Mar 20
File systems with a billion files, intro / TOC
what
This is a story about benchmarking and optimization.
Lars Wirzenius blogged about making a file system with a billion empty files.
Working on that scale can make ordinarily quick things very slow – like taking minutes to list folder contents, or delete files.
Initially, I was curious about how well general-purpose compression like gzip
would fare with the edge case of gigabytes of zeroes, and then I fell down a rabbit hole.
I found a couple of major speedups, tried a couple of other formats, and tried some other methods for making so many files.
timing
For a brief spoiler: Lars’ best time was about 26 hours. I got their Rust program down to under 16 hours, on a Raspberry Pi. And I managed to get a couple of other methods – shell scripts – to finish in under 24 hours (update!) 7 hours.
sections
I was polishing up a lengthy blog post, and I fell in to what might be a whole other wing of the rabbit hole, and I realized it might be another blog post, or, maybe several posts would be better anyway.
The sections I can see now, I’ll add links as I go, there’s a tag, as well:
- hardware, below
- making the forests – making all those files and folders
- more info on the Rust program, and some tuning
- extra post: making the forests, in parallel
- archiving and compressing the file systems
- the “whole other wing” possibility
- when I wrote this, I meant, profiling and optimizing the Rust app. that’s off my list for now.
- troubleshooting breakage on Debian 12, bookworm
- conclusions?
the hardware I’m using
I worked from a Raspberry Pi 4, with 4 GB RAM, running Debian 12 (bookworm). The media was a Seagate USB drive, which turned out to be SMR (Shingled Magnetic Recording), and non-optimal when writing a lot of data – probably when writing a gigabyte, and definitely when writing a terabyte. This is definitely easy to improve upon! The benefit here: It was handy, and it could crash without inconvenience.
I tried using my Synology NAS, but it never finished a run. Once, it crashed to the point of having to pull the power cord from the wall. I think its 2GB of memory wasn’t enough.
resources
Lars Wirzenius wrote:
- first blog post (2020 or 2022)
- first git repo, including a Python script to create a forest of empty files
- a second blog post (2024), which was my entry point
- second git repo, including a Rust program to create the drive image, make the file system, and populate it with empty files
Slides from Ric Wheeler’s 2010 presentation, “One Billion Files: Scalability Limits in Linux File Systems”
No comments
No Comments
Leave a comment