Mar 21
Making file systems with a billion files
this is part 2 – part 1 has an intro and links to the others
I forget where I picked up “forest” as “many files or hardlinks, largely identical”. I hope it’s more useful than confusing. Anyway. Let’s make a thousand thousand thousand files!
file structures
Putting even a million files in a single folder is not recommended. For this, the usual structure:
- a thousand level 1 folders, each containing:
- a thousand level 2 folders, each containing:
- a thousand empty files
- a thousand level 2 folders, each containing:
various script attempts
These are ordered, roughly, slowest to fastest. These times were on an ext4 file system.
Lots more details over in a Gitlab repo, a fork of the Rust program repo.
forest-touch.sh
– runtouch $file
in a loop, 1 billion timescreate_files.py
– touches a file, 1 billion times. from Lars Wirzenius, take 1, repo.forest-tar.sh
– build a tar.gz with a million files, then unpack it, a thousand times. makes an effort for consistent timestamps.forest-multitouch.sh
– runtouch 0001 ... 1000
in a loop, 1 million times. makes an effort for consistent timestamps.
More consistent timestamps can lead to better compression of drive images, later.
A friend, Elliot Grafil, suggested that tar would have the benefits of decades of optimization. It’s not a bad showing! zip didn’t fare as well: it was slower, it took more space, and couldn’t be streamed through a pipe like tar.gz can.
the Rust program
Lars Wirzenius’ create-empty-files
, with some modifications, was the fastest method.
Some notes on usage:
- for speed, skip the state file – filed merge request #1 – still WIP
For documentation, filed merge request #3, merged 2024-03-17
- other file system types are doable, as long as
mount
recognizes them automatically. - if, when you run this, it takes over a minute to show a progress meter, make sure you’re running on a file system that supports sparse files
about the speed impacts of saving the state
The fastest version was the one where I’d commented out all saving of state. If state were saved to a tmpfs in memory, it slowed down by a third. If state were saved to the internal Micro SD card – and this was my starting point – it ran at about 4% the speed.
file system formats
ext2 vs. ext4
The Rust program was documented as making an ext4 file system, but it was really making an ext2 file system. (I corrected this oversight with merge request #2, merged 2024-03-17.) Switching to an ext4 file system sped up the process by about 45%.
XFS
I didn’t modify the defaults. After 100 min, it estimated 19 days remaining. After hitting ctrl-c, it took 20+ min to get a responsive shell. Unmounting took a few minutes.
btrfs
By default, it stores two copies of metadata. For speed, my second attempt (“v2”), switched to one copy of metadata:
mkfs.btrfs --metadata single --nodesize 64k -f $image
overall timings for making forests
These are the method timings to create a billion files, slowest to fastest.
method | clock time | files/second | space |
---|---|---|---|
shell script: run touch x 1 billion times, ext4 | 31d (estimated) | 375 | |
Rust program, xfs defaults | 19d (estimated) | 610 | |
Rust program, ext4, state on Micro SD | 17 days (estimated) | 675 | |
Rust program, btrfs defaults | 38hr 50min | 7510 | 781GB |
shell script: unzip 1 million files, 1k times, ext4 | 34 hrs (estimated) | 7960 | |
Rust program, ext2 | 27hr 5min 57s | 10250 | 276GB |
Python script, ext4 | 24hr 11min 43s | 11480 | 275GB |
Rust program, ext4, state on /dev/shm | 23hr (estimated) | 11760 | |
shell script: untar 1 million files, 1k times, ext4 | 21hr 39min 16s | 12830 | 260GB |
shell script: touch 1k files, 1 million times, ext4 | 19hr 17min 54sec | 14390 | 260GB |
Rust program, btrfs v2 | 18hr 19min 14s | 15160 | 407GB |
Rust program, ext4 | 15hr 23m 46s | 18040 | 278GB |
Edit, 2024-06-23: working in parallel speeds it up a bit
No commentsNo Comments
Leave a comment