Mar 21

Making file systems with a billion files

Category: Uncategorized

this is part 2 – part 1 has an intro and links to the others

I forget where I picked up “forest” as “many files or hardlinks, largely identical”. I hope it’s more useful than confusing. Anyway. Let’s make a thousand thousand thousand files!

file structures

Putting even a million files in a single folder is not recommended. For this, the usual structure:

  • a thousand level 1 folders, each containing:
    • a thousand level 2 folders, each containing:
      • a thousand empty files

various script attempts

These are ordered, roughly, slowest to fastest. These times were on an ext4 file system.

Lots more details over in a Gitlab repo, a fork of the Rust program repo.

  • forest-touch.sh – run touch $file in a loop, 1 billion times
  • create_files.py – touches a file, 1 billion times. from Lars Wirzenius, take 1, repo.
  • forest-tar.sh – build a tar.gz with a million files, then unpack it, a thousand times. makes an effort for consistent timestamps.
  • forest-multitouch.sh – run touch 0001 ... 1000 in a loop, 1 million times. makes an effort for consistent timestamps.

More consistent timestamps can lead to better compression of drive images, later.

A friend, Elliot Grafil, suggested that tar would have the benefits of decades of optimization. It’s not a bad showing! zip didn’t fare as well: it was slower, it took more space, and couldn’t be streamed through a pipe like tar.gz can.

the Rust program

Lars Wirzenius’ create-empty-files, with some modifications, was the fastest method.

Some notes on usage:

For documentation, filed merge request #3, merged 2024-03-17

  • other file system types are doable, as long as mount recognizes them automatically.
  • if, when you run this, it takes over a minute to show a progress meter, make sure you’re running on a file system that supports sparse files

about the speed impacts of saving the state

The fastest version was the one where I’d commented out all saving of state. If state were saved to a tmpfs in memory, it slowed down by a third. If state were saved to the internal Micro SD card – and this was my starting point – it ran at about 4% the speed.

file system formats

ext2 vs. ext4

The Rust program was documented as making an ext4 file system, but it was really making an ext2 file system. (I corrected this oversight with merge request #2, merged 2024-03-17.) Switching to an ext4 file system sped up the process by about 45%.

XFS

I didn’t modify the defaults. After 100 min, it estimated 19 days remaining. After hitting ctrl-c, it took 20+ min to get a responsive shell. Unmounting took a few minutes.

btrfs

By default, it stores two copies of metadata. For speed, my second attempt (“v2”), switched to one copy of metadata:

mkfs.btrfs --metadata single --nodesize 64k -f $image

overall timings for making forests

These are the method timings to create a billion files, slowest to fastest.

methodclock timefiles/secondspace
shell script: run touch x 1 billion times, ext431d (estimated)375
Rust program, xfs defaults19d (estimated)610
Rust program, ext4, state on Micro SD17 days (estimated)675
Rust program, btrfs defaults38hr 50min7510781GB
shell script: unzip 1 million files, 1k times, ext434 hrs (estimated)7960
Rust program, ext227hr 5min 57s10250276GB
Python script, ext424hr 11min 43s11480275GB
Rust program, ext4, state on /dev/shm23hr (estimated)11760
shell script: untar 1 million files, 1k times, ext421hr 39min 16s12830260GB
shell script: touch 1k files, 1 million times, ext419hr 17min 54sec14390260GB
Rust program, btrfs v218hr 19min 14s15160407GB
Rust program, ext415hr 23m 46s18040278GB

Edit, 2024-06-23: working in parallel speeds it up a bit

No comments

No Comments

Leave a comment