Jun 23
File systems with a billion files, making forests, parallel multitouch
about
Making file systems with a billion files is interesting for feeling out scaling issues.
The Intro post for file systems with a billion files, with a table of contents. This is yet another way to make file systems with a billion files.
While working on the upcoming archiving and compression post, with various obstacles, yet another method for making those file systems came to mind: running multiple multitouch methods in parallel. Spoilers: It’s the fastest method for making file systems with a billion files that I’ve run.
hardware
I worked from a Raspberry Pi 4, with 4 GB RAM, running Debian 12 (bookworm). It has four cores. For storage – this is new – I connected the Pi to my NAS.
environmental changes
While gathering data for the upcoming archiving and compression post, the first hard drive, a 2017-era Seagate USB hard drive, flaked out. When connected to my Pi, the drive would go offline, with kernel log messages involving “over-current change” – which might point at the power supply for the Pi. This hard drive’s shelved for now.
The second spare drive I reached for was a 2023-era WD USB hard drive. It was also an SMR drive, and the performance was surprisingly awful. As in, I went looking for malware. Perhaps it just needs some time for internal remapping and housekeeping, but it’s off to the side for now.
I connected the Pi to my NAS, with redundant CMR hard drives, over Gigabit Ethernet.
new repo
Before, I’d placed a few scripts in a fork for create-empty-files; I’m adding more and it feels like tangential clutter. So, I started a billion-file-fs
repo. I plan to put everything in this post there; see the scripts
directory.
parallel multitouch
I realized I could run multiple multitouch in parallel.
There’s a rule of thumb for parallel processes: the number of processor cores, plus one. I benchmarked with a million files, and one to ten processes; the peak was five, matching that expectation.
processes | time | files / sec |
---|---|---|
1 | 1m57s | 8490 |
2 | 1m5s | 15200 |
3 | 50s | 19700 |
4 | 45.2s | 22000 |
5 | 43.7s | 22800 |
6 | 44.1s | 22600 |
7 | 45.0s | 22100 |
8 | 44.8s | 22200 |
9 | 44.8s | 22200 |
10 | 45.0s | 22100 |
how did it do?
It populated a billion files in under seven hours, a new personal record!
Here are how the fastest generators ran, against NAS storage:
method | time | space used | files/sec |
---|---|---|---|
multitouch on ext4 | 18hr36min | 260GB | 14900 |
Rust program on ext4 | 14hr14min | 278GB | 19500 |
Rust program on ext2 | 12hr48min | 276GB | 21700 |
parallel multitouch on ext4 | 6hr41min | 260GB | 41500 |
Against the first USB drive, Rust & ext2 was slower than ext4. They switched places here, which surprised me.
what’s next?
Probably archiving and compression – it’s already much longer, with graphs.
No commentsNo Comments
Leave a comment