Jun 23

File systems with a billion files, making forests, parallel multitouch

Category: Uncategorized

about

Making file systems with a billion files is interesting for feeling out scaling issues.

The Intro post for file systems with a billion files, with a table of contents. This is yet another way to make file systems with a billion files.

While working on the upcoming archiving and compression post, with various obstacles, yet another method for making those file systems came to mind: running multiple multitouch methods in parallel. Spoilers: It’s the fastest method for making file systems with a billion files that I’ve run.

hardware

I worked from a Raspberry Pi 4, with 4 GB RAM, running Debian 12 (bookworm). It has four cores. For storage – this is new – I connected the Pi to my NAS.

environmental changes

While gathering data for the upcoming archiving and compression post, the first hard drive, a 2017-era Seagate USB hard drive, flaked out. When connected to my Pi, the drive would go offline, with kernel log messages involving “over-current change” – which might point at the power supply for the Pi. This hard drive’s shelved for now.

The second spare drive I reached for was a 2023-era WD USB hard drive. It was also an SMR drive, and the performance was surprisingly awful. As in, I went looking for malware. Perhaps it just needs some time for internal remapping and housekeeping, but it’s off to the side for now.

I connected the Pi to my NAS, with redundant CMR hard drives, over Gigabit Ethernet.

new repo

Before, I’d placed a few scripts in a fork for create-empty-files; I’m adding more and it feels like tangential clutter. So, I started a billion-file-fs repo. I plan to put everything in this post there; see the scripts directory.

parallel multitouch

I realized I could run multiple multitouch in parallel.

There’s a rule of thumb for parallel processes: the number of processor cores, plus one. I benchmarked with a million files, and one to ten processes; the peak was five, matching that expectation.

processestimefiles / sec
11m57s8490
21m5s15200
350s19700
445.2s22000
543.7s22800
644.1s22600
745.0s22100
844.8s22200
944.8s22200
1045.0s22100

how did it do?

It populated a billion files in under seven hours, a new personal record!

Here are how the fastest generators ran, against NAS storage:

methodtimespace usedfiles/sec
multitouch on ext418hr36min260GB14900
Rust program on ext414hr14min278GB19500
Rust program on ext212hr48min276GB21700
parallel multitouch on ext46hr41min260GB41500

Against the first USB drive, Rust & ext2 was slower than ext4. They switched places here, which surprised me.

what’s next?

Probably archiving and compression – it’s already much longer, with graphs.

No comments

No Comments

Leave a comment