billion files – miscellaneous https://pronoiac.org/misc Just another weblog Sun, 08 Sep 2024 02:15:58 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 File systems with a billion files, archiving and compression https://pronoiac.org/misc/2024/09/file-systems-with-a-billion-files-archiving-and-compression/ https://pronoiac.org/misc/2024/09/file-systems-with-a-billion-files-archiving-and-compression/#respond Mon, 02 Sep 2024 00:27:22 +0000 https://pronoiac.org/misc/?p=225 about

This continues the billion file file systems blog posts (tag); the first post has an introduction and a Table of Contents.

Previously, we looked at populating file systems.

The file systems / drive images are a bit unwieldy and tricky to copy and move around efficiently. If we archive and compress them, they’ll be much smaller and easier to move around.

This is a long post; sorry not sorry.

goals and theories to test

  • reproduce (and add to) the archive stats from Lars Wirzenius’ blog post
  • recompressing – like, decompressing the gzip -1 output, and piping that into another compressor – could speed things up. The cpu impact of decompression could outweigh the other resources saved.
  • do ext2 and ext4 file systems produce file systems that compress better or worse? how about the different generators?
  • using tar’s sparse file handling should help with compression
  • there are a lot of compressors; let’s figure out the interesting ones

overall notes

Notes:

  • compressed sizes are as reported by pv, which uses e.g. gibibytes, more like base 1024 than 1000. If I’m not measuring numbers, I’ll probably use “gigabytes”, from habit.
  • the timings are by wall clock, not cpu
  • decompression time is measured to /dev/null, not to a drive

xz – running 5.4.1-0.2, with liblzma 5.4.1. It is not impacted by the 2024-03-29 discovery of an xz backdoor, at least, so far.

  • the full xz -1 command: xz -T0 -1
  • the full xz -9 command: xz -T3 -M 4GiB -9
    • I’m using only three threads, on a four-core system, because of memory constraints: this is a 4GiB system, and in practice, each thread uses over a gigabyte of memory.

hardware changes

As mentioned in the parallel multitouch post, one SMR hard drive fell over and its replacement (another arbitrary unused drive off my shelf) was painfully slow, writing at 3-4MiB/s. I’d worked on archiving and compression on the SMR drives before shifting, mounting a network share on a NAS as storage for the Pi. This sped up populating the file systems, but there’s a caveat later.

new, faster media! let’s make those file systems again!

Some timing for generating the file systems into drive images, onto the NAS. Roughly in order of appearance in this post:

methodtime
Rust program on ext212hr48min
Rust program on ext414hr14min
multitouch on ext418hr36m
parallel multitouch on ext47hr4m

That Rust on ext2 was the fastest generator I’d run, to that point.

Surprisingly, populating ext4 was slower than ext2; they switched places from last time, on the first SMR hard drive I tried.

retracing Wirzenius’ results

Let’s check the numbers from Wirzenius’ blog post. It intended to use ext4, but it was using ext2 (bug, now fixed). To reproduce statistics, we’ll use the Rust program and ext2 for these results. For example, the resulting files would look like 1tb-ext2.img.gz-1.

These are compressing a 1TB file, with 276GB of data reported by du.

compressionpreviouslymy ext2 compressed sizein bytescompressiongzip -1 recompressiondecompression
gzip -12054638553118.7GiB200667564234hr22m4h10m1h49m
gzip -91594888359414.7GiB1579844407619hr27m19hr13m2hr35m
xz -11138923678010.6GiB113403878004hr23m4hr6m44m38s
xz -9104657172569.6GiB1039650377220hr45m19hr57m44m46s
total53.6GiB49hr47hr26m5hr52m

Some notes:

  • it looks like decompressing a gzip version, and compressing that output, does save some time, though it’s only 1% to 4%. When on the SMR drives, I’d seen improvements of 10% and more. this is enough for me to skip “compress directly from the raw image file, over the network” from now on.
  • the comparative sizes between my results and Wirzenius’ are roughly the same. my compressed images are a bit smaller – between 0.4% and 2.3%. my theory: my image generation ran faster, so the timestamps varied less, and so, compressed better.

onward, to ext4

As mentioned, using ext2 wasn’t intended. As of March 2024, the Linux kernel has deprecated ext2, as it hasn’t handled the Year 2038 Problem.

ext4, rust

I used the observation that gzip decompression paid for itself timewise, compared to network I/O.

compressionmy ext2 compressed sizemy ext4 compressed sizein bytesgzip-1 recompressiondecompression
gzip -118.7GiB25.8GiB276764960224hr25m1hr53m
gzip -914.7GiB21.7GiB2326764154827hr11m2hr39m
xz -110.6GiB15.4GiB164921393764hr44m47m
xz -99.6GiB14.3GiB1536174004827hr51m
total53.6GiB77.2GiB63hr20m6hr10m

Comparing against ext2:

  • the compressed file sizes: ext4 was larger by an extra 37% to 48%.
  • compression time increased, 15% to 42%.

using tar on file systems

The resulting drive images here would be something like 1tb-ext4.img.tar.gz. It’s not archiving the mounted file system, but the drive image.

a sidenote: sparse files

The file systems / drive images are sparse files – while they could hold up to a terabyte, we’ve only written about a quarter of that. When we measure storage used – du -h, instead of ls -h – they only take up the smaller amount. They’re still unwieldy – about 270 gigabytes! – but a full terabyte is worse. The remainder is empty, which we don’t need to reserve space for, and we can (often) skip processing or storing it, with care. But, if anything along the way – operating systems, software, the file system holding the drive image, network protocols – doesn’t support sparse files, or we don’t enable sparse file handling, then it’ll deal with almost four times as much “data.” (Yes, this is ominous foreshadowing.)

The gzip and xz compressors above processed the full, mostly empty, terabyte. Unpacking those, as is, will take up a full terabyte.

The theory: If we archive, then compress them, they’ll be smaller and we can transfer them, preserving sparse files, with less hassle. We can benefit from both sparse file handling, and making it a non-issue.

issue: sparse files on a network share

tar can handle sparse files efficiently. Even without compression, we’d expect tar to read that sparse file, and make a tar file that’s closer to 270 GiB than a terabyte.

An issue here: While the Raspberry Pi, the NAS, and each of their operating systems, can each handle sparse files well, the network share doesn’t – I was using SMB / Samba or the equivalent, not, say, NFS or WebDAV. Writing sparse files works well! Reading sparse files: the network share will send all the zeros over the wire, instead of “hey, there’s a gigabyte of a sparse data gap here”. While that gigabyte of zeros will take only take a few seconds to transfer over gigabit ethernet, we’re dealing with over 700 of them. It adds up.

How this interacts with tar:

However, be aware that --sparse option may present a serious drawback. Namely, in order to determine the positions of holes in a file tar may have to read it before trying to archive it, so in total the file may be read twice. This may happen when your OS or your FS does not support SEEK_HOLE/SEEK_DATA feature in lseek.

Doing the math, I’d expect tar to take over two hours to get to the point of actually emitting output. While it took over four hours for image.tar to run over the network, that seems uncharitable to measure here, like we’re setting it up to fail. After the initial, slow, tar creation, we’re not working with a bulky sparse file; we can recompress it, and (in theory) see something like what it might look like if we were using another network sharing protocol or a fast local drive.

As a sidenote: Debian includes bsdtar, a port of tar from (I think) FreeBSD; it didn’t detect gaps at all in this context, so the generated tar file was a full terabyte, and the image.tar.gz didn’t save space compared to the image.gz. On a positive note, it can detect gaps on unpacking, if you happened to make a non-sparse tar file and want to preserve the timestamps and other metadata.

tar compression results

To gauge the compression speed, I used, as above, an initial tar.gz file, then decompressed that into a pipe, and recompressed it.

We’ll call the largest entry per column 100% – that’s gzip -1.

compressormetricrust, imagerust, tartar saves
gzip -1space25.8GiB, 100%22.4GiB, 100%3.4GiB, 13%
compression time4hr25m1hr33m64%
decompression time1hr53m37m16s67%
gzip -9space21.7GiB, -15%20.9GiB, -6%0.8GiB, 3.6%
compression time27hr11m23hr58m11%
decompression time2hr39m36m54s77%
xz -1space15.4GiB, -40%15.2GiB, -32%0.2GiB, 1.2%
compression time4hr44m2hr26m48%
decompression time47m18m25s61%
xz -9space14.3GiB, -44%14.2GiB, -36%0.1GiB, 0.6%
compression time27hr23hr33m12%
decompression time51m21m17s58%

In short: Using tar (after the initial network share hurdle) saved resources, both compression time and file size.

more compressors

There’s a trend in gathering compressors, to catch them all be comprehensive. “My Pokemans compressors, let me show you them.” I’m not immune to this, though I showed some restraint.

I tried compressors I was familiar with, and some that came up in discussions, and were already or easily available in Debian and ARM.

issues found: crashes when decompressing

Various compressors were glitchy in the latest OS (bookworm, Debian 12), against storage on the NAS: lzop, pbzip2, pigz, and plzip. They seemed to compress, but then crashed on attempts to decompress. After this, I used checksums to verify the output of decompression, and verified that these were the only compressors with issues. I made a fresh bookworm MicroSD card, and reproduced the issue.

These compressors worked in the previous version (bullseye, Debian 11), against storage on the NAS. They’re marked on the graph with different colors and shapes. I intend to explore the issues further, but in a later, separate blog post.

my method for benchmarking compression

I wrote shell scripts for this. A fun fact: bash doesn’t parse too far ahead when interpreting, so, I was able to edit and reprioritize the next steps while it was running, cheap and dirty.

The form of the scripts evolved as I worked on this; the latest version was DRY’ed up (Don’t Repeat Yourself), and I’ll post it on my Gitlab repo. This should be enough for others to check that, say, I didn’t leave out multithreading for some compression or decompression.

results

I graphed Compression time vs compressed file size, for the ext4 file system made by the Rust program, as a tar file.

Compression time vs compressed file size

Over on the billion file fs repo:

  • the data tables, as CSVs
  • graphs including decompression time
  • R code for all the graphs

Not included in those tables or graphs:

  • brotli -11 was on track to take 23 days, though larger than xz -9.
  • lrzip took over 22 hours to decompress to stdout, over ten times slower than the next slowest, pbzip2 -9. even on a log scale, it was the only data point in the top half of the graph.

The compressors of interest to me, were the ones with fastest compression and smallest compressed files. From fastest to slowest:

  • zstd -1 and -9
  • plzip -0 – could replace zstd -9, but was only reliable on bullseye
  • xz -1 and -9
  • gzip -1 and -9 – just for comparison

To be thorough about the Pareto frontier:

  • lzop -1 has potential – a minute faster than zstd -1, but 50% larger, unreliable on bookworm, and four minutes slower on bullseye. (Benchmarking bullseye vs. bookworm is out of scope.)
  • plzip -9 offered a bit over 1% space savings over xz -9, but, took 2.6 times as long – 64 hours vs 23 hours

how well do file systems from the different generators compress?

I expected the Rust image to compress worse than multitouch, and I wasn’t sure where parallel multitouch would fit.

Let’s discuss how they lay out the empty files.

generators – Rust

Wirzenius’ Rust program creates the files, in order, one at a time.

Looking at a file system it made, the directory layouts are (roughly, with a caveat):

  • 0 to 999/0 to 999/file-1.txt to file-1000000000.txt

generators – multitouch

Multitouch creates the files, in order, 1k at a time. Parallel multitouch makes the directories roughly in order, but slightly interleaved. Both attempt to use consistent timestamps and filenames.

The directory layouts:

  • 0001 to 1000/0001 to 1000/0001 to 1000

The relatively shorter file names led to less space usage, before compression.

results

The table of data is in the same place as above; here’s the chart comparing the different generators, for the compressors I decided on above.

Compression time vs compressed file size, for different generators

analysis

Looking at the chart, parallel multitouch was a bit worse than multitouch, but both compressed to smaller files compared to the Rust image.

plzip -0 usually, but not always, fit between zstd -1 and xz -1. For multitouch (not parallel), it was larger than zstd -9.

compression on ext4 file systems

generatorcompressorsizecompression timedecompression time
Rustext4 image, xz -914.3GiB27hr51min
Rustext4 image.tar, xz -914.2GiB23hr33min21min
multitouchext4 image.tar, xz -99.8GiB19hr5m18min
parallel multitouchext4 image.tar, xz -910.1GiB19hr19min18min

overall summary

  • decompressing gzip and piping that to other compressors saves some resources
  • ext4 drive images were larger than ext2, by an extra 37% to 48%
  • tar’s sparse file handling helps with compression time and space
  • the compressors I’d recommend: zstd -1 and -9; plzip -0; and xz -1 and -9
  • multitouch file systems compress better than the Rust file systems

reference: software versions

These are the versions reported by apt.

For Debian 12, bookworm, the newer release:

  • brotli: 1.0.9-2+b6
  • gzip: 1.12-1
  • lrzip: 0.651-2
  • lz4: 1.9.4-1
  • xz: 5.4.1-0.2, with liblzma 5.4.1. It is not impacted by the 2024-03-29 discovery of an xz backdoor, at least, so far.
  • zstd: 1.5.4+dfsg2-5

Debian 11, bullseye, the older release:

  • lzop: 1.04-2
  • pbzip2: 1.1.13-1. parallel bzip2
  • pigz: 2.6-1. parallel gzip
  • plzip: 1.9-1
]]>
https://pronoiac.org/misc/2024/09/file-systems-with-a-billion-files-archiving-and-compression/feed/ 0
File systems with a billion files, making forests, parallel multitouch https://pronoiac.org/misc/2024/06/file-systems-with-a-billion-files-making-forests-parallel-multitouch/ https://pronoiac.org/misc/2024/06/file-systems-with-a-billion-files-making-forests-parallel-multitouch/#respond Mon, 24 Jun 2024 06:09:49 +0000 https://pronoiac.org/misc/?p=213 about

Making file systems with a billion files is interesting for feeling out scaling issues.

The Intro post for file systems with a billion files, with a table of contents. This is yet another way to make file systems with a billion files.

While working on the upcoming archiving and compression post, with various obstacles, yet another method for making those file systems came to mind: running multiple multitouch methods in parallel. Spoilers: It’s the fastest method for making file systems with a billion files that I’ve run.

hardware

I worked from a Raspberry Pi 4, with 4 GB RAM, running Debian 12 (bookworm). It has four cores. For storage – this is new – I connected the Pi to my NAS.

environmental changes

While gathering data for the upcoming archiving and compression post, the first hard drive, a 2017-era Seagate USB hard drive, flaked out. When connected to my Pi, the drive would go offline, with kernel log messages involving “over-current change” – which might point at the power supply for the Pi. This hard drive’s shelved for now.

The second spare drive I reached for was a 2023-era WD USB hard drive. It was also an SMR drive, and the performance was surprisingly awful. As in, I went looking for malware. Perhaps it just needs some time for internal remapping and housekeeping, but it’s off to the side for now.

I connected the Pi to my NAS, with redundant CMR hard drives, over Gigabit Ethernet.

new repo

Before, I’d placed a few scripts in a fork for create-empty-files; I’m adding more and it feels like tangential clutter. So, I started a billion-file-fs repo. I plan to put everything in this post there; see the scripts directory.

parallel multitouch

I realized I could run multiple multitouch in parallel.

There’s a rule of thumb for parallel processes: the number of processor cores, plus one. I benchmarked with a million files, and one to ten processes; the peak was five, matching that expectation.

processestimefiles / sec
11m57s8490
21m5s15200
350s19700
445.2s22000
543.7s22800
644.1s22600
745.0s22100
844.8s22200
944.8s22200
1045.0s22100

how did it do?

It populated a billion files in under seven hours, a new personal record!

Here are how the fastest generators ran, against NAS storage:

methodtimespace usedfiles/sec
multitouch on ext418hr36min260GB14900
Rust program on ext414hr14min278GB19500
Rust program on ext212hr48min276GB21700
parallel multitouch on ext46hr41min260GB41500

Against the first USB drive, Rust & ext2 was slower than ext4. They switched places here, which surprised me.

what’s next?

Probably archiving and compression – it’s already much longer, with graphs.

]]>
https://pronoiac.org/misc/2024/06/file-systems-with-a-billion-files-making-forests-parallel-multitouch/feed/ 0
Making file systems with a billion files https://pronoiac.org/misc/2024/03/making-file-systems-with-a-billion-files/ https://pronoiac.org/misc/2024/03/making-file-systems-with-a-billion-files/#respond Fri, 22 Mar 2024 05:45:45 +0000 https://pronoiac.org/misc/?p=193

this is part 2 – part 1 has an intro and links to the others

I forget where I picked up “forest” as “many files or hardlinks, largely identical”. I hope it’s more useful than confusing. Anyway. Let’s make a thousand thousand thousand files!

file structures

Putting even a million files in a single folder is not recommended. For this, the usual structure:

  • a thousand level 1 folders, each containing:
    • a thousand level 2 folders, each containing:
      • a thousand empty files

various script attempts

These are ordered, roughly, slowest to fastest. These times were on an ext4 file system.

Lots more details over in a Gitlab repo, a fork of the Rust program repo.

  • forest-touch.sh – run touch $file in a loop, 1 billion times
  • create_files.py – touches a file, 1 billion times. from Lars Wirzenius, take 1, repo.
  • forest-tar.sh – build a tar.gz with a million files, then unpack it, a thousand times. makes an effort for consistent timestamps.
  • forest-multitouch.sh – run touch 0001 ... 1000 in a loop, 1 million times. makes an effort for consistent timestamps.

More consistent timestamps can lead to better compression of drive images, later.

A friend, Elliot Grafil, suggested that tar would have the benefits of decades of optimization. It’s not a bad showing! zip didn’t fare as well: it was slower, it took more space, and couldn’t be streamed through a pipe like tar.gz can.

the Rust program

Lars Wirzenius’ create-empty-files, with some modifications, was the fastest method.

Some notes on usage:

For documentation, filed merge request #3, merged 2024-03-17

  • other file system types are doable, as long as mount recognizes them automatically.
  • if, when you run this, it takes over a minute to show a progress meter, make sure you’re running on a file system that supports sparse files

about the speed impacts of saving the state

The fastest version was the one where I’d commented out all saving of state. If state were saved to a tmpfs in memory, it slowed down by a third. If state were saved to the internal Micro SD card – and this was my starting point – it ran at about 4% the speed.

file system formats

ext2 vs. ext4

The Rust program was documented as making an ext4 file system, but it was really making an ext2 file system. (I corrected this oversight with merge request #2, merged 2024-03-17.) Switching to an ext4 file system sped up the process by about 45%.

XFS

I didn’t modify the defaults. After 100 min, it estimated 19 days remaining. After hitting ctrl-c, it took 20+ min to get a responsive shell. Unmounting took a few minutes.

btrfs

By default, it stores two copies of metadata. For speed, my second attempt (“v2”), switched to one copy of metadata:

mkfs.btrfs --metadata single --nodesize 64k -f $image

overall timings for making forests

These are the method timings to create a billion files, slowest to fastest.

methodclock timefiles/secondspace
shell script: run touch x 1 billion times, ext431d (estimated)375
Rust program, xfs defaults19d (estimated)610
Rust program, ext4, state on Micro SD17 days (estimated)675
Rust program, btrfs defaults38hr 50min7510781GB
shell script: unzip 1 million files, 1k times, ext434 hrs (estimated)7960
Rust program, ext227hr 5min 57s10250276GB
Python script, ext424hr 11min 43s11480275GB
Rust program, ext4, state on /dev/shm23hr (estimated)11760
shell script: untar 1 million files, 1k times, ext421hr 39min 16s12830260GB
shell script: touch 1k files, 1 million times, ext419hr 17min 54sec14390260GB
Rust program, btrfs v218hr 19min 14s15160407GB
Rust program, ext415hr 23m 46s18040278GB

Edit, 2024-06-23: working in parallel speeds it up a bit

]]>
https://pronoiac.org/misc/2024/03/making-file-systems-with-a-billion-files/feed/ 0
File systems with a billion files, intro / TOC https://pronoiac.org/misc/2024/03/file-systems-with-a-billion-files-intro-toc/ https://pronoiac.org/misc/2024/03/file-systems-with-a-billion-files-intro-toc/#respond Thu, 21 Mar 2024 04:30:47 +0000 https://pronoiac.org/misc/?p=188 what

This is a story about benchmarking and optimization.

Lars Wirzenius blogged about making a file system with a billion empty files. Working on that scale can make ordinarily quick things very slow – like taking minutes to list folder contents, or delete files. Initially, I was curious about how well general-purpose compression like gzip would fare with the edge case of gigabytes of zeroes, and then I fell down a rabbit hole. I found a couple of major speedups, tried a couple of other formats, and tried some other methods for making so many files.

timing

For a brief spoiler: Lars’ best time was about 26 hours. I got their Rust program down to under 16 hours, on a Raspberry Pi. And I managed to get a couple of other methods – shell scripts – to finish in under 24 hours (update!) 7 hours.

sections

I was polishing up a lengthy blog post, and I fell in to what might be a whole other wing of the rabbit hole, and I realized it might be another blog post, or, maybe several posts would be better anyway.

The sections I can see now, I’ll add links as I go, there’s a tag, as well:

the hardware I’m using

I worked from a Raspberry Pi 4, with 4 GB RAM, running Debian 12 (bookworm). The media was a Seagate USB drive, which turned out to be SMR (Shingled Magnetic Recording), and non-optimal when writing a lot of data – probably when writing a gigabyte, and definitely when writing a terabyte. This is definitely easy to improve upon! The benefit here: It was handy, and it could crash without inconvenience.

I tried using my Synology NAS, but it never finished a run. Once, it crashed to the point of having to pull the power cord from the wall. I think its 2GB of memory wasn’t enough.

resources

Lars Wirzenius wrote:

Slides from Ric Wheeler’s 2010 presentation, “One Billion Files: Scalability Limits in Linux File Systems”

The next part is up!

]]>
https://pronoiac.org/misc/2024/03/file-systems-with-a-billion-files-intro-toc/feed/ 0