{"id":193,"date":"2024-03-21T22:45:45","date_gmt":"2024-03-22T05:45:45","guid":{"rendered":"https:\/\/pronoiac.org\/misc\/?p=193"},"modified":"2024-06-23T23:45:18","modified_gmt":"2024-06-24T06:45:18","slug":"making-file-systems-with-a-billion-files","status":"publish","type":"post","link":"https:\/\/pronoiac.org\/misc\/2024\/03\/making-file-systems-with-a-billion-files\/","title":{"rendered":"Making file systems with a billion files"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><em>this is part 2 &#8211; <a href=\"https:\/\/pronoiac.org\/misc\/2024\/03\/file-systems-with-a-billion-files-intro-toc\/\">part 1<\/a> has an intro and links to the others<\/em><\/p>\n<\/blockquote>\n\n\n\n<p>I forget where I picked up &#8220;forest&#8221; as &#8220;many files or hardlinks, largely identical&#8221;. I hope it&#8217;s more useful than confusing. Anyway. Let&#8217;s make a thousand thousand thousand files! <\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">file structures<\/h2>\n\n\n\n<p>Putting even a million files in a single folder is <em>not recommended<\/em>. For this, the usual structure:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>a thousand level 1 folders, each containing:\n<ul class=\"wp-block-list\">\n<li>a thousand level 2 folders, each containing:\n<ul class=\"wp-block-list\">\n<li>a thousand empty files<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">various script attempts<\/h2>\n\n\n\n<p>These are ordered, roughly, slowest to fastest.\nThese times were on an ext4 file system.<\/p>\n\n\n\n<p>Lots more details over in <a href=\"https:\/\/gitlab.com\/pronoiac\/create-empty-files\/-\/tree\/main\/scripts?ref_type=heads\">a Gitlab repo, a fork of the Rust program repo<\/a>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>forest-touch.sh<\/code> &#8211; run <code>touch $file<\/code> in a loop, 1 billion times<\/li>\n\n\n\n<li><code>create_files.py<\/code> &#8211; touches a file, 1 billion times. from <a href=\"http:\/\/git.liw.fi\/billion-files\/\">Lars Wirzenius, take 1, repo<\/a>.<\/li>\n\n\n\n<li><code>forest-tar.sh<\/code> &#8211; build a tar.gz with a million files, then unpack it, a thousand times.\nmakes an effort for consistent timestamps.<\/li>\n\n\n\n<li><code>forest-multitouch.sh<\/code> &#8211; run <code>touch 0001 ... 1000<\/code> in a loop, 1 million times.\nmakes an effort for consistent timestamps.<\/li>\n<\/ul>\n\n\n\n<p>More consistent timestamps can lead to better compression of drive images, later.<\/p>\n\n\n\n<p>A friend, Elliot Grafil, suggested that tar would have the benefits of decades of optimization. It&#8217;s not a bad showing! zip didn&#8217;t fare as well: it was slower, it took more space, and couldn&#8217;t be streamed through a pipe like tar.gz can.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">the Rust program<\/h3>\n\n\n\n<p>Lars Wirzenius&#8217; <a href=\"https:\/\/gitlab.com\/pronoiac\/create-empty-files\"><code>create-empty-files<\/code><\/a>, with some modifications, was the fastest method.<\/p>\n\n\n\n<p>Some notes on usage:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>for speed, skip the state file &#8211; <a href=\"https:\/\/gitlab.com\/larswirzenius\/create-empty-files\/-\/merge_requests\/1\">filed merge request #1<\/a> &#8211; still WIP<\/li>\n<\/ul>\n\n\n\n<p>For documentation, <a href=\"https:\/\/gitlab.com\/larswirzenius\/create-empty-files\/-\/merge_requests\/3\">filed merge request #3<\/a><em>, <\/em><em>merged 2024-03-17<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>other file system types are doable, as long as <code>mount<\/code> recognizes them automatically.<\/li>\n\n\n\n<li>if, when you run this, it takes over a minute to show a progress meter, make sure you\u2019re running on a file system that supports sparse files<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">about the speed impacts of saving the state<\/h4>\n\n\n\n<p>The fastest version was the one where I\u2019d commented out all saving of state. If state were saved to a tmpfs in memory, it slowed down by a third. If state were saved to the internal Micro SD card &#8211; and this was my starting point &#8211; it ran at about 4% the speed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">file system formats<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">ext2 vs. ext4<\/h3>\n\n\n\n<p>The Rust program was documented as making an ext4 file system, but it was really making an ext2 file system. (I corrected this oversight with <a href=\"https:\/\/gitlab.com\/larswirzenius\/create-empty-files\/-\/merge_requests\/2\">merge request #2<\/a>, merged 2024-03-17.) Switching to an ext4 file system sped up the process by about 45%.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">XFS<\/h3>\n\n\n\n<p>I didn\u2019t modify the defaults.\nAfter 100 min, it estimated 19 days remaining.\nAfter hitting ctrl-c, it took 20+ min to get a responsive shell.\nUnmounting took a few minutes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">btrfs<\/h3>\n\n\n\n<p>By default, it stores two copies of metadata.\nFor speed, my second attempt (\u201cv2\u201d), switched to <em>one<\/em> copy of metadata:<\/p>\n\n\n\n<p><code>mkfs.btrfs --metadata single --nodesize 64k -f $image<\/code><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">overall timings for making forests<\/h2>\n\n\n\n<p>These are the method timings to create a billion files, slowest to fastest.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>method<\/th><th>clock time<\/th><th>files\/second<\/th><th>space<\/th><\/tr><\/thead><tbody><tr><td>shell script: run <code>touch<\/code> x 1 billion times, ext4<\/td><td>31d (estimated)<\/td><td>375<\/td><td><\/td><\/tr><tr><td>Rust program, xfs defaults<\/td><td>19d (estimated)<\/td><td>610<\/td><td><\/td><\/tr><tr><td>Rust program, ext4, state on Micro SD<\/td><td>17 days (estimated)<\/td><td>675<\/td><td><\/td><\/tr><tr><td>Rust program, btrfs defaults<\/td><td>38hr 50min<\/td><td>7510<\/td><td>781GB<\/td><\/tr><tr><td>shell script: unzip 1 million files, 1k times, ext4<\/td><td>34 hrs (estimated)<\/td><td>7960<\/td><td><\/td><\/tr><tr><td>Rust program, ext2<\/td><td>27hr 5min 57s<\/td><td>10250<\/td><td>276GB<\/td><\/tr><tr><td>Python script, ext4<\/td><td>24hr 11min 43s<\/td><td>11480<\/td><td>275GB<\/td><\/tr><tr><td>Rust program, ext4, state on <code>\/dev\/shm<\/code><\/td><td>23hr (estimated)<\/td><td>11760<\/td><td><\/td><\/tr><tr><td>shell script: untar 1 million files, 1k times, ext4<\/td><td>21hr 39min 16s<\/td><td>12830<\/td><td>260GB<\/td><\/tr><tr><td>shell script: touch 1k files, 1 million times, ext4<\/td><td>19hr 17min 54sec<\/td><td>14390<\/td><td>260GB<\/td><\/tr><tr><td>Rust program, btrfs v2<\/td><td>18hr 19min 14s<\/td><td>15160<\/td><td>407GB<\/td><\/tr><tr><td>Rust program, ext4<\/td><td>15hr 23m 46s<\/td><td>18040<\/td><td>278GB<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Edit, 2024-06-23: <a href=\"https:\/\/pronoiac.org\/misc\/2024\/06\/file-systems-with-a-billion-files-making-forests-parallel-multitouch\/\">working in parallel speeds it up a bit<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>this is part 2 &#8211; part 1 has an intro and links to the others I forget where I picked up &#8220;forest&#8221; as &#8220;many files or hardlinks, largely identical&#8221;. I hope it&#8217;s more useful than confusing. Anyway. Let&#8217;s make a thousand thousand thousand files!<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[17],"class_list":["post-193","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-billion-files"],"_links":{"self":[{"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/posts\/193"}],"collection":[{"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/comments?post=193"}],"version-history":[{"count":5,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/posts\/193\/revisions"}],"predecessor-version":[{"id":223,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/posts\/193\/revisions\/223"}],"wp:attachment":[{"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/media?parent=193"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/categories?post=193"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/tags?post=193"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}