{"id":188,"date":"2024-03-20T21:30:47","date_gmt":"2024-03-21T04:30:47","guid":{"rendered":"https:\/\/pronoiac.org\/misc\/?p=188"},"modified":"2024-09-07T19:15:58","modified_gmt":"2024-09-08T02:15:58","slug":"file-systems-with-a-billion-files-intro-toc","status":"publish","type":"post","link":"https:\/\/pronoiac.org\/misc\/2024\/03\/file-systems-with-a-billion-files-intro-toc\/","title":{"rendered":"File systems with a billion files, intro \/ TOC"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">what<\/h2>\n\n\n\n<p>This is a story about benchmarking and optimization.<\/p>\n\n\n\n<p><a href=\"https:\/\/blog.liw.fi\/posts\/2024\/billion\/\">Lars Wirzenius blogged about making a file system with a billion empty files<\/a>.\nWorking on that scale can make ordinarily quick things very slow &#8211; like taking minutes to list folder contents, or delete files.\nInitially, I was curious about how well general-purpose compression like <code>gzip<\/code> would fare with the edge case of gigabytes of zeroes, and then I fell down a rabbit hole.\nI found a couple of major speedups, tried a couple of other formats, and tried some other methods for making <em>so many<\/em> files.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">timing<\/h2>\n\n\n\n<p>For a brief spoiler: Lars\u2019 best time was about 26 hours. I got their Rust program down to under 16 hours, on a Raspberry Pi. And I managed to get a couple of other methods &#8211; shell scripts &#8211; to finish in under <s>24 hours<\/s> (update!) 7 hours.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">sections<\/h2>\n\n\n\n<p>I was polishing up a lengthy blog post, and I fell in to what might be a whole other <em>wing<\/em> of the rabbit hole, and I realized it might be another blog post, or, maybe several posts would be better anyway.<\/p>\n\n\n\n<p>The sections I can see now, I\u2019ll add links as I go, there&#8217;s <a href=\"https:\/\/pronoiac.org\/misc\/tag\/billion-files\/\">a tag<\/a>, as well:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>hardware, below<\/li>\n\n\n\n<li><a href=\"https:\/\/pronoiac.org\/misc\/2024\/03\/making-file-systems-with-a-billion-files\/\">making the forests<\/a> &#8211; making all those files and folders\n<ul class=\"wp-block-list\">\n<li>more info on the Rust program, and some tuning<\/li>\n\n\n\n<li><a href=\"https:\/\/pronoiac.org\/misc\/2024\/06\/file-systems-with-a-billion-files-making-forests-parallel-multitouch\/\">extra post<\/a>: <a href=\"https:\/\/pronoiac.org\/misc\/2024\/06\/file-systems-with-a-billion-files-making-forests-parallel-multitouch\/\">making the forests, in parallel<\/a><\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><a href=\"https:\/\/pronoiac.org\/misc\/2024\/09\/file-systems-with-a-billion-files-archiving-and-compression\/\">archiving and compressing the file systems<\/a><\/li>\n\n\n\n<li>the \u201cwhole other wing\u201d possibility\n<ul class=\"wp-block-list\">\n<li>when I wrote this, I meant, profiling and optimizing the Rust app. that&#8217;s off my list for now.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>troubleshooting breakage on Debian 12, bookworm<\/li>\n\n\n\n<li>conclusions?<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">the hardware I\u2019m using<\/h2>\n\n\n\n<p>I worked from a Raspberry Pi 4, with 4 GB RAM, running Debian 12 (bookworm).\nThe media was a Seagate USB drive, which turned out to be <a href=\"https:\/\/en.wikipedia.org\/wiki\/Shingled_magnetic_recording\">SMR<\/a> (Shingled Magnetic Recording), and non-optimal when writing a lot of data &#8211; probably when writing a gigabyte, and <em>definitely<\/em> when writing a terabyte.\nThis is definitely easy to improve upon!\nThe benefit here: It was handy, and it could crash without inconvenience.<\/p>\n\n\n\n<p>I tried using my Synology NAS, but it never finished a run.\nOnce, it crashed to the point of having to pull the power cord from the wall.\nI think its 2GB of memory wasn\u2019t enough.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">resources<\/h2>\n\n\n\n<p>Lars Wirzenius wrote:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/blog.liw.fi\/posts\/2020\/10\/01\/a_billion_files\/\">first blog post (2020 or 2022)<\/a><\/li>\n\n\n\n<li><a href=\"http:\/\/git.liw.fi\/billion-files\/tree\/\">first git repo<\/a>, including a Python script to create a forest of empty files<\/li>\n\n\n\n<li><a href=\"https:\/\/blog.liw.fi\/posts\/2024\/billion\/\">a second blog post (2024)<\/a>, which was my entry point<\/li>\n\n\n\n<li><a href=\"https:\/\/gitlab.com\/larswirzenius\/create-empty-files\">second git repo<\/a>, including a Rust program to create the drive image, make the file system, and populate it with empty files<\/li>\n<\/ul>\n\n\n\n<p>Slides from Ric Wheeler\u2019s 2010 presentation, <a href=\"https:\/\/events.static.linuxfound.org\/slides\/2010\/linuxcon2010_wheeler.pdf\">\u201cOne Billion Files:\nScalability Limits in Linux File Systems\u201d<\/a><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><a href=\"https:\/\/pronoiac.org\/misc\/2024\/03\/making-file-systems-with-a-billion-files\/\">The next part is up!<\/a><\/p>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>what This is a story about benchmarking and optimization. Lars Wirzenius blogged about making a file system with a billion empty files. Working on that scale can make ordinarily quick things very slow &#8211; like taking minutes to list folder contents, or delete files. Initially, I was curious about how well general-purpose compression like gzip [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[17],"class_list":["post-188","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-billion-files"],"_links":{"self":[{"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/posts\/188"}],"collection":[{"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/comments?post=188"}],"version-history":[{"count":12,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/posts\/188\/revisions"}],"predecessor-version":[{"id":234,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/posts\/188\/revisions\/234"}],"wp:attachment":[{"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/media?parent=188"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/categories?post=188"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pronoiac.org\/misc\/wp-json\/wp\/v2\/tags?post=188"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}