Fuse gzip decompression on the fly
I have an archive of the B3 (BVMF Bovespa) old FTP site that once resided at ftp.bvmf.com.br. As-is, it weighs at around 360 GB, 309 GB of which are gzipped text files.
I have to move this archive out of my NAS as I want to fix my RAID setup, and after checking what hardware I got spare, I've found two options:
- copy those files to a spare 500 GB disk and waste 140 GB;
- make it fit into a 320 GB disk.
Of course, b is the right choice.
Those text files compress really well (around 10x!), but as they stand, they are islands. We can be certain that those text files are very similar, so we can leverage solid compression to further decrease the stored size. Besides that, gzip isn't the best compression algorithm. We can do better. 7-zip to the rescue.
Now, how can we extract those individual .gz files on-the-fly? Remember, they will inflate to 10x the stored size and I don't have 3 TB laying around.
I could create some find -exec
incantation to gunzip
and append the file
to the 7z archive, but I thought that wouldn't be efficient, as 7-zip might
recalculate the dictionary and recompress the archive every appended file.
Another option would be hacking 7-zip itself to extract the file when needed,
but ain't nobody got time for that. I need a hook to be called every open(2)
call.
LD_PRELOAD
? nah, too much boilerplate. What about doing this outside 7-zip?
7-zip's input are files from a filesystem. FUSE fits right there.
Turns out, 9 years ago, Gautier Portet wrote
gzfuse, exactly what I need.
After running 2to3
to make it work on Python 3 and adding an
exception handler to deal with truncated gzip files, now we can recompress those
gzipped text files to a solid 7z archive without wasting precious disk space.
Get the Python 3 version of gzfuse here. Requires fusepy. Also useful: nullfs (it is a /dev/null filesystem, nothing to do with the BSD nullfs)