Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?
I have a lot of tar and disk image backups, as well as raw photos, that I want to squeeze onto a hard drive for long term offline archival, but I want to make the most of the drive's capacity so I want to compress them at the highest ratio supported by standard tools. I've zeroed out the free space in my disk images so I can save the entire image while only having it take up as much space as there are actual files on them, and raw images in my experience can have their size reduced by a third or even half with max compression (and I would assume it's lossless since file level compression can regenerate the original file in its entirety?)
I've heard horror stories of compressed files being made completely unextractable by a single corrupted bit but I don't know how much a risk that still is in 2025, though since I plan to leave the hard drive unplugged for long periods, I want the best chance of recovery if something does go wrong.
I also want the files to be extractable with just the Linux/Unix standard binutils since this is my disaster recovery plan and I want to be able to work with it through a Linux live image without installing any extra packages when my server dies, hence I'm only looking at gz, xz, or bz2.
So out of the three, which is generally considered more stable and corruption resistant when the compression ratio is turned all the way up? Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting? Additionally, should I be generating separate checksum files for the original data or do the compressed formats include checksumming themselves?
SGH
in reply to HiddenLayer555 • • •Honestly, given that they should be purely compressing data, I would suppose that none of the formats you mentioned has ECC recovery nor builtin checksums (but I might be very mistaken on this). I think I only saw this within WinRAR, but also try other GUI tools like 7zip and check its features for anything that looks like what you need, if the formats support ECC then surely 7zip will offer you this option.
I just wanted to point out, no matter what someone else might say, if you were to split your data onto multiple compressed files, the chances of a bit rotting deleting your entire library are much lower, i.e. try to make it so that only small chunks of your data is lost in case something catastrophic happens.
However, if one of your filesystem-relevant bits rot, you may be in for a much longer recovery session.
anotherspinelessdem
in reply to HiddenLayer555 • • •DasFaultier
in reply to HiddenLayer555 • • •You're asking the right questions, and there have been some great answers on here already.
I work at the crossover between IT and digital preservation in a large GLAM institution, so I'd like to offer my perspective. Sorry of there are any peculiarities in my comment, English is my 2nd language.
First of all (and as you've correctly realizes), compression is an antipattern in DigiPres and adds risk that you should only accept of you know what you're doing. Some formats do offer integrity information (MKV/FFV1 for video comes to mind, or the BagIt archival information package structure), including formats that use lossless compression, and these should be preferred.
You might want to check this to find a suitable format here: en.wikipedia.org/wiki/List_of_… -> Containers and compression
Depending on your file formats, it might not even be beneficial to use a compressed container, e.g. if you're archiving photos/videos that already exist in compressed formats (JPEG/JFIF, h.264, ...).
You can make your data more resilient by choosing appropriate formats not only for the compressed container but also for the payload itself. Find significant properties of your data and pick formats accordingly, not the other way round. Convert before archival of necessary (the term is normalization).
You might also want to consider to reduce the risk of losing the entirety of your archive by compressing each file individually. Bit rot is a real threat, and you probably want to limit the impact of flipped bits. Error rates for spinning HDDs are well studied and understood, and even relatively small archives tend to be within the size range for bit flips. I can't seem to find the sources just now, but iirc, it was something like 1 Bit in 1.5TB for disks at write time.
Also, there's only so much you can do against bit rot on the format side, so consider using a filesystem that allows you to run regular scrubs and so actually run them; ZFS or Btrfs come to mind. If you use a more "traditional" filesystem like ext4, you could at least add checksum files for all of your archival data that you can then use as a baseline for more manual checks, but these won't help you repair damaged payload files. You can also create BagIt bags for your archive contents, because bags come with fixity mechanisms included. See RFC 8493 (datatracker.ietf.org/doc/html/…). There are even libraries and software that help you verify the integrity of bags, so that may be helpful.
The disk hardware itself is a risk as well; having your disk laying around for prolonged periods of time might have an adverse effect on bearings etc. You don't have to keep it running every day, but regular scrubs might help to detect early signs of hardware degradation. Enable SMART if possible. Don't save on disk quality. If at all possible, purchase two disks (different make & model) to store the information.
DigiPres is first and foremost a game of risk reduction and an organizational process, even of we tend to prioritize the technical aspects of it. Keep that in mind at all times
And finally, I want to leave you with some reading material on DigiPres and personal archiving on general.
* langzeitarchivierung.de/Webs/n… (in German)
* meindigitalesarchiv.de/ (in German)
* digitalpreservation.gov/person… (by the Library of Congress, who are extremely competent in DigiPres)
I've probably forgotten a few things (it's late...), but if you have any further questions, feel free to ask.
EDIT: I answered to a similar thread a few months ago, see sh.itjust.works/comment/139223…
Wikimedia list article
Contributors to Wikimedia projects (Wikimedia Foundation, Inc.)DasFaultier
2024-09-15 05:43:25
RiverRabbits
in reply to DasFaultier • • •Besonders die Perspektive, wie in deinem Feld an das Thema herangegangen wird ist für Laien sehr wertvoll um ein Gefühl für die wichtigen Aspekte zu erkennen!
(und denke mal, bei dem Username, dass du deutsch sprechen kannst haha)