Massive data backup question: What Linux software do you folks recommend for helping sort out and organize terabytes of files and remove duplicates?
I've got a whole bucket full of old hard drives, CDs and DVDs, and I'm starting the process of backing up as much as still works to a 4TB drive.
It's gonna be a long journey and lots of files, many prone to being duplicates from some of the drives.
What sorts of software do you Linux users recommend?
I'm on Linux Mint MATE, if that matters much.
Edit: One of the programs I'm accustomed to from my Windows days is FolderMatch, which is a step above simple duplicate file scanning, it scans for duplicate or semi-duplicate folders as well and breaks down individual file differences when comparing two folders.
I see I've already gotten some responses, and I thank everyone in advance. I'm on a road trip right now, I'll be checking you folks recommend software later this evening or as soon as I can anyways.
baltakatei
in reply to over_clox • • •Baltakatei's Useful CLI Commands - Reboil
reboil.comtruthfultemporarily
in reply to over_clox • • •over_clox
in reply to truthfultemporarily • • •I have like 10+ hard drives and probably 75+ optical discs to back up, and across the different devices and media, the folder and file structure isn't exactly consistent.
I already know in advance that I'm gonna have to curate this backup myself, it's not quite as easy to just purely let backup/sync software do it all for me.
But I do need software to help.
everett
in reply to over_clox • • •That's the thing: it doesn't need to be. If your backup software or filesystem supports block-level deduplication, all matching data only gets stored once, and filenames don't matter. The files don't even have to 100% match. You'll still see all your files when browsing, but the system is transparently making sure to only store stuff once.
over_clox
in reply to everett • • •I guess you're missing the point then. I'm backing up data coming from many different file systems, FAT12, FAT16, FAT32, EXFAT, NTFS, HPFS, EXT2, 3 and 4, ISOs (of varying degrees of copy protection plus MODE1 and MODE2 discs with audio tracks)...
Plus different date revisions of many files.
You think there's anything consistent enough where any one solution works?
I need all the recommended software I can throw at it. Sure I'd love a purely automated solution, but i know there's still gonna be a lot of manual curating on my part as well.
Also, files don't have to match and filenames aren't important? Are you a psychopath? That's exactly what I want, to organize folder and filenames, and match and remove duplicates based on file hashes.
everett
in reply to over_clox • • •over_clox
in reply to everett • • •I get the concept of block level reduplication, no problem.
But some of these drives came from friends that reorganized their copy of files their own way, while I took my main branch they copied from and salvaged damaged files.
Ever heard of goodtools? I've spent an awful lot of time salvaging corrupt video game console ROMs. I have all of Atari 2600, most of NES and SNES, a number of N64 and a number of PSP games, along with a lot of other stuff.
I ain't about to play headgames on what I have and haven't salvaged already, I must keep track of what device stores what, what filename is what, and what dates are what.
I want an organized file/folder structure. I didn't spend the past 20+ years to trust everything to automation.
everett
in reply to over_clox • • •This is precisely the headache I'm trying to save to you from: micromanaging what you store for the purpose of saving storage space. Store it all, store every version of every file on the same filesystem, or throw it into the same backup system (one that supports block-level deduplication), and you won't be wasting any space and you get to keep your organized file structure.
Ultimately, what we're talking about is storing files, right? And your goal is to now keep files from these old systems in some kind of unified modern system, right? Okay, then. All disks store files as blocks, and with block-level dedup, a common block of data that appears in multiple files only gets stored once, and if you have more than one copy of the file, the difference between the versions (if there is any) gets stored as a diff. The stuff you said about filenames, modified dates and what ancient filesystem it was originally stored on... sorry, none of that is relevant.
When you browse your new, consolidated collection, you'll see all the original folders and files. If two copies of a file happen to contain all the same data, the incremental storage needed to store the second copy is ~0. If you have two copies of the same file, but one was stored by your friend and 10% of it got corrupted, storing that second copy only costs you ~10% in extra storage. If you have historical versions of a file that was modified in 1986, 1992 and 2005 that lived on a different OS each time, what it costs to store each copy is just the difference.
I must reiterate that block-level deduplication doesn't care what files the common data resides in, if it's on the same filesystem it gets deduplcated. This means you can store all the files you have, keep them all in their original contexts (folder structure), without wasting space storing any common parts of any files more than once.
over_clox
in reply to everett • • •Block level dedupe doesn't account for random data at the end of the last block. I want a byte for byte hash level and folder comparison, with the file slack space nulled out. I also want to consolidate all related files into logically organized folders, not just a bunch of random folders titled '20250505 Backup Turd'
I also have numerous drives with similar folder structures, some just minimalized to fit smaller drives. I also have archives from friends, based on the original structure from like 10 years ago, but their file system structures have varied from mine over the years.
over_clox
in reply to everett • • •Also, try converting Big Endian vs Little Endian ROM file formats. I spent many months doing that, via goodtools.
I'm not in any hurry to accidentally overwrite a ROM that's been corrected for consistency in my archives because some automatic sync software might think they're supposed to be the same file.
doeknius_gloek
in reply to over_clox • • •I've had great success with restic. It will handle your 4TB just fine, here's some stats of mine:
and another one, not as large but with lots of files
Restic will automatically deduplicate your data so your duplicates won't waste storage at your backup location.
I've recently learned about backrest which can serve as a restic UI if you're not comfortable with the cli, but I haven't used it myself.
To clean your duplicates at the source I would look into Czkawka as another lemming already suggested.
GitHub - garethgeorge/backrest: Backrest is a web UI and orchestrator for restic backup.
GitHubChurbleyimyam
in reply to over_clox • • •solrize
in reply to over_clox • • •MonkderVierte
in reply to over_clox • • •That is filesystem-level. Btrfs and i think ZFS? have deduplication built in.
Btrfs gave me 150 GB on my 2 TB gaming disk that way.
SayCyberOnceMore
in reply to over_clox • • •There's BeyondCompare and Meld if you want a GUI, but, if I understand this correctly,
rmlint
andfdupes
might be helpful hereI've done similar in the past - I prefer commandline for this...
What I'd do is create a "final destination" folder on the 4TB drive and then other working folders for each hdd / cd / dvd that you're working through
Ie
/mnt/4TB/finaldestination
/mnt/4TB/source1
/mnt/4TB/source2
...
Obviously finaldestination is empty to start with so it could just be a direct copy of your first hdd - so make that the largest drive.
(I'm saying copy here, presuming you want to keep the old drives for now, just in case you accidentally delete the wrong stuff on the 4TB drive)
Maybe clean up any obvious stuff
Remove that first drive
Mount the next and copy the data to /mnt/4TB/source2
Now use
rmlint
orfdupes
and do a dry-run between source2 and finaldestination and get a feel whether they're similar or not, so then you'll know whether to just move it all to finaldestination or maybe then use the gui tools.You might completely empty /mnt4TB/source2, or it might still have something in, depends on how you feel it's going.
Repeat for the rest, working on smaller & smaller drives, comparing with the finaldestination first and then moving the data.
Slow? Yep. Satisfying that you know there's only 1 version there? Yep.
Then do a backup 😉
over_clox
in reply to SayCyberOnceMore • • •The way I'm organizing the main backups to start with is with folder names such as 20250505 Laptop Backup, 20250508 Media Backup, etc.
Eventually I plan on organizing things in bulk folders with simple straightforward names such as Movies, Music, Game ROMs, Virtual Machines, etc.
Yes, thankfully I already got all my main files, music and movies backed up. Right now I'm backing up my software, games, emulator ROMs, etc.
Hopefully that drive finishes backing up before the weather gets bad, cuz I'm definitely shutting things down when there's lightning around...