Monday, January 6, 2020

Tools for file copying and compression

If I were Stephanie Hofeller for a day I would use these Debian GNU/Linux command line tools for copying files (rsync), letting users know if files were copied correctly (rhash) and compressing files so they took as little space as possible (tar and xz).

  • Create checksums of all files so file integrity can be confirmed.

    • Create the file checksum.txt containing SHA256 checksums of all files in /home/baltakatei/Backups directory:

      $ rhash --recursive --sha256 /home/baltakatei/Backups > /home/baltakatei/Backups/checksums.txt

    • Use checksums.txt to make sure that all files in /home/baltakatei/Backups have the SHA256 hashes listed in checksums.txt:

      $ rhash --check --sha256 /home/baltakatei/Backups/checksums.txt

  • Copy files from one location to another (ex: backing up your Documents folder)

    $ rsync -avu --modify-window=1 --progress /home/baltakatei/Documents /home/baltakatei/Backups/20200106T0909Z_Documents

    • References
      • Link. General guide for using rsync to copy files.
      • Link. Why --modify-window=1 is useful when using rsync to copy from Windows file systems.
      • Link. Note for increasing file transfer efficiency if transferring between macOS and GNU/Linux by specifying file name format encoding.
  • Compress the directory (and all files it contains) into a single tar.xz file (max compression, high memory req.).

    • Create the compressed tar.xz file:

      $ tar cf - /home/baltakatei/Backups | xz -9e --lzma2=dict=1536MiB,mf=bt4,nice=273,depth=1000 > Backups.tar.xz

    • Decompress the contents of the tar.xz file to /home/baltakatei/Downloads directory:

      $ tar -xf Backups.tar.xz -C /home/baltakatei/Downloads/Backups

    • Note: As I understand it, using a large dictionary (1536 MiB) means highly redundant data (ex: backup copies) could be very efficiently compressed.

    • References

      • Link. How to uncompress tar.xz files.
      • Link. How to specify output location for files extracted tar.
      • Link. How xz dictionary size affects compression.
  • Notes:

    • These steps can mostly be replicated in macOS via Homebrew.
    • Packages may be installed in Debian 10 via the apt package manager by:

      $ sudo apt-get update $ sudo apt-get install rsync xz-utils rhash 
    • I don't use Ubuntu but it appears to also have the apt package manager and probably has rsync, xz-utils, and rhash.

    • Optional:

      • I would sign checksum.txt with my openPGP key if I had one with $ gpg -o checksums.txt.gpg -s checksums.txt so the integrity of the checksum file could be verified by $ gpg --verify checksums.txt.gpg.
      • I would upload the checksums.txt file to OpenTimestamps so the existence of the files could be proved in a decentralized fashion by way of the Bitcoin blockchain.

Narrative Summary: I would mount all USB drives onto my desktop computer, use rsync to copy each of their contents onto a local 4 TB hard drive. Then I would run rhash to create a file manifest that could then be audited by someone who downloaded the data using the same rhash program to make sure all files were sent correctly. Then I would compress this with tar and xz since most of the space is taken up by multiple backups of mostly the same data. Then I would upload the resulting tar.xz file to Google Drive (or some other file hosting service), or create a BitTorrent, or try to get someone at the Internet Archive to host it for me. Someone wanting to verify that the data had not been tampered with could do so with the checksum.txt file (or with one of the optional measures I described) regardless of what method they used to acquire the files.

tl;dr I'd use GNU/Linux to make a SHA256 manifest for the files before I uploaded them.



No comments:

Post a Comment