rsync: A Quick Tour

Pierre Rioux
ACElab Developer Meeting, April 2016

What is it?

  • it's a UNIX utility program,
  • command line only,
  • and boringly, it copies files...
  • ... yet it's still the greatest thing ever.

History of rsync

  • Originally co-developed by Andrew Tridgell,
    as a part of his Ph. D. thesis.
  • So popular that it is now installed by default
    on almost all LINUX distributions.
  • It's the basis of many common solutions for
    distributing files or backing them up.
  • At the ACElab, almost all nightly backups are
    performed, basically, by rsync.

cp, scp and rsync usage

Well, they all look much the same, eh?
cp SOURCE DESTINATION
scp SOURCE DESTINATION
rsync SOURCE DESTINATION

What does cp do?

cp    myfile myfilecopy
cp -r mydir mydircopy
  • Source files always fully read from disk.
  • Data sent to destination even if it already exists there (crushed!)

What does scp do?

scp    myfile           user@host:myfilecopy
scp -r mydir            user@host:mydircopy
scp    user@host:myfile myfilecopy
scp -r user@host:mydir  mydircopy
  • This is just like cp except it sends
    the data through the network to another host.
  • So? Same problems as with cp.

What does rsync do?

rsync -a myfile           user@host:myfilecopy
rsync -a mydir            user@host:mydircopy
rsync -a user@host:myfile myfilecopy
rsync -a user@host:mydir  mydircopy
  • The source and destination specifications
    are familiar to scp users
  • Ignore the -a option for the moment
  • Darn it Pierre that still doesn't tell us what's different about it!

The destination is the key

Case 1: When the files at the destination DO NOT exist,
the behavior of all these tools is similar:

  1. all the input files are opened at the source;
  2. their contents are read in;
  3. it is sent to the destination...
  4. ...where file entries are created
  5. and the contents put in them.

The destination is the key

Case 2: When the files at the destination DO exist,
the behavior is quite different:

  • cp and scp will just crush content of the destination files
  • rsync will try to identify the differences in content, and only send those differences
  • rsync will adjust the files at the destination so they match the source
  • rsync will not send anything if there is no need to do so!

rsync runs at both ends

Whenever rsync is launched, two1 copies
of the program are involved:

  • One copy at the source, the sender
  • One copy at the destination, the receiver

Their roles are defined by the rsync algorithm.

1 Actually, more than two.

When is rsync useful?

  • When the connection speed is slow while local file access is fast
  • When there are small changes among large files
  • When there are a limited number of changed files in large directory trees
  • When file content doesn't change but other attributes do

Usage statement

This is a sophisticated piece of software with many features.
rsync  version 3.0.9  protocol version 30
Copyright (C) 1996-2011 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, ACLs, xattrs, iconv, symtimes

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details.

rsync is a file transfer program capable of efficient remote update
via a fast differencing algorithm.

Usage: rsync [OPTION]... SRC [SRC]... DEST
  or   rsync [OPTION]... SRC [SRC]... [USER@]HOST:DEST
  or   rsync [OPTION]... SRC [SRC]... [USER@]HOST::DEST
  or   rsync [OPTION]... SRC [SRC]... rsync://[USER@]HOST[:PORT]/DEST
  or   rsync [OPTION]... [USER@]HOST:SRC [DEST]
  or   rsync [OPTION]... [USER@]HOST::SRC [DEST]
  or   rsync [OPTION]... rsync://[USER@]HOST[:PORT]/SRC [DEST]
The ':' usages connect via remote shell, while '::' & 'rsync://' usages connect
to an rsync daemon, and require SRC or DEST to start with a module name.
Options
 -v, --verbose               increase verbosity
 -q, --quiet                 suppress non-error messages
     --no-motd               suppress daemon-mode MOTD (see manpage caveat)
 -c, --checksum              skip based on checksum, not mod-time & size
 -a, --archive               archive mode; equals -rlptgoD (no -H,-A,-X)
     --no-OPTION             turn off an implied OPTION (e.g. --no-D)
 -r, --recursive             recurse into directories
 -R, --relative              use relative path names
     --no-implied-dirs       don't send implied dirs with --relative
 -b, --backup                make backups (see --suffix & --backup-dir)
     --backup-dir=DIR        make backups into hierarchy based in DIR
     --suffix=SUFFIX         set backup suffix (default ~ w/o --backup-dir)
 -u, --update                skip files that are newer on the receiver
     --inplace               update destination files in-place (SEE MAN PAGE)
     --append                append data onto shorter files
     --append-verify         like --append, but with old data in file checksum
 -d, --dirs                  transfer directories without recursing
 -l, --links                 copy symlinks as symlinks
 -L, --copy-links            transform symlink into referent file/dir
     --copy-unsafe-links     only "unsafe" symlinks are transformed
     --safe-links            ignore symlinks that point outside the source tree
 -k, --copy-dirlinks         transform symlink to a dir into referent dir
 -K, --keep-dirlinks         treat symlinked dir on receiver as dir
 -H, --hard-links            preserve hard links
 -p, --perms                 preserve permissions
 -E, --executability         preserve the file's executability
     --chmod=CHMOD           affect file and/or directory permissions
 -A, --acls                  preserve ACLs (implies --perms)
 -X, --xattrs                preserve extended attributes
 -o, --owner                 preserve owner (super-user only)
 -g, --group                 preserve group
     --devices               preserve device files (super-user only)
     --copy-devices          copy device contents as regular file
     --specials              preserve special files
 -D                          same as --devices --specials
 -t, --times                 preserve modification times
 -O, --omit-dir-times        omit directories from --times
     --super                 receiver attempts super-user activities
     --fake-super            store/recover privileged attrs using xattrs
 -S, --sparse                handle sparse files efficiently
 -n, --dry-run               perform a trial run with no changes made
 -W, --whole-file            copy files whole (without delta-xfer algorithm)
 -x, --one-file-system       don't cross filesystem boundaries
 -B, --block-size=SIZE       force a fixed checksum block-size
 -e, --rsh=COMMAND           specify the remote shell to use
     --rsync-path=PROGRAM    specify the rsync to run on the remote machine
     --existing              skip creating new files on receiver
     --ignore-existing       skip updating files that already exist on receiver
     --remove-source-files   sender removes synchronized files (non-dirs)
     --del                   an alias for --delete-during
     --delete                delete extraneous files from destination dirs
     --delete-before         receiver deletes before transfer, not during
     --delete-during         receiver deletes during the transfer
     --delete-delay          find deletions during, delete after
     --delete-after          receiver deletes after transfer, not during
     --delete-excluded       also delete excluded files from destination dirs
     --ignore-errors         delete even if there are I/O errors
     --force                 force deletion of directories even if not empty
     --max-delete=NUM        don't delete more than NUM files
     --max-size=SIZE         don't transfer any file larger than SIZE
     --min-size=SIZE         don't transfer any file smaller than SIZE
     --partial               keep partially transferred files
     --partial-dir=DIR       put a partially transferred file into DIR
     --delay-updates         put all updated files into place at transfer's end
 -m, --prune-empty-dirs      prune empty directory chains from the file-list
     --numeric-ids           don't map uid/gid values by user/group name
     --timeout=SECONDS       set I/O timeout in seconds
     --contimeout=SECONDS    set daemon connection timeout in seconds
 -I, --ignore-times          don't skip files that match in size and mod-time
     --size-only             skip files that match in size
     --modify-window=NUM     compare mod-times with reduced accuracy
 -T, --temp-dir=DIR          create temporary files in directory DIR
 -y, --fuzzy                 find similar file for basis if no dest file
     --compare-dest=DIR      also compare destination files relative to DIR
     --copy-dest=DIR         ... and include copies of unchanged files
     --link-dest=DIR         hardlink to files in DIR when unchanged
 -z, --compress              compress file data during the transfer
     --compress-level=NUM    explicitly set compression level
     --skip-compress=LIST    skip compressing files with a suffix in LIST
 -C, --cvs-exclude           auto-ignore files the same way CVS does
 -f, --filter=RULE           add a file-filtering RULE
 -F                          same as --filter='dir-merge /.rsync-filter'
                             repeated: --filter='- .rsync-filter'
     --exclude=PATTERN       exclude files matching PATTERN
     --exclude-from=FILE     read exclude patterns from FILE
     --include=PATTERN       don't exclude files matching PATTERN
     --include-from=FILE     read include patterns from FILE
     --files-from=FILE       read list of source-file names from FILE
 -0, --from0                 all *-from/filter files are delimited by 0s
 -s, --protect-args          no space-splitting; only wildcard special-chars
     --address=ADDRESS       bind address for outgoing socket to daemon
     --port=PORT             specify double-colon alternate port number
     --sockopts=OPTIONS      specify custom TCP options
     --blocking-io           use blocking I/O for the remote shell
     --stats                 give some file-transfer stats
 -8, --8-bit-output          leave high-bit chars unescaped in output
 -h, --human-readable        output numbers in a human-readable format
     --progress              show progress during transfer
 -P                          same as --partial --progress
 -i, --itemize-changes       output a change-summary for all updates
     --out-format=FORMAT     output updates using the specified FORMAT
     --log-file=FILE         log what we're doing to the specified FILE
     --log-file-format=FMT   log updates using the specified FMT
     --password-file=FILE    read daemon-access password from FILE
     --list-only             list the files instead of copying them
     --bwlimit=KBPS          limit I/O bandwidth; KBytes per second
     --write-batch=FILE      write a batched update to FILE
     --only-write-batch=FILE like --write-batch but w/o updating destination
     --read-batch=FILE       read a batched update from FILE
     --protocol=NUM          force an older protocol version to be used
     --iconv=CONVERT_SPEC    request charset conversion of filenames
     --checksum-seed=NUM     set block/file checksum seed (advanced)
 -4, --ipv4                  prefer IPv4
 -6, --ipv6                  prefer IPv6
     --version               print version number
(-h) --help                  show this help (-h is --help only if used alone)

Use "rsync --daemon --help" to see the daemon-mode command-line options.
Please see the rsync(1) and rsyncd.conf(5) man pages for full documentation.
See http://rsync.samba.org/ for updates, bug reports, and answers

Did you notice -a in there?

It's the most commonly used option. It means:

  • recursive (-r)
  • copy symbolic links (-l)
  • copy file permissions (-p)
  • copy file timestamps (-t)
  • copy file groupship (-g)
  • copy file ownership (-o)
  • ... and a few more

This is why most of the examples in this
presentation will show rsync -a.

Source or target specifications

Just like for scp a source or destination can be:

A local absolute path: /data/study/minecraft
A local relative path: mydir/subdir
A remote absolute path, accessible through SSH using the same user name: xyz.mcgill.ca:/home/prioux
A remote relative path, accessible through SSH using the same user name: xyz.mcgill.ca:presentations
(defaults to home of user)
A remote path, accessible through SSH using some other user name: alan@xyz.mcgill.ca:/papers

Local copy

rsync -a myfile myfilecopy
rsync -a /data/mydir/ /home/prioux/localdir

Push mode

rsync -a /data/localdir/ prioux@superserver.mcgill.ca:/data/brainfolly

Pull mode

rsync -a prioux@superserver.mcgill.ca:/data/brainfolly/ ~/follycopy

Important!

Does the destination directory exist?

rsync creates the destination directory if it
doesn't already exist (last component only).

E.g. if /X/Y/Z exists but not N, then N will be created
and the content of C will go into N:

rsync -a /A/B/C   /X/Y/Z/N
Which means that a repetition of the rsync command
can cause total duplication!
rsync -a /A/B/C   /X/Y/Z/N  # N doesn't exist but is created
rsync -a /A/B/C   /X/Y/Z/N  # N exists now, so data sent to /X/Y/Z/N/C !!!

Important!

The trailing / in the source!

rsync's behavior change when a trailing / is added to the source specification.

E.g. Assuming X/Y/Z exists:

rsync -a /A/B/C   /X/Y/Z     # will create/update /X/Y/Z/C at the destination
rsync -a /A/B/C/  /X/Y/Z     # the files inside C go directly inside /X/Y/Z

rsync algorithm

Without going into much details:

  • If the file doesn't exist at the destination, send it over
  • If the file does exist, compare size and timestamps:
    1. If they match, only adjust the other attributes (owner, group, permissions...)
    2. If they do NOT match, enter dialog mode to identify differences in content, and synchronize.

Reports: simple file list

rsync -a -v SOURCE DEST
building file list ... done
./
README.md
index.html
css/
css/reveal.css
css/reveal.scss
(etc)

Reports: itemized file list

rsync -a -i SOURCE DEST
>f..t.... README.md
>f.st.... index.html
.f.....g. css/reveal.css
(etc)

Reports: stats

This shows the stats for the actual ACElab nightly backup of Breitner1-vh's NFS data partition (where all VMs get their /data mounts) for April 7nd, 2016.

The backup took about 40 minutes.
Source: 5T, 2M files. Changes: 6G.

rsync -a --stats SOURCE DEST
Number of files: 2014473
Number of files transferred: 9360
Total file size: 4987326901335 bytes
Total transferred file size: 6148253201 bytes
Literal data: 6147647701 bytes
Matched data: 605500 bytes
File list size: 55018731
File list generation time: 0.035 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 6203895660
Total bytes received: 277811

sent 6203895660 bytes  received 277811 bytes  2732514.19 bytes/sec
total size is 4987326901335  speedup is 803.87

Running sender as root

  • Can traverse the source tree completely
  • Only needed when that tree contains mixed ownership data

Running receiver as root

  • Can recreate with ownership completely
  • Security risk for automatic systems

Fake root mode (--fake-super)

  • You still run as root on the source side
  • On the destination, a non-privileged user can run rsync
  • Information about ownerhsip, groupship etc is stored in extended attributes, assuming the target filesystem supports them (e.g ext4)
  • This is what ACElab use for incremental backups

Network transport options

  • When copying locally, there is no network transport; the two rsync programs talk to each other using local IPC
  • When copying remotely, two transport mechanism are supported:
    • rsync protocol over port 873 (unencrypted)
    • rsync protocol over SSH (encrypted)
  • Note: source and destination syntax change for each mode; not shown.

Delete spurious files

Counterintuitively, by default...

  • if other files exist in the destination directory, they are left alone
  • we must use --delete to tell rsync to clean them up

rsync -a --delete /mystudy ace-storage-19:/studybackup

Dry-run mode

A useful option for making sure all specifications and directories are OK is to use the dry run mode:
rsync -a --dry-run -v /source /dest # same as -n
rsync -a -n        -v /source /dest # same as --dry-run
In such a mode, no files are modified at the destination at all. In combination with the reporting options, one can double-check that rsync will do what is expected.

File updating

Not performed in place!

When updating an existing file on the destination side, rsync recreates it beside the previous version and then renames it in place:
unix$ ls -al
-rw-r--r--    1 prioux  staff   44871173 Apr 08  2016 .big_one_gig_file.img.yAGHxB
-rw-r--r--    1 prioux  staff 1023422499 Oct 23  2013 big_one_gig_file.img
(After the file's entire data is synchronized)
unix$ ls -al
-rw-r--r--    1 prioux  staff 1071933211 Mar 21  2015 big_one_gig_file.img

File updating: --inplace

In some circumpstances, we can prefer to let rsync modify the content of destination files, directly.
rsync -a --inplace big_one_gig_file.img prioux@destination:big_one_gig_file.img
This kind of behavior is ideal for LARGE files with internal content changing in localized regions, such as VM disk images. This takes full advantage of the differentiating algorithm of rsync.
-rw-------. 1 qemu qemu  7760445440 Apr  8 11:18 abou-haider.img
-rw-------. 1 qemu qemu 19897057280 Apr  8 11:19 ccna.img
-rw-------. 1 qemu qemu 19308150784 Apr  8 11:17 cecile.img
-rw-------. 1 qemu qemu 20052967424 Apr  8 11:17 christine.qcow2
-rw-------. 1 qemu qemu 12864585728 Apr  8 11:17 dave.img
-rw-------. 1 qemu qemu 19894697984 Apr  8 11:17 epigenomics.img
-rw-------. 1 qemu qemu 20913717248 Apr  8 11:19 greg.img
-rw-------. 1 qemu qemu 14506393600 Apr  8 11:19 jordan.img
-rw-------. 1 qemu qemu  8932556800 Apr  8 11:17 jsaigle.img
-rw-------. 1 qemu qemu 31374770176 Apr  8 11:20 justin.qcow2

Exclusions

You can tell rsync to ignore part of your source tree:
rsync -a --exclude="*/not_backed_up" --exclude="**/*.bak" /nfs/data remote:/dest

Timestamps deltas

Sometimes, timestamps cannot be compared exactly.

rsync -a --modify-window=2 source dest
  • You can specify a 'fuzziness' level when comparing timestamps.
  • Useful for source and destination filesystems with different time granularity
  • E.g. FAT filesystems on USB keys, with a resolution of 2 seconds

Skipping files while ignoring timestamps

If the destination already has files with known content in sync, but with wrong timestamps, then you can use rsync to adjust all the timestamps by telling it to ONLY compare files by their presence and size.
rsync -a --size-only source dest
Otherwise, because the timestamps are different, the contents of all files would be inspected!

Using multiple source

Multiple sources can be sent to a single destination:
rsync -a /A/B  /X/Y/Z  /home/prioux /dest  # updates /dest/B, /dest/Z and /dest/prioux
rsync -a /A/B/ /X/Y/Z/ /home/prioux/ /dest # all goes inside /dest

Things to consider

Before launching a rsync command, consider:

  • Are there hardlinks among my sources?
  • Are timestamp reliable? (think FAT filesystems)
  • Do I want to make changes --inplace?
  • What are the --excludes I can specify?
  • Do I need rsync to run as root on the source side?
  • What about on the destination side?

The perfect mirror

This command mirrors perfectly a directory tree, and avoids all the pitfals of source and destination specifications.
rsync -a -H --delete /my/source/data/ /my/dest/data

Thank you!

  • Please read the man page for rsync, it's worth it
  • If you are not sure about running a rsync command that could potentially be destructive, please use --dry-run and contact me to review the command.
  • The word rsync can be spelled in other colors.