diff --git a/doc/Design.md b/doc/Design.md index b815efad9..4463fadb3 100644 --- a/doc/Design.md +++ b/doc/Design.md @@ -21,49 +21,65 @@ been backed up at some point in time. The state here means the content and meta data like the name and modification time for the file or the directory and its contents. +*Storage ID*: A storage ID is the hash of the content of a file stored in the +repository. This ID is needed in order to load the file from the repository. +The storage hash is the SHA-256 hash of the content. + Repository Format ================= All data is stored in a restic repository. A repository is able to store data -of several different types, which can later be requested based on an ID. The ID -is the hash (SHA-256) of the content of a file. All files in a repository are -only written once and never modified afterwards. This allows accessing and even -writing to the repository with multiple clients in parallel. Only the delete -operation changes data in the repository. +of several different types, which can later be requested based on an ID. This +so-called "storage ID" is the hash (SHA-256) of the content of a file. All +files in a repository are only written once and never modified afterwards. This +allows accessing and even writing to the repository with multiple clients in +parallel. Only the delete operation removes data from the repository. At the time of writing, the only implemented repository type is based on directories and files. Such repositories can be accessed locally on the same system or via the integrated SFTP client. The directory layout is the same for both access methods. This repository type is described in the following. -Repositories consists of several directories and a file called `version`. This -file contains the version number of the repository. At the moment, this file -is expected to hold the string `1`, with an optional newline character. -Additionally there is a file named `id` which contains 32 random bytes, encoded -in hexadecimal. This uniquely identifies the repository, regardless if it is -accessed via SFTP or locally. +Repositories consists of several directories and a file called `config`. For +all other files stored in the repository, the name for the file is the lower +case hexadecimal representation of the storage ID, which is the SHA-256 hash of +the file's contents. This allows easily checking all files for accidental +modifications like disk read errors by simply running the program `sha256sum` +and comparing its output to the file name. If the prefix of a filename is +unique amongst all the other files in the same directory, the prefix may be +used instead of the complete filename. -For all other files stored in the repository, the name for the file is the -lower case hexadecimal representation of the SHA-256 hash of the file's -contents. This allows easily checking all files for accidental modifications -like disk read errors by simply running the program `sha256sum` and comparing -its output to the file name. If the prefix of a filename is unique amongst all -the other files in the same directory, the prefix may be used instead of the -complete filename. - -Apart from the files `version`, `id` and the files stored below the `keys` -directory, all files are encrypted with AES-256 in counter mode (CTR). The -integrity of the encrypted data is secured by a Poly1305-AES message -authentication code (sometimes also referred to as a "signature"). +Apart from the files stored below the `keys` directory, all files are encrypted +with AES-256 in counter mode (CTR). The integrity of the encrypted data is +secured by a Poly1305-AES message authentication code (sometimes also referred +to as a "signature"). In the first 16 bytes of each encrypted file the initialisation vector (IV) is stored. It is followed by the encrypted data and completed by the 16 byte MAC. The format is: `IV || CIPHERTEXT || MAC`. The complete encryption -overhead is 32 byte. For each file, a new random IV is selected. +overhead is 32 bytes. For each file, a new random IV is selected. -The basic layout of a sample restic repository is shown below: +The file `config` is encrypted this way and contains a JSON document like the +following: + + { + "version": 1, + "id": "5956a3f67a6230d4a92cefb29529f10196c7d92582ec305fd71ff6d331d6271b", + "chunker_polynomial": "25b468838dcb75" + } + +After decryption, restic first checks that the version field contains a version +number that it understands, otherwise it aborts. At the moment, the version is +expected to be 1. The field `id` holds a unique ID which consists of 32 +random bytes, encoded in hexadecimal. This uniquely identifies the repository, +regardless if it is accessed via SFTP or locally. The field +`chunker_polynomial` contains a parameter that is used for splitting large +files into smaller chunks (see below). + +The basic layout of a sample restic repository is shown here: /tmp/restic-repo + ├── config ├── data │ ├── 21 │ │ └── 2159dd48f8a24f33c307b750592773f8b71ff8d11452132a7b2e2a6a01611be1 @@ -74,7 +90,6 @@ The basic layout of a sample restic repository is shown below: │ ├── 73 │ │ └── 73d04e6125cf3c28a299cc2f3cca3b78ceac396e4fcf9575e34536b26782413c │ [...] - ├── id ├── index │ ├── c38f5fb68307c6a3e3aa945d556e325dc38f5fb68307c6a3e3aa945d556e325d │ └── ca171b1b7394d90d330b265d90f506f9984043b342525f019788f97e745c71fd @@ -83,8 +98,7 @@ The basic layout of a sample restic repository is shown below: ├── locks ├── snapshots │ └── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec - ├── tmp - └── version + └── tmp A repository can be initialized with the `restic init` command, e.g.: @@ -93,21 +107,21 @@ A repository can be initialized with the `restic init` command, e.g.: Pack Format ----------- -All files in the repository except Key and Data files just contain raw data, -stored as `IV || Ciphertext || MAC`. Data files may contain one or more Blobs -of data. The format is described in the following. +All files in the repository except Key and Pack files just contain raw data, +stored as `IV || Ciphertext || MAC`. Pack files may contain one or more Blobs +of data. -The Pack's structure is as follows: +A Pack's structure is as follows: EncryptedBlob1 || ... || EncryptedBlobN || EncryptedHeader || Header_Length -At the end of the Pack is a header, which describes the content. The header is -encrypted and authenticated. `Header_Length` is the length of the encrypted header -encoded as a four byte integer in little-endian encoding. Placing the header at -the end of a file allows writing the blobs in a continuous stream as soon as -they are read during the backup phase. This reduces code complexity and avoids -having to re-write a file once the pack is complete and the content and length -of the header is known. +At the end of the Pack file is a header, which describes the content. The +header is encrypted and authenticated. `Header_Length` is the length of the +encrypted header encoded as a four byte integer in little-endian encoding. +Placing the header at the end of a file allows writing the blobs in a +continuous stream as soon as they are read during the backup phase. This +reduces code complexity and avoids having to re-write a file once the pack is +complete and the content and length of the header is known. All the blobs (`EncryptedBlob1`, `EncryptedBlobN` etc.) are authenticated and encrypted independently. This enables repository reorganisation without having @@ -178,7 +192,7 @@ listed afterwards. There may be an arbitrary number of index files, containing information on non-disjoint sets of Packs. The number of packs described in a single file is -chosen so that the file size is kep below 8 MiB. +chosen so that the file size is kept below 8 MiB. Keys, Encryption and MAC ------------------------ @@ -230,9 +244,8 @@ tampered with, the computed MAC will not match the last 16 bytes of the data, and restic exits with an error. Otherwise, the data is decrypted with the encryption key derived from `scrypt`. This yields a JSON document which contains the master encryption and message authentication keys for this -repository (encoded in Base64) and the polynomial that is used for CDC. The -command `restic cat masterkey` can be used as follows to decrypt and -pretty-print the master key: +repository (encoded in Base64). The command `restic cat masterkey` can be used +as follows to decrypt and pretty-print the master key: $ restic -r /tmp/restic-repo cat masterkey { @@ -241,7 +254,6 @@ pretty-print the master key: "r": "E9eEDnSJZgqwTOkDtOp+Dw==" }, "encrypt": "UQCqa0lKZ94PygPxMRqkePTZnHRYh1k1pX2k2lM2v3Q=", - "chunker_polynomial": "2f0797d9c2363f" } All data in the repository is encrypted and authenticated with these master keys. @@ -284,9 +296,9 @@ hash. Before saving, each file is split into variable sized Blobs of data. The SHA-256 hashes of all Blobs are saved in an ordered list which then represents the content of the file. -In order to relate these plain text hashes to the actual encrypted storage -hashes (which vary due to random IVs), an index is used. If the index is not -available, the header of all data Blobs can be read. +In order to relate these plain text hashes to the actual location within a Pack +file , an index is used. If the index is not available, the header of all data +Blobs can be read. Trees and Data -------------- @@ -321,7 +333,7 @@ The command `restic cat tree` can be used to inspect the tree referenced above: A tree contains a list of entries (in the field `nodes`) which contain meta data like a name and timestamps. When the entry references a directory, the -field `subtree` contains the plain text ID of another tree object. +field `subtree` contains the plain text ID of another tree object. When the command `restic cat tree` is used, the storage hash is needed to print a tree. The tree referenced above can be dumped as follows: @@ -372,8 +384,9 @@ For creating a backup, restic scans the source directory for all files, sub-directories and other entries. The data from each file is split into variable length Blobs cut at offsets defined by a sliding window of 64 byte. The implementation uses Rabin Fingerprints for implementing this Content -Defined Chunking (CDC). An irreducible polynomial is selected at random when a -repository is initialized. +Defined Chunking (CDC). An irreducible polynomial is selected at random and +saved in the file `config` when a repository is initialized, so that watermark +attacks are much harder. Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in size. The implementation aims for 1 MiB Blob size on average.