doc: Add config

2024-12-22 10:58:55 +00:00 · 2015-05-03 15:00:26 +02:00 · 2015-05-03 15:00:26 +02:00 · 062c328f2d
commit 062c328f2d
parent 13a42ec5ec
1 changed files with 63 additions and 50 deletions
--- a/doc/Design.md
+++ b/doc/Design.md
@ -21,49 +21,65 @@ been backed up at some point in time. The state here means the content and meta
 data like the name and modification time for the file or the directory and its
 contents.
 *Storage ID*: A storage ID is the hash of the content of a file stored in the
 repository. This ID is needed in order to load the file from the repository.
 The storage hash is the SHA-256 hash of the content.
 Repository Format
 =================
 All data is stored in a restic repository. A repository is able to store data
-of several different types, which can later be requested based on an ID. The ID
+of several different types, which can later be requested based on an ID. This
-is the hash (SHA-256) of the content of a file. All files in a repository are
+so-called "storage ID" is the hash (SHA-256) of the content of a file. All
-only written once and never modified afterwards. This allows accessing and even
+files in a repository are only written once and never modified afterwards. This
-writing to the repository with multiple clients in parallel. Only the delete
+allows accessing and even writing to the repository with multiple clients in
-operation changes data in the repository.
+parallel. Only the delete operation removes data from the repository.
 At the time of writing, the only implemented repository type is based on
 directories and files. Such repositories can be accessed locally on the same
 system or via the integrated SFTP client. The directory layout is the same for
 both access methods. This repository type is described in the following.
-Repositories consists of several directories and a file called `version`. This
+Repositories consists of several directories and a file called `config`. For
-file contains the version number of the repository. At the moment, this file
+all other files stored in the repository, the name for the file is the lower
-is expected to hold the string `1`, with an optional newline character.
+case hexadecimal representation of the storage ID, which is the SHA-256 hash of
-Additionally there is a file named `id` which contains 32 random bytes, encoded
+the file's contents. This allows easily checking all files for accidental
-in hexadecimal. This uniquely identifies the repository, regardless if it is
+modifications like disk read errors by simply running the program `sha256sum`
-accessed via SFTP or locally.
+and comparing its output to the file name. If the prefix of a filename is
 unique amongst all the other files in the same directory, the prefix may be
 used instead of the complete filename.
-For all other files stored in the repository, the name for the file is the
+Apart from the files stored below the `keys` directory, all files are encrypted
-lower case hexadecimal representation of the SHA-256 hash of the file's
+with AES-256 in counter mode (CTR). The integrity of the encrypted data is
-contents. This allows easily checking all files for accidental modifications
+secured by a Poly1305-AES message authentication code (sometimes also referred
-like disk read errors by simply running the program `sha256sum` and comparing
+to as a "signature").
 its output to the file name. If the prefix of a filename is unique amongst all
 the other files in the same directory, the prefix may be used instead of the
 complete filename.
 Apart from the files `version`, `id` and the files stored below the `keys`
 directory, all files are encrypted with AES-256 in counter mode (CTR). The
 integrity of the encrypted data is secured by a Poly1305-AES message
 authentication code (sometimes also referred to as a "signature").
 In the first 16 bytes of each encrypted file the initialisation vector (IV) is
 stored. It is followed by the encrypted data and completed by the 16 byte
 MAC. The format is: `IV || CIPHERTEXT || MAC`. The complete encryption
-overhead is 32 byte. For each file, a new random IV is selected.
+overhead is 32 bytes. For each file, a new random IV is selected.
-The basic layout of a sample restic repository is shown below:
+The file `config` is encrypted this way and contains a JSON document like the
 following:
    {
      "version": 1,
      "id": "5956a3f67a6230d4a92cefb29529f10196c7d92582ec305fd71ff6d331d6271b",
      "chunker_polynomial": "25b468838dcb75"
    }
 After decryption, restic first checks that the version field contains a version
 number that it understands, otherwise it aborts. At the moment, the version is
 expected to be 1. The field `id` holds a unique ID which consists of 32
 random bytes, encoded in hexadecimal. This uniquely identifies the repository,
 regardless if it is accessed via SFTP or locally. The field
 `chunker_polynomial` contains a parameter that is used for splitting large
 files into smaller chunks (see below).
 The basic layout of a sample restic repository is shown here:
    /tmp/restic-repo
    ├── config
    ├── data
    │   ├── 21
    │   │   └── 2159dd48f8a24f33c307b750592773f8b71ff8d11452132a7b2e2a6a01611be1
@ -74,7 +90,6 @@ The basic layout of a sample restic repository is shown below:
    │   ├── 73
    │   │   └── 73d04e6125cf3c28a299cc2f3cca3b78ceac396e4fcf9575e34536b26782413c
    │   [...]
    ├── id
    ├── index
    │   ├── c38f5fb68307c6a3e3aa945d556e325dc38f5fb68307c6a3e3aa945d556e325d
    │   └── ca171b1b7394d90d330b265d90f506f9984043b342525f019788f97e745c71fd
@ -83,8 +98,7 @@ The basic layout of a sample restic repository is shown below:
    ├── locks
    ├── snapshots
    │   └── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec
-    ├── tmp
+    └── tmp
    └── version
 A repository can be initialized with the `restic init` command, e.g.:
@ -93,21 +107,21 @@ A repository can be initialized with the `restic init` command, e.g.:
 Pack Format
 -----------
-All files in the repository except Key and Data files just contain raw data,
+All files in the repository except Key and Pack files just contain raw data,
-stored as `IV || Ciphertext || MAC`. Data files may contain one or more Blobs
+stored as `IV || Ciphertext || MAC`. Pack files may contain one or more Blobs
-of data. The format is described in the following.
+of data.
-The Pack's structure is as follows:
+A Pack's structure is as follows:
    EncryptedBlob1 || ... || EncryptedBlobN || EncryptedHeader || Header_Length
-At the end of the Pack is a header, which describes the content. The header is
+At the end of the Pack file is a header, which describes the content. The
-encrypted and authenticated. `Header_Length` is the length of the encrypted header
+header is encrypted and authenticated. `Header_Length` is the length of the
-encoded as a four byte integer in little-endian encoding. Placing the header at
+encrypted header encoded as a four byte integer in little-endian encoding.
-the end of a file allows writing the blobs in a continuous stream as soon as
+Placing the header at the end of a file allows writing the blobs in a
-they are read during the backup phase. This reduces code complexity and avoids
+continuous stream as soon as they are read during the backup phase. This
-having to re-write a file once the pack is complete and the content and length
+reduces code complexity and avoids having to re-write a file once the pack is
-of the header is known.
+complete and the content and length of the header is known.
 All the blobs (`EncryptedBlob1`, `EncryptedBlobN` etc.) are authenticated and
 encrypted independently. This enables repository reorganisation without having
@ -178,7 +192,7 @@ listed afterwards.
 There may be an arbitrary number of index files, containing information on
 non-disjoint sets of Packs. The number of packs described in a single file is
-chosen so that the file size is kep below 8 MiB.
+chosen so that the file size is kept below 8 MiB.
 Keys, Encryption and MAC
 ------------------------
@ -230,9 +244,8 @@ tampered with, the computed MAC will not match the last 16 bytes of the data,
 and restic exits with an error. Otherwise, the data is decrypted with the
 encryption key derived from `scrypt`. This yields a JSON document which
 contains the master encryption and message authentication keys for this
-repository (encoded in Base64) and the polynomial that is used for CDC. The
+repository (encoded in Base64). The command `restic cat masterkey` can be used
-command `restic cat masterkey` can be used as follows to decrypt and
+as follows to decrypt and pretty-print the master key:
 pretty-print the master key:
    $ restic -r /tmp/restic-repo cat masterkey
    {
@ -241,7 +254,6 @@ pretty-print the master key:
          "r": "E9eEDnSJZgqwTOkDtOp+Dw=="
        },
        "encrypt": "UQCqa0lKZ94PygPxMRqkePTZnHRYh1k1pX2k2lM2v3Q=",
        "chunker_polynomial": "2f0797d9c2363f"
    }
 All data in the repository is encrypted and authenticated with these master keys.
@ -284,9 +296,9 @@ hash. Before saving, each file is split into variable sized Blobs of data. The
 SHA-256 hashes of all Blobs are saved in an ordered list which then represents
 the content of the file.
-In order to relate these plain text hashes to the actual encrypted storage
+In order to relate these plain text hashes to the actual location within a Pack
-hashes (which vary due to random IVs), an index is used. If the index is not
+file , an index is used. If the index is not available, the header of all data
-available, the header of all data Blobs can be read.
+Blobs can be read.
 Trees and Data
 --------------
@ -321,7 +333,7 @@ The command `restic cat tree` can be used to inspect the tree referenced above:
 A tree contains a list of entries (in the field `nodes`) which contain meta
 data like a name and timestamps. When the entry references a directory, the
-field `subtree` contains the plain text ID of another tree object. 
+field `subtree` contains the plain text ID of another tree object.
 When the command `restic cat tree` is used, the storage hash is needed to print
 a tree. The tree referenced above can be dumped as follows:
@ -372,8 +384,9 @@ For creating a backup, restic scans the source directory for all files,
 sub-directories and other entries. The data from each file is split into
 variable length Blobs cut at offsets defined by a sliding window of 64 byte.
 The implementation uses Rabin Fingerprints for implementing this Content
-Defined Chunking (CDC). An irreducible polynomial is selected at random when a
+Defined Chunking (CDC). An irreducible polynomial is selected at random and
-repository is initialized.
+saved in the file `config` when a repository is initialized, so that watermark
 attacks are much harder.
 Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in
 size. The implementation aims for 1 MiB Blob size on average.