2
2
mirror of https://github.com/octoleo/restic.git synced 2024-12-22 02:48:55 +00:00

Clarify terminology

This commit is contained in:
Alexander Neumann 2015-04-09 20:48:32 +02:00
parent b99474154c
commit 58f289003c

View File

@ -1,15 +1,35 @@
This document gives a high-level overview of the design and repository layout This document gives a high-level overview of the design and repository layout
of the restic backup program. of the restic backup program.
Terminology
===========
This section introduces terminology used in this document.
*Repository*: All data produced during a backup is sent to and stored at a
repository in structured form, for example in a file system hierarchy of with
several subdirectories. A repository implementation must be able to fulfil a
number of operations, e.g. list the contents.
*Blob*: A Blob combines a number of data bytes with identifying information
like the SHA256 hash of the data and its length.
*Pack*: A Pack combines one or more Blobs together, e.g. in a single file.
*Snapshot*: A Snapshot stands for the state of a file or directory that has
been backed up at some point in time. The state here means the content and meta
data like the name and modification time for the file or the directory and its
contents.
Repository Format Repository Format
================= =================
All data is stored in a restic repository. A repository is able to store data All data is stored in a restic repository. A repository is able to store data
in blobs of several different types, which can later be requested based on an of several different types, which can later be requested based on an ID. The ID
ID. The ID is the hash (SHA-256) of the content of a blob. All blobs in a is the hash (SHA-256) of the content of a file. All files in a repository are
repository are only written once and never modified afterwards. This allows only written once and never modified afterwards. This allows accessing and even
accessing and even writing to the repository with multiple clients in parallel. writing to the repository with multiple clients in parallel. Only the delete
Only the delete operation changes data in the repository. operation changes data in the repository.
At the time of writing, the only implemented repository type is based on At the time of writing, the only implemented repository type is based on
directories and files. Such repositories can be accessed locally on the same directories and files. Such repositories can be accessed locally on the same
@ -23,7 +43,7 @@ Additionally there is a file named `id` which contains 32 random bytes, encoded
in hexadecimal. This uniquely identifies the repository, regardless if it is in hexadecimal. This uniquely identifies the repository, regardless if it is
accessed via SFTP or locally. accessed via SFTP or locally.
For all other blobs stored in the repository, the name for the file is the For all other files stored in the repository, the name for the file is the
lower case hexadecimal representation of the SHA-256 hash of the file's lower case hexadecimal representation of the SHA-256 hash of the file's
contents. This allows easily checking all files for accidental modifications contents. This allows easily checking all files for accidental modifications
like disk read errors by simply running the program `sha256sum` and comparing like disk read errors by simply running the program `sha256sum` and comparing
@ -76,64 +96,66 @@ A repository can be initialized with the `restic init` command, e.g.:
$ restic -r /tmp/restic-repo init $ restic -r /tmp/restic-repo init
Blob Format Pack Format
----------- -----------
All blobs except key, tree and data blobs just contain raw data, stored as `IV All files in the repository except Key, Tree and Data files just contain raw
|| Ciphertext || MAC`. Tree and Data blobs may contain several chunks of data. data, stored as `IV || Ciphertext || MAC`. Tree and Data files may contain
The format is described in the following. several Blobs of data. The format is described in the following.
The blob starts with a nonce and a header, the header describes the content and A Pack starts with a nonce and a header, the header describes the content and
is encrypted and signed. The blob's structure is as follows: is encrypted and signed. The Pack's structure is as follows:
NONCE || Header_Length || NONCE || Header_Length ||
IV_Header || Ciphertext_Header || MAC_Header || IV_Header || Ciphertext_Header || MAC_Header ||
IV_Chunk_1 || Ciphertext_Chunk_1 || MAC_Chunk_1 || IV_Blob_1 || Ciphertext_Blob_1 || MAC_Blob_1 ||
[...] [...]
IV_Chunk_n || Ciphertext_Chunk_n || MAC_Chunk_n || IV_Blob_n || Ciphertext_Blob_n || MAC_Blob_n ||
MAC MAC
`NONCE` consists of 16 bytes and `Header_Length` is a four byte integer in `NONCE` consists of 16 bytes and `Header_Length` is a four byte integer in
little-endian encoding. little-endian encoding.
All the parts (`Ciphertext_Header`, `Ciphertext_Chunk1` etc.) are signed and All the parts (`Ciphertext_Header`, `Ciphertext_Blob1` etc.) are signed and
encrypted independently. In addition, the complete blob is signed again using encrypted independently. In addition, the complete pack is signed using
`NONCE`. This enables repository reorganisation without having to touch the `NONCE`. This enables repository reorganisation without having to touch the
encrypted chunks. In addition it also allows efficient indexing, for only the encrypted Blobs. In addition it also allows efficient indexing, for only the
header needs to be read in order to find out which chunks are contained in the header needs to be read in order to find out which Blobs are contained in the
blob. Since the header is signed, authenticity of the header can be checked Pack. Since the header is signed, authenticity of the header can be checked
without having to read the complete blob. without having to read the complete Pack.
After decryption, a blob's header consists of the following elements: After decryption, a Pack's header consists of the following elements:
Length(IV_Chunk1+Ciphertext_Chunk1+MAC_Chunk1) || Hash(Plaintext_Chunk1) || Length(IV_Blob_1+Ciphertext_Blob1+MAC_Blob_1) || Hash(Plaintext_Blob_1) ||
[...] [...]
Length(IV_Chunk_n+Ciphertext_Chunk_n+MAC_Chunk_n) || Hash(Plaintext_Chunk_n) || Length(IV_Blob_n+Ciphertext_Blob_n+MAC_Blob_n) || Hash(Plaintext_Blob_n) ||
This is enough to calculate the offsets for all the chunks in the blob. Length This is enough to calculate the offsets for all the Blobs in the Pack. Length
is the length of the chunk as a four byte integer in little-endian format. is the length of a Blob as a four byte integer in little-endian format.
Indexing Indexing
-------- --------
Index blobs pack together information about data and tree blobs and stores this Index files contain information about Data and Tree Blobs and the Packs they
information in the repository. When the local cached index is not accessible are contained in and store this information in the repository. When the local
any more, the index files can be downloaded and used to reconstruct the index. cached index is not accessible any more, the index files can be downloaded and
The index blobs are encrypted and signed like data and tree blobs, so the outer used to reconstruct the index. The index Blobs are encrypted and signed like
structure is `IV || Ciphertext || MAC` again. The plaintext consists of a JSON Data and Tree Blobs, so the outer structure is `IV || Ciphertext || MAC` again.
document like the following: The plaintext consists of a JSON document like the following:
{ [
"73d04e6125cf3c28a299cc2f3cca3b78ceac396e4fcf9575e34536b26782413c": {
[ "id": "73d04e6125cf3c28a299cc2f3cca3b78ceac396e4fcf9575e34536b26782413c",
"3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce", "blobs": [
"9ccb846e60d90d4eb915848add7aa7ea1e4bbabfc60e573db9f7bfb2789afbae", "3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce",
"d3dc577b4ffd38cc4b32122cabf8655a0223ed22edfd93b353dc0c3f2b0fdf66" "9ccb846e60d90d4eb915848add7aa7ea1e4bbabfc60e573db9f7bfb2789afbae",
] "d3dc577b4ffd38cc4b32122cabf8655a0223ed22edfd93b353dc0c3f2b0fdf66"
} ]
}
]
This JSON document lists all the blobs with the contents. In this example, the This JSON document lists all the Blobs with contents. In this example, the Pack
blob `73d04e61` contains three chunks, the plaintext hashes are listed afterwards. `73d04e61` contains three Blobs, the plaintext hashes are listed afterwards.
Keys, Encryption and MAC Keys, Encryption and MAC
------------------------ ------------------------
@ -232,13 +254,13 @@ Here it can be seen that this snapshot represents the contents of the directory
`/tmp/testdata`. The most important field is `tree`. `/tmp/testdata`. The most important field is `tree`.
All content within a restic repository is referenced according to its SHA-256 All content within a restic repository is referenced according to its SHA-256
hash. Before saving, each file is split into variable sized chunks of data. The hash. Before saving, each file is split into variable sized Blobs of data. The
SHA-256 hashes of all chunks are saved in an ordered list which then represents SHA-256 hashes of all Blobs are saved in an ordered list which then represents
the content of the file. the content of the file.
In order to relate these plain text hashes to the actual encrypted storage In order to relate these plain text hashes to the actual encrypted storage
hashes (which vary due to random IVs), an index is used. If the index is not hashes (which vary due to random IVs), an index is used. If the index is not
available, the header of all data blobs can be read. available, the header of all data Blobs can be read.
Trees and Data Trees and Data
-------------- --------------
@ -322,15 +344,15 @@ Backups and Deduplication
For creating a backup, restic scans the target directory for all files, For creating a backup, restic scans the target directory for all files,
sub-directories and other entries. The data from each file is split into sub-directories and other entries. The data from each file is split into
variable length chunks cut at offsets defined by a sliding window of 64 byte. variable length Blobs cut at offsets defined by a sliding window of 64 byte.
The implementation uses Rabin Fingerprints for implementing this Content The implementation uses Rabin Fingerprints for implementing this Content
Defined Chunking (CDC). An irreducible polynomial is selected at random when a Defined Chunking (CDC). An irreducible polynomial is selected at random when a
repository is initialized. repository is initialized.
Files smaller than 512 KiB are not split, chunks are of 512 KiB to 8 MiB in Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in
size. The implementation aims for 1 MiB chunk size on average. size. The implementation aims for 1 MiB Blob size on average.
For modified files, only modified chunks have to be saved in a subsequent For modified files, only modified Blobs have to be saved in a subsequent
backup. This even works if bytes are inserted or removed at arbitrary positions backup. This even works if bytes are inserted or removed at arbitrary positions
within the file. within the file.