mirror of
https://github.com/octoleo/restic.git
synced 2024-12-22 10:58:55 +00:00
Clarify terminology
This commit is contained in:
parent
b99474154c
commit
58f289003c
120
doc/Design.md
120
doc/Design.md
@ -1,15 +1,35 @@
|
|||||||
This document gives a high-level overview of the design and repository layout
|
This document gives a high-level overview of the design and repository layout
|
||||||
of the restic backup program.
|
of the restic backup program.
|
||||||
|
|
||||||
|
Terminology
|
||||||
|
===========
|
||||||
|
|
||||||
|
This section introduces terminology used in this document.
|
||||||
|
|
||||||
|
*Repository*: All data produced during a backup is sent to and stored at a
|
||||||
|
repository in structured form, for example in a file system hierarchy of with
|
||||||
|
several subdirectories. A repository implementation must be able to fulfil a
|
||||||
|
number of operations, e.g. list the contents.
|
||||||
|
|
||||||
|
*Blob*: A Blob combines a number of data bytes with identifying information
|
||||||
|
like the SHA256 hash of the data and its length.
|
||||||
|
|
||||||
|
*Pack*: A Pack combines one or more Blobs together, e.g. in a single file.
|
||||||
|
|
||||||
|
*Snapshot*: A Snapshot stands for the state of a file or directory that has
|
||||||
|
been backed up at some point in time. The state here means the content and meta
|
||||||
|
data like the name and modification time for the file or the directory and its
|
||||||
|
contents.
|
||||||
|
|
||||||
Repository Format
|
Repository Format
|
||||||
=================
|
=================
|
||||||
|
|
||||||
All data is stored in a restic repository. A repository is able to store data
|
All data is stored in a restic repository. A repository is able to store data
|
||||||
in blobs of several different types, which can later be requested based on an
|
of several different types, which can later be requested based on an ID. The ID
|
||||||
ID. The ID is the hash (SHA-256) of the content of a blob. All blobs in a
|
is the hash (SHA-256) of the content of a file. All files in a repository are
|
||||||
repository are only written once and never modified afterwards. This allows
|
only written once and never modified afterwards. This allows accessing and even
|
||||||
accessing and even writing to the repository with multiple clients in parallel.
|
writing to the repository with multiple clients in parallel. Only the delete
|
||||||
Only the delete operation changes data in the repository.
|
operation changes data in the repository.
|
||||||
|
|
||||||
At the time of writing, the only implemented repository type is based on
|
At the time of writing, the only implemented repository type is based on
|
||||||
directories and files. Such repositories can be accessed locally on the same
|
directories and files. Such repositories can be accessed locally on the same
|
||||||
@ -23,7 +43,7 @@ Additionally there is a file named `id` which contains 32 random bytes, encoded
|
|||||||
in hexadecimal. This uniquely identifies the repository, regardless if it is
|
in hexadecimal. This uniquely identifies the repository, regardless if it is
|
||||||
accessed via SFTP or locally.
|
accessed via SFTP or locally.
|
||||||
|
|
||||||
For all other blobs stored in the repository, the name for the file is the
|
For all other files stored in the repository, the name for the file is the
|
||||||
lower case hexadecimal representation of the SHA-256 hash of the file's
|
lower case hexadecimal representation of the SHA-256 hash of the file's
|
||||||
contents. This allows easily checking all files for accidental modifications
|
contents. This allows easily checking all files for accidental modifications
|
||||||
like disk read errors by simply running the program `sha256sum` and comparing
|
like disk read errors by simply running the program `sha256sum` and comparing
|
||||||
@ -76,64 +96,66 @@ A repository can be initialized with the `restic init` command, e.g.:
|
|||||||
|
|
||||||
$ restic -r /tmp/restic-repo init
|
$ restic -r /tmp/restic-repo init
|
||||||
|
|
||||||
Blob Format
|
Pack Format
|
||||||
-----------
|
-----------
|
||||||
|
|
||||||
All blobs except key, tree and data blobs just contain raw data, stored as `IV
|
All files in the repository except Key, Tree and Data files just contain raw
|
||||||
|| Ciphertext || MAC`. Tree and Data blobs may contain several chunks of data.
|
data, stored as `IV || Ciphertext || MAC`. Tree and Data files may contain
|
||||||
The format is described in the following.
|
several Blobs of data. The format is described in the following.
|
||||||
|
|
||||||
The blob starts with a nonce and a header, the header describes the content and
|
A Pack starts with a nonce and a header, the header describes the content and
|
||||||
is encrypted and signed. The blob's structure is as follows:
|
is encrypted and signed. The Pack's structure is as follows:
|
||||||
|
|
||||||
NONCE || Header_Length ||
|
NONCE || Header_Length ||
|
||||||
IV_Header || Ciphertext_Header || MAC_Header ||
|
IV_Header || Ciphertext_Header || MAC_Header ||
|
||||||
IV_Chunk_1 || Ciphertext_Chunk_1 || MAC_Chunk_1 ||
|
IV_Blob_1 || Ciphertext_Blob_1 || MAC_Blob_1 ||
|
||||||
[...]
|
[...]
|
||||||
IV_Chunk_n || Ciphertext_Chunk_n || MAC_Chunk_n ||
|
IV_Blob_n || Ciphertext_Blob_n || MAC_Blob_n ||
|
||||||
MAC
|
MAC
|
||||||
|
|
||||||
`NONCE` consists of 16 bytes and `Header_Length` is a four byte integer in
|
`NONCE` consists of 16 bytes and `Header_Length` is a four byte integer in
|
||||||
little-endian encoding.
|
little-endian encoding.
|
||||||
|
|
||||||
All the parts (`Ciphertext_Header`, `Ciphertext_Chunk1` etc.) are signed and
|
All the parts (`Ciphertext_Header`, `Ciphertext_Blob1` etc.) are signed and
|
||||||
encrypted independently. In addition, the complete blob is signed again using
|
encrypted independently. In addition, the complete pack is signed using
|
||||||
`NONCE`. This enables repository reorganisation without having to touch the
|
`NONCE`. This enables repository reorganisation without having to touch the
|
||||||
encrypted chunks. In addition it also allows efficient indexing, for only the
|
encrypted Blobs. In addition it also allows efficient indexing, for only the
|
||||||
header needs to be read in order to find out which chunks are contained in the
|
header needs to be read in order to find out which Blobs are contained in the
|
||||||
blob. Since the header is signed, authenticity of the header can be checked
|
Pack. Since the header is signed, authenticity of the header can be checked
|
||||||
without having to read the complete blob.
|
without having to read the complete Pack.
|
||||||
|
|
||||||
After decryption, a blob's header consists of the following elements:
|
After decryption, a Pack's header consists of the following elements:
|
||||||
|
|
||||||
Length(IV_Chunk1+Ciphertext_Chunk1+MAC_Chunk1) || Hash(Plaintext_Chunk1) ||
|
Length(IV_Blob_1+Ciphertext_Blob1+MAC_Blob_1) || Hash(Plaintext_Blob_1) ||
|
||||||
[...]
|
[...]
|
||||||
Length(IV_Chunk_n+Ciphertext_Chunk_n+MAC_Chunk_n) || Hash(Plaintext_Chunk_n) ||
|
Length(IV_Blob_n+Ciphertext_Blob_n+MAC_Blob_n) || Hash(Plaintext_Blob_n) ||
|
||||||
|
|
||||||
This is enough to calculate the offsets for all the chunks in the blob. Length
|
This is enough to calculate the offsets for all the Blobs in the Pack. Length
|
||||||
is the length of the chunk as a four byte integer in little-endian format.
|
is the length of a Blob as a four byte integer in little-endian format.
|
||||||
|
|
||||||
Indexing
|
Indexing
|
||||||
--------
|
--------
|
||||||
|
|
||||||
Index blobs pack together information about data and tree blobs and stores this
|
Index files contain information about Data and Tree Blobs and the Packs they
|
||||||
information in the repository. When the local cached index is not accessible
|
are contained in and store this information in the repository. When the local
|
||||||
any more, the index files can be downloaded and used to reconstruct the index.
|
cached index is not accessible any more, the index files can be downloaded and
|
||||||
The index blobs are encrypted and signed like data and tree blobs, so the outer
|
used to reconstruct the index. The index Blobs are encrypted and signed like
|
||||||
structure is `IV || Ciphertext || MAC` again. The plaintext consists of a JSON
|
Data and Tree Blobs, so the outer structure is `IV || Ciphertext || MAC` again.
|
||||||
document like the following:
|
The plaintext consists of a JSON document like the following:
|
||||||
|
|
||||||
{
|
[
|
||||||
"73d04e6125cf3c28a299cc2f3cca3b78ceac396e4fcf9575e34536b26782413c":
|
{
|
||||||
[
|
"id": "73d04e6125cf3c28a299cc2f3cca3b78ceac396e4fcf9575e34536b26782413c",
|
||||||
"3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce",
|
"blobs": [
|
||||||
"9ccb846e60d90d4eb915848add7aa7ea1e4bbabfc60e573db9f7bfb2789afbae",
|
"3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce",
|
||||||
"d3dc577b4ffd38cc4b32122cabf8655a0223ed22edfd93b353dc0c3f2b0fdf66"
|
"9ccb846e60d90d4eb915848add7aa7ea1e4bbabfc60e573db9f7bfb2789afbae",
|
||||||
]
|
"d3dc577b4ffd38cc4b32122cabf8655a0223ed22edfd93b353dc0c3f2b0fdf66"
|
||||||
}
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
This JSON document lists all the blobs with the contents. In this example, the
|
This JSON document lists all the Blobs with contents. In this example, the Pack
|
||||||
blob `73d04e61` contains three chunks, the plaintext hashes are listed afterwards.
|
`73d04e61` contains three Blobs, the plaintext hashes are listed afterwards.
|
||||||
|
|
||||||
Keys, Encryption and MAC
|
Keys, Encryption and MAC
|
||||||
------------------------
|
------------------------
|
||||||
@ -232,13 +254,13 @@ Here it can be seen that this snapshot represents the contents of the directory
|
|||||||
`/tmp/testdata`. The most important field is `tree`.
|
`/tmp/testdata`. The most important field is `tree`.
|
||||||
|
|
||||||
All content within a restic repository is referenced according to its SHA-256
|
All content within a restic repository is referenced according to its SHA-256
|
||||||
hash. Before saving, each file is split into variable sized chunks of data. The
|
hash. Before saving, each file is split into variable sized Blobs of data. The
|
||||||
SHA-256 hashes of all chunks are saved in an ordered list which then represents
|
SHA-256 hashes of all Blobs are saved in an ordered list which then represents
|
||||||
the content of the file.
|
the content of the file.
|
||||||
|
|
||||||
In order to relate these plain text hashes to the actual encrypted storage
|
In order to relate these plain text hashes to the actual encrypted storage
|
||||||
hashes (which vary due to random IVs), an index is used. If the index is not
|
hashes (which vary due to random IVs), an index is used. If the index is not
|
||||||
available, the header of all data blobs can be read.
|
available, the header of all data Blobs can be read.
|
||||||
|
|
||||||
Trees and Data
|
Trees and Data
|
||||||
--------------
|
--------------
|
||||||
@ -322,15 +344,15 @@ Backups and Deduplication
|
|||||||
|
|
||||||
For creating a backup, restic scans the target directory for all files,
|
For creating a backup, restic scans the target directory for all files,
|
||||||
sub-directories and other entries. The data from each file is split into
|
sub-directories and other entries. The data from each file is split into
|
||||||
variable length chunks cut at offsets defined by a sliding window of 64 byte.
|
variable length Blobs cut at offsets defined by a sliding window of 64 byte.
|
||||||
The implementation uses Rabin Fingerprints for implementing this Content
|
The implementation uses Rabin Fingerprints for implementing this Content
|
||||||
Defined Chunking (CDC). An irreducible polynomial is selected at random when a
|
Defined Chunking (CDC). An irreducible polynomial is selected at random when a
|
||||||
repository is initialized.
|
repository is initialized.
|
||||||
|
|
||||||
Files smaller than 512 KiB are not split, chunks are of 512 KiB to 8 MiB in
|
Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in
|
||||||
size. The implementation aims for 1 MiB chunk size on average.
|
size. The implementation aims for 1 MiB Blob size on average.
|
||||||
|
|
||||||
For modified files, only modified chunks have to be saved in a subsequent
|
For modified files, only modified Blobs have to be saved in a subsequent
|
||||||
backup. This even works if bytes are inserted or removed at arbitrary positions
|
backup. This even works if bytes are inserted or removed at arbitrary positions
|
||||||
within the file.
|
within the file.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user