syncthing/vendor/github.com/cznic/ql/design/doc.go
2016-09-13 22:20:22 +02:00

299 lines
10 KiB
Go

// Copyright 2014 The ql Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
/*
Package design describes some of the data structures used in QL.
Handles
A handle is a 7 byte "pointer" to a block in the DB[0].
Scalar encoding
Encoding of so called "scalars" provided by [1]. Unless specified otherwise,
all values discussed below are scalars, encoded scalars or encoding of scalar
arrays.
Database root
DB root is a 1-scalar found at a fixed handle (#1).
+---+------+--------+-----------------------+
| # | Name | Type | Description |
+---+------+--------+-----------------------+
| 0 | head | handle | First table meta data |
+---+------+--------+-----------------------+
Head is the head of a single linked list of table of meta data. It's zero if
there are no tables in the DB.
Table meta data
Table meta data are a 6-scalar.
+---+---------+--------+--------------------------+
| # | Name | Type | Description |
+---+---------+--------+--------------------------+
| 0 | next | handle | Next table meta data. |
| 1 | scols | string | Column defintitions |
| 2 | hhead | handle | -> head -> first record |
| 3 | name | string | Table name |
| 4 | indices | string | Index definitions |
| 5 | hxroots | handle | Index B+Trees roots list |
+---+---------+--------+--------------------------+
Fields #4 and #5 are optional for backward compatibility with existing
databases. OTOH, forward compatibility will not work. Once any indices are
created using a newer QL version the older versions of QL, expecting only 4
fields of meta data will not be able to use the DB. That's the intended
behavior because the older versions of QL cannot update the indexes, which can
break queries runned by the newer QL version which expect indices to be always
actualized on any table-with-indices mutation.
The handle of the next table meta data is in the field #0 (next). If there is
no next table meta data, the field is zero. Names and types of table columns
are stored in field #1 (scols). A single field is described by concatenating a
type tag and the column name. The type tags are
bool 'b'
complex64 'c'
complex128 'd'
float32 'f'
float64 'g', alias float
int8 'i'
int16 'j'
int32 'k'
int64 'l', alias int
string 's'
uint8 'u', alias byte
uint16 'v'
uint32 'w'
uint64 'x', alias uint
bigInt 'I'
bigRat 'R'
blob 'B'
duration 'D'
time 'T'
The scols value is the above described encoded fields joined using "|". For
example
CREATE TABLE t (Foo bool, Bar string, Baz float);
This statement adds a table meta data with scols
"bFool|sBar|gBaz"
Columns can be dropped from a table
ALTER TABLE t DROP COLUMN Bar;
This "erases" the field info in scols, so the value becomes
"bFool||gBaz"
Colums can be added to a table
ALTER TABLE t ADD Count uint;
New fields are always added to the end of scols
"bFool||gBaz|xCount"
Index of a field in strings.Split(scols, "|") is the index of the field in a
table record. The above discussed rules for column dropping and column adding
allow for schema evolution without a need to reshape any existing table data.
Dropped columns are left where they are and new records insert nil in their
place. The encoded nil is one byte. Added columns, when not present in
preexisting records are returned as nil values. If the overhead of dropped
columns becomes an issue and there's time/space and memory enough to move the
records of a table around:
BEGIN TRANSACTION;
CREATE TABLE new (column definitions);
INSERT INTO new SELECT * FROM old;
DROP TABLE old;
CREATE TABLE old (column definitions);
INSERT INTO old SELECT * FROM new;
DROP TABLE new;
END TRANSACTION;
This is not very time/space effective and for Big Data it can cause an OOM
because transactions are limited by memory resources available to the process.
Perhaps a method and/or QL statement to do this in-place should be added
(MAYBE consider adopting MySQL's OPTIMIZE TABLE syntax).
Field #2 (hhead) is a handle to a head of table records, i.e. not a handle to
the first record in the table. It is thus always non zero even for a table
having no records. The reason for this "double pointer" schema is to enable
adding (linking) a new record by updating a single value of the (hhead pointing
to) head.
tableMeta.hhead -> head -> firstTableRecord
The table name is stored in field #3 (name).
Indices
Consider an index named N, indexing column named C. The encoding of this
particular index is a string "<tag>N". <tag> is a string "n" for non unique
indices and "u" for unique indices. There is this index information for the
index possibly indexing the record id() and for all other columns of scols.
Where the column is not indexed, the index info is an empty string. Infos for
all indexes are joined with "|". For example
BEGIN TRANSACTION;
CREATE TABLE t (Foo int, Bar bool, Baz string);
CREATE INDEX X ON t (Baz);
CREATE UNIQUE INDEX Y ON t (Foo);
COMMIT;
The values of fields #1 and #4 for the above are
scols: "lFoo|bBar|sBaz"
indices: "|uY||nX"
Aligning properly the "|" split parts
id col #0 col#1 col#2
+----------+----+--------+--------+--------+
| scols: | | "lFoo" | "bBar" | "sBaz" |
+----------+----+--------+--------+--------+
| indices: | "" | "uY" | "" | "nX" |
+----------+----+--------+--------+--------+
shows that the record id() is not indexed for this table while the columns Foo
and Baz are.
Note that there cannot be two differently named indexes for the same column and
it's intended. The indices are B+Trees[2]. The list of handles to their roots
is pointed to by hxroots with zeros for non indexed columns. For the previous
example
tableMeta.hxroots -> {0, y, 0, x}
where x is the root of the B+Tree for the X index and y is the root of the
B+Tree for the Y index. If there would be an index for id(), its B+Tree root
will be present where the first zero is. Similarly to hhead, hxroots is never
zero, even when there are no indices for a table.
Table record
A table record is an N-scalar.
+-----+------------+--------+-------------------------------+
| # | Name | Type | Description |
+-----+------------+--------+-------------------------------+
| 0 | next | handle | Next record or zero. |
| 1 | id | int64 | Automatically assigned unique |
| | | | value obtainable by id(). |
| 2 | field #0 | scalar | First field of the record. |
| 3 | field #1 | scalar | Second field of the record. |
...
| N-1 | field #N-2 | scalar | Last field of the record. |
+-----+------------+--------+-------------------------------+
The linked "ordering" of table records has no semantics and it doesn't have to
correlate to the order of how the records were added to the table. In fact, an
efficient way of the linking leads to "ordering" which is actually reversed wrt
the insertion order.
Non unique index
The composite key of the B+Tree is {indexed values, record handle}. The B+Tree
value is not used.
B+Tree key B+Tree value
+----------------+---------------+ +--------------+
| Indexed Values | Record Handle | -> | not used |
+----------------+---------------+ +--------------+
Unique index
If the indexed values are all NULL then the composite B+Tree key is {nil,
record handle} and the B+Tree value is not used.
B+Tree key B+Tree value
+------+-----------------+ +--------------+
| NULL | Record Handle | -> | not used |
+------+-----------------+ +--------------+
If the indexed values are not all NULL then key of the B+Tree key are the indexed
values and the B+Tree value is the record handle.
B+Tree key B+Tree value
+----------------+ +---------------+
| Indexed Values | -> | Record Handle |
+----------------+ +---------------+
Non scalar types
Scalar types of [1] are bool, complex*, float*, int*, uint*, string and []byte
types. All other types are "blob-like".
QL type Go type
-----------------------------
blob []byte
bigint big.Int
bigrat big.Rat
time time.Time
duration time.Duration
Memory back-end stores the Go type directly. File back-end must resort to
encode all of the above as (tagged) []byte due to the lack of more types
supported natively by lldb. NULL values of blob-like types are encoded as nil
(gbNull in lldb/gb.go), exactly the same as the already existing QL types are.
Blob encoding
The values of the blob-like types are first encoded into a []byte slice:
+-----------------------+-------------------+
| blob | raw |
| bigint, bigrat, time | gob encoded |
| duration | gob encoded int64 |
+-----------------------+-------------------+
The gob encoding is "differential" wrt an initial encoding of all of the
blob-like type. IOW, the initial type descriptors which gob encoding must write
out are stripped off and "resupplied" on decoding transparently. See also
blob.go. If the length of the resulting slice is <= shortBlob, the first and
only chunk is the scalar encoding of
[]interface{}{typeTag, slice}. // initial (and last) chunk
The length of slice can be zero (for blob("")). If the resulting slice is long
(> shortBlob), the first chunk comes from encoding
[]interface{}{typeTag, nextHandle, firstPart}. // initial, but not final chunk
In this case len(firstPart) <= shortBlob. Second and other chunks: If the chunk
is the last one, src is
[]interface{lastPart}. // overflow chunk (last)
In this case len(lastPart) <= 64kB. If the chunk is not the last one, src is
[]interface{}{nextHandle, part}. // overflow chunk (not last)
In this case len(part) == 64kB.
Links
Referenced from above:
[0]: http://godoc.org/github.com/cznic/lldb#hdr-Block_handles
[1]: http://godoc.org/github.com/cznic/lldb#EncodeScalars
[2]: http://godoc.org/github.com/cznic/lldb#BTree
Rationale
While these notes might be useful to anyone looking at QL sources, the
specifically intended reader is my future self.
*/
package design