If the GC finds a key k that it wants to keep, it records that in a
Bloom filter. If a key k' can be removed but its hash collides with k,
it will be kept. Since the old Bloom filter code was completely
deterministic, the next run would encounter the same collision, assuming
k must still be kept.
A randomized hash function that uses all the SHA-256 bits solves this
problem: the second run has a non-zero probability of removing k', as
long as the Bloom filter is not completely full.
Group the global list of files by version, instead of having one flat list for all devices. This removes lots of duplicate protocol.Vectors.
Co-authored-by: Jakob Borg <jakob@kastelo.net>
This adds indirection of large version vectors in the same manner as we
already to block lists. The effect is the same: less duplicated data in
some situations.
To mitigate the impact for when this indirection
wouldn't be needed I've added an indirection cutoff for both blocks and
the new version vector stuff: we don't do the indirection at all for
small block lists or small version vectors, instead storing it directly
like we used to do. This is faster for small files and small setups.
This makes the GC runner a service that will stop fairly quickly when
told to.
As a bonus, STTRACE=app will print the service tree on the way out,
including any errors they've flagged.
As of the latest database checker we are again putting files without
blocks. I'm not 100% convinced that's a great idea, but we also do it
for ignored files apparently so it looks like we probably should support
it. This adds an escape hatch that must be manually enabled...
I was working on indirecting version vectors, and that resulted in some
refactoring and improving the existing block indirection stuff. We may
or may not end up doing the version vector indirection, but I think
these changes are reasonable anyhow and will simplify the diff
significantly if we do go there. The main points are:
- A bunch of renaming to make the indirection and GC not about "blocks"
but about "indirection".
- Adding a cutoff so that we don't actually indirect for small block
lists. This gets us better performance when handling small files as it
cuts out the indirection for quite small loss in space efficiency.
- Being paranoid and always recalculating the hash on put. This costs
some CPU, but the consequences if a buggy or malicious implementation
silently substituted the block list by lying about the hash would be bad.
This adds metadata updates to the same write batch as the underlying
file change. The odds of a metadata update going missing is greatly
reduced.
Bonus change: actually commit the transaction in recalcMeta.
The readWriteTransaction offered both commit() (the one to use) and
Commit() (via embedding) where the latter didn't close the read
transaction. This removes the lower cased variant in order to prevent
the mistake.
The only place where the mistake was made was the new gc runner, where
it would leave a read snapshot open forever.
Also retain the interval over restarts by storing last GC time in the
database. This to make sure that GC eventually happens even if the
interval is configured to a long time (say, a month).