At a high level, this is what I've done and why:
- I'm moving the protobuf generation for the `protocol`, `discovery` and
`db` packages to the modern alternatives, and using `buf` to generate
because it's nice and simple.
- After trying various approaches on how to integrate the new types with
the existing code, I opted for splitting off our own data model types
from the on-the-wire generated types. This means we can have a
`FileInfo` type with nicer ergonomics and lots of methods, while the
protobuf generated type stays clean and close to the wire protocol. It
does mean copying between the two when required, which certainly adds a
small amount of inefficiency. If we want to walk this back in the future
and use the raw generated type throughout, that's possible, this however
makes the refactor smaller (!) as it doesn't change everything about the
type for everyone at the same time.
- I have simply removed in cold blood a significant number of old
database migrations. These depended on previous generations of generated
messages of various kinds and were annoying to support in the new
fashion. The oldest supported database version now is the one from
Syncthing 1.9.0 from Sep 7, 2020.
- I changed config structs to be regular manually defined structs.
For the sake of discussion, some things I tried that turned out not to
work...
### Embedding / wrapping
Embedding the protobuf generated structs in our existing types as a data
container and keeping our methods and stuff:
```
package protocol
type FileInfo struct {
*generated.FileInfo
}
```
This generates a lot of problems because the internal shape of the
generated struct is quite different (different names, different types,
more pointers), because initializing it doesn't work like you'd expect
(i.e., you end up with an embedded nil pointer and a panic), and because
the types of child types don't get wrapped. That is, even if we also
have a similar wrapper around a `Vector`, that's not the type you get
when accessing `someFileInfo.Version`, you get the `*generated.Vector`
that doesn't have methods, etc.
### Aliasing
```
package protocol
type FileInfo = generated.FileInfo
```
Doesn't help because you can't attach methods to it, plus all the above.
### Generating the types into the target package like we do now and
attaching methods
This fails because of the different shape of the generated type (as in
the embedding case above) plus the generated struct already has a bunch
of methods that we can't necessarily override properly (like `String()`
and a bunch of getters).
### Methods to functions
I considered just moving all the methods we attach to functions in a
specific package, so that for example
```
package protocol
func (f FileInfo) Equal(other FileInfo) bool
```
would become
```
package fileinfos
func Equal(a, b *generated.FileInfo) bool
```
and this would mostly work, but becomes quite verbose and cumbersome,
and somewhat limits discoverability (you can't see what methods are
available on the type in auto completions, etc). In the end I did this
in some cases, like in the database layer where a lot of things like
`func (fv *FileVersion) IsEmpty() bool` becomes `func fvIsEmpty(fv
*generated.FileVersion)` because they were anyway just internal methods.
Fixes#8247
Go is not cgroup aware and by default will set GOMAXPROCS to the number
of available threads, regardless of whether it is within the allocated
quota. This behaviour causes high amount of CPU throttling and degraded
application performance.
The intention was that if no peers are given, we shouldn't start the
listener. We did that anyway, because:
- splitting an empty string on comma returns a slice with one empty
string in it
- parsing the empty string as a device ID returns the empty device ID
so we end up with a valid replication peer which is the empty device ID.
This changes the build script to build all the things in one go
invocation, instead of one invocation per cmd. This is a lot faster
because it means more things get compiled concurrently. It's especially
a lot faster when things *don't* need to be rebuilt, possibly because it
only needs to build the dependency map and such once instead of once per
binary.
In order for this to work we need to be able to pass the same ldflags to
all the binaries. This means we can't set the program name with an
ldflag.
When it needs to rebuild everything (go clean -cache):
( ./old-build -gocmd go1.14.2 build all 2> /dev/null; ) 65.82s user 11.28s system 574% cpu 13.409 total
( ./new-build -gocmd go1.14.2 build all 2> /dev/null; ) 63.26s user 7.12s system 1220% cpu 5.766 total
On a subsequent run (nothing to build, just link the binaries):
( ./old-build -gocmd go1.14.2 build all 2> /dev/null; ) 26.58s user 7.53s system 582% cpu 5.853 total
( ./new-build -gocmd go1.14.2 build all 2> /dev/null; ) 18.66s user 2.45s system 1090% cpu 1.935 total
This adds a certificate lifetime parameter to our certificate generation
and hard codes it to twenty years in some uninteresting places. In the
main binary there are a couple of constants but it results in twenty
years for the device certificate and 820 days for the HTTPS one. 820 is
less than the 825 maximum Apple allows nowadays.
This also means we must be prepared for certificates to expire, so I add
some handling for that and generate a new certificate when needed. For
self signed certificates we regenerate a month ahead of time. For other
certificates we leave well enough alone.
The relay and discosrv didn't use the new lib/build package, now they
do. Conversely the lib/build package wasn't aware there might be other
users and hard coded the program name - now it's set by the build
script
This changes the TLS and certificate handling in a few ways:
- We always use TLS 1.2, both for sync connections (as previously) and
the GUI/REST/discovery stuff. This is a tightening of the requirements
on the GUI. AS far as I can tell from caniusethis.com every browser from
2013 and forward supports TLS 1.2, so I think we should be fine.
- We always greate ECDSA certificates. Previously we'd create
ECDSA-with-RSA certificates for sync connections and pure RSA
certificates for the web stuff. The new default is more modern and the
same everywhere. These certificates are OK in TLS 1.2.
- We use the Go CPU detection stuff to choose the cipher suites to use,
indirectly. The TLS package uses CPU capabilities probing to select
either AES-GCM (fast if we have AES-NI) or ChaCha20 (faster if we
don't). These CPU detection things aren't exported though, so the tlsutil
package now does a quick TLS handshake with itself as part of init().
If the chosen cipher suite was AES-GCM we prioritize that, otherwise we
prefer ChaCha20. Some might call this ugly. I think it's awesome.
This is a new revision of the discovery server. Relevant changes and
non-changes:
- Protocol towards clients is unchanged.
- Recommended large scale design is still to be deployed nehind nginx (I
tested, and it's still a lot faster at terminating TLS).
- Database backend is leveldb again, only. It scales enough, is easy to
setup, and we don't need any backend to take care of.
- Server supports replication. This is a simple TCP channel - protect it
with a firewall when deploying over the internet. (We deploy this within
the same datacenter, and with firewall.) Any incoming client announces
are sent over the replication channel(s) to other peer discosrvs.
Incoming replication changes are applied to the database as if they came
from clients, but without the TLS/certificate overhead.
- Metrics are exposed using the prometheus library, when enabled.
- The database values and replication protocol is protobuf, because JSON
was quite CPU intensive when I tried that and benchmarked it.
- The "Retry-After" value for failed lookups gets slowly increased from
a default of 120 seconds, by 5 seconds for each failed lookup,
independently by each discosrv. This lowers the query load over time for
clients that are never seen. The Retry-After maxes out at 3600 after a
couple of weeks of this increase. The number of failed lookups is
stored in the database, now and then (avoiding making each lookup a
database put).
All in all this means clients can be pointed towards a cluster using
just multiple A / AAAA records to gain both load sharing and redundancy
(if one is down, clients will talk to the remaining ones).
GitHub-Pull-Request: https://github.com/syncthing/syncthing/pull/4648