This changes the TLS and certificate handling in a few ways:
- We always use TLS 1.2, both for sync connections (as previously) and
the GUI/REST/discovery stuff. This is a tightening of the requirements
on the GUI. AS far as I can tell from caniusethis.com every browser from
2013 and forward supports TLS 1.2, so I think we should be fine.
- We always greate ECDSA certificates. Previously we'd create
ECDSA-with-RSA certificates for sync connections and pure RSA
certificates for the web stuff. The new default is more modern and the
same everywhere. These certificates are OK in TLS 1.2.
- We use the Go CPU detection stuff to choose the cipher suites to use,
indirectly. The TLS package uses CPU capabilities probing to select
either AES-GCM (fast if we have AES-NI) or ChaCha20 (faster if we
don't). These CPU detection things aren't exported though, so the tlsutil
package now does a quick TLS handshake with itself as part of init().
If the chosen cipher suite was AES-GCM we prioritize that, otherwise we
prefer ChaCha20. Some might call this ugly. I think it's awesome.
This is a new revision of the discovery server. Relevant changes and
non-changes:
- Protocol towards clients is unchanged.
- Recommended large scale design is still to be deployed nehind nginx (I
tested, and it's still a lot faster at terminating TLS).
- Database backend is leveldb again, only. It scales enough, is easy to
setup, and we don't need any backend to take care of.
- Server supports replication. This is a simple TCP channel - protect it
with a firewall when deploying over the internet. (We deploy this within
the same datacenter, and with firewall.) Any incoming client announces
are sent over the replication channel(s) to other peer discosrvs.
Incoming replication changes are applied to the database as if they came
from clients, but without the TLS/certificate overhead.
- Metrics are exposed using the prometheus library, when enabled.
- The database values and replication protocol is protobuf, because JSON
was quite CPU intensive when I tried that and benchmarked it.
- The "Retry-After" value for failed lookups gets slowly increased from
a default of 120 seconds, by 5 seconds for each failed lookup,
independently by each discosrv. This lowers the query load over time for
clients that are never seen. The Retry-After maxes out at 3600 after a
couple of weeks of this increase. The number of failed lookups is
stored in the database, now and then (avoiding making each lookup a
database put).
All in all this means clients can be pointed towards a cluster using
just multiple A / AAAA records to gain both load sharing and redundancy
(if one is down, clients will talk to the remaining ones).
GitHub-Pull-Request: https://github.com/syncthing/syncthing/pull/4648