Designing Non-Correlation: Deep Dive Into DVT and Charon’s Architecture
I have been worrying about decentralising proof of stake from as early as the original serenity design spec launched at Devcon 4. Today I want to talk about how Charon, Obol’s distributed validator middleware client, is architected in a manner that will allow it to grow the Ethereum validator set by an order of magnitude without introducing correlated behaviour into the validators that adopt the technology.
Firstly, what’s so important about correlation?
The proof of stake Ethereum specification is designed to encourage decentralisation by heavily penalising centralisation. There are mechanisms designed to penalise correlated inactivity, and there are mechanisms to maximally punish correlated malicious behaviour. The more people offline when you are offline, the worse your inactivity penalty is (particularly if >33% of the network are offline). Similarly, if your validator is considered to be attacking the network you will be slashed, and the slashing penalty grows as a function of how many other validators get slashed in the succeeding three weeks after your slashing. If your validator was the only one slashed in this period, it would lose maybe 2 ether. If a mass slashing event happens you could lose the entirety your validator's ether.
So now that we understand the stakes here 😏 let’s look at some good (and bad) ideas for architecting a distributed validator client that aims to make the network safer, more resilient, and less correlated than the network would otherwise be without it.
Terminology
To quickly touch on a few terms for people newly introduced to Charon and DVT generally:
- Distributed Validator: One 32 ether stake, represented by a BLS public key, operated by multiple private keys together in tandem.
- Distributed Validator Cluster: A set of Ethereum nodes (including Charon), connected together to operate one or more distributed validators.
- Middleware: Software that sits between two other independent pieces of software, that intercepts the communication line between them. Charon sits between a validator client and the HTTP Beacon API on a consensus client.
Private keys
Where else could you start when discussing distributed validator architecture other than the private keys. There are two types of private keys involved in a distributed validator cluster:
- An SECP256K1 public/private key pair is used to identify a Charon client. Charon aligns closely with the existing eth1 and eth2 networking stacks in this manner, using Ethereum Node Records (ENRs) and the discv5 discovery protocol to find the right peers on the internet no matter where they end up.
- BLS12-381 threshold key shares. These are BLS private keys that sign duties like a normal Ethereum validator would but with a twist. Multiple private keys are generated in such a manner that together they represent a particular private key. Importantly, not all private keys are needed to construct a valid signature for the associated public key. Think of it like a multi-sig wallet for a validator private key.
When setting up a new distributed validator cluster, each operator generates a private key (1) for their Charon client to use, and then they prepare a distributed key generation ceremony with the help of the distributed validator launchpad to create the distributed validator private keys (2).
After setting the terms of the cluster, adding all of the operators, and authenticating them all through the launchpad, the operators give their Charon clients the produced cluster-definition file, and the DKG process begins. Each Charon client finds one another on the internet, establishes a secure and encrypted line of communication, and then creates these BLS private keys in a manner where no one client ever controls or knows what the full private key is. Once complete, each Charon client writes their private keys to disk in the widely adopted EIP-2335 keystore format. As part of this process, the private keys are used to produce deposit data for activating the validator on the chosen network. This is the one and only time Charon has custody of, and the ability to sign with the distributed validator’s private keys. After the DKG process completes, the validator private keys get imported into the operator’s validator client of choice.
To contrast this private key generation approach with another; It is possible (but unwise) to have a trusted entity (like a potential customer) create a full private key on their own, have this entity split the private key into shares, have this entity encrypt the private key shares with each operator’s client’s private key, and post the encrypted private keys on-chain for the world to see. This is our first correlation risk, and it relates to the damage that can be caused if an operator loses their client’s private key. In Charon, nothing much happens, that node can start acting in a byzantine manner, which is not a major concern as the cluster is byzantine-fault tolerant. In the alternative architecture, the attacker can retrospectively decrypt (and publish) every BLS validator private key share sitting on chain that was sent to this compromised operator, making every distributed validator this operator was a part of less secure.
Middleware
Once the distributed validator cluster has been created, the next architecture design decision is to mitigate the risk of correlated slashing at all costs. Charon aims to achieve this by not having the ability to do something slashable. Charon does not custody the validator private keys, nor has it the ability to sign arbitrary data. Instead, Charon is a middleware that sits between existing consensus clients and validator clients and intercepts the data flowing between the two. All Charon clients do is:
- Come to consensus amongst themselves on what should be presented to their connected validator clients for signing.
- Capture the returned signatures and combine them together into an aggregate signature which gets sent to the wider network.
We think a middleware based architecture is a much more secure and trust-minimised architecture than the alternative, which is to implement a distributed validator client as a full validator client with custody of the private keys and the capability to sign arbitrary data. If, for example, you consider the worst case scenarios, which is that Charon is compromised by a supply chain attack, a remote code execution attack, or the Obol team become bad actors and decide to push a malicious release, Charon can not do a lot of damage as a middleware. If a compromised Charon client proposes a potential double vote or surround vote to a validator to sign, the validator client will check its anti-slashing database, see that it has already signed something conflicting, and will simply refuse to return a signature. Charon could propose to a validator to sign an invalid block, but the chain would reject this and simply consider the validator offline, which is much much better than getting slashed.
As a middleware, Charon gets the benefit of having all of the existing validator clients as fail-safe’s that double check nothing is wrong. Multiple implementations working together to validate make the odds of unintended slashing vanishingly small. Distributed Validators implemented as a full validator client capable of signing arbitrary data without the oversight of a second software implementation have a much higher risk of causing correlated slashing in my opinion.
Communication
As we move down the threat severity from key compromise, to slashing risk, to correlated inactivity risk, the next architecture decision to look at is how Charon clients communicate with one another. Working under the guiding principle of not making validators more correlated when you’re aiming to make them less correlated, we have decided to keep shared networking infrastructure between distributed validator clusters to a minimum.
Rather than having every single Charon client connect to a single shared gossip network, each cluster are completely isolated from the others. Charon clients in a cluster establish direct TCP connections to their peers. This is approach takes more initial setup than connecting to a public gossip network in that you need to make sure your Charon node can be accessed on the public internet directly, but the payoff in the long run is worth it in my opinion.
- Direct TCP connections are reliable, messages don’t get lost like they do on a gossip network.
- Direct TCP connections are much faster than a signature bouncing around a network over multiple hops before getting to you. This improves validator profitability.
- There are no messages being sent to your client that aren’t intended for your client.
- There is no one central networking layer that if it goes down, every single distributed validator goes offline in a correlated outage.
- If you are running a cluster in a cloud infrastructure provider, using their private inter-data centre networking is fast, high throughput, and cost-subsidised. Using their network egress to the public internet for a gossip protocol incurs more cloud costs for a slower network that has less bandwidth.
There is only one piece of common networking infrastructure that Obol currently host and that is a discv5 bootnode for enabling Charon nodes to find one another. For a security conscious staking community it makes sense for them to self-host their own bootnode instead to sever the only common link between Charon clusters left. We have made it easy to do so, and have seen a number of our Athena testing program participants already opt for this.
Versioning
Last but not least, doing the hard work to de-couple clusters from one another opens up another way to ensure you don’t fail in a correlated manner, and that relates to software upgrades. Software upgrades are always scary and risky, much more so in a distributed system. If you have every client sharing the same message bus; if you release a new version that changes the way messages are structured, old versions of the software won’t know what to do with these messages. So to work around this, you need to set a time (or slot number) when the new protocol becomes active, and you blast everyone running your software on your comms channels to say they have to be running the latest version of your software before that time or their nodes will go offline. Once that slot arrives, every single distributed validator swaps to the new protocol simultaneously. I barely need to highlight how correlated this is, if anything goes wrong it goes wrong for everyone. Even if nothing goes wrong, if people didn’t update in time they get forced offline.
Sometimes, like when you run a layer 1 blockchain with multiple independent client implementations, picking a coordinated moment to make an upgrade is the only feasible solution to rolling out changes, however in Obol, because each cluster is independent of one another, combined with the fact that these clusters have byzantine fault tolerant capabilities, it becomes possible to let each cluster upgrade to the latest protocol at their leisure, when they are ready to. We are building graceful version upgrades into Charon clients such that they periodically come to consensus with one another on the most recent protocol version they all speak, and once everyone is running the latest software the cluster realises, “Hey, we all speak version n+1 now, let’s use it from next epoch onwards”. This reduces the correlation risk involved in the rollout of a breaking change, clusters can upgrade one at a time, and as confidence builds in the stability of the new release, all clusters can upgrade asynchronously.
Conclusion
I started this post by highlighting that I have been thinking about the architecture of proof of stake Ethereum for four years now. I hope by the end of this post you are convinced that Obol are taking the importance of bringing DVT to the ecosystem in a safe manner seriously. I also hope that you come away from this article with a better understanding of the design tradeoffs that can be made when architecting distributed validators. I firmly believe that a trust-minimised, non-custodial, non-correlated architecture is a much healthier way to introduce high availability validation to the space than a custom validator client with a common networking layer.
Thanks for reading.