Tour de SOL – Stage 1: Week 1 Recap
We’re officially a week into Stage 1 of Tour de SOL (TdS), and what a week it has been. Our goal was to begin to stress test our latest release v0.23.3, and iron out any major issues in this release with the aim to use it to upgrade our Soft Launch Phase 1 (SLP1) cluster to the Global Cluster (GC), the first major release of our cluster. As of now, we’ve had 2 cluster restarts, 1 critical bug identified, and a series of smaller bugs identified which we’re currently working through. Suffice to say it’s been everything we hoped it would be when we first set out to launch this event.
ATTACK! CRITICAL BUG FOUND Congratulations to the team at Certus One, who successfully submitted the first PR for a critical bug! With this attack they managed to weaponize a memory leak bug in the gossip network to starve the kernel allocator, which would take down any Validator on the cluster (either one at a time, or all at once).
If you refer to the diagram below, this is how the cluster reacts when this attack is performed. The diagram on the left shows an ever-growing gap between the latest block and the last time a block was finalized, when a healthy cluster would instead show a flat line. The diagram on the right should look like a to a linearly increasing function, instead it’s a flat line indicating that consensus isn’t being achieved.
OTHER ISSUES IDENTIFIED
The great news is that despite the following 2 issues, the cluster managed to stay alive, however it did lead to a reasonable amount of on-boarding friction for Validators:
Mixed Public Keys across Solana Clusters
The gossip networks across both TdS and SLP1.1 merged into a larger cluster during the on-boarding period.
A few weeks ago, some of you may have seen our announcement about the launch of our SLP1 cluster, which runs in parallel to TdS. To connect to this, and any of our other clusters, the process is that we typically require Validators to provide us with their public key, a public identifier for their node.
Therefore as Validators started to connect to the newly setup TdS cluster during the initial 24 hour on-boarding period, we collected their public keys in advance so that we could delegate tokens to them, and get them staked.
Some of the Validators coming online, re-used the same public keys that they were using for the SLP1 cluster. As a result, these Validators – which couldn’t identify that the SLP1 cluster and the TdS clusters were separate – merged the gossip networks between the two into a larger whole, and eventually turned Validators delinquent.
- Overloaded RPC Entrypoint
The RPC entrypoint for the tds.blog.solana.com Validator was added to gossip multiple times, leading to excessive inbound traffic, inefficient management of the additional traffic eventually lead to our the Validator to run out of memory.
For Validators joining the Tour de SOL cluster, there are two methods of connecting their Validator:
- Using the existing RPC entrypoint provided by Solana
- Setting up their own RPC entrypoint
The first option is by far the most convenient. However, during the first 48 hours of the cluster coming online, Validators were finding that they were having intermittent success in utilizing the single RPC entrypoint that we had prepared after multiple attempts.
While we’re still debugging this, we observed two key issues. First was that our RPC entrypoint was being added to the gossip network multiple times (at least 10 times) leading to a significant amount of increased traffic, which was due to a snapshot issue. The second issue which became apparent as a consequence of that was that, the RPC node wasn’t able to properly manage the extra traffic and as a result ran out of memory. It’s possible there are other factors involved, however we’re still investigating as of this time.
- Failed Migration of Bootstrap Leader Node
In an attempt to migrate the bootstrap node from Google Cloud into a Co-Location setup, causing it to re-transmit a block and the rest of the cluster forked away, therefore permanently losing consensus.
- Cluster Outage in Data Centre
Due to a cluster outage in our Data Centre which was hosting our bootstrap node, the cluster stopped making progress, requiring us to restart the cluster.