The alarm went off at 6:20am. I rolled out of bed, opened my laptop, and pulled up Grafana. Equity markets open at 6:30 and I hoped last night’s fixes would hold. I watched request and order counts, comparing them to a week ago when things ran fine. Slack was already buzzing with engineers also checking metrics.
The clock flipped to 6:30. The first metric tick came in, spiking the graphs. Okay, not bad. Next ticks pushed them higher, but error rates didn’t budge. It just needed to stay steady for a minute, then ease off. This could’ve been any morning from 2014, when we started bringing in external Robinhood customers, to 2024, when I left. I was an engineer and later a manager, mostly on brokerage systems.
For stock brokers, market open is a dual threat that’s brutal to handle: orders placed overnight get executed and it’s when the highest rate of customers look at their accounts. Brokers deal with customer money so outages can damage reputation and trust and therefore reliability is critically important.
For the purposes of handling market open load, two Robinhood systems were the most critical. The first was brokeback (“brokerage backend”), a python/django service backed by a postgres database and memcached caching layer. The architecture of brokeback is fairly standard, although nowadays it leverages application sharding, where multiple copies of brokeback run in parallel sharded by user, each with their own database and cache. Next is main street (connects to “wall street”), a Go application backed by postgres databases that does order placing, state management, and produces a stream of updates. Unlike Brokeback, main street was built from the start to be sharded, with its own routing layer and is sharded by order id. Main street communicates with counter parties primarily with FIX, a standard stateful point-to-point communication protocol frequently used in the financial world.
Outside of regular market hours, users are still able to place orders to execute starting from the open. Shortly before open, main street resends the orders to execution venues. At open, the executions start pouring in. Main street handles those executions and subsequently publishes execution information along with the updated order state. The exact nature of how this is published has changed over the years, but has been implemented using both websockets and kafka. Brokeback consumes these updates, updates its database, does cache invalidation, and then sends push notifications. The database updating requires substantial business logic and each order update leads to opening a transaction and locking the associated account row. At this point, users can see their updated execution information.
These updates are handled simultaneously with the highest traffic rates from clients seen throughout the day – even significant events like fed announcements didn’t cause traffic rates to go over market open peaks. We didn’t want the queued order handling to significantly impact the user experience since such a high percentage of users were using the app.
Postgres Redlines Early
Both of these challenges were magnified by the massive growth that we saw. From March 2020, when a scale-related outage took systems down for a full day, to January 2021 we saw more than a 10x increase in traffic. January 2021 saw over 2M requests per second going to brokeback, a python/django service that at the time was able to handle 100 requests per second per application server comfortably.
This scale and growth constantly created challenges for us over the years. Very early on, we saw issues around connection instantiation, which was quickly solved with connection pooling. Sometimes we had a severe performance regression in python code, which could be quickly tracked down by identifying the slow endpoint and isolating changes to that endpoint from the previous deployment. Looking at compute flamegraphs didn’t show many clear wins. But more often than not, the database presented scalability challenges.
Initially we could tackle these iteratively. New query patterns would pop up demanding additional indexes. Postgres’s query planner sometimes mispredicted the appropriate query plan to use that got fixed with a postgres settings change. Caching was introduced and its use cases expanded. Indexes were deleted when we realized they actually weren’t being used. PGAnalyze and AWS RDS Performance Insights were invaluable tools for us. Our postgres instance got upsized to the largest available instance. Read replicas were added, and then used more as we added replica lag monitoring.
But ultimately, these types of fixes were not sufficient to bring us 10x scale when our database was already close to redlining. We needed to shard brokeback.
Sharding Takes Over
Sharding was a conversation that had been ongoing for years prior to actually executing the project. At a high level, we settled on two approaches: data layer sharding and application sharding. In data layer sharding, we would continue to have a single fleet of application servers handling traffic, but the specific database they talked to would depend on which user the request was referencing. Since brokeback had a lot of batch jobs and asynchronous consumers, this was not just going to be as simple as tweaking some django settings handling API requests, extensive changes would need to be made. Furthermore, data layer sharding is a solution targeted at scale and nothing else. We would have liked to have gotten better canary testing for our efforts.
But application sharding was going to come with its own set of challenges. Everything would have to be duplicated: application servers, databases, caches, airflow workers, kafka consumers, daemons, and more. Furthermore, a new service would have to be introduced to manage routing to each of the application shards. This new service would both have to be performant for handling end user traffic but also flexible enough that it could join queries across shards for internal admin panels.
Ultimately, the most tenured engineer decided that application sharding was the right call, spent all of his political capital to make it happen, and drove the project forward. He presented his plan late 2019 despite considerable internal opposition and skepticism. By the time COVID lockdowns had caused a spike of traffic throughout early 2020, application sharding was identified as a critical project to complete since it would let us throw money at the problem of scaling backend systems instead of relying on the completion of engineering projects that could take months or years.
The router was also implemented as a python/django service. However, our python/django services were fronted by nginx, so routing client requests was simply a matter of writing a lua module to run in nginx. Python/django let us write the administrative API sharding and joining logic in a higher level way than with Go and performance was much less of a concern.
Application sharding had side benefits of providing more robust canary options (could have a dedicated canary shard or just rotate through shards to serve the role), and provided greater isolation between US and international users, as they could sit on different shards.
Holding Up at Scale
As the end of 2020 approached, we began rolling out sharding, and just in time too. At the end of January 2021 Robinhood was all over the news and traffic really began spiking. 6:30 AM days were the norm and most of the engineers were awake for them. We just barely had enough time to keep up with growth by frantically spinning up new shards and migrating users to them. The strategy turned out to work incredibly well. We saw a scale that certainly would have led to extended outages if sharding had not happened.
That engineer’s stand in 2019 burning political capital against pushback bridged us from the engineering 2020 failure to 2021 win. The work was not glamorous but it was critical.