How We Can Beat the Tradeoff Between Developer Velocity and Scalability

04 Jan, 2022

I’ve been working at Robinhood for a number of years since the early days and one phenomenon I’ve been fascinated by is the dramatic decrease in developer velocity as both the engineering organization and production systems have scaled.

Here’s an example inspired by real-world events building out the backend API for showing dividends in-app in 2014:

We decide in the morning we want to build out some basic support for dividends. Our clearing firm provides a file on a daily basis showing upcoming dividends. You’re the backend engineer on this, you say you can have a first draft of the diff EOD today.
You read the investopedia and wikipedia entries on dividends and the clearing firm’s file format. You poke around the APIs of some other brokerages to get a sense as to what the schema looks like.
Our technology stack is Python/Django/DRF/Postgres. You define the schema in a models.py file and then write some API views and serializers where the types are pulled directly from the models.py definition. Basically all you’re specifying with the API views are which fields from the model should be exposed. You write the django command to pull in the data from the clearing firm’s file, and tests for all. Diff goes out.
Frontend engineer consumes the API and both of you test it together with some sample data in a test environment.

Looking at this workstream, it’s incredible to think about how much time the engineer is able to focus on what matters. Much of the time for the backend engineer is spent understanding the business requirements and how to implement that with high level abstractions like models and views. And what matters is what is the end result delivered to the user: can they see their dividends?

Breaking down at scale

I like to think about scalability in two ways: systems scalability and organizational scalability. System scalability is more straightforward, so let’s focus on how this breaks down at scale in that way first:

Postgres cannot be trivially scaled horizontally. In a basic architecture of app traffic -> load balancers -> api servers -> database, the database tends to be the only one that can only be scaled vertically. Sure, there are other database options but if you use something like DynamoDB or Spanner, you’re giving up Django/DRF. Maybe you could use something like cockroachdb which promises very good postgres compatibility, but at that point you’re probably spending an innovation token. We’re building a mobile-first, zero-commission stock brokerage, that’s already a lot of innovation tokens.
Python/Django/DRF is slow and inefficient. This one hits us in many ways. As traffic increases we’ll need to deploy more servers which pushes more complexity onto infrastructure teams sooner. Running tests locally will get slower and slower as the python interpreter takes longer to get started. CI and integration tests will take a lot more time as spinning up the app in a fresh environment takes a lot longer.
Python/Django/DRF is expensive. Obviously this is very related to the previous point, but every application that serves even a modest number of users is going to care about cost at some point. And a lot of money will get burned starting the application in CI and integration testing environments.
Django/DRF’s overhead is substantial. The abstractions giving the developer productivity aren’t free, and they’re paid at runtime. The Django ORM is quite inefficient — one of my biggest CPU performance improvements was changing the OAuth token query from using the Django ORM to using a raw SQL query. The creator of DRF wrote a nice blog post showing how not using some of the abstractions provided by DRF can improve performance.
Python has some peculiarities when used to run a high QPS service. Rachel Kroll had a good blog post on this here. Python 3 asyncio hopefully will fix some of these things, but then we’re giving up Django/DRF (for now, at least).

Now let’s talk about organizational scalability:

It’s relatively easy to break forward compatibility with REST APIs. As a comparison, protobufs use non-developer facing field numbers and enum value numbers that enable field and enum value renaming. With a variety of clients (even simply iOS/Android/Web), there can be clients who make implicit assumptions about the data.
While you can make OpenAPI schemas, they’re not first-class citizens. By default, it’s also possible to return data that doesn’t conform to the schema. This makes it so frontend engineers have to be cautious when relying on these schemas as a source of truth.
Typically today, a data infrastructure team will snapshot the database periodically and dump it into a data lake. This is phenomenal for data analytics use cases, but invariably leads to a duplication of the schema definition. It also becomes incredibly easy to build some critical workflows on top of the data lake, which can be quite brittle.
Another team will inevitably ask for a stream of events to solve a particular problem or implement a new product feature. This doesn’t really fit neatly into existing CRUD frameworks, so you explain to them that this isn’t trivial and then they’ll wonder why you didn’t implement it as an event-based system originally.
Typing is often more of an afterthought using a Django/DRF-like framework. It’s a lot easier to make these frameworks without being constrained by a type system, so the core models don’t have types unless you cobble together a few libraries, which a super early stage company is less likely to do.
Rewriting subsets of APIs in a new service is much harder. You’re basically picking between using the same tech stack or using a new tech stack and attempting to perfectly match the existing responses, which takes some trial and error. Often you’ll just use the same tech stack, leaving many of these challenges in place.

The developer velocity vs scalability “tradeoff”

I think a framework with a good developer velocity would not only let you build MVPs quickly as shown at the top, but also make it easy to scale. But too often the tools available to us pick one or the other: either it’s great for MVPs but hard to scale, or it’s easier to scale but with a higher up front cost to build out. And since often time definitions of developer velocity excludes the scale piece, we perceive that there’s a tradeoff between developer velocity and scalability.

For instance, take this blog post which discusses common anti-patterns in go applications. It talks about the drawbacks of having a single Go struct serve as both the schema for the database and the schema for the API. It recommends (correctly) that developers should avoid this, and write two Go structs, one for the API and one for the database, and then manually write some glue code between the two. When choosing between these two approaches, a developer is ultimately making this tradeoff — do I want my MVP built faster or am I building something more for the long term? This would just be one decision, and at early stage companies you’ll make dozens of such decisions ranging from something more minor like directory structure to something major like which programming language or cloud offering to use.

And ultimately, if a very early stage company systematically makes decisions to solely prioritize the long run, the company will fail. Either the MVP will take too long to build or you’ll find out that you ended up over-engineering certain parts of the system and drown in complexity and tech debt.

But this tradeoff doesn’t need to exist anymore. We’re seeing cloud architectures start to converge more and more, around tools like gRPC and kubernetes. Languages like Go offer such fast compile times that running tests feel more like running an interpreted language rather than a compiled language. Some horizontally scalable database options have progressed to the point where using them no longer requires spending an innovation token.

The “ideal” next-gen framework

Let’s start with the end state — suppose we want a new application framework that avoids this tradeoff. There are three key elements this framework would need to satisfy:

Fast runtime performance. Programming language is native or near-native, has a fast startup time, and the framework has zero or low runtime cost abstractions through either templates/generics or code generation.
Fast iterative development feedback loops even with large code bases. Compiling should be fast and running tests should be fast.
Schema-first. Even if this framework reaches the pinnacle of perfection, there will still be code not written in the same language as the framework, and the schema must be defined outside of the code with the code being an equal consumer to the schema. This also lets us be a bit more technology-agnostic.

It’s about the schemas

All of these lines of thought let me to a surprising place — improving how we write schemas is the best place to start. Generally, schemas are defined either in code or in data.

In a code-first schema (like Django/DRF), the application code itself specifies the schema. This is great for initial developer velocity but the schema definition itself often gets overloaded with additional metadata that doesn’t make its way to a schema generated. Ultimately, if the schema is defined in the same programming language that it’s used in, there’s a very high chance some information is lost while converting it to a schema.

In contrast, in a schema-first schema (like gRPC), the schema is defined in some data format and all users of the schema effectively are on a level playing field. Protobuf files in particular play a nice trick where the file looks like code but is actually just data. In fact, the compiled version of the protobuf file is just specified here, and basically matches exactly what the developer writes in the protobuf files.

So if we’re able to have a better way to generate schemas that enables attaching the type of metadata that enables functionality improves developer velocity like with code-first schemas we could get the best of both worlds. The process would look like:

Write the schemas in some new way. Cover all cases where data is transmitted or stored. This would even mean that the database schema is derived from the universal schemas in addition to the more traditional cases like API schemas or interservice message schemas.
Use metadata from these schemas to perform various types of code generation. It’d generate the necessary SQL statements to migrate the database and provide high level hooks for the application code to interact with the database and serve APIs defined in the schema.
Write application code leveraging the generated code, letting the developer focus primarily on implementing business logic.

If something like this were to exist all of the scalability issues I noted above could go away:

Provides looser coupling with the database. The schema is first and the database schema is derived from the primary schema. This would make it easier to have the same abstract models but just in a different database. Starting from scratch this use case could be treated as a first-class citizen.
Python/Django/DRF is no longer required to get that fast initial developer velocity, a language like Go could be used instead since everything is done by code gen. And since code gen is used, the abstractions don’t need to have substantial runtime overhead.
We could use gRPC as the primary API server mechanism, which avoids many of the pitfalls of REST APIs while still being able to expose a REST API through the grpc-gateway project. This also enables easier re-implementations of subsets of the API.
This approach is still schema-first in the sense that all consumers of the schema are equal.
If a data infra team wants to access the data from the database directly, they can use the same schema that was used to generate the database schema. This also helps to better see who is consuming the schema, which improves the ability to check for cases where schema changes introduce a regression. If you’d like, you could even expose the database changelog to create a kind of event stream.
Typing can be a first-class citizen.

#Schemas