Reflecting on Building Real-time APIs at Facebook

Disclaimer: I no longer work at Facebook. These opinions are mine alone.

Last month, I took some time to reflect on what I learned while building real-time APIs at Facebook. One useful technique stands out: build the null API first.

Consider a null endpoint for a new HTTP API that returns Status Code 200 with an empty body. With the null 200 API, the client can:

Check network connectivity
Measure latency
Validate an access token
Test the SSL certificate
Test intermediate caching layers

The client can also detect and handle various failure cases:

Network unavailable (airplane mode)
DNS lookup failed
Request timeout (retry the request?)
Flood response (503)

Notice that we haven’t mentioned application data. That’s the point. Engineering teams often focus on functional requirements first, and non-functional requirements second. By omitting the application data, the null API technique forces us to work backwards and confront non-functional requirements first.

As for the streaming null API, we need to establish a foundation by discussing the differences between streaming and request/response. In a streaming API, the client opens a persistent connection to the server, which can be an actual persistent connection (e.g. WebSocket) or a simulated connection (e.g. HTTP long polling). Over the lifetime of this connection, the server chooses when to push data to the client.

While request/response APIs can be stateful or stateless, a streaming API is always stateful. For example, in a chat app, the server needs to store at least three pieces of state:

Which chat rooms has the user joined?
What was the last message the client received? (if we want coherent conversations)
Which persistent connection will carry data to the correct user?

Stateless application-layer protocols like HTTP have extreme amnesia. They remember only what’s necessary for a single request/response. By contrast, real-time APIs are also stateful at the application protocol layer. That is, they must remember what happened in the past. For example, real-time protocols frequently rely on the publisher-subscriber pattern (e.g. MQTT and Redis) where an unsubscribe request might only be valid if it follows a subscribe request for the same channel name.

Handling this kind of state at scale is challenging. First: the server must make trade offs between consistency, availability, partition-tolerance (CAP theorem), durability, and latency. Second: some kinds of state, such as chat room membership, must be synchronized across the network. If the client and server disagree about the set of chat rooms for the current user, the result is either a broken user experience or leaked resources. Unfortunately, state synchronization is hard. Third: client-side caches are often ill-equipped to deal with real-time data streams.

How many persistent connections does a client need? “One” seems like the obvious answer. But if this single persistent connection drops, all data streams sharing the connection would drop simultaneously. Then again, maybe multiple persistent connections would fare no better, such as cases where your cell phone enters a dead zone. The point is, real-time APIs are much more complicated than request/response APIs, partly because they are stateful at the application protocol layer as well as the transport layer.

With those concepts sorted out, we’re ready to model the null streaming API: an API where the client opens a persistent connection to the server, but the server pushes no application data. The client can use such an API to anticipate a bountiful assortment of success and failure cases:

Success Cases:

Reach the server (network available, DNS resolves, server cert is legit)
Create the persistent connection
Issue a request across the persistent connection
Detect server-initiated stream termination (end-of-stream)
Close the stream
Close the persistent connection
Pause the stream

Failure Cases:

Failed to create a persistent connection (transport level): Network unavailable, DNS lookup failure, Request timeout, Bad cert, Bad access token
Failed to create a data stream (protocol level): Unauthenticated, Unauthorized, Invalid request, Flood response
Connection interrupted: Server load shedding, Server gateway node failure
Stream interrupted: Connection interrupted (above)

The stream/connection interrupted cases are particularly interesting. An interrupted stream cannot possibly yield data, but the interruption can occur at the transport and protocol levels. If that stream was feeding essential data to the client (for example, player health in a multiplayer game), it’s important to let the user know what’s happening: “hey, you’re lagging, hang on!”.

This list of non-functional cases is not obviously exhaustive, but it already shows how much smarter the client needs to be when handling edge cases. By building such an API first, the team can discover and design the best user experience when building apps running over unreliable networks, such mobile networks.

If you’ve had success using some variation of the null API technique or if you know a better way, I’d love to hear from you in the comments!