Why Does gRPC Insist on Trailers?
gRPC comes up occasionally on the Orange Site, often with a redress of grievences in the comment section. One of the major complaints people have with gRPC is that it requires HTTP trailers. This one misstep has caused so much heartache and trouble, I think it probably is the reason gRPC failed to achieve its goal. Since I was closely involved with the project, I wanted to rebut some misconceptions I see posted a lot, and warn future protocol designers against the mistakes we made.
Mini History of gRPC’s Origin.
gRPC was reared by two parents trying to solve similar problems:
- The Stubby team. They had just begun the next iteration of their RPC system, used almost exclusively throughout Google. It handled 1010 queries per second in 2015. Performance was a key concern.
- The API team. This team owned the the common infrastructure serving (all) public APIs at Google. The primary value-add was converting REST+JSON calls to Stubby+Protobuf. Performance was a key concern.
The push to Cloud was coming on strong from the top, and the two teams joined forces to ease the communication from the outside world, to the inside. Rather than boil the ocean, they decided to reuse the newly minted HTTP/2 protocol. Additionally, they chose to keep Protobuf as the default wire format, but allow other encodings too. Stubby had tightly coupled the message, the protocol format, and custom extensions, making it impossible to open source just the protocol.
Thus, gRPC would allow intercommunication between browsers, phones, servers, and proxies, all using HTTP semantics, and without forcing the entirety of Google to change message formats. Since message translation is no longer needed, high speed communication between endpoints is tractable.
HTTP, HTTP/1.1, and HTTP/2
HTTP is about semantics: headers, messages, and verbs.
HTTP/1.1 is a mix of a wire
format, plus the semantics (RFCs 7231-7239). gRPC tries to keep the HTTP
semantics, while upgrading the wire format. Around 2014-15, SPDY was being
tested by Chrome and GFE as a work around for problems with HTTP/1.1.
Specifically:
- Most browsers limit connection counts to a domain to 2-6. This means there can be at most 2-6 active requests.
- Pipelining breaks many many devices that neither the end-user nor the server can control. A failure in a pipeline request causes the entire connection to be severed.
- Head-of-line blocking. A slow response in a pipeline prevents the load of other, ready responses.
- Authentication tokens, cookies, and other headers have become enormous. The headers often exceed the size of the body.
Acting on the promising improvements seen in the SPDY experimentation, the protocol was formalized into HTTP/2. HTTP/2 only changes the wire format, but keeps the HTTP semantics. This allows newer devices to downgrade the wire format when speaking with older devices.
As an aside, HTTP/2 is technically superior to WebSockets. HTTP/2 keeps the semantics of the web, while WS does not. Additionally, WebSockets suffers from the same head-of-line blocking problem HTTP/1.1 does.
Those Contemptible Trailers
Most people do not know this, but HTTP has had trailers in the specification since 1.1. The reason they are so uncommonly used is because most user agents don’t implement them, and don’t surface them to the JS layer.
Several events happened around the same time, which lead to the bet on requiring trailers:
- HTTP/1.1 had semantic support for trailers.
- HTTP/2 had just been newly minted, and had wire support for trailers
- The fetch API had just added support for trailers
The thinking went like this:
- Since we are using a new protocol, any devices that use it will need to upgrade their code.
- When they upgrade their code, they will need to implement trailer support anyways.
- Since HTTP/2 mandates TLS, it is unlikely middleboxes will error on unexpected trailers.
Why Do We Need Trailers At All?
So far, we’ve only talked about if it’s possible to use trailers, not if we should use them? It’s been over two decades, and we haven’t needed them yet, why put such a big risk into the gRPC protocol?
The answer is that it solves an ambiguity. Consider the following HTTP conversation:
GET /data HTTP/1.1
Host: example.com
HTTP/1.1 200 OK
Server: example.com
abc123
In this flow, what was the length of the /data
resource? Since we don’t
have a Content-Length, we are not sure the entire response came back. If the
connection was closed, does it mean it succeeded or failed? We aren’t sure.
Since streaming is a primary feature of gRPC, we often will not know the
length of the response ahead of time. HTTP aficionados are probably feeling
pretty smug right now: “Why don’t you use Transfer-Encoding: chunked
?” This
too is insufficient, because error can happen late in the response cycle.
Consider this exchange:
GET /data HTTP/1.1
Host: example.com
HTTP/1.1 200 OK
Server: example.com
Transfer-Encoding: chunked
6
abc123
0
Suppose that the server was in the middle of streaming a chat room message back to us, and there is a reverse proxy between our user agent and the server. The server sends chunks back to us, but after sending the first chunk of 6, the server crashes. What should the Proxy send back to us? It’s too late to change the response code from 200 to 503. If there were pipelined requests, all of them would need to be thrown away too. If this proxy wanted to keep the connection open (remember connections cost a lot to make), it would not want to terminate it, for an arguably recoverable scenario.
Hopefully this illustrates the ambiguity between successful, complete responses, and a mic-drop. What we need is a clear sign the response is done, or a clear sign there was an error.
Trailers are this final word, where the server can indicate success or failure in an unambiguous way.
Trailers for JSON v.s. Protobuf
While gRPC is definitely not Protobuf specific, it was created by people who have been burned by Protobuf’s encoding. The encoding of Protobuf probably had a hand in the need for trailers, because it’s not obvious when a Proto is finished. Protobuf messages are a concatenation of Key-Length-Values. Because of this structure, it’s possible to concatenate 2 Protos together and it still be valid. The downside of this is that there is no obvious point that the message is complete. An example of the problem:
syntax = "proto3";
message DeleteRequest {
string id = 1;
int32 limit = 2;
}
The wire format for an example message looks like:
Field 1: "zxy987"
Field 2: 1
A program can override a value by adding another field on:
Field 2: 1000
The concatenation would be:
Field 1: "zxy987"
Field 2: 1
Field 2: 1000
Which would be interpreted as:
Field 1: "zxy987"
Field 2: 1000
This makes encoding messages faster, since there is no size
field at the
beginning of the message. However, there is now a (mis-)feature where Protos
can be split or joined along KLV boundaries.
JSON has the upper hand here. With JSON, the message has to end with a curly
}
brace. If we haven’t seen the finally curly, and the connection hangs up,
we know something bad has happened. JSON is self delimiting, while Protobuf
is not. It’s not hard to imagine that trailers would be less of an issue, if the
default encoding was JSON.
The Final Nail in gRPC’s Trailers
Trailers were officially added to the fetch
API, and all major browsers said
they would support them. The authors were part of the WHATWG, and worked at
the companies that could actually put them into practice. However, Google is
not one single company, but a collection of independent and distrusting
companies. While the point of this post is not to point fingers, a single
engineer on the Chrome team decided that trailers should not be surfaced up to
the JS layer. You can read the arguments against it, but the short version
is that there was some fear around semantic differences causing security
problems. For example, if a Cache-Control
header appears in the trailers,
does it override the one in the headers?
I personally found this reason weak, and offered a compromise of treating them
as semantic-less key-values surfaced up to the fetch
layer. Whether it’s
because I was wrong, or failed to make the argument, I strongly suspect
organizational boundaries had a substantial effect. The Area Tech Leads of
Cloud also failed to convince their peers in Chrome, and as a result,
trailers were ripped out.
Lessons for Designers
This post hopefully exposed why trailers were included, and why they didn’t work ultimately. I left the gRPC team in 2019, but I still think fondly of what we created. There are gobs of things the team got right; unfortunately this one mistake ended up being the demise. Some takeaways:
- Organizational problems are harder than technological ones. Solve the harder problems first. If we had met with the Chrome team years earlier, we could have designed around this road block. As the saying goes, “Weeks of working can save hours of planning”.
- Updating code is nearly impossible. Compatibility with the existing system matters more than all the features and performance improvements. The best protocol is the one you can use.
- Focus on customers. Despite locking horns with other orgs, our team had a more critical problem: we didn’t listen to early customer feedback. We could have modified the servers and clients to speak an updated version of the protocol that obviated the need for trailers. (there’s even room in the gRPC frame for it!). It was our lack of sympathy that sank us, ultimately.