Health Checking Best Practices
When it’s time to productionize your service, you will need to consider how to tell if your service is running. In this post, we’ll look at how Health Checking works, and some of the possible trade offs.
What is a Health Check?
Before we go too deep let’s first talk about what a health check is and is not. I claim:
A health check is a single probe to a service that tells whether the service can handle requests.
This can be as simple as opening a TCP socket to the server, or it could be an API request. It could even be just checking if the process is running. Health checks solve a number of systemic problems:
- Health checking lets a server gracefully shutdown. If you wanted to stop a server in order to roll out a new version, you need a way to stop incoming requests. Shutting down the existing connections is not enough, because new connections could be arriving at the same time. When it is time to stop, a server marks itself as unhealthy, and starts closing all of the existing connections. When the connections are gone, the server can safely stop without dropping any traffic.
- Health checking lets a server warm up after starting. After rolling out your new server version, the server may need to connect to dependent services. For example, your server may need to start a connection to the auth server or logging server. If the server is implemented in a JIT compiled language (like C# or Java), it may also need to warm up by loading and compiling code. When a server starts up, it marks itself as unhealthy. Once the all the dependent connections are ready and the server can handle traffic, it marks itself as healthy.
- If a server loses connection with one of its backends, or if the server is overloaded with work, it can mark itself as unhealthy. This allows it to push back on clients (indirectly). The server can indicate that is is unable to handle traffic in a timely manner.
- If a server cannot start up, Health checking let’s the container tell if there is a problem. For example, if a bug makes the server hang on start up, or if a bad configuration is pushed, or if one of the dependent backends is not reachable, the container can determine via health checks to terminate and possibly retry starting the server.
Health Checks and Keep-Alives
These two seemingly similar concepts are often used interchangeably, but they actually serve different purposes. A health check tells whether the service can handle requests, while a keep-alive tells whether a particular client is still connected to a server. Keep-alives solve a number of network problems:
- Keep-alives let a client know if the server has disconnected. If a client hasn’t heard from the server recently, the server may just be idle, or the plug was pulled out. Without periodically pinging the server, the client doesn’t know. In gRPC the default keep alive interval is 20 seconds.
- Keep alives also let a server know if the client has disconnected. If a server hasn’t heard from a client recently, the client may be doing other work, or possibly was turned off abruptly. Again, keep-alives let a server know if it should keep a connection open. Since clients typically are more interested in the connection liveliness, the server keep-alive interval is 270 seconds.
- If there is a proxy between the client and server, it doesn’t know if the connections are still active. For example, a NAT doesn’t know if it should keep connection info in memory if it hasn’t seen any traffic recently. Regular keep alives let the NAT (or other proxy) know both client and server still want the connection.
- Keep-alives can measure network latency. In gRPC, keep-alives are implemented as HTTP/2 PING frames. These allow gRPC to measure latency independent of RPCs, which may include app processing time.
As we can see, the difference is that keep-alives are scoped to the connection, while health checks are system-wide. Generally, you want to have both. A future post will talk about how to configure keep-alives.
Probing
While not the subject of this post, probing is an important part of overall system stability. Like health checking, it too tells if your service is up. However, probing is usually done across all instances of your service, rather than to a single instance. In other words, probes are issued to a load-balanced target, either region-wide or globally. Additionally, the response to a failed probe is different than a failed health check. A failed probe will trigger alerting and notify someone, while a failed health check may just restart a server. Lastly, probes are either white-box (checks the response matches) or black-box (any response is good). Health checks are always black-box.
Option 1: Point to Point Health Checks
In the simplest case, a health check is a keep-alive. Your client can send a request to your server periodically to see if it’s still willing to take more traffic. If the server ever becomes unhealthy, the client can close the connection, and try connecting to a different server.
Point to point health checks let the client query the server to see if it should still receive traffic. This includes all interested parties, such as your Kubernetes container. Your server exposes a “health checking” end point (such as a gRPC service defined in your .proto file), which can be called by anyone. For example, in Kubernetes this is an HTTP GET request to “/healthz”.
Pros
- Simple to implement
- Any program can find out if your server is alive, using a standard interface.
Cons
- Expensive and slow. As your service gets more popular, and you spin up more service instances, and more clients connect, the CPU and network load grow. For C clients and S servers, checking at an interval of H seconds, you end up with C x S / H health checks per second. Imagine your service with 10,000 clients connected to your 10 servers, sending a health check once every 20 seconds. That’s 5,000 health checks a second! Worse, if most of the clients are idle, you still pay network and CPU costs.
- No caching. While the healthiness of your server may change infrequently, the new clients don’t know that. They have to ask repeatedly, and cannot share information with other clients.
- Hug of death. In order for a client to tell if a server is healthy, it has to connect and ask.
When the server is under heavy load, extra connections and health checks become more taxing.
The problems with point to point health checking are more obvious as your service is under higher load, and not at all obvious when you start out. This is the scary part, since you don’t know about the problems until it’s too late, and you are in the middle of a crisis.
Minor Cons
There are a few other issues that depend on how point to point health checks are implemented. The following points may not apply, but they are worth noting:
- Generalized health checks, such as those through a standardized interface, can fail to tell you what you want to know. For example, when handling a health check request, the server is going to go through a different code path than a normal API call would. This means a health check might return healthy, when in fact a subsequent request might fail. The health check for a key-value service may return healthy, even though the connection to the auth server is down. Because the health check didn’t need to get permissions, the auth server was never queried and thus the server responded that it was healthy. Fixing this behavior is hard, because now the health check service has to know about every single dependent backend connection.
- Access controls on requests are harder. Preferably, you would want to limit access to any client that doesn’t need it. Who needs access to the health check service? With point to point health checking, everyone! Securing a service is now harder than it might otherwise be.
- API compatibility is harder. If the health checking API were to change (say to change from a plain HTML response to a JSON one), it would be difficult to upgrade. In point to point health checking, any client can query the service, and they may not upgrade for a long time. Effectively, the health checking API is frozen.
- iPhone and Android clients do more work, wasting battery power and costing network traffic.
Option 2: Centralized Health Checks
To get around some of the problems with the Point to Point model, we can define a centralized health checking service. Rather than clients asking servers if the server is healthy, a single service can query each server. The health checking service passes this information to the load balancer, which can then decide to remove servers from the pool, or add them back when they are ready. At startup, clients query the load balancer to get a list of healthy servers. Because the unhealthy servers won’t be present, clients avoid connecting to servers that don’t want traffic.
While operationally more complex, this does solve many of the above problems. Let’s list them:
Pros
- Servers only take a limited amount of health checks. Health check load scales in the number of health checkers rather than the number of clients.
- Naturally caching. The healthiness of a server is indicated by presence in the load balancing list, which can be easily shared by multiple clients. A single health check can be expressed to lots of clients indirectly through the load balancer.
- No avalanching load. Because clients no longer have to connect to a server to see if it’s healthy, the clients avoid causing the server to be overloaded.
- Easy to secure, hard to abuse. The health checking service is the only one that need access to issue health requests, narrowing the scope of who can access the server.
- Easier to upgrade. Because you are in control of your servers, and you control the health check service, you can evolve the health check API and infrastructure.
Cons
- Operational complexity. Now that there is an additional party in the picture, the system is harder to reason about. This means the centralized health checking service has to be managed, rolled out, periodically updated, and be monitored.
- Leans towards generalized health checking. A centralized health checker will expect all servers to implement the health check service, which pushes servers to not execute normal code paths. Using the example from the point to point, the auth service may not be queried while building a health check response, even if the auth backend is down.
Variations
If your architecture does not include a separate load balancer service, it is possible to have clients query the central health checker service directly. This still has some of the benefits of offloading work from the servers, and limitting access. The client can combine the health data with the list of servers it knows about to decide who to connect to.
It may also be possible to report the server healthiness directly to the load balancer, rather than have a separate health checker service. This avoids the operation complexity of an additional binary, in favor of increasing the load balancer responsibility.
Make Health Checks Look Like Real Requests
Regardless of which option you go with, strongly consider using a real, idempotent, request to check health. Doing so raises the fidelity of the response, because it exercises the same code paths a normal request would.
If making a custom request is difficult, consider making the server self issue a real request. For example, if our key-value service gets a health check request, it could issue a lookup request back to itself to see if it is healthy. Thus, each server is also its own client in a way. Do make sure if you take this approach, to propagate the original requestors identity and credentials, so that your server doesn’t escalate the privilege of the health check.
Always make sure to use authenticated requests. If normal client traffic has to include OAuth tokens to make queries, then the health checks should include them too. Despite our best efforts, sometimes requests do have side effects. This is a security risk that you don’t need. Additionally, if the auth checks fail, it means your clients will likely get failure responses too.
Consider using gRPC!
gRPC has support for a standardized central health checker. It includes a protobuf definition which can be used. gRPC also has full support for keep-alives, so you can get both at the same time.
Conclusion
Health checking is tricky to do correctly, but can be tamed by having the right service setup. Using a centralized health checking service with service-specific health checks provides the most useful, stable, healthiness data.