Microservices — part 1: Health check API in practice

How to understand the behavior of an application and troubleshoot problems?

Possibility to verify if our application is running correctly is one of the most important things to ensure that the business works as expected. In this article, I will describe to you how to handle it basing on my experience with work on thousands of microservices.

The main gain of having it is that we can intimidate get a notification that something does not work properly and start resolving the incident. Please take a minute to think about how it could be done and let’s check if I will surprise you how complex it could be.

Response from endpoint

This pattern based on a rule which says that each service should have GET endpoint called /health which will return service state. A basic JSON response might look like that:

{
"name": "order-repository-service",
"status": "UP"
}

As you can see there is a service name and current state called status. Statuses which could be returned:

  • UP: functioning correctly

For the first look, it seems that everything is covered here, however, in practice, it will be not enough. Status DOWN could mean that one of the service dependencies is not achievable. In case of emergency, MTTR will be extended.

Mean Time to Recovery (MTTR) is answer for question how long does it generally take to restore service when a service incident occurs (e.g., unplanned outage, service implement)?

This is part of DORA metrics, more details can be found in my article about it.

To speed up its health check should be extended for information with statuses for his dependency, an example of the extended response for webReferences:

{
"name": "order-repository-service",
"status": "DOWN",
"webReferences": [
{
"responseStatusCode": 200,
"name': "payment-service",
"status': "UP"
},
{
"responseStatusCode": 500,
"name": "book-repository-service",
"status": "DOWN"
},
]
}

As you can see, such a response is extremely readable, we can immediately start checking why book-repository-service is DOWN. Those web references should have the same status for the state, however, they need to be achieved in a little bit different way.

Statuses for web references:

  • UP: Web reference is UP and has responded in a reasonable time

As you can see I wrote there reasonable time, a piece of code which will be decided for assessing if service is UP should consider next timeouts:

  • Connect timeout in seconds — number of seconds client should wait before giving up connecting to health-check

Based on its final response might look like this:

{
"name": "order-repository-service",
"status": "DOWN",
"webReferences": [
{
"responseStatusCode": 200,
"name": "payment-service",
"status": "UP",
"responseTime': "0.0191s"
},
{
"responseStatusCode": 500,
"name": "book-repository-service",
"status": "DOWN",
"responseTime": null
},
]
}

It brings additional value, that we can always easily check response time for services. Let’s move forward, what about all other services like databases or queues? To be honest, it is similar to the web, however, there is no guarantee that service will have the same endpoint as it will be for all webReferences, for it you will need probably extra development.

As I don’t want to promote service which has a lot of dependencies I will drop the previous example and start with a new one:

{
"name": "products-repository-service",
"status": "UP",
"webReferences": [],
"mysqlDatabaseConnections":
[
{
"name': "product-schema",
"status': "UP",
"responseTime': "0.13432s"
}
],
"rmqQueues": [
{
"name': "product-notification-queue",
"status': "UP",
"responseTime': "0.15273423s"
}
]
}

In this example, I added mysqlDatabaseConnections which is responsible for communication with MySQL database and rmqQueues which is responsible for RabbitMQ queues. As you can see it does not contains e.q., response status code, due to different ways of connection.

Kind of health check structures is clear to read and can really speed up investigations. As an extra thing which I would recommend to add to the response is the service version. It depends on the technology it might be read for from e.g., from package.json or from any other source even during building package. Example of a response from health check for service without dependencies:

{
"name": "customer-service"
"status": "UP",
"version": "1.0.0"
"webReferences": [],
"mysqlDatabaseConnections": [],
"rmqQueues": []
}

Notifications

Even though you will have the best-implemented health mechanism on all services for all technologies which you use, the mechanism will be unuseful if you do not have any notification. It is great to connect with any tool which can alert, examples: Nagios, Schinken, or something more advanced tool for handling extended — xMatters.

The good idea is to combine those things and add visual dashboards in Grafana, so the team can immediately see that something is wrong and in parallel send an email or slack notification.

Additional use

This mechanism is very useful for the service registry or load balancer — they should not route requests to the failed service instance.

Tips for implementation

From my experience in microservices architecture, it’s good to:

  • prepare a common library for each technology

Summary

Health check API pattern resolved one of the cross-cutting concerns and is very useful in a microservices architecture. As you can see it is not that easy to establish all rules. This could be a basic endpoint, but if you dig into a little bit it might be a quite complex piece of code, especially if you have many different services dependencies.

Principal Developer in StepStone, trainer in the field of architecture and business-oriented programming, after-hours passionate about ultra mountain marathons

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store