Microservices — part 1: Health check API in practice

5 min readMay 20, 2020

How to understand the behavior of an application and troubleshoot problems?

Possibility to verify if our application is running correctly is one of the most important things to ensure that the business works as expected. In this article, I will describe to you how to handle it basing on my experience with work on thousands of microservices.

The main gain of having it is that we can intimidate get a notification that something does not work properly and start resolving the incident. Please take a minute to think about how it could be done and let’s check if I will surprise you how complex it could be.

Response from endpoint

This pattern based on a rule which says that each service should have GET endpoint called /health which will return service state. A basic JSON response might look like that:

{
    "name": "order-repository-service",
    "status": "UP"
}

As you can see there is a service name and current state called status. Statuses which could be returned:

UP: functioning correctly
DOWN: not functioning correctly
UNKNOWN: health check failed at some point, cannot determine the exact state of the application, service is dead or alive.

For the first look, it seems that everything is covered here, however, in practice, it will be not enough. Status DOWN could mean that one of the service dependencies is not achievable. In case of emergency, MTTR will be extended.

Mean Time to Recovery (MTTR) is answer for question how long does it generally take to restore service when a service incident occurs (e.g., unplanned outage, service implement)?

This is part of DORA metrics, more details can be found in my article about it.

To speed up its health check should be extended for information with statuses for his dependency, an example of the extended response for webReferences:

{
    "name": "order-repository-service",
    "status": "DOWN",
    "webReferences": [ 
      {
          "responseStatusCode": 200,
          "name': "payment-service",
          "status': "UP"
      },
      {
          "responseStatusCode": 500,
          "name": "book-repository-service",
          "status": "DOWN"
      },
    ]
}

As you can see, such a response is extremely readable, we can immediately start checking why book-repository-service is DOWN. Those web references should have the same status for the state, however, they need to be achieved in a little bit different way.

Statuses for web references:

UP: Web reference is UP and has responded in a reasonable time
DOWN: Cannot establish a connection to the web reference in a reasonable time or it is response code is other than 200
UNKNOWN: Healthcheck failed at some point, cannot determine the exact state of the web reference

As you can see I wrote there reasonable time, a piece of code which will be decided for assessing if service is UP should consider next timeouts:

Connect timeout in seconds — number of seconds client should wait before giving up connecting to health-check
Request timeout in seconds — number of seconds client should wait for health-check request completion

Based on its final response might look like this:

{
    "name": "order-repository-service",
    "status": "DOWN",
    "webReferences": [ 
      {
          "responseStatusCode": 200,
          "name": "payment-service",
          "status": "UP",
          "responseTime': "0.0191s"
      },
      {
          "responseStatusCode": 500,
          "name": "book-repository-service",
          "status": "DOWN",
          "responseTime": null
      },
    ]
}

It brings additional value, that we can always easily check response time for services. Let’s move forward, what about all other services like databases or queues? To be honest, it is similar to the web, however, there is no guarantee that service will have the same endpoint as it will be for all webReferences, for it you will need probably extra development.

As I don’t want to promote service which has a lot of dependencies I will drop the previous example and start with a new one:

{
    "name": "products-repository-service",
    "status": "UP",
    "webReferences": [],
    "mysqlDatabaseConnections":
    [ 
      {
          "name': "product-schema",
          "status': "UP",
          "responseTime': "0.13432s"
      }
    ],
    "rmqQueues": [
      { 
          "name': "product-notification-queue",
          "status': "UP",
          "responseTime': "0.15273423s"
      }
    ]
}

In this example, I added mysqlDatabaseConnections which is responsible for communication with MySQL database and rmqQueues which is responsible for RabbitMQ queues. As you can see it does not contains e.q., response status code, due to different ways of connection.

Kind of health check structures is clear to read and can really speed up investigations. As an extra thing which I would recommend to add to the response is the service version. It depends on the technology it might be read for from e.g., from package.json or from any other source even during building package. Example of a response from health check for service without dependencies:

{
    "name": "customer-service"
    "status": "UP",    
    "version": "1.0.0"
    "webReferences": [],
    "mysqlDatabaseConnections": [],
    "rmqQueues": []
}

Notifications

Even though you will have the best-implemented health mechanism on all services for all technologies which you use, the mechanism will be unuseful if you do not have any notification. It is great to connect with any tool which can alert, examples: Nagios, Schinken, or something more advanced tool for handling extended — xMatters.

The good idea is to combine those things and add visual dashboards in Grafana, so the team can immediately see that something is wrong and in parallel send an email or slack notification.

Additional use

This mechanism is very useful for the service registry or load balancer — they should not route requests to the failed service instance.

Tips for implementation

From my experience in microservices architecture, it’s good to:

prepare a common library for each technology
the library should be part of a chassis framework(which is basically kind of template/starter for service)
library implementation might depend on the service configuration file which should be either align cross services
should work out of the box and be mandatory for each service

Summary

Health check API pattern resolved one of the cross-cutting concerns and is very useful in a microservices architecture. As you can see it is not that easy to establish all rules. This could be a basic endpoint, but if you dig into a little bit it might be a quite complex piece of code, especially if you have many different services dependencies.

Microservices — part 1: Health check API in practice

Written by Kamil Żurawski