Cloud Bits: Circuit Breakers – The First Line of Defense in Cloud-Native Resilience

What are we solving?

When you are working in a cloud-native environment, you cannot continue making the assumptions that you made when you were working with a single-node system. Networks can & will go wrong. Your user flow which is right now being powered by an orchestration of multiple services can come to halt because one of the services ended up shipping a bad configuration & is now unreachable in production. Your system should be able to adapt to these challenges & continue serving the user requests even if it means processing these requests in a degraded form.

Consider the following example. In order to process a request for placing an order, the order service is responsible for updating the data stores & then it eventually calls an internationalization(i18n) service to build the confirmation message in the locale of the user. Now if due to some issues i18n service starts becoming un-responsive, do you want to fail the original request for placing an order & rollback all changes from the data stores? What happens if this workflow was part of a bigger orchestration? Say the order flow was part of a real-time delivery operations where we have already assigned a delivery agent, charged the customer & also placed an order through the merchant? Should we rollback all these changes? You can see how one service failure can turn into a domino effect. Instead can there be a message in english language which can be used as a fallback response?

Also consider the fact that do you want to continue calling i18n service even when you are getting errors due to timeout or service unavailable for all your requests? Can there be some check which you can use to prevent calling i18n service when you have a confirmation of its unavailability? This helps in scenarios when a service is ending up with timeouts as it is overloaded & is in middle of auto-scaling efforts. If you continue calling the service then there is a higher chance that the service will continue working in degraded state & will eventually go down completely.

Circuit Breakers

Circuit breaker pattern is designed specifically to handle these scenarios & make your application fault-tolerant in case of such failures. I will go through an introduction of possible states in a circuit breaker & then demo circuit breaker & retry mechanism using Spring cloud & Istio.

Circuit breaker is a cloud-native design pattern which provides fault-tolerance & resiliency for your application in case of a failure. Consider circuit breaker as a system sitting in between 2 services like a proxy. It observes the request-response behavior & acts once it starts noticing failures. It introduces 3 primary states:

Open: Request goes through normally
Closed: Triggered when failure threshold is reached. You end up getting a 503 usually
Half-Open: A sampling phase to check if the faulty service has recovered. If we get successful response then the state is changed to closed else it goes back to open

With circuit breakers, you rely on some form of fallback behavior such as using English language for building the response instead of calling i18n service. This nudges every service to evaluate its dependencies & also how will they act if one of those dependencies is unreachable. It also provides enough breathing room for the services when they get over-loaded as they can now rely upon circuit breakers to cut the incoming traffic & give them enough time to recover. You can also set a retry-policy which defines how many times your service is going to retry a request & how those retry attempts will be spread over time. Lets see this in action through our over-complicated demo application.

Demo Application

For our demo application, we are looking at a reception desk in a hospital where patients sign-up. In order to fetch the register the patient, the system calls a health-record service to get patient’s updated profile. health-record service in turn calls an insurance-service to fetch the plans that patient’s insurance supports. Here is a simple diagram for this user flow:

Now how should our system act if the insurance-service is down? In absence of a circuit breaker, we will continue calling insurance-service & receiving an error response. This will eventually delay the recovery of insurance-service as the new nodes will be bombarded with retries of failed requests.

Should we block the sign-ups? This will result in a huge queue on the reception desk as the sign-ups are blocked until insurance-service comes back up. A circuit breaker help us here by short-circuiting the communication between the 2 services while we can rely on a fallback insurance plan to unblock customer signups. These signups can later be retried once the insurance-service comes back up. Let see this in action below. You can checkout the starter code for these 2 services here.

Spring Cloud

Spring cloud provides support circuit breaker support through its implementation of Resilience4j library. I have setup the circuit breaker in health-record service so that it can monitor the response from insurance-service & update the state to open if it sees 50% of failing requests in a sliding window of 10 requests. It also has configuration for retry mechanism where it retries the request to insurance-service 3 times before failing. Spring cloud makes this straightforward with the following config definition:

@Configuration
public class CircuitBreakerConfiguration {

    @Bean
    public Customizer<Resilience4JCircuitBreakerFactory> defaultCustomizer() {
        return factory -> factory.configureDefault(
            id -> new Resilience4JConfigBuilder(id)
                .timeLimiterConfig(TimeLimiterConfig.custom().timeoutDuration(Duration.ofSeconds(4)).build())
                .circuitBreakerConfig(CircuitBreakerConfig.custom()
                    .failureRateThreshold(50)
                    .waitDurationInOpenState(Duration.ofSeconds(2))
                    .slidingWindowSize(10)
                    .build())
                .build());
    }

    @Bean
    public RetryRegistry retryRegistry() {
        RetryConfig retryConfig = RetryConfig.custom()
            .maxAttempts(3)
            .waitDuration(Duration.ofMillis(100))
            .retryOnException(throwable -> true)
            .ignoreExceptions(InsuranceVerificationException.class)
            .build();
        return RetryRegistry.of(retryConfig);
    }
}

Now while calling insurance-service endpoint, we make use of the circuit breaker & retry mechanism as below. We also provide a fallback response in case we end up with 5XX errors for these requests

public RegisterResponse register(
    String firstName, String lastName, int age, String controlNumber, String policyNumber) {
    VerificationRequest verificationRequest = new VerificationRequest(controlNumber, policyNumber, LocalDate.now());
    CircuitBreaker circuitBreaker = circuitBreakerFactory.create("insuranceService");
    Retry retry = retryRegistry.retry("insuranceServiceRetry");

    return circuitBreaker.run(
        () -> retry.executeSupplier(() ->
            callVerificationService(verificationRequest, firstName, lastName, age, controlNumber, policyNumber)
        ),
        throwable -> getFallbackResponse(firstName, lastName, age, controlNumber, policyNumber, throwable)
    );
}

Lets take a look at this circuit breaker in action. In order to demonstrate failure, I have added a Spring filter which returns a failure response based on a config which I can toggle externally. In the below demo we can see the retry behavior as we see the filter getting invoked 3 times for every request before we land with the fallback response.

Lets also see what happens if we cross the failure threshold from circuit breaker’s perspective. In the following demo, you will see that initially the circuit breaker is in closed state & then after repetitive failures it changes to open state.

You can find all the code for this demo on this Github branch.

Istio

Now as per the theme of this series, we remove everything that Spring provides & try to replicate the same behavior in Kubernetes world. In order to achieve the same functionality of circuit breaker & retry we make use of Istio. Please note that Istio is a very powerful tool & I am using it only for a small feature of setting up retries. Istio adds a proxy for our services so retry happens on the network layer instead of the application layer.

In order to setup retries, I have defined a gateway which routes the request to health-record service. The gateway definition is as below:

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: health-record-gateway
  namespace: circuit-breaker
spec:
  selector:
    istio: ingressgateway
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "*"
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: health-record-vs
  namespace: circuit-breaker
spec:
  hosts:
    - "*"
  gateways:
    - health-record-gateway
  http:
    - route:
        - destination:
            host: health-record-service
            port:
              number: 80

Then there is a destination rule for insurance-service & as part of it I have also defined the retry configuration which consists of 2 retry attempts in addition to the original attempt.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: insurance-service-destination
  namespace: circuit-breaker
spec:
  host: insurance-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 10
      http:
        http1MaxPendingRequests: 10
        maxRequestsPerConnection: 2
    outlierDetection:
      consecutiveErrors: 1
      interval: 10s
      baseEjectionTime: 10s
      splitExternalLocalOriginErrors: true
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: insurance-service-vs
  namespace: circuit-breaker
spec:
  hosts:
    - insurance-service
  http:
    - route:
        - destination:
            host: insurance-service
      retries:
        attempts: 2
        perTryTimeout: 2s
        retryOn: 5xx
        retryRemoteLocalities: true
      timeout: 10s

In order to see this in demo, I have added a debug log for the Spring filter & we can view the log getting triggered thrice for every request we send to the insurance-service.

Failure injection in this case happens through a ConfigMap which is consumed by the insurance-service & continuously polls to get the updated value. I re-apply the config map definition with changed value & we can see that the failure injection starts working as expected. Similar failure injection behavior can be achieved through Istio. Here is a demo where we can see failure injection in action while changing the config value through ConfigMap.

You can find all the code for the Istio demo on this Github branch.

Conclusion

Circuit breaker should be considered as one of the primitives while working in cloud-native ecosystem. Every service should act like a state-machine where they should be able to tell the expected behavior when they face a failure from one of their dependencies. They also shouldn’t become a bottleneck for their dependencies to recover from a failure. You can also have configuration around timeouts for each retry & even the complete retry mechanism.

Hope you were able to get something useful out of this post. Next I am going to write about health-check & deep-checks to ensure reliability of systems in cloud-native ecosystem. Happy learning!

Related Posts

Cloud Bits: API Gateways – Cloud System’s Reception Desk

Cloud Bits: Beyond Pings – Checks for Cloud-Native Reliability

Cloud Bits: Breaking the Double Write – A Guide to Distributed Data Consistency

Cloud Bits: The Compass of Microservices- Navigating Service Discovery