Cloud Bits: Dancing Through Failures – Saga Pattern for Resilient Microservices

people sitting on chairs inside building

In any software application we generally require talking to multiple microservices to achieve a single business outcome. An example of this outcome can be performing a checkout on an e-commerce application which might require charging the customer’s payment method, reserving an inventory & adding an order record. In a typical monolithic systems these operations can be done in a transactional manner as we are generally dealing with a single database. But as we move to the world of microservices this becomes a big challenge. In this post I will try to explain the challenges involved in performing transactions around microservices & how the Saga pattern can help us in resolving those challenges. Lets dive in.

Understanding the challenges with distributed transactions

So lets set the stage. You have an e-commerce application where you are trying to perform the checkout by calling an API for order-service. A successful checkout requires following tasks to be done successfully:

A record persisted in the database owned by order-service
payment-service successfully charing customer’s payment method & persisting a record in the database owned by the service
inventory-service successfully reserving the required inventory & persisting a record in the database owned by the service

From customer’s perspective all the above tasks should succeed if the order was successful or else none of them should effect the application state. We want to avoid scenarios where the customer gets charged successfully but we fail to reserve the inventory or vice-versa. We want atomicity across service boundaries & this poses a big challenge as now we can’t use the transactional guarantees that we were promised in a monolithic application. We can & will end up in inconsistent state across services. We want to come up with a solution which either avoids these inconsistencies in the first place(something like a 2 phased commit) or leads to an eventually consistent state.

Saga pattern provides us with the mechanism to revert back to a consistent state eventually after we have landed in an inconsistent state.

Introducing Saga pattern

In a distributed system, we cannot avoid failures as we are dealing with unreliable networks, multiple service boundaries & applications which can fail independently. Saga pattern accounts for these failures & rather solves the problem of distributed transactions by introducing the concept of compensations. It goes ahead with the assumption that failure in inevitable & instead of trying to avoid the failure, it plans for what to do in case of the failures.

So now if one of the services ends up failing then all other services which performed the original operation successfully are instructed to revert back to original state through a compensation operation. The compensation operation can be calling a payment provider API to refund the original transaction or releasing a lock on inventory that was reserved in the database. These compensation operations can be run whenever the services are available so we resolve for failure scenarios where the service is unavailable right now but will eventually perform the compensation leading to a consistent state.

One critical point to consider here is that the compensation operation should be idempotent given the async nature of Saga architecture. So if the inventory-service is reserving an inventory unit then instead of deducting a count(Where the compensation operation would be to increment the count), it should rather update the state of an inventory record for reservation. This way for compensation operation it can just update the state of that record which is idempotent. Systems usually end up using an idempotency key which is shared across services to maintain the idempotent nature of transactions.

Saga pattern can be implemented either in a choreography or orchestration mechanism. Both have their own set of benefits & challenges which we will look into as part of next 2 sections.

Saga through choreography

Choreography as the name suggests is an implementation pattern of Saga where each component of your system acts on a set of events just like a performer in a dance responds to the moves of its peer performers. There is no single component acting as central point of control & each component responds to the events it receives from other components. This is what an event driven system looks like where components of your system are loosely coupled using some form of event queue & the overall transaction is said to be finished once all components have processed their events.

Lets look at this pattern in terms of implementation where we decouple the services shared above using Kafka. The code flow looks as below from client side.

Client sends a request to process an order
order-service inserts a record in CREATED state & publishes a message to order-created topic
Client is returned with an order id using which either the client can poll for order status or order status can be communicated to the client using some form of notification mechanism

Now both payment-service & inventory-service subscribe to order-created topic & process the order on their end. This processing can either result in a success or a failure corresponding to which both services publish an event to either a success or a failure topic. order-service subscribes to these topics & marks an order as success if it receives successful events for a specific order from all dependent services.

If it receives even a single failure event then it marks the order as failed & publishes another event to order-failed topic which all dependent services subscribe to & as part of which they perform the compensation operation. This way we are able to bring the system back to its eventually consistent end state.

Here is a quick demo choreography based Saga for a successful order

Below is the demo of how the choreography based Saga handles a failure scenario where the inventory reservation succeeds but payment fails(Due to a fraud check for amount exceeding a certain threshold) leading to an eventually consistent state of inventory reservation being rolled back & order being marked as failed. You can find all the code for this demo on this GitHub branch.

The core benefits of choreography is that your system is loosely coupled & your services remain autonomous. But at the same time as the number of services in your application grow, the cognitive complexity of the system can grow exponentially. All your services are required to be onboarded on a comprehensive observability stack so that you can debug issues in production. Also trying to trace a single transaction in your application means you need to rely on distributed tracing tools for figuring out how your order is processed across network boundaries.

Saga through orchestration

Saga using orchestration relies on an orchestrator component which can be either external or embedded. Lets first think about what do we really want from this external component to achieve our end goal of an eventually consistent system.

The component should be able to perform all the external service calls in desired order(Can be in parallel or one after the another)
In case of a failures we can resume the transaction though we also don’t want to resume from the very first step but rather from the last successful step. This poses a requirement for some form of state management which stores the output of successful steps, responses, failures etc & uses this preserved state while replaying the transaction.
Some form of mechanism to run compensation operations if one of the service calls result in a failure.

There are multiple solutions available which provide this form of feature set & all of them come under the umbrella of a concept called durable execution(I am planning to do a deep dive on this concept in an upcoming post). Durable execution is a programming paradigm that manages state of transaction that is spread across service boundaries while considering failure as a first class citizen. The paradigm can either be implemented by relying on an external component such as Temporal or using a library such as DBOS.

For simplicity I will do a demo of orchestration based Saga using DBOS. With orchestrator based Saga our workflow gets simplified to a great extent as now we move away from event driven architecture to a system which looks more closer to procedural code that is easy to reason about. The overall architecture is also simplified as now your application doesn’t needs to manage event state in order to decide on status of the operation. Also now you have a single point of request dispatch which simplifies your observability setup.

DBOS provides us with primitives of Workflow & Steps. Even though there is no native Saga solution provided by DBOS, we can make use of these primitives to implement our orchestrator based Saga. In short a workflow is an operation such as placing an order which comprises of multiple steps such as making a payment or reserving an inventory. Compensation operations are also described in form of steps.

Here is what the code for order processing looks like using DBOS workflow

@Workflow
fun processOrder(
    orderId: BigInteger,
    inventoryCount: Int,
    amount: Double,
    customerName: String,
    idempotencyKey: String
) {
    val paymentFuture = CompletableFuture.supplyAsync {
        orderStepService.chargePayment(
            orderId = orderId,
            idempotencyKey = idempotencyKey,
            amount = amount,
            customerName = customerName,
        )
    }
    val inventoryFuture = CompletableFuture.supplyAsync {
        orderStepService.reserveInventory(
            orderId = orderId,
            customerName = customerName,
            inventoryCount = inventoryCount,
        )
    }

    // Flags for tracking request states
    var paymentCharged = false
    var inventoryReserved = false

    try {
        CompletableFuture.allOf(paymentFuture, inventoryFuture).join()

        paymentCharged = paymentFuture.get()
        inventoryReserved = inventoryFuture.get()

        if (paymentCharged && inventoryReserved) {
            val orderSucceededEvent = OrderSucceededEvent(orderId = orderId.toString())
            val eventString = Json.encodeToString(orderSucceededEvent)
            kafkaTemplate.send(ORDER_SUCCESSFUL_TOPIC, eventString)
        } else {
            throw RuntimeException("One or more services failed to complete processing.")
        }
    } catch (_: Exception) {
        val paymentWasSuccess = try {
            paymentFuture.isDone && !paymentFuture.isCompletedExceptionally && paymentFuture.get()
        } catch (_: Exception) {
            false
        }
        val inventoryWasSuccess = try {
            inventoryFuture.isDone && !inventoryFuture.isCompletedExceptionally && inventoryFuture.get()
        } catch (_: Exception) {
            false
        }

        if (paymentWasSuccess) {
            orderStepService.reversePayment(orderId)
        }
        if (inventoryWasSuccess) {
            orderStepService.revertInventoryReservation(orderId)
        }

        val orderFailedEvent = OrderFailedEvent(orderId = orderId.toString())
        val eventString = Json.encodeToString(orderFailedEvent)
        kafkaTemplate.send(ORDER_FAILED_TOPIC, eventString)
    }
}

With above workflow, we describe what steps we want to perform & what compensation operations we want to run in case of failures. Lets take a look at how does this workflow operates in case of failure scenario where a user tries to reserve an inventory exceeding the system threshold. We can see that the payment initially succeeded & was then reverted as part of the compensation operation because we failed to reserve the inventory. You can find all the code for orchestration based implementation on this GitHub branch.

Conclusion

Saga pattern is an excellent tool for solving challenges with transactions across service boundaries. Now it will depend upon your use case that whether you want to go with an event driven architecture by using choreography based implementation or rely on an external dependency such as DBOS to perform orchestration based Saga.

Hope you got something out of this post. Durable execution has piqued my interest so I am planning to cover it in more detail in upcoming posts. Till then happy learning.

Leave a Reply Cancel reply

Related Posts

Cloud Bits: Breaking the Double Write – A Guide to Distributed Data Consistency

Cloud Bits: Bounded by contract

Cloud Bits: Circuit Breakers – The First Line of Defense in Cloud-Native Resilience

Cloud Bits: Beyond Pings – Checks for Cloud-Native Reliability