Clarifying the Saga pattern

By kellabyte  //  Architecture  //  10 Comments

As you build more advanced solutions, you may find the need to do collaborative processing that goes across multiple systems. Since failures are normal in these environments, some activities may fail while others succeed, which will yield inconsistent outcomes. There are mechanisms to handle these situations such as coordinated 2-phase commit (distributed transactions), but in the world of higher scale and cloud computing, these options are no longer suitable.

The Saga interaction pattern was designed to handle these failures. A Saga is a distribution of long-living transactions where steps may interleave, each with associated compensating transactions providing a compensation path across databases in the occurrence of a fault that may or may not compensate the entire chain back to the originator.

In the paper written by Hector Garcia-Molina in 1987 where the Saga pattern is introduced, Hector presented two possible implementations. Implemented directly in the DBMS (database management system) along side the transaction coordinator the Saga log and the database transaction log could be merged together. Another implementation could be on top of an existing DBMS using save points.

The original paper intended use within or on top of a database. A Saga is useful in other contexts as well, but this doesn’t change the fact that a Saga is about compensation across systems.

A Saga is a set of rules for routing a job to multiple collaborating parties, and allowing these parties to backtrack and/or take corrective action in the case of failure.

There are frameworks and examples out there using the name Saga that don’t exhibit the properties defined by Hector Garcia-Molina and resemble a state machine or workflow instead of a Saga. A Saga may be built on top of a workflow and a workflow is built on top of a state machine but state machines or workflows are not Sagas.

State Machine

Wikipedia states

The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition, this is called a transition.

A state machine has a set of defined states. Each state has a set of defined operations available.

An example is a car alarm. When the car alarm is engaged, the possible operations are to disengage the alarm or trip the alarm.


Kevin Junghans describes a workflow as

A workflow represents a sequence of activities. The transition between each activity, or step, occurs when a previous activity is completed. Workflows can have decisions on transitions that can cause branching to other activities. Workflows are commonly used to depict business processes.

Kevin continues to state

The main difference between our state machine and activity diagram (i.e. workflow) is that the focus is on actions instead of states and the transitions occur when an action is completed, instead of when an event occurs.

Wikipedia states

In Service-oriented architectures an application can be represented through an executable workflow, where different, possibly geographically distributed, service components interact to provide the corresponding functionality, under the control of a Workflow Management System.

A workflow can be built on top of a state machine and is similar to a program in that it has a defined set of steps that can alter the control flow based on the results of each step execution.

An example is creating an IT ticket because your computer won’t boot. The user submits a ticket into the system, the system emails the IT staff, the IT staff fixes your computer and the ticket is closed.

The Process Manager Pattern

The Process Manager pattern from the Enterprise Integration Patterns book by Gregor Hohpe is a workflow pattern and more closely resembles the implementations we are seeing in the community that are incorrectly being labelled Sagas.


A Saga is a distribution of multiple workflows across multiple systems, each providing a path (fork) of compensating actions in the event that any of the steps in the workflow fails.


Figure 1.  Saga compensating due to a failed transaction (T4).

From 1992 in the publication ACTA: The SAGA Continues, Panos K. Chrysanthis and Krithi Ramamritham describe the characteristics of a Saga as

Sagas have been proposed as a transaction model for long lived activities. A saga is a set of relatively independent (component) transactions T1, T2…Tn which can interleave in any way with component transactions of other sagas. Component transactions within a saga execute in a predefined order which, in the simplest case, is either sequential or parallel (no order).

Each component transaction T1 (0 ≤ i < n) is associated with a compensating transaction CT1. A compensating transaction CT1 undoes, from a semantic point of view, any effects of T1, but does not necessarily restore the database to the state that existed when T1 began executing.

Further more there are other patterns that build from the Saga pattern such as Kangaroo Transactions (Margaret H. Dunham, 1997). Kangaroo transactions deal with transactions in mobile environments that hop from one base station to another as the mobile unit moves through cells. Similar to the Saga pattern a kangaroo transaction is designed for the purpose of compensating in the event of failure.

Arnon Rotem-gal-oz states the following about Sagas

Saga is similar to a transaction in the sense that it provides a shared context for an attempt to get a distributed consensus.  Unlike a transaction which insures ACID properties, Sagas are not.

When a saga is aborted the only thing the coordinator can do is pass the status to the participants. Each of the services is responsible to do its best effort to handle the abort (either by rolling back, compensation or whatever)

Workflow is another thing altogether. which keeps a context between calls and means externalizing the decisions on the logic flow from the business logic (usually with a workflow engine). You can use workflows within a service (a pattern I call workflodize) or you can use them externally (a pattern I call orchestrated choreography e.g. BPM).

You can use either form of workflow to support the implementation of a saga but you can also implement sagas without workflows.

Example of a Saga

Clemens Vasters has kindly implemented an example of a Saga  that he describes on his blog. I highly recommend reading his explanation of the example. Key things to notice that differ from implementations calling themselves Sagas that are not is the lack of centralized coordination and lack of centralized state. As Arnon Rotem-gal-oz stated above: “a shared context”.

  • Arnon Rotem-Gal-Oz

    Nice explanation of Saga vs. other things.
    Regarding “The Process Manager pattern from the Enterprise Integration Patterns book by Gregor Hohpe is a workflow pattern and more closely resembles the implementations we are seeing in the community that are incorrectly being labelled Sagas.” – I was very surprised to learn that this is becoming prevalent – but I guess that’s happening. Hopefully more posts like yours will help rectify this :)

    Lastly you can hear a little about a saga implementation that is more aligned with the the original one in a presentation I gave last year (around the 18min mark)


  • Pingback: Distributed Weekly 166 — Scott Banwart's Blog

  • Frank Quednau

    Thank you for this post explaining key differences of Sagas, Workflows and State machines and indeed providing relevant citations.

    Considering the relative complexity of differentiating between those concepts I am surprised people try to discuss that onTwitter in earnest – here’s wishing that more go back to explain their points on personal blogs.

  • Marco Franssen

    Nice post Kelly! Did you find any representing example in .net/c# yet? Please let me know if you find any… Keep up the good work on sharing your knownledge… Greets from a proud reader….

  • Pingback: Felice Pollano Blog - Playing with ZeroMQ REQ-REP

  • Pingback: Patterns & Best Practices For Moving From RDBMS to Azure Storage « CaitieM

  • Pingback: Applying Domain Driven Design to Data Warehouses | James Snape

  • Scooletz

    It’s always pleasant to use a stronger term ‘saga’ for describing weaker implemenatations ‘proces managers’. It’s good to see this two terms being split. The quote from the original paper can be used to check whether we’re discussing a saga or not. The compensation tx, that’s the key.

  • Julien Letrouit

    Little late to the party, but thanks for the clarification! However there are a number of problems in practice. For the Saga to be really distributed, you need all systems to share the same runtime (JVM or .NET or whatever). This is increasingly not a reality in enterprises, especially with the trendy micro-services. And what happens if a the service processing the current transaction becomes unresponsive? It means in practice, you need a non distributed Saga Execution Coordinator that times out the transaction (see for example ). And that means the SEC needs to store some kind of state. But then, the combo Saga + SEC becomes awfully close to a Process Manager… Is there a real way to avoid centralized state in the coordinator?

    • Brent Arias

      A saga implementaion is nothing more than a class instance; it is merely an object. The saga object executes on only one machine. If that machine is part of a load-balanced group of machines, then access to the saga object is coordinated through locks (perhaps a shared cache locking mechanism). The logical lifecycle of the saga might last for days, but the actual saga object is active and “locked” for only milliseconds at a time. It is both highly unlikely and undesirable to have the implementation of the saga object be replicated in multiple run-times (JVM or .NET). There is no reason for that.
      The saga object is coordinating both the low-latency request-responses and high-latency events associated with distributed services. The low-latency service client tools are pre-programmed with SLA timeouts (e.g. 50 ms, 100ms). You can call this “statefulness” if you wish, but that “state” is a welcome, local, low-latency situation. It is ok for the saga object to be locked for a short duration in this fashion.

Listing all pages