Jun
7

Amortizing de-duplication at read time instead of write time

By kellabyte  //  Architecture, Databases  //  8 Comments

Yesterday I was reading a blog post by Jimmy Bogard where he describes using ACID 2.0 (Associative, Commutative, Idempotent, Distributed) to create a ledger for a point of sales system. I’m not going in depth on “ACID 2.0″ in this post but the idea is that you can provide correctness in distributed systems using different techniques.

One part of the post describes idempotency as:

Idempotent – we only record a transaction if we know we haven’t seen that transaction ID before

There are a few problems with this approach which make it difficult to ensure idempotency.

  • It requires the entire history of transaction id’s to be available to query for de-duplication logic.
  • The list of transaction id’s is unbounded.
  • It requires every write do a read before the write.
  • It penalizes every transaction commit with the cost of compensating for something that has a low rate of failure. Why make 95% of the commits pay the price for 5% of the problems?

In my experience, even single system databases with OLTP workloads with designs like this aren’t sustainable very long because the transaction id index grows unbounded. If the system does any kind of archiving of old transactions, you can no longer guarantee idempotency.

Providing correctness for a window of time can be a good solution and services in the cloud like Azure Service Bus do this to de-dupe messages that occur within a bounded time constraint (because you can’t hold every single message id). However this is not idempotency. A transaction can come in outside that window of time that alters the state of the system.

I also don’t like the fact I’m forcing a read before a write. Databases work hard at making append-only internals for performance gains and here we throw all that away by forcing a query before a commit.

Another problem is if this is a highly available system and you are sharding by some property say userid and the shards for the given userid are down you may still want to accept the transaction on other shards for the time being. Unfortunately these nodes do not have the history to make the decision about de-duplication and the write will be accepted.

Let me propose another solution.

We don’t actually have to incur the cost of de-duplication at transaction commit time. Instead we can amortizing the de-duplication cost at read time. Record the transaction, even the duplicates and for any query, query for unique transaction ids which will leave out the duplicates. SQL views, stored procs or the public API of this service will isolate the private implementation details (you shouldn’t be doing shared database anyway!).

There are a few options for dealing with the duplicate data. One approach is during reads we can issue a “garbage collection” that removes duplicate transaction ids in the background. Another approach is a periodic schedule to clean up duplicate transactions. Another value add to this approach is the GC process can report how many duplicates it cleans up and we can set alerts if a system is misbehaving and sending excessive duplicates. A lot of duplicates may indicate a problem elsewhere in the system.

This option has a predictable capacity by eliminating the unbound query which is a really good characteristic to have. We gain transaction performance by eliminating a read before a write and we don’t have to hold an index with high cardinality in memory to solve it.

This isn’t the only option, just one that came to my mind while reading the post.

  • Qudoos

    So, when you say record the transactions even the duplicates, I guess this is not a transaction that would go and change actual state but is more like an event being stored where state is built at runtime?

  • http://drmathochist.wordpress.com/ John Armstrong

    Same thing in fancier language:

    All “idempodent” really means is that if you do the same operation multiple times (in a row, usually) it’s the same as doing it once. In this case, the resulting state of the DB is the same if you insert the same record once or twice.

    The insight here is that we don’t actually examine the state of the DB itself; we only ever examine the results of reads. That is, if we get the same result from any read after inserting the same record many times as we do after inserting it once, it’s good enough. To be really fancy: the operation is “extensionally idempotent”.

    And to break it down more plainly again: it doesn’t matter what the internal state of the DB is as long as reads only return a single copy (and maybe clean up internal duplicates when they find them, to be nice).

  • http://ravendb.net Ayende Rahien

    I would actually strongly disagree with you here.
    This assumes that the only goal of the system is to record that something happened, but in many cases, you need to _do_ something about it. And more than just modify some internal data.

    Using the Send SMS as a side affect of processing a message, you have a big problem with this.

  • Michael R Schmidt

    What if the system is processing credit card transaction. You cant duplicate one. You could however queue the transaction which could free up the calling thread/process to do other work while the payment processor decides whether it should send back the previous results (duplicate transaction) or if it is processing a new transaction.

  • Harry Brumleve

    Hi Kelly,

    I hate to say it, but I kind of use exceptions-driven design in my implentation of idempotency.

    We have a system of fragile legacy processors living in Azure. Due to performance and their aforementioned fagility, we replicate each command several times and deliver them to partitioned processors.

    How we make sure that each command is only executed once and only once is by having the command contain deterministic information that allows each processor to generate the same unique partition and row key.

    When the processor is finished with its work and is ready to save out its results, the processor tries to insert to that partition/row key combination. A Storage Failure of “Entity already exists” means that the processor is done and can move to the next command.

    We also use a similar strategy to implement an EventStore in Table.

    It seems to work for us and we’d rather waste the processor cycles than give the user a bad experience.

    Also the fee for Table is so low that we don’t notice this cost and can actually leverage this information in other areas of our system.

    Thoughts?

    Harry

  • Alexey

    I think what you propose is much harder to implement and much trickier to maintain (and probably that’s why this solution is not common).

    You say “We don’t actually have to incur the cost of de-duplication at transaction commit time. Instead we can amortizing the de-duplication cost at read time.”

    Maybe it works for simple “saving” scenarios, but what if the result of the transaction is a calculation of some kind? Now I need to be able to distinguish a “unique” calculation result from a “duplicated” one. So I need what? CausationID? Or a predictable way to generate a transaction ID based on something? Or perhaps both (because one command can cause multiple calculations)?

    I could make it while reading commands themselves, but now I cannot just use RabbitMQ or something to deliver commands, I need something on top to be able to store all the commands somewhere and to read from that storage but with deduplication.
    And I don’t see here how you avoid that growing log.

    Actually, most of the time you don’t need to read before write anyway. Just try to write with the same ID, and blow up if the ID is not unique. Any database can do it for you including a file system.

    Any thoughts?

  • http://twitter.com/yreynhout Yves Reynhout

    Hi Kelly,

    Here are some things that came to mind.

    With regard to “The list of transaction id’s is unbounded”, might I suggest, in a non-condescending way, rereading Pat Helland’s paper “Life beyond Distributed Transactions: an Apostate’s Opinion” (specifically Section 5, “Ensuring At-Most-Once Acceptance via Activities”). My take-away is that the message ergo transaction identifiers that need to be tracked is not one huge list, but rather a list on a per entity per partner basis. Surely for most entities this list will be much shorter and manageable for the entire lifecycle of the entity. Only long-lived entities with lots of interactions might require a different kind of approach IMO.

    With regard to “It requires every write do a read before the write”, in most OLTP systems I’ve seen, reads happen anyway since most of the time a decision is made based on history or current state. If transaction identifiers are part of this read operation (as extra data), except for the bandwidth and size concerns, I don’t see the harm in doing so.

    I don’t work on solutions/problems that require a lot of scale. Hence to be taken with a grain of salt.

    Regards,
    Yves.

  • kellabyte

    Yves,

    I remember that paper, it’s a great one!

    First thing I should mention is I wouldn’t automatically go to this choice. If I have a heavy write system that I need de-dupe because my operations aren’t idempotent and I need the system to protect itself, then perhaps I consider this option along with other options. I wrote this to stir up some thoughts.

    While many applications do reads before writes, that doesn’t mean we should introduce more. If you can reduce random disk or random memory seeks it’s generally a good idea in write heavy systems. Garbage collection (of transaction ids) needs to be carefully tuned though because it’s also a form of write amplification.

    You can certainly make better tradeoffs if you are certain transactions don’t span entities as you say. But like most decisions, you’re trading off something else.

    Something to keep in mind, in this model you cannot ensure serializability because you don’t have a total order. The strongest isolation level you can provide in this model is Read Committed which can certainly cause incorrectness in system state if your OLTP workload is vulnerable at this isolation level.

    I don’t disagree with anything you’ve said, just pointing out that in my opinion, it’s more complicated than that. Especially if you’re building a generic transactional system. Calvin transactions have a similar problem as de-dupe does with the unbounded transaction id growth.

    There is recent research (like Calvin, Granola, HAT’s and RAMP) that is making great gains in providing stronger consistency even in large distributed systems. There is lots of creative work happening out there right now it’s exciting!

    TPC-C is dominated by Big Iron ACID databases because no others can complete the workload due to the consistency requirements. An Oracle $30M system is the king of this workload. Any day now I expect this decade(s) long trend to fall because modern transactional approaches will finally beat these results. They will beat it handily and at much cheaper costs.