The Rules for Data Processing Pipeline Builders
You that data is eating the world. And whenever any reasonable amount of data needs processing, a complicated multi-stage data processing pipeline will be involved.
This all adds up to quite a complex system! And just as with any other engineering system, unless carefully maintained, pipelines tend to turn into a house of cards – failing daily, requiring manual data fixes and constant monitoring.
For this reason, I want to share certain good engineering practises with you, ones that make it possible to build scalable data processing pipelines from composable steps. While some engineers understand such rules intuitively, I had to learn them by doing, making mistakes, fixing, sweating and fixing things again…
Each data point is a JSON Object (aka hash table); and those data points are accumulated in large files (aka batches), containing a single JSON Object per line. Every batch file is, say, about 10GB.
First, you want to validate the keys and values of every object; next, apply a couple of transformations to each object; and finally, store a clean result into an output file.
In validation takes about 10% of the time, the first transformation takes about 70% of the time and the rest takes 20%.
This is how we process our data here at Bumble
Now imagine your startup is growing, there are hundreds if not thousands of batches already processed… and then you realise there’s a bug in the data processing logic, in its final step, and because of that broken 20%, you’ll have to rerun all of it.
- steps are easy to understand;
- every step can be tested separately;
- it’s easy to cache intermediate results or put broken data aside;
- the system is easy to extend with error handling;
- transformations can be reused in other pipelines.
Or worse, the data will only be partially transformed, and further pipeline steps will have no way of knowing that. At the end of the pipe, you’ll only get partial data. Not good.
Ideally, you want the data to be in one of the two states: to-be-transformed or already-transformed. This property is called atomicity. An atomic step either happened, or it did not:
In transactional database systems, this can be achieved using – you guessed it – transactions, which make it super easy to compose complex atomic operations on data. So, if you can use such a database – please do so.
POSIX-compatible and POSIX-like file systems have atomic operations (say, mv or ln ), which can be used to imitate transactions:
In the example above, broken intermediate data will end up in a *.tmp file , which can be introspected for debugging purposes, or just garbage collected later.
At Bumble – the parent company operating Badoo and Bumble apps – we apply hundreds of data transforming steps while processing our data sources: a high volume of user-generated events, production databases and external systems
Notice, by the way, how this integrates nicely with the Rule of Small Steps, as small steps are much easier to make atomic.
In imperative programming, a subroutine with side effects is idempotent if the system state remains the same after one or several calls. – Wikipedia on Idempotence
The Rule of Idempotence is a bit more subtle: running a transformation on the same input data one or more times should give you the same result.
I repeat: you run your step twice on a batch, and the result is the same. You run it 10 times, and the result is still the same. Let’s tweak our example to illustrate the idea:
We had our /input/batch.json as input, it ended up in /output/batch.json as output. And no matter how many times we apply the transformation – we should end up with the same output data:
Note that implicit input can sneak through in very unexpected ways. If you’ve ever heard of reproducible builds, then you know the usual suspects: time, file system paths and other flavours of hidden global state.
Why is idempotency important? Firstly for its ease of use! This feature makes it easy to reload subsets of data whenever something was tweaked in , or data in /input/batch.json . Your data will end up in the same paths, database tables or table partitions, etc.
Remember, though, that some things simply cannot be idempotent by definition, e.g. it’s meaningless to be idempotent when you flush an external buffer. But those cases should still be pretty isolated, Small and Atomic.
One more thing: delay deleting intermediate data for as long as possible. I’d also suggest having slow, cheap storage for raw incoming data, if possible:
So, you should keep raw data in batch.json and clean data in output/batch.json for as long as possible, and batch-1.json , batch-2.json , batch-3.json at least until the pipeline finishes a work cycle.
You’ll thank me when analysts decide to change to the algorithm for calculating some kind of derived metric in and there will be months of fatflirt data to fix.
- split your pipeline into isolated and testable Smallest Steps;
- strive to make the steps both Atomic and Idempotent;
- introduce as much data Redundancy as reasonably possible.
The data goes through hundreds of carefully crafted, tiny step transformations, 99% of which are Atomic, Small and Idempotent. We can afford plenty of Redundancy as we use cold data storage, hot data storage and even superhot intermediate data cache.
In retrospect, the Rules might feel very natural, almost obvious. You might even sort of follow them intuitively. But understanding the reasoning behind them does help to identify their applicability limits, and to step over them if necessary.