Jeffrey has worked in data processing for about seven years on systems ranging from clunky bash scripts to complicated Hadoop jobs. He has maintained and reverse engineered legacy C++ monoliths, and built several smaller systems in Ruby, Python, and C. He is only too aware that data is indifferent to your suffering.
Jeffrey currently works at Zendesk as part of the data engineering team, building and maintaining the infrastructure for training thousands of machine learning models. When not stressing about data systems, he hangs out with his cat and is occasionally approached in the street by total strangers who want to tell him their life story.
YOW! Data 2016 Sydney
Intermediate Datasets and Complex Problems
We often build data processing systems by starting from simple cases and slowly adding functionality. Usually, a system like this is thought of as a set of operations that take raw data and create meaningful output. These operations tend to organically grow into monoliths, which become very hard to debug and reason about. As such a system expands, development of new features tends to slow down and the cost of maintenance dramatically increases. One way to manage this complexity is to produce denormalized intermediate datasets which can then be reused for both automated processes and ad hoc querying. This separates the knowledge of how the data is connected from the process of extracting information, and allows these parts to be tested separately and more thoroughly. While there are disadvantages to this approach, there are many reasons to consider it. If this technique applies to you, it makes the hard things easy, and the impossible things merely hard.