Exploring Spotify’s Luigi to build ETL pipeline

for a movie streaming system in production

Jinglun

--

Problems with data workflow in production

Nowadays, data is the foundation of the modern business world. Data on its own is not very useful and is often stored in some form of application database that isn’t easy to use for analytics. It’s really common in a company to have to move and transform data, using them for different testing, quality assurance and monitoring in production. This data often comes from multiple sources and analyzing it from multiple sources creates great challenges for data workflow.

To transform them into a useful format, scheduled data integration, or ETL(Extraction, Transformation, Loading) is introduced to consolidate data from multiple sources and transform them into useful format. With ETL tools, data from different sources can be grouped into a single place for analytics programs to act on and realize key insights.

For example, you have plenty of logs stored somewhere on AWS S3, many of the telemetry data in multiple partitions of a SQL database, and you want to periodically take those data, extract and aggregate meaningful information and then store them in an analytics DB (e.g., Redshift). It would be ok to perform these tasks manually at the beginning but it would require some sort of automation, for example cron jobs triggered by Cloudwatch. However, cron jobs are not enough to guarantee a stable and robust performance in production.

That’s when you need some tools like Luigi to help in the design and execution of workflows to ensure analysts and data scientists all have the access to data from multiple application systems and maintain a robust performance in production.

What is Luigi

Luigi is a Python (2.7, 3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more. https://luigi.readthedocs.io/en/stable

In short, Luigi is an open-source execution framework developed by Spotify that allows you to build complex data pipelines completely in Python. It is designed to solve all pipeline problems associated with long-running batch processes that it stitches long running tasks together into pipelines with a wide toolbox of task templates(e.g. Hadoop…

--

--