Exploring Spotify’s Luigi to build ETL pipeline
Problems with data workflow in production
Nowadays, data is the foundation of the modern business world. Data on its own is not very useful and is often stored in some form of application database that isn’t easy to use for analytics. It’s really common in a company to have to move and transform data, using them for different testing, quality assurance and monitoring in production. This data often comes from multiple sources and analyzing it from multiple sources creates great challenges for data workflow.
To transform them into a useful format, scheduled data integration, or ETL(Extraction, Transformation, Loading) is introduced to consolidate data from multiple sources and transform them into useful format. With ETL tools, data from different sources can be grouped into a single place for analytics programs to act on and realize key insights.
For example, you have plenty of logs stored somewhere on AWS S3, many of the telemetry data in multiple partitions of a SQL database, and you want to periodically take those data, extract and aggregate meaningful information and then store them in an analytics DB (e.g., Redshift). It would be ok to perform these tasks manually at the beginning but it would require some sort of automation, for example cron jobs triggered by Cloudwatch. However, cron jobs are not enough to guarantee a stable and robust performance in…