Data Science Pipelines - Pulling Data
tl;dr Every dataset should be pulled from a database or flat file exactly once. This will make the code easier to read and maintain, more performant and easier to hand off to a colleague.
Ok, so you have a algorithm that produces some scores and you want to run it nightly.
This post details some best practices to make maintenance of new and existing ML pipelines easier.
The coding skills of Data Scientists are all over the place. Everything from "I can barely write SQL" to "I can write an operating system". This post is intended for an audience of Data Scientists who are less familiar with Software Engineering practices.
What does a standard nightly ML process look like?
Let's assume the code transforms datasets Products and Customers and scores them with a model on a nightly basis. There is always business logic and the code will grow as the business logic is bolted onto the model execution.
With the code being a mix of Python, R and SQL.
The code runs nightly, the product team likes the scores, the data scientists do some QA and testing and are satisfied that the scores are correct.
Everyone is happy! as long as it runs every night
What types of changes can I expect?
- Input Data Locations
- Changes in Data Base or Table name
- Data Engineering Managers love new shiny storage systems and every couple of years will move everything from one system to another. SQL Server to Greenplum to Redshift to Snowflake, there is always something better coming out.
- Input Data Formats
- Data Engineers love star schemas and de-normalization and additional joins may be required to build the input data
- Changes from an RDS to flat files may occur
- Product Changes
- As the business changes so too do business rules
Code Updated ✅
Wonderful! The model is a success and folks want updates, changes and additional functionality!
Oops we are pulling data in twice in different locations
This will cause issues sooner or later.
- Updates to filtering of Products datasets will be inconsistent
- Unavailable Inventory data will cause the pipeline to fail in the middle after potentially running for hours.
Pull in each Dataset Exactly Once
Best to pull each dataset in exactly once.
- Any unavailable or corrupted datasets will cause the pipeline to immediately fail.
- Mostly useful during development as waiting 10 minutes for the pipeline to fail is really annoying.
- Nice for nightly jobs as an support team would immediately know something failed and could quickly fix and re-run
- Updates to transformations or filtering of an input dataset is applied on read
- One piece of code won't be using filtered data while another is using unfiltered.
- Helps with performance so that large datasets are not pulled in multiple times.
Every dataset should be pulled from a database or flat file exactly once. This will make the code easier to read and maintain, more performant and easier to hand off to a colleague.