Stephen Pettinato - Data Professional - (he/him): January 2021

Data Science Pipelines - Pulling Data

tl;dr Every dataset should be pulled from a database or flat file exactly once. This will make the code easier to read and maintain, more performant and easier to hand off to a colleague.

Ok, so you have a algorithm that produces some scores and you want to run it nightly.

This post details some best practices to make maintenance of new and existing ML pipelines easier.

The coding skills of Data Scientists are all over the place. Everything from "I can barely write SQL" to "I can write an operating system". This post is intended for an audience of Data Scientists who are less familiar with Software Engineering practices.

What does a standard nightly ML process look like?

Let's assume the code transforms datasets Products and Customers and scores them with a model on a nightly basis. There is always business logic and the code will grow as the business logic is bolted onto the model execution.

With the code being a mix of Python, R and SQL.

The code runs nightly, the product team likes the scores, the data scientists do some QA and testing and are satisfied that the scores are correct.

Everyone is happy! as long as it runs every night

What types of changes can I expect?

Input Data Locations

Changes in Data Base or Table name
Data Engineering Managers love new shiny storage systems and every couple of years will move everything from one system to another. SQL Server to Greenplum to Redshift to Snowflake, there is always something better coming out.

Input Data Formats

Data Engineers love star schemas and de-normalization and additional joins may be required to build the input data
Changes from an RDS to flat files may occur

Product Changes

As the business changes so too do business rules

Code Updated ✅

Wonderful! The model is a success and folks want updates, changes and additional functionality!

So, updates, testing and everyone is happy again!

Oops we are pulling data in twice in different locations

This will cause issues sooner or later.

Updates to filtering of Products datasets will be inconsistent
Unavailable Inventory data will cause the pipeline to fail in the middle after potentially running for hours.

Pull in each Dataset Exactly Once

Best to pull each dataset in exactly once.

Benefits

Any unavailable or corrupted datasets will cause the pipeline to immediately fail.

Mostly useful during development as waiting 10 minutes for the pipeline to fail is really annoying.
Nice for nightly jobs as an support team would immediately know something failed and could quickly fix and re-run

Updates to transformations or filtering of an input dataset is applied on read

One piece of code won't be using filtered data while another is using unfiltered.
Helps with performance so that large datasets are not pulled in multiple times.

Conclusion

Every dataset should be pulled from a database or flat file exactly once. This will make the code easier to read and maintain, more performant and easier to hand off to a colleague.

Stephen Pettinato - Data Professional - (he/him)

About Me

10 Most Popular Posts

Most Recent Post

2021-01-04

Data Science Pipelines - Pulling Data

Data Science Pipelines - Pulling Data

What does a standard nightly ML process look like?

What types of changes can I expect?

Code Updated ✅

Oops we are pulling data in twice in different locations

Pull in each Dataset Exactly Once

Benefits

Conclusion

Bio

All Posts

My other places