About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2022-08-06

Data Engineering Tips and Principles

Data Engineering Tips and Principles

Software Engineering patterns are pretty well established within the industry.

  • Develop locally and deploy to production
  • Lock down production data, systems and processes
  • Use unit tests, QA and integration tests to validate code changes
  • Do code reviews, design reviews and use appropriate standards for the technologies used

 Data folks are often different because

  • Sometimes there is just 1 database with the tables and columns, so development happens on this production system
  • SQL is commonly used which makes automated tests difficult to achieve
  • Real data is used to catch corner cases and validate transformations during development

A lot flows out of these points that makes Data Engineering a different discipline than Software Engineering. It becomes hard to lock down production data and systems and if SQL is the primary language used, then data engineering teams will sometimes not have any tests at all.

Below I've tried to outline a few points that can help bring more stringent development standards into Data Engineering while still maintaining the differences between Data and Software Engineering.

  • Develop with small data
  • Save tests for future use
  • Have a clear process to deploy something to production

These can help to bring more discipline to a scrappy data engineering team.

Development

The first step to developing a data transformation is to understand the real data. Understanding the real data can help to develop faster and to have reproducible tests.

Like a lot of folks, I've sat there and watched my SQL query run for 45 seconds, made a small change, and iterated, and iterated, and lost a day just watching my query run. Thinking about the final result and making an effort to understand the data beforehand can drastically speed up the development and well as create more robust and higher quality code.

I suggest the following approach,

  1. Understand the real data - Document it including common queries and corner cases
  2. Recreate a tiny dataset that covers the real data including corner cases
  3. Consider the output of your transformation on this tiny dataset.
  4. Develop your code with this tiny dataset until your code matches your expectations
  5. Run the code on the full dataset, and if there are failures or discrepancies then your original understanding of the dataset is incomplete, so return to step 1 and iterate.
This allows rapid development working with small data, while still having the emphasis on the real data.

Testing

Developers typically run tests on their transformations to see if they are correct. These tests should be stored with the final output and be re-usable. Typically Software Engineers use automated unit tests, but SQL can also be tested. Whether the tests are SQL statements, Python unit tests or a checklist, they are an artifact that allows code to be updated with confidence and should be maintained and used before deployment.

Generally software engineers have it right here. Run code locally (or in a container) and every manual test should be documented as an automated test.

This allows for a pyramid of dependencies to be created. Does the code compile? Do the tests pass? Are guidelines followed? If yes, then the code is ready for production.

Deployment

Deploying to production should be an isolated, reviewed and approved button click.

Standard Software Engineering code reviews should be used here to catch any obvious issues and to ensure code guidelines and standards are followed.

If deploying to production means manually changing 6 different configuration values, s3 buckets, database endpoints or table names, then a system should be devised to deploy to production via some automated process.

The automation can be built on to enforce standards. Folks breaking production because their tests don't pass? Update the automated deploy to disallow this.

Administration

Data Engineers often have to maintain a database with permissions, tables and views and other changes that can occur without an actual deploy. A view can be created through a SQL client without having to go through any process.

Some of these should have checklists and guidelines and some should required enhanced permissions. Need to update permissions so that a user can query a table? Go through the checklist and follow standards. Need to build a new schema? Talk to an administrator with access to a super user account. No need to recreate the wheel every time you need to add permissions, no need to allow engineer to do anything on the system.

Need to build a view? Great, follow the standards across the team for code storage, table location and permissions.

Most administration work is very routine and it's a waste of an engineer's time to start from scratch every time they are taking on this type of work.

Conclusion

A robust system is constructed step by step on 

  • A standard development method
  • Followed consistently
  • Updated when necessary
  • And agreed upon by the team

Generally with Engineers this means

  • Work on small data that is well understood
  • Track validation and tests as part of production code
  • Document standard methods and approaches so anyone can work on anything
  • Collaboratively review each others work

Don't lose sight that we are engineers building a structure step by step and the structure is only as strong as it's weakest point. Standards, Tests, Process, Reviews help keep the structure strong.