Stephen Pettinato - Data Professional - (he/him): January 2022

Simple Tips towards Clean Code for Data Folks

A few years ago I was a Software Engineer and my team lead asked me to take a look at a piece of code. This code interacted with a 3rd party library and we wanted to upgrade that library.

My team lead knew this code was quite old and understood that it might be hard to work with. He asked me to evaluate the level of effort to upgrade vs level of effort to re-write.

So I took a look and found a single C++ file

With a single class, with a single static function

With ~1200 lines of code
With dozens of local variables
With local variables reassigned 300 line after initially being set

With

no comments - literally 0, not one comment in the entire file
no unit tests
no tests of any sort

So I told my team lead that it would be easier to re-write the code than to refactor it. But anything could have tipped the balance back to refactor,

Comments would have helped to understand what was happening in the code
Tests would have allowed me to modify the code with confidence that I was maintaining the original functionality
Smaller functions would have helped increase the code readability
Constant local variables would have dramatically increased the readability of the code

Data professionals are not software engineers. Data Scientists and Data Analysts are definitely not, and Data Engineers are a mix as some come from data science and data analysis, and some from software engineering.

To that end, here are a few tips and tricks that should help make your code easier to read and maintain while still allowing the focus to be on the data and not the code.

Functions should fit on your Monitor

Top to bottom, all the code inside the function should fit on your screen. This drastically increases the readability of any particular piece of code as each function can be quickly read and understood.

This also applies to Jupyter notebooks and SQL statements. For example, I could remove all the section headers in this post, but that would make it a lot harder to read.

🔑 This is the main point here. All code becomes 10x easier to understood if it's chunked up into bite sized pieces.

Comments

At a minimum I would suggest

"Why does this file exist?" as a comment at the top of every file
"Why does this function/SQL exist?" as a comment for every function or SQL statement
"What are the data types of the arguments and the return value?" as a comment for every function in Python

Just remember, you will be picking up this code in ~6 months, and you can set yourself up for an easy reminder of what this code does, or a tough one.

Unit Tests

Any piece of code is tested by the developer. They run it through various scenarios with a range of data values. So it gets tested, but sometimes these tests don't get archived.

The time to formalize these tests and to put them into the codebase is small, but can save a tremendous amount of time at a later date.

Imagine picking up a piece of code that has 2-3 tests vs a piece of code with 0 tests? One is a lot easier and faster to pickup and modify.

Constant Local Variables

This can be really impactful for Jupyter Notebooks and for refactoring existing code.

For notebooks, with dozens of cells and variables, everything is a global variable and having confidence that variables are not in flux makes the code easier to read and maintain.

For refactoring, often data code ends up in large functions with dozens of local variables. Updating the variables to be unchanged or constant can make long functions easier to read and refactor.

Conclusion

These tips are for you. You will be the one working on your own code in 6 months, or someone on your team will go on to a new exciting opportunity, and you will be asked to work on their code.

Wouldn't you rather work on this code?

Short Functions
Comments
Unit Tests
Constant Local Variables

Stephen Pettinato - Data Professional - (he/him)

About Me

10 Most Popular Posts

Most Recent Post

2022-01-10

Clean Code for Data Professionals