Simple Tips towards Clean Code for Data Folks
A few years ago I was a Software Engineer and my team lead asked me to take a look at a piece of code. This code interacted with a 3rd party library and we wanted to upgrade that library.
My team lead knew this code was quite old and understood that it might be hard to work with. He asked me to evaluate the level of effort to upgrade vs level of effort to re-write.
So I took a look and found a single C++ file
- With a single class, with a single static function
- With ~1200 lines of code
- With dozens of local variables
- With local variables reassigned 300 line after initially being set
- no comments - literally 0, not one comment in the entire file
- no unit tests
- no tests of any sort
So I told my team lead that it would be easier to re-write the code than to refactor it. But anything could have tipped the balance back to refactor,
- Comments would have helped to understand what was happening in the code
- Tests would have allowed me to modify the code with confidence that I was maintaining the original functionality
- Smaller functions would have helped increase the code readability
- Constant local variables would have dramatically increased the readability of the code
Data professionals are not software engineers. Data Scientists and Data Analysts are definitely not, and Data Engineers are a mix as some come from data science and data analysis, and some from software engineering.
To that end, here are a few tips and tricks that should help make your code easier to read and maintain while still allowing the focus to be on the data and not the code.
Functions should fit on your Monitor
Top to bottom, all the code inside the function should fit on your screen. This drastically increases the readability of any particular piece of code as each function can be quickly read and understood.
This also applies to Jupyter notebooks and SQL statements. For example, I could remove all the section headers in this post, but that would make it a lot harder to read.
🔑 This is the main point here. All code becomes 10x easier to understood if it's chunked up into bite sized pieces.
At a minimum I would suggest
- "Why does this file exist?" as a comment at the top of every file
- "Why does this function/SQL exist?" as a comment for every function or SQL statement
- "What are the data types of the arguments and the return value?" as a comment for every function in Python
Just remember, you will be picking up this code in ~6 months, and you can set yourself up for an easy reminder of what this code does, or a tough one.
Any piece of code is tested by the developer. They run it through various scenarios with a range of data values. So it gets tested, but sometimes these tests don't get archived.
The time to formalize these tests and to put them into the codebase is small, but can save a tremendous amount of time at a later date.
Imagine picking up a piece of code that has 2-3 tests vs a piece of code with 0 tests? One is a lot easier and faster to pickup and modify.
Constant Local Variables
This can be really impactful for Jupyter Notebooks and for refactoring existing code.
For notebooks, with dozens of cells and variables, everything is a global variable and having confidence that variables are not in flux makes the code easier to read and maintain.
For refactoring, often data code ends up in large functions with dozens of local variables. Updating the variables to be unchanged or constant can make long functions easier to read and refactor.
These tips are for you. You will be the one working on your own code in 6 months, or someone on your team will go on to a new exciting opportunity, and you will be asked to work on their code.
Wouldn't you rather work on this code?
- Short Functions
- Unit Tests
- Constant Local Variables