About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2022-08-06

Data Engineering Tips and Principles

Data Engineering Tips and Principles

Software Engineering patterns are pretty well established within the industry.

  • Develop locally and deploy to production
  • Lock down production data, systems and processes
  • Use unit tests, QA and integration tests to validate code changes
  • Do code reviews, design reviews and use appropriate standards for the technologies used

 Data folks are often different because

  • Sometimes there is just 1 database with the tables and columns, so development happens on this production system
  • SQL is commonly used which makes automated tests difficult to achieve
  • Real data is used to catch corner cases and validate transformations during development

A lot flows out of these points that makes Data Engineering a different discipline than Software Engineering. It becomes hard to lock down production data and systems and if SQL is the primary language used, then data engineering teams will sometimes not have any tests at all.

Below I've tried to outline a few points that can help bring more stringent development standards into Data Engineering while still maintaining the differences between Data and Software Engineering.

  • Develop with small data
  • Save tests for future use
  • Have a clear process to deploy something to production

These can help to bring more discipline to a scrappy data engineering team.

Development

The first step to developing a data transformation is to understand the real data. Understanding the real data can help to develop faster and to have reproducible tests.

Like a lot of folks, I've sat there and watched my SQL query run for 45 seconds, made a small change, and iterated, and iterated, and lost a day just watching my query run. Thinking about the final result and making an effort to understand the data beforehand can drastically speed up the development and well as create more robust and higher quality code.

I suggest the following approach,

  1. Understand the real data - Document it including common queries and corner cases
  2. Recreate a tiny dataset that covers the real data including corner cases
  3. Consider the output of your transformation on this tiny dataset.
  4. Develop your code with this tiny dataset until your code matches your expectations
  5. Run the code on the full dataset, and if there are failures or discrepancies then your original understanding of the dataset is incomplete, so return to step 1 and iterate.
This allows rapid development working with small data, while still having the emphasis on the real data.

Testing

Developers typically run tests on their transformations to see if they are correct. These tests should be stored with the final output and be re-usable. Typically Software Engineers use automated unit tests, but SQL can also be tested. Whether the tests are SQL statements, Python unit tests or a checklist, they are an artifact that allows code to be updated with confidence and should be maintained and used before deployment.

Generally software engineers have it right here. Run code locally (or in a container) and every manual test should be documented as an automated test.

This allows for a pyramid of dependencies to be created. Does the code compile? Do the tests pass? Are guidelines followed? If yes, then the code is ready for production.

Deployment

Deploying to production should be an isolated, reviewed and approved button click.

Standard Software Engineering code reviews should be used here to catch any obvious issues and to ensure code guidelines and standards are followed.

If deploying to production means manually changing 6 different configuration values, s3 buckets, database endpoints or table names, then a system should be devised to deploy to production via some automated process.

The automation can be built on to enforce standards. Folks breaking production because their tests don't pass? Update the automated deploy to disallow this.

Administration

Data Engineers often have to maintain a database with permissions, tables and views and other changes that can occur without an actual deploy. A view can be created through a SQL client without having to go through any process.

Some of these should have checklists and guidelines and some should required enhanced permissions. Need to update permissions so that a user can query a table? Go through the checklist and follow standards. Need to build a new schema? Talk to an administrator with access to a super user account. No need to recreate the wheel every time you need to add permissions, no need to allow engineer to do anything on the system.

Need to build a view? Great, follow the standards across the team for code storage, table location and permissions.

Most administration work is very routine and it's a waste of an engineer's time to start from scratch every time they are taking on this type of work.

Conclusion

A robust system is constructed step by step on 

  • A standard development method
  • Followed consistently
  • Updated when necessary
  • And agreed upon by the team

Generally with Engineers this means

  • Work on small data that is well understood
  • Track validation and tests as part of production code
  • Document standard methods and approaches so anyone can work on anything
  • Collaboratively review each others work

Don't lose sight that we are engineers building a structure step by step and the structure is only as strong as it's weakest point. Standards, Tests, Process, Reviews help keep the structure strong.

2022-05-13

Book Suggestions

Book Suggestions for Data Practitioners

This is a collection of books that I've either read or partially read that I think is valuable for people who work in Data.

These books are for a range of skill levels from beginners to advanced data folks.

A lot of these books receive continued updates, so be sure to get the newest version.

These 2 don't really fit in any other section, but are probably the most useful.

Data Visualizations

Engineering

Machine Learning

Statistics and Analytics




2022-04-01

Why does my startup need a Data Team anyway?

Why does my startup need a Data Team anyway?

This is a reasonable question. Any business should have a healthy skepticism to hiring. Trying to hire your way out of a problem is feasible, but can be a really expensive proposition for any company.

Typical questions,

  • Can't the engineers, the accountants and finance people handle data stuff?
  • What benefit do I get from hiring data analysts, data engineers and data scientists?

The engineers, accounts and finance folks can take a business a long way. They can make plots and charts and determine if the business is making money or going bankrupt. They can support quarterly reports to a board of directors.

A Data Team tackles more fast moving, predictive, larger and messier datasets than accounting and finance and free up the engineers to do actual engineering work. Having data specific folks is a specialization of skill and allows focus on a bounded domain.

Let's start with an Exercise

Let's calculate Revenue over last month from a table with columns

  • customer id
  • revenue
  • revenue date time

This is super easy right?

  • Filter by date time to the "last month"
  • Sum the revenue, and you're done!

In practice, the tables are never this simple. They tend to have dozens of columns and revenue data can be spread across another dozen tables. Trying to get a handle on 12 * 12 * number of records for even a small number of records takes some time and effort.

Typically a revenue calculation would involve,

  • refund amount
  • item sold
  • type of item sold
  • different revenue streams that may have their own tables
    • or might even be stored in the same table in different columns
  • etc, etc, more and more columns and tables, on and on and on

Accountants and finance folks can handle these type of problems in Excel with little difficulty. An engineer can run some SQL, hand the data off to accounting and they can load it into Excel. Excel is great, but does have has size limitations. There is only so much memory on a laptop and the business complexity will continue to increase.

More tables, more SQL joins, and more corner cases cause businesses to increase the amount of time required to support basic revenue calculations. At some point, it becomes a half time or full time position for a dedicated person.

Is there a data team?

Accounting and finance may gradually ramp up the work for the engineer to support their efforts. Does your business have an engineer spending more than 50% of their time supporting data pulls? Does it have 2 engineers spending 25% each?

Then your business already has a data team, it just hasn't been verbalized. That's not a great position to be in. Best to have clear communication, goals and expectations on folks.

I would propose that the equivalent of one-half on an engineer's time spent supporting data pulls for accounting and finance is a data team.

What do I mean by "Team"?

In the above example accounting and finance were the core of the data teams, but not all businesses have such a complex accounting or finance group. Some tech startups have more questions on the business operations than they do on their revenue and tax commitments.

  • How many people visited the website today?
  • Where did our new visitors come from?
  • How can we encourage them to stay?

This can lead a business into data analytics. Analysts focus on business questions and business use cases. But their focus is so oriented towards business problems, that they are necessarily the strongest engineers. In order to move faster and have more regularity in their delivered reports they will need support, and thus a business may hire a Data Engineer or Analyst Engineer.

This can cascade into a full data team with analysts, engineers and scientists - What are all these data people doing?

Data teams grow gradually, one person at a time until someone stands up and says "Look Over Here! We have a data team".

Conclusion

Maybe your organization doesn't need a full data team. A single accountant can support a large, complex business, and maybe that's enough.

Maybe you already have a data team, but it just hasn't been said. Folks supporting accounting and finance, folks writing SQL, making plots and charts and just generally wrangling data to provide value to the business.

Being clear with data goals and expectations can help accelerate the value that the data can provide. Daily updated dashboards can help a business understand where it is right now. Historical data analysis can help them see where they have been and forecasting can give a business an idea of where it might be in 6 months.

2022-02-18

Anti-Patterns in Experimentation

Anti-Patterns in Experimentation

Running experiments on websites involves showing 2 or more different experiences to similar but distinct groups of users and measuring the differences to determine which experience works best for the website. There are a variety of methods of setting up and running experiments and industry standards are unfortunately not yet clear.

Often experiments on websites start with a single developer putting in a single "if" statement and experimentation only grows from here.

The intent of this post is to highlight a few anti-patterns I've seen in the hopes that the next team implementing an experiment might avoid some of of the common pitfalls.

Here are the 3 most common anti-patterns in AB Testing that I've seen,

  1. Experimentation is a Library
  2. All Users are in all Experiments
  3. A pile of data dropped on an Analyst

Experimentation

At it's core experimentation is

  • Giving different experiences to different groups of users
  • Measuring differences
  • Moving forward with the experience that provides the best outcome

An experiment has to be implemented in a website, app or within the sending of emails. There also must be tracking of some sort to measure which experience each user was shown.

An Experiment Assignment is defined as a single user's experience and should include,

  • User id
  • Experiment Name
  • Variation Name
    • Which variation of the website did this user experience? For example, it could be the "red_button" or the "blue_button"
  • Timestamp
    • Ideally the first time the user encountered the experience.

Anti-Patterns

An anti-pattern is a pattern of code development or structure that seems like a good idea, but ultimately turns out poorly. When discussing software systems sometimes it's just as important to discuss best practices as to point out how things can go wrong.

Experimentation is a Library

Experimentation is a library! Call a function to split the users. Send the Experiment Assignments downstream for an analyst to use. No problem.

But wait, for the analyst, what does the data mean? how is it produced? where did it come from? why are we running this experiment at all?

All these questions should come out before the experiment is run to allow for methods and data to be aligned to generate the best outcome in an optimal manner.

Experimentation is a Process from hypothesis to development, analysis and cleanup and typically looks like,

  1. An experimenter hypothesizes that a new experience would be better by some measure
  2. An engineer develops the new experience to be able to run in parallel to the existing experience
  3. Users enter and interact with the 2+ experiences
  4. An analyst evaluates and compares the two experiences quantitatively
  5. An engineer codes the website to exclusively display the experience that is quantitatively better

Experimentation is a House of Cards that can return unusable or inaccurate results if any particular piece of this process fails to properly coordinate with the other pieces.

  1. The experimenter must have a reasonable hypothesis that can be quantitatively evaluate
  2. The engineer must properly, randomly split the users and present 2+ high quality experiences
  3. Users must be given time to enter and interact with the 2+ experiences
  4. The analyst must understand the data production and apply a rigorous analysis
  5. The engineer must prioritize and properly cleanup the codebase to guarantee that the best experience is shown to all users from now on

Any step can have minor bugs, but if there is a disconnect or any major bugs the whole system will topple. The experiment may produce incorrect results without anyone ever knowing that the results were incorrect.

Experimentation is NOT a library. It's a system, it's a process, it's a puzzle and the pieces must fit properly to provide high quality results.

All Users are in all Experiments

Every user that has ever landed on the website or that visits during an experiment gets an Experiment Assignment logged for the analyst to use.

Did they actually see the experiment? Who knows, doesn't matter, the analyst can figure it out.

😳

I have worked with outstanding analysts and am 100% positive they can figure this out and tease apart the raw data into some usable answers.

But how long will this take? How many experiments are we intending to run?

Analysts are busy and analytics takes time. Is it worth it to the business to have an analyst spending a week to produce an analysis for every experiment?

This pattern is easy on the Engineering team, but hard on an Analytics team. They will feel undeserved and unappreciated and will likely be overworked when asked to keep up with a high paced experimentation culture.

It also makes automation really hard as each test's analysis needs this extra bit of meta-data that may or may not capture the nuance of when the user actually saw the experiment.

Best to Devise a System of Data Logging that analysts can easily work with and that can allow for some automation.

A pile of data dropped on an Analyst

The old joke is that when you present a pile of data to a analyst and ask "What does this all mean?" the analyst will inevitably say "Not much". Pulling strong signals from experimentation requires up front planning from analysts. They can catch obvious bugs in the design, can point out metrics that can't be calculated and will recall existing similar experiments that have been run to help catch duplicate work.

Analysts are smart, hard working and resourceful. They will find some signal in whatever dataset they are presented with but it can be hard for an organization to measure and realize that they are spending an enormous amount of time generating conclusions from experimentation.

It's also impossible for an organization to measure the quality of a single experiment analysis. The system as a whole must be healthy for an organization to rely on the results.

Pull Analysts Into the Process Early to give your analysts some heads up and make sure they are part of the experiment process at the first step.

Conclusion

Avoiding some of the anti-patterns highlighted here will allow a team to move briskly into a future full of high quality, trusted experimentation.

Instead of Experimentation is a Library, recognize that Experimentation is a Process involving multiple teams and diverse points of view.

Instead of All Users are in all Experiments, work with analysts to Devise a System of Data Logging with experimentation in mind. Something that in intended for analysts to work with and will produce trusted experimentation results.

Instead of A pile of data dropped on an Analyst, Pull Analysts Into the Process Early. Work with your analysts to produce a system of experimentation easy to work with and trusted to produce quality results.

Go forth and keep trying new ideas that delight your users and help your website grow!

Further reading

I recommend the book "Trustworthy Online Controlled Experiments (A Practical Guide to A/B Testing)" by Ron Kohavi, Diane Tang and Ya Xu.

2022-01-10

Clean Code for Data Professionals

Simple Tips towards Clean Code for Data Folks

A few years ago I was a Software Engineer and my team lead asked me to take a look at a piece of code. This code interacted with a 3rd party library and we wanted to upgrade that library.

My team lead knew this code was quite old and understood that it might be hard to work with. He asked me to evaluate the level of effort to upgrade vs level of effort to re-write.

So I took a look and found a single C++ file

  • With a single class, with a single static function
    • With ~1200 lines of code
    • With dozens of local variables
    • With local variables reassigned 300 line after initially being set
  • With
    • no comments - literally 0, not one comment in the entire file
    • no unit tests
    • no tests of any sort

So I told my team lead that it would be easier to re-write the code than to refactor it. But anything could have tipped the balance back to refactor,

  • Comments would have helped to understand what was happening in the code
  • Tests would have allowed me to modify the code with confidence that I was maintaining the original functionality
  • Smaller functions would have helped increase the code readability
  • Constant local variables would have dramatically increased the readability of the code

Data professionals are not software engineers. Data Scientists and Data Analysts are definitely not, and Data Engineers are a mix as some come from data science and data analysis, and some from software engineering.

To that end, here are a few tips and tricks that should help make your code easier to read and maintain while still allowing the focus to be on the data and not the code.

Functions should fit on your Monitor

Top to bottom, all the code inside the function should fit on your screen. This drastically increases the readability of any particular piece of code as each function can be quickly read and understood.

This also applies to Jupyter notebooks and SQL statements. For example, I could remove all the section headers in this post, but that would make it a lot harder to read.

🔑 This is the main point here. All code becomes 10x easier to understood if it's chunked up into bite sized pieces.

Comments

At a minimum I would suggest

  • "Why does this file exist?" as a comment at the top of every file
  • "Why does this function/SQL exist?" as a comment for every function or SQL statement
  • "What are the data types of the arguments and the return value?" as a comment for every function in Python

Just remember, you will be picking up this code in ~6 months, and you can set yourself up for an easy reminder of what this code does, or a tough one.

Unit Tests

Any piece of code is tested by the developer. They run it through various scenarios with a range of data values. So it gets tested, but sometimes these tests don't get archived.

The time to formalize these tests and to put them into the codebase is small, but can save a tremendous amount of time at a later date.

Imagine picking up a piece of code that has 2-3 tests vs a piece of code with 0 tests? One is a lot easier and faster to pickup and modify.

Constant Local Variables

This can be really impactful for Jupyter Notebooks and for refactoring existing code.

For notebooks, with dozens of cells and variables, everything is a global variable and having confidence that variables are not in flux makes the code easier to read and maintain.

For refactoring, often data code ends up in large functions with dozens of local variables. Updating the variables to be unchanged or constant can make long functions easier to read and refactor.

Conclusion

These tips are for you. You will be the one working on your own code in 6 months, or someone on your team will go on to a new exciting opportunity, and you will be asked to work on their code.

Wouldn't you rather work on this code?

  • Short Functions
  • Comments
  • Unit Tests
  • Constant Local Variables