2022-12-02

Two Types of Data Engineers

There are 2 types of data engineers, those that come from a data/sql/warehouse background and those that come from a software engineering background.

Both are necessary to keeping a modern data system running and to keep a company provided with the data they need when they need it.

What does a Data Engineer "do"?

From What are all these Data people doing anyway?

Raw data can be messy and/or inaccessible. Data Engineers are responsible for building pipelines to pull and archive the raw data and to systematically clean datasets for use in downstream reporting. Other teams have data needs with requirements around how often the data is refreshed and data engineers have the skills to fulfill those requests.
Make clean data accessible with requirements around
When
Where
How
Optimize data flows for
Speed to delivery
Cost to operate
Example Projects
Data Warehousing
Usually the flagship project of Data Engineering teams. Store the data in an archival format such that reporting, analytics and modeling can run on top
Real Time Server Side User Profiles
Make user data accessible in real time for modeling, analytics or for display on a website

Two Types

There are two types of complimentary Data Engineers

Analytics Engineers
and
Software Data Engineers

Typically one is focused on Analytics problems and building clean, aggregated tables, while the other is focused on APIs and moving data from one location to another.

Both are essential to a well functioning Data Engineering team which will recognize the complimentary skill set required to pull and move data and to prepare data for analysis and reporting.

Analytics Engineers

Analytics Engineers love to clean datasets for generic use. As companies grow analysts will realize that their reporting dashboards are going slower and slower and are increasingly harder to maintain. This is often due to the dashboard querying raw data while joining dozens of tables and dealing with hundreds of corner cases.

Analytics Engineers are here to be the middle ground between "the business is asking for 16 things simultaneously" and "this query is running slow, perhaps a different sortkey would help". They build clean, aggregated datasets used to power dozens of different dashboards each with a different perspective on a dataset.

Typically Analytics Engineers work heavily in SQL systems and run nightly jobs that clean and aggregate incoming data into datasets that analysts can easily use. They are engineers and follow engineering principles but are business focused and listen carefully to analyst problems and needs. Part of being an Analytics Engineers is looking at bigger issues and finding places where 3 different problems could be solved by building a single, clean dataset.

Example projects

Clean Revenue data - dataset with 1 row per revenue event

Seems simple, but could easily involve joining 10+ datasets and accounting for that one time in 2018 when we had a 24 hour website bug that caused all collected revenue to be recorded in the raw data as 10x the actual revenue.

Clean Event data - dataset with 1 row per user action

Once again, seems simple and a "SELECT * FROM atable" seems like it would work, but always doesn't in practice.
Sometimes there are 10 duplicated events each with timestamp that differs by milliseconds.
Sometimes events are collected in 2 or 3 different locations and each have their own bugs and idiosyncrasies that need to be accounted for before joining into a single clean dataset.
Each corner case has to be accounted for and differs from company to company and project to project.

Software Data Engineers

It's all fine and good to build nice datasets for analytics use, but first you have to have data ready to go.

That's the job of the Software Data Engineer.

Sometimes you have a 3rd party vendor that has an API that returns a base64 encoded JSON string that has a bunch of encrypted keys with serialized results that needs yet another API call to decode 😧, No problem for a software data engineer. They will pull this data, massage it, throw out the obvious junk, check that it looks good and load it for use by analysts, data scientists, and of course, analytics engineers.

They also make data available for use by other teams via APIs. In some ways they seem to be backend engineers and sometimes there is overlap, but their focus is on data problems. Reporting, machine learning and other analytics systems are what they are most concerned with.

Example projects

Copy production datasets for backend use
Pull data from 16 different vendors into a single system that can combine them all together
Work closely with Data Scientists to make model results available via APIs

There is an overlap here with Machine Learning Engineers who focus exclusively on this problem.

What's the difference between a Data Engineer and a Backend Engineer?

Often there isn't much difference. Backend engineers will run data transformations and batch jobs that clean data and make it accessible to front end website. Both of these specialties involves a lot of software engineering, API development and movement of clean datasets from one system to another.

The difference between these jobs tends to be who their primary stakeholders are. If they are primarily responsible to other engineers and product managers they tend to be lumped in as backend engineers. In this case they often run in sprints with other product engineers and their focus is on clean production datasets.

On the other hand if they are primarily responsible to data scientists and data analysts they they are typically lumped in with "data" engineers. These teams will often run separate from teams who are focused exclusively on user experience in websites and apps as their responsibilities are more around reporting and supporting machine learning models.

Conclusion

What exactly is the difference here? SQL, Python, Java are just programming languages, these folks are all Engineers, why are there 2 types?

I think is an historical accident. Data Base Administrators (DBA) were a job function for decades and still exist at larger companies. Their primary focus is building SQL systems that analysts can use to do reporting.

But then data science arose in the last 15 years and wanted to do more. They wanted more data, messier data and wanted to use it all to produce machine learning scores in a timely manner. So Software Data Engineers were created to support them.

This lead to a confluence of events where companies look at their data teams and say "we just need some analysts and engineers, right?" The lump Data Science and Data Analysis together as well as Analytics Engineering and Backend Engineering and get an awkward combination.

It's our job to tease apart these differences to build well rounded teams focused on whatever the business needs.

2022-08-06

Data Engineering Tips and Principles

Software Engineering patterns are pretty well established within the industry.

Develop locally and deploy to production
Lock down production data, systems and processes
Use unit tests, QA and integration tests to validate code changes
Do code reviews, design reviews and use appropriate standards for the technologies used

Data folks are often different because

Sometimes there is just 1 database with the tables and columns, so development happens on this production system
SQL is commonly used which makes automated tests difficult to achieve
Real data is used to catch corner cases and validate transformations during development

A lot flows out of these points that makes Data Engineering a different discipline than Software Engineering. It becomes hard to lock down production data and systems and if SQL is the primary language used, then data engineering teams will sometimes not have any tests at all.

Below I've tried to outline a few points that can help bring more stringent development standards into Data Engineering while still maintaining the differences between Data and Software Engineering.

Develop with small data
Save tests for future use
Have a clear process to deploy something to production

These can help to bring more discipline to a scrappy data engineering team.

Development

The first step to developing a data transformation is to understand the real data. Understanding the real data can help to develop faster and to have reproducible tests.

Like a lot of folks, I've sat there and watched my SQL query run for 45 seconds, made a small change, and iterated, and iterated, and lost a day just watching my query run. Thinking about the final result and making an effort to understand the data beforehand can drastically speed up the development and well as create more robust and higher quality code.

I suggest the following approach,

Understand the real data - Document it including common queries and corner cases
Recreate a tiny dataset that covers the real data including corner cases
Consider the output of your transformation on this tiny dataset.
Develop your code with this tiny dataset until your code matches your expectations
Run the code on the full dataset, and if there are failures or discrepancies then your original understanding of the dataset is incomplete, so return to step 1 and iterate.

This allows rapid development working with small data, while still having the emphasis on the real data.

Testing

Developers typically run tests on their transformations to see if they are correct. These tests should be stored with the final output and be re-usable. Typically Software Engineers use automated unit tests, but SQL can also be tested. Whether the tests are SQL statements, Python unit tests or a checklist, they are an artifact that allows code to be updated with confidence and should be maintained and used before deployment.

Generally software engineers have it right here. Run code locally (or in a container) and every manual test should be documented as an automated test.

This allows for a pyramid of dependencies to be created. Does the code compile? Do the tests pass? Are guidelines followed? If yes, then the code is ready for production.

Deployment

Deploying to production should be an isolated, reviewed and approved button click.

Standard Software Engineering code reviews should be used here to catch any obvious issues and to ensure code guidelines and standards are followed.

If deploying to production means manually changing 6 different configuration values, s3 buckets, database endpoints or table names, then a system should be devised to deploy to production via some automated process.

The automation can be built on to enforce standards. Folks breaking production because their tests don't pass? Update the automated deploy to disallow this.

Administration

Data Engineers often have to maintain a database with permissions, tables and views and other changes that can occur without an actual deploy. A view can be created through a SQL client without having to go through any process.

Some of these should have checklists and guidelines and some should required enhanced permissions. Need to update permissions so that a user can query a table? Go through the checklist and follow standards. Need to build a new schema? Talk to an administrator with access to a super user account. No need to recreate the wheel every time you need to add permissions, no need to allow engineer to do anything on the system.

Need to build a view? Great, follow the standards across the team for code storage, table location and permissions.

Most administration work is very routine and it's a waste of an engineer's time to start from scratch every time they are taking on this type of work.

Conclusion

A robust system is constructed step by step on

A standard development method
Followed consistently
Updated when necessary
And agreed upon by the team

Generally with Engineers this means

Work on small data that is well understood
Track validation and tests as part of production code
Document standard methods and approaches so anyone can work on anything
Collaboratively review each others work

Don't lose sight that we are engineers building a structure step by step and the structure is only as strong as it's weakest point. Standards, Tests, Process, Reviews help keep the structure strong.

2022-05-13

Book Suggestions

Book Suggestions for Data Practitioners

This is a collection of books that I've either read or partially read that I think is valuable for people who work in Data.

These books are for a range of skill levels from beginners to advanced data folks.

A lot of these books receive continued updates, so be sure to get the newest version.

These 2 don't really fit in any other section, but are probably the most useful.

How to Lead in Data Science by Jike Chong and Yue Cathy Chang

Useful for any leader in Data

Build a Career in Data Science by Emily Robinson and Jacqueline Nolis

Data Visualizations

These books by Tufte are really good for teaching the theory of data visualizations as well as exploring the wide range of what is possible in data visualizations.

Engineering

Working Effectively with Legacy Code by Michael Feathers
Refactoring: Improving the Design of Existing Code by Martin Fowler, with Kent Beck

This is the most boring book I have ever read. It systematically goes through all possible code changes that can be made to object oriented code. It definitely made me a significantly better programmer.

Designing Data-Intensive Applications by Martin Kleppmann

Machine Learning

Statistics and Analytics

2022-04-01

Why does my startup need a Data Team anyway?

This is a reasonable question. Any business should have a healthy skepticism to hiring. Trying to hire your way out of a problem is feasible, but can be a really expensive proposition for any company.

Typical questions,

Can't the engineers, the accountants and finance people handle data stuff?
What benefit do I get from hiring data analysts, data engineers and data scientists?

The engineers, accounts and finance folks can take a business a long way. They can make plots and charts and determine if the business is making money or going bankrupt. They can support quarterly reports to a board of directors.

A Data Team tackles more fast moving, predictive, larger and messier datasets than accounting and finance and free up the engineers to do actual engineering work. Having data specific folks is a specialization of skill and allows focus on a bounded domain.

Let's start with an Exercise

Let's calculate Revenue over last month from a table with columns

customer id
revenue
revenue date time

This is super easy right?

Filter by date time to the "last month"
Sum the revenue, and you're done!

In practice, the tables are never this simple. They tend to have dozens of columns and revenue data can be spread across another dozen tables. Trying to get a handle on 12 * 12 * number of records for even a small number of records takes some time and effort.

Typically a revenue calculation would involve,

refund amount
item sold
type of item sold
different revenue streams that may have their own tables

or might even be stored in the same table in different columns

etc, etc, more and more columns and tables, on and on and on

Accountants and finance folks can handle these type of problems in Excel with little difficulty. An engineer can run some SQL, hand the data off to accounting and they can load it into Excel. Excel is great, but does have has size limitations. There is only so much memory on a laptop and the business complexity will continue to increase.

More tables, more SQL joins, and more corner cases cause businesses to increase the amount of time required to support basic revenue calculations. At some point, it becomes a half time or full time position for a dedicated person.

Is there a data team?

Accounting and finance may gradually ramp up the work for the engineer to support their efforts. Does your business have an engineer spending more than 50% of their time supporting data pulls? Does it have 2 engineers spending 25% each?

Then your business already has a data team, it just hasn't been verbalized. That's not a great position to be in. Best to have clear communication, goals and expectations on folks.

I would propose that the equivalent of one-half on an engineer's time spent supporting data pulls for accounting and finance is a data team.

What do I mean by "Team"?

In the above example accounting and finance were the core of the data teams, but not all businesses have such a complex accounting or finance group. Some tech startups have more questions on the business operations than they do on their revenue and tax commitments.

How many people visited the website today?
Where did our new visitors come from?
How can we encourage them to stay?

This can lead a business into data analytics. Analysts focus on business questions and business use cases. But their focus is so oriented towards business problems, that they are necessarily the strongest engineers. In order to move faster and have more regularity in their delivered reports they will need support, and thus a business may hire a Data Engineer or Analyst Engineer.

This can cascade into a full data team with analysts, engineers and scientists - What are all these data people doing?

Data teams grow gradually, one person at a time until someone stands up and says "Look Over Here! We have a data team".

Conclusion

Maybe your organization doesn't need a full data team. A single accountant can support a large, complex business, and maybe that's enough.

Maybe you already have a data team, but it just hasn't been said. Folks supporting accounting and finance, folks writing SQL, making plots and charts and just generally wrangling data to provide value to the business.

Being clear with data goals and expectations can help accelerate the value that the data can provide. Daily updated dashboards can help a business understand where it is right now. Historical data analysis can help them see where they have been and forecasting can give a business an idea of where it might be in 6 months.

2022-02-18

Anti-Patterns in Experimentation

Running experiments on websites involves showing 2 or more different experiences to similar but distinct groups of users and measuring the differences to determine which experience works best for the website. There are a variety of methods of setting up and running experiments and industry standards are unfortunately not yet clear.

Often experiments on websites start with a single developer putting in a single "if" statement and experimentation only grows from here.

The intent of this post is to highlight a few anti-patterns I've seen in the hopes that the next team implementing an experiment might avoid some of of the common pitfalls.

Here are the 3 most common anti-patterns in AB Testing that I've seen,

Experimentation is a Library
All Users are in all Experiments
A pile of data dropped on an Analyst

Experimentation

At it's core experimentation is

Giving different experiences to different groups of users
Measuring differences
Moving forward with the experience that provides the best outcome

An experiment has to be implemented in a website, app or within the sending of emails. There also must be tracking of some sort to measure which experience each user was shown.

An Experiment Assignment is defined as a single user's experience and should include,

User id
Experiment Name
Variation Name

Which variation of the website did this user experience? For example, it could be the "red_button" or the "blue_button"

Timestamp

Ideally the first time the user encountered the experience.

Anti-Patterns

An anti-pattern is a pattern of code development or structure that seems like a good idea, but ultimately turns out poorly. When discussing software systems sometimes it's just as important to discuss best practices as to point out how things can go wrong.

Experimentation is a Library

Experimentation is a library! Call a function to split the users. Send the Experiment Assignments downstream for an analyst to use. No problem.

But wait, for the analyst, what does the data mean? how is it produced? where did it come from? why are we running this experiment at all?

All these questions should come out before the experiment is run to allow for methods and data to be aligned to generate the best outcome in an optimal manner.

Experimentation is a Process from hypothesis to development, analysis and cleanup and typically looks like,

An experimenter hypothesizes that a new experience would be better by some measure
An engineer develops the new experience to be able to run in parallel to the existing experience
Users enter and interact with the 2+ experiences
An analyst evaluates and compares the two experiences quantitatively
An engineer codes the website to exclusively display the experience that is quantitatively better

Experimentation is a House of Cards that can return unusable or inaccurate results if any particular piece of this process fails to properly coordinate with the other pieces.

The experimenter must have a reasonable hypothesis that can be quantitatively evaluate
The engineer must properly, randomly split the users and present 2+ high quality experiences
Users must be given time to enter and interact with the 2+ experiences
The analyst must understand the data production and apply a rigorous analysis
The engineer must prioritize and properly cleanup the codebase to guarantee that the best experience is shown to all users from now on

Any step can have minor bugs, but if there is a disconnect or any major bugs the whole system will topple. The experiment may produce incorrect results without anyone ever knowing that the results were incorrect.

Experimentation is NOT a library. It's a system, it's a process, it's a puzzle and the pieces must fit properly to provide high quality results.

All Users are in all Experiments

Every user that has ever landed on the website or that visits during an experiment gets an Experiment Assignment logged for the analyst to use.

Did they actually see the experiment? Who knows, doesn't matter, the analyst can figure it out.

😳

I have worked with outstanding analysts and am 100% positive they can figure this out and tease apart the raw data into some usable answers.

But how long will this take? How many experiments are we intending to run?

Analysts are busy and analytics takes time. Is it worth it to the business to have an analyst spending a week to produce an analysis for every experiment?

This pattern is easy on the Engineering team, but hard on an Analytics team. They will feel undeserved and unappreciated and will likely be overworked when asked to keep up with a high paced experimentation culture.

It also makes automation really hard as each test's analysis needs this extra bit of meta-data that may or may not capture the nuance of when the user actually saw the experiment.

Best to Devise a System of Data Logging that analysts can easily work with and that can allow for some automation.

A pile of data dropped on an Analyst

The old joke is that when you present a pile of data to a analyst and ask "What does this all mean?" the analyst will inevitably say "Not much". Pulling strong signals from experimentation requires up front planning from analysts. They can catch obvious bugs in the design, can point out metrics that can't be calculated and will recall existing similar experiments that have been run to help catch duplicate work.

Analysts are smart, hard working and resourceful. They will find some signal in whatever dataset they are presented with but it can be hard for an organization to measure and realize that they are spending an enormous amount of time generating conclusions from experimentation.

It's also impossible for an organization to measure the quality of a single experiment analysis. The system as a whole must be healthy for an organization to rely on the results.

Pull Analysts Into the Process Early to give your analysts some heads up and make sure they are part of the experiment process at the first step.

Conclusion

Avoiding some of the anti-patterns highlighted here will allow a team to move briskly into a future full of high quality, trusted experimentation.

Instead of Experimentation is a Library, recognize that Experimentation is a Process involving multiple teams and diverse points of view.

Instead of All Users are in all Experiments, work with analysts to Devise a System of Data Logging with experimentation in mind. Something that in intended for analysts to work with and will produce trusted experimentation results.

Instead of A pile of data dropped on an Analyst, Pull Analysts Into the Process Early. Work with your analysts to produce a system of experimentation easy to work with and trusted to produce quality results.

Go forth and keep trying new ideas that delight your users and help your website grow!

2022-01-10

Clean Code for Data Professionals

Simple Tips towards Clean Code for Data Folks

A few years ago I was a Software Engineer and my team lead asked me to take a look at a piece of code. This code interacted with a 3rd party library and we wanted to upgrade that library.

My team lead knew this code was quite old and understood that it might be hard to work with. He asked me to evaluate the level of effort to upgrade vs level of effort to re-write.

So I took a look and found a single C++ file

With a single class, with a single static function

With ~1200 lines of code
With dozens of local variables
With local variables reassigned 300 line after initially being set

With

no comments - literally 0, not one comment in the entire file
no unit tests
no tests of any sort

So I told my team lead that it would be easier to re-write the code than to refactor it. But anything could have tipped the balance back to refactor,

Comments would have helped to understand what was happening in the code
Tests would have allowed me to modify the code with confidence that I was maintaining the original functionality
Smaller functions would have helped increase the code readability
Constant local variables would have dramatically increased the readability of the code

Data professionals are not software engineers. Data Scientists and Data Analysts are definitely not, and Data Engineers are a mix as some come from data science and data analysis, and some from software engineering.

To that end, here are a few tips and tricks that should help make your code easier to read and maintain while still allowing the focus to be on the data and not the code.

Functions should fit on your Monitor

Top to bottom, all the code inside the function should fit on your screen. This drastically increases the readability of any particular piece of code as each function can be quickly read and understood.

This also applies to Jupyter notebooks and SQL statements. For example, I could remove all the section headers in this post, but that would make it a lot harder to read.

🔑 This is the main point here. All code becomes 10x easier to understood if it's chunked up into bite sized pieces.

Comments

At a minimum I would suggest

"Why does this file exist?" as a comment at the top of every file
"Why does this function/SQL exist?" as a comment for every function or SQL statement
"What are the data types of the arguments and the return value?" as a comment for every function in Python

Just remember, you will be picking up this code in ~6 months, and you can set yourself up for an easy reminder of what this code does, or a tough one.

Unit Tests

Any piece of code is tested by the developer. They run it through various scenarios with a range of data values. So it gets tested, but sometimes these tests don't get archived.

The time to formalize these tests and to put them into the codebase is small, but can save a tremendous amount of time at a later date.

Imagine picking up a piece of code that has 2-3 tests vs a piece of code with 0 tests? One is a lot easier and faster to pickup and modify.

Constant Local Variables

This can be really impactful for Jupyter Notebooks and for refactoring existing code.

For notebooks, with dozens of cells and variables, everything is a global variable and having confidence that variables are not in flux makes the code easier to read and maintain.

For refactoring, often data code ends up in large functions with dozens of local variables. Updating the variables to be unchanged or constant can make long functions easier to read and refactor.

Conclusion

These tips are for you. You will be the one working on your own code in 6 months, or someone on your team will go on to a new exciting opportunity, and you will be asked to work on their code.

Wouldn't you rather work on this code?

Short Functions
Comments
Unit Tests
Constant Local Variables

About Me

10 Most Popular Posts

Most Recent Post

2022-12-02

Two Types of Data Engineers

Two Types of Data Engineers

What does a Data Engineer "do"?

Example Projects

Two Types

Analytics Engineers

Example projects

Software Data Engineers

Example projects

What's the difference between a Data Engineer and a Backend Engineer?

Conclusion

2022-08-06

Data Engineering Tips and Principles

Data Engineering Tips and Principles

Development

Testing

Deployment

Administration

Conclusion

2022-05-13

Book Suggestions

Book Suggestions for Data Practitioners

Data Visualizations

Engineering

Machine Learning

Statistics and Analytics

2022-04-01

Why does my startup need a Data Team anyway?

Why does my startup need a Data Team anyway?

Let's start with an Exercise

Is there a data team?

What do I mean by "Team"?

Conclusion

2022-02-18

Anti-Patterns in Experimentation

Anti-Patterns in Experimentation

Experimentation

Anti-Patterns

Experimentation is a Library

All Users are in all Experiments

A pile of data dropped on an Analyst

Conclusion

Further reading

2022-01-10

Clean Code for Data Professionals

Simple Tips towards Clean Code for Data Folks

Functions should fit on your Monitor

Comments

Unit Tests

Constant Local Variables

Conclusion

All Posts

My other places