About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2022-12-02

Two Types of Data Engineers

Two Types of Data Engineers

There are 2 types of data engineers, those that come from a data/sql/warehouse background and those that come from a software engineering background.

Both are necessary to keeping a modern data system running and to keep a company provided with the data they need when they need it.

What does a Data Engineer "do"?

From What are all these Data people doing anyway?

Raw data can be messy and/or inaccessible. Data Engineers are responsible for building pipelines to pull and archive the raw data and to systematically clean datasets for use in downstream reporting. Other teams have data needs with requirements around how often the data is refreshed and data engineers have the skills to fulfill those requests.

  • Make clean data accessible with requirements around
    • When
    • Where
    • How
  • Optimize data flows for
    • Speed to delivery
    • Cost to operate

 Example Projects

  • Data Warehousing
    • Usually the flagship project of Data Engineering teams. Store the data in an archival format such that reporting, analytics and modeling can run on top
  • Real Time Server Side User Profiles
    • Make user data accessible in real time for modeling, analytics or for display on a website

Two Types

There are two types of complimentary Data Engineers

  • Analytics Engineers
  • and
  • Software Data Engineers

Typically one is focused on Analytics problems and building clean, aggregated tables, while the other is focused on APIs and moving data from one location to another.

Both are essential to a well functioning Data Engineering team which will recognize the complimentary skill set required to pull and move data and to prepare data for analysis and reporting.

Analytics Engineers

Analytics Engineers love to clean datasets for generic use. As companies grow analysts will realize that their reporting dashboards are going slower and slower and are increasingly harder to maintain. This is often due to the dashboard querying raw data while joining dozens of tables and dealing with hundreds of corner cases.

Analytics Engineers are here to be the middle ground between "the business is asking for 16 things simultaneously" and "this query is running slow, perhaps a different sortkey would help". They build clean, aggregated datasets used to power dozens of different dashboards each with a different perspective on a dataset.

Typically Analytics Engineers work heavily in SQL systems and run nightly jobs that clean and aggregate incoming data into datasets that analysts can easily use. They are engineers and follow engineering principles but are business focused and listen carefully to analyst problems and needs. Part of being an Analytics Engineers is looking at bigger issues and finding places where 3 different problems could be solved by building a single, clean dataset.

Example projects

  • Clean Revenue data - dataset with 1 row per revenue event
    • Seems simple, but could easily involve joining 10+ datasets and accounting for that one time in 2018 when we had a 24 hour website bug that caused all collected revenue to be recorded in the raw data as 10x the actual revenue.
  • Clean Event data - dataset with 1 row per user action
    • Once again, seems simple and a "SELECT * FROM atable" seems like it would work, but always doesn't in practice.
    • Sometimes there are 10 duplicated events each with timestamp that differs by milliseconds.
    • Sometimes events are collected in 2 or 3 different locations and each have their own bugs and idiosyncrasies that need to be accounted for before joining into a single clean dataset.
    • Each corner case has to be accounted for and differs from company to company and project to project.

Software Data Engineers

It's all fine and good to build nice datasets for analytics use, but first you have to have data ready to go.

That's the job of the Software Data Engineer.

Sometimes you have a 3rd party vendor that has an API that returns a base64 encoded JSON string that has a bunch of encrypted keys with serialized results that needs yet another API call to decode 😧, No problem for a software data engineer. They will pull this data, massage it, throw out the obvious junk, check that it looks good and load it for use by analysts, data scientists, and of course, analytics engineers.

They also make data available for use by other teams via APIs. In some ways they seem to be backend engineers and sometimes there is overlap, but their focus is on data problems. Reporting, machine learning and other analytics systems are what they are most concerned with.

Example projects 

  • Copy production datasets for backend use
  • Pull data from 16 different vendors into a single system that can combine them all together
  • Work closely with Data Scientists to make model results available via APIs
    • There is an overlap here with Machine Learning Engineers who focus exclusively on this problem.

What's the difference between a Data Engineer and a Backend Engineer?

Often there isn't much difference. Backend engineers will run data transformations and batch jobs that clean data and make it accessible to front end website. Both of these specialties involves a lot of software engineering, API development and movement of clean datasets from one system to another.

The difference between these jobs tends to be who their primary stakeholders are. If they are primarily responsible to other engineers and product managers they tend to be lumped in as backend engineers. In this case they often run in sprints with other product engineers and their focus is on clean production datasets.

On the other hand if they are primarily responsible to data scientists and data analysts they they are typically lumped in with "data" engineers. These teams will often run separate from teams who are focused exclusively on user experience in websites and apps as their responsibilities are more around reporting and supporting machine learning models.

Conclusion

What exactly is the difference here? SQL, Python, Java are just programming languages, these folks are all Engineers, why are there 2 types?

I think is an historical accident. Data Base Administrators (DBA) were a job function for decades and still exist at larger companies. Their primary focus is building SQL systems that analysts can use to do reporting.

But then data science arose in the last 15 years and wanted to do more. They wanted more data, messier data and wanted to use it all to produce machine learning scores in a timely manner. So Software Data Engineers were created to support them.

This lead to a confluence of events where companies look at their data teams and say "we just need some analysts and engineers, right?" The lump Data Science and Data Analysis together as well as Analytics Engineering and Backend Engineering and get an awkward combination.

It's our job to tease apart these differences to build well rounded teams focused on whatever the business needs.