About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post


What are all these Data people doing anyway?

Data Organization Specialties

Companies have lots of people working in data. This is distinct from the people working on websites, warehouses, machinery and design. Usually it's easy to figure out who is working on data, but it can be hard to identify what they are doing.

Everyone has a specialty, an emphasis. Outlined here are all the data specialties that I've encountered in data organizations. These roles are what I see as the 4 data specialties.

4 Data Specialties

Data organizations have pretty diverse projects and requirements. Everything from "How many users visited the website yesterday?" to "Our collider collects too much data to store, what should we discard?".

There is no single person who ever understands all the data, nor is there some unicorn who can do everything. Teams of people with data specialties coordinate on solving the diverse problems that organizations have.

These teams tend to have

  • Analysts answering questions
  • Engineers storing data
  • Librarians organizing metadata
  • Scientists building predictions

There is always overlap, and everybody on a team always does a little of everything but people specialize in order to allow an organization to coordinate and solve ever larger problems.


Data Analysts are responsible for finding actionable insights in the data. Often a primary responsibility of data analysts is building reports but they can go much deeper. One off analyses can involve simple tables of numbers, but could also involve model construction, feature engineering and explaining the model constructed along with the importance of the features.

  • Reporting
    • Dashboards
    • Defining and implementing metrics
  • One off Analyses
    • Pulling numbers
    • Building and explaining models

 Example Projects

  • Ongoing Reporting
    • On an ongoing basis, report on a variety of metrics
    • Involves a lot of documentation and discussion on what metrics to report, how to calculate the metrics and how to make them accessible to the business as a whole
  • One Off Analyses
    • Does having a social media presence on a particular platform help bring in new website users at a reasonable cost?
    • Which users should receive coupons in this upcoming email?


Raw data can be messy and/or inaccessible. Data Engineers are responsible for building pipelines to pull and archive the raw data and to systematically clean datasets for use in downstream reporting. Other teams have data needs with requirements around how often the data is refreshed and data engineers have the skills to fulfill those requests.

  • Make clean data accessible with requirements around
    • When
    • Where
    • How
  • Optimize data flows for
    • Speed to delivery
    • Cost to operate

 Example Projects

  • Data Warehousing
    • Usually the flagship project of Data Engineering teams. Store the data in an archival format such that reporting, analytics and modeling can run on top
  • Real Time Server Side User Profiles
    • Make user data accessible in real time for modeling, analytics or for display on a website


The Librarian is the maintainer of what goes into datasets and how to interpret the data stored within. The librarian is an oft overlooked role that tends to get lumped in or distributed to the the rest of the data organization, but is a distinct function within a data organization. It requires close coordination with business and engineers to define datasets and to maintain a dictionary of terms related to those datasets. Librarians organize and store the metadata that enables the rest of the data organization to understand the data. Without a dedicated role this function is often spread throughout the data organization, but it's an essential function that has to be done.

  • Maintains complex data dictionaries
  • Negotiates with data creators on
    • Vocabulary
    • Fields
  • Trains data creators and people performing data entry tasks

 Example Projects

  • WebLogs
    • Define a consistent naming scheme and definitions for capturing events on a website for use downstream use.
  • Taxonomy
    • Organize what is known about products and the available category names and definitions.


The Data Scientist builds and deploys mathematical models that help the business solve specific problems.  Sometimes the models are a simple data transformation or heuristic, sometimes a t-test and sometimes they are a complex model (or models) that push the limits of computation. Data scientists are a little bit of everything with their own emphases and a with a focus on business problems. Often individual projects look a lot like the work that Data Analysts and Data Engineers do but there is typically a range of projects with a heavy foundation in mathematics.

  • One off analyses
    • Pulling numbers
    • Building and explaining models
  •  Predictions
    • Granular predictions in support of business projects

 Example Projects

  • Recommendations
    • Deliver automated, reliable recommendations on a periodic basis
  • User Scoring
    • Use churn, LTV or other user scores to optimize handing out coupons


Lots of people work in data. What's your role?

  • I answer questions asked by non-technical people - Analyst
  • I build platforms to store and allow retrieval of data - Engineer
  • I organize what the stuff in the data means - Librarian
  • I do math and build granular predictions - Scientist

Maybe you do 2, 3 or all 4 of these types of work! That's common, but I think you will find that you enjoy doing one type of work better than the others.

Look around you, what about the people you work with?

The Excel power users, the SQL wizard, the genius in finance - they all fit into the picture of this world we call data.