About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2024-10-31

On Data Flows for Machine Learning Services

On Data Flows for Machine Learning Services

Building data flows for Machine Learning systems is the mundane plumbing that ensures that a system is

  • transparent and testable
  • behaves as expected
  • has clear ownership
  • is scalable to high volume
  • is scalable to many ML services

A system is typically setup that will some id (typically a user id) and return some results such as a churn score or a set of recommendations. This is often an Rest or gRPC API call into a separate service running next to a website or web apps. The core of a system of this sort is a machine learning model that takes some data in and calculates some features and then returns a result.

Generally something like,

This is pretty straightforward, the core problem then is Where Does the Machine Learning Model get it's Features from?

  1. The application - ⚠️ Anti-pattern ⚠️
  2. Some other system
  3. The application as raw data - ðŸ’« Best Practice 💫

It's so easy to think of machine learning as just a thing that works and provides these amazing insights and personalization, but without having the data sets in sync and set up properly, the actual experience of developing, debugging, monitoring, and extending can stall out or require excessive maintenance.

Data Flows

In order to check all these boxes,
  • transparent and testable
  • behaves as expected
  • has clear ownership
  • is scalable to high volume
  • is scalable to many ML services

A system must

  • Be usable in a non-production environment
  • Return production quality results in a non-production environment
  • Have clear boundaries which team is responsible for what part of the system
  • Scale to hundreds, thousands and millions of users
  • Scale to 2, 3, 60, 500 machine learning services

Features come from the application - ⚠️ Anti-pattern ⚠️

In this system, the application will make a call to the machine learning system, and it'll give it everything it needs with the call to generate a result.

Pros & Cons

A system of this sort is very easy to develop and deploy, but hard to maintain.

  • ✅ Be usable in a non-production environment
  • ✅ Return production quality results in a non-production environment
  • 🚫 Have clear boundaries which team is responsible for what part of the system
  • ✅ Scale to hundreds, thousands and millions of users
  • 🚫 Scale to 2, 3, 60, 500 machine learning services

The team developing the features is not typically the same team as is developing the machine learning model so there can be misunderstandings and bugs. There can also be limitations in the features to be developed. Machine learning models often take large amounts of data over time and applications don't always have this so readily available.

This type of system tightly couples the machine learning system to the application, as well as denies the ability for the data used to calculate the features to be genetically available to other machine, learning models.

Features come from some other system

 

In this system, the application will make a call to the machine learning system, and it will just give some user id that requires a result for.

Pros & Cons

A system of this sort is very easy to develop, deploy and maintain but hard to test.
  • 🚫 Be usable in a non-production environment
  • 🚫 Return production quality results in a non-production environment
  • ✅ Have clear boundaries which team is responsible for what part of the system
  • ✅ Scale to hundreds, thousands and millions of users
  • ✅ Scale to 2, 3, 60, 500 machine learning services

These types of systems are also fairly easy to set up often as batch mode systems that calculate the features separately from the application.

This is a really good place for teams to start if your team or organization has never spun up a machine learning model. It's very understandable, easy to roll out to, but requires more thinking on the testing side.

The main con here is the application and the machine learning responses are not in sync. So a user id in one system may not be useful in another system. This makes testing difficult in non-production environments as either the application will have a user id that will match to a different machine learning result, or the application will have a user id that will match to no result.

Features come from the application as raw data - 💫 Best Practice 💫

 

In this system, the application will send raw data into an aggregator (typically called a feature store), this data will then be made available to the machine learning model for use in calculating a score or recommendations.

Pros & Cons 

A system of this sort is very hard to develop and deploy, but easy to test and scale.
  • ✅ Be usable in a non-production environment
  • ✅ Return production quality results in a non-production environment
  • ✅ Have clear boundaries which team is responsible for what part of the system
  • ✅ Scale to hundreds, thousands and millions of users
  • ✅ Scale to 2, 3, 60, 500 machine learning services

This is the most robust pattern available to an organization. It allows for multiple models to leverage the same set of features, as well as separates out the concerns of who maintains the features and how they are calculated from who uses the machine learning result.

This type of system also allows for testing in non-production environments since all the data comes directly from an application.

This type of system should not be the first machine learning system an organization spins up. It takes a bit more sophistication, a bit more raw data being flowing out of the system and a whole other set of aggregations separate from the application owned by another team.
 
An organization should definitely spin up a system like this if it anticipates a few dozen machine learning systems to be spun up, or it has tight requirements on robustness, speed or quality.

Bonus: ⚠️ Anti-pattern ⚠️ Features come from both the application and another system 🪢

 


In this system, the application will make a call to the machine learning system, and it'll give it some of the data it needs with the call to generate a result, with other data coming from some other system.

Pros & Cons

A system of this sort is very hard to develop and deploy, and test and maintain.
  • 🚫 Be usable in a non-production environment
  • 🚫 Return production quality results in a non-production environment
  • 🚫 Have clear boundaries which team is responsible for what part of the system
  • ✅ Scale to hundreds, thousands and millions of users
  • 🚫 Scale to 2, 3, 60, 500 machine learning services

🤪 This type of system design is excellent if you'd like to tie yourself in and not stay up all night and be very unsure if your system is performing as you expected to 🤪

There are no pros to a system of this sort.

This system not only has issues with separation of concern, data being mismatched, but also has the fun added bonus of another method of data being mismatched. Not only may the user ids be mismatched but the features may be as well leading to a situation where instead of recommending books, the system may be recommending books, movies, tables and gibberish.

Notes on input datasets

There are just so many different ways to build a machine learning algorithm that the above examples are wrapped around "user id" being the input to the system. Really it could be any identification as a single field or multiple, simple or complex. The state matters here as much as the entity being evaluated.

For example it could "user id + book" indicating the model should generate recommendations for this user limited to "books".

Conclusion

Overall, it's best practice if the people and teams that own in the machine learning model also own the construction of the features that power of the model. Systems can be set up in such a way to support this goal or to detract from it.

Goal should be a generic system that powers not one machine learning model, but dozens, allowing for an organization to scale.