On Data Flows for Machine Learning Services

Building data flows for Machine Learning systems is the mundane plumbing that ensures that a system is

transparent and testable
behaves as expected
has clear ownership
is scalable to high volume
is scalable to many ML services

A system is typically setup that will some id (typically a user id) and return some results such as a churn score or a set of recommendations. This is often an Rest or gRPC API call into a separate service running next to a website or web apps. The core of a system of this sort is a machine learning model that takes some data in and calculates some features and then returns a result.

Generally something like,

This is pretty straightforward, the core problem then is Where Does the Machine Learning Model get it's Features from?

The application - ⚠️ Anti-pattern ⚠️
Some other system
The application as raw data - 💫 Best Practice 💫

It's so easy to think of machine learning as just a thing that works and provides these amazing insights and personalization, but without having the data sets in sync and set up properly, the actual experience of developing, debugging, monitoring, and extending can stall out or require excessive maintenance.

Data Flows

In order to check all these boxes,

transparent and testable
behaves as expected
has clear ownership
is scalable to high volume
is scalable to many ML services

A system must

Be usable in a non-production environment
Return production quality results in a non-production environment
Have clear boundaries which team is responsible for what part of the system
Scale to hundreds, thousands and millions of users
Scale to 2, 3, 60, 500 machine learning services

Features come from the application - ⚠️ Anti-pattern ⚠️

In this system, the application will make a call to the machine learning system, and it'll give it everything it needs with the call to generate a result.

Pros & Cons

A system of this sort is very easy to develop and deploy, but hard to maintain.

✅ Be usable in a non-production environment
✅ Return production quality results in a non-production environment
🚫 Have clear boundaries which team is responsible for what part of the system
✅ Scale to hundreds, thousands and millions of users
🚫 Scale to 2, 3, 60, 500 machine learning services

The team developing the features is not typically the same team as is developing the machine learning model so there can be misunderstandings and bugs. There can also be limitations in the features to be developed. Machine learning models often take large amounts of data over time and applications don't always have this so readily available.

This type of system tightly couples the machine learning system to the application, as well as denies the ability for the data used to calculate the features to be genetically available to other machine, learning models.

Features come from some other system

In this system, the application will make a call to the machine learning system, and it will just give some user id that requires a result for.

Pros & Cons

A system of this sort is very easy to develop, deploy and maintain but hard to test.

🚫 Be usable in a non-production environment
🚫 Return production quality results in a non-production environment
✅ Have clear boundaries which team is responsible for what part of the system
✅ Scale to hundreds, thousands and millions of users
✅ Scale to 2, 3, 60, 500 machine learning services

These types of systems are also fairly easy to set up often as batch mode systems that calculate the features separately from the application.

This is a really good place for teams to start if your team or organization has never spun up a machine learning model. It's very understandable, easy to roll out to, but requires more thinking on the testing side.

The main con here is the application and the machine learning responses are not in sync. So a user id in one system may not be useful in another system. This makes testing difficult in non-production environments as either the application will have a user id that will match to a different machine learning result, or the application will have a user id that will match to no result.

Features come from the application as raw data - 💫 Best Practice 💫

In this system, the application will send raw data into an aggregator (typically called a feature store), this data will then be made available to the machine learning model for use in calculating a score or recommendations.

Pros & Cons

A system of this sort is very hard to develop and deploy, but easy to test and scale.

✅ Be usable in a non-production environment
✅ Return production quality results in a non-production environment
✅ Have clear boundaries which team is responsible for what part of the system
✅ Scale to hundreds, thousands and millions of users
✅ Scale to 2, 3, 60, 500 machine learning services

This is the most robust pattern available to an organization. It allows for multiple models to leverage the same set of features, as well as separates out the concerns of who maintains the features and how they are calculated from who uses the machine learning result.

This type of system also allows for testing in non-production environments since all the data comes directly from an application.

This type of system should not be the first machine learning system an organization spins up. It takes a bit more sophistication, a bit more raw data being flowing out of the system and a whole other set of aggregations separate from the application owned by another team.

An organization should definitely spin up a system like this if it anticipates a few dozen machine learning systems to be spun up, or it has tight requirements on robustness, speed or quality.

Bonus: ⚠️ Anti-pattern ⚠️ Features come from both the application and another system 🪢

In this system, the application will make a call to the machine learning system, and it'll give it some of the data it needs with the call to generate a result, with other data coming from some other system.

Pros & Cons

A system of this sort is very hard to develop and deploy, and test and maintain.

🚫 Be usable in a non-production environment
🚫 Return production quality results in a non-production environment
🚫 Have clear boundaries which team is responsible for what part of the system
✅ Scale to hundreds, thousands and millions of users
🚫 Scale to 2, 3, 60, 500 machine learning services

🤪 This type of system design is excellent if you'd like to tie yourself in and not stay up all night and be very unsure if your system is performing as you expected to 🤪

There are no pros to a system of this sort.

This system not only has issues with separation of concern, data being mismatched, but also has the fun added bonus of another method of data being mismatched. Not only may the user ids be mismatched but the features may be as well leading to a situation where instead of recommending books, the system may be recommending books, movies, tables and gibberish.

Notes on input datasets

There are just so many different ways to build a machine learning algorithm that the above examples are wrapped around "user id" being the input to the system. Really it could be any identification as a single field or multiple, simple or complex. The state matters here as much as the entity being evaluated.

For example it could "user id + book" indicating the model should generate recommendations for this user limited to "books".

Conclusion

Overall, it's best practice if the people and teams that own in the machine learning model also own the construction of the features that power of the model. Systems can be set up in such a way to support this goal or to detract from it.

Goal should be a generic system that powers not one machine learning model, but dozens, allowing for an organization to scale.

Documentation for Data Systems

How do I as a data professional communicate with my colleagues, my peers, myself in six months, and with the person we are going to hire next year?

How do I leave something behind for my colleagues to learn and access the thoughts that I had when I built something investigated something learned something or understood something?

I write this from the perspective of a data professional, but I really think this is pretty broadly applicable to most companies where people have to ask other people questions to learn information to get their job done. So really a lot of companies probably 🤪

Basically I think this boils down to 4 super related things that are needed to write and contribute documentation to your company and your colleagues.

Accessible & Findable - Make sure it's open to the team that needs it, and that they can find it
Incomplete & Living - Make sure you spent the appropriate amount of time and not a second more, but also that others' can extend and pick up where you left off

Accessible & Findable

Make sure it's open to the team that needs it, and that they can find it

In order for documentation to be used, it has to be open and findable to the team needing it.

This may seem simple but many a time I have tried to open a document and found the permissions locked. Or looked for a document I knew existed, but been unable to find it and had to ask around.

The companies and teams that move the fastest have open access to the data and information they need to accomplish their mission.

Incomplete & Living

Don't write everything
Don't try to capture everything
Don't try to solve every problem
Don't try to store every detail
Don't try to tell everybody everything
But also

Don't think that communicating status and state is not your problem

Do write what you know right now, and/or is at your fingertips, and also be done and leave TODOs and open questions.

A company, a project is like a quantum state. You can't actually know what's happening at any given point in time even if you had all the information. And having all the information would alter the system to an unknowable state.

Just aim for 20 minutes, leave open questions, and make sure the doc is accessible and open for updates and permissions, and come back! next week! next year! tinker, and a line here and there and see where it grows to 🌱

Wrap it up

todo there is more to say on wrapping together Accessible, Findable, Incomplete and Living.

Conclusion

Spending 20 minutes writing a piece of documentation is usually going to be a waste of time 😧

but

when you hit you hit big. Write 10 docs that take you 3 hours, and one of those will save someone between a day and a week of time.

In conclusion ⤵️

Stephen Pettinato - Data Professional - (he/him)

About Me

10 Most Popular Posts

Most Recent Post

2024-10-31