About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2025-12-05

A Primer on Eventing and Domain Events

A Primer on Eventing and Domain Events

Generally speaking an event is a record of something happening. Software will record something to a log and that's the event. They must be

  • Write only
  • With a Timestamp
  • An event type
  • Metadata related to the specific type 

Events can be used in two ways

  1. Real time systems
  2. Back end analysis

Events present a particular difficulty in that an events can interact and override each other. Such as for an ecommerce site, an order may come in as an event, but then an order cancel event may also come and together there is no order. 

But there was an order. For a period of time this order existed in the system, between the time between order made and order cancelled. 

Systems that process events in real time need to be able to handle these types of cases, and back end systems that run a reporting should include the order during the time the order existed.

So why do I as a software engineer or data professional care about events and how do I start to deal with this type of dataset?

What is an Event?

An event is simply a record of something happening. This may be something a user did, the system did, or more trivially some logging if database access, security events, or exceptions.

Typically they are stored and transmitted in a variety of methods including

  • Ad hoc storage to Database Tables
  • Logging to flat files or Log aggregation systems
  • Sent to a queuing system such as Kafka

Examples

Some typical examples are

Order was Placed - What items were ordered, when and by which customer.

Database was queried - A log that may be captured and used by an engineer to debug a production system.

Product Recommendations successfully re-generated and available for real time use.

Domain Events?

Domain Events are more specific they are geared towards significant and important events within a system that can be used to understand the main goals and calculate key KPIs.

For example, a user placing an order would absolutely be a domain event, but querying a database would not be a domain event. These are very specific to the team responsible for implementing and tracking these events and although some things like orders are likely to be applicable from company to company, every company is going to set these up differently.

As a side note, domain thinking by itself can help a company break their system up into manageable pieces where each is responsible for a single domain.

Domain events come into play because a domain must send data out for other teams to use, and teams don't care about "querying a database", but absolutely care about "a user placing an order". This approach to eventing allows emitting of events so that other systems can listen proactively to necessary events without having to also sort through a bunch of unrelated data.

Web Engineering

One of the biggest impacts to using domain events widely is that web teams can implement their underlying databases however they want to.

A common pattern with reporting at web companies is that the web engineering team transmits in the entire database downstream for use in analytics and reporting. I've seen this mechanism in a few ways, but the problem is that if the web team want to make a simple change - add a column, or rename a table, or split a table, they have to negotiate with downstream uses of that table.

Events help get away from this anti-pattern by enabling downstream users to rely on contractually obligated events being emitted. These events are separate from the database, allowing an application team to optimize their queries and table structures without having to negotiate with other teams.

This could be both an opportunity and a bit daunting to implement.

For the example above with orders a web team may start to implement a domain event when an order occurs and then find out that orders occur in multiple places across the system. On the one hand it can be hard to implement the domain event but on the other hand, large complex systems can be cleaned up and unified in such a way that the entire domain or "orders"is more easily understood and easier to maintain.

Data Engineering

Data engineering or analytics engineering teams often benefit greatly from eventing because it helps them to fulfill their obligations to both reporting and data science teams.

One of the core requirements on a data engineering team is to know the state of the system at any given point in time for the past few years. This becomes exceptionally complex when the source of your data is operational databases that have updates applied. What did the table look like yesterday? You have to know what the table looked like yesterday, and if it has updates, then teams will often need to store the entire table from yesterday.

This resents an enormous cost either in development time to clean this up or in storage costs to store multiple copies of databases.

Events are a solution to this. A streaming set of things happening in the system can just be stored in their raw form. Since there are no updates, there's no worry of storing duplicate data, nor is there a huge complexity in cleaning up the data set for use in reporting or machine learning. 

Data Science

A core part of building an understandable ML model is to know what was the state of the system into the past. Eventing is one methodology of allowing data science teams to both have high quality data sets where the state of the system is known for every point in time, and to allow the ML models to operate on streaming data.

Does your LLM need to know the customer's most recent order? Just store, and use when it's necessary. Does your model need to know how many times a user has refunded an order in the last month, just a trivial count.

This allows data science teams to build models that are low in leakage because they are able to understand exactly what data came in exactly when.

This does increase the complexity on building features for an ML model. Many times I've had conversations with people over what exactly is your training set, no, really what exactly is the training set? Data over the last 6 months is the not the same as data for all users from the last 6 months who have relevant data. Building models off of streaming data means teams have to think very carefully up front about features in order to ensure that they're fulfilling the requirements of the model and aligning with it's future implementation.

These types of eventing systems present both an opportunity to build high quality models for data science teams but also a Less straightforward method of building feature sets.

Data Analytics

A common problem on reporting teams is when data changes upstream. Web team renames a column and suddenly the entire reporting suite is broken. Analytics teams traditionally struggle with being held accountable for changes that are outside of their purview.

If you are an analyst, how many time in the last month has one of your reports broken due to upstream changes?

Events help them with this because The very nature of eventing forces teams to think deeply about what data is being produced and what reporting needs. Analytics gets a seat at that table to define events, instead of just being a downstream user of the data they are the stakeholder.

Conclusion

I barely mentioned Kafka. Often Kafka and eventing and domain events are conflated, but they're really different and Kafka is just one system that could be used for moving events. Eventing is a way of thinking about data flows and a methodology of scaling a engineering organization to many independent, encapsulated teams.

At your company, ask yourself, 

  • Have we built systems that are flexible and enable teams to work independently and at scale?
  • Can we build systems to allow advanced machine learning use cases to run robustly and at scale?
  • If we think of all our dependencies, how can we enable those teams and projects to work independently and at scale?

There is a funny bit here, these types of systems are old and companies still struggle to even understand whether or not they want to adopt them. I tried to illustrate the justification for that as opposed to the mechanism that a lot of books and blog posts focus on.

2025-06-26

Types Of Logs

Types Of Logs

Start with "what's a log"? Let's take the definition from here

  an append-only, totally-ordered sequence of records ordered by time.

I would add to this definition as

an append-only, totally-ordered sequence of records ordered by time generated by engineering systems

Engineering systems, our systems supporting websites, games, jets produce data that is required for many use cases across a company including

  • Machine Learning
  • Reporting
  • Debugging 

The above definition differs from a lot of other data sets used in websites in that it's "append only". The word "log" and "logging" gets many definitions and conversations can easily get muddled.  My attempt here is to tease out the differences between different types of logs to have more clear communication. I hope at the end to not have anything called a "log" as that would apply to all the below categories.

I propose that there are 3 categories of logs

Log CategoryTypical StakeholderFieldsExample
MetricsEngineeringName, Timestamp, ValueEndpoint Latency
ErrorsEngineeringTimestamp, textException in code
EventsData folksName, Timestamp, <custom values>New account creation

Metrics

 Metrics are a series of records that follow a typical pattern,

  • Name
  • Timestamp
  • Value

Standard examples are 

  • API latency
  • Count of failures

Development teams often monitor the behavior of a system for uptime and to decide where changes need to be made.

Is the system working well? Look at the plot of errors over time.

Which endpoint is slow? Look at the latency of all endpoints and see which one has the highest value.

Metric logging is an essential part of understanding the real time behavior of a system. 

Errors

If a piece of software hits an exception and throws an error the error and stacktrace are needed to fix the error.

Oftentimes software systems have a variety of logging levels such as DEBUG, INFO and ERROR. In practice in production there's generally none of these standard logs except for ERROR. This typically means an exception has occurred and should be corrected as soon as possible.

Sometimes metrics are calculated on top of errors but should be standalone and independent to increase both the speed, efficiency, and cost of the metrics being stored. Why calculate the metric off of thousands of strings when you just have the metric "12789" sitting there?

Error logging is an essential part of correcting fatal errors in a system. 

Events

Events are a record of what happened when and any associated data required to understand the full extent of this event.

Such as

  • New user created - userid, timestamp, user type
  • Recommendations are available for loading - in systems where product recommendations are produced in batch mode, this can be an efficient method of informing a production system to load new data in.

Event driven architectures are commonplace and typically powered by a queuing system such as Kafka which allows many listeners to pick and choose which events are required to power their specific service.

These systems are very powerful and uniquely suited to fulfill many data architectures such as 

  • Real time data science model execution
  • Reporting decoupled from production application databases 

These events are almost always needed perpetuity for data science teams to build new models as well as for reporting to understand systems and metrics.

As with many new AI/LLM powered architectures, Event Logging can also be used to track what actually happened, what input caused what output using which LLM. This can greatly aid in the understanding of AI systems making many thousands of decisions.

Conclusion

Append only data sets are commonly used across websites and are often called "logs" or "logging" when in reality development teams are generally talking about Metrics, Errors or Events.

Each type of log has it's own unique use cases and stakeholder and should be considered as an essential building block production systems. 

Being able to breakdown the type of record being asked for into the type of log that is required makes understanding and implementation much more straightforward and can be generalized as well.

2025-03-25

Filter on the Last Step

Filter on the Last Step

So you've written a bunch of code to build a data set for a model or report. Your code has joins and filters, and it's complex. Maybe hundreds of lines or maybe just lots of intricate joints and CTEs.
 
That's fabulous. Stakeholders are happy. Your code is done. You're happy. Everybody's happy.
 
And then you get a question "How many users in Georgia are filtered out?" or "Why is this particular customer filtered out?"
 
These become hard questions to answer because there's a series of steps and each step makes a decision. It's hard to reason how all the final filters are interacting together without having them all in a final dataset.

Prefer adding columns instead of filtering on the fly. Do the filtering at the end together in a way that allows for investigations of the interactions between filters.
| customer_id | col1 | col2 | filter_col1 | filter_col2 |
|-------------|------|------|-------------|-------------|
| 11          | ...  | ...  | 10          | California  |
| 22          | ...  | ...  | 11          | Georgia     |
| 33          | ...  | ...  | 113         | Washington  |
| 44          | ...  | ...  | 134         | California  |
| 55          | ...  | ...  | 12          | California  |
Below I'll walk through a couple different examples and reason through this been a bit more detail.

How does this work?

  1. Build your data site as normal - joins, CTE's, etc
  2. Instead of adding filters, add a column
    1. Instead of WHERE col1 > 10
    2. Add a column "col1". You can then have all the filter logic together at the end.
  3. Instead of a inner joint, doing an outer join
  4. Be sure to keep track of names as the additional columns can cause confusion
  5. Then at the end you've got your full day set.
    1. You can play with it, look at it do analytics on it store it, etc.
  6. Then to pull the final data set
    1. Filter as appropriate
    2. Add a distinct
    3. and you're done ✅

Filter As you Go

Suppose you have 2 datasets and you want to build a new dataset that removes customers who

  • Are not in California or Washington
  • Have not made an order in the last month

Customer Information "customers"

| customer_id | state      |
|-------------|------------|
| 11          | Washington |
| 22          | California |
| 33          | New York   |
| 44          | California |
| 55          | New York   |

Order Information "orders"

| customer_id | order_id | days_since_order |
|-------------|----------|------------------|
| 11          | 4444     | 10               |
| 11          | 4445     | 200              |
| 22          | 4449     | 11               |
| 33          | 4441     | 10               |
| 44          | 4440     | 200              |
| 44          | 4439     | 10               |

A Single Query would work fine

SELECT DISTINCT customer_id
FROM customers
JOIN orders ON customer_id
WHERE state IN ("California", "Washington")
AND days_since_order <= 30

So what happens now?

You've built your data set. You're done right? Well, until the questions start coming in,
  • Why did customer 44 get filtered?
  • How many orders were filtered?

These become difficult to answer because you've just run the query. Easy enough with this toy example to just run a couple different queries, but as SQL statements can become longer and more complex, it can become harder to determine the answers to these questions.

Filter Last

Instead build a full dataset like this

| customer_id | order_id | days_since_order | state      |
|-------------|----------|------------------|------------|
| 11          | 4444     | 10               | Washington |
| 11          | 4445     | 200              | Washington |
| 22          | 4449     | 11               | California |
| 33          | 4441     | 10               | New York   |
| 44          | 4440     | 200              | California |
| 44          | 4439     | 10               | California |
| 55          | 4433     | 250              | New York   |

Then filter like this,

SELECT DISTINCT customer_id
FROM full_dataset
WHERE state IN ("California", "Washington")
AND days_since_order <= 30

Now the questions are easy to answer

  • Why did customer 44 get filtered?
    • SELECT * FROM full_dataset WHERE customer_id == 44
  • How many orders were filtered?
    • SELECT COUNT(DISTINCT order_id)
      FROM full_dataset
      WHERE state NOT IN ("California", "Washington")
      AND days_since_order > 30

Performance Considerations

If I'm building a large data set, is that going to suck up resources on the database? Is it going to make the queries slower? Is it going to make a production run of this data set slower if I run this often?

This comes back to the core item - do you need to answer questions and investigate the dataset you have built? This is almost always yes, so then the question is whether you can optimize the compute required to do these investigations.
 
Options are generally
  • Make a view with the unfiltered dataset that can be used for investigations
  • Cache the unfiltered data and run investigations on this dataset
 It's a trade-off between compute for storage.

Conclusion

Filtering as you go has the benefit of thinking through the process as you're going through it. It feels quick. It feels easy. It feels like the way we were taught to do it in school in our early jobs.
 
Filtering later has the benefit of being able to reason about the data set and ask questions of the full unfiltered data without having to write a whole bunch of custom code. You have the final, unfiltered data set. You can ask questions on it and builds plots, charts, reports, anything needed to support the final project.