About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2025-06-26

Types Of Logs

Types Of Logs

Start with "what's a log"? Let's take the definition from here

  an append-only, totally-ordered sequence of records ordered by time.

I would add to this definition as

an append-only, totally-ordered sequence of records ordered by time generated by engineering systems

Engineering systems, our systems supporting websites, games, jets produce data that is required for many use cases across a company including

  • Machine Learning
  • Reporting
  • Debugging 

The above definition differs from a lot of other data sets used in websites in that it's "append only". The word "log" and "logging" gets many definitions and conversations can easily get muddled.  My attempt here is to tease out the differences between different types of logs to have more clear communication. I hope at the end to not have anything called a "log" as that would apply to all the below categories.

I propose that there are 3 categories of logs

Log CategoryTypical StakeholderFieldsExample
MetricsEngineeringName, Timestamp, ValueEndpoint Latency
ErrorsEngineeringTimestamp, textException in code
EventsData folksName, Timestamp, <custom values>New account creation

Metrics

 Metrics are a series of records that follow a typical pattern,

  • Name
  • Timestamp
  • Value

Standard examples are 

  • API latency
  • Count of failures

Development teams often monitor the behavior of a system for uptime and to decide where changes need to be made.

Is the system working well? Look at the plot of errors over time.

Which endpoint is slow? Look at the latency of all endpoints and see which one has the highest value.

Metric logging is an essential part of understanding the real time behavior of a system. 

Errors

If a piece of software hits an exception and throws an error the error and stacktrace are needed to fix the error.

Oftentimes software systems have a variety of logging levels such as DEBUG, INFO and ERROR. In practice in production there's generally none of these standard logs except for ERROR. This typically means an exception has occurred and should be corrected as soon as possible.

Sometimes metrics are calculated on top of errors but should be standalone and independent to increase both the speed, efficiency, and cost of the metrics being stored. Why calculate the metric off of thousands of strings when you just have the metric "12789" sitting there?

Error logging is an essential part of correcting fatal errors in a system. 

Events

Events are a record of what happened when and any associated data required to understand the full extent of this event.

Such as

  • New user created - userid, timestamp, user type
  • Recommendations are available for loading - in systems where product recommendations are produced in batch mode, this can be an efficient method of informing a production system to load new data in.

Event driven architectures are commonplace and typically powered by a queuing system such as Kafka which allows many listeners to pick and choose which events are required to power their specific service.

These systems are very powerful and uniquely suited to fulfill many data architectures such as 

  • Real time data science model execution
  • Reporting decoupled from production application databases 

These events are almost always needed perpetuity for data science teams to build new models as well as for reporting to understand systems and metrics.

As with many new AI/LLM powered architectures, Event Logging can also be used to track what actually happened, what input caused what output using which LLM. This can greatly aid in the understanding of AI systems making many thousands of decisions.

Conclusion

Append only data sets are commonly used across websites and are often called "logs" or "logging" when in reality development teams are generally talking about Metrics, Errors or Events.

Each type of log has it's own unique use cases and stakeholder and should be considered as an essential building block production systems. 

Being able to breakdown the type of record being asked for into the type of log that is required makes understanding and implementation much more straightforward and can be generalized as well.

2025-03-25

Filter on the Last Step

Filter on the Last Step

So you've written a bunch of code to build a data set for a model or report. Your code has joins and filters, and it's complex. Maybe hundreds of lines or maybe just lots of intricate joints and CTEs.
 
That's fabulous. Stakeholders are happy. Your code is done. You're happy. Everybody's happy.
 
And then you get a question "How many users in Georgia are filtered out?" or "Why is this particular customer filtered out?"
 
These become hard questions to answer because there's a series of steps and each step makes a decision. It's hard to reason how all the final filters are interacting together without having them all in a final dataset.

Prefer adding columns instead of filtering on the fly. Do the filtering at the end together in a way that allows for investigations of the interactions between filters.
| customer_id | col1 | col2 | filter_col1 | filter_col2 |
|-------------|------|------|-------------|-------------|
| 11          | ...  | ...  | 10          | California  |
| 22          | ...  | ...  | 11          | Georgia     |
| 33          | ...  | ...  | 113         | Washington  |
| 44          | ...  | ...  | 134         | California  |
| 55          | ...  | ...  | 12          | California  |
Below I'll walk through a couple different examples and reason through this been a bit more detail.

How does this work?

  1. Build your data site as normal - joins, CTE's, etc
  2. Instead of adding filters, add a column
    1. Instead of WHERE col1 > 10
    2. Add a column "col1". You can then have all the filter logic together at the end.
  3. Instead of a inner joint, doing an outer join
  4. Be sure to keep track of names as the additional columns can cause confusion
  5. Then at the end you've got your full day set.
    1. You can play with it, look at it do analytics on it store it, etc.
  6. Then to pull the final data set
    1. Filter as appropriate
    2. Add a distinct
    3. and you're done ✅

Filter As you Go

Suppose you have 2 datasets and you want to build a new dataset that removes customers who

  • Are not in California or Washington
  • Have not made an order in the last month

Customer Information "customers"

| customer_id | state      |
|-------------|------------|
| 11          | Washington |
| 22          | California |
| 33          | New York   |
| 44          | California |
| 55          | New York   |

Order Information "orders"

| customer_id | order_id | days_since_order |
|-------------|----------|------------------|
| 11          | 4444     | 10               |
| 11          | 4445     | 200              |
| 22          | 4449     | 11               |
| 33          | 4441     | 10               |
| 44          | 4440     | 200              |
| 44          | 4439     | 10               |

A Single Query would work fine

SELECT DISTINCT customer_id
FROM customers
JOIN orders ON customer_id
WHERE state IN ("California", "Washington")
AND days_since_order <= 30

So what happens now?

You've built your data set. You're done right? Well, until the questions start coming in,
  • Why did customer 44 get filtered?
  • How many orders were filtered?

These become difficult to answer because you've just run the query. Easy enough with this toy example to just run a couple different queries, but as SQL statements can become longer and more complex, it can become harder to determine the answers to these questions.

Filter Last

Instead build a full dataset like this

| customer_id | order_id | days_since_order | state      |
|-------------|----------|------------------|------------|
| 11          | 4444     | 10               | Washington |
| 11          | 4445     | 200              | Washington |
| 22          | 4449     | 11               | California |
| 33          | 4441     | 10               | New York   |
| 44          | 4440     | 200              | California |
| 44          | 4439     | 10               | California |
| 55          | 4433     | 250              | New York   |

Then filter like this,

SELECT DISTINCT customer_id
FROM full_dataset
WHERE state IN ("California", "Washington")
AND days_since_order <= 30

Now the questions are easy to answer

  • Why did customer 44 get filtered?
    • SELECT * FROM full_dataset WHERE customer_id == 44
  • How many orders were filtered?
    • SELECT COUNT(DISTINCT order_id)
      FROM full_dataset
      WHERE state NOT IN ("California", "Washington")
      AND days_since_order > 30

Performance Considerations

If I'm building a large data set, is that going to suck up resources on the database? Is it going to make the queries slower? Is it going to make a production run of this data set slower if I run this often?

This comes back to the core item - do you need to answer questions and investigate the dataset you have built? This is almost always yes, so then the question is whether you can optimize the compute required to do these investigations.
 
Options are generally
  • Make a view with the unfiltered dataset that can be used for investigations
  • Cache the unfiltered data and run investigations on this dataset
 It's a trade-off between compute for storage.

Conclusion

Filtering as you go has the benefit of thinking through the process as you're going through it. It feels quick. It feels easy. It feels like the way we were taught to do it in school in our early jobs.
 
Filtering later has the benefit of being able to reason about the data set and ask questions of the full unfiltered data without having to write a whole bunch of custom code. You have the final, unfiltered data set. You can ask questions on it and builds plots, charts, reports, anything needed to support the final project.