Filter on the Last Step
So you've written a bunch of code to build a data set for a model or report. Your code has joins and filters, and it's complex. Maybe hundreds of lines or maybe just lots of intricate joints and CTEs.
That's fabulous. Stakeholders are happy. Your code is done. You're happy. Everybody's happy.
And then you get a question "How many users in Georgia are filtered out?" or "Why is this particular customer filtered out?"
These become hard questions to answer because there's a series of steps and each step makes a decision. It's hard to reason how all the final filters are interacting together without having them all in a final dataset.
Prefer adding columns instead of filtering on the fly. Do the filtering at the end together in a way that allows for investigations of the interactions between filters.
| customer_id | col1 | col2 | filter_col1 | filter_col2 |
|-------------|------|------|-------------|-------------|
| 11 | ... | ... | 10 | California |
| 22 | ... | ... | 11 | Georgia |
| 33 | ... | ... | 113 | Washington |
| 44 | ... | ... | 134 | California |
| 55 | ... | ... | 12 | California |
Below I'll walk through a couple different examples and reason through this been a bit more detail.
How does this work?
- Build your data site as normal - joins, CTE's, etc
- Instead of adding filters, add a column
- Instead of WHERE col1 > 10
- Add a column "col1". You can then have all the filter logic together at the end.
- Instead of a inner joint, doing an outer join
- Be sure to keep track of names as the additional columns can cause confusion
- Then at the end you've got your full day set.
- You can play with it, look at it do analytics on it store it, etc.
- Then to pull the final data set
- Filter as appropriate
- Add a distinct
- and you're done ✅
Filter As you Go
Suppose you have 2 datasets a
nd you want to build a new dataset that removes customers who
Are not in California or Washington
Have not made an order in the last month
Customer Information "customers"
| customer_id | state |
|-------------|------------|
| 11 | Washington |
| 22 | California |
| 33 | New York |
| 44 | California |
| 55 | New York |
Order Information "orders"
| customer_id | order_id | days_since_order |
|-------------|----------|------------------|
| 11 | 4444 | 10 |
| 11 | 4445 | 200 |
| 22 | 4449 | 11 |
| 33 | 4441 | 10 |
| 44 | 4440 | 200 |
| 44 | 4439 | 10 |
A Single Query would work fine
SELECT DISTINCT customer_id
FROM customers
JOIN orders ON customer_id
WHERE state IN ("California", "Washington")
AND days_since_order <= 30
So what happens now?
You've built your data set. You're done right? Well, until the questions start coming in,
- Why did customer 44 get filtered?
- How many orders were filtered?
These become difficult to answer because you've just run the query. Easy enough with this toy example to just run a couple different queries, but as SQL statements can become longer and more complex, it can become harder to determine the answers to these questions.
Filter Last
Instead build a full dataset like this
| customer_id | order_id | days_since_order | state |
|-------------|----------|------------------|------------|
| 11 | 4444 | 10 | Washington |
| 11 | 4445 | 200 | Washington |
| 22 | 4449 | 11 | California |
| 33 | 4441 | 10 | New York |
| 44 | 4440 | 200 | California |
| 44 | 4439 | 10 | California |
| 55 | 4433 | 250 | New York |
Then filter like this,
SELECT DISTINCT customer_id
FROM full_dataset
WHERE state IN ("California", "Washington")
AND days_since_order <= 30
Now the questions are easy to answer
- Why did customer 44 get filtered?
SELECT * FROM full_dataset WHERE customer_id == 44
- How many orders were filtered?
-
SELECT COUNT(DISTINCT order_id)
FROM full_datasetWHERE state NOT IN ("California", "Washington")
AND days_since_order > 30
Performance Considerations
If I'm building a large data set, is that going to suck up resources on the database? Is it going to make the queries slower? Is it going to make a production run of this data set slower if I run this often?
This comes back to the core item - do you need to answer questions and investigate the dataset you have built? This is almost always yes, so then the question is whether you can optimize the compute required to do these investigations.
Options are generally
- Make a view with the unfiltered dataset that can be used for investigations
- Cache the unfiltered data and run investigations on this dataset
It's a trade-off between compute for storage.
Conclusion
Filtering as you go has the benefit of thinking through the process as you're going through it. It feels quick. It feels easy. It feels like the way we were taught to do it in school in our early jobs.
Filtering later has the benefit of being able to reason about the data set and ask questions of the full unfiltered data without having to write a whole bunch of custom code. You have the final, unfiltered data set. You can ask questions on it and builds plots, charts, reports, anything needed to support the final project.