About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2021-08-23

How to decrease the time it takes to do an analysis

Rapid Analysis Development

How we do analysis can determine how productive we are. I've spent a large amount of time writing data transformations with real data and often most of the time is spent waiting for code to run. There are methods of speeding up this analysis.

Building an analysis is not typically thought of as code development, but analysts spend a lot of time writing code. Whether it's SQL, Python, R, Excel, or building dashboards in some Business Intelligence vender's proprietary software, the analyst writes code to load, transform, aggregate and display data.

Often this data is large, so the individual queries can take 30 seconds, a minute or many minutes to execute.

How can we speed this up? How can we make the queries run faster while exploring the data and developing the analysis?

Typical Analytics Process

A typical analyst will follow a process like this to explore data while building plots, charts or calculating metrics.

  1. Load some data
  2. Transform and Join
  3. Calculate Aggregations
  4. Build Visualizations

The analyst will often jump around these steps to fiddle with inputs, try multiple visualizations and generally iterate on the code until results are clear and accurate.

The analyst wants real results, so by default they work with the full dataset.

Pros

Working with the full dataset allows for handing of corner cases as they arise and helps fine tune the logic of the transformations.

Preliminary counts and sums can be evaluated against the analyst's prior knowledge of the dataset which helps increase confidence of the final product and helps find bugs in the code.

Cons

Each code iteration can take a while to run. Changing the color of a plot in Python requires rebuilding the plot and can take seconds or minutes to run. This may not seem like a lot but can add up quickly.

10 times of executing a piece of code can sometimes take 5 minutes. If the analyst gets distracted with questions or reading the news this time can balloon rapidly.

Rapid Analysis Development

This is a more rapid code development process when working with large datasets.

  1. Jump around these steps doing the development
    1. Load a sample of the data
    2. Transform and Join
    3. Calculate Aggregations
    4. Build Visualizations
  2. When everything looks good move on
  3. Remove the sample from step 1
  4. Re-run the all the steps
  5. Update code to handle corner cases and fix errors
  6. And Re-run again to get the final results

In this case, the jumping around in step 1 is faster to iterate on because the dataset is smaller. There will be weird bugs here, joins won't work as expected and groupby aggregations may return invalid results. This can often be alleviated by taking a sample of the largest dataset (usually something visitor or weblog based) but not taking samples of fact tables that this data is joined into.

The results won't be accurate until the full set is run so aggregations may be obviously incorrect.

Pros

Each code iteration can be really fast. Maybe 5 seconds or less, so the analyst won't get distracted and can focus more on the results and less on watching queries run.

Cons

The full analysis will still need to be run with the full input datasets, and this will still take time because when the code looks ready, it won't be. Additional work will be required to double check that joins are working properly with the full input dataset and to fix any issues that arise.

Any visualizations produced will need changes when run with the full input dataset. Max and min ranges may not align, or histograms may look totally corrupt due to outliers that weren't in the original data sample.

Tips & Tricks

For analyses run in notebooks, a parameter can be placed on the top of the notebook indicating whether a sample should be taken, so re-running with the full dataset can be as easy as updating a single variable and clicking "run all",

In Python and R it's easy to add in assert statements to verify the individual steps and joins while the initial development is happening off the sample datasets.

SQL tables can be constructed from samples which can make complex joins faster to develop.

Conclusion

Working with large datasets can be very time consuming. It's fastest to work with small datasets when developing new code. This is a fairly standard software engineering technique and can be applied to data analysis as well. Samples can drastically speed up an analysis.

Next time you build a project, really think about how much time you spend watching the code run. Add it up and consider whether a sample could help you save time.

Your time is valuable, don't waste it watching queries running if the results will just be discarded.