About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2023-08-07

Data Stored an In Code Calculation

Data Stored as In Code Calculation

Websites store a lot of information within code calculations. At the simplest this might be a function like

def year_of(a_date):
    return a_date.split('-')[0]

Instead of storing the year separately from the date, we have to piece different data together to determine the year. In this case it's simple, but you could imagine a function that takes multiple pieces of data and makes some determination to return a single data point.

Software does this all the time.

  • A user logs into the system, is there account "in good standing?" is calculated on the fly from a variety of database inputs as some custom calculation.
  • "Do they need a password reset?" is calculated on the fly.
  • "What's the most recent order?" is calculated on the fly from an orders database.

This is great, it keeps the software running, and provides a high quality experience for the user. However, there are a few drawbacks that can be encountered.

  1. Reporting: When a business person needs to know "how many users need a password reset?"
  2. Efficiency: When calculating the value takes a lot of resources
  3. Accuracy: When multiple places in the code calculate the value differently
  4. Functionality: When the business wants to force a decision for some subset of users

The easiest way to fix all these is by storing these values in the code and keeping them up to date through some in-code or batch process.

Why would a data point need to move from an in code calculation into a database lookup?

Sorting data in code is often a default. From just a few simple values, a tremendous amount of data can be derived. There is a balance between a more complex system using more resources and a simpler system calculating everything on the fly.

⚖️ As with everything, middle ground between "storing everything" and "storing a bare minimum" is the best solution.

Reporting

Businesses change, new questions get asked and business folks will ask questions that can only be answered by executing the code on every user or by recreating the logic in some backend SQL system. Recreating the logic can work for some situations, but there are extreme cases where the decision is complex, information is unavailable to the backend system to make the decision or where reporting and actual are desired to be 100% the same.

Best to align with existing dashboard and reporting systems and to realize when a data point is too complex to calculate using duplicate logic in some reporting system.

Not everything needs to be reported on and new data points should really only be provided if the business really needs this data. When storing new data in code it's important to ask whether this data needs to be reported on.

Efficiency

Caching is a tried and true method within software engineering and realizing when a data point needs to be cached is beyond the scope of this document.

But it's good to take a step back when caching a new value and ask whether it would be helpful to have this data point in reporting and whether this data point is calculated in multiple locations in the code.

Accuracy

Oh no, there are too many places in the code calculating the value slightly differently πŸ”€. It's not always possible to clean the code to have all the values point to some new logic or library since differences need to be investigated before porting everything to a standard library.

If the clean, robust data was stored for all the users, it can be easier to update existing code to use this since the differences in calculation are easier to compare.

Plus, a junior developer might not re-implement effective_to if it's stored in the database. Putting data into a database is a method of communicating "this data is available" around a complex, large codebase in a way that is separate from building a standard library.

Getting one place to store a robust data point is easy, but rolling it out can be hard and require different strategies and coordination.

Functionality

Oftentimes a business will want to have some complex one-off logic to force something to happen on the website.

Perhaps it's some onboarding module that only pops up in the user's first session, or the business needs to reset passwords en-mass.

The code logic can be updated, but if the value is stored in the database, a simple query could change all the values en-mass and therefore change the user experience without a code deploy.

New Architecture

Moving from calculating some data on the fly to storing it and looking it up again can be quite challenging.

  • What if half the data is stored in the database, but not the other half?
  • How exactly do I populate the database while maintaining a production system?
  • How do I know if the data needs to be updated?

In these situations it's important to ask 2 questions,

  • How stale can the data be? 1 day? 1 second?
  • Under what situations does the data need to be updated?

Migration - Example - Do they need a password reset?

In the standard case where users are forced to update their passwords every 6 months, this data can be quite stale.

It's easy enough to store an effective_to date in a database and to update it nightly or weekly.

This can allow business folks to do reporting on this data with questions such as

  • How many users have passwords that currently need resetting?
  • How many users have passwords that need to be reset in the next month?

And to allow a single location for the production software to lookup whether the password needs resetting.

Migration - Complex Pattern

In general a complex system may need a series of steps

  1. Update to calculate the data on the fly, and to store to a database
  2. Run a one-off transformation to execute this code for all users to populate the database
  3. Run a periodic job to update this data for all users
    1. How often this job needs to run is dependent on how stale this data can be
      1. May need to be run
      2. nightly
      3. asynchronously based on some event
      4. Both nightly and asynchronously
  4. Update the on the fly calculation to pull from the database and compare to the in-code calculation
    1. This can be required in the case where small differences can have large effects
  5. Update to no longer do the calculation on the fly 

πŸ’‘A simple system may be able to skip or combine these steps.

Conclusion

Storing data in code is a common, preferred case but often reaches some limit in

  • Reporting
  • Efficiency
  • Accuracy
  • Functionality

There are methods of moving a data point from an in-code calculation into a database, but they have to be carefully considered for functionality and staleness.

🚚 Companies do this all the time! It's part of maintaining and extending a system to meet new news of the business.

🌱It's so common you may wonder why I bothered to write this down at all πŸ’«