Software Design Patterns for Data Practitioners
Note: This is the first of 2 posts on Software Design Patterns for Data Practitioners. I'm going to talk in these 2 posts about the the Singleton and Factory design patterns as those are the ones that I've seen commonly used in data products.
What is a design pattern?
Software Engineers commonly read and know Design Patterns. When I was a software engineer a common text was "Design Patterns: Elements of Reusable Object-Oriented Software" by Gamma, Helm, Johnson and Vlissides. This book is also known as "The Gang of Four book".
Amazingly this book has 23!!! different design patterns and really only scratches the surface of the most common Software Engineering design patterns. It includes different ways to put together Object Oriented code in such a way that a piece of software can be
- Easily Maintainable
- Easily Extendable
A design pattern is a common solution to a recurring problem. Many design patterns within data are algorithmic-ey such as NLP removal of stop words or system-ey such as a dashboards for refreshed data visualizations. The concept of a design pattern is pretty generic. If you've ever solved 2 problems using basically the same approach or code, that's a design pattern.
At a high level a Singleton is a global instance of a variable. In data projects the simplest example is a variable declared in the first cell of a Notebook. It's available in every successive scope and if it changes, then it's changed from that point on.
It's a fairly simplistic implementation to just assume that a variable is in scope, but heavily leveraging global variables is discouraged in a wide variety of software engineering texts. This is because it's hard to know what the value of the variable is at any particular point in time, so it's hard to diagnose what actually happened in the code. Global variables also make testing difficult because sometimes the code is setup in a way that changing the value of the variable can be difficult.
The Singleton pattern is a middle ground between using global variables and hardcoding.
So why would I use a Singleton in a Data application?
For data applications Singletons should be small, self contained, and deterministic. The most common use of a singleton in a data application is a database connection. In high throughput systems, database connections can take hundreds of milliseconds to build, so it's best to only do it once.
Another example is to to avoid hardcoding configuration values while simultaneously avoiding passing around the configuration values from function to function.
Singletons are not intended to store data, but are best for variables that are used in multiple locations, or are based on some input configuration and are intended to be constant for the lifetime of the application.
The Singleton pattern enables
- accessible objects similar to a global variable
- they don't have to be passed into every function that needs them
- easy unit testing
- the singleton can easily be created with specific values or mocked as appropriate
- easy to run in a production or development environment
- objects can be set at the start of execution and used everywhere
- For Database Connections
- easy to save time when creating the database connection
- easy to make various queries across the application while having a standard implementation of logging and error handling
- For Configurations
- easy to avoid "magic numbers" and hardcoded strings in code
- any specific values used can be hardcoded into the config with a name and some documentation
- easy to wrap a variety of configurations together instead of having 12 different variables to track
Example - Database Connection
class DBConnection(object): _instance = None def __new__(cls, credentials): if cls._instance is None: print('Creating the object') cls._instance = create_dbconnection(credentials) return cls._instance DBConnection('username', 'password') for query in queries: # A tight loop won't require reconnecting to the database on every query query_results[query] = DBConnection().query(query)
Example - Configuration Storage
At the start of any processing whether pipeline, glue code or API, it's common to set certain standard configurations,
- input locations - databases, cloud storage locations, etc
- output locations
- model configurations
- model version to use
- maybe model hyperparameters
This can be done using the Singleton pattern such as,
from collections import namedtuple class Configs(object): _instance = None def __new__(cls, cfg1=None, cfg2=None, cfg3=None): if cls._instance is None: print('Creating the object') # As an example, hardcode the configs as member variables cls._instance = namedtuple( 'Configs', ['cfg1', 'cfg2', 'cfg3'])(cfg1, cfg2, cfg3) return cls._instance cfgA = Configs(cfg1=12, cfg2='dir1', cfg3='dir2') print(cfgA) # Configs(cfg1=12, cfg2='dir1', cfg3='dir2') print(cfgA.cfg1) # 12 print(cfgA.cfg2) # dir1 print(cfgA.cfg3) # dir2 cfgB = Configs() print(cfgB) # Configs(cfg1=12, cfg2='dir1', cfg3='dir2')
The values are set when cfgA is initialized. Then when cfgB is initialized, it already has all the values from cfgA.
One variable, lots of configurations all centrally located.
Here is an excellent discussion of the Singleton pattern in Python as compared to the Gang of Four book - https://python-patterns.guide/gang-of-four/singleton/
There are many, many, many Software Engineering design patterns. Enough to fill a book or 10. The Singleton pattern is pretty generically useful for data products as it help enable the creation of maintainable, extensible, testable, and readable code.
Software Engineering often extends into the data domain but it's hard for a data professional to wrap their head around the whole of it. What aspects of Software Engineering should a Data Analyst embrace? How about a Data Engineer or Data scientist?
As a data professional, which software design patterns are useful to learn about and understand?
➡️➡️ Singleton ⬅️⬅️