Software Design Patterns for Data Practitioners

Note: This is the second of 2 posts on Software Design Patterns for Data Practitioners. I'm going to talk in these 2 posts about the the Singleton and Factory design patterns as those are the ones that I've seen commonly used in data products.

What is a design pattern?

Software Engineers commonly read and know Design Patterns. When I was a software engineer a common text was "Design Patterns: Elements of Reusable Object-Oriented Software" by Gamma, Helm, Johnson and Vlissides. This book is also known as "The Gang of Four book".

Amazingly this book has 23!!! different design patterns and really only scratches the surface of the most common Software Engineering design patterns. It includes different ways to put together Object Oriented code in such a way that a piece of software can be

Easily Maintainable
Easily Extendable
Performant
Modular

A design pattern is a common solution to a recurring problem. Many design patterns within data are algorithmic-ey such as NLP removal of stop words or system-ey such as a dashboards for refreshed data visualizations. The concept of a design pattern is pretty generic. If you've ever solved 2 problems using basically the same approach or code, that's a design pattern.

Factory

The factory pattern allows for construction of multiple objects that have an identical interface but are constructed and implemented differently.

This is a pattern where there are multiple objects with the same functions that can all be operated on by the same set of code plus some factory function that creates the objects

Something like,

car
bike
skateboard

Each object has operations

accelerate
brake
park

And then with a factory function that creates the objects given some set of input like cost, size, number of wheels. The benefit then comes because the same set of code can then be used to drive the vehicle to a location.

vehicle = factory(cost, dimensions, wheel count)
while(not at destination)

vehicle.accelerate
if stop sign

vehicle.break

vehicle.park

So the same set of code can operate on multiple objects.

Common Data Pattern

An example seems best to illustrate the usefulness of the Factory pattern as applied to a data project. A common data pattern I've seen is pulling from different data sources and combining into a new data product. Such as an algorithm like

Pull from a database

Validate the data pulled
transform the data

Pull from a second database

Validate the data pulled
transform the data

Pull from a flat file

Validate the data pulled
transform the data

Do some processing and combining of the 3 datasets
Store each of the 4 transformed datasets and new created datasets

Could be combined into an algorithm like,

Build n objects from inputted configuration values

credentials
input location
output storage location

For each object

pull
validate
transform

Combine the n datasets as appropriate
For each object

store

The factory pattern + duck typing is a nice solution to this.

Example

This example illustrates how to use the Factory method to produce and leverage multiple wrappers together.

The MyDB object is pulling from some sort of DataBase and requires credentials, whereas the MyJSON object pulls data from some JSON file somewhere. Together the data can be pulled, used and archived in a standard method. For simplicity I left off a validate function in the two objects, but it could easily be added along with additional common functions.

This example includes a Factory method that takes in a configuration dictionary and returns an object that is all setup to pull and store a dataset while also making the dataset accessible within the object.

class MyDB:
  def __init__(self, schema, table, credentials):
    self.schema = schema
    self.table = table
    self.credentials = credentials
  def pull(self):
    """Pull from a DataBase schema/table using the credentials"""
    self.data = 'dataset from the database'
    return self
  def store(self):
    """Store to an archival location"""
    return self

class MyJSON:
  def __init__(self, location):
    self.location = location
  def pull(self):
    """Pull from the JSON file"""
    self.data = 'dataset from the JSON file'
    return self
  def store(self):
    """Store to an archival location"""
    return self

def obj_factory(config):
  if config['type'] == 'database':
    return MyDB(config['schema'], config['table'], config['credentials'])
  elif config['type'] == 'json':
    return MyJSON(config['location'])

# These configurations can come in from an external source.
# The objects in the list don't necessarily need to be hardcoded.
# This can make it really easy to add an additional data source
configs = [
  dict(type='database', schema='schema1', table='table1', credentials='c1'),
  dict(type='database2', schema='schema2', table='table2', credentials='c2'),
  dict(type='json', location='/abucket/adir/stuff.json')
]

# Build objects
data_objs = [obj_factory(config) for config in configs]

# Pull data
_ = [data_obj.pull() for data_obj in data_objs]
 
# Other processing ...

# Store
_ = [data_obj.store() for data_obj in data_objs]

Then when a new dataset needs to be pulled into this processing, for example an API call, it's just a matter of building the object pull and store functions, and setting the configurations.

I've used this pattern multiple times to pull in a variety of sources and combine then in a lightweight, flexible, testable and extendable framework.

The alternative is typically a few functions with a series of if statements. This can be hard to extend as it involves tampering with existing functions, it can be hard to test as every if statement in a function would ideally have some test, and it can hard to understand as the code for a single datatype could be spread across a codebase.

One flexible framework with "like" functionality coupled together and easy to understand, read, test, and extend.

Conclusion

There are many, many, many Software Engineering design patterns. Enough to fill a book or 10. The Factory pattern is pretty generically useful for data products as it help enable the creation of maintainable, extensible, testable, and readable code.

Software Engineering often extends into the data domain but it's hard for a data professional to wrap their head around the whole of it. What aspects of Software Engineering should a Data Analyst embrace? How about a Data Engineer or Data scientist?

As a data professional, which software design patterns are useful to learn about and understand?

➡️➡️ Factory ⬅️⬅️

Stephen Pettinato - Data Professional - (he/him)

About Me

10 Most Popular Posts

Most Recent Post

2021-11-11