About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2021-11-11

Software Design Patterns for Data Practitioners - Factory

Software Design Patterns for Data Practitioners

Note: This is the second of 2 posts on Software Design Patterns for Data Practitioners. I'm going to talk in these 2 posts about the the Singleton and Factory design patterns as those are the ones that I've seen commonly used in data products.

What is a design pattern?

Software Engineers commonly read and know Design Patterns. When I was a software engineer a common text was "Design Patterns: Elements of Reusable Object-Oriented Software" by Gamma, Helm, Johnson and Vlissides. This book is also known as "The Gang of Four book".

Amazingly this book has 23!!! different design patterns and really only scratches the surface of the most common Software Engineering design patterns. It includes different ways to put together Object Oriented code in such a way that a piece of software can be

  • Easily Maintainable
  • Easily Extendable
  • Performant
  • Modular

A design pattern is a common solution to a recurring problem. Many design patterns within data are algorithmic-ey such as NLP removal of stop words or system-ey such as a dashboards for refreshed data visualizations. The concept of a design pattern is pretty generic. If you've ever solved 2 problems using basically the same approach or code, that's a design pattern.

Factory

The factory pattern allows for construction of multiple objects that have an identical interface but are constructed and implemented differently.

This is a pattern where there are multiple objects with the same functions that can all be operated on by the same set of code plus some factory function that creates the objects

Something like,

  • car
  • bike
  • skateboard

Each object has operations

  • accelerate
  • brake
  • park

And then with a factory function that creates the objects given some set of input like cost, size, number of wheels. The benefit then comes because the same set of code can then be used to drive the vehicle to a location.

  • vehicle = factory(cost, dimensions, wheel count)
  • while(not at destination)
    • vehicle.accelerate
    • if stop sign
      • vehicle.break
  • vehicle.park

So the same set of code can operate on multiple objects.

Common Data Pattern

An example seems best to illustrate the usefulness of the Factory pattern as applied to a data project. A common data pattern I've seen is pulling from different data sources and combining into a new data product. Such as an algorithm like

  1.  Pull from a database
    1. Validate the data pulled
    2. transform the data
  2. Pull from a second database
    1. Validate the data pulled
    2. transform the data
  3. Pull from a flat file 
    1. Validate the data pulled
    2. transform the data
  4. Do some processing and combining of the 3 datasets
  5. Store each of the 4 transformed datasets and new created datasets

Could be combined into an algorithm like,

  1. Build n objects from inputted configuration values
    1. credentials
    2. input location
    3. output storage location
  2. For each object
    1. pull
    2. validate
    3. transform
  3. Combine the n datasets as appropriate
  4. For each object
    1. store

The factory pattern + duck typing is a nice solution to this.

Example

This example illustrates how to use the Factory method to produce and leverage multiple wrappers together.

The MyDB object is pulling from some sort of DataBase and requires credentials, whereas the MyJSON object pulls data from some JSON file somewhere. Together the data can be pulled, used and archived in a standard method. For simplicity I left off a validate function in the two objects, but it could easily be added along with additional common functions.

This example includes a Factory method that takes in a configuration dictionary and returns an object that is all setup to pull and store a dataset while also making the dataset accessible within the object.

class MyDB:
  def __init__(self, schema, table, credentials):
    self.schema = schema
    self.table = table
    self.credentials = credentials
  def pull(self):
    """Pull from a DataBase schema/table using the credentials"""
    self.data = 'dataset from the database'
    return self
  def store(self):
    """Store to an archival location"""
    return self

class MyJSON:
  def __init__(self, location):
    self.location = location
  def pull(self):
    """Pull from the JSON file"""
    self.data = 'dataset from the JSON file'
    return self
  def store(self):
    """Store to an archival location"""
    return self

def obj_factory(config):
  if config['type'] == 'database':
    return MyDB(config['schema'], config['table'], config['credentials'])
  elif config['type'] == 'json':
    return MyJSON(config['location'])

# These configurations can come in from an external source.
# The objects in the list don't necessarily need to be hardcoded.
# This can make it really easy to add an additional data source
configs = [
  dict(type='database', schema='schema1', table='table1', credentials='c1'),
  dict(type='database2', schema='schema2', table='table2', credentials='c2'),
  dict(type='json', location='/abucket/adir/stuff.json')
]

# Build objects
data_objs = [obj_factory(config) for config in configs]

# Pull data
_ = [data_obj.pull() for data_obj in data_objs]
 
# Other processing ...

# Store
_ = [data_obj.store() for data_obj in data_objs]

Then when a new dataset needs to be pulled into this processing, for example an API call, it's just a matter of building the object pull and store functions, and setting the configurations.

I've used this pattern multiple times to pull in a variety of sources and combine then in a lightweight, flexible, testable and extendable framework.

The alternative is typically a few functions with a series of if statements. This can be hard to extend as it involves tampering with existing functions, it can be hard to test as every if statement in a function would ideally have some test, and it can hard to understand as the code for a single datatype could be spread across a codebase.

One flexible framework with "like" functionality coupled together and easy to understand, read, test, and extend.

Conclusion

There are many, many, many Software Engineering design patterns. Enough to fill a book or 10. The Factory pattern is pretty generically useful for data products as it help enable the creation of maintainable, extensible, testable, and readable code.

Software Engineering often extends into the data domain but it's hard for a data professional to wrap their head around the whole of it. What aspects of Software Engineering should a Data Analyst embrace? How about a Data Engineer or Data scientist?

As a data professional, which software design patterns are useful to learn about and understand?

➡️➡️ Factory ⬅️⬅️