About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2021-11-11

Software Design Patterns for Data Practitioners - Factory

Software Design Patterns for Data Practitioners

Note: This is the second of 2 posts on Software Design Patterns for Data Practitioners. I'm going to talk in these 2 posts about the the Singleton and Factory design patterns as those are the ones that I've seen commonly used in data products.

What is a design pattern?

Software Engineers commonly read and know Design Patterns. When I was a software engineer a common text was "Design Patterns: Elements of Reusable Object-Oriented Software" by Gamma, Helm, Johnson and Vlissides. This book is also known as "The Gang of Four book".

Amazingly this book has 23!!! different design patterns and really only scratches the surface of the most common Software Engineering design patterns. It includes different ways to put together Object Oriented code in such a way that a piece of software can be

  • Easily Maintainable
  • Easily Extendable
  • Performant
  • Modular

A design pattern is a common solution to a recurring problem. Many design patterns within data are algorithmic-ey such as NLP removal of stop words or system-ey such as a dashboards for refreshed data visualizations. The concept of a design pattern is pretty generic. If you've ever solved 2 problems using basically the same approach or code, that's a design pattern.

Factory

The factory pattern allows for construction of multiple objects that have an identical interface but are constructed and implemented differently.

This is a pattern where there are multiple objects with the same functions that can all be operated on by the same set of code plus some factory function that creates the objects

Something like,

  • car
  • bike
  • skateboard

Each object has operations

  • accelerate
  • brake
  • park

And then with a factory function that creates the objects given some set of input like cost, size, number of wheels. The benefit then comes because the same set of code can then be used to drive the vehicle to a location.

  • vehicle = factory(cost, dimensions, wheel count)
  • while(not at destination)
    • vehicle.accelerate
    • if stop sign
      • vehicle.break
  • vehicle.park

So the same set of code can operate on multiple objects.

Common Data Pattern

An example seems best to illustrate the usefulness of the Factory pattern as applied to a data project. A common data pattern I've seen is pulling from different data sources and combining into a new data product. Such as an algorithm like

  1.  Pull from a database
    1. Validate the data pulled
    2. transform the data
  2. Pull from a second database
    1. Validate the data pulled
    2. transform the data
  3. Pull from a flat file 
    1. Validate the data pulled
    2. transform the data
  4. Do some processing and combining of the 3 datasets
  5. Store each of the 4 transformed datasets and new created datasets

Could be combined into an algorithm like,

  1. Build n objects from inputted configuration values
    1. credentials
    2. input location
    3. output storage location
  2. For each object
    1. pull
    2. validate
    3. transform
  3. Combine the n datasets as appropriate
  4. For each object
    1. store

The factory pattern + duck typing is a nice solution to this.

Example

This example illustrates how to use the Factory method to produce and leverage multiple wrappers together.

The MyDB object is pulling from some sort of DataBase and requires credentials, whereas the MyJSON object pulls data from some JSON file somewhere. Together the data can be pulled, used and archived in a standard method. For simplicity I left off a validate function in the two objects, but it could easily be added along with additional common functions.

This example includes a Factory method that takes in a configuration dictionary and returns an object that is all setup to pull and store a dataset while also making the dataset accessible within the object.

class MyDB:
  def __init__(self, schema, table, credentials):
    self.schema = schema
    self.table = table
    self.credentials = credentials
  def pull(self):
    """Pull from a DataBase schema/table using the credentials"""
    self.data = 'dataset from the database'
    return self
  def store(self):
    """Store to an archival location"""
    return self

class MyJSON:
  def __init__(self, location):
    self.location = location
  def pull(self):
    """Pull from the JSON file"""
    self.data = 'dataset from the JSON file'
    return self
  def store(self):
    """Store to an archival location"""
    return self

def obj_factory(config):
  if config['type'] == 'database':
    return MyDB(config['schema'], config['table'], config['credentials'])
  elif config['type'] == 'json':
    return MyJSON(config['location'])

# These configurations can come in from an external source.
# The objects in the list don't necessarily need to be hardcoded.
# This can make it really easy to add an additional data source
configs = [
  dict(type='database', schema='schema1', table='table1', credentials='c1'),
  dict(type='database2', schema='schema2', table='table2', credentials='c2'),
  dict(type='json', location='/abucket/adir/stuff.json')
]

# Build objects
data_objs = [obj_factory(config) for config in configs]

# Pull data
_ = [data_obj.pull() for data_obj in data_objs]
 
# Other processing ...

# Store
_ = [data_obj.store() for data_obj in data_objs]

Then when a new dataset needs to be pulled into this processing, for example an API call, it's just a matter of building the object pull and store functions, and setting the configurations.

I've used this pattern multiple times to pull in a variety of sources and combine then in a lightweight, flexible, testable and extendable framework.

The alternative is typically a few functions with a series of if statements. This can be hard to extend as it involves tampering with existing functions, it can be hard to test as every if statement in a function would ideally have some test, and it can hard to understand as the code for a single datatype could be spread across a codebase.

One flexible framework with "like" functionality coupled together and easy to understand, read, test, and extend.

Conclusion

There are many, many, many Software Engineering design patterns. Enough to fill a book or 10. The Factory pattern is pretty generically useful for data products as it help enable the creation of maintainable, extensible, testable, and readable code.

Software Engineering often extends into the data domain but it's hard for a data professional to wrap their head around the whole of it. What aspects of Software Engineering should a Data Analyst embrace? How about a Data Engineer or Data scientist?

As a data professional, which software design patterns are useful to learn about and understand?

➡️➡️ Factory ⬅️⬅️

Software Design Patterns for Data Practitioners - Singleton

Software Design Patterns for Data Practitioners

Note: This is the first of 2 posts on Software Design Patterns for Data Practitioners. I'm going to talk in these 2 posts about the the Singleton and Factory design patterns as those are the ones that I've seen commonly used in data products.

What is a design pattern?

Software Engineers commonly read and know Design Patterns. When I was a software engineer a common text was "Design Patterns: Elements of Reusable Object-Oriented Software" by Gamma, Helm, Johnson and Vlissides. This book is also known as "The Gang of Four book".

Amazingly this book has 23!!! different design patterns and really only scratches the surface of the most common Software Engineering design patterns. It includes different ways to put together Object Oriented code in such a way that a piece of software can be

  • Easily Maintainable
  • Easily Extendable
  • Performant
  • Modular

A design pattern is a common solution to a recurring problem. Many design patterns within data are algorithmic-ey such as NLP removal of stop words or system-ey such as a dashboards for refreshed data visualizations. The concept of a design pattern is pretty generic. If you've ever solved 2 problems using basically the same approach or code, that's a design pattern.

Singleton

At a high level a Singleton is a global instance of a variable. In data projects the simplest example is a variable declared in the first cell of a Notebook. It's available in every successive scope and if it changes, then it's changed from that point on.

It's a fairly simplistic implementation to just assume that a variable is in scope, but heavily leveraging global variables is discouraged in a wide variety of software engineering texts. This is because it's hard to know what the value of the variable is at any particular point in time, so it's hard to diagnose what actually happened in the code. Global variables also make testing difficult because sometimes the code is setup in a way that changing the value of the variable can be difficult.

The Singleton pattern is a middle ground between using global variables and hardcoding.

So why would I use a Singleton in a Data application?

For data applications Singletons should be small, self contained, and deterministic. The most common use of a singleton in a data application is a database connection. In high throughput systems, database connections can take hundreds of milliseconds to build, so it's best to only do it once.

Another example is to to avoid hardcoding configuration values while simultaneously avoiding passing around the configuration values from function to function.

Singletons are not intended to store data, but are best for variables that are used in multiple locations, or are based on some input configuration and are intended to be constant for the lifetime of the application.

The Singleton pattern enables

  • accessible objects similar to a global variable
    • they don't have to be passed into every function that needs them
  • easy unit testing
    • the singleton can easily be created with specific values or mocked as appropriate
  • easy to run in a production or development environment
    • objects can be set at the start of execution and used everywhere
  • For Database Connections
    • easy to save time when creating the database connection
    • easy to make various queries across the application while having a standard implementation of logging and error handling
  • For Configurations
    • easy to avoid "magic numbers" and hardcoded strings in code
    • any specific values used can be hardcoded into the config with a name and some documentation
    • easy to wrap a variety of configurations together instead of having 12 different variables to track

Example - Database Connection

class DBConnection(object):
  _instance = None

  def __new__(cls, credentials):
    if cls._instance is None:
      print('Creating the object')
      cls._instance = create_dbconnection(credentials)
    return cls._instance

DBConnection('username', 'password')
for query in queries:
  # A tight loop won't require reconnecting to the database on every query
  query_results[query] = DBConnection().query(query)


Example - Configuration Storage

At the start of any processing whether pipeline, glue code or API, it's common to set certain standard configurations,

  • input locations - databases, cloud storage locations, etc
  • output locations
  • model configurations
    • model version to use
    • maybe model hyperparameters
  • credentials

This can be done using the Singleton pattern such as,

from collections import namedtuple

class Configs(object):
  _instance = None

  def __new__(cls, cfg1=None, cfg2=None, cfg3=None):
    if cls._instance is None:
      print('Creating the object')
      # As an example, hardcode the configs as member variables
      cls._instance = namedtuple(
        'Configs', ['cfg1', 'cfg2', 'cfg3'])(cfg1, cfg2, cfg3)
    return cls._instance

cfgA = Configs(cfg1=12, cfg2='dir1', cfg3='dir2')
print(cfgA)  # Configs(cfg1=12, cfg2='dir1', cfg3='dir2')
print(cfgA.cfg1)  # 12
print(cfgA.cfg2)  # dir1
print(cfgA.cfg3)  # dir2

cfgB = Configs()
print(cfgB)  # Configs(cfg1=12, cfg2='dir1', cfg3='dir2')

The values are set when cfgA is initialized. Then when cfgB is initialized, it already has all the values from cfgA.

One variable, lots of configurations all centrally located.

Additional Resources

Here is an excellent discussion of the Singleton pattern in Python as compared to the Gang of Four book - https://python-patterns.guide/gang-of-four/singleton/

Conclusion

There are many, many, many Software Engineering design patterns. Enough to fill a book or 10. The Singleton pattern is pretty generically useful for data products as it help enable the creation of maintainable, extensible, testable, and readable code.

Software Engineering often extends into the data domain but it's hard for a data professional to wrap their head around the whole of it. What aspects of Software Engineering should a Data Analyst embrace? How about a Data Engineer or Data scientist?

As a data professional, which software design patterns are useful to learn about and understand?

➡️➡️ Singleton ⬅️⬅️