About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2021-12-02

Retention Model Usage

What can we do with a Retention Model?

Data teams commonly build a retention model, or even a couple. It seems like a great idea. What members are likely to leave, and which are likely to stay? There are obviously a ton of use cases that could be derived from this.

I'm going to run through a couple of common use cases, and two pitfalls that can arise from retention models.

As a prerequisite, retention models only model what the company's definition of retention is. This a complex question, but well worth digging into before the model is constructed. This definition can have meaningful impacts on interpretation of the scores. Is the model "will a new member retain after 30 days?" or "will any member retention after 6 months?". This initial definition guides the interpretation of the model and guides it's usage.

Retention or Churn model?

Essentially both are the same, just inverted. Retention models how likely members are to stay and Churn models how likely members are to leave. As a matter of preference, I would recommend building a retention model as the developer will be discussing this model endlessly.

Would you rather talk about members staying or about members leaving? Personally I find it depressing to spend a chunk of my life thinking about the more negative aspects of churn, but I find it delightful to discuss all the positive aspects of retention.

Projects

Here are a few examples of projects I've seen come up in industry. All these projects can be hard to implement as businesses have existing processes, and methods and can be hesitant to change. Go slow, iterate in tiny chunks and do what you can to align with higher level projects and goals.

All these projects require,

  • Raw numeric scores from the Retention Classification problem to generated on member periodically (usually nightly or weekly)
  • AB testing to determine the effect of using the scores

Offers

This is overwhelmingly the most popular project I've seen discussed. Companies often spend $$$ sending out coupons and discounts and a lot of that is a waste. There are members who are highly likely to stay anyway who don't need a coupon, and there are members who are highly likely to leave and the coupon is used right before they cancel.

Businesses can use retention scores to better target the members in the "middle". Those that might cancel and might not.

Often the scores are split into deciles so that top decile can be safely considered as retaining, and the bottom decile can be safely considered to be churning. The middle 8 deciles then can targeted with coupons.

This allows for a couple of nice knobs and dials. A campaign can balance between the folks targeted and the budget of the offer. AB testing can be used to measure the impact of any particular strategy and allows for optimization of the strategy.

Marketing

Let's let marketing run 5 campaigns over a week, which one is better?

How could this be measured? Number of members signed up is one way, another is the quality of the members signed up. Enter a retention model. With the retention scores a business can now compare a marketing campaign that signs up 10,000 members that intend to leave quickly with a marketing campaign that signs up 1500 members that really seem to like the service.

Having the retention score can help a business balance the goals of acquiring and keeping members.

Membership Count Prediction

It's a lot easier to predict the number of members you will have in 6 months if you have an idea of how many existing members are likely to quit in the near-term, and how many are likely to stay.

A straight time series analysis can work well here too, but coupling that with some data from a simulated retention model can make something more convincing. Maybe it will be more accurate too!

Pitfalls

Questions

Endless analysis with no use of the model. Sometimes business folks are incredibly concerned with the change that this model can present to their processes. They will want to know more about the retention scores, and more, and more until 6 months has passed and the scores are still not being used.

It's important to have some guardrails around this analysis vs implementation as the business has good questions, and it's tempting to want to answer them all, but then the scores can take a long time to start providing value. Instead consider using AB Testing as the safety net in-lieu of not having all the questions about the scores answered.

AB Testing

A lack of AB Testing infrastructure will cause the effect of the scores impact on business metrics to be unknown. Why would a business pay for an in-house model if the impact is attributed to other sources.

On the plus side here, setting up good AB Testing infrastructure can immediately help any process without any retention scores or modeling at all.

Conclusion

Retention scores are often desired by a business, but it's important to have commitments on how they will be used before the model is constructed.

Often a super simple v1 can provide a lot of value with little effort in modeling as long as the AB Testing system is well understood.

Go! Build a retention model! It's usually pretty easy to convince your manager and your business partners, but be sure to do some diligence on measurement beforehand.

2021-11-11

Software Design Patterns for Data Practitioners - Factory

Software Design Patterns for Data Practitioners

Note: This is the second of 2 posts on Software Design Patterns for Data Practitioners. I'm going to talk in these 2 posts about the the Singleton and Factory design patterns as those are the ones that I've seen commonly used in data products.

What is a design pattern?

Software Engineers commonly read and know Design Patterns. When I was a software engineer a common text was "Design Patterns: Elements of Reusable Object-Oriented Software" by Gamma, Helm, Johnson and Vlissides. This book is also known as "The Gang of Four book".

Amazingly this book has 23!!! different design patterns and really only scratches the surface of the most common Software Engineering design patterns. It includes different ways to put together Object Oriented code in such a way that a piece of software can be

  • Easily Maintainable
  • Easily Extendable
  • Performant
  • Modular

A design pattern is a common solution to a recurring problem. Many design patterns within data are algorithmic-ey such as NLP removal of stop words or system-ey such as a dashboards for refreshed data visualizations. The concept of a design pattern is pretty generic. If you've ever solved 2 problems using basically the same approach or code, that's a design pattern.

Factory

The factory pattern allows for construction of multiple objects that have an identical interface but are constructed and implemented differently.

This is a pattern where there are multiple objects with the same functions that can all be operated on by the same set of code plus some factory function that creates the objects

Something like,

  • car
  • bike
  • skateboard

Each object has operations

  • accelerate
  • brake
  • park

And then with a factory function that creates the objects given some set of input like cost, size, number of wheels. The benefit then comes because the same set of code can then be used to drive the vehicle to a location.

  • vehicle = factory(cost, dimensions, wheel count)
  • while(not at destination)
    • vehicle.accelerate
    • if stop sign
      • vehicle.break
  • vehicle.park

So the same set of code can operate on multiple objects.

Common Data Pattern

An example seems best to illustrate the usefulness of the Factory pattern as applied to a data project. A common data pattern I've seen is pulling from different data sources and combining into a new data product. Such as an algorithm like

  1.  Pull from a database
    1. Validate the data pulled
    2. transform the data
  2. Pull from a second database
    1. Validate the data pulled
    2. transform the data
  3. Pull from a flat file 
    1. Validate the data pulled
    2. transform the data
  4. Do some processing and combining of the 3 datasets
  5. Store each of the 4 transformed datasets and new created datasets

Could be combined into an algorithm like,

  1. Build n objects from inputted configuration values
    1. credentials
    2. input location
    3. output storage location
  2. For each object
    1. pull
    2. validate
    3. transform
  3. Combine the n datasets as appropriate
  4. For each object
    1. store

The factory pattern + duck typing is a nice solution to this.

Example

This example illustrates how to use the Factory method to produce and leverage multiple wrappers together.

The MyDB object is pulling from some sort of DataBase and requires credentials, whereas the MyJSON object pulls data from some JSON file somewhere. Together the data can be pulled, used and archived in a standard method. For simplicity I left off a validate function in the two objects, but it could easily be added along with additional common functions.

This example includes a Factory method that takes in a configuration dictionary and returns an object that is all setup to pull and store a dataset while also making the dataset accessible within the object.

class MyDB:
  def __init__(self, schema, table, credentials):
    self.schema = schema
    self.table = table
    self.credentials = credentials
  def pull(self):
    """Pull from a DataBase schema/table using the credentials"""
    self.data = 'dataset from the database'
    return self
  def store(self):
    """Store to an archival location"""
    return self

class MyJSON:
  def __init__(self, location):
    self.location = location
  def pull(self):
    """Pull from the JSON file"""
    self.data = 'dataset from the JSON file'
    return self
  def store(self):
    """Store to an archival location"""
    return self

def obj_factory(config):
  if config['type'] == 'database':
    return MyDB(config['schema'], config['table'], config['credentials'])
  elif config['type'] == 'json':
    return MyJSON(config['location'])

# These configurations can come in from an external source.
# The objects in the list don't necessarily need to be hardcoded.
# This can make it really easy to add an additional data source
configs = [
  dict(type='database', schema='schema1', table='table1', credentials='c1'),
  dict(type='database2', schema='schema2', table='table2', credentials='c2'),
  dict(type='json', location='/abucket/adir/stuff.json')
]

# Build objects
data_objs = [obj_factory(config) for config in configs]

# Pull data
_ = [data_obj.pull() for data_obj in data_objs]
 
# Other processing ...

# Store
_ = [data_obj.store() for data_obj in data_objs]

Then when a new dataset needs to be pulled into this processing, for example an API call, it's just a matter of building the object pull and store functions, and setting the configurations.

I've used this pattern multiple times to pull in a variety of sources and combine then in a lightweight, flexible, testable and extendable framework.

The alternative is typically a few functions with a series of if statements. This can be hard to extend as it involves tampering with existing functions, it can be hard to test as every if statement in a function would ideally have some test, and it can hard to understand as the code for a single datatype could be spread across a codebase.

One flexible framework with "like" functionality coupled together and easy to understand, read, test, and extend.

Conclusion

There are many, many, many Software Engineering design patterns. Enough to fill a book or 10. The Factory pattern is pretty generically useful for data products as it help enable the creation of maintainable, extensible, testable, and readable code.

Software Engineering often extends into the data domain but it's hard for a data professional to wrap their head around the whole of it. What aspects of Software Engineering should a Data Analyst embrace? How about a Data Engineer or Data scientist?

As a data professional, which software design patterns are useful to learn about and understand?

➡️➡️ Factory ⬅️⬅️

Software Design Patterns for Data Practitioners - Singleton

Software Design Patterns for Data Practitioners

Note: This is the first of 2 posts on Software Design Patterns for Data Practitioners. I'm going to talk in these 2 posts about the the Singleton and Factory design patterns as those are the ones that I've seen commonly used in data products.

What is a design pattern?

Software Engineers commonly read and know Design Patterns. When I was a software engineer a common text was "Design Patterns: Elements of Reusable Object-Oriented Software" by Gamma, Helm, Johnson and Vlissides. This book is also known as "The Gang of Four book".

Amazingly this book has 23!!! different design patterns and really only scratches the surface of the most common Software Engineering design patterns. It includes different ways to put together Object Oriented code in such a way that a piece of software can be

  • Easily Maintainable
  • Easily Extendable
  • Performant
  • Modular

A design pattern is a common solution to a recurring problem. Many design patterns within data are algorithmic-ey such as NLP removal of stop words or system-ey such as a dashboards for refreshed data visualizations. The concept of a design pattern is pretty generic. If you've ever solved 2 problems using basically the same approach or code, that's a design pattern.

Singleton

At a high level a Singleton is a global instance of a variable. In data projects the simplest example is a variable declared in the first cell of a Notebook. It's available in every successive scope and if it changes, then it's changed from that point on.

It's a fairly simplistic implementation to just assume that a variable is in scope, but heavily leveraging global variables is discouraged in a wide variety of software engineering texts. This is because it's hard to know what the value of the variable is at any particular point in time, so it's hard to diagnose what actually happened in the code. Global variables also make testing difficult because sometimes the code is setup in a way that changing the value of the variable can be difficult.

The Singleton pattern is a middle ground between using global variables and hardcoding.

So why would I use a Singleton in a Data application?

For data applications Singletons should be small, self contained, and deterministic. The most common use of a singleton in a data application is a database connection. In high throughput systems, database connections can take hundreds of milliseconds to build, so it's best to only do it once.

Another example is to to avoid hardcoding configuration values while simultaneously avoiding passing around the configuration values from function to function.

Singletons are not intended to store data, but are best for variables that are used in multiple locations, or are based on some input configuration and are intended to be constant for the lifetime of the application.

The Singleton pattern enables

  • accessible objects similar to a global variable
    • they don't have to be passed into every function that needs them
  • easy unit testing
    • the singleton can easily be created with specific values or mocked as appropriate
  • easy to run in a production or development environment
    • objects can be set at the start of execution and used everywhere
  • For Database Connections
    • easy to save time when creating the database connection
    • easy to make various queries across the application while having a standard implementation of logging and error handling
  • For Configurations
    • easy to avoid "magic numbers" and hardcoded strings in code
    • any specific values used can be hardcoded into the config with a name and some documentation
    • easy to wrap a variety of configurations together instead of having 12 different variables to track

Example - Database Connection

class DBConnection(object):
  _instance = None

  def __new__(cls, credentials):
    if cls._instance is None:
      print('Creating the object')
      cls._instance = create_dbconnection(credentials)
    return cls._instance

DBConnection('username', 'password')
for query in queries:
  # A tight loop won't require reconnecting to the database on every query
  query_results[query] = DBConnection().query(query)


Example - Configuration Storage

At the start of any processing whether pipeline, glue code or API, it's common to set certain standard configurations,

  • input locations - databases, cloud storage locations, etc
  • output locations
  • model configurations
    • model version to use
    • maybe model hyperparameters
  • credentials

This can be done using the Singleton pattern such as,

from collections import namedtuple

class Configs(object):
  _instance = None

  def __new__(cls, cfg1=None, cfg2=None, cfg3=None):
    if cls._instance is None:
      print('Creating the object')
      # As an example, hardcode the configs as member variables
      cls._instance = namedtuple(
        'Configs', ['cfg1', 'cfg2', 'cfg3'])(cfg1, cfg2, cfg3)
    return cls._instance

cfgA = Configs(cfg1=12, cfg2='dir1', cfg3='dir2')
print(cfgA)  # Configs(cfg1=12, cfg2='dir1', cfg3='dir2')
print(cfgA.cfg1)  # 12
print(cfgA.cfg2)  # dir1
print(cfgA.cfg3)  # dir2

cfgB = Configs()
print(cfgB)  # Configs(cfg1=12, cfg2='dir1', cfg3='dir2')

The values are set when cfgA is initialized. Then when cfgB is initialized, it already has all the values from cfgA.

One variable, lots of configurations all centrally located.

Additional Resources

Here is an excellent discussion of the Singleton pattern in Python as compared to the Gang of Four book - https://python-patterns.guide/gang-of-four/singleton/

Conclusion

There are many, many, many Software Engineering design patterns. Enough to fill a book or 10. The Singleton pattern is pretty generically useful for data products as it help enable the creation of maintainable, extensible, testable, and readable code.

Software Engineering often extends into the data domain but it's hard for a data professional to wrap their head around the whole of it. What aspects of Software Engineering should a Data Analyst embrace? How about a Data Engineer or Data scientist?

As a data professional, which software design patterns are useful to learn about and understand?

➡️➡️ Singleton ⬅️⬅️

2021-08-23

How to decrease the time it takes to do an analysis

Rapid Analysis Development

How we do analysis can determine how productive we are. I've spent a large amount of time writing data transformations with real data and often most of the time is spent waiting for code to run. There are methods of speeding up this analysis.

Building an analysis is not typically thought of as code development, but analysts spend a lot of time writing code. Whether it's SQL, Python, R, Excel, or building dashboards in some Business Intelligence vender's proprietary software, the analyst writes code to load, transform, aggregate and display data.

Often this data is large, so the individual queries can take 30 seconds, a minute or many minutes to execute.

How can we speed this up? How can we make the queries run faster while exploring the data and developing the analysis?

Typical Analytics Process

A typical analyst will follow a process like this to explore data while building plots, charts or calculating metrics.

  1. Load some data
  2. Transform and Join
  3. Calculate Aggregations
  4. Build Visualizations

The analyst will often jump around these steps to fiddle with inputs, try multiple visualizations and generally iterate on the code until results are clear and accurate.

The analyst wants real results, so by default they work with the full dataset.

Pros

Working with the full dataset allows for handing of corner cases as they arise and helps fine tune the logic of the transformations.

Preliminary counts and sums can be evaluated against the analyst's prior knowledge of the dataset which helps increase confidence of the final product and helps find bugs in the code.

Cons

Each code iteration can take a while to run. Changing the color of a plot in Python requires rebuilding the plot and can take seconds or minutes to run. This may not seem like a lot but can add up quickly.

10 times of executing a piece of code can sometimes take 5 minutes. If the analyst gets distracted with questions or reading the news this time can balloon rapidly.

Rapid Analysis Development

This is a more rapid code development process when working with large datasets.

  1. Jump around these steps doing the development
    1. Load a sample of the data
    2. Transform and Join
    3. Calculate Aggregations
    4. Build Visualizations
  2. When everything looks good move on
  3. Remove the sample from step 1
  4. Re-run the all the steps
  5. Update code to handle corner cases and fix errors
  6. And Re-run again to get the final results

In this case, the jumping around in step 1 is faster to iterate on because the dataset is smaller. There will be weird bugs here, joins won't work as expected and groupby aggregations may return invalid results. This can often be alleviated by taking a sample of the largest dataset (usually something visitor or weblog based) but not taking samples of fact tables that this data is joined into.

The results won't be accurate until the full set is run so aggregations may be obviously incorrect.

Pros

Each code iteration can be really fast. Maybe 5 seconds or less, so the analyst won't get distracted and can focus more on the results and less on watching queries run.

Cons

The full analysis will still need to be run with the full input datasets, and this will still take time because when the code looks ready, it won't be. Additional work will be required to double check that joins are working properly with the full input dataset and to fix any issues that arise.

Any visualizations produced will need changes when run with the full input dataset. Max and min ranges may not align, or histograms may look totally corrupt due to outliers that weren't in the original data sample.

Tips & Tricks

For analyses run in notebooks, a parameter can be placed on the top of the notebook indicating whether a sample should be taken, so re-running with the full dataset can be as easy as updating a single variable and clicking "run all",

In Python and R it's easy to add in assert statements to verify the individual steps and joins while the initial development is happening off the sample datasets.

SQL tables can be constructed from samples which can make complex joins faster to develop.

Conclusion

Working with large datasets can be very time consuming. It's fastest to work with small datasets when developing new code. This is a fairly standard software engineering technique and can be applied to data analysis as well. Samples can drastically speed up an analysis.

Next time you build a project, really think about how much time you spend watching the code run. Add it up and consider whether a sample could help you save time.

Your time is valuable, don't waste it watching queries running if the results will just be discarded.

2021-06-25

How do I get work done?

Data Science Process

I've been asked how I get stuff done and as Data Science does not have a rigid, industry wide process, it seems worthwhile to write up some notes and tips here.

A lot of advice around having an effective Data Science process is quite high level "Do POCs" or "Get Stakeholder Buy-in" 🙄. This post discusses my own personal process at a lower level with the aim to hit somewhere between high level advice like "Communicate with Stakeholders" and low level advice like "Don't reuse variable names".

Overall I try to focus on the task at hand by splitting my work into different types of tickets. This helps me to iterate on small units of work and to see and communicate progress. My manager and stakeholders know where I've spent my time lately because I can show finished and in progress tickets and demonstrate code functionality.

  • Design
  • Analysis
  • Develop
  • Productionalize
  • Maintenance

Where do I use this process?

Everywhere. Any particular project I work on uses some aspect of this framework to keep track of where I'm at in the project. Some projects I do end-to-end, some projects I just do part of so I only use part of this process.

What do I mean by Project?

For me a Project is a unit of work that starts with an idea and ends with a finished product that the business can use in some substantial manner. Projects don't end there however, they need continuing effort to keep running and I try to track that continuing effort as well as the initial implementation of the project.

Why do I use this process?

This process helps me go home on time.

Because I can standardize a process of incremental work and communication,

  1. My manager and stakeholders know what I'm doing
  2. The initial design for the project allows me to have built in time to document a timeline
  3. If I'm spending a bunch of time on maintenance or bugs, it's documented and easy to discuss with my manager.

Focus on the Task at Hand

Building a model or doing an analysis often has no clear ending. When is the accuracy "good enough" for production? Are "enough" corner cases handled? What does production mean for this project? When I'm doing some maintenance on a project, should I also update the code to use a new feature?

A project is like a forest and it's surprisingly easy to spend weeks, months or years iterating without having an effect on the business.

To this goal of staying focused, I write detailed tickets with clear items to be done and I don't do anything without a ticket. I try to have less than 3 tickets in progress at any given time to help me maintain my focus. Tickets help to focus and orient the work towards the end goal as well as communicate progress to wider stakeholders.

Tickets keep me honest about when a particular unit of work is complete, provide guardrails as to what should not be done with this work and help communicate status to team members, management, and more importantly, me in the future.

Tickets are also nicely incremental and allow for an iterative approach to a project, one step at a time from design to maintenance.

Design

At the start of any project I spend time determining the goals. I try to answer the following questions and write them down for review with stakeholders.

  • What is this project?
  • What does success look like?
  • How will it work? - system architecture, datasets, algorithms, documented analysis etc.
  • What does the timeline look like?

During Design I don't build a model, I don't do analysis plots or charts, I don't write more than a few dozen lines of SQL. I answer questions and write down what the project will entail.

This task is 95% documentation but it often takes a few bits of SQL to check assumptions. Anything beyond a couple of ad-hoc queries needs to go into its' own Analysis work and I note these open questions in the design. It's easy to accidentally do to much analysis in the design work, and I try to not answer all questions at this step.

With some projects the questions outnumber the current understanding of the data, in that case the design can end being incomplete and it may need to be finished after some analysis is done. But usually the analysis just answers small points and the existing design just needs minor updates. Good to note that the design is never complete until after the project is deployed, but writing down the initial design helps establish guardrails, and a direction for this project.

I always review my designs with stakeholders and my peers. They always always always help to polish the initial design and give me confidence that I'm on the right track and that the timeline for the project is realistic.

Analysis

Analysis pops up a bunch of times in any given project.

  • After design
    • Check design assumptions
  • After development
    • Do the results make any sense?
    • Are the outputs usable as they were expected to be?
  • After productionalization
    • Is this reproducible?
    • Is it working as originally designed?

Any analysis can easily spiral out of control, and it's essential to use tickets to ensure that analysis is only done when there are specific questions, and that the analysis is completed when those specific questions are answered.

Overall there is a balance here between being too rigid and too loose. If I'm too rigid I end up not answering necessary questions or writing too many tickets, and if I'm too loose I end up spending too much time on analysis and may not meet a deadline. There is no correct answer here, I just try my best to balance these concerns.

Any project will often have multiple analysis tickets, but it's best to iterate a few questions at a time rather than try to answer 10 questions all at the same time.

Usually Analysis tickets get reviewed with stakeholders and peers, but sometimes they only provide a greater understanding of the data that is used during development.

Develop

Make a model, write some code, make some features. I follow the design here as I am a Data Scientist and I got into this job to build models, so it's tempting to spend way too much time on this one.

My design is always really specific and agreed upon by peers and stakeholders, so I just implement the design. I don't iterate on the algorithm beyond what the design ticket says, I don't design or analyze, I just focus on the engineering aspect of the project and aim towards what the design says.

Since this is primarily a develop task, I don't worry about what the data says, or what the final results might be. I implement using small datasets and only use the full dataset to verify that corner cases are handled appropriately.

Of course I look at any final metrics or results before moving on, but primarily as a sanity check on the implementation.

Productionalize

Wow, not a lot of projects get to this point. Often stakeholders change their minds, or the initial designed product can't be built or the data to do the project doesn't exist, or the accuracy isn't good enough or the project is already done and just needs maintenance.

This step is usually obvious for the project at hand, but once again having a ticket and focusing on just the productionalization as designed helps to ensure that I finish this step in a timely manner.

Some projects require a clean presentation to stakeholders, and some are engineering but either way this step is the final polish on the project before it's reviewed and actionable by stakeholders.

Maintenance

Some items are outside of my control and updates are required to keep projects running beyond their original implementation. Bugs will arise and need fixing. Systems and processes and data change and code that's running nightly or a completed analysis may need to be updated for the current state of the business and infrastructure.

It's good to mark these tickets as maintenance (or bug) as tracking this type of work allows me to properly communicate to my manager (and myself) where my time is spent.

These tickets can range all over the place from writing SQL, to building visualizations, to developing code and working with other teams to understand existing functionality. The ticket it doubly important as without a ticket indicating what needs doing, there is literally no other reason to do this work.

Tickets are also nice for maintenance as sometimes they are unimportant and don't really need doing, but someone is asking for it to be done. In that case having a ticket allows for clear communication as to the priorities of this maintenance work compared to other projects and other work.

Occasionally a maintenance ticket will end up being a large iteration on a project. In this case, it's important to recognize this and close the maintenance ticket to start a new design ticket if this work is of high priority.

Conclusion

Focusing on the task to be done helps me split my work up into multiple bite-sized pieces that can be tackled one after another. It allows me to document and communicate my work and saves me the energy of constant decision churn of "should I do this other thing instead of what I'm currently doing?".

Conceptually these tickets don't have a clear dividing line. Where does design end and analysis start? How about develop and productionalize? Arbitrarily adding in a line helps focus time and energy into an iterative process instead of just having one large ticket "Do Project".

Even these types of work themselves don't have clear definitions. Design might mean "System Architecture" or "Establish Business Use Case" or something else entirely.   Data projects are always blends of engineering, analytics and product so it's hard to decide upon a framework of how to tackle a project. Should Engineering best practices be use? or Product or analytical process from academia?

There is no rigid system of "do data science like this" nor is the vocabulary for this type of process settled upon. There is no clear answer, find what works best for you and adapt it for your organization, or adapt your organization's process to your own.

Be incremental, be focused on one task at a time, iterate, communicate and go home on time.

2021-06-16

How do I Scale My Personal Performance?

Scale My Personal Performance

Quick post on my personal process as a programmer and my own method of improving.

When I first started programming, I would

  1. Write some code
  2. Do some ad-hoc tests to make sure it was correct
  3. Move on

This is not scalable for working on a team and maintaining projects for months or years at a time. I was nervous to change any of the code 10 minutes after I did the ad-hoc tests, and even more nervous to update code 6 months after the initial development. I didn't want to break anything, so my ability to experiment with the libraries I was using was drastically limited.

This initial process wasn't even a good way to learn to program. Most of my programs as an undergrad didn't even compile. It wasn't until I started working as a Software Engineer and writing unit tests that I became a competent programmer.

Nowadays I

  1. Write some code
  2. Write some Unit Tests to validate code
  3. Iterate on the code for efficiency and readability

Iterating on existing code without changing the functionality has allowed me to have a deeper understanding of the libraries that I'm working with. Having the unit tests is a guarantee that the modified code works as expected with every change.

Compared to ad-hoc testing, writing unit tests doesn't even take any extra time. There is a learning curve to learning to using unit testing in Python, but the ability to re-run ad-hoc tests 5, 10, 50 times ultimately saves time in the development.

I Take Notes

I maintain notes on a variety of subjects,

  • Git/Github
    • Checkout new branch from an existing remote branch
    • How to set the upstream branch
    • etc
  • Python
    • Basics of unit testing
    • Common libraries I use a lot - mock, unittest, requests 
    • etc
  • Pandas
    • How do do a groupby transform to add a new column
    • How to fill `na` values depending on other columns
    • etc
  • PySpark
    • How to read a single CSV file
    • PySpark Pipelines
    • etc
  • Specific Vendors
    • Each vendor and service has it's own quarks, and I like to jot them down
  • ML Algorithms
  • OSX, bash, brew, conda, etc.

Each one is 1-3 pages (currently) and contains code snippets and/or explanations of things I've learned at work that are generic to Data Science in general, i.e. no company secrets and absolutely no copy/paste of any code developed at work for work.

Conclusion

Over time this process has helped me to become a better programmer and faster at accomplishing routine tasks.

Hopefully there is something in this that is helpful for you!

2021-03-03

What are all these Data people doing anyway?

Data Organization Specialties

Companies have lots of people working in data. This is distinct from the people working on websites, warehouses, machinery and design. Usually it's easy to figure out who is working on data, but it can be hard to identify what they are doing.

Everyone has a specialty, an emphasis. Outlined here are all the data specialties that I've encountered in data organizations. These roles are what I see as the 4 data specialties.

4 Data Specialties

Data organizations have pretty diverse projects and requirements. Everything from "How many users visited the website yesterday?" to "Our collider collects too much data to store, what should we discard?".

There is no single person who ever understands all the data, nor is there some unicorn who can do everything. Teams of people with data specialties coordinate on solving the diverse problems that organizations have.

These teams tend to have

  • Analysts answering questions
  • Engineers storing data
  • Librarians organizing metadata
  • Scientists building predictions

There is always overlap, and everybody on a team always does a little of everything but people specialize in order to allow an organization to coordinate and solve ever larger problems.

Analyst

Data Analysts are responsible for finding actionable insights in the data. Often a primary responsibility of data analysts is building reports but they can go much deeper. One off analyses can involve simple tables of numbers, but could also involve model construction, feature engineering and explaining the model constructed along with the importance of the features.

  • Reporting
    • Dashboards
    • Defining and implementing metrics
  • One off Analyses
    • Pulling numbers
    • Building and explaining models

 Example Projects

  • Ongoing Reporting
    • On an ongoing basis, report on a variety of metrics
    • Involves a lot of documentation and discussion on what metrics to report, how to calculate the metrics and how to make them accessible to the business as a whole
  • One Off Analyses
    • Does having a social media presence on a particular platform help bring in new website users at a reasonable cost?
    • Which users should receive coupons in this upcoming email?

Engineer

Raw data can be messy and/or inaccessible. Data Engineers are responsible for building pipelines to pull and archive the raw data and to systematically clean datasets for use in downstream reporting. Other teams have data needs with requirements around how often the data is refreshed and data engineers have the skills to fulfill those requests.

  • Make clean data accessible with requirements around
    • When
    • Where
    • How
  • Optimize data flows for
    • Speed to delivery
    • Cost to operate

 Example Projects

  • Data Warehousing
    • Usually the flagship project of Data Engineering teams. Store the data in an archival format such that reporting, analytics and modeling can run on top
  • Real Time Server Side User Profiles
    • Make user data accessible in real time for modeling, analytics or for display on a website

Librarian

The Librarian is the maintainer of what goes into datasets and how to interpret the data stored within. The librarian is an oft overlooked role that tends to get lumped in or distributed to the the rest of the data organization, but is a distinct function within a data organization. It requires close coordination with business and engineers to define datasets and to maintain a dictionary of terms related to those datasets. Librarians organize and store the metadata that enables the rest of the data organization to understand the data. Without a dedicated role this function is often spread throughout the data organization, but it's an essential function that has to be done.

  • Maintains complex data dictionaries
  • Negotiates with data creators on
    • Vocabulary
    • Fields
  • Trains data creators and people performing data entry tasks

 Example Projects

  • WebLogs
    • Define a consistent naming scheme and definitions for capturing events on a website for use downstream use.
  • Taxonomy
    • Organize what is known about products and the available category names and definitions.

Scientist

The Data Scientist builds and deploys mathematical models that help the business solve specific problems.  Sometimes the models are a simple data transformation or heuristic, sometimes a t-test and sometimes they are a complex model (or models) that push the limits of computation. Data scientists are a little bit of everything with their own emphases and a with a focus on business problems. Often individual projects look a lot like the work that Data Analysts and Data Engineers do but there is typically a range of projects with a heavy foundation in mathematics.

  • One off analyses
    • Pulling numbers
    • Building and explaining models
  •  Predictions
    • Granular predictions in support of business projects

 Example Projects

  • Recommendations
    • Deliver automated, reliable recommendations on a periodic basis
  • User Scoring
    • Use churn, LTV or other user scores to optimize handing out coupons

Conclusion

Lots of people work in data. What's your role?

  • I answer questions asked by non-technical people - Analyst
  • I build platforms to store and allow retrieval of data - Engineer
  • I organize what the stuff in the data means - Librarian
  • I do math and build granular predictions - Scientist

Maybe you do 2, 3 or all 4 of these types of work! That's common, but I think you will find that you enjoy doing one type of work better than the others.

Look around you, what about the people you work with?

The Excel power users, the SQL wizard, the genius in finance - they all fit into the picture of this world we call data.

2021-02-05

I'm on the Data Team, what does Engineering mean when they say "prod" and "dev"?

Engineering Prod vs Dev for Data Folks and vice-versa

So I'm on the Data Team doing data stuff and I'm working closely with some Engineers who run the website. They keep talking about "dev" and "prod" and I don't really know what they mean. I work with 1 Database with real data that I can ask query.

What's up with the Engineering setup and how does it compare to the Data setup?

Overall Differences

Engineering best practices followed by engineers include separate environments for development and production systems. In mature organizations Data Engineers adhere to these standards as well but they are not the only folks developing data transformations and running queries.

Real data has to be available to data professionals in order to be successful, whereas engineering teams strive to lock the real production data for security and deployment reasons, a data organization can often get by fine with a single database.

Data organizations include non-engineers such as Data Analysts, BI Analysts, Statisticians and Data Scientists who develop transformations and support the business but are not trained engineers and only leverage some engineering practices.

Engineering - Dev/Prod

The Engineering team has strong requirements to keep the website (or App or other code) running as close to 100% of the time as possible. They also have requirements around keeping PII locked up and often severely restrict who has the permissions to query real production data. With this in mind they want a place where they can test code functionality before the code is deployed to production without using real production data.

Dev

The Development environment is basically a free for all. There are often multiple environments for multiple teams to each test their slice of the code. Everything here is small and fast. Easy to deploy code, and easy to wipe out and start over. Sometimes this is as simple as running the entire system on the developers laptop.

To develop code a developer needs to query a database and the database needs something in it to return. So developers put mock data in a database with only a few hundred rows. That way the teams can use this test data to validate real functionality while still keeping the database small enough to fit on a laptop. This data is often updated manually in order to test individual use cases.

For Example, consider the construction of a web page to view orders. The developer may want to test a customer who has made zero orders, 3 orders, 1000 orders and various other corner cases to make sure the page looks as expected. In order to view how the webpage will look and to test the full page end to end system there needs to be data in the database to view.

Prod

Production runs across thousands of machines and queries databases containing Terabytes of data. A piece of code with invalid syntax can sometimes take down an entire website costing a business thousands or millions of dollars.

Typically code is not deployed to production unless it has been verified to have the desired functionality and to not break any existing code.

Often Prod has a little sibling - staging. Staging is the place where code about to be deployed can be tested.  The Engineers want to be sure it's the same as production but with less data and less machines running so as to run cheaply. Staging is often a fragile environment but it's always stable right before a production deploy.


Engineering Dev/Prod

 

Environment
DataCodeCost
DevelopmentMBs
FragileLow
StagingMBsFragile and Stable
Low
ProductionTBsStableHigh

 

Data - Dev/Prod

Data teams typically work with 1 database filled with real data - one instance of Snowflake or Redshift or Greenplum. There is real data in that machine with all the benefits and risks that entails. This is the production environment. It's just the one database that all queries run against and where all the data is stored.

From the Engineering perspective this sounds risky. How is the data kept stable and consistent? How is code deployed and tested? How is PII locked down and inaccessible to query?

Dev/Ad-Hoc Queries

Data folks run queries against real data on a near constant basis. Both ETL development and ad-hoc queries have to be run against real data in order to account for historical idiosyncrasies. Essentially for both tasks, a bunch of queries are run and queries that need to run on a schedule are then scheduled.

Occasionally queries are run that are resource heavy and can slow down the production database, but this is rare and modern Data Engineers can cancel queries and restrict resource usage of individual accounts or optimize the performance of a scheduled query.

So what keeps the developer from accidentally overwriting the production data?

  • Naming Conventions
    • Storage of query results are usually stored to an ad-hoc location in the database, somewhere that can be deleted without repercussions to production, scheduled jobs.
  • Permissions
    • ETL Development and Ah-Hoc queries are typically run on the individual's own account which has limited permissions and cannot modify production data.
  • Additional Servers
    • Some data setups include a copy of the production database where development can occur.
  •  Snapshots
    • In the event of a catastrophic failure, production (or sometimes just a single table) could be recreated from a snapshot.
 

Summary

Data and Engineering teams have a different set of requirements and skill sets. This translates into different environments and a different system of developing, testing and deploying.

It's important to understand each other to maintain a relationship of trust and productivity.

2021-01-04

Data Science Pipelines - Pulling Data

Data Science Pipelines - Pulling Data

tl;dr  Every dataset should be pulled from a database or flat file exactly once. This will make the code easier to read and maintain, more performant and easier to hand off to a colleague.
Ok, so you have a algorithm that produces some scores and you want to run it nightly.
 
This post details some best practices to make maintenance of new and existing ML pipelines easier.

The coding skills of Data Scientists are all over the place. Everything from "I can barely write SQL" to "I can write an operating system". This post is intended for an audience of Data Scientists who are less familiar with Software Engineering practices.
 

What does a standard nightly ML process look like?

Let's assume the code transforms datasets Products and Customers and scores them with a model on a nightly basis. There is always business logic and the code will grow as the business logic is bolted onto the model execution.
 
With the code being a mix of Python, R and SQL.

The code runs nightly, the product team likes the scores, the data scientists do some QA and testing and are satisfied that the scores are correct.
 
Everyone is happy! as long as it runs every night

What types of changes can I expect?

  1. Input Data Locations
    • Changes in Data Base or Table name
    • Data Engineering Managers love new shiny storage systems and every couple of years will move everything from one system to another. SQL Server to Greenplum to Redshift to Snowflake, there is always something better coming out.
  2. Input Data Formats
    • Data Engineers love star schemas and de-normalization and additional joins may be required to build the input data
    • Changes from an RDS to flat files may occur
  3. Product Changes
    • As the business changes so too do business rules

Code Updated

Wonderful! The model is a success and folks want updates, changes and additional functionality!

So, updates, testing and everyone is happy again!

Oops we are pulling data in twice in different locations

This will cause issues sooner or later.
  • Updates to filtering of Products datasets will be inconsistent
  • Unavailable Inventory data will cause the pipeline to fail in the middle after potentially running for hours.

 

Pull in each Dataset Exactly Once

Best to pull each dataset in exactly once.

Benefits

  • Any unavailable or corrupted datasets will cause the pipeline to immediately fail.
    • Mostly useful during development as waiting 10 minutes for the pipeline to fail is really annoying.
    • Nice for nightly jobs as an support team would immediately know something failed and could quickly fix and re-run
  • Updates to transformations or filtering of an input dataset is applied on read
    • One piece of code won't be using filtered data while another is using unfiltered.
    • Helps with performance so that large datasets are not pulled in multiple times.

Conclusion

Every dataset should be pulled from a database or flat file exactly once. This will make the code easier to read and maintain, more performant and easier to hand off to a colleague.