Data Science Process
I've been asked how I get stuff done and as Data Science does not have a rigid, industry wide process, it seems worthwhile to write up some notes and tips here.
A lot of advice around having an effective Data Science process is quite high level "Do POCs" or "Get Stakeholder Buy-in" 🙄. This post discusses my own personal process at a lower level with the aim to hit somewhere between high level advice like "Communicate with Stakeholders" and low level advice like "Don't reuse variable names".
Overall I try to focus on the task at hand by splitting my work into different types of tickets. This helps me to iterate on small units of work and to see and communicate progress. My manager and stakeholders know where I've spent my time lately because I can show finished and in progress tickets and demonstrate code functionality.
- Design
- Analysis
- Develop
- Productionalize
- Maintenance
Where do I use this process?
Everywhere. Any particular project I work on uses some aspect of this framework to keep track of where I'm at in the project. Some projects I do end-to-end, some projects I just do part of so I only use part of this process.
What do I mean by Project?
For me a Project is a unit of work that starts with an idea and ends with a finished product that the business can use in some substantial manner. Projects don't end there however, they need continuing effort to keep running and I try to track that continuing effort as well as the initial implementation of the project.
Why do I use this process?
This process helps me go home on time.
Because I can standardize a process of incremental work and communication,
- My manager and stakeholders know what I'm doing
- The initial design for the project allows me to have built in time to document a timeline
- If I'm spending a bunch of time on maintenance or bugs, it's documented and easy to discuss with my manager.
Focus on the Task at Hand
Building a model or doing an analysis often has no clear ending. When is the accuracy "good enough" for production? Are "enough" corner cases handled? What does production mean for this project? When I'm doing some maintenance on a project, should I also update the code to use a new feature?
A project is like a forest and it's surprisingly easy to spend weeks, months or years iterating without having an effect on the business.
To this goal of staying focused, I write detailed tickets with clear items to be done and I don't do anything without a ticket. I try to have less than 3 tickets in progress at any given time to help me maintain my focus. Tickets help to focus and orient the work towards the end goal as well as communicate progress to wider stakeholders.
Tickets keep me honest about when a particular unit of work is complete, provide guardrails as to what should not be done with this work and help communicate status to team members, management, and more importantly, me in the future.
Tickets are also nicely incremental and allow for an iterative approach to a project, one step at a time from design to maintenance.
Design
At the start of any project I spend time determining the goals. I try to answer the following questions and write them down for review with stakeholders.
- What is this project?
- What does success look like?
- How will it work? - system architecture, datasets, algorithms, documented analysis etc.
- What does the timeline look like?
During Design I don't build a model, I don't do analysis plots or charts, I don't write more than a few dozen lines of SQL. I answer questions and write down what the project will entail.
This task is 95% documentation but it often takes a few bits of SQL to check assumptions. Anything beyond a couple of ad-hoc queries needs to go into its' own Analysis work and I note these open questions in the design. It's easy to accidentally do to much analysis in the design work, and I try to not answer all questions at this step.
With some projects the questions outnumber the current understanding of the data, in that case the design can end being incomplete and it may need to be finished after some analysis is done. But usually the analysis just answers small points and the existing design just needs minor updates. Good to note that the design is never complete until after the project is deployed, but writing down the initial design helps establish guardrails, and a direction for this project.
I always review my designs with stakeholders and my peers. They always always always help to polish the initial design and give me confidence that I'm on the right track and that the timeline for the project is realistic.
Analysis
Analysis pops up a bunch of times in any given project.
- After design
- Check design assumptions
- After development
- Do the results make any sense?
- Are the outputs usable as they were expected to be?
- After productionalization
- Is this reproducible?
- Is it working as originally designed?
Any analysis can easily spiral out of control, and it's essential to use tickets to ensure that analysis is only done when there are specific questions, and that the analysis is completed when those specific questions are answered.
Overall there is a balance here between being too rigid and too loose. If I'm too rigid I end up not answering necessary questions or writing too many tickets, and if I'm too loose I end up spending too much time on analysis and may not meet a deadline. There is no correct answer here, I just try my best to balance these concerns.
Any project will often have multiple analysis tickets, but it's best to iterate a few questions at a time rather than try to answer 10 questions all at the same time.
Usually Analysis tickets get reviewed with stakeholders and peers, but sometimes they only provide a greater understanding of the data that is used during development.
Develop
Make a model, write some code, make some features. I follow the design here as I am a Data Scientist and I got into this job to build models, so it's tempting to spend way too much time on this one.
My design is always really specific and agreed upon by peers and stakeholders, so I just implement the design. I don't iterate on the algorithm beyond what the design ticket says, I don't design or analyze, I just focus on the engineering aspect of the project and aim towards what the design says.
Since this is primarily a develop task, I don't worry about what the data says, or what the final results might be. I implement using small datasets and only use the full dataset to verify that corner cases are handled appropriately.
Of course I look at any final metrics or results before moving on, but primarily as a sanity check on the implementation.
Productionalize
Wow, not a lot of projects get to this point. Often stakeholders change their minds, or the initial designed product can't be built or the data to do the project doesn't exist, or the accuracy isn't good enough or the project is already done and just needs maintenance.
This step is usually obvious for the project at hand, but once again having a ticket and focusing on just the productionalization as designed helps to ensure that I finish this step in a timely manner.
Some projects require a clean presentation to stakeholders, and some are engineering but either way this step is the final polish on the project before it's reviewed and actionable by stakeholders.
Maintenance
Some items are outside of my control and updates are required to keep projects running beyond their original implementation. Bugs will arise and need fixing. Systems and processes and data change and code that's running nightly or a completed analysis may need to be updated for the current state of the business and infrastructure.
It's good to mark these tickets as maintenance (or bug) as tracking this type of work allows me to properly communicate to my manager (and myself) where my time is spent.
These tickets can range all over the place from writing SQL, to building visualizations, to developing code and working with other teams to understand existing functionality. The ticket it doubly important as without a ticket indicating what needs doing, there is literally no other reason to do this work.
Tickets are also nice for maintenance as sometimes they are unimportant and don't really need doing, but someone is asking for it to be done. In that case having a ticket allows for clear communication as to the priorities of this maintenance work compared to other projects and other work.
Occasionally a maintenance ticket will end up being a large iteration on a project. In this case, it's important to recognize this and close the maintenance ticket to start a new design ticket if this work is of high priority.
Conclusion
Focusing on the task to be done helps me split my work up into multiple bite-sized pieces that can be tackled one after another. It allows me to document and communicate my work and saves me the energy of constant decision churn of "should I do this other thing instead of what I'm currently doing?".
Conceptually these tickets don't have a clear dividing line. Where does design end and analysis start? How about develop and productionalize? Arbitrarily adding in a line helps focus time and energy into an iterative process instead of just having one large ticket "Do Project".
Even these types of work themselves don't have clear definitions. Design might mean "System Architecture" or "Establish Business Use Case" or something else entirely. Data projects are always blends of engineering, analytics and product so
it's hard to decide upon a framework of how to tackle a project. Should
Engineering best practices be use? or Product or analytical process
from academia?
There is no rigid system of "do data science like this" nor is the vocabulary for this type of process settled upon. There is no clear answer, find what works best for you and adapt it for your organization, or adapt your organization's process to your own.
Be incremental, be focused on one task at a time, iterate, communicate and go home on time.