Engineering Prod vs Dev for Data Folks and vice-versa
So I'm on the Data Team doing data stuff and I'm working closely with some Engineers who run the website. They keep talking about "dev" and "prod" and I don't really know what they mean. I work with 1 Database with real data that I can ask query.
What's up with the Engineering setup and how does it compare to the Data setup?
Engineering best practices followed by engineers include separate environments for development and production systems. In mature organizations Data Engineers adhere to these standards as well but they are not the only folks developing data transformations and running queries.
Real data has to be available to data professionals in order to be successful, whereas engineering teams strive to lock the real production data for security and deployment reasons, a data organization can often get by fine with a single database.
Data organizations include non-engineers such as Data Analysts, BI Analysts, Statisticians and Data Scientists who develop transformations and support the business but are not trained engineers and only leverage some engineering practices.
Engineering - Dev/Prod
The Engineering team has strong requirements to keep the website (or App or other code) running as close to 100% of the time as possible. They also have requirements around keeping PII locked up and often severely restrict who has the permissions to query real production data. With this in mind they want a place where they can test code functionality before the code is deployed to production without using real production data.
The Development environment is basically a free for all. There are often multiple environments for multiple teams to each test their slice of the code. Everything here is small and fast. Easy to deploy code, and easy to wipe out and start over. Sometimes this is as simple as running the entire system on the developers laptop.
To develop code a developer needs to query a database and the database needs something in it to return. So developers put mock data in a database with only a few hundred rows. That way the teams can use this test data to validate real functionality while still keeping the database small enough to fit on a laptop. This data is often updated manually in order to test individual use cases.
For Example, consider the construction of a web page to view orders. The developer may want to test a customer who has made zero orders, 3 orders, 1000 orders and various other corner cases to make sure the page looks as expected. In order to view how the webpage will look and to test the full page end to end system there needs to be data in the database to view.
Production runs across thousands of machines and queries databases containing Terabytes of data. A piece of code with invalid syntax can sometimes take down an entire website costing a business thousands or millions of dollars.
Typically code is not deployed to production unless it has been verified to have the desired functionality and to not break any existing code.
Often Prod has a little sibling - staging. Staging is the place where code about to be deployed can be tested. The Engineers want to be sure it's the same as production but with less data and less machines running so as to run cheaply. Staging is often a fragile environment but it's always stable right before a production deploy.
Data - Dev/Prod
Data teams typically work with 1 database filled with real data - one instance of Snowflake or Redshift or Greenplum. There is real data in that machine with all the benefits and risks that entails. This is the production environment. It's just the one database that all queries run against and where all the data is stored.
From the Engineering perspective this sounds risky. How is the data kept stable and consistent? How is code deployed and tested? How is PII locked down and inaccessible to query?
Data folks run queries against real data on a near constant basis. Both ETL development and ad-hoc queries have to be run against real data in order to account for historical idiosyncrasies. Essentially for both tasks, a bunch of queries are run and queries that need to run on a schedule are then scheduled.
Occasionally queries are run that are resource heavy and can slow down the production database, but this is rare and modern Data Engineers can cancel queries and restrict resource usage of individual accounts or optimize the performance of a scheduled query.
So what keeps the developer from accidentally overwriting the production data?
- Naming Conventions
- Storage of query results are usually stored to an ad-hoc location in the database, somewhere that can be deleted without repercussions to production, scheduled jobs.
- ETL Development and Ah-Hoc queries are typically run on the individual's own account which has limited permissions and cannot modify production data.
- Additional Servers
- Some data setups include a copy of the production database where development can occur.
- In the event of a catastrophic failure, production (or sometimes just a single table) could be recreated from a snapshot.