New technology development tools often go hand in hand with current methodology best practices. Development methods and software are so intertwined it seems as if you can’t use one without the other. Upgrading to the latest and greatest tool can certainly make the transition to a better process easier, but projects can’t replace their entire technical stack every other year just to keep up with market trends.
For Data Engineering and Data Operations projects, DataOps is the new standard. DataOps combines elements from Agile Development, DevOps, and Lean Manufacturing, such as Continuous Integration/Continuous Deployment (CI/CD), Version Control, and Statistical Process Controls, and adapts them to address the changes of developing and managing data solutions. New tools exist to manage all the software tools required in data operations and implement DataOps practices across the entire project. While this might be the best solution in the ideal world, it requires upfront costs and refactoring, which can impede ongoing development priorities. Instead, you can begin integrating DataOps concepts in your existing environment and free or affordable software to support what you have.
Let’s look at a practical example, Definitive Logic manages an On-Premise Database supporting a custom application. The database holds 20 GB of data in almost 200 tables across 6 schemas, with dozens of supporting views, triggers, stored procedures, and functions. It supports daily extract-load-transform (ELT) processes which feed both a transactional web application and data analytics tool. Because of the interdependency between the database and web application, most database structure changes cannot be deployed without the accompanying application changes.
Here’s where it gets even more complicated, the Development environment (Dev) is on a different network than the Test and Production (Prod) environments, and code changes cannot be implemented directly between the networks. This makes continuous, automated integration impossible; Some level of manual interaction is required to transfer code from Development and implement it in Test and Prod. The goal, then, is not achieving a fully CI/CD process but minimizing the time to deploy given the constraints.
What to do, what to do? We reviewed available automated database deployment tools such as SDT and Flyway and found they would manually add more overhead than executing the deployments. Adding a few columns could result in 1,000 lines of code generated. With deployments including dozens or hundreds of changes, troubleshooting errors would become a full-time job. For this reason, we decided to keep track of development changes to data and database structure in an Excel spreadsheet and executing the changes manually. Deploying to an environment could take anywhere from 30 to 90 minutes, depending on the number of steps, but the source of any error would be immediately apparent and could be corrected without having to update and rerun the entire deployment.
A light at the end of the tunnel, as the project matured, and it became obvious it would not move to a cloud platform anytime soon, the ability to integrate DataOps practices was reviewed. We realized automating most of the deployment process didn’t require a new tool but could be done with any standard shell script. We chose PowerShell because it is widely used and already available in all project environments. We then adapted the Excel file storing deployment steps into a .csv, simplifying it to include only information required by the PowerShell script and a record of when each step was executed. The PowerShell script determines
the location and action required for each step based on data in the .csv, executes it, and records the time of completion. It also creates an audit log to capture any errors that arise during deployment.
The team’s database developers are still responsible for recording required steps in the .csv throughout development, moving the required scripts from one environment to another, and launching the PowerShell script. But the bulk of the deployment work opening and running scripts is now handled automatically. This has decreased the time to deploy to a new environment from an average of 45 minutes to 5 minutes.
Not only is the actual deployment process 90% faster, but this speed enables additional improvements which were previously impractical or impossible. Since the deployment process only takes a few minutes, it makes sense to run a test deployment in the development environment to identify and resolve any issues. This reduces the actual deployment time even more and reduces the risk of issues and potential application downtime. Now that each additional step to the deployment adds fractions of a second, dozens of data tests can be run as part of all deployments without increasing the runtime. These tests help ensure data integrity and catch potential issues before they occur in Production.
One of the most daunting questions when deciding to implement DataOps, or any shift in methodology, is where to start. When the answer is “everywhere” the scope of required changes can be overwhelming. But if the answer is “anywhere” you soon realize one change can quickly cascade into several, and what once seemed unachievable is suddenly in progress. Without having to replace the entire tech stack.
Technical Lead at Definitive Logic