My career as a Data Architect began exactly the way you would expect. I first earned both Bachelor’s and Master’s degrees in History. All joking aside, it’s not often that I see parallels between my near career as a Historian and my actual career as a Data Architect but one similarity is that certain problems need to be revisited from time to time to see if old assumptions still fit with the latest research and techniques. In History, these take the form of archeological finds and linguistic and DNA analysis. In IT, you guessed it, it’s the Cloud that should make you take another look.
In Data Analytics, the most critical architectural dilemma is how best to offload data for reporting and data science related tasks. There are three basic options:
1. Create read replicas of your source OTLP databases
2. Build a classic Kimball style Star Schema in a data warehouse or data mart
3. Offload as much potentially relevant data as possible into a Data Lake
Five years ago, I would have declared creating read replicas a non-starter, and possibly even a red flag about any organization using them. Today, I recommend that you explore that option first. To understand why let’s briefly revisit the pros and cons of these three approaches.
Star Schemas rely on an ETL process (Extract Transform Load) to roll up and aggregate source data into large, denormalized Fact tables and supporting attribute tables based on predetermined reporting and analytics criteria.
This has many advantages when successfully implemented. The data will be cleansed and homogenized and come from several sources. It also makes the subsequent reporting queries as simple as possible for an end-user who is not an expert in SQL and who will more than likely be using a visualization tool such as Qlik or Tableau to perform their work. These queries are also more efficient for the SQL engines that support them.
But designing and building Star Schemas is time-consuming and expensive and requires the maintenance of dozens of separate ETL processes that also have to be synchronized. Further, the databases required to Store Schemas must be configured separately from their source OTLP systems. In the Cloud, this can be done with a few clicks, but frequently Data Warehouse optimized solutions cost twice as much per unit as their OTLP counterparts.
The long-term rewards for a Star Schema are high, but so are the startup costs.
Data Lakes relies on the ETL process’s fraternal twin, the ELT process (Extract Load Transform). Data is extracted and loaded into any number of data stores (traditional relational databases, NoSQL databases, Spark, data bricks, etc.) then cleansed and transformed in place to make it useable. Powerful software engines are then used to render the data as needed.
This also has several advantages. It is turnkey. Loading data is fairly simple (sometimes it’s even drag and drop) and can be used for reporting and analytics almost immediately. It is also MUCH simpler to incorporate data from outside of traditional OTLP systems, such as spreadsheets, Access databases, and even Social Media feeds into the reporting environment. This simplicity in turn promotes data sharing among siloed organizations. The data access layers also tend to play into the strengths of developers who may not necessarily be database specialists reducing per hour labor costs.
However, Data Lakes are still relatively new and finding teams that get the most out of them can be a challenge. It also pushes much of the work of report generation over the reporting and visualization teams. These are by no means insurmountable obstacles, but they do need to be taken into consideration.
As the name implies, Read Replicas are simply copies of source OTLP systems stripped of Write access to maintain the integrity of the data and exposed to reporting users and tools to remove any potential impact on the actual business process.
This approach has two obvious drawbacks. It limits actual queries to one data source and data stored in such database will most like be normalized to ensure the integrity of the data, not to facility efficient queries.]
So why has the Cloud turned this into the first approach that you should try? Simply put, Elasticity!
Properly configured replicas can scale to meet demand. If real-time reporting is needed, CPU cycles, memory, etc. are ramped up during business hours and scaled back at night. If latency is acceptable and data is being cached on a server for later viewing, then the reverse happens. The system is scaled up at night wound down during the day.
In addition, no coding whatsoever is required to implement a read replica. In the cloud, it will most likely be an option that one can select when configuring the actual transaction processing system. More effort is moved over the business analysts join together the data, but they will benefit from being able to leverage existing data dictionaries and data models.
Also, most modern Visualization and Reporting tools allow for the blending of data from multiple sources either when the data is cached or when it is rendered on the front end so reports and dashboards from multiple read replica sources are now a viable option. If nothing else, this approach can be rolled out at the launch of a new system and replaced with a Star Schema or Data Lake later on in the life cycle. In 2021, Read Replicas are definitely the first solution you should try in your Cloud-based analytics environment.
Data Architect at Definitive Logic