Building a cloud analytics data warehouse 103: Build a Cloud-Based Data Warehouse for Analytics

25 July 2023

15 minutes reading time

Discover the benefits of building a data warehouse for analytics. Explore cloud-based solutions, partitioned and clustered tables, normalization vs. denormalization, analytics-ready data products, and the data mesh approach.

103 Series: Build a Cloud-Based Data Warehouse for Analytics

Building a data warehouse can be daunting, especially when choosing the right technologies and architectures to support your data needs. As organizations seek to harness the power of their data, understanding the purpose of a data warehouse becomes crucial in designing an effective solution. In this introductory article, we aim to demystify the process and provide answers to common questions about building and managing a data warehouse. 

Why invest in building a data warehouse?

There are plenty of good reasons to create and maintain a warehouse solution in your business. Here are our top three:

  • Data Control: Build your data warehouse for complete control over data management, ensuring integrity, security, and accessibility.
  • Scalability & Flexibility: Scale your infrastructure as data grows, accommodating increased volumes and complex queries while adapting to evolving business needs.
  • Advanced Analytics: Centralize data to perform complex queries, data mining, and machine learning, uncovering valuable insights for data-driven decision-making and business intelligence.

Understanding data warehouses

Data warehouses are potent systems that organize and process large amounts of business data for complex analytical queries. They are often supported by OLAP (Online Analytical Processing) systems. It’s important to distinguish them from Online Transaction Processing (OLTP) systems, which handle real-time transactions and are optimized for quick user requests and storage. In most cases, the choice for OLAP is already made for you. But since it can be confusing, let’s spend some time making things clear. In the zoo of OLAP-based data warehouses, you can find creatures such as Google BigQuery, Amazon Redshift, Azure Analysis Services, IBM Cognos, and quite a few more. These systems are all meant for analytical data. This article focuses on Online Analytical Processing (OLAP) systems, as they serve analytics and machine learning purposes better than traditional transactional databases. Here’s why:

  • Retrieval Optimization: OLAP databases are designed to efficiently retrieve large amounts of data for specific fields, thanks to their columnar storage architecture. On the other hand, OLTP systems are optimized for real-time transactions and frequent data writes.
  • Storage Capacity: OLAP systems require larger storage capacities, often in petabytes, to analyze trends and identify patterns. In contrast, OLTP systems have smaller storage requirements.
  • Data Usage: OLTP databases handle real-time transactional data, while OLAP databases focus on historical aggregated data.

It’s worth noting that this article doesn’t cover NoSQL databases, which are more suitable for unstructured data and application runtime data handling. Now that we understand the importance and benefits of data warehousing, let’s explore how cloud-based solutions have revolutionized the landscape. Cloud data warehouses, such as Google BigQuery and Amazon Redshift, have emerged as powerful tools specifically designed for handling analytics workloads. With their optimized architectures, flexible pricing models, and self-management capabilities, these cloud-based solutions offer organizations an efficient and scalable way to unlock the full potential of their data.

Building your data warehouse in the Cloud

In recent years, analytics data warehouses have made significant strides in their development. With our ability to collect data growing, historical data has become abundant. It’s no surprise that systems like Google BigQuery and Amazon Redshift have gained prominence in the analytics scene since they were purpose-built for this. Other benefits include:

  • These systems are based on architectures optimized for analytics, making them ideal for handling analytical workloads.
  • They offer a low barrier to entry, providing generous free tiers and a pay-as-you-go pricing model.
  • They are self-managed, eliminating the need for your organization to worry about the traditional IT system overhead.

However, you might wonder about compliance issues when storing your analytics data in the cloud. Here are a few points to consider:

  • Cloud-based data warehouse solutions provide fine-grained access control, ensuring the secure handling of datasets at the column or row level.
  • Most data warehouses offer data encryption at rest, adding an extra layer of protection.
  • Conducting analysis doesn’t necessarily require storing personally identifiable information (PII), which can be advantageous when designing datasets meant to be shared across departments.
  • Integration with Virtual Private Cloud (VPC) service perimeters and tools like Google Analytics Hub helps prevent unauthorized data exfiltration from your projects.
  • You have the flexibility to choose the data center where your data is stored, ensuring compliance with geographic data storage requirements.

Knowing these factors, you can have confidence that cloud-based projects offer significant benefits in terms of operational efficiency.

Designing Your Data Warehouse

Now we’ve got a grasp of why data warehouses are beneficial to your organization and what cloud solutions there are available, let’s dive into how you will set up your data warehouse. At the end of the day, your analytics capabilities are highly dependent on how you design your data warehouse. The rest of this article will give you a go over some important topics that might guide you when deciding how to organize your data. This advice is valid whether you are going to support analytics teams with their dashboarding efforts or data science teams with their training and prediction workloads, and is based on the following two metrics:

  • Amount of bytes scanned. Analytics warehouses have a big focus on scanning data, and therefore, reducing the amount of data scanned is always beneficial both in terms of costs and in terms of performance.
  • Computations performed. It is good to know which operations are going to be more tricky for the underlying architecture of data warehouses to perform, so here we try to reason about a few data structures that might help with that.

Let’s talk tables…

Partitioned vs. Clustered tables

Partitioned tables are a way to divide your data into smaller segments, making it easier for your queries to quickly scan and retrieve the relevant information. For example, if you have a table with a time column, you can partition it based on time intervals. This reduces the number of bytes scanned, especially when you only need to analyze a specific date range. Partitioning can also help with data lifecycle management by automatically deleting old partitions after a certain period without new data. Additionally, in systems like BigQuery, you can enforce that queries targeting a certain table must filter on the partition column. You are then sure the analysts using your table are at least aware of this optimization strategy before they start using the underlying data. Clustered tables are organized in storage blocks based on the attached metadata and values of clustering columns. This arrangement reduces the amount of data scanned when filtering or grouping by those columns. Since the storage blocks are already sorted, it also improves query performance when ordering the data by the clustered columns. When choosing clustering fields, focus on those frequently used for filtering or grouping your queries. Both partitioned and clustered tables optimize query performance and reduce the amount of data scanned. Partitioning is beneficial when you have time-based data or want to manage data lifecycles easily. Clustering is advantageous when your queries frequently involve filtering or grouping by specific fields. However, the benefits of clustering are more pronounced with larger tables and a higher number of entries per storage block.

Normalized vs. Denormalized data

Normalization is a method used to organize data into multiple tables, reducing redundancy and ensuring data consistency and integrity. On the other hand, denormalization involves merging data from multiple tables into a single table, which speeds up data retrieval. When your data is normalized, your queries often involve joins. However, this can lead to data being shuffled across different computing units during the query process, resulting in increased complexity and slower query performance. In some cases, it may even exceed the computation-to-scanned-data ratio limit of the BigQuery free plan, especially when using cross joins. To overcome these limitations, denormalization is a useful approach. Denormalizing your data involves keeping nested and repeated fields within your dataset, eliminating the need for joins. For example, in a normalized data model, the products sold in each transaction would be stored in a separate table linked by a transaction ID. In a denormalized scenario, a single table would include a repeated (array) field containing the products sold within each transaction. This simplifies the execution process for your queries and improves performance.

Analytics-ready data products

Analytics-ready tables are a crucial component in data analytics. They are created by transforming raw data from transactional systems like SAP HANA into tables that contain dimensions (such as user, department, and brand) and aggregated metrics (such as conversions). When designing analytics-ready tables, there are important considerations to keep in mind:

  1. Time resolution: Choose a time resolution that aligns with the granularity of insights needed by your teams. Daily resolution strikes a balance between data readiness and dashboard performance. Hourly resolution can provide insights into user behavior or transaction trends within a day. Avoid weekly or monthly aggregations, as those can be performed directly in the dashboard without affecting performance.
  2. Relevant aggregations: Select dimensions that are meaningful in the business context of your questions and insights. Including unnecessary dimensions will only complicate your dashboard and increase complexity. Focus on dimensions related to the specific analysis you’re conducting, such as conversions over departments or brands.
  3. Access to raw data: In large organizations, data needs may change over time. To allow teams to explore and analyze the data further, provide restricted access to the raw data alongside analytics-ready tables. This enables teams to discover additional trends and patterns. While this may require additional storage, the costs are typically outweighed by the benefits gained from analysis.

By considering these factors, you can design analytics-ready tables that meet the needs of your organization while providing flexibility for various use cases.

Make it ready for cross-team collaboration

In the previous section, we covered the practical aspects of developing analytical data tables that capture insights and analysis. But what about insights from other departments? Who designs those tables? Do we rely solely on a centralized data team for this? Now we need to start thinking about working in a cross-team environment. Teams often rely on a centralized group of engineers, such as a data platform or business intelligence team, to handle data flows and analysis. However, this can create a bottleneck as these engineers need to be knowledgeable about various domains within the organization. The concept of data mesh, introduced by Zhamak Dehghani, offers a solution for achieving data independence within teams. In this approach, teams take responsibility for ingesting, processing and analyzing their own data. They can also publish data products within a shared data platform. A central data platform team manages the underlying technology of the platform, including query engines, databases, automated policies, and access management. They provide essential services and guidelines to ensure interoperability and compliance. With the support of these central services and clear guidelines, domain-driven teams can create their own data products within their designated areas in the data platform, such as AWS accounts or Google Cloud projects. The data platform and governance teams oversee interoperability, security, compliance, and more.

Key components of a data mesh architecture

  • Domain ownership: Data is owned and maintained by the organizational domain that is closest to the data and the definitions. 
  • Data products: Organizational domains offer readily consumable data products, ensuring easy accessibility and up-to-date information for consumers. 
  • Federated data governance: A central team oversees the main data architecture, while domains take responsibility for maintaining, updating, and ensuring the accuracy of their data. They adhere to company-wide rules for data storage and processing.

What to look out for when working in a data mesh

Implementing a data mesh comes with its technical challenges. ETL pipelines need adjustment, data product tables must be designed to serve various use cases, and user management should accommodate access divisions between public and in-process data. However, the biggest hurdle lies in the organizational aspect. Let’s consider an example: Your customer care center holds valuable data on customer satisfaction and pain points. However, the employees lack the necessary skills to take ownership of this data as their focus is on operational tasks. Integrating data capabilities into the team is not a quick decision, and knowledge sharing with the central data team may be limited. Additionally, data-oriented employees may not possess the domain-specific knowledge needed for context. Separating business ownership from technical ownership of data can be beneficial. However, adopting a data mesh architecture requires careful consideration. Ask yourself:

  • What problem am I solving? Identify the disconnect between data capabilities and domain knowledge in the organization.
  • Is the investment worthwhile? Implementing a data mesh involves significant technical and organizational changes.
  • Is my organization ready? Data should be valued as an asset throughout the company. Without a culture of data appreciation, decentralization may lead to ownership issues rather than increased value.

These factors should be carefully evaluated before adopting a data mesh architecture.

Built-in products to support a data mesh environment

Fortunately, cloud providers offer tools that simplify the technical aspect of implementing a data mesh architecture. For example, the Google Analytics Hub platform, built on BigQuery, enables federated data exchange. The data platform and governance teams can jointly manage the platform, allowing domain projects to create data exchanges with shared datasets. Through data contracts, known as data exchanges in the Analytics Hub, domain-driven teams can collaborate by publishing and subscribing to data. In AWS, the Glue product helps catalog and document data, while Lake Formation assists in governing data sources and enhancing security. These tools are particularly useful for implementing a data mesh within a data lake environment, although they are less focused on structuring the architecture within a data warehouse (Redshift) environment.

Next steps

In the upcoming articles, we’ll explore the tools that can help you manage and run ETL pipelines within your cloud data warehouse environment. We’ll discuss user-friendly options that require no coding experience, as well as more technical tools that offer advanced capabilities. Whether you prefer a simple, intuitive solution or have a deeper technical understanding, we’ve got you covered. Stay tuned to learn about the various options available for orchestrating and executing your ETL pipelines effectively. If you want to leave nothing to chance, our data experts will be glad to organize a one-to-one call and help you set up a valuable data warehouse solution for your organization.

Key Takeaways

Building a data warehouse provides complete control over data management, scalability, and advanced analytics for data-driven decision-making. Here are our five key takeaways from this article:

  • Cloud-based data warehouse solutions like Google BigQuery and Amazon Redshift offer optimized architectures, flexible pricing models, and self-management capabilities.
  • Partitioned tables divide data into segments, improving query performance by reducing the amount of data scanned. Clustered tables organize data into ordered storage blocks, optimizing filtering and grouping operations.
  • Normalization reduces redundancy and ensures data integrity, while denormalization speeds up data retrieval by eliminating the need for joins.
  • Analytics-ready data products should consider time resolution, relevant aggregations, and access to raw data for flexibility and collaboration within the organization.
  • Implementing a data mesh architecture enables domain-driven teams to take ownership of their data, publish data products, and collaborate, but it requires careful consideration of technical and organizational challenges. Cloud providers offer tools like Google Analytics Hub and AWS Glue to support data mesh environments.

 

Contact us

Curious about how we could help you? Please feel free to get in touch. We'd love to hear about your data and how you're using it within your organization.