Building a cloud analytics data warehouse 104: How to orchestrate your ETL in your Cloud Data Warehouse
Building a cloud analytics data warehouse 104: How to orchestrate your ETL in your Cloud Data Warehouse
Series 104: How to orchestrate your ETL in your Cloud Data Warehouse
In previous parts of this series, we've covered what it means to build up a data warehouse, how it can be structured to provide the most value, and what use cases you could potentially cover with such an environment.
An important topic we've not covered currently is the matter of integrating data with your data warehouse or data lake environment and preparing data to the point that it can actually be used for reporting, analysis and other use cases. This part of our series will dive deeper into this topic, covering a number of tools and methods that can help you with integrating data into your cloud data warehouse and preparing your data once it resides in your environment.
As the main focus point of this series is still BigQuery and GCP, we'll also focus the tooling around GCP native and independent tooling.
Data Integration
The concept of data integration focuses on getting the right data into your data warehouse environment. While it's obviously possible to just take an Excel or Google Sheet file and upload it to your environment, the main concept of data integration is that data is imported periodically in an automated way. This means an integration with either API's, file storage environment or other database.
Google Cloud Functions
Google Cloud functions is, as Google explains it themselves, a 'scalable, pay-as-you-go functions as a service (FaaS) product'. It can be used to create custom-code applications in many languages, amongst which Python (often used for data processing applications).
Google cloud functions are meant for relatively short workloads, (10 minutes or 60 minutes max, depending on how the function is initialized), making it interesting for not-too-heavy ETL jobs. On top of this, costs for Google Cloud Functions are generally quite low, as the only payment is being done when a job is actually running.
An interesting part of Cloud Functions is also that they can be triggered by many different types of events. Apart from simple scheduled triggers, they can be triggered by e.g. a file being added to a storage bucket or even the update of a BigQuery table (this is done through the concept of Google's EventArc) The biggest drawbacks of Cloud Functions are the lack of overview of jobs that have run and limited ability to group together jobs in pipelines.
Google Cloud DataProc
Cloud DataProc is the managed Spark service of the Google Cloud Platform. One of the biggest benefits is that the DataProc environment is highly scalable, which, in combination with the parallel nature of Spark jobs, can be a very important performance reason for companies to start using this for their ETL pipelines.
While a lot of benefits may arise from using the performance benefits of Cloud DataProc, the usage of Spark in general has quite a steep learning curve, making it harder for less technical individuals to get into. Also costs may become an issue in DataProc environments, as clusters can grow quite big with bigger amounts of data being handled.
Luckily, Dataproc also provides the 'serverless' service, where you don't actually have to worry about scaling a Spark cluster up and down yourself. In this last option, you only pay for the amount of CPU time you've actually used with the jobs that you've run. This latter option can be specifically interesting when running high performance data pipelines for a relatively short amount of time each day. In a more 'datalake' environment, a DataProc cluster would also be used often for data preparation, as the main internal preparations are often file-based.
Google Cloud Dataflow
Dataflow is, as Google describes it, a service that provides unified stream and batch data processing at scale. It is a managed version of the open source platform Apache Beam, and provides a number of standard ETL templates out of the box. While it can handle a lot of, Batch use cases from file storage or databases very well, it's biggest advantage over other tools is it's ability to work seamlessly with streaming use cases. For example, it has a number of standard templates that can handle incoming streams from Google Clouds Pub/Sub streaming messaging systems, and can even provide some analytics functionality over a window of the data if needed. Although always-on streaming dataflows can run into some costs, the platform itself is still reasonably priced.
The biggest drawback of Dataflow is the fact that it doesn't handle API connections out of the box, which is something that is needed in most cases when integrating data from external sources. There's always room for custom templating, but this requires a bit more development work. Google Cloud composer (Apache Airflow)
Going from one managed open source Apache product to another, let's take a look at Cloud Composer, or the managed version of Apache Airflow. In recent years, Airflow has grown to become a standard platform for larger data environments, and it is clear why: the tool provides a very strong foundation for technically oriented people to create jobs that interact with whatever source necessary. Moreover, the strength of Cloud Composer is that it can group together different jobs in what is called a DAG (Directed Acyclical Graph), which is a pipeline where one job can be dependent on the successful termination of another.
Even though there is still a learning curve to consider, the strong documentation and easy-to-use operators for any kind of application (Bash, Python, or even GCP products like Cloud Functions) make it a very solid platform to work with.
The biggest consideration for actually using the platform is costs. Even a relatively small environment can easily reach a few hundred dollars per month in costs, and it depends on your environment if that is going to be worthwhile.
Managed ETL tooling
After a number of tools that all expect a certain level of technical knowledge, it is important to also cover the alternative: managed ETL tools. There's quite a wide variety of tools that have pre-built connectors to a large number of external API's that tend to be important for e.g. marketing insights like Meta, Google Ads and Tiktok. Most tools contain upwards of 500 connections and it is usually easy to set up some first connections to e.g. your BigQuery platform.
Some alternatives to consider here are Funnel, Adverity, Fivetran, Supermetrics, all of which provide a similar, high quality service. The biggest considerations in using these tools are on the one hand a costs versus benefits argument (are the additional costs for this platform worth the ease-of-use or do we prefer a more technical solution) and on the other hand a matter of how complete the solution is in terms of your connectivity desires. If a lot of custom connections or file handling needs to be implemented as opposed to more marketing related platforms where API data needs to be gathered, these tools might not be for you.
When to use which?
While some statements have already been made about (dis-)advantages of tools above, the biggest questions with regards to using a tool will come down to the amount of technical knowledge available, the vastness of the data landscape and pipelines around it and the actual data that needs to be integrated. If the data landscape is not too big and there is some technical capacity, why not start with Cloud Functions? If there are over 100 tables in your Data Warehouse that need to be fed with data from all sorts of sources, it might be worth the costs of spinning up a Cloud Composer environment.
If you mainly want to retrieve some API data for your central marketing dashboard and the data engineering team cannot fit you in until October next year, why not try a Funnel subscription?
Data Preparation
Once data has entered your data warehouse environment (BigQuery specifically in this blog), you can potentially use it for any possible use-case. The problem with the majority of data sources, though, is that it has not been created to directly suit your use case, be it a dashboard or an integration with another system. This means that some preparation is usually required to turn your data source into something valuable. Again, in this domain, there's a number of tools and methods out there that can assist you in preparing your data.
Scheduled queries and views
For small preparation efforts, you can always resort to a view or a scheduled query. Views are semi-tables, with a prepared schema and all, but do not actually contain any data. Instead, they are themselves a SQL query, which gets executed when the View is queried by e.g. a Looker or Tableau dashboard. Be careful when using multiple views on larger data sources, though, as query costs can increase significantly (because queries are executed every time the view gets referenced).
Scheduled queries are slightly more elegant in the sense that they can fuel an actual prepared table. You are stuck to still writing only one query though, and you can then run this query daily, for example, in order to keep updating your reporting table from the raw source. Biggest drawback here is that you are limited to one query and that an overview of what updates are actually going on is harder to manage.
Google Cloud Functions
Already mentioned before, but yes, Cloud Functions can also be used for data preparation, for example by directly interacting with the BigQuery API to execute queries. Biggest benefit here is that there is more flexibility to execute entire SQL scripts, instead of being stuck to a single query. This does mean, however, that you'll have to write this entire functionality yourself.
DBT
DBT is, as the company describes itself, an intuitive, collaborative platform that lets you reliably transform data using SQL and Python code. DBT is meant for data transformations, and its biggest benefit is that it handles a lot of validation and documentation for you, which immediately makes your data preparation pipelines more reliable and clear for other users in your organization. It has both an open source variant that you can implement yourself, but also an online interface that you can use on subscription basis (first developer usage is free, after that you pay $100 per month per developer in a team).
For data transformations within BigQuery, the main technical knowledge you need for using it is SQL. One main limitation is that any specific job needs to be a single query, which makes it harder to write longer scripts that handle multiple transformations.
Dataform
Dataform can be seen as the Google Cloud native alternative to DBT. It is primarily aimed to be used in BigQuery, which means that using any other cloud data warehouse eliminates dataform as a viable option. The tool is also still a bit less mature than DBT, but it is free to use and you will only pay for the usage of BigQuery, which makes it an interesting tool to try-out for your data transformation in Google Cloud Platform.
Cloud composer
Also mentioned before, but since Cloud Composer has a built-in Bigquery operator that lets you execute all kinds of SQL statements in a pipeline, it's good to mention in this section as well.
When to use which? The main questions to ask here are similar to those in the data integration part.
If you have only a few data preparation steps to execute, go for a more simple approach like scheduled queries or views. If your environment is not that big but you need a lot of options to dynamically run your queries, setup something Cloud Functions.
If your environment is a bit bigger and you want to properly maintain all your data transformation, check out DBT or Dataform.
And finally, if it's important to have both your data integration and preparation in the same pipelines, it might be worth going for Cloud Composer. Nice to mention here is that Cloud Composer can work alongside DBT and Data Form, where Cloud Composer handles the invocation of jobs within these tools.
Workflow orchestration
Depending on the combination of tools you've chosen from the data integration and preparation steps, you might end up with a lot of individual jobs you have to run that are related to each other or even dependent on each other to be able to run. Most tools provide at least some form of scheduling, but it can be hard to determine the schedule of a job to run exactly at the time another job in a different system is going to finish. If the job is scheduled too early, a longer job run might cause the whole flow to run out of sync, whereas a very wide schedule might result in data not being available when business needs it. To fix this, there are several tools that can orchestrate various tasks into a single pipeline, of which we'll list a couple below.
Cloud composer
Yes, Cloud Composer a third time. The fact that orchestration is a native part of Cloud Composer in combination with the actual technical environment to create your data integrations and preparations makes the tool so valuable for many end users. The only slight disadvantage of the tool is that triggering outside of standard scheduling usually requires a custom script. There is a standard way of organizing these in so-called 'sensors', but unfortunately Cloud Composer cannot natively listen to e.g. GCP's EventArc.
Google Workflows
Whereas Cloud Composer can handle both orchestration and execution, Google Workflows is limited to the orchestration part. Workflows merely listens to several types of triggers, including schedules and EventArc type triggers, and starts a workflow consisting of HTTP calls. This means that Workflows will always have to be combined with a separate mechanism that does the actual execution, like e.g. a Cloud Function. The biggest benefit of Workflows is the low costs that are tied to its usage.
Mage
Mage is a more recent contender in the workflow orchestration field, and is seen as an easier to use Apache Airflow. There's less options for customization then you would have in Apache Airflow, but workflows and pipelines can be more easily created with drag and drop interfaces.
Mage is as of yet only open-source available. This means that it is free to use, but you do have to do all the hosting and maintenance yourself. You could still host mage in Google Cloud Platform by use of e.g. compute engine or Cloud Run, and you would incur the costs of those platforms specifically.
When to use which? Here the main question is how complex and vast your data environment is.
When you want some more ability to manage your pipelines, but it's not worth the costs of a more extensive environment, you might want to try out Google workflows.
If your environment is more extensive to manage but the use cases themselves aren't too complex, and you're willing to manage deployments yourself, maybe consider using Mage.
If the vastness and complexity of your environment needs a proper tool that can handle everything, consider using Cloud Composer.
Up next
After getting into what is needed to integrate and transform data within your data warehouse, we'll use the following part of this series to dive deeper into what is needed to activate your data for e.g. marketing and personalization purposes, and what tooling you can consider here.