If you’re unfamiliar with dbt, I highly recommend reading
this introductory blog post by the CEO of Fishtown Analytics (the creators of dbt).
What is Bigquery ML?
BigQuery is Google’s serverless data warehouse offering, which is one of the easiest and most user-friendly data warehousing platforms out there.
Machine learning usually requires extensive knowledge of specialized frameworks and complex infrastructure, but BigQuery ML lets you use the SQL you already know and love to build a fully operational ML pipeline
within minutes on top of the data you already have in BigQuery.
As of January 2020, Bigquery ML supports the following machine learning algorithms:
- Linear Regression: Used for forecasting using a linear combination of variables.
- Logistic Regression: Both binary and multiclass logistic regression for classification tasks
- K-Means Clustering: An unsupervised learning technique used for data segmentation/clustering; for example, identifying customer segments.
- Tensorflow: Import a Tensorflow model into Bigquery ML
- Matrix Factorization: A recommendation algorithm that can predict a user’s preference for items they have not interacted with before using the preferences of similar users.
- AutoML Regressor: Creates a regression model using Google’s AutoML Tables service, effectively trying multiple regression algorithms and tuning hyperparameters automatically.
- AutoML Classifier: Creates a classification model using Google’s AutoML Tables service, effectively trying multiple classification algorithms and tuning hyperparameters automatically.
- Boosted Tree Classifier: Uses XGBoost for classification.
- Boosted Tree Regressor: Uses XGBoost for regression.
- DNN Classifier: Creates a Deep Neural Network Classifier model.
- DNN Regressor: Creates a Deep Neural Network Regressor model.
- ARIMA: Builds a univariate ARIMA-based time series model for time series forecasting.
As you can see, you have a great degree of flexibility and can skip building a whole bunch of machine learning infrastructure and hit the ground running!
Manual BigQuery ML Workflow
In order to understand what dbt is doing under the hood, it is valuable to understand what a vanilla workflow with BQML would look like without dbt. We’re going to use the popular Titanic survivorship dataset, where we want to predict whether someone survived or not based on features such as age, sex, ticket class, etc.
If you want to follow along with this manual way to work with BigQuery ML, you can do the steps as I list them, however, later in the post I’m going to be going over how to do this with dbt with an example project, so feel free to simply read this section instead!
Step 1 – Upload the Dataset
Download the Titanic dataset from this gist and upload it to BigQuery using the Bigquery console. If you’need some help with this, check out the BigQuery docs for the details.
Step 2 – Create The Model
You can create a simple model by including all the input features in your select and simply naming your target variable as “label”. Here, I’m training only on the first 800 passengers because I’m holding out 91 passengers for evaluating the model performance in Step 4.