Data analytics involves analyzing raw data in order to extract valuable insights. These insights are frequently used as an aid in future decision making.
The data that is analyzed is usually stored in one or more formats. These could be as flat files or text files (comma-separated, tab-separated), spreadsheets (Excel, Google Sheets), databases, and other formats.
My first experience with Structured Query Language (SQL) was with an early version of MySql, sometime around the year 2000. At that time, I had to visit the site to download both the installer (I was using Windows) and the documentation. It was from there that I learned about Data Manipulation Language (DML) and Data Description Language (DDL).
At that time, I had to download and install the software and make decisions about what engine I wanted to run, and if I wanted replication I would have to set it up myself. I had to concern myself with performance tuning, scaling when my load increased, and figuring out what to do if I were to run out of space on my storage.
Over the years, I have worked with other databases, including offerings from Microsoft, Oracle, and IBM.
While many database vendors have moved on to provide managed services, one stood out for me. That is BigQuery from Google. The Google Cloud Platform is a fantastic offering, but let’s just focus on BigQuery for now.
BigQuery is a serverless offering from Google. What that means is, you do not need to provision or manage it. You don’t need to specify how much storage you intend to utilize, or what sort of processor you want to provision for your database server. Instead, it expands to accommodate your working needs.
It is able to store massive amounts of data. However, it is designed for data analytics workloads, meaning it doesn’t expect the data you store to change frequently. If you come from the world of old databases, it is designed for Online Analytics Processing (OLAP), not Online Transaction Processing (OLTP).
BigQuery will store thousands of terabytes of data without any preparation. You can help your case by implementing a partitioning strategy, which will speed up your data retrieval considerably. I also need to mention that data retrieval is already fast without you having to do anything.
You can get started using BigQuery by taking a look at the publicly available datasets that they host. You can find those here.
Importing your own data into BigQuery is a breeze. From the BigQuery console, you can click on your project name and create a dataset. Once you have a dataset, you can proceed to create tables or import data into tables.
Queries require you to specify the project, the dataset, as well as the table name. Here is an example of a query.
The image above shows the result of running that query.
Holding down Command (or CTRL in Windows) and clicking on the name of the table inside the query will give you table details. You will also have access to the table schema.
In the bar under the query editor, there is a validator that tells you whether or not the query is valid, and how much data the query would process if you were to execute it.
One popular way of importing data into BigQuery is by uploading text files unto a Google Cloud Storage bucket. From there, you can point BigQuery at the file and import it into a table.
Once you have your data imported into BigQuery, you may go ahead and run your queries.
Queries are fun. They tell you what has already happened, and give you clarity into the past. You can call those descriptive analytics. How many items did I sell, how often do orders come in, on what days of the week are particular items sold the most?
Can queries tell you what will happen in the future? Can you predict sales based on time of day, while taking seasonality into consideration?
We are getting into the domain of predictive analytics, more formally called Machine Learning. I wrote an introductory article on Machine Learning which you can read here.
We need to extract insights from our data which will let us make predictions about previously unseen data, but first, the problem we are trying to solve needs to be framed. Machine Learning itself falls into three broad categories:
supervised learning
unsupervised learning
reinforcement learning.
This article will be concerned with the first two.
If you would like to know whether a problem is suitable for Machine Learning, please check out this article.
Supervised Learning deals with situations in which what you would like to predict is contained in the data that you have already collected. Consider the situation in which you would like to predict the number of bottles of water that a vendor might sell based on the time of day and the date. This is called Regression. The value you are trying to predict is called the label or target, and the information you are using to make the prediction is called the features.
Consider another scenario in which what you are trying to predict is whether or not visitors to an e-commerce site will make a purchase. This is called Classification. This is still supervised learning because you need historical data that shows whether or not a visitor to your site made a purchase.
Unsupervised Learning deals with the situation in which you have historical data but no labels. For example, you have data about customer transactions and you would like to put those transactions into clusters for the purpose of marketing or new product development.
One approach to training Machine Learning models involves minimizing loss. You can get a quick primer on that here. There is a lot of maths and statistics that go into all of these things, along with jargon specific to the industry such as optimizers, gradient descent, learning rates, over-fitting and under-fitting, bias-variance tradeoff, hyperparameter tuning, and a lot of words that the average data analyst might not want to get involved with. There is also the fact that dealing with programming languages is mandatory.
The BigQuery team at Google felt that it would be nice if data analysts could train Machine Learning models without having to deal with most of the math and jargon while using only SQL. This gave rise to BigQuery ML.
BigQuery ML (also BQML) supports training the following types of models:
Linear Regression
Binary Logistic Regression (between two classes)
Multiclass Logistic Regression (belongs to one of a number of classes)
k-means clustering (unsupervised learning for data segmentation).
More models are being worked on, and you can import models trained using TensorFlow for use in BigQuery.
BigQuery ML is surprisingly simple. The first object you need to learn about is the model. Similar to tables and views, models are stored in datasets.
Creating the model is a two-part statement. The first part specifies the model parameters, including the name of the dataset and the model and the type of model. The second part specifies the data to use for training and includes the label. Every column that is not the label is taken to be the feature.
Consider the following query.
The first line defines the model by specifying the dataset and the name. The second line specifies the model type as Logistic Regression. The subsequent lines provide the training data by fetching some data from a BigQuery public dataset that hosts some data from an e-commerce site. Notice that the select has a column called label which is what we are trying to predict. Also, notice that it is set to either 0 or 1 because computers like dealing with numbers. The two new things in the query are how the table name uses a wildcard and the table suffix range. These are some of the features of BigQuery and are easy to learn.
The logistic regression model gives the probability that the data falls into one class which we represent as 1. If the probability is below 50%, then the data falls into the alternate class which is represented as 0. 50% is a threshold, which you can change to suit your needs. You can learn more about it here.
One thing to bear in mind when creating a model is that the label column can’t be NULL.
The create model syntax follows:
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
[OPTIONS(model_option_list)]
[AS query_statement]
The model_type can be logistic_reg, linear_reg or kmeans .
If you know a bit about Machine Learning, you may want to specify the following:
l1_reg
l2_reg
max_iteration
learn_rate_strategy
learn_rate, early_stop
data_split_method
data_split_col
ls_init_learning_rate
warm_start.
These parameters are all explained here.
When the model is done training, you will get a confirmation, and the model will appear in your dataset.
You might be interested in the training options, which are visible when you scroll down the description page.
You may click on the Training button to open the tab and see something similar to the following:
The information is also available in tabular form. On each iteration, you can see how the loss reduces for both the training data and the evaluation data. You can also see the learning rate that was used for that training iteration.
The evaluation metrics are also available. You can find out more about precision and recall here. You can find out more about ROC here. The confusion matrix tells you the accuracy of the model, which is the percentage of predictions that it got right.
Models need to be evaluated using something called an evaluation dataset. Essentially, you do not use all of your data for training a model. Instead, you would set aside between 20% and 30% of your data for evaluating the quality of your model. The syntax for model evaluation follows:
ML.EVALUATE(MODEL model_name
[, {TABLE table_name | (query_statement)}]
[, STRUCT(<T> AS threshold)])
To evaluate the model, you call the ML.EVALUATE() function. The model name is mandatory. You may optionally pass in an evaluation dataset, as well as a threshold value if you don’t want to use 50%. Here is how we would evaluate the model we created above.
The result of model evaluation looks similar to the following:
When dealing with a binary classification model, you might be interested in something called the Receiver Operating Characteristic curve. The syntax for that follows:
ML.ROC_CURVE(MODEL model_name
[, {TABLE table_name | (query_statement)}]
[, GENERATE_ARRAY(thresholds)])
This is how we would generate the ROC curve for our model.
The result of running the above query looks similar to the following:
Finally, when dealing with classification models, you may be interested in the confusion matrix. This is a cross-tabulation of what was rightly classified, and what was not. For example, with binary classification, we get four cells: true positive, true negative, false positive, and false negative. The syntax for this follows:
ML.CONFUSION_MATRIX(MODEL model_name
[, {TABLE table_name | (query_statement)}]
[, STRUCT(<T> AS threshold)])
This is what the query for our model would look like when generating the confusion matrix:
The confusion matrix looks similar to the following:
We are finally ready to make use of our model for predictions using ML.PREDICT(). You can find out more about the behavior of this function here. The syntax follows:
ML.PREDICT(MODEL model_name,
{TABLE table_name | (query_statement)}
[, STRUCT<threshold FLOAT64> settings)])
The following query shows how we would use our model to make a prediction. The query without a label is used as the input for predictions.
The output will contain a predicted_label column, which is the prediction that the model has made. With classification models, you also get a probability column.
In conclusion, BQML makes it possible to create Machine Learning models for regression, classification, and clustering. While this article will get you on your way to building models, it doesn’t make you good at it. You will need at least a good understanding of Machine Learning theory.
You need to know what learning rate is, how gradient descent works, bias-variance trade-off, regularization, early-stopping, and other mechanics. A good place to learn about these things is Google’s Machine Learning Crash Course which is available here.
The most important thing to learn about is feature engineering. You need an understanding of the data, and what the correlation between the features means. It is feature engineering that ultimately leads to good models.
This article was originally posted on my Medium blog. Please subscribe to stay up to date on my articles.