Introduction

In order to help personalize messaging and ads, we will predict the most likely products that each customer is likely to buy for the first time. The process of creating these predictions requires us to create a historical dataset that contains information about customers, stores, products, etc and uses that information to train a model on what a customer buys next. This document provides an overview of the training dataset and the methodology we use.

Dataset Creation

Any dataset that we create to train a model must have information about what was known about a customer (and store, products, etc) at a particular point in time. In order to create such a dataset, we use a methodology whereby we snapshot each particular customer immediately prior to when they make a purchase. For each snapshot, we calculate the value of various features (see below) for that customer.

We then supplement that intermediary dataset with a dataset that contains all of the same values and features but for a randomly selected set of products that each customer did not buy (in order to provide the model with negative values to learn from). We choose the size of the subset by calculating the probability that any customer in the store will buy any arbitrary product.

We then join the negative and positive examples together and take 3 million records from that dataset, selected randomly, to build a model off of (the number 3 million was selected as one of the largest datasets that we have available without running into resource exhaustion limits on BigQuery).

The model will then predict the probability of any customer buying any new product that they have not yet purchased.

This methodology allows us to capture a point in time record for each customer, allowing our model to learn about what values and features are most associated with future purchases, but it does so efficiently, in a way that doesn’t require us to keep creating arbitrary snapshots of customers every X days.

Model Training

We use BigQueryML to train an XGBoost model on the training dataset. The dataset and model are both created using SQL directly sent to the BigQuery platform. By training and constructing our datasets in this manner, we offload all of the feature transformation and model scalability work to Google’s scalable systems.

Features

Below we provide a list of features that are used to train and predict on the model.

Features

Modeling Process

In January 2022, we switched from XGBoost to Logistic Regression as our learning model. Here is a quick spreadsheet that we used to help in deciding to make this change.

Logistic Regression vs XGBoost.xlsx