Skip to main content

applying machine learning for A/B hypothesis testing

A/B Testing has equipped companies with the ability to make decisions based on statistics and appropriate prediction rather than off of a “feeling”, as this can be detrimental to their results.Keep reading if you want to read about A/B testing and how using machine learning for A/B testing can achieve better results.


For Those Who do not know what A/B Testing is?

A/B testing, also known as split testing, refers to a randomized experimentation process wherein two or more versions of a variable (web page,advertisement, etc.) are shown to different segments of users or customers at the same time to determine which version leaves the maximum impact and drives business metrics. There are three types of A/B Testing we will take a look at on this blog.

Objective of the Project

An advertising company is running an online ad for a client with the intention of increasing brand awareness. The advertiser company earns money by charging the client based on user engagements with the ad it designed and serves via different platforms. To increase its market competitiveness, the advertising company provides a further service that quantifies the increase in brand awareness as a result of the ads it shows to online users. The main objective of this project is to test if the ads that the advertising company runs resulted in a significant lift in brand awareness.

Problem Formulation

A reliable hypothesis testing algorithm is needed to determine whether a recent advertising campaign resulted in a significant lift in brand awareness. We have data gathered from users that saw both the dummy and creative advertisements using a survey questionnaire. Let’s use regression algorithms to predict the success rates of the creative advertisement for our target variable brand awareness .

Given Data

We were given a collected questionnaire data with a total of 8077 rows with column as shown below

Auction_id, Experiment, date, hour, device_make, platform_os, browser, yes, no.

Out of the data only 1243 of them were answered either yes or no

Used Tools in the project

MLflow - is a framework that plays an essential role in any end-to-end machine learning lifecycle. It helps to track your ML experiments, including tracking your models, model parameters, datasets, and hyperparameters and reproducing them when needed.

  • It is easy to set up a model tracking mechanism in MLflow.
  • It offers very intuitive APIs for serving.
  • It provides data collection, data preparation, model training, and taking the model to production.

Data Version Control, or DVC, is a data and ML experiment management tool that takes advantage of the existing engineering toolset that we are familiar with (Git, CI/CD, etc.).

  • Along with data versioning, DVC also allows model and pipeline tracking.
  • With DVC, you don't need to rebuild previous models or data modeling techniques to achieve the same past state of results.
  • Along with data versioning, DVC also allows model and pipeline tracking.

Continuous Machine Learning (CML) is a set of tools and practices that brings widely used CI/CD processes to a Machine Learning workflow.


Classical A/B Testing

This A/B testing is done following the following steps:

  • Define the baseline conversion rate and minimum detectable effect (MDE)
  • Calculate the sample size needed for a meaningful experiment using the metrics in step one along with statistical power and significance level.
  • Drive traffic to your variations until you reach the target sample for each variation
  • Finally, evaluate the results of your A/B test.

If the difference in performance between variations reached MDE or exceeded it, the hypothesis of your experiment is proven right, otherwise, it’s necessary to start the test from scratch.

Limitations and Challenges of Classical Testing


  • Can take lots of time and resources
  • Only works for specific goals
  • Doesn't do any good if the experiment environment already have
  • Could end up with constant testing (Data is useless after testing)


Scroll to Continue

Project result of using classical A/B Testing for our data set

While using classical A/B testing on our data we have got a higher significance level than the p-value also the significance power is too low that shows there is a possibility of having a type 2 error. and in order to get the best significance level we must have a larger dataset.

Sequential A/B Testing

Sequential A/B testing allow experimenters to analyze data while the test is running in order to determine if an early decision can be made. sequential sampling works in a very non-traditional way; instead of a fixed sample size, you choose one item (or a few) at a time, and then test your hypothesis. We will use Sequential probability ratio testing (SPRT) algorithm , which is is based on the likelihood ratio statistic, for our dataset.

General steps of conditional SPRT

  1. Calculate critical upper and lower decision boundaries
  2. Perform cumulative sum of the observation
  3. calculate test statistics(likelihood ration) for each of the observations
  4. calculate upper and lower limits for exposed group
  5. apply stopping

Read more on


Advantages and Disadvantages of Sequential A/B testing

Advantages

  • Understanding the Complex System - Traditional A/B testing can not handle multiple variable complex systems.
  • Providing a direction and magnitude of the experiment
  • For better calculation of the feature importance of each variable
  • Incredibly useful at clustering audiences into different segments (clusters) - segment audience data into similar groups from a range of dimensions and can be utilized to perform focused A/B testing on more granular audience groupings

Disadvantages

  • Overcomplicate the A/B Testing for simple dataset (system)

Project result of using sequential A/B Testing for our data set

We used conditional SPRT algorithm to calculate the critical upper and lower limits and performed cumulative sum of the observation and plotted the below figure and the statistical result as shown below. Which clearly shows that more samples are needed.

applying-machine-learning-for-ab-hypothesis-testing
applying-machine-learning-for-ab-hypothesis-testing

machine learning-based A/B testing or significance testing

Using machine learning for A/B testing is not just about calculating the difference between the exposed and controlled variance and identify which one has been chosen rather It about identifying which parameter(variable) in the data has the most significance value in determining the outcome. Hence significance testing. To perform this testing we have compiled 4 chosen machine learning models and tried to find the accuracy score of the model and its correlation matrix. Below are the results and plots of each model

Significance testing using Logistics Regression

using sklearn's LogisticRegression Model

feature importance

applying-machine-learning-for-ab-hypothesis-testing

confusion matrix

applying-machine-learning-for-ab-hypothesis-testing

Accuracy and Other metrics

applying-machine-learning-for-ab-hypothesis-testing

Significance testing with Decision Tree

using sklearn's algorithm

feature importance

applying-machine-learning-for-ab-hypothesis-testing

confusion matrix

applying-machine-learning-for-ab-hypothesis-testing

accuracy and other metrics

applying-machine-learning-for-ab-hypothesis-testing

feature Importance

applying-machine-learning-for-ab-hypothesis-testing

confusion matrix

applying-machine-learning-for-ab-hypothesis-testing

accuracy and other metrics

applying-machine-learning-for-ab-hypothesis-testing

Significance testing with XGBoost

using sklearn's algorithm

Feature Importance

applying-machine-learning-for-ab-hypothesis-testing

confusion matrix

applying-machine-learning-for-ab-hypothesis-testing

accuracy and other metrics

applying-machine-learning-for-ab-hypothesis-testing

Conclusion

Through using ml for A/B testing we have seen the accuracy level of logistic regression and xgboost model is almost identical. This could be the cause of not getting enough data. But as we can see the confusion matrix is not the perfect one and it is the cause of less data that causes this problem. The false negatives are really high for a model. Conclusion More Data is needed


Code can be found on

Significance testing with Random Forest

using sklearn's algorithm

This content is accurate and true to the best of the author’s knowledge and is not meant to substitute for formal and individualized advice from a qualified professional.

Related Articles