Abstract

This case study shows how to perform Bayesian inference for sequential A/B testing with Stan. A/B testing compares two (or more) alternatives along some dimension by gathering data about them and performing statistical inferences. Sequential testing allows the experimenter to decide which alternative to test at each iteration. The goal is to design an experimental policy that minimizes the amount of data needed to perform the required inferences. In the multi-armed bandit setting, each alternative provides a stochastic reward (like a one-armed bandit, aka slot machine) and the goal is to minimize the amount of time exploring alternatives in order to exploit the rewards from the best alternative. The multi-armed bandit problem is a simple example of a sequential decision process and of reinforcement learning. This case study explores the Thompson sampling pollicy (Thomspon 1931), wherein a bandit is chosen at random with probability equal to its posterior probability of providing the highest expected return. The case study begins with simple Bernoulli A/B tests, extends to the sequential bandit setting with Bernoulli bandits, then generalizes to context-dependent rewards modeled with a generalized linear model.

Static A/B Tests

A concrete example will help ground our terminology and show how A/B tests are used in practice.

Suppose I am hungry and faced with a choice of several restaurants in a strange town. I look at reviews online and one received 2/2 positive reviews, a second 9/10 positive reviews, and a third 32/40 positive reviews. All else being equal, which do I try?
Suppose I run a web site and have a choice of which ads to show a customer, each of which provides the same profit if clicked. Suppose one ad has been clicked 2/2 times, a second 9/10 times, and a third 32/40 times? Which ad does the site run?
Suppose I coach a basketball team and have to foul an opposing player. One has made 2/2 free throws in the tournament so far, the second 9/10, and a third 32/40. Who do I ask the team to foul?
Suppose I am a doctor and have to prescribe a treatment for a sick patient. In the past, the first treatment has worked 2/2 times, the secod 9/10 times, and the third 32/40 times.
Suppose I am a gambler and am faced with two slot machines. The first has paid off 2/2 times, the second 9/10 times, and the third 32/40 times. Where do I drop my next coin?

In all of these cases, the data is exactly the same—number of successes out of number of trials for a number of items. The question is also the same—which item has the highest chance of success?

In the general case, results do not have to be either success (1) or failure (0)—in general the return can be a count value, be continuous, or even be multivariate. Nevertheless, in A/B testing, the question remains the same—which is better?

Bernoulli A/B tests

To keep things simple in the first example, we will suppose that the outcome of a trial is a simple success (return of 1) or failure (return of 0). We will assume that there are $K > 0$ possible items in the comparison set. For each item $k \in 1:K$, there will be $N_k \geq 0$ trials, out of which $y_k \in 0:N_k$ will be successes. This data specification translates directly to Stan code.

data {
  int<lower = 1> K;
  int<lower = 0> N[K];
  int<lower = 0> y[K];
}

The model has a parameter $\theta_k \in (0, 1)$ for each item $k \in 1:K$ representing its chance of success. Our eventual goal is to determine which item is best in the sense of having the highest chance of success. As with the data, the Stan program follows the mathematical definition.

parameters {
  vector<lower = 0, upper = 1>[K] theta;
}

To keep things simple, we will assume uniform priors for $k \in 1:K$,

\[ \theta_k \sim \mathsf{Uniform}(0, 1). \]

The likelihood is a simple binomial, which is the distribution of the number of successes in a given number of independent trials with a given chance of success. For each $n \in 1:N$,

\[ y_n \sim \mathsf{Binomial}(N_k, \theta_k). \]

In Stan, uniform priors are the implicit default prior and thus do not need to be coded. The binomial likelihood can be coded using vectorization, so that the entire joint log density is expressed as

model {
  y ~ binomial(N, theta);
}

This vectorized sampling statement is more efficient than the equivalent loop.

  for (k in 1:K)
    y[k] ~ binomial(N[k], theta[k]);

Finally, we want to know which item is the best. The probability that item $k$ is the best given the data $y$ is given by

\[ \begin{array}{rcl} \mbox{Pr}\left[ k \mbox{ is best} \right] & = & \displaystyle \mathbb{E}\!\left[ \, \mathrm{I}\!\left[ \theta_k \geq \max \theta \right] \, \right] \\[6pt] & = & \displaystyle \int \mathrm{I}\left[ \theta_k \geq \max \theta \right] \ p(\theta | y) \ \mathrm{d}\theta \\[6pt] & = & \displaystyle \frac{1}{M} \sum_{m=1}^M \, \mathrm{I}\!\left[ \theta_k \geq \max \theta^{(m)} \right], \end{array} \]

where $\theta^{(m)}$ are draws from the posterior $p(\theta | y)$.

Stan’s posterior analysis tools compute posterior expectations for us, so all that remains is to define the indicator variable $\mathrm{I}\left[ \theta_k \geq \max \theta^{(m)} \right]$. That can be done in the generated quantities block following the mathematical definition.

generated quantities {
  int<lower = 0, upper = 1> is_best[K];
  for (k in 1:K)
    is_best[k] = (theta[k] >= best_prob);
}

Static A/B test designs

The general problem of A/B testing is that of having two (or more) alternatives and some data with which to tell them apart. Given a number of trials, such as asking consumers which of several products they would be more likely to buy or

Traditionally, an experiment would be designed based on the sample size needed to detect significant differences in the alternatives being tested.11 This is called a power analysis becaus it is based on the power of a data set to derive significant results given assumptions about effect sizes and population variation. Then data on the alternatives would be collected, and statistical inference performed.

Because the arms are exchangeable, 22 Being exhchangeable doesn’t mean the arms are identical, just that we have no information with which to distinguish them a priori. such a design would amount to a number of times to pull each arm. Then the experiment would be run and the best result selected.

Multi-armed bandits

The multi-armed bandit problem involves a fixed number of one-armed bandits. Each bandit represents a different option (A, B, etc.) being tested and each provides a different return when played.

The is that each bandit is assumed to provide independent and identically distributed (i.i.d.) returns.33 This is the fundamental assumption of multi-armed A/B testing. That is, no matter when or how many times a given bandit is played, the probability of a payout does not change. Each pull of the arm, so to speak, is an indepdent trial.

Exploration and exploitation

A player must explore the distribution of returns of the bandits and then exploit this knowledge to play the bandit with the best return. A general policy of how exploration is carried out, here defined by a decision rule for which arm to pull.44 Our decision rules can be non-deterministic. Viewed this way, the multi-armed bandit problem is a form of reinforcement learning.

Sequential designs

Rather than a static policy of choosing arms, a sequential design decides at each step of the experiment which piece of data to collect next.55 In the machine learning literature, sequential design is sometimes called “active learning.” In this case, that amounts to a decision about which arm to pull based on the arms that have been pulled previously and the results.

Regret

Different policies are typically compared based on expected regret, the difference between the expected return of always pulling the optimal arm minus the actual return.

Bernoulli Bandits

In this section, we consider the simplest form of the multi-armed bandit problem. First, we assume there are a total of $K \geq 1$ bandits. Next, we assume there will be $N \geq 0$ rounds of play. In each round, $n \in 1:N$, the player selects a bandit $z_n \in 1:K$ and receives a subsequent real valued reward $y_n \in \mathbb{R}$. The basic probabilistic assumption we make is that each bandit provides the same distribution of returns each time it is pulled, independently of any history. That is, the returns for any given bandit are independent and identically distributed (i.i.d.).

A Bernoulli bandit has a fixed probability of returning a unit (1) reward and otherwise returns nothing.66 The Bernoulli distribution is defined for $y \in \{ 0, 1 \}$ and $\theta \in [0, 1]$ by \[\mathsf{Bernoulli}(y \mid \theta) = \left\{ \begin{array}{cc} \theta & \mbox{if } y = 1 \\ 1 - \theta & \mbox{if } y = 0 \end{array}\right.\] We have assumed the rewards from a bandit, so given the arm $z_n$ played in round $n$, the reward can be summarized as follows.77 The notation $u \sim \mathsf{Foo}(\phi)$ implies that $u$ is conditionally independent of other variables given $\phi$ and has distribution $\mathsf{Foo}(\phi)$.

\[ y_n \sim \mathsf{Bernoulli}(\theta_{z_n}). \]

Player Policies

A player is defined in terms of a strategic policy for deciding which arm to play based on the behavior of the bandits in previous rounds. To be effective, policies need to balance the exploration of the bandits’ returns and the exploitation of bandits with high expected returns. Mathematically, a policy may be stochastic and is expressed as a distribution $p(z_{n + 1} \mid y_{1:n}, \, z_{1:n})$ over choice of bandit conditioned on the history of the previous $n$ rounds.

Types of policies

A Markovian policy only depends on the return of the previous play, $p(z_{n + 1} \mid y_n, z_n)$.

A memoryless policy does not depend on any previous history and may be expressed as a distribution $p(z_{n + 1})$.

A $deterministic policy* has a delta function for a distribution and may be expressed more simply as a function $z_{n + 1} = f(y_{1:n}, \, z_{1:n})$. A deterministic policy may depend on the history in an arbitrary way, or it may be Markovian or memoryless.

Round robin policy

The simplest policy corresponds to a balanced, non-sequential design, where each arm is played in equal proportion. Playing such a policy for $N$ rounds is equivalent to a classical balanced design of size $N$. It corresponds to generating the sequence

\[ z = 1, 2, \ldots, K, \, 1, 2, \ldots, K, \ldots \]

The round robin policy may be defined with the function

\[ z_n = \left( (n - 1) \!\!\!\!\!\mod K \right) + 1. \]

Uniform random policy

Choosing an arm randomly each round is a memoryless, stochastic policy,

\[ p(z_n \mid y_{1:n-1}, \, z_{1:n-1}) \ = \ \mathsf{Categorical}\left( \textstyle \frac{1}{K}, \ldots, \frac{1}{K} \right). \]

Tit-for-tat policy

Robbins (1952, 1956) analyzed a determinitic, Markovian policy of choosing which arm to pull next. Each arm is pulled until it returns 0, then the next arm in sequence is pulled.88 Our naming is an homage to the tit-for-tat strategy for the iterated prisoner’s dilemma, to which Robbins’ policy bears a passing similarity.

The first pull will be of the first bandit, $z_1 = 1$. For each subsequent pull, it returns the same value as last time if the last pull was successful and otherwise advances one bandit, rolling over to 1 after receiving no return from playing the $K$-th bandit.

\[ z_{n+1} = \begin{cases} z_n & \mbox{if } y_n = 1 \\[4pt] z_n + 1 & \mbox{if } y_n = 0 \mbox{ and } z_n < K \\[4pt] 1 & \mbox{if } y_n = 0 \mbox{ and } z_n = K \end{cases} \]

In the limit as $\theta_k \rightarrow 0$, the tit-for-tat policy approaches the uniform random policy. As the bandits’ probability of return decreases, this policy approaches the round-robin policy.

Probability matching policy

Thompson (1931) introduced a Bayesian stochastic policy which employs the entire history of returns. In Thompson’s probability matching policy, a bandit’s arm is pulled with probability equal to its probability of being the best bandit conditioned on the entire history of previous turns. Given parameters $\theta = (\theta_1, \ldots, \theta_K)$, bandit $k$ will be the best bandit if $\max(\theta) = \theta_k$. Given a Bayesian model, the posterior probability after $n$ trials that bandit $k$ is the best is then defined by

The probability matching policy then selects the next bandit to play, $z_{n+1}$, according to

\[ z_{n + 1} \sim \mathsf{Categorical}(\phi_n). \] based on the simplex $\phi_n = \phi_{n,1}, \ldots, \phi_{n,K}$.99 A simplex is a vector of non-negative values summing to one; here $\sum_{k=1}^K \phi_{n,k} = 1$.

Given the simple Bernoulli sampling distribution for bandit returns and the exchangeability assumption about the bandits, we use independent, exchangeable, and symmetric Beta priors on each bandit’s probability of return,

\[ \theta_k \sim \mathsf{Beta}(\alpha, \alpha) \]

With $\alpha = 1$, we have uniform priors; with $\alpha > 1$ the prior begins to concentrate around 0.5. We would expect $\alpha > 1$ to be a bit slower to converge to the best bandit but to have lower variance. The bandits are almost always assumed to be exchangeable.1010 Exchangeable means the probability is invariant under permutation, i.e., $p(\theta) = p(\pi(\theta))$ for any permutation $\pi$.

Given a prior, we can formulate the posterior inference for the the event probability that a given bandit is the best as an expectation, which may easily be calculated by Stan by sampling $\theta^{(1)}, \ldots, \theta^{(M)}$ from the posterior $p(\theta \, | \, y, z)$,

\[ \begin{array}{rcl} \phi_{n,k} & = & \mbox{Pr}\left[ \theta_k = \max \theta \ \big| \ y_{1:n}, z_{1:n} \right] \\[6pt] & = & \mathbb{E}\!\left[ \, \mathrm{I}\left[ \theta_k = \max \theta \right] \ \big| \ y_{1:n}, \ z_{1:n} \, \right] \\[6pt] & = & \displaystyle \int_{\Theta} \mathrm{I}[\theta_k = \max \theta] \ \ p(\theta \mid y_{1:n}, \, z_{1:n}) \ \mathrm{d} \theta \\[6pt] & \approx & \displaystyle \frac{1}{M} \sum_{m=1}^M \mathrm{I}\!\left[ \theta_k^{(m)} = \max \theta^{(m)} \right], \end{array} \]

where $\theta^{(1)}, \ldots, \theta^{(M)}$ are posterior draws produced by Stan according to the posterior $p(\theta \mid z_{1:n}, \, y_{1:n})$. Working out the sampling, $\phi_{n,k}$ is just the proportion of posterior draws $\theta^{(m)}$ in which bandit $k$ had the highest estimated payout probability $\theta_k^{(m)}$.

Bernoulli Bandits in Stan

Coding the model in Stan directly mirrors the mathematical definition above. We are coding the model for the observations up to trial N, so the data blocks looks as follows.

data {
  int<lower = 1> K;                       // num arms
  int<lower = 0> N;                       // num trials
  int<lower = 1, upper = K> z[N];         // arm on trial n
  int<lower = 0, upper = 1> y[N];         // reward on trial n
}

The parameters consist of a chance of success for each bandit.

parameters {
  vector<lower = 0, upper = 1>[K] theta;  // arm return prob
}

The Stan program uses vectorization for the prior and sampling distribution.

model {
  theta ~ beta(1, 1);                     // uniform
  y ~ bernoulli(theta[z]);                // i.i.d. by arm
}

Vectorizations just distribute through, so the vectorized sampling distribution has the same effect as the following loop.

for (n in 1:N)
  y[n] ~ bernoulli(theta[z[n]]);

Finally, the generated quantities block is used to define the simplex $\phi^{(m)}_k$. It is declared as a simplex, with the draw index being implicit as usual in Stan’s treatment of random variables. Here, a local block is introduced with braces to allow the local variable best_prob to be defined without being saved.

generated quantities {
  simplex[K] is_best;  // one hot or uniform with ties
  {
    real best_prob = max(theta);
    for (k in 1:K)
      is_best[k] = (theta[k] >= best_prob);
    is_best /= sum(is_best);  // uniform for ties
  }
}

The final subtlety in the definition is that with rounding to floating point, it is conceivable that we get a tie for best; in that case, the probability is shared over all of the options. In the end, the simplex divides 1 by all the tied options, with zero elsewhere, so it’s guaranteed to sum to 1. As usual, these are being treated like indicator variables in order to compute the appropriate expectation (as expressed in the integral above) for $\phi_{n,k}$.

Sufficient Statistics

This program is going to be very slow as it loops over observations, evaluating log densities for each of them and summing. This can be improved by instead replacing the Bernoulli with binomial sufficient statistics. All we need is the count of the number of turns an arm was pulled and the number of those turns that it returned a 1.

The data, parameters, and generated quantities blocks remain the same as in the Bernoulli formulation. The transformed data block will be used to calculate the sufficient statistics.1111 The advantage of computing in the transformed data block is that it is only executed once, just after the data is loaded.

transformed data {
  int<lower = 0> successes[2] = rep_array(0, K);
  int<lower = 0> trials[2] = rep_array(0, K);
  for (n in 1:N) {
    trials[z[n]] += 1;
    successes[z[n]] += y[n];
  }
}

The variables are declared and initialized to zero, then incremented for each round using the bandit played that round’s index, z[n]. After defining the transformed data, the model block itself is a vectorized binomial.1212 The uniform distribution on parameter vector theta is left implicit.

model {
  successes ~ binomial(trials, theta);
}

Conjugacy

This model is so simple that the posterior may be computed analytically given the sufficient statistics. This allows us to radically reformulate the model by moving the parameter declarations to the generated quantities block and replacing the model block with random number generation in the generated quantities block.

  vector<lower = 0, upper = 1>[K] theta;
  for (k in 1:K)
    theta[k] = beta_rng(1 + successes[k], 1 + trials[k] - successes[k]);

This implementation in Stan produces a standard Monte Carlo sampler.1313 Unlike Markov chain Monte Carlo samplers, a standard Monte Carlo sampler draws each element of the sample independently so that the effective sample size equals the number of iterations. As such, it can be run with no warmup and a single chain, which will consist of independent draws.

This implementation is blazingly fast as it only needs to draw $K$ random Beta variates each iteration and then normalize with $K - 1$ additions and $K$ divisions and assignments.

Probability matching policy

Given the model implemented in Stan, we can write a driver function in R.

model <- stan_model("bernoulli-bandits-conjugate.stan")
K <- 2
theta <- c(0.5, 0.4)
N <- 1000
p_best <- matrix(0, N, 2)
r_hat <- matrix(0, N, 4)
y <- array(0.0, 0)
z <- array(0.0, 0)
prefix <- function(y, n) array(y, dim = n - 1)
for (n in 1:N) {
  data <- list(K = K, N = n - 1, y = prefix(y, n), z = prefix(z, n))
  fit <- sampling(model, data = data, algorithm = "Fixed_param",
                  warmup = 0, chains = 1, iter = 1000, refresh = 0)
  p_best[n, ] <-
    summary(fit, pars="is_best", probs = c())$summary[ , "mean"]
  r_hat[n, ] <-
    summary(fit, pars="theta", probs = c())$summary[ , "Rhat"]
  z[n] <- sample(K, 1, replace = TRUE, p_best[n, ])
  y[n] <- rbinom(1, 1, theta[z[n]])
}

To make sure we’re converging, here’s a histogram of all the relevant $\hat{R}$ convergence statistics for the $\theta$ variates,

Histogram of Rhat values for estimators in the previous bandit policy simulation.

A single chain is used here with control parameters defined to have a low initial stepsize and high adapt_delta so that the target stepsize after adaptation is low. This will decrease efficiency but improve robustness.

Learning rate of Bernoulli bandit for theta = (0.5, 0.4).

An R Framework for Policy Simulation

We can abstract this approach to to simulating policies over bandits and encapsulate it in a general simulation function based on the concept of a policy.

Bandits, histories, and policies

A bandit is just a random number generator—it generates a reward (real number) every time it is pulled. A play history is the sequence of arms that were pulled and the corresponding rewards $(y_{1:n}, z_{1:n})$ for $n \geq 0$. A policy is a function from histories to the index of the next arm to pull.

Representing bandits

In R, a bandit will be represented as a nullary function that generates a double value representing the next return. A Bernoulli bandit with a 50% chance of winning can be represented by the function

flip_bandit <- function() rbinom(1, 1, 0.5)

To play the bandit, just call it like a function,

flip_bandit(); flip_bandit()

[1] 1

[1] 0

We can go further and write a factory function to produce Bernoulli bandits with a specified chance of success.1414 This is an example of a higher-order function which takes a double-valued argument and returns a function. It can be called with bernoulli_bandit_factory(0.3)(). R’s syntax function(x) is essentially a lambda-abstraction, though it can take tuples. The result is a fairly clean but verbose syntax for higher-order functions.

bernoulli_bandit_factory <- function(theta) function() rbinom(1, 1, theta)

The flip bandit we defined previously could be defined with the factory as

flip_bandit <- bernoulli_bandit_factory(0.5)

It is then played as before,

flip_bandit(); flip_bandit()

Representing histories

The sequence of arms pulled z will be represented as an array of integers in 1:K; the sequence of returns y must be an array of double values of the same length. Usually the entries in y are non-negative. Together, the history of arms pulled and subsequent returns will be represented as a list list(x, y).

Representing policies

A policy will be represented by a function from histories and the number of bandits to integers in 1:K representing arm selections.

Random policy. In R, we can represent the policy that randomly selects a bandit as follows.

random_policy <-
  function(y, z, K) sample(1:K, 1)

Balanced policy. A balanced policy that begins on the first bandit then cycles through them may be defined as follows.

cyclic_policy <-
  function(y, z, K) ifelse(size(y) == 0,
                           1,
                           z[length(z)] + 1)

Probability matching policy. The probability matching policy can be implemented by refactoring the code above. To keep the code modular, we first break out a function to fit the model.

model_conjugate <- stan_model("bernoulli-bandits-conjugate.stan")
fit_bernoulli_bandit <- function(y, z, K) {
  data <- list(K = K, N = length(y), y = y, z = z)
  sampling(model_conjugate, algorithm = "Fixed_param", data = data,
           warmup = 0, chains = 1, iter=1000, refresh = 0)}

And then a function to compute posterior expectations from the result of fitting.

expectation <- function(fit, param) {
  posterior_summary <- summary(fit, pars = param, probs = c())
  posterior_summary$summary[ , "mean"]
}

Finally, we define the Bayesian probability matching policy, which chooses a bandit with probability equal to the posterior estimate of it being the best bandit given previous data. This strategy was introduced by Thomspon (1931) and is commonly known as Thompson sampling. Thompson sampling amounts to fitting the model to existing data (y[n] reward for pulling arm z[n] from among K bandits), calclulating the posterior probability of each bandit providing the best probability of reward (the posterior expectation of the indicator that it is the best), then sampling a value according to that d distribution p_best to return.

thompson_sampling_policy <- function(y, z, K) {
  posterior <- fit_bernoulli_bandit(y, z, K)
  p_best <- expectation(posterior, "is_best")
  sample(K, 1, replace = TRUE, p_best)
}

Simulating policies

Finally, we can put this all together and write a general simulator for any class of bandits or policies.

prefix <- function(y, n) array(y, dim = n)

sim_policy <- function(N, policy, bandits) {
  K <- length(bandits)
  y <- array(0.0, N)
  z <- array(0, N)
  for (n in 1:N) {
    # temp to avoid aliasing
    k <- policy(prefix(y, n - 1), prefix(z, n - 1), K)
    y[n] <- bandits[[k]]()
    z[n] <- k
  }
  data.frame(y = y, z = z)
}

Putting it all together

Our previous simulation can now be pieced together as a particular instance of this framework.

We can now plot the percentage of draws assigned to each arm. Without keeping track of the internal calculations of the Thompson sampling policy, we no longer have access to the probability assessments.

counts <- rep(0, K)
theta <- matrix(0, N + 1, K)
theta[1, ] <- rep(1.0 / K, K)
for (n in 1:N) {
  counts[yz$z[n]] <- counts[yz$z[n]] + 1
  theta[n + 1, ] <- counts / n
}
theta_melted <- melt(theta)
df <- data.frame(iteration = theta_melted[ , 1],
                 bandit = factor(theta_melted[ , 2]),
                 theta = theta_melted[, 3])

Proportion of pulls allocated to each Bernoulli bandit for theta = (0.5, 0.4). The proportions over bandits must sum to one, so with two bandits, the lines mirror one another.

Regret

The regret for a sequence of rewards is how much trails the expepected return from playing the optimal policy.1515 Thus the sum may be negative if the choices randomly outperform expectations for the optimal policy.

Optimal Policy

Suppose there are $K$ bandits. Let $Y_k$ be the reward for a single pull of arm $k$. Given that the returns are i.i.d. for playing the bandit multiple times, the optimal policy is to always play the arm $k^*$ with the highest expected return,

\[ k^* \ = \ \mathrm{argmax}_k \ \mathbb{E}[Y_k] \]

where

\[ \mathbb{E}[Y_k] \ = \ \int u \cdot p_{Y_k}(u) \ \mathrm{d}u. \]

The expected return for $M$ iterations of the optimal policy is thus $M \times \mathbb{E}\left[ Y_{k^*} \right]$.

Regret

If playing a policy results in returns $y_1, \ldots, y_M$, the regret is defined as the difference from the expected value of the optimal strategy,1616 The regret can be negative here if the sum of the returns is greater than the expected return of the optimal policy.

\[ \begin{array}{rcl} \mathrm{regret}(y) & = & M \times \mathbb{E}\left[ Y_{k^*} \right] - \sum_{n=1}^N y_n \\[8pt] & = & M \times \left( \mathbb{E}\left[ Y_{k^*} \right] - \bar{y} \right), \end{array} \]

where $\bar{y} = \frac{1}{M} \sum_{m=1}^M y_n$ is the sample mean of $y$.

In the simulation we have been doing with $\theta = (0.5, 0.4)$, the best strategy is to always play the first bandit, resulting in an expected return of $0.5$.1717 If $U \sim \mathsf{Bernolli}(\theta)$, then $\mathbb{E}[U] = \theta.$.

Expedience and power

Studies of Bayesian bandits often include plots of the cumulative regret over turns (e.g., Scott 2010, Chapelle and Li 2011). The interest is in finding policies that quickly converge to optimal behavior, incurring a small total regret over a finite horizon of simulated turns. Adaptive policies such as probability matching are motivated in real-world A/B testing because they minimize the number of trials required to reach a given conclusion.1818 In the language of classical statistics, a sequential design based on probability matching has more “power” than a balanced deisgn to distinguish the best bandit after a given number of trials. An adaptive sequential design reaches the same conclusion faster, thus enabling fewer resources to be consumed by A/B testing before exploiting the results.

References

Agrawal, Shipra and Navin Goyal. 2012. Analysis of Thompson sampling for the multi-armed bandit problem. Proceedings of the 25th Annual Conference on Learning Theory (COLT).

Chapelle, Olivier and Lihong Li. 2011. An empirical evaluation of Thompson sampling. Neural Information Processing Systems 24 (NIPS).

Robbins, Herbert. 1952. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society. 58:527–535.

Robbins, Herbert. 1956. A sequential decision problem with a finite memory. Proceedings of the National Academy of Science. 12:920–923.

Scott, Steve L. 2010. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 26(6):639–658.

Thompson, William R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.

Appendix: Objection to the term “bandits”

Andrew Gelman sent along the following comments. In the spirit of his blog, I repeat those comments here with some commentary.

First, Each slot machine (or “bandit”) only has one arm. Hence it’s many one-armed bandits, not one multi-armed bandit.

Second, the basic strategy in these problems is to play on lots of machines until you find out which is the best, and then concentrate your plays on that best machine. This all presupposes that either (a) you’re required to play, or (b) at least one of the machines has positive expected value. But with slot machines, they all have negative expected value for the player (that’s why they’re called “bandits”), and the best strategy is not to play at all. So the whole analogy seems backward to me.

Third, I find the “bandit” terminology obscure and overly cute. It’s an analogy removed at two levels from reality: the optimization problem is not really like playing slot machines, and slot machines are not actually bandits. It’s basically a math joke, and I’m not a big fan of math jokes.

## Licenses {-}

A Bayesian Approach to Sequential A/B Testing: Multi-Armed Contextual Bandits in Stan

Bob Carpenter

3 February 2018

Abstract

Static A/B Tests

Bernoulli A/B tests

Static A/B test designs

Multi-armed bandits

Exploration and exploitation

Sequential designs

Regret

Bernoulli Bandits

Player Policies

Types of policies

Round robin policy

Uniform random policy

Tit-for-tat policy

Probability matching policy

Bernoulli Bandits in Stan

Sufficient Statistics

Conjugacy

Probability matching policy

An R Framework for Policy Simulation

Bandits, histories, and policies

Representing bandits

Representing histories

Representing policies

Simulating policies

Putting it all together

Regret

Optimal Policy

Regret

Expedience and power

References

Appendix: Objection to the term “bandits”