Bandit algorithm is used for making decisions when we have multiple actions we can take and imperfect information of the rewards we can receive after each action.

  • arm: options or actions which we can take
  • bandit: collection of arms
  • trial (play): each chance to pull on an arm
  • exploit: plays the arm with the highest estimated value based on previous plays
    focusing on short term rewards and having possibility of missing better reward in long term
  • explore: plays the arm other than exploitation (not the highest estimated value)
    trying to find better reward
  • reward: subject of play such as profit, CTR, or CVR
    • should be clearly quantitative
    • more reward is better than less reward
  • anneal: an algorithm anneals if it explore less
  • temperature: parameter to change balance of exploitation and exploration.
    algorithm anneals by low temperature and less exploration


A/B testing

In general A/B testing of web site, we play explore for certain period with fixed ratio of each option
For example

  1. The site is based on design A
  2. A/B testing for 2 weeks (explore)
    1. Allocate 50% of traffic to old design A
    2. Allocate 25% of traffic to new design B
    3. Allocate 25% of traffic to new design C
  3. Collect and summarise 2 weeks of ClickThroughRate
  4. Change the site design to highest CTR (reward) design B (exploit)

So this A/B testing could be said as special case of bandit such that

  • focusing on explore for certain period (super high temperature)
  • not changing ratio of each option (arm) during testing even though it shows bad reward
  • once we choose 1 arm after the testing, we use that 100% even though there is change of reward

With bandit algorithm, we can mix exploration phase and exploitation phase then take balance between that.


Bandit: Epsilon-Greedy