Bandit – 6109.hidepiy.com

Overview

Bandit algorithm is used for making decisions when we have multiple actions we can take and imperfect information of the rewards we can receive after each action.

arm: options or actions which we can take
bandit: collection of arms
trial (play): each chance to pull on an arm
exploit: plays the arm with the highest estimated value based on previous plays
focusing on short term rewards and having possibility of missing better reward in long term
explore: plays the arm other than exploitation (not the highest estimated value)
trying to find better reward
reward: subject of play such as profit, CTR, or CVR
- should be clearly quantitative
- more reward is better than less reward
anneal: an algorithm anneals if it explore less
temperature: parameter to change balance of exploitation and exploration.
algorithm anneals by low temperature and less exploration

A/B testing

In general A/B testing of web site, we play explore for certain period with fixed ratio of each option
For example

The site is based on design A
A/B testing for 2 weeks (explore)
1. Allocate 50% of traffic to old design A
2. Allocate 25% of traffic to new design B
3. Allocate 25% of traffic to new design C
Collect and summarise 2 weeks of ClickThroughRate
Change the site design to highest CTR (reward) design B (exploit)

So this A/B testing could be said as special case of bandit such that

focusing on explore for certain period (super high temperature)
not changing ratio of each option (arm) during testing even though it shows bad reward
once we choose 1 arm after the testing, we use that 100% even though there is change of reward

With bandit algorithm, we can mix exploration phase and exploitation phase then take balance between that.

Overview

A/B testing

Bandit: Epsilon-Greedy