Overview
Bandit algorithm is used for making decisions when we have multiple actions we can take and imperfect information of the rewards we can receive after each action.
- arm: options or actions which we can take
- bandit: collection of arms
- trial (play): each chance to pull on an arm
- exploit: plays the arm with the highest estimated value based on previous plays
focusing on short term rewards and having possibility of missing better reward in long term - explore: plays the arm other than exploitation (not the highest estimated value)
trying to find better reward - reward: subject of play such as profit, CTR, or CVR
- should be clearly quantitative
- more reward is better than less reward
- anneal: an algorithm anneals if it explore less
- temperature: parameter to change balance of exploitation and exploration.
algorithm anneals by low temperature and less exploration
A/B testing
In general A/B testing of web site, we play explore for certain period with fixed ratio of each option
For example
- The site is based on design A
- A/B testing for 2 weeks (explore)
- Allocate 50% of traffic to old design A
- Allocate 25% of traffic to new design B
- Allocate 25% of traffic to new design C
- Collect and summarise 2 weeks of ClickThroughRate
- Change the site design to highest CTR (reward) design B (exploit)
So this A/B testing could be said as special case of bandit such that
- focusing on explore for certain period (super high temperature)
- not changing ratio of each option (arm) during testing even though it shows bad reward
- once we choose 1 arm after the testing, we use that 100% even though there is change of reward
With bandit algorithm, we can mix exploration phase and exploitation phase then take balance between that.