Dynamic resource allocation as an estimation problem
An agent sequentially chooses actions in a set of possible options. Each action leads to a stochastic reward, whose distribution is unknown. What dynamic allocation rule should he choose so as to maximize his cumulated reward?
The study of "bandit problems" (the word refers to the paradigmatic situation of a gambler facing a row of slot-machines and deciding which one to choose in order to maximize his rewards), dating back to the 1930s, was originally motivated by medical trials. In has recently raised a renewed interest because of computer-driven applications, from computer experiments to recommender systems and Big Data.
One possible solution, called UCB, relies on upper confidence bounds of the expected rewards associated to all actions. In this presentation, I shall explain why fine confidence bounds are required in order to obtain efficient allocation rules, and I shall present the statistical challenges involved in their construction.