Thompson Sampling for Multinomial Logit Contextual Bandits

View Researcher's Other Codes

Disclaimer: The provided code links for this paper are external links. Science Nest has no responsibility for the accuracy, legality or content of these links. Also, by downloading this code(s), you agree to comply with the terms of use as set out by the author(s) of the code(s).

Please contact us in case of a broken link from here

Authors Garud Iyengar, Min-Hwan Oh
Journal/Conference Name NeurIPS 2019 12
Paper Category
Paper Abstract We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments that maximizes the expected cumulative revenue, or alternatively, minimize the expected regret. The feedback here is the item that the user picks from the assortment. The distinguishing feature in this work is that this feedback has a multinomial logistic distribution. The utility of each item is a dynamic function of contextual information of both the item and the user. We propose two Thompson sampling algorithms for this multinomial logit contextual bandit. Our first algorithm maintains a posterior distribution of the true parameter and establishes $\tilde{O}(d\sqrt{T})$ Bayesian regret over $T$ rounds with $d$ dimensional context vector. The worst-case computational complexity of this algorithm could be high when the prior distribution is not a conjugate. The second algorithm approximates the posterior by a Gaussian distribution, and uses a new optimistic sampling procedure to address the issues that arise in worst-case regret analysis. This algorithm achieves $\tilde{O}(d^{3/2}\sqrt{T})$ worst-case (frequentist) regret bound. The numerical experiments show that the practical performance of both methods is in line with the theoretical guarantees.
Date of publication 2019
Code Programming Language Unspecified
Comment

Copyright Researcher 2022