Lesson 1: Two-armed bandit task

Multi-Armed Bandit Task

The multi-armed bandit task is a problem in the field of reinforcement learning and decision theory. The multi-armed bandit problem is named after the classic scenario of gambling with slot machines, or "one-armed bandits," where each machine has an arm that you pull to play. The objective is to maximize your returns over time by strategically choosing which arms to pull, even though you don't know in advance which machines are more likely to give high rewards. This concept isn't just about gambling, it's a fundamental issue in various real-world situations where we need to make a series of choices under uncertainty.

Understanding the multi-armed bandit problem can help us develop strategies to make the best possible decisions, balancing the need to gather information with the need to achieve good outcomes.

Key Concepts

Before we dive into the task itself, let's go over a few key concepts:

Bandit: In this context, a bandit is a slot machine. Imagine walking into a casino with a row of slot machines, each offering different rewards.
Arm: Each lever on these slot machines is referred to as an "arm."
Reward: This is the payoff or return you get from pulling an arm. In our context, it is often modeled as a probabilistic binary outcome.

These concepts lay the foundation for understanding the multi-armed bandit task and its important in making informed decisions in unknown environments.

Q-learning

Q-learning is a model-free reinforcement learning algorithm that aims to find the optimal policy for an agent by learning the value of taking a given action in a given state, known as the Q-value. It is a form of temporal difference learning where the agent updates its knowledge based on the difference between predicted and actual rewards.

Key Concepts:

Q-Value [Q(s,a)]: Represents the expected cumulative future reward of taking action a in state s and following the optimal policy thereafter.
Learning Rule: The agent updates the Q-values using the following rule:

Untitled

$\alpha$ is the learning rate.
$r_t$ is the reward received after taking action.

How Q-Learning Works

Initialization: Initialize the Q-values Q(s,a) for all state-action pairs to arbitrary values.