What Is Reinforcement Learning?
Reinforcement learning (RL) is the branch of ML where an agent learns by interacting with an environment. Unlike supervised learning (where you have labeled data) or unsupervised learning (where you find patterns), RL learns through trial and error — taking actions, receiving rewards, and adjusting strategy.
RL powers some of the most impressive AI achievements: AlphaGo, ChatGPT's RLHF training, robotic manipulation, autonomous driving, and game-playing agents that surpass human performance.
Core Concepts
Agent and Environment
The agent is the learner — it observes the environment, takes actions, and receives rewards. The environment is everything the agent interacts with. Think of it like a game: the player (agent) plays in a world (environment) and tries to maximize their score (reward).
States, Actions, and Rewards
- State (s): The current situation — what the agent observes. In a chess game, it's the board position.
- Action (a): What the agent can do. In chess, it's the set of legal moves.
- Reward (r): The feedback signal. +1 for winning, -1 for losing, 0 for neutral moves.
- Policy (π): The agent's strategy — a mapping from states to actions.
The Goal
Maximize the cumulative discounted reward: R = r₁ + γr₂ + γ²r₃ + ... where γ (gamma) is the discount factor (typically 0.9-0.99). This means the agent values immediate rewards more than distant ones.
Q-Learning: Your First RL Algorithm
Q-learning learns a "quality" value for each state-action pair: Q(s, a) = how good is it to take action a in state s?
The Update Rule
Q(s, a) = Q(s, a) + α × (reward + γ × max Q(s') - Q(s, a))
Where α is the learning rate, γ is the discount factor, and max Q(s') is the best Q-value from the next state.
The agent uses a Q-table that stores values for every state-action pair. Over many episodes, the table converges to optimal values.
Exploration vs. Exploitation
The agent faces a dilemma: should it exploit what it already knows (pick the highest Q-value action) or explore new actions that might lead to better outcomes? The ε-greedy strategy handles this: with probability ε, take a random action; otherwise, take the best known action. Start with high ε (explore a lot) and decrease it over time.
Deep Q-Networks (DQN)
When the state space is too large for a table (e.g., Atari game pixels), replace the Q-table with a neural network that approximates Q-values. This is how DeepMind's DQN beat human players at Atari games in 2015.
Policy Gradient Methods
Instead of learning Q-values, directly learn the policy — the probability of taking each action in each state. Policy gradients can handle continuous action spaces (like controlling a robot arm) where Q-learning struggles.
RLHF: How ChatGPT Learns
Reinforcement Learning from Human Feedback (RLHF) is how modern LLMs are aligned with human preferences. The process:
- Train a base language model on text data
- Have humans rank model outputs by quality
- Train a reward model on these rankings
- Fine-tune the LLM using RL (PPO algorithm) to maximize the reward model's score
Practice It
Try our Q-Learning Grid World exercise to implement Q-learning from scratch. Then explore our Reinforcement Learning lesson for a comprehensive deep dive with quizzes and examples.
Ready for a structured learning journey? The AI Foundations path covers RL alongside other essential AI concepts. For production applications, check the Production AI Engineer path.