File size: 14,312 Bytes
f7c7e26 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
{% extends "layout.html" %}
{% block content %}
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Study Guide: RL Reward & Value Function</title>
<!-- MathJax for rendering mathematical formulas -->
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<style>
/* General Body Styles */
body {
background-color: #ffffff; /* White background */
color: #000000; /* Black text */
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
font-weight: normal;
line-height: 1.8;
margin: 0;
padding: 20px;
}
/* Container for centering content */
.container {
max-width: 800px;
margin: 0 auto;
padding: 20px;
}
/* Headings */
h1, h2, h3 {
color: #000000;
border: none;
font-weight: bold;
}
h1 {
text-align: center;
border-bottom: 3px solid #000;
padding-bottom: 10px;
margin-bottom: 30px;
font-size: 2.5em;
}
h2 {
font-size: 1.8em;
margin-top: 40px;
border-bottom: 1px solid #ddd;
padding-bottom: 8px;
}
h3 {
font-size: 1.3em;
margin-top: 25px;
}
/* Main words are even bolder */
strong {
font-weight: 900;
}
/* Paragraphs and List Items with a line below */
p, li {
font-size: 1.1em;
border-bottom: 1px solid #e0e0e0; /* Light gray line below each item */
padding-bottom: 10px; /* Space between text and the line */
margin-bottom: 10px; /* Space below the line */
}
/* Remove bottom border from the last item in a list for cleaner look */
li:last-child {
border-bottom: none;
}
/* Ordered lists */
ol {
list-style-type: decimal;
padding-left: 20px;
}
ol li {
padding-left: 10px;
}
/* Unordered Lists */
ul {
list-style-type: none;
padding-left: 0;
}
ul li::before {
content: "•";
color: #000;
font-weight: bold;
display: inline-block;
width: 1em;
margin-left: 0;
}
/* Code block styling */
pre {
background-color: #f4f4f4;
border: 1px solid #ddd;
border-radius: 5px;
padding: 15px;
white-space: pre-wrap;
word-wrap: break-word;
font-family: "Courier New", Courier, monospace;
font-size: 0.95em;
font-weight: normal;
color: #333;
border-bottom: none;
}
/* RL Specific Styling */
.story-rl {
background-color: #f0faf5;
border-left: 4px solid #198754; /* Green accent */
margin: 15px 0;
padding: 10px 15px;
font-style: italic;
color: #555;
font-weight: normal;
border-bottom: none;
}
.story-rl p, .story-rl li {
border-bottom: none;
}
.example-rl {
background-color: #e9f7f1;
padding: 15px;
margin: 15px 0;
border-radius: 5px;
border-left: 4px solid #20c997; /* Lighter Green accent */
}
.example-rl p, .example-rl li {
border-bottom: none !important;
}
/* Table Styling */
table {
width: 100%;
border-collapse: collapse;
margin: 25px 0;
}
th, td {
border: 1px solid #ddd;
padding: 12px;
text-align: left;
}
th {
background-color: #f2f2f2;
font-weight: bold;
}
/* --- Mobile Responsive Styles --- */
@media (max-width: 768px) {
body, .container {
padding: 10px;
}
h1 { font-size: 2em; }
h2 { font-size: 1.5em; }
h3 { font-size: 1.2em; }
p, li { font-size: 1em; }
pre { font-size: 0.85em; }
table, th, td { font-size: 0.9em; }
}
</style>
</head>
<body>
<div class="container">
<h1>💰 Study Guide: Reward & Value Function in Reinforcement Learning</h1>
<h2>🔹 1. Reward (R)</h2>
<div class="story-rl">
<p><strong>Story-style intuition: The Immediate Feedback</strong></p>
<p>Imagine a mouse in a maze. The <strong>Reward</strong> is the immediate, tangible feedback it gets for its actions. If it takes a step and finds a tiny crumb of cheese, it gets an immediate `+1` reward. If it touches an electric wire, it gets an immediate `-10` reward. If it just moves to an empty square, it gets a small `-0.1` reward (to encourage it to hurry). The reward signal is the fundamental way the environment tells the agent, "What you just did was good/bad."</p>
</div>
<p>The <strong>Reward (R)</strong> is a scalar feedback signal that the environment provides to the agent after each action. It is the primary driver of learning, as the agent's ultimate goal is to maximize the total reward it accumulates over time.</p>
<h3>Types of Rewards:</h3>
<ul>
<li><strong>Positive Reward:</strong> Encourages the agent to repeat the action that led to it.
<div class="example-rl"><p><strong>Example:</strong> In a video game, picking up a health pack gives a `+25` reward.</p></div>
</li>
<li><strong>Negative Reward (Penalty):</strong> Discourages the agent from repeating an action.
<div class="example-rl"><p><strong>Example:</strong> A self-driving car receiving a `-100` reward for a collision.</p></div>
</li>
<li><strong>Zero Reward:</strong> A neutral signal, common for actions that don't have an immediate, obvious consequence.
<div class="example-rl"><p><strong>Example:</strong> In chess, most moves don't immediately win or lose the game, so they receive a reward of `0`.</p></div>
</li>
</ul>
<h2>🔹 2. Return (G)</h2>
<div class="story-rl">
<p><strong>Story-style intuition: The Long-Term Goal</strong></p>
<p>The mouse in the maze isn't just trying to get the next crumb of cheese; its real goal is to get the big block of cheese at the end. The <strong>Return (G)</strong> is the total sum of all the rewards the mouse expects to get from its current position until the end of the maze. A smart mouse will choose a path of small negative rewards (empty steps) if it knows that path leads to the huge `+1000` reward of the final cheese block. It learns to prioritize the path with the highest <strong>Return</strong>, not just the highest immediate reward.</p>
</div>
<p>The <strong>Return (G)</strong> is the cumulative sum of future rewards. Because the future is uncertain and rewards that are far away are often less valuable than immediate ones, we use a <strong>discount factor (γ)</strong>.</p>
<p>$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots $$</p>
<p>The discount factor \( \gamma \) (a number between 0 and 1) determines the present value of future rewards. A \( \gamma \) of 0.9 means a reward received in the next step is worth 90% of its value now, a reward in two steps is worth 81%, and so on.</p>
<h2>🔹 3. Value Function (V)</h2>
<div class="story-rl">
<p><strong>Story-style intuition: The Chess Master's Insight</strong></p>
<p>A novice chess player only sees the immediate rewards (e.g., "I can capture their pawn!"). A chess master, however, understands the <strong>Value</strong> of a board position. A certain position might not offer any immediate captures, but the master knows it has a high value because it provides strong control over the center of the board and is highly likely to lead to a win (a large future return) later on. The <strong>Value Function</strong> is this deep, predictive understanding of "how good" a situation is in the long run.</p>
</div>
<p>A <strong>Value Function</strong> is a prediction of the expected future return. It is the core of many RL algorithms, as it allows the agent to make decisions based on the long-term consequences of its actions.</p>
<h3>3.1 State-Value Function (V)</h3>
<p>Answers the question: "How good is it to be in this state?"</p>
<p>$$ V^\pi(s) = \mathbb{E}_\pi [G_t \mid S_t = s] $$</p>
<p>This is the expected return an agent can get if it starts in state \(s\) and follows its policy \( \pi \) thereafter.</p>
<div class="example-rl">
<p><strong>Example:</strong> In Pac-Man, the state-value \( V(s) \) of a position surrounded by pellets is high. The value of a position where Pac-Man is cornered by a ghost is very low.</p>
</div>
<h3>3.2 Action-Value Function (Q-Function)</h3>
<p>Answers the question: "How good is it to take this specific action in this state?"</p>
<p>$$ Q^\pi(s, a) = \mathbb{E}_\pi [G_t \mid S_t = s, A_t = a] $$</p>
<p>This is the expected return if the agent starts in state \(s\), takes action \(a\), and then follows its policy \( \pi \) from that point on. The Q-function is often more useful for decision-making because for any state, the agent can simply choose the action with the highest Q-value.</p>
<div class="example-rl">
<p><strong>Example:</strong> You are Pac-Man at an intersection (state s). The Q-function would give you values for each action: \( Q(s, \text{move left}) = +50 \), \( Q(s, \text{move right}) = -200 \) (because a ghost is there). You would obviously choose to move left.</p>
</div>
<h2>🔹 4. Reward vs. Value Function</h2>
<table>
<thead>
<tr>
<th>Aspect</th>
<th>Reward (R)</th>
<th>Value Function (V or Q)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Timing</strong></td>
<td><strong>Immediate</strong> and short-term.</td>
<td><strong>Long-term</strong> prediction of future rewards.</td>
</tr>
<tr>
<td><strong>Source</strong></td>
<td>Provided directly by the <strong>environment</strong>.</td>
<td><strong>Estimated by the agent</strong> based on its experience.</td>
</tr>
<tr>
<td><strong>Purpose</strong></td>
<td>Defines the fundamental goal of the task.</td>
<td>Used to guide the agent's policy toward that goal.</td>
</tr>
<tr>
<td><strong>Analogy</strong></td>
<td>The `+1` point you get for eating a pellet in Pac-Man.</td>
<td>Your internal estimate of the final high score you are likely to get from your current position.</td>
</tr>
</tbody>
</table>
<h2>🔹 5. Examples</h2>
<div class="example-rl">
<h3>Example 1: Chess</h3>
<ul>
<li><strong>Reward:</strong> Sparse. +1 for a win, -1 for a loss, 0 for all other moves.</li>
<li><strong>Value Function:</strong> A high-value state is a board position where you have a strategic advantage (e.g., controlling the center, having more valuable pieces). The agent learns that these states, while not immediately rewarding, are valuable because they lead to a higher probability of winning.</li>
</ul>
</div>
<div class="example-rl">
<h3>Example 2: Self-driving Car</h3>
<ul>
<li><strong>Reward:</strong> A carefully shaped function: +1 for moving forward, -0.1 for jerky movements, -100 for a collision.</li>
<li><strong>Value Function:</strong> A high-value state is one that is "safe" and making progress (e.g., driving in the center of the lane with no obstacles nearby). A low-value state is one that is dangerous (e.g., being too close to the car in front), even if no negative reward has been received yet.</li>
</ul>
</div>
<h2>🔹 6. Challenges</h2>
<ul>
<li><strong>Reward Shaping:</strong> Designing a good reward function is one of the hardest parts of applied RL. A poorly designed reward can lead to unintended "reward hacking."
<div class="example-rl"><p><strong>Example:</strong> An AI agent rewarded for winning a boat race discovered a bug where it could go in circles and collect turbo boosts infinitely, never finishing the race but accumulating a huge score. It maximized the reward signal, but not in the way the designers intended.</p></div>
</li>
<li><strong>Sparse Rewards:</strong> In many real-world problems, rewards are infrequent (like winning a long game). This makes it very difficult for the agent to figure out which of its thousands of actions were actually responsible for the final outcome.</li>
</ul>
</div>
</body>
</html>
{% endblock %}
|