Assignment 4 — Answers
Answer 1 — Deterministic policy gradient vs. TD backup in continuous control
1. Absence of an explicit action integral in deterministic policy gradients
For a deterministic policy \(a=\mu_\theta(s)\), the Bellman equation for the action-value function is \[ q^\mu(s,a)=\mathbb{E}\left[r(s,a) + \gamma q^\mu \left(s', \mu(s')\right) \middle | S=s, A=a \right]. \]
In stochastic policy gradient methods, the objective typically involves an expectation over actions sampled from a policy distribution, \[ J(\theta)=\mathbb{E}_{s \sim \rho^{\pi_\theta} a \sim \pi_\theta(\cdot \mid s)}\bigl[ Q^{\pi_\theta}(s,a) \bigr], \] which leads to gradients that include an explicit expectation (or integral) over the action space.
By contrast, a deterministic policy induces a degenerate action distribution, \[ \pi_\theta(a \mid s) = \delta \left(a - \mu_\theta(s)\right), \] so the action is no longer a random variable. The expectation over actions disappears, and the policy gradient reduces to a chain-rule expression, \[ \nabla_\theta J(\theta)=\mathbb{E}_{s \sim \rho^\mu} \left[ \nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)\big|{a=\mu_\theta(s)} \right]. \]
2. Limitations of TD(0) state-value learning for continuous control
A TD(0) update for a state-value function has the form \[ V(S_t) \leftarrow V(S_t) + \alpha \Bigl( R_{t+1} + \gamma V(S_{t+1})V(S_t) \Bigr). \]
This update estimates expected return from a state but does not specify which action should be taken. In a continuous action setting, deriving an action from a state-value function would require an additional optimization step, \[ a^(s) \in \arg\max_{a \in \mathcal{A}} \mathbb{E}\left[ r(s,a) + \gamma V(S') \middle| S=s, A=a \right], \]
which is generally nontrivial without an explicit transition model or an action-value function. Consequently, TD(0) value learning alone does not yield a direct mapping from states to continuous control commands.
Answer 2 — TD critic targets and stability considerations
1. Comparison of TD targets
In DDPG, the critic target is
\[ y_{\text{DDPG}} = r + \gamma Q_{\phi'}\left(s', \mu_{\theta'}(s')\right). \]
For classical TD(0) value prediction, \[ y_{\text{TD(0)}} = r + \gamma V(s'), \] while in Q-learning,
\[ y_{\text{Q-learning}}=r + \gamma \max_{a'} Q(s',a'). \]
The DDPG target is tied to the current policy through \(\mu_{\theta'}\), whereas Q-learning uses a maximization over actions that defines an implicit greedy policy. TD(0) prediction does not involve action selection and therefore does not define a control update. Target networks in DDPG slow the evolution of the bootstrap target and reduce instability caused by rapidly changing function approximators when the actor and critic are trained simultaneously.
2. Stability implications in sim-to-real settings
The maximization operator in Q-learning can introduce overestimation bias under function approximation, because noise in value estimates is amplified by the \(\max\) operation. DDPG avoids explicit maximization, but it still relies on bootstrapping, which can be unstable if the target changes too quickly.
Target networks mitigate this nonstationarity by ensuring that the bootstrap target evolves smoothly over time. In contrast, TD(0) value prediction is often more stable because it avoids both action maximization and coupled actor updates, but it does not produce a control policy.
3. Importance of critic stability for real robots
In actor–critic methods, the actor update depends directly on the gradient of the critic with respect to the action, \[ \nabla_\theta J(\theta)= \mathbb{E} \left[ \nabla_\theta \mu_\theta(s) \nabla_a Q_\phi(s,a)\big|{a=\mu_\theta(s)} \right]. \]
If the critic is unstable, these gradients can induce erratic control updates. For TurtleBot3, this may translate into oscillatory motion, collisions, or actuator saturation. Ensuring critic stability is therefore a prerequisite for safe sim-to-real transfer.
Answer 3 — Reward structure, exploration, and conservative behavior
1. Effect of reward shaping on actor updates
In collision-avoidance tasks, rewards penalize collisions and near-collision states and provide sparse positive feedback for goal progress. This shapes the critic so that many actions have strongly negative values near obstacles and relatively flat value landscapes elsewhere.
The actor update follows the gradient of the critic with respect to the action, \[ \Delta \theta \propto \nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)\big|{a=\mu_\theta(s)}. \]
If higher velocities or sharper turns consistently reduce the critic’s value, the learned policy is driven toward slower or more cautious actions.
2. Conservative behavior under weak exploration
Deterministic policies rely on externally injected noise for exploration, \[ a_t = \mu_\theta(s_t) + \varepsilon_t. \]
If \(\varepsilon_t\) is too small or decays too quickly, the agent explores only a narrow region of the action space. Combined with strong penalties for unsafe behavior, this can lead to convergence toward conservative, low-variance actions that are locally safe but suboptimal for task completion.
3. Comparison with stochastic policy exploration
In stochastic policies, exploration is intrinsic, since actions are sampled from a distribution \(\pi(a \mid s)\). This randomness persists even without explicit noise injection and can lead to broader exploration patterns.
Deterministic policies decouple action selection from exploration, placing greater responsibility on the design and tuning of the exploration process. This explains why exploration behavior can differ qualitatively between deterministic and stochastic TD-based methods.
4. Implications of the deterministic policy gradient framework
Deterministic policy gradient methods assume that sufficient exploration is provided externally. In real-time navigation, this assumption must be reconciled with safety constraints, since excessive exploration can cause collisions, while insufficient exploration can stall learning. Managing this trade-off is central to practical deployment.
Answer 4 — Demonstrations, critics, and the limits of pure TD learning
1. Incorporating demonstrations into DDPG
Given demonstration data \((s_t, a_t, r_t, s_{t+1})\), the actor can be initialized through behavior cloning, \[ \min_\theta \mathbb{E}\left[ |\mu_\theta(s) - a|^2 \right], \] while the critic can be pretrained by regressing toward bootstrapped targets derived from the demonstrations. This initialization reduces the need for unsafe exploration early in training.
2. Correction of demonstrated actions through TD learning
The critic update \[ \phi \leftarrow \phi + \alpha \bigl( y - Q_\phi(s,a) \bigr)\nabla_\phi Q_\phi(s,a), \qquad y = r + \gamma Q_{\phi'}\left(s', \mu_{\theta'}(s')\right), \] allows the value assigned to demonstrated actions to be adjusted based on long-term outcomes. These corrected values influence the actor through gradients with respect to the action, enabling systematic improvement beyond imitation.
3. Insufficiency of pure TD(0) value learning
Learning only a state-value function, \[ V_\psi(s) \leftarrow V_\psi(s) + \alpha \bigl( r + \gamma V_\psi(s') - V_\psi(s) \bigr) \] does not provide guidance on how to adjust continuous actions. Without an explicit actor or an action-value function, there is no direct mechanism for refining control commands, and additional planning or optimization layers would be required.
4. When demonstrations plus DDPG are advantageous
Combining demonstrations with an actor–critic approach is particularly effective when the action space is continuous, rewards are sparse, and safety constraints limit exploration. In such cases, the critic provides actionable gradients that a pure value-based TD method cannot supply.
Answer 5 — DDPG and VLA generalization
1. Fine-tuning the action head of a VLA model
Let a pretrained perception–language encoder produce features \[ z = f_{\text{enc}}(s), \] and let the action head output \[ a = \mu_\theta(z). \]
With the encoder fixed, deterministic policy gradients update only the action head, \[ \nabla_\theta J(\theta)= \mathbb{E} \left[ \nabla_\theta \mu_\theta \left(f_{\text{enc}}(s)\right) \nabla_a Q^\mu(s,a)\big| {a=\mu_\theta(f_{\text{enc}}(s))} \right]. \]
2. Advantages of deterministic gradients for calibrated action outputs
When supervised pretraining already yields reasonable actions, reinforcement learning often serves to make small, reward-aligned corrections. Deterministic policy gradients provide low-variance updates and direct control over continuous actions, which is advantageous when precise refinement is required.
3. Correcting systematic biases under distribution shift
Under distribution shift, a pretrained action head may be overly cautious or overly aggressive. The critic, trained via TD learning from sparse or delayed rewards, reshapes the action-value landscape so that gradients with respect to the action encourage adjustments that improve long-horizon performance.
This highlights the complementary roles of supervised pretraining, which matches demonstrated behavior, and reinforcement learning fine-tuning, which optimizes behavior with respect to task-level reward objectives.