Assignment 4

1. Overview and Reading List

This assignment connects deterministic policy gradient methods (in particular DDPG), temporal-difference (TD) learning, and recent work on vision–language–action (VLA) models and sim-to-real navigation with TurtleBot3.

Required readings:

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. A. Riedmiller, “Deterministic policy gradient algorithms,” ICML, vol. 32, no. 1, pp. 387–395, June 2014.
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic Policy Gradient Algorithms,” in Proceedings of the 31st International Conference on Machine Learning, 2014.
H. Niu, Z. Ji, F. Arvin, B. Lennox, H. Yin, and J. Carrasco, “Accelerated Sim-to-real deep reinforcement learning: Learning collision avoidance from human player,” arXiv:2102.11312, 2021.
J. Liu et al., “What can RL bring to VLA generalization? An empirical study,” arXiv [cs.LG], 2025.
The TurtleBot3 DRL navigation repository:
- https://github.com/tomasvr/turtlebot3_drlnav

You are expected to be familiar with the basic actor–critic and TD learning framework as presented, for example, in Sutton and Barto’s Reinforcement Learning: An Introduction.

Throughout, we consider an infinite-horizon discounted Markov decision process (MDP) with state space \(\mathcal{S}\), action space \(\mathcal{A}\), discount factor \(\gamma \in [0,1)\), transition kernel \(p(s' \mid s,a)\), and reward \(r(s,a)\).

The state-value function of a policy \(\pi\) is defined as

\[ v_\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \;\middle|\; S_0 = s \right], \]

and the action-value function as

\[ q_\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \;\middle|\; S_0 = s, A_0 = a \right]. \]

The Bellman expectation equations are

\[ v_\pi(s) = \mathbb{E}_{A \sim \pi(\cdot \mid s),\, S' \sim p(\cdot \mid s,A)} \left[ R(s,A) + \gamma v_\pi(S') \right], \]

\[ q_\pi(s,a) = \mathbb{E}_{S' \sim p(\cdot \mid s,a)} \left[ R(s,a) + \gamma \mathbb{E}_{A' \sim \pi(\cdot \mid S')} q_\pi(S',A') \right]. \]

For deterministic policies \(\mu_\theta : \mathcal{S} \to \mathcal{A}\), the deterministic policy gradient theorem (Silver et al., 2014) states that, under suitable regularity assumptions,

\[ \nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^{\mu}} \left[ \nabla_\theta \mu_\theta(s) \, \nabla_a q^{\mu}(s,a)\big\rvert_{a = \mu_\theta(s)} \right], \]

where \(\rho^\mu\) denotes the discounted state visitation distribution under policy \(\mu\).

3. Questions

Question 1 (Deterministic policy gradient vs TD backup in continuous control)

In the TurtleBot3 DRL navigation setting, the robot’s action space is continuous (for example, linear and angular velocities). This motivates the use of deterministic policy gradients (DDPG) rather than purely value-based TD methods.

Starting from the Bellman equation for the action-value function under a deterministic policy (\(\mu\)),

\[ q^\mu(s,a) = \mathbb{E}\left[ R(s,a) + \gamma q^\mu(S', \mu(S')) ,\middle|, S = s, A = a \right], \]

explain why the deterministic policy gradient update

\[ \nabla_\theta J(\theta) = \mathbb{E}*{s \sim \rho^\mu} \left[ \nabla*\theta \mu_\theta(s), \nabla_a Q^\mu(s,a)\big\rvert_{a=\mu_\theta(s)} \right] \]

does not require an explicit integral over the action space, in contrast to the stochastic policy gradient that involves an expectation (\(\mathbb{E}_{a \sim \pi(\cdot \mid s)}\)).
Compare this to a TD(0) state-value update of the form

\[ V(s) \leftarrow V(s) + \alpha \bigl( R_{t+1} + \gamma V(S_{t+1}) - V(s) \bigr), \]

and explain why such a TD update, by itself, cannot directly produce continuous control commands for TurtleBot3.
Relate your discussion to the navigation task in the TurtleBot3 repository, where the policy must output smooth velocity commands. Explain why a deterministic actor is a natural choice for this control problem.

Question 2 (TD critic target in DDPG vs TD(0) and stability)

In DDPG, the critic is typically trained with a TD target of the form

\[ y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s')), \]

where (\(\phi'\)) and (\(\theta'\)) denote parameters of slowly-updated target networks.

Compare this TD target to that used in classical TD(0) value prediction and in value-based control methods such as fitted Q-iteration or Q-learning,

\[ y_{\text{TD(0)}} = r + \gamma V(s'), \qquad y_{\text{Q-learning}} = r + \gamma \max_{a'} Q(s',a'). \]

Highlight the key differences in terms of (i) dependence on the current policy, (ii) use of a max-operator, and (iii) the role of target networks.
Discuss how these differences affect stability, particularly in the context of sim-to-real collision avoidance as considered by Niu et al. (2021). In your answer, address:
- Overestimation bias arising from the \(\max_{a'}\) operator.
- The effect of target networks on reducing nonstationarity in the bootstrap target.
- Why TD(0) value prediction is often more stable but cannot, on its own, specify a control policy.
From a robotics perspective, explain why stability of the critic is especially important when transferring policies from simulation to a real TurtleBot3 platform. Consider the consequences of critic instability on real-world navigation safety.

Question 3 (Reward structure, exploration, and conservative behaviors)

Consider a collision-avoidance task for TurtleBot3 in which the reward function penalizes collisions and near-collision states (e.g., small distances to obstacles) and gives sparse positive rewards for progress toward a goal.

Explain how such a reward structure interacts with a deterministic policy \(\mu_\theta\) in DDPG. Discuss how the shape of the critic \(Q^\mu(s,a)\) with respect to (a) determines the direction and magnitude of the gradient \(\nabla_a Q^\mu(s,a)\) that drives the actor updates.
Describe why deterministic policies can sometimes converge to conservative, low-variance behaviors (for example, very small linear velocities or minimal turning), especially if the exploration mechanism (e.g., Ornstein–Uhlenbeck or Gaussian noise added to actions) is weak or poorly tuned.
Contrast this with exploration in stochastic TD methods, where randomness in the policy \(\pi(a \mid s)\) directly affects the sampled actions. Explain how this stochasticity, combined with TD learning, can lead to qualitatively different exploration patterns than those induced by additive noise around a deterministic policy.
Relate your analysis to Silver et al. (2014), focusing on how the deterministic policy gradient framework assumes that exploration is provided externally, and how this assumption plays out in a real-time navigation task.

Question 4 (Demonstrations, TD critic, and the limits of pure TD learning)

Niu et al. (2021) study accelerated sim-to-real deep reinforcement learning for collision avoidance using human demonstrations, followed by RL fine-tuning (e.g., with DDPG) in simulation and on the real robot.

Suppose that a set of human demonstration trajectories \({(s_t, a_t, r_t, s_{t+1})}\) is collected. Explain how these data can be used to initialize or pretrain the critic (Q_(s,a)) and the actor (_(s)) in a DDPG framework.
Show mathematically how TD critic updates of the form

\[ \phi \leftarrow \phi + \alpha \bigl( y - Q_\phi(s,a) \bigr) \nabla_\phi Q_\phi(s,a), \quad y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s')), \]

allow the critic to gradually correct errors in the values assigned to demonstrated actions. Explain how these corrected value estimates influence the actor updates via \(\nabla_a Q_\phi(s,a)\).
Consider a hypothetical alternative in which only a state-value function (V_(s)) is learned with TD(0), without any explicit actor:

\[ V_\psi(s) \leftarrow V_\psi(s) + \alpha \bigl( r + \gamma V_\psi(s') - V_\psi(s) \bigr) \nabla_\psi V_\psi(s). \]

Discuss why, in a continuous control environment such as TurtleBot3 collision avoidance, this pure TD approach is generally insufficient to produce an improved controller from demonstrations alone. In particular, address:
- The absence of a direct mapping from value estimates to continuous actions.
- The difficulty of deriving a policy from a state-value function without a separate optimization or planning step.
- The role of the critic’s action-derivative \(\nabla_a Q(s,a)\) in providing actionable gradients to refine the controller.
Summarize conditions under which combining demonstrations with a DDPG-style actor–critic (as in sim-to-real TurtleBot3 navigation) is expected to outperform pure TD value learning in terms of both performance and sample efficiency.

Question 5 (DDPG and VLA generalization)

In Liu et al. (2025), VLA models map perceptual inputs and natural-language instructions to continuous robot actions. The paper studies how reinforcement learning fine-tuning can improve the generalization of such models beyond supervised pretraining.

Consider a VLA model whose perception and language components are pretrained and approximately fixed, and whose action head outputs continuous control commands \(a = \mu_\theta(s)\) (e.g., linear and angular velocities).

Explain how the deterministic policy gradient

\[ \nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu} \left [ \nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)\big\rvert_{a=\mu_\theta(s)} \right] \]

can be used to fine-tune only the action head of a VLA model while treating the upstream perception and language encoders as fixed feature extractors.

Argue why deterministic policy gradients (as in DDPG) may be particularly suitable in this setting, compared to fully stochastic policies, when the VLA model already provides a reasonably calibrated action distribution but requires precise, reward-aligned corrections.
Discuss how TD critic signals \(Q^\mu(s,a)\), learned from sparse or delayed rewards, can correct systematic biases in the action head under distribution shift. Give concrete examples of such biases (e.g., overly conservative or overly aggressive actions) and explain qualitatively how the gradient \(\nabla_a Q^\mu(s,a)\) reshapes the action head’s mapping.

Your explanation should clearly distinguish between supervised pretraining on demonstration data and RL fine-tuning using TD-based critic updates.