Reference: Gallici, M., Fellows, M., Ellis, B., Pou, B., Masmitja, I., Foerster, J. N., & Martin, M.. Simplifying Deep Temporal Difference Learning. ICLR 2025
Temporal difference (TD) methods can be simple and efficient, but are notably unstable when combining them with neural networks or off-policy sampling.
All of the following methods were developed to stabilise TD whilst using deep neural networks (NN):
Replay buffer (batched learning)
Target networks
Trust region methods
DDQN
Maximum entropy methods (SAC)
Ensembling
PPO seems to have been the de-facto method for a lot of scenarios, but its still very unstable and hard to configure with a lot of implementation details and tricks to efficiently implement.
PPO has no provable convergence properties when used with NN.
Parallelization of interaction using vectorised environments with multithreading is standard to speed up training.
More recent GPU-based frameworks (e.g. IsaacGym, Craftax, etc.) are vectorized by using batched tensors which allows the agent to interact with thousands of environments.
Allows compilation of a single
Not possible to do with DQN because of:
Replay buffer in memory is practically impossible since it would take most of the VRAM.
Convergence of off-policy methods usually have a low update-to-data (UTD) ratio (i.e. UTD of 1 for traditional DQN).
Usually parallelization of Q-learning looks like:
1 process that continuously trains the agent in a separate process whilst another is very quickly sampling new transitions.
BatchNorm alone doesn’t stabilize TD learning, it can even degrade learning in some cases. Only in the case of CrossQ while having multiple tricks (double Q-learning, BatchRenorm) can it stabilize training. However, it seems like BatchNorm can improve results if applied early in the network.
LayerNorm + L2 regularized TD can stabilise TD by mitigating effects of nonlinearity and off-policy sampling as demonstrated by their extensive theoretical analysis in the paper. The authors suggest to start with LayerNorm and L2 regularisation as a strong baseline to stabilize TD algorithms.
Their analysis demonstrate that L2 regularization should only be used sparingly; only when the LayerNorm alone cannot stabilize the environment and initially only over the final layer weights. However, not having L2 regularization cannot completely stabilize TD learning for all domains.
\(\pi_{Explore}\) (\(\varepsilon\)-greedy policy) rolled out for a small trajectory of size \(T\). \(\tau = (s_i, a_i, r_i, s_{t+1} \cdots s_{i+T})\) after which we start computing the returns as such:
(Last return) Start with \( R_{i+T}^{\lambda} = \max_{a’} Q_{\theta}(s_{i+T}, a’)\)
(Terminal state) If \(s_t\) is a terminal state, replace the target with \(R_t^{\lambda} = r_t\)
Only to compute \(\lambda\)-returns, a small replay buffer containing only the transitions tuples for each agent of the length of the trajectory is needed, which is much smaller than the usual buffer of length 1M for each agent since \(T \lt Length_{buffer}\).
Special case of \(\lambda = 0\) and \(T = 1\) is equivalent to traditional Q-learning (update at each transition).
Similarily to PPO, for improved sample-efficiency, PQN divides the collected experiences into multiple minibatches while updating on them multiple times within a few epochs.
PQN is an off-policy algorithm since it uses two different policies:
\(\varepsilon\)-greedy policy for the current timestep
\(\varepsilon = 1\) at the start of training. This implies that we’re optimizing value functions for a fully random policy, and this requires normalization to avoid training instability as they proved in the paper.
Current policy for the next step
Table 1 summarizes the advantages of this algorithm. The closest algorithm would be PPO in terms of these characteristics, but it requires numerous interacting implementation details and having more hyperparameters to tune, making it harder to use.
Multi-agent learning version
The algorithm can also be adapted to cooperative multi-agent scenarios by adopting Value Network Decomposition Networks (VDN). In other words, this optimizes the joined action-value function as a sum of the single agents action-values.
Benefits of online q-learning with vectorized environments #
Parallelized nature can help exploration since the natural stochasticity in the dynamics means even a greedy policy will explore several different states in parallel.
Taking multiple actions in multiple states, enables PQN’s sampling distribution to be a good approximation of the true stationary distribution under the current policy.As illustrated in Figure 3, sampling from DQN’s replay buffer is kind of equivalent of sampling from an average of older stationary distributions under varying policies.
Baird’s Counter is an environment designed to be provably divergent. PQN with LayerNorm and L2 loss diverges much less than its unnormalized counterpart.
Craftax is an environment where the agent has to solve multiple tasks before completion and PQN with an RNN is more sample efficient than PPO with an RNN and performs slightly better, while the two methods take similar time to train.
Smax, Overcooked and Hanabi are three multi-agent environments. PQN-VDN outperforms or is more sample efficient than all evaluated algorithms on those datasets while not requiring a huge replay buffer.
Input Normalization. BatchNorm mostly useful as input normalization before the first layer. Adding CrossQ’s tricks worsens the performance. Instead of using BatchNorm, which can sometimes hurt the performance, adding LayerNorm within the whole network seems to have the same effect while being more stable.
Varying \(\lambda\). Experiments show that \(\lambda = 0.65\) performs significantly better, demonstrating that \(\lambda\)-returns are an important design choice.
Replay Buffer: Figure 6d) demonstrate that PQN reaches the same performance in 6x quicker than if it was using huge replay buffers to store 1M transitions on GPU memory.
Number of environments: PQN can learn even with very few environments, but it is strongly encouraged, as shown in Figure 6e), to use much more environments as it is much quicker.