DDPG
Deep Deterministic Policy Gradient (DDPG) is a powerful actor-critic algorithm designed for environments with continuous action spaces. It combines the strengths of deterministic policy gradients and Q-learning, enabling effective learning in high-dimensional control tasks. DDPG maintains separate networks for the actor and critic, along with their respective target networks, which contribute to training stability. It leverages experience replay and soft target updates to reduce variance and improve sample efficiency. Although more sensitive to hyperparameters compared to some newer methods, DDPG remains a strong baseline in continuous control benchmarks. Our implementation closely follows the design and structure outlined by CleanRL.
Continuous state - continuous action
The ddpg_classical.py
and the ddpg_quantum.py
have the following features:
- ✅ Work with continuous observation space
- ✅ Work with continuous action space
- ✅ Work with envs like Pendulum-v1
- ✅ Multiple Vectorized Environments
- ✅ Single file implementation
Implementation details
The key difference between the classical and the quantum algorithm's is the ddpgAgentQuantum
class, as shown below
Additionally, we also need to specify a function for the ansatz of the parameterized quantum circuit.
In our implementation, the mean of the continuous action is based on the expectation value of the parameterized quantum circuit, while the variance is an additional classical trainable parameter. This parameter is also the same for all continuous actions. For additional information we refer to Variational Quantum Circuit Design for Quantum Reinforcement Learning on Continuous Environments.
Our implementation implements some key novelties proposed by Skolik et al Quantum agents in the Gym.
data reuploading
: In our ansatz, the features of the states are encoded via RX rotation gates. Instead of only encoding the features in the first layer, this process is repeated in each layer. This has been shown to improve training performance by increasing the expressivity of the ansatz.input scaling
: In our implementation, we define another set of trainable parameters that scale the features that are encoded into the quantum circuits. This has also been shown to improve training performance.output scaling
: In our implementation, we define a final set of hyperparameters that scales the expectation values that the quantum circuit "outputs". This has also been shown to improve training performance.
We also provide the option to select different learning rates
for the different parameter sets:
optimizer = optim.Adam(
[
{"params": agent.input_scaling, "lr": lr_input_scaling},
{"params": agent.output_scaling, "lr": lr_output_scaling},
{"params": agent.weights, "lr": lr_weights},
]
)
Also, you can use a faster pennylane backend for your simulations:
pennylane-lightning
: We enable the use of thelightning
simulation backend by pennylane, which speeds up simulation
We also add an observation wrapper called ArctanNormalizationWrapper
at the very beginning of the file. Because we encode the features of the states as rotations, we need to ensure that the features are not beyond the interval of - π and π due to the periodicity of the rotation gates. For more details on wrappers, see Advanced Usage.
Experiment results
Coming Soon!