强化学习算法：深度确定性策略梯度(DDPG)

张

张建站

2026/5/18 15:16:10

10分钟阅读

强化学习算法深度确定性策略梯度(DDPG)1. 技术分析1.1 DDPG概述DDPG是针对连续动作的深度强化学习算法DDPG特点确定性策略: 输出确定动作而非概率 Actor-Critic架构: 结合策略和价值离线策略: 使用经验回放核心创新: 确定性策略梯度目标网络探索噪声1.2 DDPG组成组件作用更新方式Actor输出确定性动作策略梯度Critic评估Q值TD学习目标网络稳定训练软更新1.3 DDPG优势特性DDPGPPO(连续)动作空间连续连续探索方式噪声注入随机策略样本效率高中2. 核心功能实现2.1 DDPG算法import numpy as np import random from collections import deque class DDPG: def __init__(self, actor, critic, target_actor, target_critic, actor_optimizer, critic_optimizer, replay_buffer_size100000, batch_size64, gamma0.99, tau0.001, noise_scale0.1): self.actor actor self.critic critic self.target_actor target_actor self.target_critic target_critic self.actor_optimizer actor_optimizer self.critic_optimizer critic_optimizer self.replay_buffer deque(maxlenreplay_buffer_size) self.batch_size batch_size self.gamma gamma self.tau tau self.noise_scale noise_scale def select_action(self, state, add_noiseTrue): action self.actor(state) if add_noise: action np.random.normal(0, self.noise_scale, sizeaction.shape) return np.clip(action, -1, 1) def add_to_replay(self, state, action, reward, next_state, done): self.replay_buffer.append((state, action, reward, next_state, done)) def update_target_networks(self): self._soft_update(self.target_actor, self.actor) self._soft_update(self.target_critic, self.critic) def _soft_update(self, target, source): for target_param, source_param in zip(target.parameters(), source.parameters()): target_param self.tau * source_param (1 - self.tau) * target_param def train(self, env, episodes1000): for episode in range(episodes): state env.reset() done False total_reward 0 while not done: action self.select_action(state) next_state, reward, done env.step(action) self.add_to_replay(state, action, reward, next_state, done) self._train_step() state next_state total_reward reward self.update_target_networks() def _train_step(self): if len(self.replay_buffer) self.batch_size: return batch random.sample(self.replay_buffer, self.batch_size) states np.array([exp[0] for exp in batch]) actions np.array([exp[1] for exp in batch]) rewards np.array([exp[2] for exp in batch]) next_states np.array([exp[3] for exp in batch]) dones np.array([exp[4] for exp in batch]) target_actions self.target_actor(next_states) target_q_values self.target_critic(next_states, target_actions) targets rewards self.gamma * target_q_values * (1 - dones) critic_loss np.mean((self.critic(states, actions) - targets) ** 2) self.critic_optimizer.step(critic_loss) actor_loss -np.mean(self.critic(states, self.actor(states))) self.actor_optimizer.step(actor_loss)2.2 DDPG网络class DDPGActorNetwork: def __init__(self, state_dim, action_dim, hidden_dim64): self.W1 np.random.randn(state_dim, hidden_dim) * 0.01 self.b1 np.zeros(hidden_dim) self.W2 np.random.randn(hidden_dim, hidden_dim) * 0.01 self.b2 np.zeros(hidden_dim) self.W3 np.random.randn(hidden_dim, action_dim) * 0.01 self.b3 np.zeros(action_dim) def forward(self, state): h1 np.maximum(0, state self.W1 self.b1) h2 np.maximum(0, h1 self.W2 self.b2) action np.tanh(h2 self.W3 self.b3) return action def parameters(self): return [self.W1, self.b1, self.W2, self.b2, self.W3, self.b3] class DDPGCriticNetwork: def __init__(self, state_dim, action_dim, hidden_dim64): self.W1 np.random.randn(state_dim action_dim, hidden_dim) * 0.01 self.b1 np.zeros(hidden_dim) self.W2 np.random.randn(hidden_dim, hidden_dim) * 0.01 self.b2 np.zeros(hidden_dim) self.W3 np.random.randn(hidden_dim, 1) * 0.01 self.b3 np.zeros(1) def forward(self, state, action): x np.concatenate([state, action]) h1 np.maximum(0, x self.W1 self.b1) h2 np.maximum(0, h1 self.W2 self.b2) q_value h2 self.W3 self.b3 return q_value[0] def parameters(self): return [self.W1, self.b1, self.W2, self.b2, self.W3, self.b3]2.3 探索噪声class OUNoise: def __init__(self, action_dim, mu0, theta0.15, sigma0.2): self.action_dim action_dim self.mu mu self.theta theta self.sigma sigma self.state np.ones(action_dim) * mu def reset(self): self.state np.ones(self.action_dim) * self.mu def noise(self): x self.state dx self.theta * (self.mu - x) self.sigma * np.random.randn(self.action_dim) self.state x dx return self.state class GaussianNoise: def __init__(self, action_dim, mean0, std0.1): self.action_dim action_dim self.mean mean self.std std def noise(self): return np.random.normal(self.mean, self.std, sizeself.action_dim) class AdaptiveNoise: def __init__(self, action_dim, initial_std0.1, decay_rate0.995, min_std0.01): self.action_dim action_dim self.std initial_std self.decay_rate decay_rate self.min_std min_std def noise(self): noise np.random.normal(0, self.std, sizeself.action_dim) self.std max(self.min_std, self.std * self.decay_rate) return noise3. 性能对比3.1 DDPG变体对比变体性能稳定性复杂度DDPG基准中低TD310%高中SAC15%很高高3.2 DDPG vs PPO(连续)指标DDPGPPO样本效率高中稳定性中高超参数敏感性高低3.3 噪声类型影响噪声类型探索效果稳定性OU噪声中高高斯噪声高中自适应噪声很高高4. 最佳实践4.1 DDPG配置def configure_ddpg(env_type): configs { simple: { noise_type: gaussian, noise_std: 0.1, tau: 0.001, gamma: 0.99 }, complex: { noise_type: adaptive, noise_std: 0.2, tau: 0.001, gamma: 0.99 } } return configs.get(env_type, configs[simple]) class DDPGConfigGenerator: staticmethod def from_environment(env_type): return configure_ddpg(env_type)4.2 TD3改进class TD3Improvements: staticmethod def twin_critics(): return {use_twin_critics: True} staticmethod def delayed_policy_update(delay2): return {policy_delay: delay} staticmethod def target_policy_smoothing(sigma0.2, clip0.5): return {target_smoothing: True, sigma: sigma, clip: clip}5. 总结DDPG是连续控制的经典算法确定性策略输出确定动作目标网络稳定训练探索噪声OU或高斯噪声TD3改进版DDPG对比数据如下TD3比DDPG更稳定SAC比DDPG性能更好自适应噪声优于固定噪声推荐使用TD3作为DDPG的改进版

基于树莓派与reef-pi构建珊瑚缸pH监测系统：从I2C通信到水质自动化

1. 项目概述与核心价值如果你正在维护一个珊瑚缸，尤其是饲养了像鹿角珊瑚（Acropora）这类对水质要求苛刻的品种，那么你肯定深知pH值稳定性的重要性。珊瑚缸的pH值通常在7.6到8.4之间波动，这个看似微小的范围&#xff0c…...

2026/5/18 15:14:42 阅读更多 →

Honey Select 2 HF Patch：一站式游戏增强与汉化终极指南

Honey Select 2 HF Patch：一站式游戏增强与汉化终极指南【免费下载链接】HS2-HF_Patch Automatically translate, uncensor and update HoneySelect2! 项目地址: https://gitcode.com/gh_mirrors/hs/HS2-HF_Patch 还在为Honey Select 2的界面语言和功能限制…...

2026/5/18 15:12:14 阅读更多 →

CSS3 媒体查询完全指南：响应式设计的核心利器

在移动设备种类繁多的今天，一套网页需要在手机、平板、笔记本、大屏显示器上都能呈现出良好的布局与可读性。CSS3 媒体查询（Media Queries）正是实现这种“一次设计，处处适应”的关键技术。它允许开发者根据设备特性（如视口宽度、屏幕分辨率、方向、色彩能力等）有条件地应…...

2026/5/18 15:12:14 阅读更多 →

app扫描wifi的时候需要打开GPS定位----否则扫不到

这是很奇怪的一个事情，wifi和定位有什么关系？但是就是要打开。...

2026/5/18 6:22:28 阅读更多 →

AMD Ryzen调试神器SMUDebugTool：免费开源工具让你的处理器性能飞起来！

AMD Ryzen调试神器SMUDebugTool：免费开源工具让你的处理器性能飞起来！ 【免费下载链接】SMUDebugTool A dedicated tool to help write/read various parameters of Ryzen-based systems, such as manual overclock, SMU, PCI, CPUID, MSR and Power Tab…...

2026/5/17 0:07:16 阅读更多 →

Midjourney抽象表现主义风格迁移全链路（从梵高笔触到AI熵增美学的底层逻辑解密）

更多请点击： https://intelliparadigm.com 第一章：Midjourney抽象表现主义风格迁移全链路（从梵高笔触到AI熵增美学的底层逻辑解密） 抽象表现主义并非仅关乎色彩与笔触的失控，而是神经感知系统在高维特征空间中对抗坍缩…...

2026/5/17 0:11:51 阅读更多 →

2026届毕业生推荐的AI科研方案实际效果

Ai论文网站排名（开题报告、文献综述、降aigc率、降重综合对比） TOP1. 千笔AI TOP2. aipasspaper TOP3. 清北论文 TOP4. 豆包 TOP5. kimi TOP6. deepseek 处在学术研究的起始阶段，开题报告的撰写常常令好多研究生以及青年学者觉得麻烦&…...

2026/5/18 10:49:06 阅读更多 →