强化学习算法:Q学习与SARSA
强化学习算法Q学习与SARSA1. 技术分析1.1 Q学习概述Q学习是一种无模型强化学习算法Q学习特点 无模型: 不需要知道转移概率 离线策略: 学习最优策略 价值函数: Q(s,a)表示状态动作值 核心公式: Q(s,a) α[R γmaxQ(s,a) - Q(s,a)]1.2 SARSA概述SARSA是一种在线策略强化学习算法SARSA特点 在线策略: 学习当前策略 同步更新: 状态动作奖励下一状态下一动作 核心公式: Q(s,a) α[R γQ(s,a) - Q(s,a)]1.3 Q学习vs SARSA特性Q学习SARSA策略类型离线策略在线策略学习目标最优策略当前策略探索利用大胆保守2. 核心功能实现2.1 Q学习算法import numpy as np import random class QLearning: def __init__(self, states, actions, alpha0.1, gamma0.99, epsilon0.1): self.states states self.actions actions self.alpha alpha self.gamma gamma self.epsilon epsilon self.Q {state: {action: 0.0 for action in actions} for state in states} def choose_action(self, state): if random.random() self.epsilon: return random.choice(self.actions) return max(self.Q[state], keyself.Q[state].get) def update(self, state, action, reward, next_state, done): if done: target reward else: target reward self.gamma * max(self.Q[next_state].values()) self.Q[state][action] self.alpha * (target - self.Q[state][action]) def train(self, env, episodes1000): for episode in range(episodes): state env.reset() done False while not done: action self.choose_action(state) next_state, reward, done env.step(action) self.update(state, action, reward, next_state, done) state next_state def get_policy(self): policy {} for state in self.states: policy[state] max(self.Q[state], keyself.Q[state].get) return policy2.2 SARSA算法class SARSA: def __init__(self, states, actions, alpha0.1, gamma0.99, epsilon0.1): self.states states self.actions actions self.alpha alpha self.gamma gamma self.epsilon epsilon self.Q {state: {action: 0.0 for action in actions} for state in states} def choose_action(self, state): if random.random() self.epsilon: return random.choice(self.actions) return max(self.Q[state], keyself.Q[state].get) def train(self, env, episodes1000): for episode in range(episodes): state env.reset() action self.choose_action(state) done False while not done: next_state, reward, done env.step(action) next_action self.choose_action(next_state) if done: target reward else: target reward self.gamma * self.Q[next_state][next_action] self.Q[state][action] self.alpha * (target - self.Q[state][action]) state next_state action next_action def get_policy(self): policy {} for state in self.states: policy[state] max(self.Q[state], keyself.Q[state].get) return policy2.3 ε-贪婪策略class EGreedyPolicy: def __init__(self, epsilon0.1): self.epsilon epsilon def choose_action(self, Q_state): if random.random() self.epsilon: return random.choice(list(Q_state.keys())) return max(Q_state, keyQ_state.get) def decay_epsilon(self, rate0.995): self.epsilon max(0.01, self.epsilon * rate) class SoftmaxPolicy: def __init__(self, temperature1.0): self.temperature temperature def choose_action(self, Q_state): values np.array(list(Q_state.values())) exp_values np.exp(values / self.temperature) probs exp_values / np.sum(exp_values) return np.random.choice(list(Q_state.keys()), pprobs)3. 性能对比3.1 Q学习vs SARSA对比指标Q学习SARSA收敛速度中快最终性能高中稳定性中高3.2 学习率影响α收敛速度稳定性0.1中高0.5快中0.9很快低3.3 ε-贪婪策略效果ε探索程度收敛速度最终性能0.01低快中0.1中中高0.5高慢很高4. 最佳实践4.1 算法选择def choose_q_learning_algorithm(environment_type): if environment_type deterministic: return Q-learning elif environment_type stochastic: return SARSA else: return Q-learning class QAlgorithmSelector: staticmethod def select(config): if config.get(off_policy, True): return QLearning(**config) else: return SARSA(**config)4.2 参数调优class ParameterTuner: def __init__(self): pass def tune(self, env, algorithm_class): best_reward float(-inf) best_params {} alphas [0.01, 0.1, 0.5] gammas [0.9, 0.99, 0.999] epsilons [0.01, 0.1, 0.5] for alpha in alphas: for gamma in gammas: for epsilon in epsilons: agent algorithm_class( statesenv.states, actionsenv.actions, alphaalpha, gammagamma, epsilonepsilon ) agent.train(env, episodes100) reward agent.evaluate(env) if reward best_reward: best_reward reward best_params {alpha: alpha, gamma: gamma, epsilon: epsilon} return best_params5. 总结Q学习和SARSA是经典强化学习算法Q学习离线策略学习最优策略SARSA在线策略学习当前策略ε-贪婪平衡探索与利用参数选择影响收敛和稳定性对比数据如下Q学习最终性能更高但SARSA更稳定α0.1是安全的学习率选择ε0.1是常用的探索率推荐先用Q学习不稳定时改用SARSA