Search: 通过直接结合 - ai.jp.net

Research Paper #Reinforcement Learning, Policy Optimization, Sample Complexity 🔬 ResearchAnalyzed: Jan 3, 2026 16:51

Sample Complexity of Policy Mirror Descent with TD Learning

Published:Dec 30, 2025 07:57

•

1 min read

•

ArXiv

Analysis

This paper investigates the sample complexity of Policy Mirror Descent (PMD) with Temporal Difference (TD) learning in reinforcement learning, specifically under the Markovian sampling model. It addresses limitations in existing analyses by considering TD learning directly, without requiring explicit approximation of action values. The paper introduces two algorithms, Expected TD-PMD and Approximate TD-PMD, and provides sample complexity guarantees for achieving epsilon-optimality. The results are significant because they contribute to the theoretical understanding of PMD methods in a more realistic setting (Markovian sampling) and provide insights into the sample efficiency of these algorithms.

Key Takeaways

•Investigates sample complexity of Policy Mirror Descent (PMD) with Temporal Difference (TD) learning under Markovian sampling.
•Introduces Expected TD-PMD and Approximate TD-PMD algorithms.
•Provides $ ilde{O}(\varepsilon^{-2})$ and $O(\varepsilon^{-2})$ sample complexity guarantees for average-time and last-iterate $\varepsilon$-optimality, respectively.
•Addresses limitations of existing PMD sample complexity analyses by directly incorporating TD learning.

Reference

“The paper establishes $ ilde{O}(\varepsilon^{-2})$ and $O(\varepsilon^{-2})$ sample complexities for achieving average-time and last-iterate $\varepsilon$-optimality, respectively.”

Permalink ArXiv

Sample Complexity of Policy Mirror Descent with TD Learning

Analysis

Key Takeaways

📬 Get AI News Delivered

Browse by Category

Trending Topics

📬 Get AI News Delivered

Browse by Category

Trending Topics