Research Paper#Reinforcement Learning, Policy Optimization, Sample Complexity🔬 ResearchAnalyzed: Jan 3, 2026 16:51
Sample Complexity of Policy Mirror Descent with TD Learning
Published:Dec 30, 2025 07:57
•1 min read
•ArXiv
Analysis
This paper investigates the sample complexity of Policy Mirror Descent (PMD) with Temporal Difference (TD) learning in reinforcement learning, specifically under the Markovian sampling model. It addresses limitations in existing analyses by considering TD learning directly, without requiring explicit approximation of action values. The paper introduces two algorithms, Expected TD-PMD and Approximate TD-PMD, and provides sample complexity guarantees for achieving epsilon-optimality. The results are significant because they contribute to the theoretical understanding of PMD methods in a more realistic setting (Markovian sampling) and provide insights into the sample efficiency of these algorithms.
Key Takeaways
- •Investigates sample complexity of Policy Mirror Descent (PMD) with Temporal Difference (TD) learning under Markovian sampling.
- •Introduces Expected TD-PMD and Approximate TD-PMD algorithms.
- •Provides $ ilde{O}(\varepsilon^{-2})$ and $O(\varepsilon^{-2})$ sample complexity guarantees for average-time and last-iterate $\varepsilon$-optimality, respectively.
- •Addresses limitations of existing PMD sample complexity analyses by directly incorporating TD learning.
Reference
“The paper establishes $ ilde{O}(\varepsilon^{-2})$ and $O(\varepsilon^{-2})$ sample complexities for achieving average-time and last-iterate $\varepsilon$-optimality, respectively.”