Gap-K%: A Novel Approach to Detecting Pretraining Data in Large Language Models
Analysis
This research introduces a groundbreaking method, Gap-K%, for identifying the pretraining data used in Generative AI Large Language Models (LLMs). The innovative approach leverages the log probability gap between a model's top-1 prediction and the target token, leading to state-of-the-art performance in data detection.
Key Takeaways
- •Gap-K% focuses on the discrepancy between top-1 predictions and target tokens.
- •It utilizes a sliding window strategy to capture local correlations.
- •The method achieves state-of-the-art performance on benchmark datasets.
Reference / Citation
View Original"In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining."
A
ArXiv MLJan 29, 2026 05:00
* Cited for critical analysis under Article 32.