分析
DeepSeek V4 的架构,特别是 Engram 内存系统,预示着大型语言模型 (LLM) 技术的突破性进展。 显着降低 VRAM 消耗和增强在广泛上下文窗口中推理稳定性的潜力令人兴奋。 如果泄露的基准测试结果准确,DeepSeek V4 可能会重新定义行业标准。
关于benchmarks的新闻、研究和更新。由AI引擎自动整理。
"Foody 说:“Gemini 3.1 Pro 现在位列 APEX-Agents 排行榜榜首”,并补充说,该模型令人印象深刻的结果表明“智能体在实际知识工作中改进的速度有多快”。"
"Gemini 3.1 Pro 实现了 77.1% 的 ARC-AGI-2 分数,比 GPT-5.2 高出约 24%。"
"广泛的评估表明,UI-Venus-1.5在ScreenSpot-Pro(69.6%)、VenusBench-GD(75.0%)和AndroidWorld(77.6%)等基准测试中建立了新的state-of-the-art性能,显著超越了以前的强大基线。"
"Intern-S1-Pro,一款用于高度专业科学的高级开源多模态LLM,于2月4日由中国上海人工智能实验室发布。"
"My question: what concrete criteria or benchmarks would allow us to choose between: (1) a multimodal LLM + post-training + tool-use will eventually cover the essentials vs (2) a non-generative world model architecture is needed to take a leap (prediction, constraints, physical interaction)"
"The study highlights the importance of creating robust metrics, paving the way for more accurate evaluations of AI's burgeoning abilities."
"The new Ryzen AI Max+ 392 has popped up on Geekbench with a single-core score of 2,917 points and a multi-core score of 18,071 points, posting impressive results across the board that match high-end desktop SKUs."
"A shift from static benchmarks to dynamic evaluations is a key requirement of modern AI systems."
"Marktechpost has released AI2025Dev, its 2025 analytics platform (available to AI Devs and Researchers without any signup or login) designed to convert the year’s AI activity into a queryable dataset spanning model releases, openness, training scale, benchmark performance, and ecosystem participants."
"Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross-model comparison"