MiniMax developed MiniMax-M1, a large reasoning model leveraging a hybrid Mixture-of-Experts architecture with Lightning Attention to enable efficient scaling of test-time compute for extremely long contexts. It supports up to 1 million input tokens and 80,000 generation tokens while achieving substantial FLOPs reduction and competitive performance across various reasoning benchmarks.
A method introduces entropy-based advantage shaping to enhance large language model reasoning, explicitly encouraging exploration during reinforcement learning fine-tuning. This approach leads to more robust, multi-step reasoning, achieving substantial gains in Pass@K scores on challenging mathematical benchmarks.
NVIDIA researchers developed AceReason-Nemotron 1.1, a Qwen2.5-7B based model, by systematically investigating the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance math and code reasoning. The model achieved new state-of-the-art performance for its size on challenging benchmarks like AIME25 (64.8%) and LiveCodeBench v6 (52.1%).
Researchers from Microsoft and UCLA developed Direct Reasoning Optimization (DRO), an RL-based framework that allows large language models to self-refine their reasoning for open-ended tasks by using a novel internal Reasoning Reflection Reward (R3). This method leverages the model's self-certainty about a reference outcome given its Chain-of-Thought, leading to superior reasoning quality on tasks like paragraph revision and comparable performance to explicit verifiers on structured QA.
This survey details discrete diffusion models as an emerging alternative to autoregressive models in large language and multimodal contexts, demonstrating competitive performance with up to 10x faster inference speeds through parallel decoding. The work consolidates the field's mathematical foundations, key models, and advanced techniques, highlighting broad applicability across generative AI tasks.
Kuaishou's OneRec introduces an end-to-end generative encoder-decoder architecture for recommender systems, unifying retrieval and ranking to address limitations of traditional multi-stage approaches. This system achieves a 5.2x improvement in training MFU and a 10.6% reduction in operating expenses compared to existing pipelines, demonstrating consistent gains in user engagement and business metrics in production.
Microsoft Research Asia introduces CoT-Pass@K, a new metric that redefines how reasoning in large language models is evaluated, demanding correctness in both the answer and the reasoning path. The study shows that Reinforcement Learning with Verifiable Rewards (RLVR) genuinely improves an LLM's logical reasoning quality, which was obscured by the limitations of traditional Pass@K metrics that allow for correct final answers from flawed reasoning.
Show-o2 develops a native unified multimodal model capable of understanding and generating text, images, and videos through a 3D causal VAE space and a two-stage training pipeline. The model achieves state-of-the-art performance across multiple multimodal understanding and visual generation benchmarks.
Truncated Proximal Policy Optimization (T-PPO) significantly enhances the training efficiency of large language models for complex reasoning tasks by enabling progressive policy updates on incomplete trajectories. The method achieved up to 2.5x higher training efficiency, leading to a 60% reduction in wall-clock time while securing a 62 pass@1 score on the AIME 2024 benchmark.
This paper presents the Autoregressive U-Net (AU-Net), an architecture that processes language directly from raw bytes, learning hierarchical representations without a predefined vocabulary. AU-Net matches or surpasses the performance of BPE-based Transformer baselines on various NLP tasks, while demonstrating improved multilingual generalization and superior character-level manipulation capabilities.