Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

01 Aug 2025
reasoningchain-of-thoughtreinforcement-learning
ByteDance Seed AI4Math developed Seed-Prover and Seed-Geometry, an automated theorem proving system for formal mathematics. This system integrates large language models with Lean 4, employing lemma-style proving and multi-tiered inference strategies, enabling it to solve 5 out of 6 problems at the IMO 2025 and achieve 99.6% on the MiniF2F benchmark.
31 Jul 2025
synthetic-datareasoningchain-of-thought
CoT-Self-Instruct, developed by FAIR at Meta, introduces a method for generating high-quality synthetic data for Large Language Models by combining Chain-of-Thought reasoning for instruction creation with robust, automated filtering mechanisms. This approach enables models trained on the synthetic data to achieve superior performance on both reasoning and general instruction-following benchmarks, often surpassing existing synthetic methods and human-annotated datasets.
01 Aug 2025
generative-modelstransformerstext-generation
DAEDAL, a training-free denoising strategy developed by researchers at The Chinese University of Hong Kong and Shanghai AI Laboratory, enables Diffusion Large Language Models (DLLMs) to dynamically adjust their output length during inference. This approach achieves performance comparable to or superior to meticulously tuned fixed-length baselines while significantly improving computational efficiency on mathematical reasoning and code generation tasks.
29 Jul 2025
transformersrepresentation-learningmechanistic-interpretability
Researchers at Anthropic introduce an automated pipeline for extracting "persona vectors" from large language models' activation spaces, enabling both the monitoring and causal control of character traits. They demonstrate that these vectors can prevent undesirable persona shifts during finetuning and effectively screen training data to predict and mitigate the induction of negative traits like evil, sycophancy, or hallucination.
01 Aug 2025
cs.RO
Researchers from Columbia University and Toyota Research Institute developed a method that treats pre-trained video generative models as robot policies, allowing robots to learn tasks by synthesizing future visual states and decoding actions from these visual representations, which improves generalization and significantly reduces the need for action-annotated data.
01 Aug 2025
vision-language-modelsinference-optimizationmodel-compression
HiPrune introduces a training-free and model-agnostic visual token pruning method for Vision-Language Models (VLMs) by leveraging the hierarchical attention within vision encoders. It achieves significant computational savings, such as an 8.8x FLOPs reduction for LLaVA-NeXT-7B, while preserving or even slightly boosting VLM performance, retaining over 99% of original accuracy on various benchmarks with up to 78% fewer tokens.
30 Jul 2025
reinforcement-learningdeep-reinforcement-learningagents
RLVMR, developed by Tencent, trains Large Language Model agents to perform complex, long-horizon tasks by providing dense, verifiable meta-reasoning rewards during reinforcement learning. This approach leads to enhanced task success and generalization while significantly reducing inefficient exploration, such as repetitive and invalid actions, on benchmarks like ALFWorld and ScienceWorld.
04 Aug 2025
generative-modelstransformersmulti-modal-learning
We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and songs coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both sung and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
01 Aug 2025
agentsagentic-frameworkscontinual-learning
This survey provides a comprehensive analysis of self-evolving agents, outlining their components, mechanisms, temporal aspects, and application domains. It proposes a structured framework for understanding how agents can continuously learn and adapt, addressing the limitations of static AI models and offering tailored evaluation methodologies for these dynamic systems.
30 Jul 2025
few-shot-learningtransformersreasoning
Researchers at the University of Maryland, College Park, empirically demonstrated a 'DEMOS’ POSITION IN PROMPT bias (DPP bias)' in large language models, showing that merely repositioning in-context demonstrations within a prompt can cause accuracy to fluctuate by up to 20 percentage points and flip nearly half of a model's predictions. The study reveals a consistent advantage for early demo placements and identifies that placing demonstrations at the very end of the user message often degrades performance.