alphaXiv

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Papers Benchmarks Models

generative-models

2,735

01 Nov 2025

generative-models attention-mechanisms computer-science

Kimi Linear: An Expressive, Efficient Attention Architecture

Moonshot AI

Kimi Linear introduces a hybrid attention architecture combining a novel linear attention module with full attention layers, providing superior performance and efficiency for large language models. The architecture achieved up to 6x faster decoding throughput and 75% less KV cache usage compared to full attention baselines, while consistently matching or surpassing their quality across various tasks.

270

1,197

31 Oct 2025

generative-models computer-science artificial-intelligence

Continuous Autoregressive Language Models

Continuous Autoregressive Language Models (CALM) replace discrete next-token prediction with continuous next-vector prediction to enhance efficiency in Large Language Models. A CALM-M model with 371M parameters achieved comparable performance to a 281M Transformer-S baseline, while reducing training FLOPs by 44% and inference FLOPs by 34%.

934

30 Oct 2025

generative-models computer-science computer-vision-and-pattern-recognition

Emu3.5: Native Multimodal Models are World Learners

BAAI

BAAI's Emu3.5 introduces a native multimodal model trained extensively on internet videos, capable of understanding and generating interleaved vision-language sequences across long horizons. The model achieves strong performance in various multimodal tasks, including complex image generation with accurate text rendering and embodied AI scenarios, while accelerating image inference by approximately 20x using a novel Discrete Diffusion Adaptation (DiDA) technique.

555

223

30 Oct 2025

generative-models agents autonomous-vehicles

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA's Alpamayo-R1 (AR1) introduces a vision-language-action model that integrates causally-grounded reasoning with trajectory planning for autonomous driving. This approach enhances performance and safety in complex long-tail scenarios, achieving real-time inference and improved decision-making consistency.

489

28 Oct 2025

generative-models computer-science computer-vision-and-pattern-recognition

Uniform Discrete Diffusion with Metric Path for Video Generation

Chinese Academy of Sciences

Zhejiang University Beijing Academy of Artificial Intelligence National Laboratory of Pattern Recognition, CASIA Key Laboratory of Intelligent Information Processing, ICT, CAS

URSA presents a uniform discrete diffusion framework that incorporates a metric probability path for video generation, enabling iterative global refinement in discrete token space. This framework achieves performance competitive with state-of-the-art continuous diffusion models across text-to-video, image-to-video, and text-to-image benchmarks, while enhancing scalability and multi-task capabilities.

289

29 Oct 2025

generative-models computer-science machine-learning

$π_\texttt{RL}$ : Online RL Fine-tuning for Flow-based Vision-Language-Action Models

πRL introduces an open-source framework that enables online reinforcement learning (RL) for flow-based Vision-Language-Action (VLA) models, a class previously incompatible with policy gradient methods due to technical challenges. The framework achieves substantial performance gains over supervised fine-tuning baselines, improving success rates on benchmarks like LIBERO and ManiSkill, particularly in few-shot and multi-task scenarios.

890

29 Oct 2025

generative-models chain-of-thought computer-science

RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

RAVR, a framework developed by Zhejiang University and Alibaba Group, enhances Large Language Model reasoning by employing a reference-answer-guided variational approach. It leverages answer-conditioned reasoning to efficiently explore high-utility reasoning paths, leading to consistent performance improvements across general and mathematical reasoning tasks, and more robust reasoning behaviors.

117

28 Oct 2025

generative-models autonomous-vehicles computer-science

World Simulation with Video Foundation Models for Physical AI

NVIDIA researchers developed `[Cosmos-Predict2.5]` and `[Cosmos-Transfer2.5]`, video foundation models tailored for high-fidelity world simulation in Physical AI applications. These models achieve competitive performance against larger alternatives in video generation benchmarks and enable policies trained with synthetic data to achieve 24/30 successes in real-robot manipulation tasks under novel conditions.

288

575

28 Oct 2025

generative-models computer-science continual-learning

SPICE: Self-Play In Corpus Environments Improves Reasoning

National University of Singapore

Meta FAIR at Meta

Chuanyang Jin

A research team from FAIR at Meta and NUS developed SPICE, a reinforcement learning framework that enables large language models to continuously enhance their reasoning abilities through adversarial self-play grounded in a vast external document corpus. The method consistently improved performance on mathematical and general reasoning benchmarks by up to 11.9% compared to base models.

1,567

24 Oct 2025

generative-models computer-science artificial-intelligence

The Principles of Diffusion Models

Authored by leading researchers from Sony AI, OpenAI, and Stanford, this monograph synthesizes the rapidly evolving field of diffusion models by clarifying their theoretical foundations and unifying diverse formulations into a single continuous-time generative framework. It systematically covers the origins, unifies variational, score-based, and flow-based perspectives, and outlines advancements in sampling and generation techniques.

10 Nov 2025

generative-models agents computer-science

A Step Toward World Models: A Survey on Robotic Manipulation

Autonomous agents are increasingly expected to operate in complex, dynamic, and uncertain environments, performing tasks such as manipulation, navigation, and decision-making. Achieving these capabilities requires agents to understand the underlying mechanisms and dynamics of the world, moving beyond reactive control or simple replication of observed states. This motivates the development of world models as internal representations that encode environmental states, capture dynamics, and support prediction, planning, and reasoning. Despite growing interest, the definition, scope, architectures, and essential capabilities of world models remain ambiguous. In this survey, we go beyond prescribing a fixed definition and limiting our scope to methods explicitly labeled as world models. Instead, we examine approaches that exhibit the core capabilities of world models through a review of methods in robotic manipulation. We analyze their roles across perception, prediction, and control, identify key challenges and solutions, and distill the core components, capabilities, and functions that a fully realized world model should possess. Building on this analysis, we aim to motivate further development toward generalizable and practical world models for robotics.

219

31 Oct 2025

generative-models computer-science artificial-intelligence

The End of Manual Decoding: Towards Truly End-to-End Language Models

The Chinese University of Hong Kong, Shenzhen Tencent AI Lab

The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.

220

28 Oct 2025

generative-models computer-science artificial-intelligence

MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation

An open-source framework, MiniOneRec, validates large language model-like scaling laws for generative recommendation on public benchmarks while providing an efficient post-training recipe that achieves superior performance and transferability over existing methods.

325

28 Oct 2025

generative-models computer-science artificial-intelligence

An efficient probabilistic hardware architecture for diffusion-like models

Extropic Corporation presents a probabilistic hardware architecture that integrates Denoising Thermodynamic Models (DTMs) with a CMOS-compatible, all-transistor random number generator, achieving an estimated 10,000-fold energy reduction compared to GPU-based diffusion models for generative tasks.

29 Oct 2025

generative-models computer-science artificial-intelligence

LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency

LGCC enhances flow matching-based text-guided image editing by introducing local Gaussian coupling and a content consistency loss, leading to improved detail preservation and reduced over-editing. The framework achieves a 1.60% improvement in local detail scores and a 2x-5x speedup in inference time compared to prior state-of-the-art models.

29 Oct 2025

generative-models ai-for-health computer-science

Retrieval-Augmented Multimodal Depression Detection

A Retrieval-Augmented Generation (RAG) framework enhances multimodal depression detection by dynamically retrieving semantically relevant emotional content and employing a large language model to create an "Emotion Prompt" as an auxiliary modality. This method establishes new state-of-the-art results on the AVEC 2019 dataset, achieving a Concordance Correlation Coefficient (CCC) of 0.593 and a Mean Absolute Error (MAE) of 3.95.

29 Oct 2025

generative-models computer-science artificial-intelligence

Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

BRAIN-IT presents a framework for reconstructing images from fMRI, achieving state-of-the-art performance by producing reconstructions that are both semantically accurate and structurally faithful to the perceived images. The method also demonstrates highly efficient transfer learning, enabling high-quality reconstructions from as little as 1 hour of fMRI data from a new subject, significantly reducing data requirements.

29 Oct 2025

generative-models computer-science computer-vision-and-pattern-recognition

SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

Harvard University Institut Polytechnique de Paris École des Ponts Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University Harvard AI and Robotics Lab

Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at this https URL.

31 Oct 2025

generative-models ai-for-health computer-science

MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design

The MOLCHORD framework from USTC and Zhongguancun Academy, a 4-billion-parameter system, performs structure-based drug design by aligning protein structures and sequences. It achieved a state-of-the-art Vina Dock score of -9.29 and 83.7% High Affinity on the CrossDocked2020 dataset, while also maintaining desirable drug-likeness and synthetic accessibility.

31 Oct 2025

generative-models computer-science artificial-intelligence

EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities

EBT-Policy, an implicit policy for visuomotor control utilizing Energy-Based Transformers, achieves superior performance and significantly reduces computational cost compared to diffusion-based policies. The method demonstrates emergent zero-shot recovery from failed action sequences in robotic manipulation tasks.

There are no more papers matching your filters at the moment.

Events

Watch recordings

Personalize Your Feed

Install Browser Extension

Blog|We're hiring

alphaXiv

Explore

Login

Labs

Feedback

Dark mode

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Events

Personalize Your Feed

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Kimi Linear: An Expressive, Efficient Attention Architecture

Continuous Autoregressive Language Models

Emu3.5: Native Multimodal Models are World Learners

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Uniform Discrete Diffusion with Metric Path for Video Generation

$π_\texttt{RL}$ : Online RL Fine-tuning for Flow-based Vision-Language-Action Models

RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

World Simulation with Video Foundation Models for Physical AI

SPICE: Self-Play In Corpus Environments Improves Reasoning

The Principles of Diffusion Models

A Step Toward World Models: A Survey on Robotic Manipulation

The End of Manual Decoding: Towards Truly End-to-End Language Models

MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation

An efficient probabilistic hardware architecture for diffusion-like models

LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency

Retrieval-Augmented Multimodal Depression Detection

Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design

EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities

Events

Personalize Your Feed

alphaXiv

Explore

Login

Labs

Feedback

Dark mode

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Events

Personalize Your Feed

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Kimi Linear: An Expressive, Efficient Attention Architecture

Continuous Autoregressive Language Models

Emu3.5: Native Multimodal Models are World Learners

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Uniform Discrete Diffusion with Metric Path for Video Generation

πRLπ_\texttt{RL}πRL​: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

World Simulation with Video Foundation Models for Physical AI

SPICE: Self-Play In Corpus Environments Improves Reasoning

The Principles of Diffusion Models

A Step Toward World Models: A Survey on Robotic Manipulation

The End of Manual Decoding: Towards Truly End-to-End Language Models

MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation

An efficient probabilistic hardware architecture for diffusion-like models

LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency

Retrieval-Augmented Multimodal Depression Detection

Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design

EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities

Events

Personalize Your Feed

$π_\texttt{RL}$ : Online RL Fine-tuning for Flow-based Vision-Language-Action Models