Vincent's Arxiv FrontPage


Generated on 2025-05-23.


This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.


New Datasets

2025-05-22

ATR-Bench: A Federated Learning Benchmark for Adaptation, Trust, and Reasoning

Federated Learning (FL) has emerged as a promising paradigm for collaborative model training while preserving data privacy across decentralized participants.As FL adoption grows, numerous techniques have been proposed to tackle its practical challenges.However, the lack of standardized evaluation across key dimensions hampers systematic progress and fair comparison of FL methods.In this work, we introduce ATR-Bench, a unified framework for analyzing federated learning through three foundational dimensions:Adaptation, Trust, and Reasoning.We provide an in-depth examination of the conceptual foundations, task formulations, and open research challenges associated with each theme.We have extensively benchmarked representative methods and datasets for adaptation to heterogeneous clients and trustworthiness in adversarial or unreliable environments.Due to the lack of reliable metrics and models for reasoning in FL, we only provide literature-driven insights for this dimension.ATR-Bench lays the groundwork for a systematic and holistic evaluation of federated learning with real-world relevance.We will make our complete codebase publicly accessible and a curated repository that continuously tracks new developments and research in the FL literature. 0.731

link

2025-05-22

PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues

Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives.While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines.We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem.To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data. 0.751

link

2025-05-22

A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization

Machine learning (ML) has demonstrated considerable potential in supporting model design and optimization for combinatorial optimization (CO) problems.However, much of the progress to date has been evaluated on small-scale, synthetic datasets, raising concerns about the practical effectiveness of ML-based solvers in real-world, large-scale CO scenarios.Additionally, many existing CO benchmarks lack sufficient training data, limiting their utility for evaluating data-driven approaches.To address these limitations, we introduce FrontierCO, a comprehensive benchmark that covers eight canonical CO problem types and evaluates 16 representative ML-based solvers--including graph neural networks and large language model (LLM) agents.FrontierCO features challenging instances drawn from industrial applications and frontier CO research, offering both realistic problem difficulty and abundant training data.Our empirical results provide critical insights into the strengths and limitations of current ML methods, helping to guide more robust and practically relevant advances at the intersection of machine learning and combinatorial optimization.Our data is available at https://huggingface.co/datasets/CO-Bench/FrontierCO. 0.868

link

2025-05-22

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis.To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA.To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance.The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. 0.897We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA.The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases.Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities.We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.

link

2025-05-22

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

We introduce \texttt{CASS}, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~$\leftrightarrow$~HIP) and assembly-level (Nvidia SASS~$\leftrightarrow$~AMD RDNA3) translation.The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. 0.797Leveraging this resource, we train the \texttt{CASS} family of domain-specific language models, achieving 95\% source translation accuracy and 37.5\% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify.Our generated code matches native performance in over 85\% of test cases, preserving runtime and memory behavior.To support rigorous evaluation, we introduce \texttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with ground-truth execution.All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.Dataset and benchmark are on \href{https://huggingface.co/datasets/MBZUAI/cass}{\textcolor{blue}{HuggingFace}}, with code at \href{https://github.com/GustavoStahl/CASS}{\textcolor{blue}{GitHub}}.

link

2025-05-22

Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation

Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery.While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection.A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples.To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data.Feature Mixing is modality-agnostic and applicable to various modality combinations.Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. 0.8Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup.Our source code and dataset will be available at https://github.com/mona4399/FeatureMixing.

link

2025-05-22

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems.However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge.To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. 0.795T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results.Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models.We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.

link

2025-05-22

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning.In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception.Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. 0.9Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics.Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning.We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

link

2025-05-22

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards.However, this paradigm typically lacks supervision over the thinking process leading to the final outcome.As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability.In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm.To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process.Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training.This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards.Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages.Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities.Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters.All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1. 0.725

link

2025-05-21

Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and Benchmarking

Malocclusion is a major challenge in orthodontics, and its complex presentation and diverse clinical manifestations make accurate localization and diagnosis particularly important.Currently, one of the major shortcomings facing the field of dental image analysis is the lack of large-scale, accurately labeled datasets dedicated to malocclusion issues, which limits the development of automated diagnostics in the field of dentistry and leads to a lack of diagnostic accuracy and efficiency in clinical practice.Therefore, in this study, we propose the Oral and Maxillofacial Natural Images (OMNI) dataset, a novel and comprehensive dental image dataset aimed at advancing the study of analyzing dental images for issues of malocclusion.Specifically, the dataset contains 4166 multi-view images with 384 participants in data collection and annotated by professional dentists. 0.802In addition, we performed a comprehensive validation of the created OMNI dataset, including three CNN-based methods, two Transformer-based methods, and one GNN-based method, and conducted automated diagnostic experiments for malocclusion issues.The experimental results show that the OMNI dataset can facilitate the automated diagnosis research of malocclusion issues and provide a new benchmark for the research in this field.Our OMNI dataset and baseline code are publicly available at https://github.com/RoundFaceJ/OMNI. 0.758

link

2025-05-21

FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models

Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations.However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision methods often rely on costly pixel-level annotations; and (3) No large-scale, high-quality dataset exists for modern image-editing detection techniques.To address these gaps, we develop an automated data-generation pipeline to create FragFake, the first dedicated benchmark dataset for edited image detection, which includes high-quality images from diverse editing models and a wide variety of edited objects. 0.724Based on FragFake, we utilize Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization.Experimental results show that fine-tuned VLMs achieve higher average Object Precision across all datasets, significantly outperforming pretrained models.We further conduct ablation and transferability analyses to evaluate the detectors across various configurations and editing scenarios.To the best of our knowledge, this work is the first to reformulate localized image edit detection as a vision-language understanding task, establishing a new paradigm for the field.We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.

link

2025-05-21

RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction

Reconstructing high-fidelity underwater scenes remains a challenging task due to light absorption, scattering, and limited visibility inherent in aquatic environments.This paper presents an enhanced Gaussian Splatting-based framework that improves both the visual quality and geometric accuracy of deep underwater rendering.We propose decoupled learning for RGB channels, guided by the physics of underwater attenuation, to enable more accurate colour restoration.To address sparse-view limitations and improve view consistency, we introduce a frame interpolation strategy with a novel adaptive weighting scheme.Additionally, we introduce a new loss function aimed at reducing noise while preserving edges, which is essential for deep-sea content.We also release a newly collected dataset, Submerged3D, captured specifically in deep-sea environments. 0.837Experimental results demonstrate that our framework consistently outperforms state-of-the-art methods with PSNR gains up to 1.90dB, delivering superior perceptual quality and robustness, and offering promising directions for marine robotics and underwater visual analytics.

link

2025-05-21

Who "Controls" Where Work Shall be Done? State-of-Practice in Post-Pandemic Remote Work Regulation

The COVID-19 pandemic has permanently altered workplace structures, making remote work a widespread practice.While many employees advocate for flexibility, many employers reconsider their attitude toward remote work and opt for structured return-to-office mandates.Media headlines repeatedly emphasize that the corporate world is returning to full-time office work.This study examines how companies employing software engineers and supporting roles regulate work location, whether corporate policies have evolved in the last five years, and, if so, how, and why.We collected data on remote work regulation from corporate HR and/or management representatives from 68 corporate entities that vary in size, location, and orientation towards remote or office work. 0.747Our findings reveal that although many companies prioritize office-centred working (50%), most companies in our sample permit hybrid working to varying degrees (85%).Remote work regulation does not reveal any particular new "best practice" as policies differ greatly, but the single most popular arrangement was the three in-office days per week.More than half of the companies (51%) encourage or mandate office days, and more than quarter (28%) have changed regulations, gradually increasing the mandatory office presence or implementing differentiated conditions.Although no companies have increased flexibility, only four companies are returning to full-time office work.Our key recommendation for office-oriented companies is to consider a trust-based alternative to strict office presence mandates, while for companies oriented toward remote working, we warn about the points of no (or hard) return.Finally, the current state of policies is clearly not final, as companies continue to experiment and adjust their work regulation.

link

2025-05-21

Long-Form Information Alignment Evaluation Beyond Atomic Facts

Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust.Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities.In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations.We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%.To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency.By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation.Our code and datasets are available at https://github.com/dannalily/DoveScore. 0.741

link

2025-05-21

Long LEM Query in BWT-Runs Space

In this paper, we describe a new type of match between a pattern and a text that aren't necessarily maximal in the query, but still contain useful matching information: locally maximal exact matches (LEMs).There are usually a large amount of LEMs, so we only consider those above some length threshold $\mathcal{L}$. These are referred to as long LEMs.The purpose of long LEMs is to capture substring matches between a query and a text that are not necessarily maximal in the pattern but still long enough to be important.Therefore efficient long LEMs finding algorithms are desired for these datasets.However, these datasets are too large to query on traditional string indexes.Fortunately, these datasets are very repetitive. 0.739Recently, compressed string indexes that take advantage of the redundancy in the data but retain efficient querying capability have been proposed as a solution.We therefore give an efficient algorithm for computing all the long LEMs of a query and a text in a BWT runs compressed string index.We describe an $O(m+occ)$ expected time algorithm that relies on an $O(r)$ words space string index for outputting all long LEMs of a pattern with respect to a text given the matching statistics of the pattern with respect to the text.Here $m$ is the length of the query, $occ$ is the number of long LEMs outputted, and $r$ is the number of runs in the BWT of the text.The $O(r)$ space string index we describe relies on an adaptation of the move data structure by Nishimoto and Tabei.We are able to support $LCP[i]$ queries in constant time given $SA[i]$. In other words, we answer $PLCP[i]$ queries in constant time.Long LEMs may provide useful similarity information between a pattern and a text that MEMs may ignore.This information is particularly useful in pangenome and biobank scale haplotype panel contexts.

link

2025-05-20

Automated, Cross-Layer Root Cause Analysis of 5G Video-Conferencing Quality Degradation

5G wireless networks are complex, leveraging layers of scheduling, retransmission, and adaptation mechanisms to maximize their efficiency.But these mechanisms interact to produce significant fluctuations in uplink and downlink capacity and latency.This markedly impacts the performance of real-time applications, such as video-conferencing, which are particularly sensitive to such fluctuations, resulting in lag, stuttering, distorted audio, and low video quality.This paper presents a cross-layer view of 5G networks and their impact on and interaction with video-conferencing applications.We conduct novel, detailed measurements of both Private CBRS and commercial carrier cellular network dynamics, capturing physical- and link-layer events and correlating them with their effects at the network and transport layers, and the video-conferencing application itself.Our two datasets comprise days of low-rate campus-wide Zoom telemetry data, and hours of high-rate, correlated WebRTC-network-5G telemetry data. 0.768Based on these data, we trace performance anomalies back to root causes, identifying 24 previously unknown causal event chains that degrade 5G video conferencing.Armed with this knowledge, we build Domino, a tool that automates this process and is user-extensible to future wireless networks and interactive applications.

link

2025-05-20

GUARD: Constructing Realistic Two-Player Matrix and Security Games for Benchmarking Game-Theoretic Algorithms

Game-theoretic algorithms are commonly benchmarked on recreational games, classical constructs from economic theory such as congestion and dispersion games, or entirely random game instances.While the past two decades have seen the rise of security games -- grounded in real-world scenarios like patrolling and infrastructure protection -- their practical evaluation has been hindered by limited access to the datasets used to generate them.In particular, although the structural components of these games (e.g., patrol paths derived from maps) can be replicated, the critical data defining target values -- central to utility modeling -- remain inaccessible.In this paper, we introduce a flexible framework that leverages open-access datasets to generate realistic matrix and security game instances. 0.703These include animal movement data for modeling anti-poaching scenarios and demographic and infrastructure data for infrastructure protection. 0.774Our framework allows users to customize utility functions and game parameters, while also offering a suite of preconfigured instances.We provide theoretical results highlighting the degeneracy and limitations of benchmarking on random games, and empirically compare our generated games against random baselines across a variety of standard algorithms for computing Nash and Stackelberg equilibria, including linear programming, incremental strategy generation, and self-play with no-regret learners.

link

2025-05-20

R2MED: A Benchmark for Reasoning-Driven Medical Retrieval

Current medical retrieval benchmarks primarily emphasize lexical or shallow semantic similarity, overlooking the reasoning-intensive demands that are central to clinical decision-making.In practice, physicians often retrieve authoritative medical evidence to support diagnostic hypotheses.Such evidence typically aligns with an inferred diagnosis rather than the surface form of a patient's symptoms, leading to low lexical or semantic overlap between queries and relevant documents.To address this gap, we introduce R2MED, the first benchmark explicitly designed for reasoning-driven medical retrieval.It comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval.These tasks are drawn from five representative medical scenarios and twelve body systems, capturing the complexity and diversity of real-world medical information needs.We evaluate 15 widely-used retrieval systems on R2MED and find that even the best model achieves only 31.4 nDCG@10, demonstrating the benchmark's difficulty.Classical re-ranking and generation-augmented retrieval methods offer only modest improvements.Although large reasoning models improve performance via intermediate inference generation, the best results still peak at 41.4 nDCG@10.These findings underscore a substantial gap between current retrieval techniques and the reasoning demands of real clinical tasks.We release R2MED as a challenging benchmark to foster the development of next-generation medical retrieval systems with enhanced reasoning capabilities.Data and code are available at https://github.com/R2MED/R2MED 0.701

link

2025-05-20

3D Reconstruction from Sketches

We consider the problem of reconstructing a 3D scene from multiple sketches.We propose a pipeline which involves (1) stitching together multiple sketches through use of correspondence points, (2) converting the stitched sketch into a realistic image using a CycleGAN, and (3) estimating that image's depth-map using a pre-trained convolutional neural network based architecture called MegaDepth.Our contribution includes constructing a dataset of image-sketch pairs, the images for which are from the Zurich Building Database, and sketches have been generated by us. 0.912We use this dataset to train a CycleGAN for our pipeline's second step.We end up with a stitching process that does not generalize well to real drawings, but the rest of the pipeline that creates a 3D reconstruction from a single sketch performs quite well on a wide variety of drawings.

link

2025-05-20

KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models

Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs.Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs.We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information.Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints.Next, our system generates the cooking steps and nutritional information for each recipe.To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences.Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis.Our code and benchmark datasets are publicly available at https://github.com/mohbattharani/KERL. 0.733

link

2025-05-20

Beyond Words: Multimodal LLM Knows When to Speak

While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations.This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue.In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text.To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. 0.831This dataset enables fine-grained modeling of response timing in dyadic interactions.Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate.Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs.These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.

link

2025-05-20

UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens

Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users.However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation.This may result in limitations for generating images with complex prompts.For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat.We call this kind of generation personalized knowledge-driven generation.To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation.UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks.Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks.To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation.Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation.Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding.Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}. 0.711

link

2025-05-20

Language Models use Lookbacks to Track Beliefs

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality?This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs.We analyze Llama-3-70B-Instruct's ability to reason about characters' beliefs using causal mediation and abstraction.We construct a dataset that consists of simple stories where two characters each separately change the state of two objects, potentially unaware of each other's actions. 0.734Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary.The LM binds each character-object-state triple together by co-locating reference information about them, represented as their Ordering IDs (OIs) in low rank subspaces of the state token's residual stream.When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the corresponding state OI and then an answer lookback retrieves the state token.When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs.In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs.Our work provides insights into the LM's belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

link

2025-05-19

CHAD-KG: A Knowledge Graph for Representing Cultural Heritage Objects and Digitisation Paradata

This paper presents CHAD-KG, a knowledge graph designed to describe bibliographic metadata and digitisation paradata of cultural heritage objects in exhibitions, museums, and collections. 0.765It also documents the related data model and materialisation engine.Originally based on two tabular datasets, the data was converted into RDF according to CHAD-AP, an OWL application profile built on standards like CIDOC-CRM, LRMoo, CRMdig, and Getty AAT.A reproducible pipeline, developed with a Morph-KGC extension, was used to generate the graph.CHAD-KG now serves as the main metadata source for the Digital Twin of the temporary exhibition titled \emph{The Other Renaissance - Ulisse Aldrovandi and The Wonders Of The World}, and other collections related to the digitisation work under development in a nationwide funded project, i.e. Project CHANGES (https://fondazionechanges.org). 0.79To ensure accessibility and reuse, it offers a SPARQL endpoint, a user interface, open documentation, and is published on Zenodo under a CC0 license.The project improves the semantic interoperability of cultural heritage data, with future work aiming to extend the data model and materialisation pipeline to better capture the complexities of acquisition and digitisation, further enrich the dataset and broaden its relevance to similar initiatives.

link

2025-05-19

I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models

Large language models are increasingly integrated into news recommendation systems, raising concerns about their role in spreading misinformation.In humans, visual content is known to boost credibility and shareability of information, yet its effect on vision-language models (VLMs) remains unclear.We present the first study examining how images influence VLMs' propensity to reshare news content, whether this effect varies across model families, and how persona conditioning and content attributes modulate this behavior.To support this analysis, we introduce two methodological contributions: a jailbreaking-inspired prompting strategy that elicits resharing decisions from VLMs while simulating users with antisocial traits and political alignments; and a multimodal dataset of fact-checked political news from PolitiFact, paired with corresponding images and ground-truth veracity labels.Experiments across model families reveal that image presence increases resharing rates by 4.8% for true news and 15.0% for false news.Persona conditioning further modulates this effect: Dark Triad traits amplify resharing of false news, whereas Republican-aligned profiles exhibit reduced veracity sensitivity.Of all the tested models, only Claude-3-Haiku demonstrates robustness to visual misinformation.These findings highlight emerging risks in multimodal model behavior and motivate the development of tailored evaluation frameworks and mitigation strategies for personalized AI systems.Code and dataset are available at: https://github.com/3lis/misinfo_vlm 0.852

link

2025-05-19

eStonefish-scenes: A synthetically generated dataset for underwater event-based optical flow prediction tasks

The combined use of event-based vision and Spiking Neural Networks (SNNs) is expected to significantly impact robotics, particularly in tasks like visual odometry and obstacle avoidance.While existing real-world event-based datasets for optical flow prediction, typically captured with Unmanned Aerial Vehicles (UAVs), offer valuable insights, they are limited in diversity, scalability, and are challenging to collect.Moreover, there is a notable lack of labelled datasets for underwater applications, which hinders the integration of event-based vision with Autonomous Underwater Vehicles (AUVs).To address this, synthetic datasets could provide a scalable solution while bridging the gap between simulation and reality.In this work, we introduce eStonefish-scenes, a synthetic event-based optical flow dataset based on the Stonefish simulator.Along with the dataset, we present a data generation pipeline that enables the creation of customizable underwater environments. 0.816This pipeline allows for simulating dynamic scenarios, such as biologically inspired schools of fish exhibiting realistic motion patterns, including obstacle avoidance and reactive navigation around corals.Additionally, we introduce a scene generator that can build realistic reef seabeds by randomly distributing coral across the terrain.To streamline data accessibility, we present eWiz, a comprehensive library designed for processing event-based data, offering tools for data loading, augmentation, visualization, encoding, and training data generation, along with loss functions and performance metrics.

link

2025-05-19

Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges

Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications.To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textit{tool creation}; 2) \textit{tool utilization}: tool awareness, tool selection, tool execution; and 3) \textit{role-consistent response}: response generation and role play. 0.71Furthermore, we build \texttt{VirtualMobile} -- an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs\footnote{We will use tools and APIs alternatively, there are no significant differences between them in this paper.}.Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons.

link

2025-05-19

Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation

Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding, primarily due to the lack of Question-Answer (QA) datasets that cover both aspects.We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information. 0.822It consists of a pseudo paralinguistic label-based data condensation of in-the-wild speech and LLM-based Contextual Paralinguistic QA (CPQA) generation.The effectiveness is validated by a strong correlation in evaluations of the Qwen2-Audio-7B-Instruct model on a dataset created by our framework and human-generated CPQA dataset.The results also reveal the speech-LLM's limitations in handling empathetic reasoning tasks, highlighting the need for such datasets and more robust models.The proposed framework is first of its kind and has potential in training more robust speech-LLMs with paralinguistic reasoning capabilities.

link

2025-05-19

Detect and Correct: A Selective Noise Correction Method for Learning with Noisy Labels

Falsely annotated samples, also known as noisy labels, can significantly harm the performance of deep learning models.Two main approaches for learning with noisy labels are global noise estimation and data filtering.Global noise estimation approximates the noise across the entire dataset using a noise transition matrix, but it can unnecessarily adjust correct labels, leaving room for local improvements.Data filtering, on the other hand, discards potentially noisy samples but risks losing valuable data.Our method identifies potentially noisy samples based on their loss distribution.We then apply a selection process to separate noisy and clean samples and learn a noise transition matrix to correct the loss for noisy samples while leaving the clean data unaffected, thereby improving the training process.Our approach ensures robust learning and enhanced model performance by preserving valuable information from noisy samples and refining the correction process.We applied our method to standard image datasets (MNIST, CIFAR-10, and CIFAR-100) and a biological scRNA-seq cell-type annotation dataset. 0.723We observed a significant improvement in model accuracy and robustness compared to traditional methods.

link

2025-05-19

A large-scale analysis of public-facing, community-built chatbots on Character.AI

This paper presents the first large-scale analysis of public-facing chatbots on Character.AI, a rapidly growing social media platform where users create and interact with chatbots.Character.AI is distinctive in that it merges generative AI with user-generated content, enabling users to build bots-often modeled after fictional or public personas-for others to engage with.It is also popular, with over 20 million monthly active users, and impactful, with recent headlines detailing significant issues with youth engagement on the site.Character.AI is thus of interest to study both substantively and conceptually.To this end, we present a descriptive overview of the site using a dataset of 2.1 million English-language prompts (or ``greetings'') for chatbots on the site, created by around 1 million users. 0.87Our work explores the prevalence of different fandoms on the site, broader tropes that persist across fandoms, and how dynamics of power intersect with gender within greetings.Overall, our findings illuminate an emerging form of online (para)social interaction that toes a unique and important intersection between generative AI and user-generated content.

link

2025-05-19

Granary: Speech Recognition and Translation Dataset in 25 European Languages

Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity.To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. 0.913This is the first open-source effort at this scale for both transcription and translation.We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration.We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline.Designed for efficiency, our pipeline processes vast amount of data within hours.We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages.Our findings show that these models achieve similar performance using approx.50% less data.Dataset will be made available at https://hf.co/datasets/nvidia/Granary 0.924

link

2025-05-19

FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning

Facial Emotion Analysis (FEA) plays a crucial role in visual affective computing, aiming to infer a person's emotional state based on facial data.Scientifically, facial expressions (FEs) result from the coordinated movement of facial muscles, which can be decomposed into specific action units (AUs) that provide detailed emotional insights.However, traditional methods often struggle with limited interpretability, constrained generalization and reasoning abilities.Recently, Multimodal Large Language Models (MLLMs) have shown exceptional performance in various visual tasks, while they still face significant challenges in FEA due to the lack of specialized datasets and their inability to capture the intricate relationships between FEs and AUs.To address these issues, we introduce a novel FEA Instruction Dataset that provides accurate and aligned FE and AU descriptions and establishes causal reasoning relationships between them, followed by constructing a new benchmark, FEABench.Moreover, we propose FEALLM, a novel MLLM architecture designed to capture more detailed facial information, enhancing its capability in FEA tasks.Our model demonstrates strong performance on FEABench and impressive generalization capability through zero-shot evaluation on various datasets, including RAF-DB, AffectNet, BP4D, and DISFA, showcasing its robustness and effectiveness in FEA tasks.The dataset and code will be available at https://github.com/953206211/FEALLM. 0.761

link

2025-05-19

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions.A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps.To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework.We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data.Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. 0.711Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling.The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks.Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance.MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems.We release all our codes and data at https://github.com/ModalMinds/MM-PRM.

link

2025-05-19

GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation

We present GrasMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model.GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame.For instance, given "pour me some tea", GraspMolmo selects a grasp on a teapot handle rather than its body.Unlike prior TOG methods, which are limited by small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns from PRISM, a novel large-scale synthetic dataset of 379k samples featuring cluttered environments and diverse, realistic task descriptions. 0.707We fine-tune the Molmo visual-language model on this data, enabling GraspMolmo to generalize to novel open-vocabulary instructions and objects.In challenging real-world evaluations, GraspMolmo achieves state-of-the-art results, with a 70% prediction success on complex tasks, compared to the 35% achieved by the next best alternative.GraspMolmo also successfully demonstrates the ability to predict semantically correct bimanual grasps zero-shot.We release our synthetic dataset, code, model, and benchmarks to accelerate research in task-semantic robotic manipulation, which, along with videos, are available at https://abhaybd.github.io/GraspMolmo/.

link

2025-05-19

CIE: Controlling Language Model Text Generations Using Continuous Signals

Aligning language models with user intent is becoming increasingly relevant to enhance user experience.This calls for designing methods that can allow users to control the properties of the language that LMs generate.For example, controlling the length of the generation, the complexity of the language that gets chosen, the sentiment, tone, etc.Most existing work attempts to integrate users' control by conditioning LM generations on natural language prompts or discrete control signals, which are often brittle and hard to scale.In this work, we are interested in \textit{continuous} control signals, ones that exist along a spectrum that can't easily be captured in a natural language prompt or via existing techniques in conditional generation.Through a case study in controlling the precise response-length of generations produced by LMs, we demonstrate how after fine-tuning, behaviors of language models can be controlled via continuous signals -- as vectors that are interpolated between a "low" and a "high" token embedding.Our method more reliably exerts response-length control than in-context learning methods or fine-tuning methods that represent the control signal as a discrete signal.Our full open-sourced code and datasets are available at https://github.com/vsamuel2003/CIE. 0.735

link

2025-05-15

Logos as a Well-Tempered Pre-train for Sign Language Recognition

This paper examines two aspects of the isolated sign language recognition (ISLR) task.First, despite the availability of a number of datasets, the amount of data for most individual sign languages is limited.It poses the challenge of cross-language ISLR model training, including transfer learning.Second, similar signs can have different semantic meanings.It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs.To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive ISLR dataset by the number of signers and one of the largest available datasets while also the largest RSL dataset in size and vocabulary. 0.811It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning.We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target lowresource datasets the most.The key feature of the Logos dataset is explicitly annotated visually similar sign groups.We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks.Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video.The source code, dataset, and pre-trained models are publicly available. 0.713

link

2025-05-15

CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

We introduce CheXGenBench, a rigorous and multifaceted evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across state-of-the-art text-to-image generative models.Despite rapid advancements in generative AI for real-world imagery, medical domain evaluations have been hindered by methodological inconsistencies, outdated architectural comparisons, and disconnected assessment criteria that rarely address the practical clinical value of synthetic samples.CheXGenBench overcomes these limitations through standardised data partitioning and a unified evaluation protocol comprising over 20 quantitative metrics that systematically analyse generation quality, potential privacy vulnerabilities, and downstream clinical applicability across 11 leading text-to-image architectures.Our results reveal critical inefficiencies in the existing evaluation protocols, particularly in assessing generative fidelity, leading to inconsistent and uninformative comparisons.Our framework establishes a standardised benchmark for the medical AI community, enabling objective and reproducible comparisons while facilitating seamless integration of both existing and future generative models.Additionally, we release a high-quality, synthetic dataset, SynthCheX-75K, comprising 75K radiographs generated by the top-performing model (Sana 0.6B) in our benchmark to support further research in this critical domain. 0.854Through CheXGenBench, we establish a new state-of-the-art and release our framework, models, and SynthCheX-75K dataset at https://raman1121.github.io/CheXGenBench/

link

2025-05-15

Style Customization of Text-to-Vector Generation with Image Diffusion Priors

Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure.Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual appearance and coherent aesthetics.Extending existing T2V methods for style customization poses certain challenges.Optimization-based T2V models can utilize the priors of text-to-image (T2I) models for customization, but struggle with maintaining structural regularity.On the other hand, feed-forward T2V models can ensure structural regularity, yet they encounter difficulties in disentangling content and style due to limited SVG training data. To address these challenges, we propose a novel two-stage style customization pipeline for SVG generation, making use of the advantages of both feed-forward T2V models and T2I image priors.In the first stage, we train a T2V diffusion model with a path-level representation to ensure the structural regularity of SVGs while preserving diverse expressive capabilities.In the second stage, we customize the T2V diffusion model to different styles by distilling customized T2I models.By integrating these techniques, our pipeline can generate high-quality and diverse SVGs in custom styles based on text prompts in an efficient feed-forward manner.The effectiveness of our method has been validated through extensive experiments.The project page is https://customsvg.github.io. 0.765

link

2025-05-15

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning.To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities.Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. 0.869Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset.Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving.Our model achieves a new open-source SOTA across all six metrics.Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%.The dataset and models will be released at https://github.com/mathllm/MathCoder. 0.702

link

2025-05-14

FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models

Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks.Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results.Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks.However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task.To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas.Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge.We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model's generalization ability.We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization.Our instruction datasets, protocols, and codes will be released soon. 0.877

link

2025-05-14

Radon Exposure Dataset

Exposure to elevated radon levels in the home is one of the leading causes of lung cancer in the world.The following study describes the creation of a comprehensive, state-level dataset designed to enable the modeling and prediction of household radon concentrations at Zip Code Tabulation Area (ZCTA) and sub-kilometer scales.Details include the data collection and processing involved in compiling physical and demographic factors for Pennsylvania and Utah.Attempting to mitigate this risk requires identifying the underlying geological causes and the populations that might be at risk.This work focuses on identifying at-risk populations throughout Pennsylvania and Utah, where radon levels are some of the highest in the country.The resulting dataset harmonizes geological and demographic factors from various sources and spatial resolutions, including temperature, geochemistry, and soil characteristics. 0.86Demographic variables such as the household heating fuel used, the age of building, and the housing type provide further insight into which populations could be most susceptible in areas with potentially high radon levels.This dataset also serves as a foundational resource for two other studies conducted by the authors.The resolution of the data provides a novel approach to predicting potential radon exposure, and the data processing conducted for these states can be scaled up to larger spatial resolutions (e.g., the Contiguous United States [CONUS]) and allow for a broad reclassification of radon exposure potential in the United States.

link

2025-05-14

GlobalMood: A cross-cultural benchmark for music emotion recognition

Human annotations of mood in music are essential for music generation and recommender systems.However, existing datasets predominantly focus on Western songs with mood terms derived from English, which may limit generalizability across diverse linguistic and cultural backgrounds.To address this, we introduce `GlobalMood', a novel cross-cultural benchmark dataset comprising 1,180 songs sampled from 59 countries, with large-scale annotations collected from 2,519 individuals across five culturally and linguistically distinct locations: U.S., France, Mexico, S. Korea, and Egypt. 0.824Rather than imposing predefined mood categories, we implement a bottom-up, participant-driven approach to organically elicit culturally specific music-related mood terms.We then recruit another pool of human participants to collect 988,925 ratings for these culture-specific descriptors.Our analysis confirms the presence of a valence-arousal structure shared across cultures, yet also reveals significant divergences in how certain mood terms, despite being dictionary equivalents, are perceived cross-culturally.State-of-the-art multimodal models benefit substantially from fine-tuning on our cross-culturally balanced dataset, as evidenced by improved alignment with human evaluations - particularly in non-English contexts.More broadly, our findings inform the ongoing debate on the universality versus cultural specificity of emotional descriptors, and our methodology can contribute to other multimodal and cross-lingual research.

link

2025-05-14

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Unifying image understanding and generation has gained growing attention in recent research on multimodal models.Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored.Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies.Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations.This design yields both higher training efficiency and improved generative quality.Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability.Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. 0.862Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models.BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks.To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

link

2025-05-14

MIGRATION-BENCH: Repository-Level Code Migration Benchmark from Java 8

With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability.Numerous benchmark datasets have been developed to evaluate the coding capabilities of these models, while they primarily focus on problem-solving and issue-resolution tasks.In contrast, we introduce a new coding benchmark MIGRATION-BENCH with a distinct focus: code migration.MIGRATION-BENCH aims to serve as a comprehensive benchmark for migration from Java 8 to the latest long-term support (LTS) versions (Java 17, 21), MIGRATION-BENCH includes a full dataset and its subset selected with $5,102$ and $300$ repositories respectively.Selected is a representative subset curated for complexity and difficulty, offering a versatile resource to support research in the field of code migration.Additionally, we provide a comprehensive evaluation framework to facilitate rigorous and standardized assessment of LLMs on this challenging task.We further propose SD-Feedback and demonstrate that LLMs can effectively tackle repository-level code migration to Java 17.For the selected subset with Claude-3.5-Sonnet-v2, SD-Feedback achieves 62.33% and 27.00% success rate (pass@1) for minimal and maximal migration respectively.The benchmark dataset and source code are available at: https://huggingface.co/collections/AmazonScience and https://github.com/amazon-science/self_debug respectively. 0.7

link

Data Quality

2025-05-21

A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO

In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. 0.648Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection.Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression.This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models.As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.

link

2025-05-21

Privacy-Preserving Conformal Prediction Under Local Differential Privacy

Conformal prediction (CP) provides sets of candidate classes with a guaranteed probability of containing the true class.However, it typically relies on a calibration set with clean labels.We address privacy-sensitive scenarios where the aggregator is untrusted and can only access a perturbed version of the true labels. 0.602We propose two complementary approaches under local differential privacy (LDP).In the first approach, users do not access the model but instead provide their input features and a perturbed label using a k-ary randomized response.In the second approach, which enforces stricter privacy constraints, users add noise to their conformity score by binary search response.This method requires access to the classification model but preserves both data and label privacy.Both approaches compute the conformal threshold directly from noisy data without accessing the true labels.We prove finite-sample coverage guarantees and demonstrate robust coverage even under severe randomization.This approach unifies strong local privacy with predictive uncertainty control, making it well-suited for sensitive applications such as medical imaging or large language model queries, regardless of whether users can (or are willing to) compute their own scores.

link

2025-05-19

Detect and Correct: A Selective Noise Correction Method for Learning with Noisy Labels

Falsely annotated samples, also known as noisy labels, can significantly harm the performance of deep learning models. 0.852Two main approaches for learning with noisy labels are global noise estimation and data filtering. 0.632Global noise estimation approximates the noise across the entire dataset using a noise transition matrix, but it can unnecessarily adjust correct labels, leaving room for local improvements.Data filtering, on the other hand, discards potentially noisy samples but risks losing valuable data.Our method identifies potentially noisy samples based on their loss distribution. 0.645We then apply a selection process to separate noisy and clean samples and learn a noise transition matrix to correct the loss for noisy samples while leaving the clean data unaffected, thereby improving the training process.Our approach ensures robust learning and enhanced model performance by preserving valuable information from noisy samples and refining the correction process.We applied our method to standard image datasets (MNIST, CIFAR-10, and CIFAR-100) and a biological scRNA-seq cell-type annotation dataset.We observed a significant improvement in model accuracy and robustness compared to traditional methods.

link

2025-05-15

End-to-End Vision Tokenizer Tuning

Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering.The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics.This decoupled paradigm introduces a critical misalignment:The loss of the vision tokenization can be the representation bottleneck for target tasks.For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. 0.707To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks.Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives.ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications.Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models.Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability.We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.

link

2025-05-14

Evaluation Metrics for Misinformation Warning Interventions: Challenges and Prospects

Misinformation has become a widespread issue in the 21st century, impacting numerous areas of society and underscoring the need for effective intervention strategies.Among these strategies, user-centered interventions, such as warning systems, have shown promise in reducing the spread of misinformation.Many studies have used various metrics to evaluate the effectiveness of these warning interventions.However, no systematic review has thoroughly examined these metrics in all studies.This paper provides a comprehensive review of existing metrics for assessing the effectiveness of misinformation warnings, categorizing them into four main groups: behavioral impact, trust and credulity, usability, and cognitive and psychological effects. 0.67Through this review, we identify critical challenges in measuring the effectiveness of misinformation warnings, including inconsistent use of cognitive and attitudinal metrics, the lack of standardized metrics for affective and emotional impact, variations in user trust, and the need for more inclusive warning designs. 0.611We present an overview of these metrics and propose areas for future research.

link

2025-05-12

Chronocept: Instilling a Sense of Time in Machines

Human cognition is deeply intertwined with a sense of time, known as Chronoception.This sense allows us to judge how long facts remain valid and when knowledge becomes outdated.Despite progress in vision, language, and motor control, AI still struggles to reason about temporal validity.We introduce Chronocept, the first benchmark to model temporal validity as a continuous probability distribution over time.Using skew-normal curves fitted along semantically decomposed temporal axes, Chronocept captures nuanced patterns of emergence, decay, and peak relevance.It includes two datasets: Benchmark I (atomic facts) and Benchmark II (multi-sentence passages).Annotations show strong inter-annotator agreement (84% and 89%). 0.636Our baselines predict curve parameters - location, scale, and skewness - enabling interpretable, generalizable learning and outperforming classification-based approaches.Chronocept fills a foundational gap in AI's temporal reasoning, supporting applications in knowledge grounding, fact-checking, retrieval-augmented generation (RAG), and proactive agents.Code and data are publicly available.

link

2025-05-08

Performance Estimation in Binary Classification Using Calibrated Confidence

Model monitoring is a critical component of the machine learning lifecycle, safeguarding against undetected drops in the model's performance after deployment.Traditionally, performance monitoring has required access to ground truth labels, which are not always readily available.This can result in unacceptable latency or render performance monitoring altogether impossible.Recently, methods designed to estimate the accuracy of classifier models without access to labels have shown promising results. 0.627However, there are various other metrics that might be more suitable for assessing model performance in many cases.Until now, none of these important metrics has received similar interest from the scientific community.In this work, we address this gap by presenting CBPE, a novel method that can estimate any binary classification metric defined using the confusion matrix.In particular, we choose four metrics from this large family: accuracy, precision, recall, and F$_1$, to demonstrate our method.CBPE treats the elements of the confusion matrix as random variables and leverages calibrated confidence scores of the model to estimate their distributions.The desired metric is then also treated as a random variable, whose full probability distribution can be derived from the estimated confusion matrix.CBPE is shown to produce estimates that come with strong theoretical guarantees and valid confidence intervals.

link

2025-05-07

Detecting Spelling and Grammatical Anomalies in Russian Poetry Texts

The quality of natural language texts in fine-tuning datasets plays a critical role in the performance of generative models, particularly in computational creativity tasks such as poem or song lyric generation.Fluency defects in generated poems significantly reduce their value.However, training texts are often sourced from internet-based platforms without stringent quality control, posing a challenge for data engineers to manage defect levels effectively. To address this issue, we propose the use of automated linguistic anomaly detection to identify and filter out low-quality texts from training datasets for creative models. 0.674In this paper, we present a comprehensive comparison of unsupervised and supervised text anomaly detection approaches, utilizing both synthetic and human-labeled datasets.We also introduce the RUPOR dataset, a collection of Russian-language human-labeled poems designed for cross-sentence grammatical error detection, and provide the full evaluation code.Our work aims to empower the community with tools and insights to improve the quality of training datasets for generative models in creative domains.

link

2025-05-06

RAIL: Region-Aware Instructive Learning for Semi-Supervised Tooth Segmentation in CBCT

Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal.However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. 0.726To address these problems, we propose Region-Aware Instructive Learning (RAIL), a dual-group dual-student, semi-supervised framework.Each group contains two student models guided by a shared teacher network.By alternating training between the two groups, RAIL promotes intergroup knowledge transfer and collaborative region-aware instruction while reducing overfitting to the characteristics of any single model.Specifically, RAIL introduces two instructive mechanisms.Disagreement-Focused Supervision (DFS) Controller improves supervised learning by instructing predictions only within areas where student outputs diverge from both ground truth and the best student, thereby concentrating supervision on structurally ambiguous or mislabeled areas.In the unsupervised phase, Confidence-Aware Learning (CAL) Modulator reinforces agreement in regions with high model certainty while reducing the effect of low-confidence predictions during training.This helps prevent our model from learning unstable patterns and improves the overall reliability of pseudo-labels.Extensive experiments on four CBCT tooth segmentation datasets show that RAIL surpasses state-of-the-art methods under limited annotation.Our code will be available at https://github.com/Tournesol-Saturday/RAIL.

link

Benchmarks

2025-05-22

LARES: Latent Reasoning for Sequential Recommendation

Sequential recommender systems have become increasingly important in real-world applications that model user behavior sequences to predict their preferences.However, existing sequential recommendation methods predominantly rely on non-reasoning paradigms, which may limit the model's computational capacity and result in suboptimal recommendation performance.To address these limitations, we present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation that enhances model's representation capabilities through increasing the computation density of parameters by depth-recurrent latent reasoning.Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity, thereby effectively capturing dynamic and intricate user interest patterns.A key difference of LARES lies in refining all input tokens at each implicit reasoning step to improve the computation utilization.To fully unlock the model's reasoning potential, we design a two-phase training strategy: (1) Self-supervised pre-training (SPT) with dual alignment objectives; (2) Reinforcement post-training (RPT).During the first phase, we introduce trajectory-level alignment and step-level alignment objectives, which enable the model to learn recommendation-oriented latent reasoning patterns without requiring supplementary annotated data.The subsequent phase utilizes reinforcement learning (RL) to harness the model's exploratory ability, further refining its reasoning capabilities.Comprehensive experiments on real-world benchmarks demonstrate our framework's superior performance. 0.704Notably, LARES exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.

link

2025-05-22

Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks

Recent advances in Large Language Models (LLMs) have shown promise in function-level code generation, yet repository-level software engineering tasks remain challenging.Current solutions predominantly rely on proprietary LLM agents, which introduce unpredictability and limit accessibility, raising concerns about data privacy and model customization.This paper investigates whether open-source LLMs can effectively address repository-level tasks without requiring agent-based approaches.We demonstrate this is possible by enabling LLMs to comprehend functions and files within codebases through their semantic information and structural dependencies.To this end, we introduce Code Graph Models (CGMs), which integrate repository code graph structures into the LLM's attention mechanism and map node attributes to the LLM's input space using a specialized adapter.When combined with an agentless graph RAG framework, our approach achieves a 43.00% resolution rate on the SWE-bench Lite benchmark using the open-source Qwen2.5-72B model.This performance ranks first among open weight models, second among methods with open-source systems, and eighth overall, surpassing the previous best open-source model-based method by 12.33%. 0.678

link

2025-05-22

Risk-Averse Reinforcement Learning with Itakura-Saito Loss

Risk-averse reinforcement learning finds application in various high-stakes fields.Unlike classical reinforcement learning, which aims to maximize expected returns, risk-averse agents choose policies that minimize risk, occasionally sacrificing expected value.These preferences can be framed through utility theory.We focus on the specific case of the exponential utility function, where we can derive the Bellman equations and employ various reinforcement learning algorithms with few modifications.However, these methods suffer from numerical instability due to the need for exponent computation throughout the process.To address this, we introduce a numerically stable and mathematically sound loss function based on the Itakura-Saito divergence for learning state-value and action-value functions.We evaluate our proposed loss function against established alternatives, both theoretically and empirically. 0.614In the experimental section, we explore multiple financial scenarios, some with known analytical solutions, and show that our loss function outperforms the alternatives.

link

2025-05-22

Efficient Correlation Volume Sampling for Ultra-High-Resolution Optical Flow Estimation

Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume.This results in quadratic computational and memory complexity in the number of pixels.Although an alternative memory-efficient implementation with on-demand cost computation exists, this is slower in practice and therefore prior methods typically process images at reduced resolutions, missing fine-grained details. To address this, we propose a more efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT.Our approach outperforms on-demand sampling by up to 90% while maintaining low memory usage, and performs on par with the default implementation with up to 95% lower memory usage.As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 50% savings for the total end-to-end model inference in memory-constrained environments.Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an additional inference-time modification of the recent SEA-RAFT method.With this, we achieve state-of-the-art results at high resolutions both in accuracy and efficiency. 0.603

link

2025-05-22

BP-Seg: A graphical model approach to unsupervised and non-contiguous text segmentation using belief propagation

Text segmentation based on the semantic meaning of sentences is a fundamental task with broad utility in many downstream applications.In this paper, we propose a graphical model-based unsupervised learning approach, named BP-Seg for efficient text segmentation.Our method not only considers local coherence, capturing the intuition that adjacent sentences are often more related, but also effectively groups sentences that are distant in the text yet semantically similar.This is achieved through belief propagation on the carefully constructed graphical models.Experimental results on both an illustrative example and a dataset with long-form documents demonstrate that our method performs favorably compared to competing approaches. 0.635

link

2025-05-22

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources.However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point.This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant.We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives.Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation.Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. 0.625The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

link

2025-05-22

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

We introduce \texttt{CASS}, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~$\leftrightarrow$~HIP) and assembly-level (Nvidia SASS~$\leftrightarrow$~AMD RDNA3) translation.The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability.Leveraging this resource, we train the \texttt{CASS} family of domain-specific language models, achieving 95\% source translation accuracy and 37.5\% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify.Our generated code matches native performance in over 85\% of test cases, preserving runtime and memory behavior. 0.6To support rigorous evaluation, we introduce \texttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with ground-truth execution.All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.Dataset and benchmark are on \href{https://huggingface.co/datasets/MBZUAI/cass}{\textcolor{blue}{HuggingFace}}, with code at \href{https://github.com/GustavoStahl/CASS}{\textcolor{blue}{GitHub}}.

link

2025-05-22

3D Equivariant Visuomotor Policy Learning via Spherical Projection

Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin.However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace.This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro.This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere.This enables us to reason about symmetries in SO(3) without explicitly reconstructing a point cloud.We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. 0.69Our work is the first SO(3)-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.

link

2025-05-22

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications.Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers.To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS.(1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation.(2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. 0.656(3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension.Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. 0.605MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.

link

2025-05-22

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM).We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues.To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase.This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT.Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models.To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations.In autoregressive models, the number of forward iterations during generation equals the response length.With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$.We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. 0.623Additionally, we explore Dimple's capability to precisely control its response using structure priors.These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models.Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability.Code and models are available at https://github.com/yu-rp/Dimple.

link

2025-05-22

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video).However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs.This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge.In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation.Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology.Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set.This module enables more efficient token utilization while retaining fine-grained informational fidelity.In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens.Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources. 0.605

link

2025-05-22

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs.However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic.To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities.ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation.It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions.We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding.ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems.We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. 0.656Code available at: https://github.com/mbzuai-oryx/ARB

link

2025-05-21

Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI

The rise of IoT has increased the need for on-edge machine learning, with TinyML emerging as a promising solution for resource-constrained devices such as MCU.However, evaluating their performance remains challenging due to diverse architectures and application scenarios. 0.601Current solutions have many non-negligible limitations.This work introduces an alternative benchmarking methodology that integrates energy and latency measurements while distinguishing three execution phases pre-inference, inference, and post-inference. 0.667Additionally, the setup ensures that the device operates without being powered by an external measurement unit, while automated testing can be leveraged to enhance statistical significance.To evaluate our setup, we tested the STM32N6 MCU, which includes a NPU for executing neural networks.Two configurations were considered: high-performance and Low-power.The variation of the EDP was analyzed separately for each phase, providing insights into the impact of hardware configurations on energy efficiency.Each model was tested 1000 times to ensure statistically relevant results.Our findings demonstrate that reducing the core voltage and clock frequency improve the efficiency of pre- and post-processing without significantly affecting network execution performance.This approach can also be used for cross-platform comparisons to determine the most efficient inference platform and to quantify how pre- and post-processing overhead varies across different hardware implementations. 0.653

link

2025-05-21

Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning

Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic.However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. 0.677This study explores these pitfalls by employing a novel evaluation framework.We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.

link

2025-05-21

Relationship Analysis of Image-Text Pair in SNS Posts

Social networking services (SNS) contain vast amounts of image-text posts, necessitating effective analysis of their relationships for improved information retrieval.This study addresses the classification of image-text pairs in SNS, overcoming prior limitations in distinguishing relationships beyond similarity.We propose a graph-based method to classify image-text pairs into similar and complementary relationships.Our approach first embeds images and text using CLIP, followed by clustering.Next, we construct an Image-Text Relationship Clustering Line Graph (ITRC-Line Graph), where clusters serve as nodes.Finally, edges and nodes are swapped in a pseudo-graph representation.A Graph Convolutional Network (GCN) then learns node and edge representations, which are fused with the original embeddings for final classification.Experimental results on a publicly available dataset demonstrate the effectiveness of our method. 0.674

link

2025-05-21

Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks

Neural Architecture Search (NAS) accelerates progress in deep learning through systematic refinement of model architectures.The downside is increasingly large energy consumption during the search process.Surrogate-based benchmarking mitigates the cost of full training by querying a pre-trained surrogate to obtain an estimate for the quality of the model.Specifically, energy-aware benchmarking aims to make it possible for NAS to favourably trade off model energy consumption against accuracy.Towards this end, we propose three design principles for such energy-aware benchmarks: (i) reliable power measurements, (ii) a wide range of GPU usage, and (iii) holistic cost reporting. 0.631We analyse EA-HAS-Bench based on these principles and find that the choice of GPU measurement API has a large impact on the quality of results. 0.607Using the Nvidia System Management Interface (SMI) on top of its underlying library influences the sampling rate during the initial data collection, returning faulty low-power estimations.This results in poor correlation with accurate measurements obtained from an external power meter.With this study, we bring to attention several key considerations when performing energy-aware surrogate-based benchmarking and derive first guidelines that can help design novel benchmarks. 0.674We show a narrow usage range of the four GPUs attached to our device, ranging from 146 W to 305 W in a single-GPU setting, and narrowing down even further when using all four GPUs.To improve holistic energy reporting, we propose calibration experiments over assumptions made in popular tools, such as Code Carbon, thus achieving reductions in the maximum inaccuracy from 10.3 % to 8.9 % without and to 6.6 % with prior estimation of the expected load on the device.

link

2025-05-21

Distance Adaptive Beam Search for Provably Accurate Graph-Based Nearest Neighbor Search

Nearest neighbor search is central in machine learning, information retrieval, and databases.For high-dimensional datasets, graph-based methods such as HNSW, DiskANN, and NSG have become popular thanks to their empirical accuracy and efficiency.These methods construct a directed graph over the dataset and perform beam search on the graph to find nodes close to a given query.While significant work has focused on practical refinements and theoretical understanding of graph-based methods, many questions remain.We propose a new distance-based termination condition for beam search to replace the commonly used condition based on beam width.We prove that, as long as the search graph is navigable, our resulting Adaptive Beam Search method is guaranteed to approximately solve the nearest-neighbor problem, establishing a connection between navigability and the performance of graph-based search.We also provide extensive experiments on our new termination condition for both navigable graphs and approximately navigable graphs used in practice, such as HNSW and Vamana graphs.We find that Adaptive Beam Search outperforms standard beam search over a range of recall values, data sets, graph constructions, and target number of nearest neighbors.It thus provides a simple and practical way to improve the performance of popular methods. 0.623

link

2025-05-21

Breaking Barriers for Distributed MIS by Faster Degree Reduction

We study the problem of finding a maximal independent set (MIS) in the standard LOCAL model of distributed computing.Classical algorithms by Luby [JACM'86] and Alon, Babai, and Itai [JALG'86] find an MIS in $O(\log n)$ rounds in $n$-node graphs with high probability.Despite decades of research, the existence of any $o(\log n)$-round algorithm for general graphs remains one of the major open problems in the field. Interestingly, the hard instances for this problem must contain constant-length cycles.This is because there exists a sublogarithmic-round algorithm for graphs with super-constant girth; i.e., graphs where the length of the shortest cycle is $\omega(1)$, as shown by Ghaffari~[SODA'16].Thus, resolving this $\approx 40$-year-old open problem requires understanding the family of graphs that contain $k$-cycles for some constant $k$. In this work, we come very close to resolving this $\approx 40$-year-old open problem by presenting a sublogarithmic-round algorithm for graphs that can contain $k$-cycles for all $k >6$.Specifically, our algorithm finds an MIS in $O\left(\frac{\log \Delta}{\log(\log^* \Delta)} 0.651+ \mathrm{poly}(\log\log n)\right)$ rounds, as long as the graph does not contain cycles of length $\leq 6$, where $\Delta$ is the maximum degree of the graph.As a result, we push the limit on the girth of graphs that admit sublogarithmic-round algorithms from $k = \omega(1)$ all the way down to a small constant $k=7$. This also implies a $o(\sqrt{\log n})$ round algorithm for MIS in trees, refuting a conjecture from the book by Barrenboim and Elkin.

link

2025-05-21

Graph Conditional Flow Matching for Relational Data Generation

Data synthesis is gaining momentum as a privacy-enhancing technology.While single-table tabular data generation has seen considerable progress, current methods for multi-table data often lack the flexibility and expressiveness needed to capture complex relational structures.In particular, they struggle with long-range dependencies and complex foreign-key relationships, such as tables with multiple parent tables or multiple types of links between the same pair of tables.We propose a generative model for relational data that generates the content of a relational dataset given the graph formed by the foreign-key relationships.We do this by learning a deep generative model of the content of the whole relational database by flow matching, where the neural network trained to denoise records leverages a graph neural network to obtain information from connected records.Our method is flexible, as it can support relational datasets with complex structures, and expressive, as the generation of each record can be influenced by any other record within the same connected component.We evaluate our method on several benchmark datasets and show that it achieves state-of-the-art performance in terms of synthetic data fidelity. 0.621

link

2025-05-21

Enhancing Monte Carlo Dropout Performance for Uncertainty Quantification

Knowing the uncertainty associated with the output of a deep neural network is of paramount importance in making trustworthy decisions, particularly in high-stakes fields like medical diagnosis and autonomous systems.Monte Carlo Dropout (MCD) is a widely used method for uncertainty quantification, as it can be easily integrated into various deep architectures.However, conventional MCD often struggles with providing well-calibrated uncertainty estimates.To address this, we introduce innovative frameworks that enhances MCD by integrating different search solutions namely Grey Wolf Optimizer (GWO), Bayesian Optimization (BO), and Particle Swarm Optimization (PSO) as well as an uncertainty-aware loss function, thereby improving the reliability of uncertainty quantification.We conduct comprehensive experiments using different backbones, namely DenseNet121, ResNet50, and VGG16, on various datasets, including Cats vs. Dogs, Myocarditis, Wisconsin, and a synthetic dataset (Circles).Our proposed algorithm outperforms the MCD baseline by 2-3% on average in terms of both conventional accuracy and uncertainty accuracy while achieving significantly better calibration. 0.679These results highlight the potential of our approach to enhance the trustworthiness of deep learning models in safety-critical applications.

link

2025-05-21

Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning

Multimodal pathological image understanding has garnered widespread interest due to its potential to improve diagnostic accuracy and enable personalized treatment through integrated visual and textual data.However, existing methods exhibit limited reasoning capabilities, which hamper their ability to handle complex diagnostic scenarios.Additionally, the enormous size of pathological images leads to severe computational burdens, further restricting their practical deployment.To address these limitations, we introduce a novel bilateral reinforcement learning framework comprising two synergistic branches.One reinforcement branch enhances the reasoning capability by enabling the model to learn task-specific decision processes, i.e., pathology rationales, directly from labels without explicit reasoning supervision.While the other branch dynamically allocates a tailored number of tokens to different images based on both their visual content and task context, thereby optimizing computational efficiency.We apply our method to various pathological tasks such as visual question answering, cancer subtyping, and lesion detection.Extensive experiments show an average +41.7 absolute performance improvement with 70.3% lower inference costs over the base models, achieving both reasoning accuracy and computational efficiency. 0.632

link

2025-05-21

Toward Open Earth Science as Fast and Accessible as Natural Language

Is natural-language-driven earth observation data analysis now feasible with the assistance of Large Language Models (LLMs)?For open science in service of public interest, feasibility requires reliably high accuracy, interactive latencies, low (sustainable) costs, open LLMs, and openly maintainable software -- hence, the challenge.What are the techniques and programming system requirements necessary for satisfying these constraints, and what is the corresponding development and maintenance burden in practice?This study lays the groundwork for exploring these questions, introducing an impactful earth science use-case, and providing a software framework with evaluation data and metrics, along with initial results from employing model scaling, prompt-optimization, and inference-time scaling optimization techniques.While we attain high accuracy (near 100%) across 10 of 11 metrics, the analysis further considers cost (token-spend), latency, and maintainability across this space of techniques. 0.602Finally, we enumerate opportunities for further research, general programming and evaluation framework development, and ongoing work for a comprehensive, deployable solution.This is a call for collaboration and contribution.

link

2025-05-21

Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

Reinforcement learning (RL) has emerged as an effective method for training reasoning models.However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge.This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models.To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns").By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation.Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. 0.613Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models.This highlights TAPO's potential for broader applications across multiple tasks and domains.Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.

link

2025-05-21

Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives

Recent advances in reinforcement learning (RL) have renewed focus on the design of reward functions that shape agent behavior.Manually designing reward functions is tedious and error-prone.A principled alternative is to specify behaviors in a formal language that can be automatically translated into rewards.Omega-regular languages are a natural choice for this purpose, given their established role in formal verification and synthesis.However, existing methods using omega-regular specifications typically rely on discounted reward RL in episodic settings, with periodic resets.This setup misaligns with the semantics of omega-regular specifications, which describe properties over infinite behavior traces.In such cases, the average reward criterion and the continuing setting -- where the agent interacts with the environment over a single, uninterrupted lifetime -- are more appropriate. To address the challenges of infinite-horizon, continuing tasks, we focus on absolute liveness specifications -- a subclass of omega-regular languages that cannot be violated by any finite behavior prefix, making them well-suited to the continuing setting.We present the first model-free RL framework that translates absolute liveness specifications to average-reward objectives.Our approach enables learning in communicating MDPs without episodic resetting.We also introduce a reward structure for lexicographic multi-objective optimization, aiming to maximize an external average-reward objective among the policies that also maximize the satisfaction probability of a given omega-regular specification.Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full knowledge of the environment, thus enabling model-free RL.Empirical results show our average-reward approach in continuing setting outperforms discount-based methods across benchmarks. 0.61

link

2025-05-21

HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases

Large Language Models (LLMs) have demonstrated their potential in hardware design tasks, such as Hardware Description Language (HDL) generation and debugging.Yet, their performance in real-world, repository-level HDL projects with thousands or even tens of thousands of code lines is hindered.To this end, we propose HDLxGraph, a novel framework that integrates Graph Retrieval Augmented Generation (Graph RAG) with LLMs, introducing HDL-specific graph representations by incorporating Abstract Syntax Trees (ASTs) and Data Flow Graphs (DFGs) to capture both code graph view and hardware graph view.HDLxGraph utilizes a dual-retrieval mechanism that not only mitigates the limited recall issues inherent in similarity-based semantic retrieval by incorporating structural information, but also enhances its extensibility to various real-world tasks by a task-specific retrieval finetuning.Additionally, to address the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, a multi-granularity evaluation dataset derived from real-world repository-level projects.Experimental results demonstrate that HDLxGraph significantly improves average search accuracy, debugging efficiency and completion quality by 12.04%, 12.22% and 5.04% compared to similarity-based RAG, respectively.The code of HDLxGraph and collected HDLSearch benchmark are available at https://github.com/Nick-Zheng-Q/HDLxGraph. 0.614

link

2025-05-21

LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing

Large Language Models often contain factually incorrect or outdated knowledge, giving rise to model editing methods for precise knowledge updates.However, current mainstream locate-then-edit approaches exhibit a progressive performance decline during sequential editing, due to inadequate mechanisms for long-term knowledge preservation.To tackle this, we model the sequential editing as a constrained stochastic programming.Given the challenges posed by the cumulative preservation error constraint and the gradually revealed editing tasks, \textbf{LyapLock} is proposed.It integrates queuing theory and Lyapunov optimization to decompose the long-term constrained programming into tractable stepwise subproblems for efficient solving.This is the first model editing framework with rigorous theoretical guarantees, achieving asymptotic optimal editing performance while meeting the constraints of long-term knowledge preservation.Experimental results show that our framework scales sequential editing capacity to over 10,000 edits while stabilizing general capabilities and boosting average editing efficacy by 11.89\% over SOTA baselines.Furthermore, it can be leveraged to enhance the performance of baseline methods. 0.701Our code is released on https://github.com/caskcsg/LyapLock.

link

2025-05-21

Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling

Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations.However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies.To address these critical limitations, we propose PsyLLM, the first large language model designed to systematically integrate both diagnostic and therapeutic reasoning for mental health counseling.To develop the PsyLLM, we propose a novel automated data synthesis pipeline.This pipeline processes real-world mental health posts, generates multi-turn dialogue structures, and leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate detailed clinical reasoning processes.Rigorous multi-dimensional filtering ensures the generation of high-quality, clinically aligned dialogue data.In addition, we introduce a new benchmark and evaluation protocol, assessing counseling quality across four key dimensions: comprehensiveness, professionalism, authenticity, and safety.Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models on this benchmark. 0.604

link

2025-05-21

Multi-modal Integration Analysis of Alzheimer's Disease Using Large Language Models and Knowledge Graphs

We propose a novel framework for integrating fragmented multi-modal data in Alzheimer's disease (AD) research using large language models (LLMs) and knowledge graphs.While traditional multimodal analysis requires matched patient IDs across datasets, our approach demonstrates population-level integration of MRI, gene expression, biomarkers, EEG, and clinical indicators from independent cohorts.Statistical analysis identified significant features in each modality, which were connected as nodes in a knowledge graph.LLMs then analyzed the graph to extract potential correlations and generate hypotheses in natural language.This approach revealed several novel relationships, including a potential pathway linking metabolic risk factors to tau protein abnormalities via neuroinflammation (r>0.6, p<0.001), and unexpected correlations between frontal EEG channels and specific gene expression profiles (r=0.42-0.58, p<0.01).Cross-validation with independent datasets confirmed the robustness of major findings, with consistent effect sizes across cohorts (variance <15%).The reproducibility of these findings was further supported by expert review (Cohen's k=0.82) and computational validation. 0.634Our framework enables cross modal integration at a conceptual level without requiring patient ID matching, offering new possibilities for understanding AD pathology through fragmented data reuse and generating testable hypotheses for future research.

link

2025-05-21

Lean-SMT: An SMT tactic for discharging proof goals in Lean

Lean is an increasingly popular proof assistant based on dependent type theory.Despite its success, it still lacks important automation features present in more seasoned proof assistants, such as the Sledgehammer tactic in Isabelle/HOL.A key aspect of Sledgehammer is the use of proof-producing SMT solvers to prove a translated proof goal and the reconstruction of the resulting proof into valid justifications for the original goal.We present Lean-SMT, a tactic providing this functionality in Lean.We detail how the tactic converts Lean goals into SMT problems and, more importantly, how it reconstructs SMT proofs into native Lean proofs.We evaluate the tactic on established benchmarks used to evaluate Sledgehammer's SMT integration, with promising results. 0.645We also evaluate Lean-SMT as a standalone proof checker for proofs of SMT-LIB problems.We show that Lean-SMT offers a smaller trusted core without sacrificing too much performance.

link

2025-05-21

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning.A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL).However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL.In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. 0.61These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. 0.62Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models.Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems.Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.

link

2025-05-21

Leveraging the Powerful Attention of a Pre-trained Diffusion Model for Exemplar-based Image Colorization

Exemplar-based image colorization aims to colorize a grayscale image using a reference color image, ensuring that reference colors are applied to corresponding input regions based on their semantic similarity.To achieve accurate semantic matching between regions, we leverage the self-attention module of a pre-trained diffusion model, which is trained on a large dataset and exhibits powerful attention capabilities.To harness this power, we propose a novel, fine-tuning-free approach based on a pre-trained diffusion model, making two key contributions.First, we introduce dual attention-guided color transfer.We utilize the self-attention module to compute an attention map between the input and reference images, effectively capturing semantic correspondences.The color features from the reference image is then transferred to the semantically matching regions of the input image, guided by this attention map, and finally, the grayscale features are replaced with the corresponding color features.Notably, we utilize dual attention to calculate attention maps separately for the grayscale and color images, achieving more precise semantic alignment.Second, we propose classifier-free colorization guidance, which enhances the transferred colors by combining color-transferred and non-color-transferred outputs.This process improves the quality of colorization.Our experimental results demonstrate that our method outperforms existing techniques in terms of image quality and fidelity to the reference. 0.646Specifically, we use 335 input-reference pairs from previous research, achieving an FID of 95.27 (image quality) and an SI-FID of 5.51 (fidelity to the reference).Our source code is available at https://github.com/satoshi-kosugi/powerful-attention.

link

2025-05-20

Multi-agent Reinforcement Learning vs. Fixed-Time Control for Traffic Signal Optimization: A Simulation Study

Urban traffic congestion, particularly at intersections, significantly impacts travel time, fuel consumption, and emissions.Traditional fixed-time signal control systems often lack the adaptability to manage dynamic traffic patterns effectively.This study explores the application of multi-agent reinforcement learning (MARL) to optimize traffic signal coordination across multiple intersections within a simulated environment.Utilizing Pygame, a simulation was developed to model a network of interconnected intersections with randomly generated vehicle flows to reflect realistic traffic variability.A decentralized MARL controller was implemented, in which each traffic signal operates as an autonomous agent, making decisions based on local observations and information from neighboring agents.Performance was evaluated against a baseline fixed-time controller using metrics such as average vehicle wait time and overall throughput.The MARL approach demonstrated statistically significant improvements, including reduced average waiting times and improved throughput. 0.614These findings suggest that MARL-based dynamic control strategies hold substantial promise for improving urban traffic management efficiency.More research is recommended to address scalability and real-world implementation challenges.

link

2025-05-20

PSMOA: Policy Support Multi-Objective Optimization Algorithm for Decentralized Data Replication

Efficient data replication in decentralized storage systems must account for diverse policies, especially in multi-organizational, data-intensive environments.This work proposes PSMOA, a novel Policy Support Multi-objective Optimization Algorithm for decentralized data replication that dynamically adapts to varying organizational requirements %.PSMOA integrates NSGA-III with Entropy Weighted TOPSIS to optimize replication such as minimization or maximization of replication time, storage cost, replication based on content popularity, and load balancing while respecting policy constraints.%Our simulations demonstrate PSMOA's superior performance, with load balancing %maintaining 104-107\% %performance improving by 4-7\% relative to baseline.%, while other metrics show stable performance between 98-103\%. 0.67PSMOA outperforms NSGA-II and NSGA-III in both Generational Distance (20.29 vs 148.74 and 67.74) and Inverted Generational Distance (0.78 vs 3.76 and 5.61), indicating better convergence and solution distribution.These results validate PSMOA's novelty in optimizing data replication in multi-organizational environments.

link

2025-05-20

Let LLMs Break Free from Overthinking via Self-Braking Tuning

Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks.However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking.Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions.In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms.We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning.This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors.Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point.Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models. 0.624

link

2025-05-20

Electrostatics from Laplacian Eigenbasis for Neural Network Interatomic Potentials

Recent advances in neural network interatomic potentials have emerged as a promising research direction.However, popular deep learning models often lack auxiliary constraints grounded in physical laws, which could accelerate training and improve fidelity through physics-based regularization.In this work, we introduce $\Phi$-Module, a universal plugin module that enforces Poisson's equation within the message-passing framework to learn electrostatic interactions in a self-supervised manner.Specifically, each atom-wise representation is encouraged to satisfy a discretized Poisson's equation, making it possible to acquire a potential $\boldsymbol{\phi}$ and a corresponding charge density $\boldsymbol{\rho}$ linked to the learnable Laplacian eigenbasis coefficients of a given molecular graph.We then derive an electrostatic energy term, crucial for improved total energy predictions.This approach integrates seamlessly into any existing neural potential with insignificant computational overhead.Experiments on the OE62 and MD22 benchmarks confirm that models combined with $\Phi$-Module achieve robust improvements over baseline counterparts.For OE62 error reduction ranges from 4.5\% to 17.8\%, and for MD22, baseline equipped with $\Phi$-Module achieves best results on 5 out of 14 cases. 0.613Our results underscore how embedding a first-principles constraint in neural interatomic potentials can significantly improve performance while remaining hyperparameter-friendly, memory-efficient and lightweight in training.Code will be available at \href{https://github.com/dunnolab/phi-module}{dunnolab/phi-module}.

link

2025-05-20

MMD-Newton Method for Multi-objective Optimization

Maximum mean discrepancy (MMD) has been widely employed to measure the distance between probability distributions. 0.602In this paper, we propose using MMD to solve continuous multi-objective optimization problems (MOPs).For solving MOPs, a common approach is to minimize the distance (e.g., Hausdorff) between a finite approximate set of the Pareto front and a reference set.Viewing these two sets as empirical measures, we propose using MMD to measure the distance between them. 0.619To minimize the MMD value, we provide the analytical expression of its gradient and Hessian matrix w.r.t.the search variables, and use them to devise a novel set-oriented, MMD-based Newton (MMDN) method.Also, we analyze the theoretical properties of MMD's gradient and Hessian, including the first-order stationary condition and the eigenspectrum of the Hessian, which are important for verifying the correctness of MMDN.To solve complicated problems, we propose hybridizing MMDN with multiobjective evolutionary algorithms (MOEAs), where we first execute an EA for several iterations to get close to the global Pareto front and then warm-start MMDN with the result of the MOEA to efficiently refine the approximation.We empirically test the hybrid algorithm on 11 widely used benchmark problems, and the results show the hybrid (MMDN + MOEA) can achieve a much better optimization accuracy than EA alone with the same computation budget. 0.719

link

2025-05-20

SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems.Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints.Each instance in SATBench is generated from a SAT formula, then translated into a story context and conditions using LLMs.The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses.All 2100 puzzles are validated through both LLM-assisted and solver-based consistency checks, with human validation on a subset.Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. 0.621SATBench exposes fundamental limitations in the search-based logical reasoning abilities of current LLMs and provides a scalable testbed for future research in logical reasoning.

link

2025-05-20

KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models

Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs.Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs.We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information.Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints.Next, our system generates the cooking steps and nutritional information for each recipe.To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences.Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis.Our code and benchmark datasets are publicly available at https://github.com/mohbattharani/KERL. 0.653

link

2025-05-20

Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition.However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency.To meet these constraints, post-training quantization has emerged as a promising solution.In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation.Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators.To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead.Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model. 0.674

link

2025-05-20

Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits

We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models.Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow).This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. 0.675We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties.In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile.First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability.Vox-Profile is also used as a tool to evaluate the performance of speech generation systems.Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity.Vox-Profile is publicly available at: https://github.com/tiantiaf0627/vox-profile-release.

link

2025-05-20

Explainable AI for Securing Healthcare in IoT-Integrated 6G Wireless Networks

As healthcare systems increasingly adopt advanced wireless networks and connected devices, securing medical applications has become critical.The integration of Internet of Medical Things devices, such as robotic surgical tools, intensive care systems, and wearable monitors has enhanced patient care but introduced serious security risks.Cyberattacks on these devices can lead to life threatening consequences, including surgical errors, equipment failure, and data breaches.While the ITU IMT 2030 vision highlights 6G's transformative role in healthcare through AI and cloud integration, it also raises new security concerns.This paper explores how explainable AI techniques like SHAP, LIME, and DiCE can uncover vulnerabilities, strengthen defenses, and improve trust and transparency in 6G enabled healthcare.We support our approach with experimental analysis and highlight promising results. 0.678

link

2025-05-20

Training-Free Watermarking for Autoregressive Image Generation

Invisible image watermarking can protect image ownership and prevent malicious misuse of visual generative models.However, existing generative watermarking methods are mainly designed for diffusion models while watermarking for autoregressive image generation models remains largely underexplored.We propose IndexMark, a training-free watermarking framework for autoregressive image generation models.IndexMark is inspired by the redundancy property of the codebook: replacing autoregressively generated indices with similar indices produces negligible visual differences.The core component in IndexMark is a simple yet effective match-then-replace method, which carefully selects watermark tokens from the codebook based on token similarity, and promotes the use of watermark tokens through token replacement, thereby embedding the watermark without affecting the image quality.Watermark verification is achieved by calculating the proportion of watermark tokens in generated images, with precision further improved by an Index Encoder.Furthermore, we introduce an auxiliary validation scheme to enhance robustness against cropping attacks.Experiments demonstrate that IndexMark achieves state-of-the-art performance in terms of image quality and verification accuracy, and exhibits robustness against various perturbations, including cropping, noises, Gaussian blur, random erasing, color jittering, and JPEG compression. 0.614

link

LLMs

2025-05-22

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. 0.655Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. 0.72While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. 0.75In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. 0.698AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications.(2) Long, averaging 1,723 words with a maximum of 15,630 words.(3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints.To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems.For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. 0.623We use AgentIF to systematically evaluate existing advanced LLMs. 0.744We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications.We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. 0.747We have released the code and data to facilitate future research.

link

2025-05-22

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. 0.649Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood.Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. 0.673As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks.In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training.We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models.We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations.We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies.Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. 0.632We provide our code and models at https://github.com/insait-institute/MixAT.

link

2025-05-22

Cracking Aegis: An Adversarial LLM-based Game for Raising Awareness of Vulnerabilities in Privacy Protection

Traditional methods for raising awareness of privacy protection often fail to engage users or provide hands-on insights into how privacy vulnerabilities are exploited.To address this, we incorporate an adversarial mechanic in the design of the dialogue-based serious game Cracking Aegis.Leveraging LLMs to simulate natural interactions, the game challenges players to impersonate characters and extract sensitive information from an AI agent, Aegis. 0.711A user study (n=22) revealed that players employed diverse deceptive linguistic strategies, including storytelling and emotional rapport, to manipulate Aegis.After playing, players reported connecting in-game scenarios with real-world privacy vulnerabilities, such as phishing and impersonation, and expressed intentions to strengthen privacy control, such as avoiding oversharing personal information with AI systems.This work highlights the potential of LLMs to simulate complex relational interactions in serious games, while demonstrating how an adversarial game strategy provides unique insights for designs for social good, particularly privacy protection. 0.665

link

2025-05-22

Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models

Large Language Models (LLMs) are increasingly equipped with capabilities of real-time web search and integrated with protocols like Model Context Protocol (MCP). 0.649This extension could introduce new security vulnerabilities.We present a systematic investigation of LLM vulnerabilities to hidden adversarial prompts through malicious font injection in external resources like webpages, where attackers manipulate code-to-glyph mapping to inject deceptive content which are invisible to users. 0.69We evaluate two critical attack scenarios: (1) "malicious content relay" and (2) "sensitive data leakage" through MCP-enabled tools.Our experiments reveal that indirect prompts with injected malicious font can bypass LLM safety mechanisms through external resources, achieving varying success rates based on data sensitivity and prompt design. 0.746Our research underscores the urgent need for enhanced security measures in LLM deployments when processing external content. 0.774

link

2025-05-22

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis.To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA.To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance.The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images.We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA. 0.641The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases.Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities.We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.

link

2025-05-22

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources.However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point.This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant.We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. 0.634Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation.Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR.The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

link

2025-05-22

VeriFastScore: Speeding up long-form factuality evaluation

Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim.While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. 0.674To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. 0.617We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. 0.648However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore.To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.

link

2025-05-22

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. 0.69However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored.We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks.To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests.This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests.Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split).Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data.Code is available here \href{https://github.com/justLittleWhite/SWE-Dev}{https://github.com/justLittleWhite/SWE-Dev}.

link

2025-05-22

HyGenar: An LLM-Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation

Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs.Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. 0.685In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. 0.673To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it.Our findings reveal that existing LLMs perform sub-optimally in grammar generation. 0.737To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation.HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs. 0.687

link

2025-05-22

Know the Ropes: A Heuristic Strategy for LLM-based Multi-Agent System Design

Single-agent LLMs hit hard limits--finite context, role overload, and brittle domain transfer. 0.738Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains.We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check).Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition.On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent.On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot.Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators--no ever-larger monoliths required.

link

2025-05-22

Beyond Correlation: Towards Causal Large Language Model Agents in Biomedicine

Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. 0.604This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) 0.64and perform intervention-based reasoning to infer cause-and-effect.Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools.Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models.This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.

link

2025-05-22

LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

Large Language Models (LLMs) are primarily designed for batch processing. 0.707Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. 0.717This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. 0.737While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary.To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. 0.684Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes.Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches.Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes.The code is available at repository https://github.com/EIT-NLP/StreamingLLM.

link

2025-05-22

UFT: Unifying Supervised and Reinforcement Fine-Tuning

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). 0.645The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT).SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models.In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model.To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process.UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods.Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes.Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

link

2025-05-22

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. 0.691However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge.To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains.T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results.Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models.We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.

link

2025-05-22

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. 0.753Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers.To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. 0.684(1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation.(2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols.(3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension.Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods.MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.

link

2025-05-22

$\text{R}^2\text{ec}$: Towards Large Recommender Models with Reasoning

Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. 0.69Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. 0.693However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization.To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities.Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process.Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations.Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20.Code available at https://github.com/YRYangang/RRec.

link

2025-05-22

X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs

LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. 0.755However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. 0.677This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. 0.724We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. 0.73As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. 0.66Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. 0.728Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset.In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset.Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems. 0.67

link

2025-05-22

Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. 0.679Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. 0.713In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. 0.739From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. 0.645Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. 0.641Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.

link

2025-05-22

R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning

Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. 0.678Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model.In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. 0.68R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition.The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge.By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning.Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval.The code is available at https://github.com/RUCAIBox/R1-Searcher-plus.

link

2025-05-22

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video).However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs.This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge.In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation.Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology.Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. 0.676This module enables more efficient token utilization while retaining fine-grained informational fidelity.In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens.Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.

link

Developer Research

2025-05-13

Enhancing Software Development with Context-Aware Conversational Agents: A User Study on Developer Interactions with Chatbots

Software development is a cognitively intensive process requiring multitasking, adherence to evolving workflows, and continuous learning. 0.631With the rise of large language model (LLM)-based tools, such as conversational agents (CAs), there is growing interest in supporting developers through natural language interaction.However, little is known about the specific features developers seek in these systems.We conducted a user study with 29 developers using a prototype text-based chatbot to investigate preferred functionalities.Our findings reveal strong interest in task automation, version control support, and contextual adaptability, especially the need to tailor assistance for both novice and experienced users.We highlight the importance of deep contextual understanding, historical interaction awareness, and personalized support in CA design.This study contributes to the development of context-aware chatbots that enhance productivity and satisfaction, and it outlines opportunities for future research on human-AI collaboration in software engineering.

link

2025-05-12

Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding

Large Language Models (LLMs) have demonstrated unprecedented capability in code generation.However, LLM-generated code is still plagued with a wide range of functional errors, especially for complex programming tasks that LLMs have not seen before.Recent studies have shown that developers often struggle with inspecting and fixing incorrect code generated by LLMs, diminishing their productivity and trust in LLM-based code generation.Inspired by the mutual grounding theory in communication, we propose an interactive approach that leverages code comments as a medium for developers and LLMs to establish a shared understanding.Our approach facilitates iterative grounding by interleaving code generation, inline comment generation, and contextualized user feedback through editable comments to align generated code with developer intent.We evaluated our approach on two popular benchmarks and demonstrated that our approach significantly improved multiple state-of-the-art LLMs, e.g., 17.1% pass@1 improvement for code-davinci-002 on HumanEval.Furthermore, we conducted a user study with 12 participants in comparison to two baselines: (1) interacting with GitHub Copilot, and (2) interacting with a multi-step code generation paradigm called Multi-Turn Program Synthesis.Participants completed the given programming tasks 16.7% faster and with 10.5% improvement in task success rate when using our approach.Both results show that interactively refining code comments enables the collaborative establishment of mutual grounding, leading to more accurate code generation and higher developer confidence. 0.668

link

2025-05-06

Empc: Effective Path Prioritization for Symbolic Execution with Path Cover

Symbolic execution is a powerful program analysis technique that can formally reason the correctness of program behaviors and detect software bugs. 0.626It can systematically explore the execution paths of the tested program.But it suffers from an inherent limitation: path explosion.Path explosion occurs when symbolic execution encounters an overwhelming number (exponential to the program size) of paths that need to be symbolically reasoned.It severely impacts the scalability and performance of symbolic execution.To tackle this problem, previous works leverage various heuristics to prioritize paths for symbolic execution.They rank the exponential number of paths using static rules or heuristics and explore the paths with the highest rank.However, in practice, these works often fail to generalize to diverse programs.In this work, we propose a novel and effective path prioritization technique with path cover, named Empc.Our key insight is that not all paths need to be symbolically reasoned.Unlike traditional path prioritization, our approach leverages a small subset of paths as a minimum path cover (MPC) that can cover all code regions of the tested programs.To encourage diversity in path prioritization, we compute multiple MPCs.We then guide the search for symbolic execution on the small number of paths inside multiple MPCs rather than the exponential number of paths.We implement our technique Empc based on KLEE.We conduct a comprehensive evaluation of Empc to investigate its performance in code coverage, bug findings, and runtime overhead.The evaluation shows that Empc can cover 19.6% more basic blocks than KLEE's best search strategy and 24.4% more lines compared to the state-of-the-art work cgs.Empc also finds 24 more security violations than KLEE's best search strategy.Meanwhile, Empc can significantly reduce the memory usage of KLEE by up to 93.5%.

link

Data Annotation Techniques