Vincent's Arxiv FrontPageGenerated on 2025-06-17. This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. |
|
New Datasets |
|
2025-06-16 |
A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy
Disease-symptom datasets are significant and in demand for medical research, disease diagnosis, clinical decision-making, and AI-driven health management applications.These datasets help identify symptom patterns associated with specific diseases, thus improving diagnostic accuracy and enabling early detection.The dataset presented in this study systematically compiles disease-symptom relationships from various online sources, medical literature, and publicly available health databases.The data was gathered through analyzing peer-reviewed medical articles, clinical case studies, and disease-symptom association reports.Only the verified medical sources were included in the dataset, while those from non-peer-reviewed and anecdotal sources were excluded.The dataset is structured in a tabular format, where the first column represents diseases, and the remaining columns represent symptoms. 0.819Each symptom cell contains a binary value (1 or 0), indicating whether a symptom is associated with a disease (1 for presence, 0 for absence).Thereby, this structured representation makes the dataset very useful for a wide range of applications, including machine learning-based disease prediction, clinical decision support systems, and epidemiological studies.Although there are some advancements in the field of disease-symptom datasets, there is a significant gap in structured datasets for the Bangla language. 0.714This dataset aims to bridge that gap by facilitating the development of multilingual medical informatics tools and improving disease prediction models for underrepresented linguistic communities. 0.749Further developments should include region-specific diseases and further fine-tuning of symptom associations for better diagnostic performance |
2025-06-16 |
PeakWeather: MeteoSwiss Weather Station Measurements for Spatiotemporal Deep Learning
Accurate weather forecasts are essential for supporting a wide range of activities and decision-making processes, as well as mitigating the impacts of adverse weather events.While traditional numerical weather prediction (NWP) remains the cornerstone of operational forecasting, machine learning is emerging as a powerful alternative for fast, flexible, and scalable predictions.We introduce PeakWeather, a high-quality dataset of surface weather observations collected every 10 minutes over more than 8 years from the ground stations of the Federal Office of Meteorology and Climatology MeteoSwiss's measurement network. 0.933The dataset includes a diverse set of meteorological variables from 302 station locations distributed across Switzerland's complex topography and is complemented with topographical indices derived from digital height models for context. 0.9Ensemble forecasts from the currently operational high-resolution NWP model are provided as a baseline forecast against which to evaluate new approaches.The dataset's richness supports a broad spectrum of spatiotemporal tasks, including time series forecasting at various scales, graph structure learning, imputation, and virtual sensing.As such, PeakWeather serves as a real-world benchmark to advance both foundational machine learning research, meteorology, and sensor-based applications. |
2025-06-16 |
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL).Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding.We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning.To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. 0.848Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources.Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week. |
2025-06-16 |
Lecture Video Visual Objects (LVVO) Dataset: A Benchmark for Visual Object Detection in Educational Videos
We introduce the Lecture Video Visual Objects (LVVO) dataset, a new benchmark for visual object detection in educational video content.The dataset consists of 4,000 frames extracted from 245 lecture videos spanning biology, computer science, and geosciences. 0.946A subset of 1,000 frames, referred to as LVVO_1k, has been manually annotated with bounding boxes for four visual categories: Table, Chart-Graph, Photographic-image, and Visual-illustration.Each frame was labeled independently by two annotators, resulting in an inter-annotator F1 score of 83.41%, indicating strong agreement.To ensure high-quality consensus annotations, a third expert reviewed and resolved all cases of disagreement through a conflict resolution process.To expand the dataset, a semi-supervised approach was employed to automatically annotate the remaining 3,000 frames, forming LVVO_3k. 0.728The complete dataset offers a valuable resource for developing and evaluating both supervised and semi-supervised methods for visual content detection in educational videos.The LVVO dataset is publicly available to support further research in this domain. 0.736 |
2025-06-16 |
UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions
The quality of the video dataset (image quality, resolution, and fine-grained caption) greatly influences the performance of the video generation model.The growing demand for video applications sets higher requirements for high-quality video generation models.For example, the generation of movie-level Ultra-High Definition (UHD) videos and the creation of 4K short video content.However, the existing public datasets cannot support related research and applications.In this paper, we first propose a high-quality open-sourced UHD-4K (22.4\% of which are 8K) text-to-video dataset named UltraVideo, which contains a wide range of topics (more than 100 kinds), and each video has 9 structured captions with one summarized caption (average of 824 words). 0.798Specifically, we carefully design a highly automated curation process with four stages to obtain the final high-quality dataset: \textit{i)} collection of diverse and high-quality video clips.\textit{ii)} statistical data filtering.\textit{iii)} model-based data purification.\textit{iv)} generation of comprehensive, structured captions.In addition, we expand Wan to UltraWan-1K/-4K, which can natively generate high-quality 1K/4K videos with more consistent text controllability, demonstrating the effectiveness of our data curation.We believe that this work can make a significant contribution to future research on UHD video generation.UltraVideo dataset and UltraWan models are available at https://xzc-zju.github.io/projects/UltraVideo. |
2025-06-16 |
Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems
With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients' medical conditions.However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation.If the model can provide appropriate comfort and empathy based on the patient's negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process.To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process.We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient's emotions while addressing their concerns. 0.772The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients' questions.Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model's ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers. |
2025-06-12 |
RationalVLA: A Rational Vision-Language-Action Model with Dual System
A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions.Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment.This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasible.To address this problem, we introduce RAtional MAnipulation (RAMA), a new benchmark that challenges models with both unseen executable instructions and defective ones that should be rejected.In RAMA, we construct a dataset with over 14,000 samples, including diverse defective instructions spanning six dimensions: visual, physical, semantic, motion, safety, and out-of-context. 0.776We further propose the Rational Vision-Language-Action model (RationalVLA).It is a dual system for robotic arms that integrates the high-level vision-language model with the low-level manipulation policy by introducing learnable latent space embeddings.This design enables RationalVLA to reason over instructions, reject infeasible commands, and execute manipulation effectively.Experiments demonstrate that RationalVLA outperforms state-of-the-art baselines on RAMA by a 14.5% higher success rate and 0.94 average task length, while maintaining competitive performance on standard manipulation tasks.Real-world trials further validate its effectiveness and robustness in practical applications.Our project page is https://irpn-eai.github.io/rationalvla. |
2025-06-12 |
Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?
Toxicity remains a leading cause of early-stage drug development failure.Despite advances in molecular design and property prediction, the task of molecular toxicity repair - generating structurally valid molecular alternatives with reduced toxicity - has not yet been systematically defined or benchmarked.To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair.We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities. 0.701We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge.In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success.We systematically assess nearly 30 mainstream general-purpose MLLMs and design multiple ablation studies to analyze key factors such as evaluation criteria, candidate diversity, and failure attribution.Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware molecule editing. |
2025-06-11 |
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems.Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals.This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. 0.718EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities.Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions.Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels.This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets.Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts.Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration. |
2025-06-11 |
MMME: A Spontaneous Multi-Modal Micro-Expression Dataset Enabling Visual-Physiological Fusion
Micro-expressions (MEs) are subtle, fleeting nonverbal cues that reveal an individual's genuine emotional state.Their analysis has attracted considerable interest due to its promising applications in fields such as healthcare, criminal investigation, and human-computer interaction.However, existing ME research is limited to single visual modality, overlooking the rich emotional information conveyed by other physiological modalities, resulting in ME recognition and spotting performance far below practical application needs.Therefore, exploring the cross-modal association mechanism between ME visual features and physiological signals (PS), and developing a multimodal fusion framework, represents a pivotal step toward advancing ME analysis.This study introduces a novel ME dataset, MMME, which, for the first time, enables synchronized collection of facial action signals (MEs), central nervous system signals (EEG), and peripheral PS (PPG, RSP, SKT, EDA, and ECG). 0.722By overcoming the constraints of existing ME corpora, MMME comprises 634 MEs, 2,841 macro-expressions (MaEs), and 2,890 trials of synchronized multimodal PS, establishing a robust foundation for investigating ME neural mechanisms and conducting multimodal fusion-based analyses.Extensive experiments validate the dataset's reliability and provide benchmarks for ME analysis, demonstrating that integrating MEs with PS significantly enhances recognition and spotting performance.To the best of our knowledge, MMME is the most comprehensive ME dataset to date in terms of modality diversity.It provides critical data support for exploring the neural mechanisms of MEs and uncovering the visual-physiological synergistic effects, driving a paradigm shift in ME research from single-modality visual analysis to multimodal fusion.The dataset will be publicly available upon acceptance of this paper. 0.866 |
2025-06-11 |
Dataset of News Articles with Provenance Metadata for Media Relevance Assessment
Out-of-context and misattributed imagery is the leading form of media manipulation in today's misinformation and disinformation landscape.The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand.To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. 0.898We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs).We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work. |
2025-06-11 |
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question.Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information.Moreover, the context of the overall conversation must be considered for the subsequent dialogue.To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. 0.904While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present.Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue.We further provide several baselines evaluated on our dataset and show future challenges associated with this task. 0.762The dataset is made publicly available here: https://github.com/c-patsch/OKCV. 0.912 |
2025-06-11 |
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data.In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes.Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures.To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space.Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. 0.788Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency.Our work marks a substantial step forward in making 4D content creation more accessible and practical.All the data, code, and models will be open-released. |
2025-06-11 |
Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages
Online toxic language causes real harm, especially in regions with limited moderation tools.In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data.We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. 0.859Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented.We measured precision, recall, F1 score, accuracy and false positive rates.Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives.The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms.We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration.These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities. |
2025-06-11 |
Text-Aware Image Restoration with Diffusion Models
Image restoration aims to recover degraded images.However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images.Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination.In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity.To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. 0.753Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training.This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps.Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy.See our project page: https://cvlab-kaist.github.io/TAIR/ |
2025-06-11 |
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products.Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency.Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance.In this paper, we explore how to form a data-and-model solution that natively supports partial detection.For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. 0.715Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness.Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average.Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO. |
2025-06-10 |
WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos
To meet the growing demand for systematic surgical training, wetlab environments have become indispensable platforms for hands-on practice in ophthalmology.Yet, traditional wetlab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability.Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education.Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wetlab settings.To address these limitations, we introduce WetCat, the first dataset of wetlab cataract surgery videos specifically curated for automated skill assessment.WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures.These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks.By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics.This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training.The dataset and annotations are publicly available in Synapse https://www.synapse.org/Synapse:syn66401174/files. 0.928 |
2025-06-10 |
WIP: Large Language Model-Enhanced Smart Tutor for Undergraduate Circuit Analysis
This research-to-practice work-in-progress (WIP) paper presents an AI-enabled smart tutor designed to provide homework assessment and feedback for students in an undergraduate circuit analysis course.We detail the tutor's design philosophy and core components, including open-ended question answering and homework feedback generation.The prompts are carefully crafted to optimize responses across different problems.The smart tutor was deployed on the Microsoft Azure platform and is currently in use in an undergraduate circuit analysis course at the School of Electrical and Computer Engineering in a large, public, research-intensive institution in the Southeastern United States.Beyond offering personalized instruction and feedback, the tutor collects student interaction data, which is summarized and shared with the course instructor.To evaluate its effectiveness, we collected student feedback, with 90.9% of responses indicating satisfaction with the tutor.Additionally, we analyze a subset of collected data on preliminary circuit analysis topics to assess tutor usage frequency for each problem and identify frequently asked questions.These insights help instructors gain real-time awareness of student difficulties, enabling more targeted classroom instruction.In future work, we will release a full analysis once the complete dataset is available after the Spring 2025 semester. 0.892We also explore the potential applications of this smart tutor across a broader range of engineering disciplines by developing improved prompts, diagram-recognition methods, and database management strategies, which remain ongoing areas of research. |
2025-06-10 |
ORIDa: Object-centric Real-world Image Composition Dataset
Object compositing, the task of placing and harmonizing objects in images of diverse visual scenes, has become an important task in computer vision with the rise of generative models.However, existing datasets lack the diversity and scale required to comprehensively explore real-world scenarios.We introduce ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale, real-captured dataset containing over 30,000 images featuring 200 unique objects, each of which is presented across varied positions and scenes. 0.84ORIDa has two types of data: factual-counterfactual sets and factual-only scenes.The factual-counterfactual sets consist of four factual images showing an object in different positions within a scene and a single counterfactual (or background) image of the scene without the object, resulting in five images per scene.The factual-only scenes include a single image containing an object in a specific context, expanding the variety of environments.To our knowledge, ORIDa is the first publicly available dataset with its scale and complexity for real-world image composition.Extensive analysis and experiments highlight the value of ORIDa as a resource for advancing further research in object compositing. |
2025-06-10 |
Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices.However, prior work has predominantly focused on atomic tasks -- such as shot-chain execution tasks and single-screen grounding tasks -- while overlooking the generalization to compositional tasks, which are indispensable for real-world applications.This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive.UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps.It comprises 100 interactive task templates with an average optimal step count of 14.05.Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges.Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap.Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks.AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks.AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead.The demo video, dataset, and code are available on the project page at https://ui-nexus.github.io. 0.723 |
2025-06-10 |
FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents
We introduce a new FROST-EMA (Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography) corpus. 0.706It consists of 18 bilingual speakers, who produced speech in their native language (L1), second language (L2), and imitated L2 (fake foreign accent).The new corpus enables research into language variability from phonetic and technological points of view.Accordingly, we include two preliminary case studies to demonstrate both perspectives.The first case study explores the impact of L2 and imitated L2 on the performance of an automatic speaker verification system, while the second illustrates the articulatory patterns of one speaker in L1, L2, and a fake accent. |
2025-06-10 |
Employing self-supervised learning models for cross-linguistic child speech maturity classification
Speech technology systems struggle with many downstream tasks for child speech due to small training corpora and the difficulties that child speech pose.We apply a novel dataset, SpeechMaturity, to state-of-the-art transformer models to address a fundamental classification task: identifying child vocalizations.Unlike previous corpora, our dataset captures maximally ecologically-valid child vocalizations across an unprecedented sample, comprising children acquiring 25+ languages in the U.S., Bolivia, Vanuatu, Papua New Guinea, Solomon Islands, and France. 0.796The dataset contains 242,004 labeled vocalizations, magnitudes larger than previous work. 0.815Models were trained to distinguish between cry, laughter, mature (consonant+vowel), and immature speech (just consonant or vowel).Models trained on the dataset outperform state-of-the-art models trained on previous datasets, achieved classification accuracy comparable to humans, and were robust across rural and urban settings. |
2025-06-10 |
Princeton365: A Diverse Dataset with Accurate Camera Pose
We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. 0.786Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360-camera.We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU.We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error.In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods.We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360-degree camera trajectories.Please visit https://princeton365.cs.princeton.edu for the dataset, code, videos, and submission. 0.883 |
2025-06-09 |
Egocentric Event-Based Vision for Ping Pong Ball Trajectory Prediction
In this paper, we present a real-time egocentric trajectory prediction system for table tennis using event cameras.Unlike standard cameras, which suffer from high latency and motion blur at fast ball speeds, event cameras provide higher temporal resolution, allowing more frequent state updates, greater robustness to outliers, and accurate trajectory predictions using just a short time window after the opponent's impact.We collect a dataset of ping-pong game sequences, including 3D ground-truth trajectories of the ball, synchronized with sensor data from the Meta Project Aria glasses and event streams. 0.798Our system leverages foveated vision, using eye-gaze data from the glasses to process only events in the viewer's fovea.This biologically inspired approach improves ball detection performance and significantly reduces computational latency, as it efficiently allocates resources to the most perceptually relevant regions, achieving a reduction factor of 10.81 on the collected trajectories.Our detection pipeline has a worst-case total latency of 4.5 ms, including computation and perception - significantly lower than a frame-based 30 FPS system, which, in the worst case, takes 66 ms solely for perception.Finally, we fit a trajectory prediction model to the estimated states of the ball, enabling 3D trajectory forecasting in the future.To the best of our knowledge, this is the first approach to predict table tennis trajectories from an egocentric perspective using event cameras. |
2025-06-09 |
FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity
In this paper, we aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos.By applying various governing PDEs as PINN losses or incorporating physics simulation into neural networks, existing works often fail to learn complex physical motions at boundaries or require object priors such as masks or types.In this paper, we propose FreeGave to learn the physics of complex dynamic 3D scenes without needing any object priors.The key to our approach is to introduce a physics code followed by a carefully designed divergence-free module for estimating a per-Gaussian velocity field, without relying on the inefficient PINN losses.Extensive experiments on three public datasets and a newly collected challenging real-world dataset demonstrate the superior performance of our method for future frame extrapolation and motion segmentation. 0.734Most notably, our investigation into the learned physics codes reveals that they truly learn meaningful 3D physical motion patterns in the absence of any human labels in training. |
2025-06-09 |
CrosswalkNet: An Optimized Deep Learning Framework for Pedestrian Crosswalk Detection in Aerial Images with High-Performance Computing
With the increasing availability of aerial and satellite imagery, deep learning presents significant potential for transportation asset management, safety analysis, and urban planning.This study introduces CrosswalkNet, a robust and efficient deep learning framework designed to detect various types of pedestrian crosswalks from 15-cm resolution aerial images.CrosswalkNet incorporates a novel detection approach that improves upon traditional object detection strategies by utilizing oriented bounding boxes (OBB), enhancing detection precision by accurately capturing crosswalks regardless of their orientation.Several optimization techniques, including Convolutional Block Attention, a dual-branch Spatial Pyramid Pooling-Fast module, and cosine annealing, are implemented to maximize performance and efficiency.A comprehensive dataset comprising over 23,000 annotated crosswalk instances is utilized to train and validate the proposed framework. 0.702The best-performing model achieves an impressive precision of 96.5% and a recall of 93.3% on aerial imagery from Massachusetts, demonstrating its accuracy and effectiveness.CrosswalkNet has also been successfully applied to datasets from New Hampshire, Virginia, and Maine without transfer learning or fine-tuning, showcasing its robustness and strong generalization capability.Additionally, the crosswalk detection results, processed using High-Performance Computing (HPC) platforms and provided in polygon shapefile format, have been shown to accelerate data processing and detection, supporting real-time analysis for safety and mobility applications.This integration offers policymakers, transportation engineers, and urban planners an effective instrument to enhance pedestrian safety and improve urban mobility. |
2025-06-09 |
FunDiff: Diffusion Models over Function Spaces for Physics-Informed Generative Modeling
Recent advances in generative modeling -- particularly diffusion models and flow matching -- have achieved remarkable success in synthesizing discrete data such as images and videos.However, adapting these models to physical applications remains challenging, as the quantities of interest are continuous functions governed by complex physical laws.Here, we introduce $\textbf{FunDiff}$, a novel framework for generative modeling in function spaces.FunDiff combines a latent diffusion process with a function autoencoder architecture to handle input functions with varying discretizations, generate continuous functions evaluable at arbitrary locations, and seamlessly incorporate physical priors.These priors are enforced through architectural constraints or physics-informed loss functions, ensuring that generated samples satisfy fundamental physical laws.We theoretically establish minimax optimality guarantees for density estimation in function spaces, showing that diffusion-based estimators achieve optimal convergence rates under suitable regularity conditions.We demonstrate the practical effectiveness of FunDiff across diverse applications in fluid dynamics and solid mechanics.Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy and low-resolution data.Code and datasets are publicly available at https://github.com/sifanexisted/fundiff. 0.851 |
2025-06-09 |
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise.While recent works have attempted to adapt DeepSeek-R1-style reinforcement learning (RL) training paradigms to multimodal large language models (MLLM), focusing on domain-specific tasks like math and visual perception, a critical question remains: How can we achieve the general-purpose visual-language reasoning through RL?To address this challenge, we make three key efforts: (1) A novel Scalable Multimodal QA Synthesis pipeline that autonomously generates context-aware, reasoning-centric question-answer (QA) pairs directly from the given images.(2) The open-source WeThink dataset containing over 120K multimodal QA pairs with annotated reasoning paths, curated from 18 diverse dataset sources and covering various question domains. 0.909(3) A comprehensive exploration of RL on our dataset, incorporating a hybrid reward mechanism that combines rule-based verification with model-based assessment to optimize RL training efficiency across various task domains.Across 14 diverse MLLM benchmarks, we demonstrate that our WeThink dataset significantly enhances performance, from mathematical reasoning to diverse general multimodal tasks.Moreover, we show that our automated data pipeline can continuously increase data diversity to further improve model performance. |
2025-06-09 |
Creating a Historical Migration Dataset from Finnish Church Records, 1800-1920
This article presents a large-scale effort to create a structured dataset of internal migration in Finland between 1800 and 1920 using digitized church moving records.These records, maintained by Evangelical-Lutheran parishes, document the migration of individuals and families and offer a valuable source for studying historical demographic patterns.The dataset includes over six million entries extracted from approximately 200,000 images of handwritten migration records. 0.926The data extraction process was automated using a deep learning pipeline that included layout analysis, table detection, cell classification, and handwriting recognition.The complete pipeline was applied to all images, resulting in a structured dataset suitable for research. The dataset can be used to study internal migration, urbanization, and family migration, and the spread of disease in preindustrial Finland. 0.805A case study from the Elim\"aki parish shows how local migration histories can be reconstructed.The work demonstrates how large volumes of handwritten archival material can be transformed into structured data to support historical and demographic research. |
2025-06-09 |
Exposing Hidden Backdoors in NFT Smart Contracts: A Static Security Analysis of Rug Pull Patterns
The explosive growth of Non-Fungible Tokens (NFTs) has revolutionized digital ownership by enabling the creation, exchange, and monetization of unique assets on blockchain networks.However, this surge in popularity has also given rise to a disturbing trend: the emergence of rug pulls - fraudulent schemes where developers exploit trust and smart contract privileges to drain user funds or invalidate asset ownership.Central to many of these scams are hidden backdoors embedded within NFT smart contracts.Unlike unintentional bugs, these backdoors are deliberately coded and often obfuscated to bypass traditional audits and exploit investor confidence.In this paper, we present a large-scale static analysis of 49,940 verified NFT smart contracts using Slither, a static analysis framework, to uncover latent vulnerabilities commonly linked to rug pulls.We introduce a custom risk scoring model that classifies contracts into high, medium, or low risk tiers based on the presence and severity of rug pull indicators.Our dataset was derived from verified contracts on the Ethereum mainnet, and we generate multiple visualizations to highlight red flag clusters, issue prevalence, and co-occurrence of critical vulnerabilities. 0.832While we do not perform live exploits, our results reveal how malicious patterns often missed by simple reviews can be surfaced through static analysis at scale.We conclude by offering mitigation strategies for developers, marketplaces, and auditors to enhance smart contract security.By exposing how hidden backdoors manifest in real-world smart contracts, this work contributes a practical foundation for detecting and mitigating NFT rug pulls through scalable automated analysis. |
2025-06-09 |
UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References
6D object pose estimation has shown strong generalizability to novel objects.However, existing methods often require either a complete, well-reconstructed 3D model or numerous reference images that fully cover the object.Estimating 6D poses from partial references, which capture only fragments of an object's appearance and geometry, remains challenging.To address this, we propose UA-Pose, an uncertainty-aware approach for 6D object pose estimation and online object completion specifically designed for partial references.We assume access to either (1) a limited set of RGBD images with known poses or (2) a single 2D image.For the first case, we initialize a partial object 3D model based on the provided images and poses, while for the second, we use image-to-3D techniques to generate an initial object 3D model.Our method integrates uncertainty into the incomplete 3D model, distinguishing between seen and unseen regions.This uncertainty enables confidence assessment in pose estimation and guides an uncertainty-aware sampling strategy for online object completion, enhancing robustness in pose estimation accuracy and improving object completeness.We evaluate our method on the YCB-Video, YCBInEOAT, and HO3D datasets, including RGBD sequences of YCB objects manipulated by robots and human hands. 0.719Experimental results demonstrate significant performance improvements over existing methods, particularly when object observations are incomplete or partially captured.Project page: https://minfenli.github.io/UA-Pose/ |
2025-06-09 |
Audio-Sync Video Generation with Multi-Stream Temporal Control
Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e.g., movies).Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e.g., Podcasts or historical recordings).However, existing approaches fall short in generating high-quality videos with precise audio-visual synchronization, especially across diverse and complex audio types.In this work, we introduce MTV, a versatile framework for audio-sync video generation.MTV explicitly separates audios into speech, effects, and music tracks, enabling disentangled control over lip motion, event timing, and visual mood, respectively -- resulting in fine-grained and semantically aligned video generation.To support the framework, we additionally present DEMIX, a dataset comprising high-quality cinematic videos and demixed audio tracks. 0.794DEMIX is structured into five overlapped subsets, enabling scalable multi-stage training for diverse generation scenarios.Extensive experiments demonstrate that MTV achieves state-of-the-art performance across six standard metrics spanning video quality, text-video consistency, and audio-video alignment.Project page: https://hjzheng.net/projects/MTV/. |
2025-06-09 |
Dreamland: Controllable World Creation with Simulator and Generative Models
Large-scale video generative models can synthesize diverse and realistic visual content for dynamic world creation, but they often lack element-wise controllability, hindering their use in editing scenes and training embodied AI agents.We propose Dreamland, a hybrid world generation framework combining the granular control of a physics-based simulator and the photorealistic content output of large-scale pretrained generative models.In particular, we design a layered world abstraction that encodes both pixel-level and object-level semantics and geometry as an intermediate representation to bridge the simulator and the generative model.This approach enhances controllability, minimizes adaptation cost through early alignment with real-world distributions, and supports off-the-shelf use of existing and future pretrained generative models.We further construct a D3Sim dataset to facilitate the training and evaluation of hybrid generation pipelines.Experiments demonstrate that Dreamland outperforms existing baselines with 50.8% improved image quality, 17.9% stronger controllability, and has great potential to enhance embodied agent training.Code and data will be made available. 0.767 |
2025-06-05 |
Unleashing Hour-Scale Video Training for Long Video-Language Understanding
Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs).However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored.To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. 0.841This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. 0.757Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event.Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension.Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling.It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context.In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model. |
2025-06-05 |
Search Arena: Analyzing Search-Augmented LLMs
Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness.However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions.In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs.The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. 0.711Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility.Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable.To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings.We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge.We open-sourced the dataset to support future research in this direction. 0.777Our dataset and code are available at: https://github.com/lmarena/search-arena. 0.744 |
2025-06-05 |
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces.Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization.We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions.Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency.Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences.This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability.Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. 0.892To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics.We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks.In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability.Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo. |
Data Quality |
|
2025-06-16 |
CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities.Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks.While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations.To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration.Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. 0.607Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck.To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency.Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks.These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics.Our project page is available at https://irpn-eai.github.io/CEED-VLA/. |
2025-06-12 |
Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction
Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community.Challenges include disagreement in labeling among annotators and imbalanced data distributions. 0.789This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2, evaluated on the MSP-Podcast dataset.Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling.Specifically, our best system is trained by adding text embeddings, predicting gender, and including ``Other'' (O) and ``No Agreement'' (X) samples in the training set.Our system's results secured both first and second places in the IS25-SER Challenge, and the top performance was achieved by a simple two-system ensemble. |
2025-06-10 |
UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags
The present study extends recent work on Universal Dependencies annotations for second-language (L2) Korean by introducing a semi-automated framework that identifies morphosyntactic constructions from XPOS sequences and aligns those constructions with corresponding UPOS categories.We also broaden the existing L2-Korean corpus by annotating 2,998 new sentences from argumentative essays.To evaluate the impact of XPOS-UPOS alignments, we fine-tune L2-Korean morphosyntactic analysis models on datasets both with and without these alignments, using two NLP toolkits.Our results indicate that the aligned dataset not only improves consistency across annotation layers but also enhances morphosyntactic tagging and dependency-parsing accuracy, particularly in cases of limited annotated data. 0.62 |
2025-06-09 |
Rethinking Crowd-Sourced Evaluation of Neuron Explanations
Interpreting individual neurons or directions in activations space is an important component of mechanistic interpretability.As such, many algorithms have been proposed to automatically produce neuron explanations, but it is often not clear how reliable these explanations are, or which methods produce the best explanations.This can be measured via crowd-sourced evaluations, but they can often be noisy and expensive, leading to unreliable results.In this paper, we carefully analyze the evaluation pipeline and develop a cost-effective and highly accurate crowdsourced evaluation strategy.In contrast to previous human studies that only rate whether the explanation matches the most highly activating inputs, we estimate whether the explanation describes neuron activations across all inputs.To estimate this effectively, we introduce a novel application of importance sampling to determine which inputs are the most valuable to show to raters, leading to around 30x cost reduction compared to uniform sampling.We also analyze the label noise present in crowd-sourced evaluations and propose a Bayesian method to aggregate multiple ratings leading to a further ~5x reduction in number of ratings required for the same accuracy. 0.614Finally, we use these methods to conduct a large-scale study comparing the quality of neuron explanations produced by the most popular methods for two different vision models. |
2025-06-05 |
LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs
Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$).However, the likelihood displacement observed in DPO indicates that both $\log \pi_\theta (y_w\mid x)$ and $\log \pi_\theta (y_l\mid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses.In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content.To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model.A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection.This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop.In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. 0.604Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead.Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs. |
2025-06-04 |
Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems
Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback.However, many ASR systems discard or generalize hesitations, losing important acoustic details.We fine-tune Whisper models on the Speak & Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data.We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). 0.632Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra).Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the "Extra" scheme yielded a 5.5% WER, an 11.3% relative improvement over the "Pure" scheme (6.2% WER).This demonstrates that explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription. |
2025-06-03 |
Causal Estimation of Tokenisation Bias
Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings.Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does.We define this mismatch as tokenisation bias. 0.647In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., $\langle hello \rangle$) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit{``hello''}).Estimating this effect is challenging because each model is trained with only one tokeniser.We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design.Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first $K$ to a tokeniser's vocabulary, where $K$ is an arbitrary cutoff point.As such, we can estimate a causal effect by comparing similar subwords around this cutoff.Experimentally, we find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers.Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling. |
2025-05-29 |
Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection
Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text.However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech.To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. 0.677Our contributions include: first, demonstrating that incorporating reading text through prompting benefits verbatim transcription performance over fine-tuning, and second, showing that it is feasible to augment speech recognition tasks for end-to-end miscue detection.We conducted two case studies -- children's read-aloud and adult atypical speech -- and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art. |
2025-05-29 |
FMG-Det: Foundation Model Guided Robust Object Detection
Collecting high quality data for object detection tasks is challenging due to the inherent subjectivity in labeling the boundaries of an object.This makes it difficult to not only collect consistent annotations across a dataset but also to validate them, as no two annotators are likely to label the same object using the exact same coordinates. 0.75These challenges are further compounded when object boundaries are partially visible or blurred, which can be the case in many domains.Training on noisy annotations significantly degrades detector performance, rendering them unusable, particularly in few-shot settings, where just a few corrupted annotations can impact model performance. 0.774In this work, we propose FMG-Det, a simple, efficient methodology for training models with noisy annotations.More specifically, we propose combining a multiple instance learning (MIL) framework with a pre-processing pipeline that leverages powerful foundation models to correct labels prior to training.This pre-processing pipeline, along with slight modifications to the detector head, results in state-of-the-art performance across a number of datasets, for both standard and few-shot scenarios, while being much simpler and more efficient than other approaches. |
Benchmarks |
|
2025-06-16 |
Omni-AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented for Efficient Long Video Understanding
Multimodal Large Language Models (MLLMs) struggle with long videos due to fixed context windows and weak long-term dependency modeling.Existing Retrieval-Augmented Generation (RAG) methods for videos use static retrieval strategies, leading to inefficiencies for simple queries and information loss for complex tasks.To address this, we propose AdaVideoRAG, a novel framework that dynamically adapts retrieval granularity based on query complexity using a lightweight intent classifier.Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs, enabling optimal resource allocation across tasks.We also introduce the HiVU benchmark for comprehensive evaluation. 0.703Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.AdaVideoRAG establishes a new paradigm for adaptive retrieval in video analysis.Codes will be open-sourced at https://github.com/xzc-zju/AdaVideoRAG. |
2025-06-16 |
EBS-CFL: Efficient and Byzantine-robust Secure Clustered Federated Learning
Despite federated learning (FL)'s potential in collaborative learning, its performance has deteriorated due to the data heterogeneity of distributed users.Recently, clustered federated learning (CFL) has emerged to address this challenge by partitioning users into clusters according to their similarity.However, CFL faces difficulties in training when users are unwilling to share their cluster identities due to privacy concerns.To address these issues, we present an innovative Efficient and Robust Secure Aggregation scheme for CFL, dubbed EBS-CFL.The proposed EBS-CFL supports effectively training CFL while maintaining users' cluster identity confidentially.Moreover, it detects potential poisonous attacks without compromising individual client gradients by discarding negatively correlated gradients and aggregating positively correlated ones using a weighted approach.The server also authenticates correct gradient encoding by clients.EBS-CFL has high efficiency with client-side overhead O(ml + m^2) for communication and O(m^2l) for computation, where m is the number of cluster identities, and l is the gradient size.When m = 1, EBS-CFL's computational efficiency of client is at least O(log n) times better than comparison schemes, where n is the number of clients. 0.604In addition, we validate the scheme through extensive experiments.Finally, we theoretically prove the scheme's security. |
2025-06-16 |
Delay-optimal Congestion-aware Routing and Computation Offloading in Arbitrary Network
Emerging edge computing paradigms enable heterogeneous devices to collaborate on complex computation applications.However, for arbitrary heterogeneous edge networks, delay-optimal forwarding and computation offloading remains an open problem.In this paper, we jointly optimize data/result routing and computation placement in arbitrary networks with heterogeneous node capabilities, and congestion-dependent nonlinear transmission and processing delay.Despite the non-convexity of the formulated problem, based on analyzing the KKT condition, we provide a set of sufficient optimality conditions that solve the problem globally.To provide the insights for such global optimality, we show that the proposed non-convex problem is geodesic-convex with mild assumptions.We also show that the proposed sufficient optimality condition leads to a lower hemicontinuous solution set, providing stability against user-input perturbation.We then extend the framework to incorporate utility-based congestion control and fairness.A fully distributed algorithm is developed to converge to the global optimum.Numerical results demonstrate significant improvements over multiple baselines algorithms. 0.821 |
2025-06-16 |
Global Convergence of Adjoint-Optimized Neural PDEs
Many engineering and scientific fields have recently become interested in modeling terms in partial differential equations (PDEs) with neural networks.The resulting neural-network PDE model, being a function of the neural network parameters, can be calibrated to available data by optimizing over the PDE using gradient descent, where the gradient is evaluated in a computationally efficient manner by solving an adjoint PDE.These neural-network PDE models have emerged as an important research area in scientific machine learning.In this paper, we study the convergence of the adjoint gradient descent optimization method for training neural-network PDE models in the limit where both the number of hidden units and the training time tend to infinity.Specifically, for a general class of nonlinear parabolic PDEs with a neural network embedded in the source term, we prove convergence of the trained neural-network PDE solution to the target data (i.e., a global minimizer).The global convergence proof poses a unique mathematical challenge that is not encountered in finite-dimensional neural network convergence analyses due to (1) the neural network training dynamics involving a non-local neural network kernel operator in the infinite-width hidden layer limit where the kernel lacks a spectral gap for its eigenvalues and (2) the nonlinearity of the limit PDE system, which leads to a non-convex optimization problem, even in the infinite-width hidden layer limit (unlike in typical neual network training cases where the optimization problem becomes convex in the large neuron limit).The theoretical results are illustrated and empirically validated by numerical studies. 0.703 |
2025-06-16 |
DesignCoder: Hierarchy-Aware and Self-Correcting UI Code Generation with Large Language Models
Multimodal large language models (MLLMs) have streamlined front-end interface development by automating code generation.However, these models also introduce challenges in ensuring code quality.Existing approaches struggle to maintain both visual consistency and functional completeness in the generated components.Moreover, they lack mechanisms to assess the fidelity and correctness of the rendered pages.To address these issues, we propose DesignCoder, a novel hierarchical-aware and self-correcting automated code generation framework.Specifically, we introduce UI Grouping Chains, which enhance MLLMs' capability to understand and predict complex nested UI hierarchies.Subsequently, DesignCoder employs a hierarchical divide-and-conquer approach to generate front-end code.Finally, we incorporate a self-correction mechanism to improve the model's ability to identify and rectify errors in the generated code.Extensive evaluations on a dataset of UI mockups collected from both open-source communities and industry projects demonstrate that DesignCoder outperforms state-of-the-art baselines in React Native, a widely adopted UI framework.Our method achieves a 37.63%, 9.52%, 12.82% performance increase in visual similarity metrics (MSE, CLIP, SSIM) and significantly improves code structure similarity in terms of TreeBLEU, Container Match, and Tree Edit Distance by 30.19%, 29.31%, 24.67%. 0.609Furthermore, we conducted a user study with professional developers to assess the quality and practicality of the generated code.Results indicate that DesignCoder aligns with industry best practices, demonstrating high usability, readability, and maintainability.Our approach provides an efficient and practical solution for agile front-end development, enabling development teams to focus more on core functionality and product innovation. |
2025-06-16 |
Prefix-Tuning+: Modernizing Prefix-Tuning through Attention Independent Prefix Data
Parameter-Efficient Fine-Tuning (PEFT) methods have become crucial for rapidly adapting large language models (LLMs) to downstream tasks.Prefix-Tuning, an early and effective PEFT technique, demonstrated the ability to achieve performance comparable to full fine-tuning with significantly reduced computational and memory overhead.However, despite its earlier success, its effectiveness in training modern state-of-the-art LLMs has been very limited.In this work, we demonstrate empirically that Prefix-Tuning underperforms on LLMs because of an inherent tradeoff between input and prefix significance within the attention head.This motivates us to introduce Prefix-Tuning+, a novel architecture that generalizes the principles of Prefix-Tuning while addressing its shortcomings by shifting the prefix module out of the attention head itself.We further provide an overview of our construction process to guide future users when constructing their own context-based methods.Our experiments show that, across a diverse set of benchmarks, Prefix-Tuning+ consistently outperforms existing Prefix-Tuning methods. 0.655Notably, it achieves performance on par with the widely adopted LoRA method on several general benchmarks, highlighting the potential modern extension of Prefix-Tuning approaches. 0.671Our findings suggest that by overcoming its inherent limitations, Prefix-Tuning can remain a competitive and relevant research direction in the landscape of parameter-efficient LLM adaptation. |
2025-06-16 |
Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models
Sampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications.Recently, Nguyen et al. 2024's "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling.The significance of these claims was underscored by the paper's recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation.This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper's four lines of evidence.First, the original paper's human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines.Second, comprehensively sweeping the original paper's NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters. 0.65Third, the original paper's LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported.Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading.We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity. |
2025-06-16 |
MARCO: Hardware-Aware Neural Architecture Search for Edge Devices with Multi-Agent Reinforcement Learning and Conformal Prediction Filtering
This paper introduces MARCO (Multi-Agent Reinforcement learning with Conformal Optimization), a novel hardware-aware framework for efficient neural architecture search (NAS) targeting resource-constrained edge devices.By significantly reducing search time and maintaining accuracy under strict hardware constraints, MARCO bridges the gap between automated DNN design and CAD for edge AI deployment.MARCO's core technical contribution lies in its unique combination of multi-agent reinforcement learning (MARL) with Conformal Prediction (CP) to accelerate the hardware/software co-design process for deploying deep neural networks.Unlike conventional once-for-all (OFA) supernet approaches that require extensive pretraining, MARCO decomposes the NAS task into a hardware configuration agent (HCA) and a Quantization Agent (QA).The HCA optimizes high-level design parameters, while the QA determines per-layer bit-widths under strict memory and latency budgets using a shared reward signal within a centralized-critic, decentralized-execution (CTDE) paradigm.A key innovation is the integration of a calibrated CP surrogate model that provides statistical guarantees (with a user-defined miscoverage rate) to prune unpromising candidate architectures before incurring the high costs of partial training or hardware simulation.This early filtering drastically reduces the search space while ensuring that high-quality designs are retained with a high probability.Extensive experiments on MNIST, CIFAR-10, and CIFAR-100 demonstrate that MARCO achieves a 3-4x reduction in total search time compared to an OFA baseline while maintaining near-baseline accuracy (within 0.3%). 0.607Furthermore, MARCO also reduces inference latency.Validation on a MAX78000 evaluation board confirms that simulator trends hold in practice, with simulator estimates deviating from measured values by less than 5%. |
2025-06-12 |
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint.It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise.In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding.Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice.For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using.We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. 0.618Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively.These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems. |
2025-06-12 |
Evaluating Large Language Models on Non-Code Software Engineering Tasks
Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation; however, their effectiveness on non-code Software Engineering (SE) tasks remains underexplored.We present the first comprehensive benchmark, which we name `Software Engineering Language Understanding' (SELU), for evaluating LLMs on 17 non-code tasks, spanning from identifying whether a requirement is functional or non-functional to estimating the effort and complexity of backlog items.SELU covers classification, regression, Named Entity Recognition (NER), and Masked Language Modeling (MLM) targets, with data drawn from diverse sources such as code repositories, issue tracking systems, and developer forums.We fine-tune 22 open-source LLMs, prompt two proprietary alternatives, and train two baselines.Performance is measured using metrics such as F1-macro, SMAPE, F1-micro, and accuracy, and compared via the Bayesian signed-rank test. 0.719Our results show that moderate-scale decoder-only models consistently form a top-tier, exhibiting high mean performance and low across-task variance, while domain adaptation via code-focused pre-training might yield only modest improvements.These insights guide model selection for non-code SE workflows and highlight directions for expanding SELU to generative and design-oriented scenarios. |
2025-06-12 |
A voice for minorities: diversity in approval-based committee elections under incomplete or inaccurate information
We study diversity in approval-based committee elections with incomplete or inaccurate information.As standard in the literature on approval-based multi-winner voting, we define diversity according to the maximum coverage problem, which is known to be NP-complete, with a best attainable polynomial time approximation ratio of $1-1/\e$. In the incomplete information model, voters can vote on only a small portion of the candidates.We suggest a greedy algorithm and a local search algorithm that query voters and use the query responses to approximate the total population's opinion.For both algorithms, we prove an upper bound on the number of queries required to get a close to $(1-1/\e)$-approximate solution with high probability.We also provide a lower bound for the query complexity of non-adaptive algorithms, that cannot adapt their querying strategy to readily obtained information.In the inaccurate information setting, voters' responses are corrupted with a probability $p\in(0,\frac{1}{2})$. We provide both an upper and a lower bound for the number of queries required to attain a $(1-1/\e)$-approximate solution with high probability.Finally, using real data from Polis, we see that our algorithms perform remarkably better than the theoretical results suggest, both with incomplete and inaccurate information. 0.615 |
2025-06-12 |
Precise Zero-Shot Pointwise Ranking with LLMs through Post-Aggregated Global Context Information
Recent advancements have successfully harnessed the power of Large Language Models (LLMs) for zero-shot document ranking, exploring a variety of prompting strategies.Comparative approaches like pairwise and listwise achieve high effectiveness but are computationally intensive and thus less practical for larger-scale applications. 0.681Scoring-based pointwise approaches exhibit superior efficiency by independently and simultaneously generating the relevance scores for each candidate document. 0.624However, this independence ignores critical comparative insights between documents, resulting in inconsistent scoring and suboptimal performance.In this paper, we aim to improve the effectiveness of pointwise methods while preserving their efficiency through two key innovations: (1) We propose a novel Global-Consistent Comparative Pointwise Ranking (GCCP) strategy that incorporates global reference comparisons between each candidate and an anchor document to generate contrastive relevance scores. 0.641We strategically design the anchor document as a query-focused summary of pseudo-relevant candidates, which serves as an effective reference point by capturing the global context for document comparison.(2) These contrastive relevance scores can be efficiently Post-Aggregated with existing pointwise methods, seamlessly integrating essential Global Context information in a training-free manner (PAGC).Extensive experiments on the TREC DL and BEIR benchmark demonstrate that our approach significantly outperforms previous pointwise methods while maintaining comparable efficiency. 0.688Our method also achieves competitive performance against comparative methods that require substantially more computational resources. 0.714More analyses further validate the efficacy of our anchor construction strategy. |
2025-06-12 |
Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment
Medical dialogue systems (MDS) have emerged as crucial online platforms for enabling multi-turn, context-aware conversations with patients.However, existing MDS often struggle to (1) identify relevant medical knowledge and (2) generate personalized, medically accurate responses.To address these challenges, we propose MedRef, a novel MDS that incorporates knowledge refining and dynamic prompt adjustment.First, we employ a knowledge refining mechanism to filter out irrelevant medical data, improving predictions of critical medical entities in responses.Additionally, we design a comprehensive prompt structure that incorporates historical details and evident details.To enable real-time adaptability to diverse patient conditions, we implement two key modules, Triplet Filter and Demo Selector, providing appropriate knowledge and demonstrations equipped in the system prompt.Extensive experiments on MedDG and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in both generation quality and medical entity accuracy, underscoring its effectiveness and reliability for real-world healthcare applications. 0.604 |
2025-06-12 |
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets.\textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary.These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved.Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods.Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead.Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the computation required when training on the original mixture of The Pile dataset.Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83\%$. 0.607 |
2025-06-12 |
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs).However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. 0.64In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges.To tackle these issues, our pipeline integrates three core automated components.First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency.Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers.Finally, we automate the fail2pass validation process using these reliable exit code signals.Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance.We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00.We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation.Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory. |
2025-06-12 |
ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems
There has been a flurry of activity around using pretrained diffusion models as informed data priors for solving inverse problems, and more generally around steering these models using reward models.Training-free methods like diffusion posterior sampling (DPS) and its many variants have offered flexible heuristic algorithms for these tasks, but when the reward is not informative enough, e.g., in hard inverse problems with low signal-to-noise ratio, these techniques veer off the data manifold, failing to produce realistic outputs.In this work, we devise a simple wrapper, ReGuidance, for boosting both the sample realism and reward achieved by these methods.Given a candidate solution $\hat{x}$ produced by an algorithm of the user's choice, we propose inverting the solution by running the unconditional probability flow ODE in reverse starting from $\hat{x}$, and then using the resulting latent as an initialization for DPS.We evaluate our wrapper on hard inverse problems like large box in-painting and super-resolution with high upscaling.Whereas state-of-the-art baselines visibly fail, we find that applying our wrapper on top of these baselines significantly boosts sample quality and measurement consistency. 0.694We complement these findings with theory proving that on certain multimodal data distributions, ReGuidance simultaneously boosts the reward and brings the candidate solution closer to the data manifold.To our knowledge, this constitutes the first rigorous algorithmic guarantee for DPS. |
2025-06-12 |
Rethinking Losses for Diffusion Bridge Samplers
Diffusion bridges are a promising class of deep-learning methods for sampling from unnormalized distributions.Recent works show that the Log Variance (LV) loss consistently outperforms the reverse Kullback-Leibler (rKL) loss when using the reparametrization trick to compute rKL-gradients.While the on-policy LV loss yields identical gradients to the rKL loss when combined with the log-derivative trick for diffusion samplers with non-learnable forward processes, this equivalence does not hold for diffusion bridges or when diffusion coefficients are learned.Based on this insight we argue that for diffusion bridges the LV loss does not represent an optimization objective that can be motivated like the rKL loss via the data processing inequality.Our analysis shows that employing the rKL loss with the log-derivative trick (rKL-LD) does not only avoid these conceptual problems but also consistently outperforms the LV loss.Experimental results with different types of diffusion bridges on challenging benchmarks show that samplers trained with the rKL-LD loss achieve better performance. 0.625From a practical perspective we find that rKL-LD requires significantly less hyperparameter optimization and yields more stable training behavior. |
2025-06-11 |
Investigating the Perception of Translational Shape-Changing Haptic Interfaces
Shape-changing haptic interfaces (SCHIs) are a promising and emerging field.However, compared to more established stimulus modalities, such as vibration, there is sparse literature on the perception of dynamic shapes.Furthermore, the influence of properties such as grasp types and displacement magnitude/direction has not been formally evaluated.This work attempts to initiate a formal perceptual evaluation of SCHIs via a psychophysical user study involving a 1-DOF translational shape-changing interface that can move its body with 1.25-micrometer resolution.Participants completed a Method of Constant Stimulus study while holding the device with three different grasps.Stimuli direction occurred both toward and away from the thumb, while the standard stimuli varied between small (0.48 mm) and large (6 mm).Our results indicate that translational SCHIs should maximize the translation magnitude rather than the number of fingers in contact.We also demonstrated how to apply our findings to real-world applications via a simple 'paddle game', where we compared conventional linear mapping with non-linear mapping derived from our perceptual experiment outcomes between the device position and its represented value.Results indicate that the non-linear mapping was more effective, with improved error distribution. 0.634We hope this work inspires further formal perceptual investigation into other SCHI morphologies. |
2025-06-11 |
A Weighted Loss Approach to Robust Federated Learning under Data Heterogeneity
Federated learning (FL) is a machine learning paradigm that enables multiple data holders to collaboratively train a machine learning model without sharing their training data with external parties.In this paradigm, workers locally update a model and share with a central server their updated gradients (or model parameters).While FL seems appealing from a privacy perspective, it opens a number of threats from a security perspective as (Byzantine) participants can contribute poisonous gradients (or model parameters) harming model convergence.Byzantine-resilient FL addresses this issue by ensuring that the training proceeds as if Byzantine participants were absent.Towards this purpose, common strategies ignore outlier gradients during model aggregation, assuming that Byzantine gradients deviate more from honest gradients than honest gradients do from each other.However, in heterogeneous settings, honest gradients may differ significantly, making it difficult to distinguish honest outliers from Byzantine ones.In this paper, we introduce the Worker Label Alignement Loss (WoLA), a weighted loss that aligns honest worker gradients despite data heterogeneity, which facilitates the identification of Byzantines' gradients.This approach significantly outperforms state-of-the-art methods in heterogeneous settings. 0.658In this paper, we provide both theoretical insights and empirical evidence of its effectiveness. 0.628 |
2025-06-11 |
Error-Guided Pose Augmentation: Enhancing Rehabilitation Exercise Assessment through Targeted Data Generation
Effective rehabilitation assessment is essential for monitoring patient progress, particularly in home-based settings.Existing systems often face challenges such as data imbalance and difficulty detecting subtle movement errors.This paper introduces Error-Guided Pose Augmentation (EGPA), a method that generates synthetic skeleton data by simulating clinically relevant movement mistakes.Unlike standard augmentation techniques, EGPA targets biomechanical errors observed in rehabilitation.Combined with an attention-based graph convolutional network, EGPA improves performance across multiple evaluation metrics.Experiments demonstrate reductions in mean absolute error of up to 27.6 percent and gains in error classification accuracy of 45.8 percent. 0.651Attention visualizations show that the model learns to focus on clinically significant joints and movement phases, enhancing both accuracy and interpretability.EGPA offers a promising approach for improving automated movement quality assessment in both clinical and home-based rehabilitation contexts. |
2025-06-11 |
OctoNav: Towards Generalist Embodied Navigation
Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI.However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually.In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability.To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. 0.655Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline.We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability.Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions.For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations.Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages.Each stage contains specifically designed learning policies and rewards.Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer.Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists.Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability.Finally, OctoNav-R1 shows superior performance compared with previous methods. 0.668 |
2025-06-11 |
Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition
Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting.Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets.To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data.By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts.Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme.Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. 0.635We open source our code at https://github.com/pkaliosis/fada. |
2025-06-11 |
IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments
We present IntPhys 2, a video benchmark designed to evaluate the intuitive physics understanding of deep learning models.Building on the original IntPhys benchmark, IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity.These conditions are inspired by research into intuitive physical understanding emerging during early childhood.IntPhys 2 offers a comprehensive suite of tests, based on the violation of expectation framework, that challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments.Alongside the benchmark, we provide performance evaluations of several state-of-the-art models. 0.759Our findings indicate that while these models demonstrate basic visual understanding, they face significant challenges in grasping intuitive physics across the four principles in complex scenes, with most models performing at chance levels (50%), in stark contrast to human performance, which achieves near-perfect accuracy.This underscores the gap between current models and human-like intuitive physics understanding, highlighting the need for advancements in model architectures and training methodologies. |
2025-06-11 |
Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation
Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS).OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real-world scenarios such as autonomous driving in adverse conditions.We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time.Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain-invariant structural features.To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Prompts, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH), which fuses refined structural and semantic features for robust prediction.Comprehensive evaluation on these components demonstrates the effectiveness of our designs. 0.623Our proposed Vireo achieves the state-of-the-art performance and surpasses existing methods by a large margin in both domain generalization and open-vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments.Code is available at https://github.com/anonymouse-9c53tp182bvz/Vireo. |
2025-06-11 |
Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs
We present a novel approach for detecting hallucinations in large language models (LLMs) by analyzing the probabilistic divergence between prompt and response hidden-state distributions.Counterintuitively, we find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses, suggesting that hallucinations often arise from superficial rephrasing rather than substantive reasoning.Leveraging this insight, we propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores, eliminating the need for external knowledge or auxiliary models.To enhance sensitivity, we employ deep learnable kernels that automatically adapt to capture nuanced geometric differences between distributions.Our approach outperforms existing baselines, demonstrating state-of-the-art performance on several benchmarks. 0.804The method remains competitive even without kernel training, offering a robust, scalable solution for hallucination detection. |
2025-06-11 |
CEM-FBGTinyDet: Context-Enhanced Foreground Balance with Gradient Tuning for tiny Objects
Tiny object detection (TOD) reveals a fundamental flaw in feature pyramid networks: high-level features (P5-P6) frequently receive zero positive anchors under standard label assignment protocols, leaving their semantic representations untrained due to exclusion from loss computation.This creates dual deficiencies: (1) Stranded high-level features become semantic dead-ends without gradient updates, while (2) low-level features lack essential semantic context for robust classification.We propose E-FPN-BS that systematically converts wasted high-level semantics into low-level feature enhancements.To address these issues, we propose E-FPN-BS, a novel architecture integrating multi-scale feature enhancement and adaptive optimization.First, our Context Enhancement Module(CEM) employs dual-branch processing to align and compress high-level features for effective global-local fusion.Second, the Foreground-Background Separation Module (FBSM) generates spatial gating masks that dynamically amplify discriminative regions.To address gradient imbalance across object scales, we further propose a Dynamic Gradient-Balanced Loss (DCLoss) that automatically modulates loss contributions via scale-aware gradient equilibrium.Extensive experiments across multiple benchmark datasets demonstrate the outstanding performance and generalization ability of our approach. 0.747 |
2025-06-11 |
Discrete Scale-invariant Metric Learning for Efficient Collaborative Filtering
Metric learning has attracted extensive interest for its ability to provide personalized recommendations based on the importance of observed user-item interactions.Current metric learning methods aim to push negative items away from the corresponding users and positive items by an absolute geometrical distance margin.However, items may come from imbalanced categories with different intra-class variations.Thus, the absolute distance margin may not be ideal for estimating the difference between user preferences over imbalanced items.To this end, we propose a new method, named discrete scale-invariant metric learning (DSIML), by adding binary constraints to users and items, which maps users and items into binary codes of a shared Hamming subspace to speed up the online recommendation.Specifically, we firstly propose a scale-invariant margin based on angles at the negative item points in the shared Hamming subspace.Then, we derive a scale-invariant triple hinge loss based on the margin.To capture more preference difference information, we integrate a pairwise ranking loss into the scale-invariant loss in the proposed model.Due to the difficulty of directly optimizing the mixed integer optimization problem formulated with \textit{log-sum-exp} functions, we seek to optimize its variational quadratic upper bound and learn hash codes with an alternating optimization strategy.Experiments on benchmark datasets clearly show that our proposed method is superior to competitive metric learning and hashing-based baselines for recommender systems. 0.66The implementation code is available at https://github.com/AnonyFeb/dsml. |
2025-06-11 |
Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering
Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations, which is an important yet challenging task.For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs).However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation.To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features.Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph.We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted.Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. 0.639Our code is available at https://github.com/jhqi/SSGCO-EGAEL. |
2025-06-11 |
Faster-than-Nyquist Signaling is Good for Single-Carrier ISAC: An Analytical Study
In this paper, we provide an analytical study of single-carrier faster-than-Nyquist (FTN) signaling for integrated sensing and communications (ISAC).Our derivations show that FTN is advantageous for ISAC, and reveal new insights that these advantages come from the fact that FTN signaling can effectively avoid the spectral aliasing due to the mismatch between the symbol rate and the bandwidth of the shaping pulse.Specifically, the communication spectral efficiency advantages of FTN signaling over time-invariant multipath channels are analytically shown, where both upper- and lower-bounds on the spectral efficiency are derived.We show that the gap between these two bounds corresponds to the potential signal-to-noise ratio (SNR) variation due to the presence of multipath delay and spectral aliasing, which diminishes as the symbol rate grows higher.Particularly, in the limiting case, this SNR variation disappears while the degree of freedom (DoF) of the system attain the maximum.Furthermore, the sensing advantages for FTN signals are verified in terms of the expected normalized squared ambiguity function.We show that FTN signals generally enjoy a more robust ranging performance.More importantly, we prove that FTN signaling can effectively avoid the undesired peaks in the considered ambiguity function along the Doppler dimension, thereby reducing the ambiguities in velocity estimation.All these conclusions are explicitly verified by numerical results. 0.64 |
2025-06-11 |
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question.Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information.Moreover, the context of the overall conversation must be considered for the subsequent dialogue.To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns.While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present.Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue.We further provide several baselines evaluated on our dataset and show future challenges associated with this task. 0.626The dataset is made publicly available here: https://github.com/c-patsch/OKCV. |
2025-06-11 |
Tight Paths and Tight Pairs in Weighted Directed Graphs
We state the graph-theoretic computational problem of finding tight paths in a directed, edge-weighted graph, as well as its simplification of finding tight pairs.These problems are motivated by the need of algorithms that find so-called basic antecedents in closure spaces, in one specific approach to data analysis.We discuss and compare several algorithms to approach these problems. 0.652 |
2025-06-11 |
Locomotion on Constrained Footholds via Layered Architectures and Model Predictive Control
Computing stabilizing and optimal control actions for legged locomotion in real time is difficult due to the nonlinear, hybrid, and high dimensional nature of these robots.The hybrid nature of the system introduces a combination of discrete and continuous variables which causes issues for numerical optimal control.To address these challenges, we propose a layered architecture that separates the choice of discrete variables and a smooth Model Predictive Controller (MPC).The layered formulation allows for online flexibility and optimality without sacrificing real-time performance through a combination of gradient-free and gradient-based methods.The architecture leverages a sampling-based method for determining discrete variables, and a classical smooth MPC formulation using these fixed discrete variables.We demonstrate the results on a quadrupedal robot stepping over gaps and onto terrain with varying heights.In simulation, we demonstrate the controller on a humanoid robot for gap traversal.The layered approach is shown to be more optimal and reliable than common heuristic-based approaches and faster to compute than pure sampling methods. 0.614 |
2025-06-11 |
Text-Aware Image Restoration with Diffusion Models
Image restoration aims to recover degraded images.However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images.Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination.In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity.To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances.Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training.This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps.Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. 0.609See our project page: https://cvlab-kaist.github.io/TAIR/ |
2025-06-10 |
Preference-Driven Multi-Objective Combinatorial Optimization with Conditional Computation
Recent deep reinforcement learning methods have achieved remarkable success in solving multi-objective combinatorial optimization problems (MOCOPs) by decomposing them into multiple subproblems, each associated with a specific weight vector.However, these methods typically treat all subproblems equally and solve them using a single model, hindering the effective exploration of the solution space and thus leading to suboptimal performance.To overcome the limitation, we propose POCCO, a novel plug-and-play framework that enables adaptive selection of model structures for subproblems, which are subsequently optimized based on preference signals rather than explicit reward values.Specifically, we design a conditional computation block that routes subproblems to specialized neural architectures.Moreover, we propose a preference-driven optimization algorithm that learns pairwise preferences between winning and losing solutions.We evaluate the efficacy and versatility of POCCO by applying it to two state-of-the-art neural methods for MOCOPs.Experimental results across four classic MOCOP benchmarks demonstrate its significant superiority and strong generalization. 0.649 |
2025-06-10 |
MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis
Artificial intelligence (AI) has become a fundamental tool for assisting clinicians in analyzing ophthalmic images, such as optical coherence tomography (OCT).However, developing AI models often requires extensive annotation, and existing models tend to underperform on independent, unseen data.Foundation models (FMs), large AI models trained on vast unlabeled datasets, have shown promise in overcoming these challenges.Nonetheless, available FMs for ophthalmology lack extensive validation, especially for segmentation tasks, and focus on a single imaging modality.In this context, we propose MIRAGE, a novel multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO) images.Additionally, we propose a new evaluation benchmark with OCT/SLO classification and segmentation tasks. 0.632The comparison with general and specialized FMs and segmentation methods shows the superiority of MIRAGE in both types of tasks, highlighting its suitability as a basis for the development of robust AI systems for retinal OCT image analysis.Both MIRAGE and the evaluation benchmark are publicly available: https://github.com/j-morano/MIRAGE. 0.659 |
2025-06-10 |
Inherently Faithful Attention Maps for Vision Transformers
We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction.Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds.At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context.To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information.Both stages are trained jointly, allowing stage 2 to refine stage 1.Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds. 0.61 |
2025-06-10 |
Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions
Recent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning -- akin to the success observed in language models -- via distillation and reinforcement learning.But what about the non-reasoning models already trained and deployed across the internet?Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces -- without any additional training or supervision?In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer pairs into the model's output stream.We show that framing reasoning as a search process -- where subquestions act as latent decisions within a broader inference trajectory -- helps the model "connect the dots" between fragmented knowledge and produce extended reasoning traces in non-reasoning models.We evaluate our method across three benchmarks and observe consistent improvements. 0.825Notably, our approach yields a 2% overall improvement on MMMU-PRO, including a significant 9% gain in Liberal Arts. |
2025-06-10 |
FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation
Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks.However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLM`s parametric knowledge.This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the model`s parametric knowledge.While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the model`s parametric knowledge, which undermines the model`s internal knowledge structure and increases the risk of misinterpreting the context.To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context.Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses.Extensive experiments demonstrate that our method outperforms state-of-the-art methods. 0.726The code is available at https:// github.com/DeepLearnXMU/Faithful-RAG |
2025-06-10 |
Effective Data Pruning through Score Extrapolation
Training advanced machine learning models demands massive datasets, resulting in prohibitive computational costs.To address this challenge, data pruning techniques identify and remove redundant training samples while preserving model performance.Yet, existing pruning techniques predominantly require a full initial training pass to identify removable samples, negating any efficiency benefits for single training runs.To overcome this limitation, we introduce a novel importance score extrapolation framework that requires training on only a small subset of data.We present two initial approaches in this framework - k-nearest neighbors and graph neural networks - to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset.We demonstrate the effectiveness of our approach for 2 state-of-the-art pruning methods (Dynamic Uncertainty and TDDS), 4 different datasets (CIFAR-10, CIFAR-100, Places-365, and ImageNet), and 3 training paradigms (supervised, unsupervised, and adversarial).Our results indicate that score extrapolation is a promising direction to scale expensive score calculation methods, such as pruning, data attribution, or other tasks. 0.632 |
2025-06-10 |
Deep Reinforcement Learning-Based RAN Slicing with Efficient Inter-Slice Isolation in Tactical Wireless Networks
The next generation of tactical networks (TNs) is poised to further leverage the key enablers of 5G and beyond 5G (B5G) technology, such as radio access network (RAN) slicing and the open RAN (O-RAN) paradigm, to unlock multiple architectural options and opportunities for a wide range of innovative applications.RAN slicing and the O-RAN paradigm are considered game changers in TNs, where the former makes it possible to tailor user services to users requirements, and the latter brings openness and intelligence to the management of the RAN.In TNs, bandwidth scarcity requires a dynamic bandwidth slicing strategy.Although this type of strategy ensures efficient bandwidth utilization, it compromises RAN slicing isolation in terms of quality of service (QoS) performance.To deal with this challenge, we propose a deep reinforcement learning (DRL)-based RAN slicing mechanism that achieves a trade-off between efficient RAN bandwidth sharing and appropriate inter- and intra-slice isolation.The proposed mechanism performs bandwidth allocation in two stages.In the first stage, the bandwidth is allocated to the RAN slices.In the second stage, each slice partitions its bandwidth among its associated users.In both stages, the slicing operation is constrained by several considerations related to improving the QoS of slices and users that in turn foster inter- and intra-slice isolation.The proposed RAN slicing mechanism is based on DRL algorithms to perform the bandwidth sharing operation in each stage.We propose to deploy the mechanism in an O-RAN architecture and describe the O-RAN functional blocks and the main DRL model lifecycle management phases involved.We also develop three different implementations of the proposed mechanism, each based on a different DRL algorithm, and evaluate their performance against multiple baselines across various parameters. 0.667 |
2025-06-10 |
MagCache: Fast Video Generation with Magnitude-Aware Cache
Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features.These approaches typically require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting.In this paper, we introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts.Specifically, the magnitude ratio of successive residual outputs decreases monotonically and steadily in most timesteps while rapidly in the last several steps.Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy.Unlike existing methods requiring dozens of curated samples for calibration, MagCache only requires a single sample for calibration.Experimental results show that MagCache achieves 2.1x and 2.68x speedups on Open-Sora and Wan 2.1, respectively, while preserving superior visual fidelity.It significantly outperforms existing methods in LPIPS, SSIM, and PSNR, under comparable computational budgets. 0.675 |
2025-06-10 |
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing?We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. 0.619Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution.Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons.Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations.Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities.This highlights the need for this benchmark to foster future AI advancements. |
LLMs |
|
2025-06-16 |
Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
We develop a framework to quantify the time-to-unsafe-sampling - the number of large language model (LLM) generations required to trigger an unsafe (e.g., toxic) response. 0.61Estimating this quantity is challenging, since unsafe responses are exceedingly rare in well-aligned LLMs, potentially occurring only once in thousands of generations. 0.704As a result, directly estimating time-to-unsafe-sampling would require collecting training data with a prohibitively large number of generations per prompt.However, with realistic sampling budgets, we often cannot generate enough responses to observe an unsafe outcome for every prompt, leaving the time-to-unsafe-sampling unobserved in many cases, making the estimation and evaluation tasks particularly challenging.To address this, we frame this estimation problem as one of survival analysis and develop a provably calibrated lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt, leveraging recent advances in conformal prediction.Our key innovation is designing an adaptive, per-prompt sampling strategy, formulated as a convex optimization problem.The objective function guiding this optimized sampling allocation is designed to reduce the variance of the estimators used to construct the LPB, leading to improved statistical efficiency over naive methods that use a fixed sampling budget per prompt.Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models. |
2025-06-16 |
FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding
Semantic querying in complex 3D scenes through free-form language presents a significant challenge.Existing 3D scene understanding methods use large-scale training data and CLIP to align text queries with 3D semantic features.However, their reliance on predefined vocabulary priors from training data hinders free-form semantic querying.Besides, recent advanced methods rely on LLMs for scene understanding but lack comprehensive 3D scene-level information and often overlook the potential inconsistencies in LLM-generated outputs. 0.655In our paper, we propose FreeQ-Graph, which enables Free-form Querying with a semantic consistent scene Graph for 3D scene understanding.The core idea is to encode free-form queries from a complete and accurate 3D scene graph without predefined vocabularies, and to align them with 3D consistent semantic labels, which accomplished through three key steps.We initiate by constructing a complete and accurate 3D scene graph that maps free-form objects and their relations through LLM and LVLM guidance, entirely free from training data or predefined priors.Most importantly, we align graph nodes with accurate semantic labels by leveraging 3D semantic aligned features from merged superpoints, enhancing 3D semantic consistency.To enable free-form semantic querying, we then design an LLM-based reasoning algorithm that combines scene-level and object-level information to intricate reasoning. 0.689We conducted extensive experiments on 3D semantic grounding, segmentation, and complex querying tasks, while also validating the accuracy of graph generation.Experiments on 6 datasets show that our model excels in both complex free-form semantic queries and intricate relational reasoning. |
2025-06-16 |
DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models
Model editing aims to efficiently update a pre-trained model's knowledge without the need for time-consuming full retraining.While existing pioneering editing methods achieve promising results, they primarily focus on editing single-modal language models (LLMs). 0.657However, for vision-language models (VLMs), which involve multiple modalities, the role and impact of each modality on editing performance remain largely unexplored.To address this gap, we explore the impact of textual and visual modalities on model editing and find that: (1) textual and visual representations reach peak sensitivity at different layers, reflecting their varying importance; and (2) editing both modalities can efficiently update knowledge, but this comes at the cost of compromising the model's original capabilities.Based on our findings, we propose DualEdit, an editor that modifies both textual and visual modalities at their respective key layers.Additionally, we introduce a gating module within the more sensitive textual modality, allowing DualEdit to efficiently update new knowledge while preserving the model's original information.We evaluate DualEdit across multiple VLM backbones and benchmark datasets, demonstrating its superiority over state-of-the-art VLM editing baselines as well as adapted LLM editing methods on different evaluation metrics. |
2025-06-16 |
An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability
As large language models (LLMs) continue to advance, reliable evaluation methods are essential particularly for open-ended, instruction-following tasks. 0.683LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. 0.747In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency.Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Tought (CoT) reasoning in evaluation.Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present. |
2025-06-16 |
EvolvTrip: Enhancing Literary Character Understanding with Temporal Theory-of-Mind Graphs
A compelling portrayal of characters is essential to the success of narrative writing.For readers, appreciating a character's traits requires the ability to infer their evolving beliefs, desires, and intentions over the course of a complex storyline, a cognitive skill known as Theory-of-Mind (ToM).Performing ToM reasoning in prolonged narratives requires readers to integrate historical context with current narrative information, a task at which humans excel but Large Language Models (LLMs) often struggle. 0.666To systematically evaluate LLMs' ToM reasoning capability in long narratives, we construct LitCharToM, a benchmark of character-centric questions across four ToM dimensions from classic literature. 0.65Further, we introduce EvolvTrip, a perspective-aware temporal knowledge graph that tracks psychological development throughout narratives.Our experiments demonstrate that EvolvTrip consistently enhances performance of LLMs across varying scales, even in challenging extended-context scenarios. 0.72EvolvTrip proves to be particularly valuable for smaller models, partially bridging the performance gap with larger LLMs and showing great compatibility with lengthy narratives. 0.614Our findings highlight the importance of explicit representation of temporal character mental states in narrative comprehension and offer a foundation for more sophisticated character understanding.Our data and code are publicly available at https://github.com/Bernard-Yang/EvolvTrip. |
2025-06-16 |
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction.Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone.While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments.In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments.To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations.Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. 0.681For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment.For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment.In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities.Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks.Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience. |
2025-06-16 |
We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems
The development of large language models (LLMs) has entered in a experience-driven era, flagged by the emergence of environment feedback-driven learning via reinforcement learning and tool-using agents. 0.668This encourages the emergenece of model context protocol (MCP), which defines the standard on how should a LLM interact with external services, such as \api and data. 0.7However, as MCP becomes the de facto standard for LLM agent systems, it also introduces new safety risks. 0.681In particular, MCP introduces third-party services, which are not controlled by the LLM developers, into the agent systems. 0.709These third-party MCP services provider are potentially malicious and have the economic incentives to exploit vulnerabilities and sabotage user-agent interactions.In this position paper, we advocate the research community in LLM safety to pay close attention to the new safety risks issues introduced by MCP, and develop new techniques to build safe MCP-powered agent systems. 0.702To establish our position, we argue with three key parts.(1) We first construct \framework, a controlled framework to examine safety issues in MCP-powered agent systems.(2) We then conduct a series of pilot experiments to demonstrate the safety risks in MCP-powered agent systems is a real threat and its defense is not trivial.(3) Finally, we give our outlook by showing a roadmap to build safe MCP-powered agent systems.In particular, we would call for researchers to persue the following research directions: red teaming, MCP safe LLM development, MCP safety evaluation, MCP safety data accumulation, MCP service safeguard, and MCP safe ecosystem construction.We hope this position paper can raise the awareness of the research community in MCP safety and encourage more researchers to join this important research direction.Our code is available at https://github.com/littlelittlenine/SafeMCP.git. |
2025-06-16 |
Prefix-Tuning+: Modernizing Prefix-Tuning through Attention Independent Prefix Data
Parameter-Efficient Fine-Tuning (PEFT) methods have become crucial for rapidly adapting large language models (LLMs) to downstream tasks.Prefix-Tuning, an early and effective PEFT technique, demonstrated the ability to achieve performance comparable to full fine-tuning with significantly reduced computational and memory overhead.However, despite its earlier success, its effectiveness in training modern state-of-the-art LLMs has been very limited. 0.749In this work, we demonstrate empirically that Prefix-Tuning underperforms on LLMs because of an inherent tradeoff between input and prefix significance within the attention head. 0.691This motivates us to introduce Prefix-Tuning+, a novel architecture that generalizes the principles of Prefix-Tuning while addressing its shortcomings by shifting the prefix module out of the attention head itself.We further provide an overview of our construction process to guide future users when constructing their own context-based methods.Our experiments show that, across a diverse set of benchmarks, Prefix-Tuning+ consistently outperforms existing Prefix-Tuning methods.Notably, it achieves performance on par with the widely adopted LoRA method on several general benchmarks, highlighting the potential modern extension of Prefix-Tuning approaches.Our findings suggest that by overcoming its inherent limitations, Prefix-Tuning can remain a competitive and relevant research direction in the landscape of parameter-efficient LLM adaptation. 0.699 |
2025-06-16 |
Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models
Sampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications.Recently, Nguyen et al. 2024's "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling. 0.624The significance of these claims was underscored by the paper's recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation.This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper's four lines of evidence.First, the original paper's human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines.Second, comprehensively sweeping the original paper's NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters.Third, the original paper's LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported. 0.643Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading.We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity. |
2025-06-16 |
An LLM's Apology: Outsourcing Awkwardness in the Age of AI
A key part of modern social dynamics is flaking at short notice.However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to 'ghosting', awkwardness, or implausible excuses, risking emotional harm and resentment in the other party.The ability to delegate this task to a Large Language Model (LLM) could substantially reduce friction and enhance the flexibility of user's social life while greatly minimising the aforementioned creative burden and moral qualms. 0.72We introduce FLAKE-Bench, an evaluation of models' capacity to effectively, kindly, and humanely extract themselves from a diverse set of social, professional and romantic scenarios.We report the efficacy of 10 frontier or recently-frontier LLMs in bailing on prior commitments, because nothing says "I value our friendship" like having AI generate your cancellation texts. 0.671We open-source FLAKE-Bench at github.com/Cloakless/flake-bench to support future research. |
2025-06-16 |
Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems
With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients' medical conditions.However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation.If the model can provide appropriate comfort and empathy based on the patient's negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process.To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process.We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient's emotions while addressing their concerns.The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients' questions.Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model's ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers. 0.664 |
2025-06-16 |
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs
Large Language Models (LLMs) are central to many contemporary AI applications, yet their extensive parameter counts pose significant challenges for deployment in memory- and compute-constrained environments. 0.637Recent works in eXplainable AI (XAI), particularly on attribution methods, suggest that interpretability can also enable model compression by identifying and removing components irrelevant to inference.In this paper, we leverage Layer-wise Relevance Propagation (LRP) to perform attribution-guided pruning of LLMs. 0.656While LRP has shown promise in structured pruning for vision models, we extend it to unstructured pruning in LLMs and demonstrate that it can substantially reduce model size with minimal performance loss.Our method is especially effective in extracting task-relevant subgraphs -- so-called ``circuits'' -- which can represent core functions (e.g., indirect object identification).Building on this, we introduce a technique for model correction, by selectively removing circuits responsible for spurious behaviors (e.g., toxic outputs).All in all, we gather these techniques as a uniform holistic framework and showcase its effectiveness and limitations through extensive experiments for compression, circuit discovery and model correction on Llama and OPT models, highlighting its potential for improving both model efficiency and safety.Our code is publicly available at https://github.com/erfanhatefi/SparC3. |
2025-06-16 |
BanditWare: A Contextual Bandit-based Framework for Hardware Prediction
Distributed computing systems are essential for meeting the demands of modern applications, yet transitioning from single-system to distributed environments presents significant challenges. 0.608Misallocating resources in shared systems can lead to resource contention, system instability, degraded performance, priority inversion, inefficient utilization, increased latency, and environmental impact. We present BanditWare, an online recommendation system that dynamically selects the most suitable hardware for applications using a contextual multi-armed bandit algorithm.BanditWare balances exploration and exploitation, gradually refining its hardware recommendations based on observed application performance while continuing to explore potentially better options.Unlike traditional statistical and machine learning approaches that rely heavily on large historical datasets, BanditWare operates online, learning and adapting in real-time as new workloads arrive. We evaluated BanditWare on three workflow applications: Cycles (an agricultural science scientific workflow) BurnPro3D (a web-based platform for fire science) and a matrix multiplication application.Designed for seamless integration with the National Data Platform (NDP), BanditWare enables users of all experience levels to optimize resource allocation efficiently. |
2025-06-16 |
Instruction Following by Boosting Attention of Large Language Models
Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment. 0.725While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. 0.728However, subsequent studies revealed latent steering's effectiveness to be limited, often underperforming simple instruction prompting.To address this limitation, we first establish a benchmark across diverse behaviors for standardized evaluation of steering techniques.Building on insights from this benchmark, we introduce Instruction Attention Boosting (InstABoost), a latent steering method that boosts the strength of instruction prompting by altering the model's attention during generation.InstABoost combines the strengths of existing approaches and is theoretically supported by prior work that suggests that in-context rule following in transformer-based models can be controlled by manipulating attention on instructions.Empirically, InstABoost demonstrates superior control success compared to both traditional prompting and latent steering. |
2025-06-16 |
Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability
Phishing attacks remain one of the most prevalent and persistent cybersecurity threat with attackers continuously evolving and intensifying tactics to evade the general detection system.Despite significant advances in artificial intelligence and machine learning, faithfully reproducing the interpretable reasoning with classification and explainability that underpin phishing judgments remains challenging.Due to recent advancement in Natural Language Processing, Large Language Models (LLMs) show a promising direction and potential for improving domain specific phishing classification tasks. 0.641However, enhancing the reliability and robustness of classification models requires not only accurate predictions from LLMs but also consistent and trustworthy explanations aligning with those predictions. 0.615Therefore, a key question remains: can LLMs not only classify phishing emails accurately but also generate explanations that are reliably aligned with their predictions and internally self-consistent? 0.692To answer these questions, we have fine-tuned transformer based models, including BERT, Llama models, and Wizard, to improve domain relevance and make them more tailored to phishing specific distinctions, using Binary Sequence Classification, Contrastive Learning (CL) and Direct Preference Optimization (DPO).To that end, we examined their performance in phishing classification and explainability by applying the ConsistenCy measure based on SHAPley values (CC SHAP), which measures prediction explanation token alignment to test the model's internal faithfulness and consistency and uncover the rationale behind its predictions and reasoning.Overall, our findings show that Llama models exhibit stronger prediction explanation token alignment with higher CC SHAP scores despite lacking reliable decision making accuracy, whereas Wizard achieves better prediction accuracy but lower CC SHAP scores. 0.626 |
2025-06-16 |
Steering LLM Thinking with Budget Guidance
Recent deep-thinking large language models often reason extensively to improve performance, but such lengthy reasoning is not always desirable, as it incurs excessive inference costs with disproportionate performance gains.Controlling reasoning length without sacrificing performance is therefore important, but remains challenging, especially under tight thinking budgets.We propose budget guidance, a simple yet effective method for steering the reasoning process of LLMs toward a target budget without requiring any LLM fine-tuning. 0.731Our approach introduces a lightweight predictor that models a Gamma distribution over the remaining thinking length during next-token generation.This signal is then used to guide generation in a soft, token-level manner, ensuring that the overall reasoning trace adheres to the specified thinking budget.Budget guidance enables natural control of the thinking length, along with significant token efficiency improvements over baseline methods on challenging math benchmarks.For instance, it achieves up to a 26% accuracy gain on the MATH-500 benchmark under tight budgets compared to baseline methods, while maintaining competitive accuracy with only 63% of the thinking tokens used by the full-thinking model.Budget guidance also generalizes to broader task domains and exhibits emergent capabilities, such as estimating question difficulty.The source code is available at: https://github.com/UMass-Embodied-AGI/BudgetGuidance. |
2025-06-16 |
Discrete Diffusion in Large Language and Multimodal Models: A Survey
In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs).Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm using full attention and a denoising-based generation strategy.This paradigm naturally enables parallel generation, fine-grained output controllability, and dynamic, response-aware perception.These capabilities are previously difficult to achieve with AR models.Recently, a growing number of industrial-scale proprietary d(M)LLMs, as well as a large number of open-source academic d(M)LLMs, have demonstrated performance comparable to their autoregressive counterparts, while achieving up to 10x acceleration in inference speed. 0.683The advancement of discrete diffusion LLMs and MLLMs has been largely driven by progress in two domains. 0.641The first is the development of autoregressive LLMs and MLLMs, which has accumulated vast amounts of data, benchmarks, and foundational infrastructure for training and inference.The second contributing domain is the evolution of the mathematical models underlying discrete diffusion.Together, these advancements have catalyzed a surge in dLLMs and dMLLMs research in early 2025. In this work, we present a comprehensive overview of the research in the dLLM and dMLLM domains.We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, and categorize representative models.We further analyze key techniques for training and inference, and summarize emerging applications across language, vision-language, and biological domains.We conclude by discussing future directions for research and deployment. Paper collection: https://github.com/LiQiiiii/DLLM-Survey |
2025-06-12 |
Build the web for agents, not agents for the web
Recent advancements in Large Language Models (LLMs) and multimodal counterparts have spurred significant interest in developing web agents -- AI systems capable of autonomously navigating and completing tasks within web environments. 0.621While holding tremendous promise for automating complex web interactions, current approaches face substantial challenges due to the fundamental mismatch between human-designed interfaces and LLM capabilities. 0.704Current methods struggle with the inherent complexity of web inputs, whether processing massive DOM trees, relying on screenshots augmented with additional information, or bypassing the user interface entirely through API interactions.This position paper advocates for a paradigm shift in web agent research: rather than forcing web agents to adapt to interfaces designed for humans, we should develop a new interaction paradigm specifically optimized for agentic capabilities.To this end, we introduce the concept of an Agentic Web Interface (AWI), an interface specifically designed for agents to navigate a website.We establish six guiding principles for AWI design, emphasizing safety, efficiency, and standardization, to account for the interests of all primary stakeholders.This reframing aims to overcome fundamental limitations of existing interfaces, paving the way for more efficient, reliable, and transparent web agent design, which will be a collaborative effort involving the broader ML community. |
2025-06-12 |
ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark
Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. 0.718However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope.We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data.Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. 0.679In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs.Code and data are available at https://github.com/zjunlp/ChineseHarm-bench. |
2025-06-12 |
MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models.Knowledge images have been central to human civilization and to the mechanisms of human learning--a fact underscored by dual-coding theory and the picture-superiority effect.Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals.To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. 0.602To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation.Each KG explicitly delineates a target image's core entities and their dependencies.We further introduce MMMG-Score to evaluate generated knowledge images.This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment.Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits--low entity fidelity, weak relations, and clutter--with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty.To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs. |
2025-06-12 |
GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation
Robotic manipulation in real-world settings remains challenging, especially regarding robust generalization.Existing simulation platforms lack sufficient support for exploring how policies adapt to varied instructions and scenarios.Thus, they lag behind the growing interest in instruction-following foundation models like LLMs, whose adaptability is crucial yet remains underexplored in fair comparisons. 0.707To bridge this gap, we introduce GenManip, a realistic tabletop simulation platform tailored for policy generalization studies.It features an automatic pipeline via LLM-driven task-oriented scene graph to synthesize large-scale, diverse tasks using 10K annotated 3D object assets. 0.603To systematically assess generalization, we present GenManip-Bench, a benchmark of 200 scenarios refined via human-in-the-loop corrections.We evaluate two policy types: (1) modular manipulation systems integrating foundation models for perception, reasoning, and planning, and (2) end-to-end policies trained through scalable data collection.Results show that while data scaling benefits end-to-end methods, modular systems enhanced with foundation models generalize more effectively across diverse scenarios.We anticipate this platform to facilitate critical insights for advancing policy generalization in realistic conditions.Project Page: https://genmanip.axi404.top/. |
2025-06-12 |
Farseer: A Refined Scaling Law in Large Language Models
Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. 0.658To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales.By systematically constructing a model loss surface $L(N,D)$, Farseer achieves a significantly better fit to empirical data than prior laws (e.g., Chinchilla's law).Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities, improving upon Chinchilla's law by reducing extrapolation error by 433\%.This allows for the reliable evaluation of competing training strategies across all $(N,D)$ settings, enabling conclusions from small-scale ablation studies to be confidently extrapolated to predict large-scale performance.Furthermore, Farseer provides new insights into optimal compute allocation, better reflecting the nuanced demands of modern LLM training. 0.633To validate our approach, we trained an extensive suite of approximately 1,000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIA H100 GPU hours. 0.674We are comprehensively open-sourcing all models, data, results, and logs at https://github.com/Farseer-Scaling-Law/Farseer to foster further research. |
2025-06-12 |
AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. 0.632LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. 0.602Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks.In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. 0.622Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines.Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science. |
Developer Research |
|
2025-06-16 |
DesignCoder: Hierarchy-Aware and Self-Correcting UI Code Generation with Large Language Models
Multimodal large language models (MLLMs) have streamlined front-end interface development by automating code generation.However, these models also introduce challenges in ensuring code quality.Existing approaches struggle to maintain both visual consistency and functional completeness in the generated components.Moreover, they lack mechanisms to assess the fidelity and correctness of the rendered pages.To address these issues, we propose DesignCoder, a novel hierarchical-aware and self-correcting automated code generation framework.Specifically, we introduce UI Grouping Chains, which enhance MLLMs' capability to understand and predict complex nested UI hierarchies.Subsequently, DesignCoder employs a hierarchical divide-and-conquer approach to generate front-end code.Finally, we incorporate a self-correction mechanism to improve the model's ability to identify and rectify errors in the generated code.Extensive evaluations on a dataset of UI mockups collected from both open-source communities and industry projects demonstrate that DesignCoder outperforms state-of-the-art baselines in React Native, a widely adopted UI framework.Our method achieves a 37.63%, 9.52%, 12.82% performance increase in visual similarity metrics (MSE, CLIP, SSIM) and significantly improves code structure similarity in terms of TreeBLEU, Container Match, and Tree Edit Distance by 30.19%, 29.31%, 24.67%.Furthermore, we conducted a user study with professional developers to assess the quality and practicality of the generated code. 0.677Results indicate that DesignCoder aligns with industry best practices, demonstrating high usability, readability, and maintainability.Our approach provides an efficient and practical solution for agile front-end development, enabling development teams to focus more on core functionality and product innovation. |
2025-06-10 |
On The Impact of Merge Request Deviations on Code Review Practices
Code review is a key practice in software engineering, ensuring quality and collaboration. 0.708However, industrial Merge Request (MR) workflows often deviate from standardized review processes, with many MRs serving non-review purposes (e.g., drafts, rebases, or dependency updates).We term these cases deviations and hypothesize that ignoring them biases analytics and undermines ML models for review analysis. We identify seven deviation categories, occurring in 37.02% of MRs, and propose a few-shot learning detection method (91% accuracy).By excluding deviations, ML models predicting review completion time improve performance in 53.33% of cases (up to 2.25x) and exhibit significant shifts in feature importance (47% overall, 60% top-*k*). Our contributions include: (1) a taxonomy of MR deviations, (2) an AI-driven detection approach, and (3) empirical evidence of their impact on ML-based review analytics.This work aids practitioners in optimizing review efforts and ensuring reliable insights. |
2025-06-09 |
Execution-Aware Program Reduction for WebAssembly via Record and Replay
WebAssembly (Wasm) programs may trigger bugs in their engine implementations.To aid debugging, program reduction techniques try to produce a smaller variant of the input program that still triggers the bug. 0.605However, existing execution-unaware program reduction techniques struggle with large and complex Wasm programs, because they rely on static information and apply syntactic transformations, while ignoring the valuable information offered by the input program's execution behavior. We present RR-Reduce and Hybrid-Reduce, novel execution-aware program reduction techniques that leverage execution behaviors via record and replay.RR-Reduce identifies a bug-triggering function as the target function, isolates that function from the rest of the program, and generates a reduced program that replays only the interactions between the target function and the rest of the program.Hybrid-Reduce combines a complementary execution-unaware reduction technique with RR-Reduce to further reduce program size. We evaluate RR-Reduce and Hybrid-Reduce on 28 Wasm programs that trigger a diverse set of bugs in three engines.On average, RR-Reduce reduces the programs to 1.20 percent of their original size in 14.5 minutes, which outperforms the state of the art by 33.15 times in terms of reduction time.Hybrid-Reduce reduces the programs to 0.13 percent of their original size in 3.5 hours, which outperforms the state of the art by 3.42 times in terms of reduced program size and 2.26 times in terms of reduction time.We envision RR-Reduce as the go-to tool for rapid, on-demand debugging in minutes, and Hybrid-Reduce for scenarios where developers require the smallest possible programs. |
2025-06-09 |
ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols
Recent advances in Large Language Models (LLMs) have shown promising capabilities in generating code for general-purpose programming languages.In contrast, their applicability for hardware description languages, particularly for generating synthesizable and functionally correct designs, remains significantly underexplored.HDLs such as SystemVerilog are logic-oriented and demand strict adherence to timing semantics, concurrency, and synthesizability constraints.Moreover, HDL-based design flows encompass a broad set of tasks beyond structural code generation, including testbench development, assertion-based verification, timing closure, and protocol-level integration for on-chip communication.The objective of our paper is to analyze the capabilities of state-of-the-art LLMs in generating SystemVerilog implementations of standard communication protocols, a core component of embedded and System-on-Chip (SoC) architectures.This paper introduces the first benchmark suite targeting four widely used protocols: SPI, I2C, UART, and AXI.We define code generation tasks that capture varying levels of design abstraction and prompt specificity. 0.614The generated designs are assessed for syntactic correctness, synthesizability, and functional fidelity via waveform simulation and test benches. |
Data Annotation Techniques |