Vincent's Arxiv FrontPageGenerated on 2025-07-05. This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. |
|
New Datasets |
|
2025-06-18 |
Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper
Automatic lyrics transcription (ALT) remains a challenging task in the field of music information retrieval, despite great advances in automatic speech recognition (ASR) brought about by transformer-based architectures in recent years.One of the major challenges in ALT is the high amplitude of interfering audio signals relative to conventional ASR due to musical accompaniment.Recent advances in music source separation have enabled automatic extraction of high-quality separated vocals, which could potentially improve ALT performance.However, the effect of source separation has not been systematically investigated in order to establish best practices for its use.This work examines the impact of source separation on ALT using Whisper, a state-of-the-art open source ASR model.We evaluate Whisper's performance on original audio, separated vocals, and vocal stems across short-form and long-form transcription tasks.For short-form, we suggest a concatenation method that results in a consistent reduction in Word Error Rate (WER).For long-form, we propose an algorithm using source separation as a vocal activity detector to derive segment boundaries, which results in a consistent reduction in WER relative to Whisper's native long-form algorithm.Our approach achieves state-of-the-art results for an open source system on the Jam-ALT long-form ALT benchmark, without any training or fine-tuning.We also publish MUSDB-ALT, the first dataset of long-form lyric transcripts following the Jam-ALT guidelines for which vocal stems are publicly available. 0.746 |
2025-06-18 |
DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping.Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance.To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. 0.776Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets.While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs.We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead.Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4.It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection.Code and data are available at: https://github.com/ShaoqLin/DiscoSG 0.73 |
2025-06-18 |
Sekai: A Video Dataset towards World Exploration
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration.However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world.In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. 0.753It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities.We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories.Experiments demonstrate the quality of the dataset.And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese).We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. |
2025-06-18 |
Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence
AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both.This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge.We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning.To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces.Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence.Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access.All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/. 0.773 |
2025-06-17 |
VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning
Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks.In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis.The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. 0.958The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795.For reasoning generation, our fine-tuned BLIP model achieves a final loss of 0.0028, with a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.87.This dataset and model framework emphasize the theme "Prevention is Better than Cure", showcasing how AI-based detection can proactively address mosquito-borne disease risks.The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito |
2025-06-17 |
3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting
3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity.However, its substantial storage requirements present significant challenges for practical applications.While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framework to evaluate their perceptual impact.Therefore we present 3DGS-IEval-15K, the first large-scale image quality assessment (IQA) dataset specifically designed for compressed 3DGS representations.Our dataset encompasses 15,200 images rendered from 10 real-world scenes through 6 representative 3DGS algorithms at 20 strategically selected viewpoints, with different compression levels leading to various distortion effects. 0.788Through controlled subjective experiments, we collect human perception data from 60 viewers.We validate dataset quality through scene diversity and MOS distribution analysis, and establish a comprehensive benchmark with 30 representative IQA metrics covering diverse types.As the largest-scale 3DGS quality assessment dataset to date, our work provides a foundation for developing 3DGS specialized IQA metrics, and offers essential data for investigating view-dependent quality distribution patterns unique to 3DGS.The database is publicly available at https://github.com/YukeXing/3DGS-IEval-15K. 0.826 |
2025-06-17 |
DDS-NAS: Dynamic Data Selection within Neural Architecture Search via On-line Hard Example Mining applied to Image Classification
In order to address the scalability challenge within Neural Architecture Search (NAS), we speed up NAS training via dynamic hard example mining within a curriculum learning framework.By utilizing an autoencoder that enforces an image similarity embedding in latent space, we construct an efficient kd-tree structure to order images by furthest neighbour dissimilarity in a low-dimensional embedding.From a given query image from our subsample dataset, we can identify the most dissimilar image within the global dataset in logarithmic time. 0.748Via curriculum learning, we then dynamically re-formulate an unbiased subsample dataset for NAS optimisation, upon which the current NAS solution architecture performs poorly.We show that our DDS-NAS framework speeds up gradient-based NAS strategies by up to 27x without loss in performance.By maximising the contribution of each image sample during training, we reduce the duration of a NAS training cycle and the number of iterations required for convergence. |
2025-06-17 |
DiFuse-Net: RGB and Dual-Pixel Depth Estimation using Window Bi-directional Parallax Attention and Cross-modal Transfer Learning
Depth estimation is crucial for intelligent systems, enabling applications from autonomous navigation to augmented reality.While traditional stereo and active depth sensors have limitations in cost, power, and robustness, dual-pixel (DP) technology, ubiquitous in modern cameras, offers a compelling alternative.This paper introduces DiFuse-Net, a novel modality decoupled network design for disentangled RGB and DP based depth estimation.DiFuse-Net features a window bi-directional parallax attention mechanism (WBiPAM) specifically designed to capture the subtle DP disparity cues unique to smartphone cameras with small aperture.A separate encoder extracts contextual information from the RGB image, and these features are fused to enhance depth prediction.We also propose a Cross-modal Transfer Learning (CmTL) mechanism to utilize large-scale RGB-D datasets in the literature to cope with the limitations of obtaining large-scale RGB-DP-D dataset.Our evaluation and comparison of the proposed method demonstrates its superiority over the DP and stereo-based baseline methods.Additionally, we contribute a new, high-quality, real-world RGB-DP-D training dataset, named Dual-Camera Dual-Pixel (DCDP) dataset, created using our novel symmetric stereo camera hardware setup, stereo calibration and rectification protocol, and AI stereo disparity estimation method. 0.703 |
2025-06-17 |
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs
We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities.Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models.To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training.First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology.Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training.Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset.We will release the model, dataset, and code. 0.856 |
2025-06-16 |
A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy
Disease-symptom datasets are significant and in demand for medical research, disease diagnosis, clinical decision-making, and AI-driven health management applications.These datasets help identify symptom patterns associated with specific diseases, thus improving diagnostic accuracy and enabling early detection.The dataset presented in this study systematically compiles disease-symptom relationships from various online sources, medical literature, and publicly available health databases.The data was gathered through analyzing peer-reviewed medical articles, clinical case studies, and disease-symptom association reports.Only the verified medical sources were included in the dataset, while those from non-peer-reviewed and anecdotal sources were excluded.The dataset is structured in a tabular format, where the first column represents diseases, and the remaining columns represent symptoms. 0.819Each symptom cell contains a binary value (1 or 0), indicating whether a symptom is associated with a disease (1 for presence, 0 for absence).Thereby, this structured representation makes the dataset very useful for a wide range of applications, including machine learning-based disease prediction, clinical decision support systems, and epidemiological studies.Although there are some advancements in the field of disease-symptom datasets, there is a significant gap in structured datasets for the Bangla language. 0.714This dataset aims to bridge that gap by facilitating the development of multilingual medical informatics tools and improving disease prediction models for underrepresented linguistic communities. 0.749Further developments should include region-specific diseases and further fine-tuning of symptom associations for better diagnostic performance |
2025-06-16 |
PeakWeather: MeteoSwiss Weather Station Measurements for Spatiotemporal Deep Learning
Accurate weather forecasts are essential for supporting a wide range of activities and decision-making processes, as well as mitigating the impacts of adverse weather events.While traditional numerical weather prediction (NWP) remains the cornerstone of operational forecasting, machine learning is emerging as a powerful alternative for fast, flexible, and scalable predictions.We introduce PeakWeather, a high-quality dataset of surface weather observations collected every 10 minutes over more than 8 years from the ground stations of the Federal Office of Meteorology and Climatology MeteoSwiss's measurement network. 0.933The dataset includes a diverse set of meteorological variables from 302 station locations distributed across Switzerland's complex topography and is complemented with topographical indices derived from digital height models for context. 0.9Ensemble forecasts from the currently operational high-resolution NWP model are provided as a baseline forecast against which to evaluate new approaches.The dataset's richness supports a broad spectrum of spatiotemporal tasks, including time series forecasting at various scales, graph structure learning, imputation, and virtual sensing.As such, PeakWeather serves as a real-world benchmark to advance both foundational machine learning research, meteorology, and sensor-based applications. |
2025-06-16 |
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL).Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding.We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning.To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. 0.848Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources.Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week. |
2025-06-16 |
Lecture Video Visual Objects (LVVO) Dataset: A Benchmark for Visual Object Detection in Educational Videos
We introduce the Lecture Video Visual Objects (LVVO) dataset, a new benchmark for visual object detection in educational video content.The dataset consists of 4,000 frames extracted from 245 lecture videos spanning biology, computer science, and geosciences. 0.946A subset of 1,000 frames, referred to as LVVO_1k, has been manually annotated with bounding boxes for four visual categories: Table, Chart-Graph, Photographic-image, and Visual-illustration.Each frame was labeled independently by two annotators, resulting in an inter-annotator F1 score of 83.41%, indicating strong agreement.To ensure high-quality consensus annotations, a third expert reviewed and resolved all cases of disagreement through a conflict resolution process.To expand the dataset, a semi-supervised approach was employed to automatically annotate the remaining 3,000 frames, forming LVVO_3k. 0.728The complete dataset offers a valuable resource for developing and evaluating both supervised and semi-supervised methods for visual content detection in educational videos.The LVVO dataset is publicly available to support further research in this domain. 0.736 |
2025-06-16 |
UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions
The quality of the video dataset (image quality, resolution, and fine-grained caption) greatly influences the performance of the video generation model.The growing demand for video applications sets higher requirements for high-quality video generation models.For example, the generation of movie-level Ultra-High Definition (UHD) videos and the creation of 4K short video content.However, the existing public datasets cannot support related research and applications.In this paper, we first propose a high-quality open-sourced UHD-4K (22.4\% of which are 8K) text-to-video dataset named UltraVideo, which contains a wide range of topics (more than 100 kinds), and each video has 9 structured captions with one summarized caption (average of 824 words). 0.798Specifically, we carefully design a highly automated curation process with four stages to obtain the final high-quality dataset: \textit{i)} collection of diverse and high-quality video clips.\textit{ii)} statistical data filtering.\textit{iii)} model-based data purification.\textit{iv)} generation of comprehensive, structured captions.In addition, we expand Wan to UltraWan-1K/-4K, which can natively generate high-quality 1K/4K videos with more consistent text controllability, demonstrating the effectiveness of our data curation.We believe that this work can make a significant contribution to future research on UHD video generation.UltraVideo dataset and UltraWan models are available at https://xzc-zju.github.io/projects/UltraVideo. |
2025-06-16 |
Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems
With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients' medical conditions.However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation.If the model can provide appropriate comfort and empathy based on the patient's negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process.To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process.We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient's emotions while addressing their concerns. 0.772The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients' questions.Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model's ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers. |
2025-06-12 |
RationalVLA: A Rational Vision-Language-Action Model with Dual System
A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions.Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment.This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasible.To address this problem, we introduce RAtional MAnipulation (RAMA), a new benchmark that challenges models with both unseen executable instructions and defective ones that should be rejected.In RAMA, we construct a dataset with over 14,000 samples, including diverse defective instructions spanning six dimensions: visual, physical, semantic, motion, safety, and out-of-context. 0.776We further propose the Rational Vision-Language-Action model (RationalVLA).It is a dual system for robotic arms that integrates the high-level vision-language model with the low-level manipulation policy by introducing learnable latent space embeddings.This design enables RationalVLA to reason over instructions, reject infeasible commands, and execute manipulation effectively.Experiments demonstrate that RationalVLA outperforms state-of-the-art baselines on RAMA by a 14.5% higher success rate and 0.94 average task length, while maintaining competitive performance on standard manipulation tasks.Real-world trials further validate its effectiveness and robustness in practical applications.Our project page is https://irpn-eai.github.io/rationalvla. |
2025-06-12 |
Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?
Toxicity remains a leading cause of early-stage drug development failure.Despite advances in molecular design and property prediction, the task of molecular toxicity repair - generating structurally valid molecular alternatives with reduced toxicity - has not yet been systematically defined or benchmarked.To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair.We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities. 0.701We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge.In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success.We systematically assess nearly 30 mainstream general-purpose MLLMs and design multiple ablation studies to analyze key factors such as evaluation criteria, candidate diversity, and failure attribution.Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware molecule editing. |
2025-06-11 |
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems.Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals.This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. 0.718EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities.Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions.Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels.This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets.Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts.Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration. |
2025-06-11 |
MMME: A Spontaneous Multi-Modal Micro-Expression Dataset Enabling Visual-Physiological Fusion
Micro-expressions (MEs) are subtle, fleeting nonverbal cues that reveal an individual's genuine emotional state.Their analysis has attracted considerable interest due to its promising applications in fields such as healthcare, criminal investigation, and human-computer interaction.However, existing ME research is limited to single visual modality, overlooking the rich emotional information conveyed by other physiological modalities, resulting in ME recognition and spotting performance far below practical application needs.Therefore, exploring the cross-modal association mechanism between ME visual features and physiological signals (PS), and developing a multimodal fusion framework, represents a pivotal step toward advancing ME analysis.This study introduces a novel ME dataset, MMME, which, for the first time, enables synchronized collection of facial action signals (MEs), central nervous system signals (EEG), and peripheral PS (PPG, RSP, SKT, EDA, and ECG). 0.722By overcoming the constraints of existing ME corpora, MMME comprises 634 MEs, 2,841 macro-expressions (MaEs), and 2,890 trials of synchronized multimodal PS, establishing a robust foundation for investigating ME neural mechanisms and conducting multimodal fusion-based analyses.Extensive experiments validate the dataset's reliability and provide benchmarks for ME analysis, demonstrating that integrating MEs with PS significantly enhances recognition and spotting performance.To the best of our knowledge, MMME is the most comprehensive ME dataset to date in terms of modality diversity.It provides critical data support for exploring the neural mechanisms of MEs and uncovering the visual-physiological synergistic effects, driving a paradigm shift in ME research from single-modality visual analysis to multimodal fusion.The dataset will be publicly available upon acceptance of this paper. 0.866 |
2025-06-11 |
Dataset of News Articles with Provenance Metadata for Media Relevance Assessment
Out-of-context and misattributed imagery is the leading form of media manipulation in today's misinformation and disinformation landscape.The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand.To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. 0.898We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs).We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work. |
2025-06-11 |
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question.Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information.Moreover, the context of the overall conversation must be considered for the subsequent dialogue.To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. 0.904While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present.Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue.We further provide several baselines evaluated on our dataset and show future challenges associated with this task. 0.762The dataset is made publicly available here: https://github.com/c-patsch/OKCV. 0.912 |
2025-06-11 |
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data.In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes.Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures.To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space.Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. 0.788Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency.Our work marks a substantial step forward in making 4D content creation more accessible and practical.All the data, code, and models will be open-released. |
2025-06-11 |
Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages
Online toxic language causes real harm, especially in regions with limited moderation tools.In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data.We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. 0.859Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented.We measured precision, recall, F1 score, accuracy and false positive rates.Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives.The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms.We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration.These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities. |
2025-06-11 |
Text-Aware Image Restoration with Diffusion Models
Image restoration aims to recover degraded images.However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images.Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination.In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity.To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. 0.753Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training.This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps.Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy.See our project page: https://cvlab-kaist.github.io/TAIR/ |
2025-06-11 |
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products.Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency.Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance.In this paper, we explore how to form a data-and-model solution that natively supports partial detection.For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. 0.715Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness.Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average.Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO. |
2025-06-10 |
WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos
To meet the growing demand for systematic surgical training, wetlab environments have become indispensable platforms for hands-on practice in ophthalmology.Yet, traditional wetlab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability.Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education.Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wetlab settings.To address these limitations, we introduce WetCat, the first dataset of wetlab cataract surgery videos specifically curated for automated skill assessment.WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures.These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks.By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics.This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training.The dataset and annotations are publicly available in Synapse https://www.synapse.org/Synapse:syn66401174/files. 0.928 |
2025-06-10 |
WIP: Large Language Model-Enhanced Smart Tutor for Undergraduate Circuit Analysis
This research-to-practice work-in-progress (WIP) paper presents an AI-enabled smart tutor designed to provide homework assessment and feedback for students in an undergraduate circuit analysis course.We detail the tutor's design philosophy and core components, including open-ended question answering and homework feedback generation.The prompts are carefully crafted to optimize responses across different problems.The smart tutor was deployed on the Microsoft Azure platform and is currently in use in an undergraduate circuit analysis course at the School of Electrical and Computer Engineering in a large, public, research-intensive institution in the Southeastern United States.Beyond offering personalized instruction and feedback, the tutor collects student interaction data, which is summarized and shared with the course instructor.To evaluate its effectiveness, we collected student feedback, with 90.9% of responses indicating satisfaction with the tutor.Additionally, we analyze a subset of collected data on preliminary circuit analysis topics to assess tutor usage frequency for each problem and identify frequently asked questions.These insights help instructors gain real-time awareness of student difficulties, enabling more targeted classroom instruction.In future work, we will release a full analysis once the complete dataset is available after the Spring 2025 semester. 0.892We also explore the potential applications of this smart tutor across a broader range of engineering disciplines by developing improved prompts, diagram-recognition methods, and database management strategies, which remain ongoing areas of research. |
2025-06-10 |
ORIDa: Object-centric Real-world Image Composition Dataset
Object compositing, the task of placing and harmonizing objects in images of diverse visual scenes, has become an important task in computer vision with the rise of generative models.However, existing datasets lack the diversity and scale required to comprehensively explore real-world scenarios.We introduce ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale, real-captured dataset containing over 30,000 images featuring 200 unique objects, each of which is presented across varied positions and scenes. 0.84ORIDa has two types of data: factual-counterfactual sets and factual-only scenes.The factual-counterfactual sets consist of four factual images showing an object in different positions within a scene and a single counterfactual (or background) image of the scene without the object, resulting in five images per scene.The factual-only scenes include a single image containing an object in a specific context, expanding the variety of environments.To our knowledge, ORIDa is the first publicly available dataset with its scale and complexity for real-world image composition.Extensive analysis and experiments highlight the value of ORIDa as a resource for advancing further research in object compositing. |
2025-06-10 |
Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices.However, prior work has predominantly focused on atomic tasks -- such as shot-chain execution tasks and single-screen grounding tasks -- while overlooking the generalization to compositional tasks, which are indispensable for real-world applications.This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive.UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps.It comprises 100 interactive task templates with an average optimal step count of 14.05.Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges.Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap.Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks.AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks.AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead.The demo video, dataset, and code are available on the project page at https://ui-nexus.github.io. 0.723 |
2025-06-10 |
FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents
We introduce a new FROST-EMA (Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography) corpus. 0.706It consists of 18 bilingual speakers, who produced speech in their native language (L1), second language (L2), and imitated L2 (fake foreign accent).The new corpus enables research into language variability from phonetic and technological points of view.Accordingly, we include two preliminary case studies to demonstrate both perspectives.The first case study explores the impact of L2 and imitated L2 on the performance of an automatic speaker verification system, while the second illustrates the articulatory patterns of one speaker in L1, L2, and a fake accent. |
2025-06-10 |
Employing self-supervised learning models for cross-linguistic child speech maturity classification
Speech technology systems struggle with many downstream tasks for child speech due to small training corpora and the difficulties that child speech pose.We apply a novel dataset, SpeechMaturity, to state-of-the-art transformer models to address a fundamental classification task: identifying child vocalizations.Unlike previous corpora, our dataset captures maximally ecologically-valid child vocalizations across an unprecedented sample, comprising children acquiring 25+ languages in the U.S., Bolivia, Vanuatu, Papua New Guinea, Solomon Islands, and France. 0.796The dataset contains 242,004 labeled vocalizations, magnitudes larger than previous work. 0.815Models were trained to distinguish between cry, laughter, mature (consonant+vowel), and immature speech (just consonant or vowel).Models trained on the dataset outperform state-of-the-art models trained on previous datasets, achieved classification accuracy comparable to humans, and were robust across rural and urban settings. |
2025-06-10 |
Princeton365: A Diverse Dataset with Accurate Camera Pose
We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. 0.786Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360-camera.We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU.We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error.In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods.We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360-degree camera trajectories.Please visit https://princeton365.cs.princeton.edu for the dataset, code, videos, and submission. 0.883 |
2025-06-09 |
Creating a Historical Migration Dataset from Finnish Church Records, 1800-1920
This article presents a large-scale effort to create a structured dataset of internal migration in Finland between 1800 and 1920 using digitized church moving records.These records, maintained by Evangelical-Lutheran parishes, document the migration of individuals and families and offer a valuable source for studying historical demographic patterns.The dataset includes over six million entries extracted from approximately 200,000 images of handwritten migration records. 0.926The data extraction process was automated using a deep learning pipeline that included layout analysis, table detection, cell classification, and handwriting recognition.The complete pipeline was applied to all images, resulting in a structured dataset suitable for research. The dataset can be used to study internal migration, urbanization, and family migration, and the spread of disease in preindustrial Finland. 0.805A case study from the Elim\"aki parish shows how local migration histories can be reconstructed.The work demonstrates how large volumes of handwritten archival material can be transformed into structured data to support historical and demographic research. |
2025-06-09 |
Exposing Hidden Backdoors in NFT Smart Contracts: A Static Security Analysis of Rug Pull Patterns
The explosive growth of Non-Fungible Tokens (NFTs) has revolutionized digital ownership by enabling the creation, exchange, and monetization of unique assets on blockchain networks.However, this surge in popularity has also given rise to a disturbing trend: the emergence of rug pulls - fraudulent schemes where developers exploit trust and smart contract privileges to drain user funds or invalidate asset ownership.Central to many of these scams are hidden backdoors embedded within NFT smart contracts.Unlike unintentional bugs, these backdoors are deliberately coded and often obfuscated to bypass traditional audits and exploit investor confidence.In this paper, we present a large-scale static analysis of 49,940 verified NFT smart contracts using Slither, a static analysis framework, to uncover latent vulnerabilities commonly linked to rug pulls.We introduce a custom risk scoring model that classifies contracts into high, medium, or low risk tiers based on the presence and severity of rug pull indicators.Our dataset was derived from verified contracts on the Ethereum mainnet, and we generate multiple visualizations to highlight red flag clusters, issue prevalence, and co-occurrence of critical vulnerabilities. 0.832While we do not perform live exploits, our results reveal how malicious patterns often missed by simple reviews can be surfaced through static analysis at scale.We conclude by offering mitigation strategies for developers, marketplaces, and auditors to enhance smart contract security.By exposing how hidden backdoors manifest in real-world smart contracts, this work contributes a practical foundation for detecting and mitigating NFT rug pulls through scalable automated analysis. |
2025-06-09 |
UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References
6D object pose estimation has shown strong generalizability to novel objects.However, existing methods often require either a complete, well-reconstructed 3D model or numerous reference images that fully cover the object.Estimating 6D poses from partial references, which capture only fragments of an object's appearance and geometry, remains challenging.To address this, we propose UA-Pose, an uncertainty-aware approach for 6D object pose estimation and online object completion specifically designed for partial references.We assume access to either (1) a limited set of RGBD images with known poses or (2) a single 2D image.For the first case, we initialize a partial object 3D model based on the provided images and poses, while for the second, we use image-to-3D techniques to generate an initial object 3D model.Our method integrates uncertainty into the incomplete 3D model, distinguishing between seen and unseen regions.This uncertainty enables confidence assessment in pose estimation and guides an uncertainty-aware sampling strategy for online object completion, enhancing robustness in pose estimation accuracy and improving object completeness.We evaluate our method on the YCB-Video, YCBInEOAT, and HO3D datasets, including RGBD sequences of YCB objects manipulated by robots and human hands. 0.719Experimental results demonstrate significant performance improvements over existing methods, particularly when object observations are incomplete or partially captured.Project page: https://minfenli.github.io/UA-Pose/ |
2025-06-09 |
Audio-Sync Video Generation with Multi-Stream Temporal Control
Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e.g., movies).Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e.g., Podcasts or historical recordings).However, existing approaches fall short in generating high-quality videos with precise audio-visual synchronization, especially across diverse and complex audio types.In this work, we introduce MTV, a versatile framework for audio-sync video generation.MTV explicitly separates audios into speech, effects, and music tracks, enabling disentangled control over lip motion, event timing, and visual mood, respectively -- resulting in fine-grained and semantically aligned video generation.To support the framework, we additionally present DEMIX, a dataset comprising high-quality cinematic videos and demixed audio tracks. 0.794DEMIX is structured into five overlapped subsets, enabling scalable multi-stage training for diverse generation scenarios.Extensive experiments demonstrate that MTV achieves state-of-the-art performance across six standard metrics spanning video quality, text-video consistency, and audio-video alignment.Project page: https://hjzheng.net/projects/MTV/. |
2025-06-09 |
Dreamland: Controllable World Creation with Simulator and Generative Models
Large-scale video generative models can synthesize diverse and realistic visual content for dynamic world creation, but they often lack element-wise controllability, hindering their use in editing scenes and training embodied AI agents.We propose Dreamland, a hybrid world generation framework combining the granular control of a physics-based simulator and the photorealistic content output of large-scale pretrained generative models.In particular, we design a layered world abstraction that encodes both pixel-level and object-level semantics and geometry as an intermediate representation to bridge the simulator and the generative model.This approach enhances controllability, minimizes adaptation cost through early alignment with real-world distributions, and supports off-the-shelf use of existing and future pretrained generative models.We further construct a D3Sim dataset to facilitate the training and evaluation of hybrid generation pipelines.Experiments demonstrate that Dreamland outperforms existing baselines with 50.8% improved image quality, 17.9% stronger controllability, and has great potential to enhance embodied agent training.Code and data will be made available. 0.767 |
Data Quality |
|
2025-06-18 |
GFLC: Graph-based Fairness-aware Label Correction for Fair Classification
Fairness in machine learning (ML) has a critical importance for building trustworthy machine learning system as artificial intelligence (AI) systems increasingly impact various aspects of society, including healthcare decisions and legal judgments.Moreover, numerous studies demonstrate evidence of unfair outcomes in ML and the need for more robust fairness-aware methods.However, the data we use to train and develop debiasing techniques often contains biased and noisy labels.As a result, the label bias in the training data affects model performance and misrepresents the fairness of classifiers during testing. 0.64To tackle this problem, our paper presents Graph-based Fairness-aware Label Correction (GFLC), an efficient method for correcting label noise while preserving demographic parity in datasets. 0.635In particular, our approach combines three key components: prediction confidence measure, graph-based regularization through Ricci-flow-optimized graph Laplacians, and explicit demographic parity incentives.Our experimental findings show the effectiveness of our proposed approach and show significant improvements in the trade-off between performance and fairness metrics compared to the baseline. |
2025-06-16 |
CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities.Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks.While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations.To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration.Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. 0.607Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck.To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency.Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks.These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics.Our project page is available at https://irpn-eai.github.io/CEED-VLA/. |
2025-06-12 |
Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction
Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community.Challenges include disagreement in labeling among annotators and imbalanced data distributions. 0.789This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2, evaluated on the MSP-Podcast dataset.Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling.Specifically, our best system is trained by adding text embeddings, predicting gender, and including ``Other'' (O) and ``No Agreement'' (X) samples in the training set.Our system's results secured both first and second places in the IS25-SER Challenge, and the top performance was achieved by a simple two-system ensemble. |
2025-06-10 |
UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags
The present study extends recent work on Universal Dependencies annotations for second-language (L2) Korean by introducing a semi-automated framework that identifies morphosyntactic constructions from XPOS sequences and aligns those constructions with corresponding UPOS categories.We also broaden the existing L2-Korean corpus by annotating 2,998 new sentences from argumentative essays.To evaluate the impact of XPOS-UPOS alignments, we fine-tune L2-Korean morphosyntactic analysis models on datasets both with and without these alignments, using two NLP toolkits.Our results indicate that the aligned dataset not only improves consistency across annotation layers but also enhances morphosyntactic tagging and dependency-parsing accuracy, particularly in cases of limited annotated data. 0.62 |
2025-06-09 |
Rethinking Crowd-Sourced Evaluation of Neuron Explanations
Interpreting individual neurons or directions in activations space is an important component of mechanistic interpretability.As such, many algorithms have been proposed to automatically produce neuron explanations, but it is often not clear how reliable these explanations are, or which methods produce the best explanations.This can be measured via crowd-sourced evaluations, but they can often be noisy and expensive, leading to unreliable results.In this paper, we carefully analyze the evaluation pipeline and develop a cost-effective and highly accurate crowdsourced evaluation strategy.In contrast to previous human studies that only rate whether the explanation matches the most highly activating inputs, we estimate whether the explanation describes neuron activations across all inputs.To estimate this effectively, we introduce a novel application of importance sampling to determine which inputs are the most valuable to show to raters, leading to around 30x cost reduction compared to uniform sampling.We also analyze the label noise present in crowd-sourced evaluations and propose a Bayesian method to aggregate multiple ratings leading to a further ~5x reduction in number of ratings required for the same accuracy. 0.614Finally, we use these methods to conduct a large-scale study comparing the quality of neuron explanations produced by the most popular methods for two different vision models. |
2025-06-05 |
LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs
Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$).However, the likelihood displacement observed in DPO indicates that both $\log \pi_\theta (y_w\mid x)$ and $\log \pi_\theta (y_l\mid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses.In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content.To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model.A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection.This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop.In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. 0.604Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead.Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs. |
2025-06-04 |
Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems
Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback.However, many ASR systems discard or generalize hesitations, losing important acoustic details.We fine-tune Whisper models on the Speak & Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data.We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). 0.632Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra).Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the "Extra" scheme yielded a 5.5% WER, an 11.3% relative improvement over the "Pure" scheme (6.2% WER).This demonstrates that explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription. |
2025-06-03 |
Causal Estimation of Tokenisation Bias
Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings.Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does.We define this mismatch as tokenisation bias. 0.647In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., $\langle hello \rangle$) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit{``hello''}).Estimating this effect is challenging because each model is trained with only one tokeniser.We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design.Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first $K$ to a tokeniser's vocabulary, where $K$ is an arbitrary cutoff point.As such, we can estimate a causal effect by comparing similar subwords around this cutoff.Experimentally, we find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers.Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling. |
Benchmarks |
|
2025-06-18 |
Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge
Text-based hyperbole and metaphor detection are of great significance for natural language processing (NLP) tasks.However, due to their semantic obscurity and expressive diversity, it is rather challenging to identify them.Existing methods mostly focus on superficial text features, ignoring the associations of hyperbole and metaphor as well as the effect of implicit emotion on perceiving these rhetorical devices.To implement these hypotheses, we propose an emotion-guided hyperbole and metaphor detection framework based on bidirectional dynamic interaction (EmoBi).Firstly, the emotion analysis module deeply mines the emotion connotations behind hyperbole and metaphor.Next, the emotion-based domain mapping module identifies the target and source domains to gain a deeper understanding of the implicit meanings of hyperbole and metaphor.Finally, the bidirectional dynamic interaction module enables the mutual promotion between hyperbole and metaphor.Meanwhile, a verification mechanism is designed to ensure detection accuracy and reliability.Experiments show that EmoBi outperforms all baseline methods on four datasets. 0.626Specifically, compared to the current SoTA, the F1 score increased by 28.1% for hyperbole detection on the TroFi dataset and 23.1% for metaphor detection on the HYPO-L dataset.These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing hyperbole and metaphor detection. |
2025-06-18 |
RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information.However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs.We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining.RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions.A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization.This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional forward pass.We further derive PAC-style guarantees that link the KL threshold to user-defined false positive and false negative rates.On the Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918.This result outperforms the strongest prior method by 6.5 percentage points while keeping latency overhead below 4.7% on an NVIDIA T4 GPU. 0.661RePCS offers a lightweight, black-box safeguard to verify whether a RAG system meaningfully leverages retrieval, making it especially valuable in safety-critical applications. |
2025-06-18 |
Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper
Automatic lyrics transcription (ALT) remains a challenging task in the field of music information retrieval, despite great advances in automatic speech recognition (ASR) brought about by transformer-based architectures in recent years.One of the major challenges in ALT is the high amplitude of interfering audio signals relative to conventional ASR due to musical accompaniment.Recent advances in music source separation have enabled automatic extraction of high-quality separated vocals, which could potentially improve ALT performance.However, the effect of source separation has not been systematically investigated in order to establish best practices for its use.This work examines the impact of source separation on ALT using Whisper, a state-of-the-art open source ASR model.We evaluate Whisper's performance on original audio, separated vocals, and vocal stems across short-form and long-form transcription tasks.For short-form, we suggest a concatenation method that results in a consistent reduction in Word Error Rate (WER).For long-form, we propose an algorithm using source separation as a vocal activity detector to derive segment boundaries, which results in a consistent reduction in WER relative to Whisper's native long-form algorithm.Our approach achieves state-of-the-art results for an open source system on the Jam-ALT long-form ALT benchmark, without any training or fine-tuning. 0.612We also publish MUSDB-ALT, the first dataset of long-form lyric transcripts following the Jam-ALT guidelines for which vocal stems are publicly available. |
2025-06-18 |
Real-Time Initialization of Unknown Anchors for UWB-aided Navigation
This paper presents a framework for the real-time initialization of unknown Ultra-Wideband (UWB) anchors in UWB-aided navigation systems.The method is designed for localization solutions where UWB modules act as supplementary sensors.Our approach enables the automatic detection and calibration of previously unknown anchors during operation, removing the need for manual setup.By combining an online Positional Dilution of Precision (PDOP) estimation, a lightweight outlier detection method, and an adaptive robust kernel for non-linear optimization, our approach significantly improves robustness and suitability for real-world applications compared to state-of-the-art.In particular, we show that our metric which triggers an initialization decision is more conservative than current ones commonly based on initial linear or non-linear initialization guesses.This allows for better initialization geometry and subsequently lower initialization errors.We demonstrate the proposed approach on two different mobile robots: an autonomous forklift and a quadcopter equipped with a UWB-aided Visual-Inertial Odometry (VIO) framework.The results highlight the effectiveness of the proposed method with robust initialization and low positioning error. 0.636We open-source our code in a C++ library including a ROS wrapper. |
2025-06-18 |
Atys: An Efficient Profiling Framework for Identifying Hotspot Functions in Large-scale Cloud Microservices
To handle the high volume of requests, large-scale services are comprised of thousands of instances deployed in clouds.These services utilize diverse programming languages and are distributed across various nodes as encapsulated containers.Given their vast scale, even minor performance enhancements can lead to significant cost reductions.In this paper, we introduce Atys1, an efficient profiling framework specifically designed to identify hotspot functions within large-scale distributed services.Atys presents four key features.First, it implements a language-agnostic adaptation mechanism for multilingual microservices.Second, a two-level aggregation method is introduced to provide a comprehensive overview of flamegraphs.Third, we propose a function selective pruning (FSP) strategy to enhance the efficiency of aggregating profiling results.Finally, we develop a frequency dynamic adjustment (FDA) scheme that dynamically modifies sampling frequency based on service status, effectively minimizing profiling cost while maintaining accuracy.Cluster-scale experiments on two benchmarks show that the FSP strategy achieves a 6.8% reduction in time with a mere 0.58% mean average percentage error (MAPE) in stack traces aggregation. 0.635Additionally, the FDA scheme ensures that the mean squared error (MSE) remains on par with that at high sampling rates, while achieving an 87.6% reduction in cost. |
2025-06-18 |
An efficient construction of Raz's two-source randomness extractor with improved parameters
Randomness extractors are algorithms that distill weak random sources into near-perfect random numbers.Two-source extractors enable this distillation process by combining two independent weak random sources.Raz's extractor (STOC '05) was the first to achieve this in a setting where one source has linear min-entropy (i.e., proportional to its length), while the other has only logarithmic min-entropy in its length.However, Raz's original construction is impractical due to a polynomial computation time of at least degree 4.Our work solves this problem by presenting an improved version of Raz's extractor with quasi-linear computation time, as well as a new analytic theorem with reduced entropy requirements.We provide comprehensive analytical and numerical comparisons of our construction with others in the literature, and we derive strong and quantum-proof versions of our efficient Raz extractor.Additionally, we offer an easy-to-use, open-source code implementation of the extractor and a numerical parameter calculation module. 0.629 |
2025-06-18 |
Task-Agnostic Experts Composition for Continual Learning
Compositionality is one of the fundamental abilities of the human reasoning process, that allows to decompose a complex problem into simpler elements.Such property is crucial also for neural networks, especially when aiming for a more efficient and sustainable AI framework.We propose a compositional approach by ensembling zero-shot a set of expert models, assessing our methodology using a challenging benchmark, designed to test compositionality capabilities.We show that our Expert Composition method is able to achieve a much higher accuracy than baseline algorithms while requiring less computational resources, hence being more efficient. 0.627 |
2025-06-18 |
CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies
This work introduces a GPU storage expansion solution utilizing CXL, featuring a novel GPU system design with multiple CXL root ports for integrating diverse storage media (DRAMs and/or SSDs).We developed and siliconized a custom CXL controller integrated at the hardware RTL level, achieving two-digit nanosecond roundtrip latency, the first in the field.This study also includes speculative read and deterministic store mechanisms to efficiently manage read and write operations to hide the endpoint's backend media latency variation.Performance evaluations reveal our approach significantly outperforms existing methods, marking a substantial advancement in GPU storage technology. 0.675 |
2025-06-18 |
Estimate Hitting Time by Hitting Probability for Elitist Evolutionary Algorithms
Drift analysis is a powerful tool for analyzing the time complexity of evolutionary algorithms.However, it requires manual construction of drift functions to bound hitting time for each specific algorithm and problem.To address this limitation, general linear drift functions were introduced for elitist evolutionary algorithms.But calculating linear bound coefficients effectively remains a problem.This paper proposes a new method called drift analysis of hitting probability to compute these coefficients. 0.602Each coefficient is interpreted as a bound on the hitting probability of a fitness level, transforming the task of estimating hitting time into estimating hitting probability.A novel drift analysis method is then developed to estimate hitting probability, where paths are introduced to handle multimodal fitness landscapes.Explicit expressions are constructed to compute hitting probability, significantly simplifying the estimation process.One advantage of the proposed method is its ability to estimate both the lower and upper bounds of hitting time and to compare the performance of two algorithms in terms of hitting time. 0.703To demonstrate this application, two algorithms for the knapsack problem, each incorporating feasibility rules and greedy repair respectively, are compared.The analysis indicates that neither constraint handling technique consistently outperforms the other. |
2025-06-18 |
GFLC: Graph-based Fairness-aware Label Correction for Fair Classification
Fairness in machine learning (ML) has a critical importance for building trustworthy machine learning system as artificial intelligence (AI) systems increasingly impact various aspects of society, including healthcare decisions and legal judgments.Moreover, numerous studies demonstrate evidence of unfair outcomes in ML and the need for more robust fairness-aware methods.However, the data we use to train and develop debiasing techniques often contains biased and noisy labels.As a result, the label bias in the training data affects model performance and misrepresents the fairness of classifiers during testing.To tackle this problem, our paper presents Graph-based Fairness-aware Label Correction (GFLC), an efficient method for correcting label noise while preserving demographic parity in datasets.In particular, our approach combines three key components: prediction confidence measure, graph-based regularization through Ricci-flow-optimized graph Laplacians, and explicit demographic parity incentives.Our experimental findings show the effectiveness of our proposed approach and show significant improvements in the trade-off between performance and fairness metrics compared to the baseline. 0.676 |
2025-06-18 |
Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement
Recent advancements in large reasoning models (LRMs) have significantly enhanced language models' capabilities in complex problem-solving by emulating human-like deliberative thinking.However, these models often exhibit overthinking (i.e., the generation of unnecessarily verbose and redundant content), which hinders efficiency and inflates inference cost.In this work, we explore the representational and behavioral origins of this inefficiency, revealing that LRMs inherently possess the capacity for more concise reasoning.Empirical analyses show that correct reasoning paths vary significantly in length, and the shortest correct responses often suffice, indicating untapped efficiency potential.Exploiting these findings, we propose two lightweight methods to enhance LRM efficiency. 0.636First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction in the model's representation space.Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity by rewarding concise correct solutions.Extensive experiments on seven LRM backbones across multiple mathematical reasoning benchmarks demonstrate that our methods significantly reduce reasoning length while preserving or improving task performance.Our results highlight that reasoning efficiency can be improved by leveraging and guiding the intrinsic capabilities of existing models in a self-guided manner. |
2025-06-17 |
Enhancing Symbolic Machine Learning by Subsymbolic Representations
The goal of neuro-symbolic AI is to integrate symbolic and subsymbolic AI approaches, to overcome the limitations of either.Prominent systems include Logic Tensor Networks (LTN) or DeepProbLog, which offer neural predicates and end-to-end learning.The versatility of systems like LTNs and DeepProbLog, however, makes them less efficient in simpler settings, for instance, for discriminative machine learning, in particular in domains with many constants.Therefore, we follow a different approach: We propose to enhance symbolic machine learning schemes by giving them access to neural embeddings.In the present paper, we show this for TILDE and embeddings of constants used by TILDE in similarity predicates.The approach can be fine-tuned by further refining the embeddings depending on the symbolic theory.In experiments in three real-world domain, we show that this simple, yet effective, approach outperforms all other baseline methods in terms of the F1 score. 0.672The approach could be useful beyond this setting: Enhancing symbolic learners in this way could be extended to similarities between instances (effectively working like kernels within a logical language), for analogical reasoning, or for propositionalization. |
2025-06-17 |
TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization
Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models.However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem.To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived.Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward.This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards.Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard. 0.634Code is available at https://github.com/dvlab-research/TGDPO. |
2025-06-17 |
SCISSOR: Mitigating Semantic Bias through Cluster-Aware Siamese Networks for Robust Classification
Shortcut learning undermines model generalization to out-of-distribution data.While the literature attributes shortcuts to biases in superficial features, we show that imbalances in the semantic distribution of sample embeddings induce spurious semantic correlations, compromising model robustness.To address this issue, we propose SCISSOR (Semantic Cluster Intervention for Suppressing ShORtcut), a Siamese network-based debiasing approach that remaps the semantic space by discouraging latent clusters exploited as shortcuts.Unlike prior data-debiasing approaches, SCISSOR eliminates the need for data augmentation and rewriting.We evaluate SCISSOR on 6 models across 4 benchmarks: Chest-XRay and Not-MNIST in computer vision, and GYAFC and Yelp in NLP tasks.Compared to several baselines, SCISSOR reports +5.3 absolute points in F1 score on GYAFC, +7.3 on Yelp, +7.7 on Chest-XRay, and +1 on Not-MNIST. 0.63SCISSOR is also highly advantageous for lightweight models with ~9.5% improvement on F1 for ViT on computer vision datasets and ~11.9% for BERT on NLP.Our study redefines the landscape of model generalization by addressing overlooked semantic biases, establishing SCISSOR as a foundational framework for mitigating shortcut learning and fostering more robust, bias-resistant AI systems. |
2025-06-17 |
Expressive Score-Based Priors for Distribution Matching with Geometry-Preserving Regularization
Distribution matching (DM) is a versatile domain-invariant representation learning technique that has been applied to tasks such as fair classification, domain adaptation, and domain translation.Non-parametric DM methods struggle with scalability and adversarial DM approaches suffer from instability and mode collapse.While likelihood-based methods are a promising alternative, they often impose unnecessary biases through fixed priors or require explicit density models (e.g., flows) that can be challenging to train.We address this limitation by introducing a novel approach to training likelihood-based DM using expressive score-based prior distributions.Our key insight is that gradient-based DM training only requires the prior's score function -- not its density -- allowing us to train the prior via denoising score matching.This approach eliminates biases from fixed priors (e.g., in VAEs), enabling more effective use of geometry-preserving regularization, while avoiding the challenge of learning an explicit prior density model (e.g., a flow-based prior).Our method also demonstrates better stability and computational efficiency compared to other diffusion-based priors (e.g., LSGM). 0.666Furthermore, experiments demonstrate superior performance across multiple tasks, establishing our score-based method as a stable and effective approach to distribution matching. 0.661Source code available at https://github.com/inouye-lab/SAUB. |
2025-06-17 |
GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors
Parameter-efficient fine-tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), offer an efficient way to adapt large language models with reduced computational costs.However, their performance is limited by the small number of trainable parameters.Recent work combines LoRA with the Mixture-of-Experts (MoE), i.e., LoRA-MoE, to enhance capacity, but two limitations remain in hindering the full exploitation of its potential: 1) the influence of downstream tasks when assigning expert numbers, and 2) the uniform rank assignment across all LoRA experts, which restricts representational diversity.To mitigate these gaps, we propose GuiLoMo, a fine-grained layer-wise expert numbers and ranks allocation strategy with GuidedSelection Vectors (GSVs).GSVs are learned via a prior bilevel optimization process to capture both model- and task-specific needs, and are then used to allocate optimal expert numbers and ranks.Experiments on three backbone models across diverse benchmarks show that GuiLoMo consistently achieves superior or comparable performance to all baselines. 0.612Further analysis offers key insights into how expert numbers and ranks vary across layers and tasks, highlighting the benefits of adaptive expert configuration.Our code is available at https://github.com/Liar406/Gui-LoMo.git. |
2025-06-17 |
A Systematic Replicability and Comparative Study of BSARec and SASRec for Sequential Recommendation
This study aims at comparing two sequential recommender systems: Self-Attention based Sequential Recommendation (SASRec), and Beyond Self-Attention based Sequential Recommendation (BSARec) in order to check the improvement frequency enhancement - the added element in BSARec - has on recommendations.The models in the study, have been re-implemented with a common base-structure from EasyRec, with the aim of obtaining a fair and reproducible comparison. 0.605The results obtained displayed how BSARec, by including bias terms for frequency enhancement, does indeed outperform SASRec, although the increases in performance obtained, are not as high as those presented by the authors. 0.652This work aims at offering an overview on existing methods, and most importantly at underlying the importance of implementation details for performance comparison. 0.763 |
2025-06-17 |
Iterative Camera-LiDAR Extrinsic Optimization via Surrogate Diffusion
Cameras and LiDAR are essential sensors for autonomous vehicles.The fusion of camera and LiDAR data addresses the limitations of individual sensors but relies on precise extrinsic calibration.Recently, numerous end-to-end calibration methods have been proposed; however, most predict extrinsic parameters in a single step and lack iterative optimization capabilities.To address the increasing demand for higher accuracy, we propose a versatile iterative framework based on surrogate diffusion. 0.624This framework can enhance the performance of any calibration method without requiring architectural modifications. 0.604Specifically, the initial extrinsic parameters undergo iterative refinement through a denoising process, in which the original calibration method serves as a surrogate denoiser to estimate the final extrinsics at each step.For comparative analysis, we selected four state-of-the-art calibration methods as surrogate denoisers and compared the results of our diffusion process with those of two other iterative approaches. 0.706Extensive experiments demonstrate that when integrated with our diffusion model, all calibration methods achieve higher accuracy, improved robustness, and greater stability compared to other iterative techniques and their single-step counterparts. 0.632 |
2025-06-17 |
Joint Error Correction and Fading Channel Estimation Enhancement Leveraging GRAND
We present a novel method for error correction in the presence of fading channel estimation errors (CEE).When such errors are significant, considerable performance losses can be observed if the wireless transceiver is not adapted.Instead of refining the estimate by increasing the pilot sequence length or improving the estimation algorithm, we propose two new approaches based on Guessing Random Additive Noise Decoding (GRAND) decoders.The first method involves testing multiple candidates for the channel estimate located in the complex neighborhood around the original pilot-based estimate.All these candidates are employed in parallel to compute log-likelihood ratios (LLR).These LLRs are used as soft input to Ordered Reliability Bits GRAND (ORBGRAND).Posterior likelihood formulas associated with ORBGRAND are then computed to determine which channel candidate leads to the most probable codeword.The second method is a refined version of the first approach accounting for the presence of residual CEE in the LLR computation. 0.647The performance of these two techniques is evaluated for [128,112] 5G NR CA-Polar and CRC codes.For the considered settings, block error rate (BLER) gains of several dBs are observed compared to cases where CEE is ignored. 0.62 |
2025-06-17 |
RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills
Endowing robots with tool design abilities is critical for enabling them to solve complex manipulation tasks that would otherwise be intractable.While recent generative frameworks can automatically synthesize task settings, such as 3D scenes and reward functions, they have not yet addressed the challenge of tool-use scenarios.Simply retrieving human-designed tools might not be ideal since many tools (e.g., a rolling pin) are difficult for robotic manipulators to handle.Furthermore, existing tool design approaches either rely on predefined templates with limited parameter tuning or apply generic 3D generation methods that are not optimized for tool creation.To address these limitations, we propose RobotSmith, an automated pipeline that leverages the implicit physical knowledge embedded in vision-language models (VLMs) alongside the more accurate physics provided by physics simulations to design and use tools for robotic manipulation.Our system (1) iteratively proposes tool designs using collaborative VLM agents, (2) generates low-level robot trajectories for tool use, and (3) jointly optimizes tool geometry and usage for task performance.We evaluate our approach across a wide range of manipulation tasks involving rigid, deformable, and fluid objects.Experiments show that our method consistently outperforms strong baselines in terms of both task success rate and overall performance. 0.761Notably, our approach achieves a 50.0\% average success rate, significantly surpassing other baselines such as 3D generation (21.4%) and tool retrieval (11.1%).Finally, we deploy our system in real-world settings, demonstrating that the generated tools and their usage plans transfer effectively to physical execution, validating the practicality and generalization capabilities of our approach. |
2025-06-17 |
Scaling-Up the Pretraining of the Earth Observation Foundation Model PhilEO to the MajorTOM Dataset
Today, Earth Observation (EO) satellites generate massive volumes of data, with the Copernicus Sentinel-2 constellation alone producing approximately 1.6TB per day.To fully exploit this information, it is essential to pretrain EO Foundation Models (FMs) on large unlabeled datasets, enabling efficient fine-tuning for several different downstream tasks with minimal labeled data.In this work, we present the scaling-up of our recently proposed EO Foundation Model, PhilEO Geo-Aware U-Net, on the unlabeled 23TB dataset MajorTOM, which covers the vast majority of the Earth's surface, as well as on the specialized subset FastTOM 2TB that does not include oceans and ice.We develop and study various PhilEO model variants with different numbers of parameters and architectures.Finally, we fine-tune the models on the PhilEO Bench for road density estimation, building density pixel-wise regression, and land cover semantic segmentation, and we evaluate the performance.Our results demonstrate that for all n-shots for road density regression, the PhilEO 44M MajorTOM 23TB model outperforms PhilEO Globe 0.5TB 44M.We also show that for most n-shots for road density estimation and building density regression, PhilEO 200M FastTOM outperforms all the other models.The effectiveness of both dataset and model scaling is validated using the PhilEO Bench. 0.607We also study the impact of architecture scaling, transitioning from U-Net Convolutional Neural Networks (CNN) to Vision Transformers (ViT). |
2025-06-16 |
Omni-AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented for Efficient Long Video Understanding
Multimodal Large Language Models (MLLMs) struggle with long videos due to fixed context windows and weak long-term dependency modeling.Existing Retrieval-Augmented Generation (RAG) methods for videos use static retrieval strategies, leading to inefficiencies for simple queries and information loss for complex tasks.To address this, we propose AdaVideoRAG, a novel framework that dynamically adapts retrieval granularity based on query complexity using a lightweight intent classifier.Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs, enabling optimal resource allocation across tasks.We also introduce the HiVU benchmark for comprehensive evaluation. 0.703Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.AdaVideoRAG establishes a new paradigm for adaptive retrieval in video analysis.Codes will be open-sourced at https://github.com/xzc-zju/AdaVideoRAG. |
2025-06-16 |
EBS-CFL: Efficient and Byzantine-robust Secure Clustered Federated Learning
Despite federated learning (FL)'s potential in collaborative learning, its performance has deteriorated due to the data heterogeneity of distributed users.Recently, clustered federated learning (CFL) has emerged to address this challenge by partitioning users into clusters according to their similarity.However, CFL faces difficulties in training when users are unwilling to share their cluster identities due to privacy concerns.To address these issues, we present an innovative Efficient and Robust Secure Aggregation scheme for CFL, dubbed EBS-CFL.The proposed EBS-CFL supports effectively training CFL while maintaining users' cluster identity confidentially.Moreover, it detects potential poisonous attacks without compromising individual client gradients by discarding negatively correlated gradients and aggregating positively correlated ones using a weighted approach.The server also authenticates correct gradient encoding by clients.EBS-CFL has high efficiency with client-side overhead O(ml + m^2) for communication and O(m^2l) for computation, where m is the number of cluster identities, and l is the gradient size.When m = 1, EBS-CFL's computational efficiency of client is at least O(log n) times better than comparison schemes, where n is the number of clients. 0.604In addition, we validate the scheme through extensive experiments.Finally, we theoretically prove the scheme's security. |
2025-06-16 |
Delay-optimal Congestion-aware Routing and Computation Offloading in Arbitrary Network
Emerging edge computing paradigms enable heterogeneous devices to collaborate on complex computation applications.However, for arbitrary heterogeneous edge networks, delay-optimal forwarding and computation offloading remains an open problem.In this paper, we jointly optimize data/result routing and computation placement in arbitrary networks with heterogeneous node capabilities, and congestion-dependent nonlinear transmission and processing delay.Despite the non-convexity of the formulated problem, based on analyzing the KKT condition, we provide a set of sufficient optimality conditions that solve the problem globally.To provide the insights for such global optimality, we show that the proposed non-convex problem is geodesic-convex with mild assumptions.We also show that the proposed sufficient optimality condition leads to a lower hemicontinuous solution set, providing stability against user-input perturbation.We then extend the framework to incorporate utility-based congestion control and fairness.A fully distributed algorithm is developed to converge to the global optimum.Numerical results demonstrate significant improvements over multiple baselines algorithms. 0.821 |
2025-06-16 |
Global Convergence of Adjoint-Optimized Neural PDEs
Many engineering and scientific fields have recently become interested in modeling terms in partial differential equations (PDEs) with neural networks.The resulting neural-network PDE model, being a function of the neural network parameters, can be calibrated to available data by optimizing over the PDE using gradient descent, where the gradient is evaluated in a computationally efficient manner by solving an adjoint PDE.These neural-network PDE models have emerged as an important research area in scientific machine learning.In this paper, we study the convergence of the adjoint gradient descent optimization method for training neural-network PDE models in the limit where both the number of hidden units and the training time tend to infinity.Specifically, for a general class of nonlinear parabolic PDEs with a neural network embedded in the source term, we prove convergence of the trained neural-network PDE solution to the target data (i.e., a global minimizer).The global convergence proof poses a unique mathematical challenge that is not encountered in finite-dimensional neural network convergence analyses due to (1) the neural network training dynamics involving a non-local neural network kernel operator in the infinite-width hidden layer limit where the kernel lacks a spectral gap for its eigenvalues and (2) the nonlinearity of the limit PDE system, which leads to a non-convex optimization problem, even in the infinite-width hidden layer limit (unlike in typical neual network training cases where the optimization problem becomes convex in the large neuron limit).The theoretical results are illustrated and empirically validated by numerical studies. 0.703 |
2025-06-16 |
DesignCoder: Hierarchy-Aware and Self-Correcting UI Code Generation with Large Language Models
Multimodal large language models (MLLMs) have streamlined front-end interface development by automating code generation.However, these models also introduce challenges in ensuring code quality.Existing approaches struggle to maintain both visual consistency and functional completeness in the generated components.Moreover, they lack mechanisms to assess the fidelity and correctness of the rendered pages.To address these issues, we propose DesignCoder, a novel hierarchical-aware and self-correcting automated code generation framework.Specifically, we introduce UI Grouping Chains, which enhance MLLMs' capability to understand and predict complex nested UI hierarchies.Subsequently, DesignCoder employs a hierarchical divide-and-conquer approach to generate front-end code.Finally, we incorporate a self-correction mechanism to improve the model's ability to identify and rectify errors in the generated code.Extensive evaluations on a dataset of UI mockups collected from both open-source communities and industry projects demonstrate that DesignCoder outperforms state-of-the-art baselines in React Native, a widely adopted UI framework.Our method achieves a 37.63%, 9.52%, 12.82% performance increase in visual similarity metrics (MSE, CLIP, SSIM) and significantly improves code structure similarity in terms of TreeBLEU, Container Match, and Tree Edit Distance by 30.19%, 29.31%, 24.67%. 0.609Furthermore, we conducted a user study with professional developers to assess the quality and practicality of the generated code.Results indicate that DesignCoder aligns with industry best practices, demonstrating high usability, readability, and maintainability.Our approach provides an efficient and practical solution for agile front-end development, enabling development teams to focus more on core functionality and product innovation. |
2025-06-16 |
Prefix-Tuning+: Modernizing Prefix-Tuning through Attention Independent Prefix Data
Parameter-Efficient Fine-Tuning (PEFT) methods have become crucial for rapidly adapting large language models (LLMs) to downstream tasks.Prefix-Tuning, an early and effective PEFT technique, demonstrated the ability to achieve performance comparable to full fine-tuning with significantly reduced computational and memory overhead.However, despite its earlier success, its effectiveness in training modern state-of-the-art LLMs has been very limited.In this work, we demonstrate empirically that Prefix-Tuning underperforms on LLMs because of an inherent tradeoff between input and prefix significance within the attention head.This motivates us to introduce Prefix-Tuning+, a novel architecture that generalizes the principles of Prefix-Tuning while addressing its shortcomings by shifting the prefix module out of the attention head itself.We further provide an overview of our construction process to guide future users when constructing their own context-based methods.Our experiments show that, across a diverse set of benchmarks, Prefix-Tuning+ consistently outperforms existing Prefix-Tuning methods. 0.655Notably, it achieves performance on par with the widely adopted LoRA method on several general benchmarks, highlighting the potential modern extension of Prefix-Tuning approaches. 0.671Our findings suggest that by overcoming its inherent limitations, Prefix-Tuning can remain a competitive and relevant research direction in the landscape of parameter-efficient LLM adaptation. |
2025-06-16 |
Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models
Sampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications.Recently, Nguyen et al. 2024's "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling.The significance of these claims was underscored by the paper's recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation.This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper's four lines of evidence.First, the original paper's human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines.Second, comprehensively sweeping the original paper's NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters. 0.65Third, the original paper's LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported.Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading.We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity. |
2025-06-16 |
MARCO: Hardware-Aware Neural Architecture Search for Edge Devices with Multi-Agent Reinforcement Learning and Conformal Prediction Filtering
This paper introduces MARCO (Multi-Agent Reinforcement learning with Conformal Optimization), a novel hardware-aware framework for efficient neural architecture search (NAS) targeting resource-constrained edge devices.By significantly reducing search time and maintaining accuracy under strict hardware constraints, MARCO bridges the gap between automated DNN design and CAD for edge AI deployment.MARCO's core technical contribution lies in its unique combination of multi-agent reinforcement learning (MARL) with Conformal Prediction (CP) to accelerate the hardware/software co-design process for deploying deep neural networks.Unlike conventional once-for-all (OFA) supernet approaches that require extensive pretraining, MARCO decomposes the NAS task into a hardware configuration agent (HCA) and a Quantization Agent (QA).The HCA optimizes high-level design parameters, while the QA determines per-layer bit-widths under strict memory and latency budgets using a shared reward signal within a centralized-critic, decentralized-execution (CTDE) paradigm.A key innovation is the integration of a calibrated CP surrogate model that provides statistical guarantees (with a user-defined miscoverage rate) to prune unpromising candidate architectures before incurring the high costs of partial training or hardware simulation.This early filtering drastically reduces the search space while ensuring that high-quality designs are retained with a high probability.Extensive experiments on MNIST, CIFAR-10, and CIFAR-100 demonstrate that MARCO achieves a 3-4x reduction in total search time compared to an OFA baseline while maintaining near-baseline accuracy (within 0.3%). 0.607Furthermore, MARCO also reduces inference latency.Validation on a MAX78000 evaluation board confirms that simulator trends hold in practice, with simulator estimates deviating from measured values by less than 5%. |
2025-06-12 |
Evaluating Large Language Models on Non-Code Software Engineering Tasks
Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation; however, their effectiveness on non-code Software Engineering (SE) tasks remains underexplored.We present the first comprehensive benchmark, which we name `Software Engineering Language Understanding' (SELU), for evaluating LLMs on 17 non-code tasks, spanning from identifying whether a requirement is functional or non-functional to estimating the effort and complexity of backlog items.SELU covers classification, regression, Named Entity Recognition (NER), and Masked Language Modeling (MLM) targets, with data drawn from diverse sources such as code repositories, issue tracking systems, and developer forums.We fine-tune 22 open-source LLMs, prompt two proprietary alternatives, and train two baselines.Performance is measured using metrics such as F1-macro, SMAPE, F1-micro, and accuracy, and compared via the Bayesian signed-rank test. 0.719Our results show that moderate-scale decoder-only models consistently form a top-tier, exhibiting high mean performance and low across-task variance, while domain adaptation via code-focused pre-training might yield only modest improvements.These insights guide model selection for non-code SE workflows and highlight directions for expanding SELU to generative and design-oriented scenarios. |
2025-06-12 |
A voice for minorities: diversity in approval-based committee elections under incomplete or inaccurate information
We study diversity in approval-based committee elections with incomplete or inaccurate information.As standard in the literature on approval-based multi-winner voting, we define diversity according to the maximum coverage problem, which is known to be NP-complete, with a best attainable polynomial time approximation ratio of $1-1/\e$. In the incomplete information model, voters can vote on only a small portion of the candidates.We suggest a greedy algorithm and a local search algorithm that query voters and use the query responses to approximate the total population's opinion.For both algorithms, we prove an upper bound on the number of queries required to get a close to $(1-1/\e)$-approximate solution with high probability.We also provide a lower bound for the query complexity of non-adaptive algorithms, that cannot adapt their querying strategy to readily obtained information.In the inaccurate information setting, voters' responses are corrupted with a probability $p\in(0,\frac{1}{2})$. We provide both an upper and a lower bound for the number of queries required to attain a $(1-1/\e)$-approximate solution with high probability.Finally, using real data from Polis, we see that our algorithms perform remarkably better than the theoretical results suggest, both with incomplete and inaccurate information. 0.615 |
2025-06-12 |
Precise Zero-Shot Pointwise Ranking with LLMs through Post-Aggregated Global Context Information
Recent advancements have successfully harnessed the power of Large Language Models (LLMs) for zero-shot document ranking, exploring a variety of prompting strategies.Comparative approaches like pairwise and listwise achieve high effectiveness but are computationally intensive and thus less practical for larger-scale applications. 0.681Scoring-based pointwise approaches exhibit superior efficiency by independently and simultaneously generating the relevance scores for each candidate document. 0.624However, this independence ignores critical comparative insights between documents, resulting in inconsistent scoring and suboptimal performance.In this paper, we aim to improve the effectiveness of pointwise methods while preserving their efficiency through two key innovations: (1) We propose a novel Global-Consistent Comparative Pointwise Ranking (GCCP) strategy that incorporates global reference comparisons between each candidate and an anchor document to generate contrastive relevance scores. 0.641We strategically design the anchor document as a query-focused summary of pseudo-relevant candidates, which serves as an effective reference point by capturing the global context for document comparison.(2) These contrastive relevance scores can be efficiently Post-Aggregated with existing pointwise methods, seamlessly integrating essential Global Context information in a training-free manner (PAGC).Extensive experiments on the TREC DL and BEIR benchmark demonstrate that our approach significantly outperforms previous pointwise methods while maintaining comparable efficiency. 0.688Our method also achieves competitive performance against comparative methods that require substantially more computational resources. 0.714More analyses further validate the efficacy of our anchor construction strategy. |
2025-06-12 |
Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment
Medical dialogue systems (MDS) have emerged as crucial online platforms for enabling multi-turn, context-aware conversations with patients.However, existing MDS often struggle to (1) identify relevant medical knowledge and (2) generate personalized, medically accurate responses.To address these challenges, we propose MedRef, a novel MDS that incorporates knowledge refining and dynamic prompt adjustment.First, we employ a knowledge refining mechanism to filter out irrelevant medical data, improving predictions of critical medical entities in responses.Additionally, we design a comprehensive prompt structure that incorporates historical details and evident details.To enable real-time adaptability to diverse patient conditions, we implement two key modules, Triplet Filter and Demo Selector, providing appropriate knowledge and demonstrations equipped in the system prompt.Extensive experiments on MedDG and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in both generation quality and medical entity accuracy, underscoring its effectiveness and reliability for real-world healthcare applications. 0.604 |
2025-06-12 |
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets.\textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary.These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved.Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods.Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead.Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the computation required when training on the original mixture of The Pile dataset.Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83\%$. 0.607 |
2025-06-12 |
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs).However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. 0.64In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges.To tackle these issues, our pipeline integrates three core automated components.First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency.Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers.Finally, we automate the fail2pass validation process using these reliable exit code signals.Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance.We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00.We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation.Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory. |
2025-06-12 |
ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems
There has been a flurry of activity around using pretrained diffusion models as informed data priors for solving inverse problems, and more generally around steering these models using reward models.Training-free methods like diffusion posterior sampling (DPS) and its many variants have offered flexible heuristic algorithms for these tasks, but when the reward is not informative enough, e.g., in hard inverse problems with low signal-to-noise ratio, these techniques veer off the data manifold, failing to produce realistic outputs.In this work, we devise a simple wrapper, ReGuidance, for boosting both the sample realism and reward achieved by these methods.Given a candidate solution $\hat{x}$ produced by an algorithm of the user's choice, we propose inverting the solution by running the unconditional probability flow ODE in reverse starting from $\hat{x}$, and then using the resulting latent as an initialization for DPS.We evaluate our wrapper on hard inverse problems like large box in-painting and super-resolution with high upscaling.Whereas state-of-the-art baselines visibly fail, we find that applying our wrapper on top of these baselines significantly boosts sample quality and measurement consistency. 0.694We complement these findings with theory proving that on certain multimodal data distributions, ReGuidance simultaneously boosts the reward and brings the candidate solution closer to the data manifold.To our knowledge, this constitutes the first rigorous algorithmic guarantee for DPS. |
2025-06-12 |
Rethinking Losses for Diffusion Bridge Samplers
Diffusion bridges are a promising class of deep-learning methods for sampling from unnormalized distributions.Recent works show that the Log Variance (LV) loss consistently outperforms the reverse Kullback-Leibler (rKL) loss when using the reparametrization trick to compute rKL-gradients.While the on-policy LV loss yields identical gradients to the rKL loss when combined with the log-derivative trick for diffusion samplers with non-learnable forward processes, this equivalence does not hold for diffusion bridges or when diffusion coefficients are learned.Based on this insight we argue that for diffusion bridges the LV loss does not represent an optimization objective that can be motivated like the rKL loss via the data processing inequality.Our analysis shows that employing the rKL loss with the log-derivative trick (rKL-LD) does not only avoid these conceptual problems but also consistently outperforms the LV loss.Experimental results with different types of diffusion bridges on challenging benchmarks show that samplers trained with the rKL-LD loss achieve better performance. 0.625From a practical perspective we find that rKL-LD requires significantly less hyperparameter optimization and yields more stable training behavior. |
LLMs |
|
2025-06-18 |
Context-Informed Grounding Supervision
Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. 0.701In such cases, we expect the model to generate responses by grounding its response in the provided external context.However, prior work has shown that simply appending context at inference time does not ensure grounded generation.To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context.Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models.In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques.In the vision-language domain, replacing a vision-language model's LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. 0.646This improved grounding comes without degradation in general downstream performance.Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model's prior knowledge and behavior, implicitly encouraging greater reliance on the external context. |
2025-06-18 |
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). 0.66However, efficient, high-quality automated process annotation remains a significant challenge.To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation.We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning.We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs.Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation.The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility. |
2025-06-18 |
Optimizing Web-Based AI Query Retrieval with GPT Integration in LangChain A CoT-Enhanced Prompt Engineering Approach
Large Language Models have brought a radical change in the process of remote learning students, among other aspects of educative activities.Current retrieval of remote learning resources lacks depth in contextual meaning that provides comprehensive information on complex student queries.This work proposes a novel approach to enhancing remote learning retrieval by integrating GPT-based models within the LangChain framework.We achieve this system in a more intuitive and productive manner using CoT reasoning and prompt engineering.The framework we propose puts much emphasis on increasing the precision and relevance of the retrieval results to return comprehensive and contextually enriched explanations and resources that best suit each student's needs.We also assess the effectiveness of our approach against paradigmatic LLMs and report improvements in user satisfaction and learning outcomes. 0.702 |
2025-06-18 |
RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information. 0.647However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs.We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining.RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions.A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization.This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional forward pass.We further derive PAC-style guarantees that link the KL threshold to user-defined false positive and false negative rates.On the Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918.This result outperforms the strongest prior method by 6.5 percentage points while keeping latency overhead below 4.7% on an NVIDIA T4 GPU.RePCS offers a lightweight, black-box safeguard to verify whether a RAG system meaningfully leverages retrieval, making it especially valuable in safety-critical applications. |
2025-06-18 |
Lessons from Training Grounded LLMs with Verifiable Rewards
Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). 0.652While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available.In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. 0.672We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations.Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses.A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal.Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks.Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs. 0.667 |
2025-06-18 |
PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction
Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. 0.691However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences.This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. 0.667We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. 0.617To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time.PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay.Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused. |
2025-06-18 |
Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents
Failure Analysis (FA) is a highly intricate and knowledge-intensive process.The integration of AI components within the computational infrastructure of FA labs has the potential to automate a variety of tasks, including the detection of non-conformities in images, the retrieval of analogous cases from diverse data sources, and the generation of reports from annotated images.However, as the number of deployed AI models increases, the challenge lies in orchestrating these components into cohesive and efficient workflows that seamlessly integrate with the FA process. This paper investigates the design and implementation of a Large Language Model (LLM)-based Planning Agent (LPA) to assist FA engineers in solving their analysis cases.The LPA integrates LLMs with advanced planning capabilities and external tool utilization, enabling autonomous processing of complex queries, retrieval of relevant data from external systems, and generation of human-readable responses. 0.661Evaluation results demonstrate the agent's operational effectiveness and reliability in supporting FA tasks. |
2025-06-18 |
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders.While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs.GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers.We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity. 0.653Our study highlights the importance of improving LLMs' inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models. 0.683 |
2025-06-18 |
A Unified Graph-based Framework for Scalable 3D Tree Reconstruction and Non-Destructive Biomass Estimation from Point Clouds
Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management.Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction.However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment.This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline.The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning.Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS).Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error).These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation.It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. 0.608To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales.This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research. |
2025-06-18 |
LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters
Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). 0.602To reduce the latency incurred by inter-GPU communication, a common practice for parallel tasks has been to allocate GPUs based on their physical proximity.However, this long-standing assumption has notable limitations, particularly in large-scale, heterogeneous GPU clusters where bandwidth distribution among GPUs is irregular.In this paper, we introduce LiteGD, a lightweight and dynamic GPU dispatching system based on global perspectives.To tackle the difficulty of storing massive GPU topology information, LiteGD adopts a computation-aware design that leverages a lightweight Transformer network trained on sampled data.Our customized design for network structure ensures both transferability and scalability.LiteGD also employs a bidirectional tree search approach to find the optimal GPU dispatching in the data generated in the previous step, which can identify near-optimal solutions while reducing search overhead.We implement and evaluate LiteGD in both real and simulated GPU clusters with homogeneous and heterogeneous interconnects, respectively.Experimental results demonstrate that LiteGD consistently achieves high GPU bandwidth efficacy (approximately 90\%) across various cluster configurations and 80\% in real-world H100 cluster, significantly outperforming conventional default and interconnect topology-aware dispatching methods, particularly in large-scale heterogeneous environments. |
2025-06-18 |
From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns
While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task.Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently.However, assessing the actual quality and reliability of generated MCQs has received limited attention -- particularly regarding cases where generation fails. 0.616This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings.Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. 0.629This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language.Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels.We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students.Our results show that current models can generate MCQs of comparable quality to human-authored ones.However, we identify issues related to semantic clarity and answerability.Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design. |
2025-06-18 |
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning
Large Language Models (LLMs) have become indispensable in real-world applications. 0.677However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions.Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign.In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. 0.608Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM.Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model's adaptability to new tasks.For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks.By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. 0.603The code is available at github.com/VITA-Group/LoX. |
2025-06-18 |
From Block to Byte: Transforming PCIe SSDs with CXL Memory Protocol and Instruction Annotation
This paper explores how Compute Express Link (CXL) can transform PCIe-based block storage into a scalable, byte-addressable working memory.We address the challenges of adapting block storage to CXL's memory-centric model by emphasizing cacheability as a key enabler and advocating for Type 3 endpoint devices, referred to as CXL-SSDs. 0.605To validate our approach, we prototype a CXL-SSD on a custom FPGA platform and propose annotation mechanisms, Determinism and Bufferability, to enhance performance while preserving data persistency.Our simulation-based evaluation demonstrates that CXL-SSD achieves 10.9x better performance than PCIe-based memory expanders and further reduces latency by 5.4x with annotation enhancements.In workloads with high locality, CXL-SSD approaches DRAM-like performance due to efficient on-chip caching. 0.622This work highlights the feasibility of integrating block storage into CXL's ecosystem and provides a foundation for future memory-storage convergence. |
2025-06-18 |
The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games
Large Language Models (LLMs) have shown promise as decision-makers in dynamic settings, but their stateless nature necessitates creating a natural language representation of history. 0.678We present a unifying framework for systematically constructing natural language "state" representations for prompting LLM agents in repeated multi-agent games. 0.655Previous work on games with LLM agents has taken an ad hoc approach to encoding game history, which not only obscures the impact of state representation on agents' behavior, but also limits comparability between studies. 0.663Our framework addresses these gaps by characterizing methods of state representation along three axes: action informativeness (i.e., the extent to which the state representation captures actions played); reward informativeness (i.e., the extent to which the state representation describes rewards obtained); and prompting style (or natural language compression, i.e., the extent to which the full text history is summarized). We apply this framework to a dynamic selfish routing game, chosen because it admits a simple equilibrium both in theory and in human subject experiments \cite{rapoport_choice_2009}.Despite the game's relative simplicity, we find that there are key dependencies of LLM agent behavior on the natural language state representation. 0.673In particular, we observe that representations which provide agents with (1) summarized, rather than complete, natural language representations of past history; (2) information about regrets, rather than raw payoffs; and (3) limited information about others' actions lead to behavior that more closely matches game theoretic equilibrium predictions, and with more stable game play by the agents.By contrast, other representations can exhibit either large deviations from equilibrium, higher variation in dynamic game play over time, or both. |
2025-06-18 |
Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. 0.654However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. 0.758To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. 0.68This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities.We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. 0.663Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities. 0.723 |
2025-06-18 |
SR-NCL: an Area-/Energy-Efficient Resilient NCL Architecture Based on Selective Redundancy
Duplication-based redundancy schemes have proven to be effective in designing fully-resilient Quasi-delay Insensitive (QDI) asynchronous circuits.The complete resiliency, however, is accompanied by significant energy, latency, and area overhead.This paper presents a novel error-tolerant Null Convention Logic (NCL) architecture based on selective redundancy. 0.613Results demonstrate the efficacy of the proposed method in terms of area and energy utilization as compared to existing duplication-based NCL designs, targeting an image processing application. |
2025-06-18 |
deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses
Although Rust ensures memory safety by default, it also permits the use of unsafe code, which can introduce memory safety vulnerabilities if misused. 0.627Unfortunately, existing tools for detecting memory bugs in Rust typically exhibit limited detection capabilities, inadequately handle Rust-specific types, or rely heavily on manual intervention. 0.615To address these limitations, we present deepSURF, a tool that integrates static analysis with Large Language Model (LLM)-guided fuzzing harness generation to effectively identify memory safety vulnerabilities in Rust libraries, specifically targeting unsafe code. 0.618deepSURF introduces a novel approach for handling generics by substituting them with custom types and generating tailored implementations for the required traits, enabling the fuzzer to simulate user-defined behaviors within the fuzzed library.Additionally, deepSURF employs LLMs to augment fuzzing harnesses dynamically, facilitating exploration of complex API interactions and significantly increasing the likelihood of exposing memory safety vulnerabilities. 0.697We evaluated deepSURF on 27 real-world Rust crates, successfully rediscovering 20 known memory safety bugs and uncovering 6 previously unknown vulnerabilities, demonstrating clear improvements over state-of-the-art tools. |
2025-06-18 |
PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection
Phishing websites continue to pose a significant cybersecurity threat, often leveraging deceptive structures, brand impersonation, and social engineering tactics to evade detection.While recent advances in large language models (LLMs) have enabled improved phishing detection through contextual understanding, most existing approaches rely on single-agent classification facing the risks of hallucination and lack interpretability or robustness. 0.625To address these limitations, we propose PhishDebate, a modular multi-agent LLM-based debate framework for phishing website detection. 0.701PhishDebate employs four specialized agents to independently analyze different textual aspects of a webpage--URL structure, HTML composition, semantic content, and brand impersonation--under the coordination of a Moderator and a final Judge.Through structured debate and divergent thinking, the framework delivers more accurate and interpretable decisions.Extensive evaluations on commercial LLMs demonstrate that PhishDebate achieves 98.2% recall and 98.2% True Positive Rate (TPR) on a real-world phishing dataset, and outperforms single-agent and Chain of Thought (CoT) baselines. 0.683Additionally, its modular design allows agent-level configurability, enabling adaptation to varying resource and application requirements. |
2025-06-18 |
CC-LEARN: Cohort-based Consistency Learning
Large language models excel at many tasks but still struggle with consistent, robust reasoning.We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. 0.605To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning.Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members.Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines.These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs. 0.669 |
2025-06-18 |
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands.This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts.A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. 0.64To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs.GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs.Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs. |
2025-06-18 |
PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning
With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. 0.709Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. 0.705Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. 0.74To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. 0.709Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics.Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%. |
2025-06-17 |
AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes
While knowledge distillation has become a mature field for compressing large language models (LLMs) into smaller ones by aligning their outputs or internal representations, the distillation of LLM-based agents, which involve planning, memory, and tool use, remains relatively underexplored. 0.667Existing agent distillation methods typically replay full teacher trajectories or imitate step-by-step teacher tool usage, but they often struggle to train student agents to dynamically plan and act in novel environments.We propose AgentDistill, a novel, training-free agent distillation framework that enables efficient and scalable knowledge transfer via direct reuse of Model-Context-Protocols (MCPs), which are structured and reusable task-solving modules autonomously generated by teacher agents.The reuse of these distilled MCPs enables student agents to generalize their capabilities across domains and solve new problems with minimal supervision or human intervention.Experiments on biomedical and mathematical benchmarks demonstrate that our distilled student agents, built on small language models, can achieve performance comparable to advanced systems using large LLMs such as OctoTools (GPT-4o), highlighting the effectiveness of our framework in building scalable and cost-efficient intelligent agents. |
2025-06-17 |
Resource Optimization with MPI Process Malleability for Dynamic Workloads in HPC Clusters
Dynamic resource management is essential for optimizing computational efficiency in modern high-performance computing (HPC) environments, particularly as systems scale.While research has demonstrated the benefits of malleability in resource management systems (RMS), the adoption of such techniques in production environments remains limited due to challenges in standardization, interoperability, and usability.Addressing these gaps, this paper extends our prior work on the Dynamic Management of Resources (DMR) framework, which provides a modular and user-friendly approach to dynamic resource allocation. 0.613Building upon the original DMRlib reconfiguration runtime, this work integrates new methodology from the Malleability Module (MaM) of the Proteo framework, further enhancing reconfiguration capabilities with new spawning strategies and data redistribution methods.In this paper, we explore new malleability strategies in HPC dynamic workloads, such as merging MPI communicators and asynchronous reconfigurations, which offer new opportunities for dramatically reducing memory overhead.The proposed enhancements are rigorously evaluated on a world-class supercomputer, demonstrating improved resource utilization and workload efficiency.Results show that dynamic resource management can reduce the workload completion time by 40% and increase the resource utilization by over 20%, compared to static resource allocation. |
2025-06-17 |
Joint Error Correction and Fading Channel Estimation Enhancement Leveraging GRAND
We present a novel method for error correction in the presence of fading channel estimation errors (CEE).When such errors are significant, considerable performance losses can be observed if the wireless transceiver is not adapted.Instead of refining the estimate by increasing the pilot sequence length or improving the estimation algorithm, we propose two new approaches based on Guessing Random Additive Noise Decoding (GRAND) decoders.The first method involves testing multiple candidates for the channel estimate located in the complex neighborhood around the original pilot-based estimate.All these candidates are employed in parallel to compute log-likelihood ratios (LLR).These LLRs are used as soft input to Ordered Reliability Bits GRAND (ORBGRAND). 0.682Posterior likelihood formulas associated with ORBGRAND are then computed to determine which channel candidate leads to the most probable codeword.The second method is a refined version of the first approach accounting for the presence of residual CEE in the LLR computation.The performance of these two techniques is evaluated for [128,112] 5G NR CA-Polar and CRC codes.For the considered settings, block error rate (BLER) gains of several dBs are observed compared to cases where CEE is ignored. |
Developer Research |
|
2025-06-17 |
Issue Retrieval and Verification Enhanced Supplementary Code Comment Generation
Issue reports have been recognized to contain rich information for retrieval-augmented code comment generation.However, how to minimize hallucinations in the generated comments remains significant challenges.In this paper, we propose IsComment, an issue-based LLM retrieval and verification approach for generating method's design rationale, usage directives, and so on as supplementary code comments.We first identify five main types of code supplementary information that issue reports can provide through code-comment-issue analysis. 0.606Next, we retrieve issue sentences containing these types of supplementary information and generate candidate code comments.To reduce hallucinations, we filter out those candidate comments that are irrelevant to the code or unverifiable by the issue report, making the code comment generation results more reliable.Our experiments indicate that compared with LLMs, IsComment increases the coverage of manual supplementary comments from 33.6% to 72.2% for ChatGPT, from 35.8% to 88.4% for GPT-4o, and from 35.0% to 86.2% for DeepSeek-V3.Compared with existing work, IsComment can generate richer and more useful supplementary code comments for programming understanding, which is quantitatively evaluated through the MESIA metric on both methods with and without manual code comments. |
2025-06-17 |
Unified Software Engineering agent as AI Software Engineer
The growth of Large Language Model (LLM) technology has raised expectations for automated coding.However, software engineering is more than coding and is concerned with activities including maintenance and evolution of a project. 0.638In this context, the concept of LLM agents has gained traction, which utilize LLMs as reasoning engines to invoke external tools autonomously.But is an LLM agent the same as an AI software engineer?In this paper, we seek to understand this question by developing a Unified Software Engineering agent or USEagent.Unlike existing work which builds specialized agents for specific software tasks such as testing, debugging, and repair, our goal is to build a unified agent which can orchestrate and handle multiple capabilities.This gives the agent the promise of handling complex scenarios in software development such as fixing an incomplete patch, adding new features, or taking over code written by others.We envision USEagent as the first draft of a future AI Software Engineer which can be a team member in future software development teams involving both AI and humans.To evaluate the efficacy of USEagent, we build a Unified Software Engineering bench (USEbench) comprising of myriad tasks such as coding, testing, and patching.USEbench is a judicious mixture of tasks from existing benchmarks such as SWE-bench, SWT-bench, and REPOCOD.In an evaluation on USEbench consisting of 1,271 repository-level software engineering tasks, USEagent shows improved efficacy compared to existing general agents such as OpenHands CodeActAgent.There exist gaps in the capabilities of USEagent for certain coding tasks, which provides hints on further developing the AI Software Engineer of the future. |
2025-06-16 |
DesignCoder: Hierarchy-Aware and Self-Correcting UI Code Generation with Large Language Models
Multimodal large language models (MLLMs) have streamlined front-end interface development by automating code generation.However, these models also introduce challenges in ensuring code quality.Existing approaches struggle to maintain both visual consistency and functional completeness in the generated components.Moreover, they lack mechanisms to assess the fidelity and correctness of the rendered pages.To address these issues, we propose DesignCoder, a novel hierarchical-aware and self-correcting automated code generation framework.Specifically, we introduce UI Grouping Chains, which enhance MLLMs' capability to understand and predict complex nested UI hierarchies.Subsequently, DesignCoder employs a hierarchical divide-and-conquer approach to generate front-end code.Finally, we incorporate a self-correction mechanism to improve the model's ability to identify and rectify errors in the generated code.Extensive evaluations on a dataset of UI mockups collected from both open-source communities and industry projects demonstrate that DesignCoder outperforms state-of-the-art baselines in React Native, a widely adopted UI framework.Our method achieves a 37.63%, 9.52%, 12.82% performance increase in visual similarity metrics (MSE, CLIP, SSIM) and significantly improves code structure similarity in terms of TreeBLEU, Container Match, and Tree Edit Distance by 30.19%, 29.31%, 24.67%.Furthermore, we conducted a user study with professional developers to assess the quality and practicality of the generated code. 0.677Results indicate that DesignCoder aligns with industry best practices, demonstrating high usability, readability, and maintainability.Our approach provides an efficient and practical solution for agile front-end development, enabling development teams to focus more on core functionality and product innovation. |
2025-06-10 |
On The Impact of Merge Request Deviations on Code Review Practices
Code review is a key practice in software engineering, ensuring quality and collaboration. 0.708However, industrial Merge Request (MR) workflows often deviate from standardized review processes, with many MRs serving non-review purposes (e.g., drafts, rebases, or dependency updates).We term these cases deviations and hypothesize that ignoring them biases analytics and undermines ML models for review analysis. We identify seven deviation categories, occurring in 37.02% of MRs, and propose a few-shot learning detection method (91% accuracy).By excluding deviations, ML models predicting review completion time improve performance in 53.33% of cases (up to 2.25x) and exhibit significant shifts in feature importance (47% overall, 60% top-*k*). Our contributions include: (1) a taxonomy of MR deviations, (2) an AI-driven detection approach, and (3) empirical evidence of their impact on ML-based review analytics.This work aids practitioners in optimizing review efforts and ensuring reliable insights. |
2025-06-09 |
Execution-Aware Program Reduction for WebAssembly via Record and Replay
WebAssembly (Wasm) programs may trigger bugs in their engine implementations.To aid debugging, program reduction techniques try to produce a smaller variant of the input program that still triggers the bug. 0.605However, existing execution-unaware program reduction techniques struggle with large and complex Wasm programs, because they rely on static information and apply syntactic transformations, while ignoring the valuable information offered by the input program's execution behavior. We present RR-Reduce and Hybrid-Reduce, novel execution-aware program reduction techniques that leverage execution behaviors via record and replay.RR-Reduce identifies a bug-triggering function as the target function, isolates that function from the rest of the program, and generates a reduced program that replays only the interactions between the target function and the rest of the program.Hybrid-Reduce combines a complementary execution-unaware reduction technique with RR-Reduce to further reduce program size. We evaluate RR-Reduce and Hybrid-Reduce on 28 Wasm programs that trigger a diverse set of bugs in three engines.On average, RR-Reduce reduces the programs to 1.20 percent of their original size in 14.5 minutes, which outperforms the state of the art by 33.15 times in terms of reduction time.Hybrid-Reduce reduces the programs to 0.13 percent of their original size in 3.5 hours, which outperforms the state of the art by 3.42 times in terms of reduced program size and 2.26 times in terms of reduction time.We envision RR-Reduce as the go-to tool for rapid, on-demand debugging in minutes, and Hybrid-Reduce for scenarios where developers require the smallest possible programs. |
2025-06-09 |
ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols
Recent advances in Large Language Models (LLMs) have shown promising capabilities in generating code for general-purpose programming languages.In contrast, their applicability for hardware description languages, particularly for generating synthesizable and functionally correct designs, remains significantly underexplored.HDLs such as SystemVerilog are logic-oriented and demand strict adherence to timing semantics, concurrency, and synthesizability constraints.Moreover, HDL-based design flows encompass a broad set of tasks beyond structural code generation, including testbench development, assertion-based verification, timing closure, and protocol-level integration for on-chip communication.The objective of our paper is to analyze the capabilities of state-of-the-art LLMs in generating SystemVerilog implementations of standard communication protocols, a core component of embedded and System-on-Chip (SoC) architectures.This paper introduces the first benchmark suite targeting four widely used protocols: SPI, I2C, UART, and AXI.We define code generation tasks that capture varying levels of design abstraction and prompt specificity. 0.614The generated designs are assessed for syntactic correctness, synthesizability, and functional fidelity via waveform simulation and test benches. |
Data Annotation Techniques |