Vincent's Arxiv FrontPageGenerated on 2025-04-07. This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. |
|
New Datasets |
|
2025-04-03 |
Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole
We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. 0.779This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). 0.911This mirrors the typical resource availability of many low resource languages.We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain.We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale.We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers.Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general. |
2025-04-03 |
EvoChain: A Framework for Tracking and Visualizing Smart Contract Evolution
Tracking the evolution of smart contracts is challenging due to their immutable nature and complex upgrade mechanisms.We introduce EvoChain, a comprehensive framework and dataset designed to track and visualize smart contract evolution.Building upon data from our previous empirical study, EvoChain models contract relationships using a Neo4j graph database and provides an interactive web interface for exploration.The framework consists of a data layer, an API layer, and a user interface layer.EvoChain allows stakeholders to analyze contract histories, upgrade paths, and associated vulnerabilities by leveraging these components.Our dataset encompasses approximately 1.3 million upgradeable proxies and nearly 15,000 historical versions, enhancing transparency and trust in blockchain ecosystems by providing an accessible platform for understanding smart contract evolution. 0.84 |
2025-04-03 |
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Imitation learning has emerged as a promising approach towards building generalist robots.However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations.Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available.This data provides a rich source of information about real-world dynamics and agent-environment interactions. 0.789Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods.In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning.Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality.We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator.Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies.Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling.Videos and code are available at https://weirdlabuw.github.io/uwm/. |
2025-04-03 |
MegaMath: Pushing the Limits of Open Math Corpora
Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs).However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training.We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. 0.767(2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity.(3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data.By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets. |
2025-04-03 |
Concept Lancet: Image Editing with Compositional Representation Transplant
Diffusion models are widely used for image editing tasks.Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space.However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task.Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error.To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing.At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts.This allows us to accurately estimate the presence of concepts in each image, which informs the edit.Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction.To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. 0.742Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation. |
2025-04-02 |
SOLAQUA: SINTEF Ocean Large Aquaculture Robotics Dataset
This paper presents a dataset gathered with an underwater robot in a sea-based aquaculture setting. 0.71Data was gathered from an operational fish farm and includes data from sensors such as the Waterlinked A50 DVL, the Nortek Nucleus 1000 DVL, Sonardyne Micro Ranger 2 USBL, Sonoptix Mulitbeam Sonar, mono and stereo cameras, and vehicle sensor data such as power usage, IMU, pressure, temperature, and more. 0.864Data acquisition is performed during both manual and autonomous traversal of the net pen structure.The collected vision data is of undamaged nets with some fish and marine growth presence, and it is expected that both the research community and the aquaculture industry will benefit greatly from the utilization of the proposed SOLAQUA dataset. |
2025-04-02 |
DISINFOX: an open-source threat exchange platform serving intelligence on disinformation and influence operations
This paper introduces DISINFOX, an open-source threat intelligence exchange platform for the structured collection, management, and dissemination of disinformation incidents and influence operations.Analysts can upload and correlate information manipulation and interference incidents, while clients can access and analyze the data through an interactive web interface or programmatically via a public API.This facilitates integration with other vendors, providing a unified view of cybersecurity and disinformation events. The solution is fully containerized using Docker, comprising a web-based frontend for user interaction, a backend REST API for managing core functionalities, and a public API for structured data retrieval, enabling seamless integration with existing Cyber Threat Intelligence (CTI) workflows.In particular, DISINFOX models the incidents through DISARM Tactics, Techniques, and Procedures (TTPs), a MITRE ATT&CK-like framework for disinformation, with a custom data model based on the Structured Threat Information eXpression (STIX2) standard. As an open-source solution, DISINFOX provides a reproducible and extensible hub for researchers, analysts, and policymakers seeking to enhance the detection, investigation, and mitigation of disinformation threats.The intelligence generated from a custom dataset has been tested and utilized by a local instance of OpenCTI, a mature CTI platform, via a custom-built connector, validating the platform with the exchange of more than 100 disinformation incidents. 0.742 |
2025-04-02 |
YourBench: Easy Custom Evaluation Sets for Everyone
Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow.This hinders timely or domain-specific assessment, crucial for real-world applications.We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents.We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark.To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. 0.713Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments.We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation. 0.71 |
2025-04-02 |
Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks
Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks.This makes it crucial to align these tools with human values to prevent malicious misuse.In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain.We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. 0.762To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs.Furthermore, we investigate the impact of models size, architecture family, and alignment strategies on their tendency to generate harmful content.The results show significant disparities in the alignment of various LLMs for harmlessness.We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts.Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices.On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information.These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area. |
2025-04-02 |
Extending MovieLens-32M to Provide New Evaluation Objectives
Offline evaluation of recommender systems has traditionally treated the problem as a machine learning problem.In the classic case of recommending movies, where the user has provided explicit ratings of which movies they like and don't like, each user's ratings are split into test and train sets, and the evaluation task becomes to predict the held out test data using the training data.This machine learning style of evaluation makes the objective to recommend the movies that a user has watched and rated highly, which is not the same task as helping the user find movies that they would enjoy if they watched them.This mismatch in objective between evaluation and task is a compromise to avoid the cost of asking a user to evaluate recommendations by watching each movie.As a resource available for download, we offer an extension to the MovieLens-32M dataset that provides for new evaluation objectives. 0.772Our primary objective is to predict the movies that a user would be interested in watching, i.e. predict their watchlist.To construct this extension, we recruited MovieLens users, collected their profiles, made recommendations with a diverse set of algorithms, pooled the recommendations, and had the users assess the pools.Notably, we found that the traditional machine learning style of evaluation ranks the Popular algorithm, which recommends movies based on total number of ratings in the system, in the middle of the twenty-two recommendation runs we used to build the pools.In contrast, when we rank the runs by users' interest in watching movies, we find that recommending popular movies as a recommendation algorithm becomes one of the worst performing runs.It appears that by asking users to assess their personal recommendations, we can alleviate the popularity bias issues created by using information retrieval effectiveness measures for the evaluation of recommender systems. |
2025-04-02 |
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. 0.764Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs.Specifically, we begin by integrating existing open-source safety datasets from diverse sources.Then, we curate safety policies to generate policy-grounded deliberative reasoning samples.Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices.Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks.Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs.Our project page is https://ucsc-vlaa.github.io/STAR-1. |
2025-04-02 |
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models.Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks.Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, filtering and subsequent training.To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes.Our distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning.We then perform analysis on the data sources used to construct our dataset, the impact of code execution filtering, and the importance of instruction/solution diversity.We observe that execution filtering negatively affected benchmark accuracy, leading us to prioritize instruction diversity over solution correctness.Finally, we also analyze the token efficiency and reasoning patterns utilized by these models.We will open-source these datasets and distilled models to the community. 0.802 |
2025-04-02 |
Image Difference Grounding with Natural Language
Visual grounding (VG) typically focuses on locating regions of interest within an image using natural language, and most existing VG methods are limited to single-image interpretations.This limits their applicability in real-world scenarios like automatic surveillance, where detecting subtle but meaningful visual differences across multiple images is crucial.Besides, previous work on image difference understanding (IDU) has either focused on detecting all change regions without cross-modal text guidance, or on providing coarse-grained descriptions of differences.Therefore, to push towards finer-grained vision-language perception, we propose Image Difference Grounding (IDG), a task designed to precisely localize visual differences based on user instructions.We introduce DiffGround, a large-scale and high-quality dataset for IDG, containing image pairs with diverse visual variations along with instructions querying fine-grained differences. 0.835Besides, we present a baseline model for IDG, DiffTracker, which effectively integrates feature differential enhancement and common suppression to precisely locate differences.Experiments on the DiffGround dataset highlight the importance of our IDG dataset in enabling finer-grained IDU.To foster future research, both DiffGround data and DiffTracker model will be publicly released. |
2025-04-02 |
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression.While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references.This introduces great challenges due to the diverse and nuanced ways users describe targets.However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES.In this paper, we take a step further towards visual granularity unified RES task.To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding.In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. 0.809To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks.UniRES++ incorporates targeted designs for fine-grained visual feature exploration.With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES.To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES. |
2025-03-31 |
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics
The rapid and unrestrained advancement of generative artificial intelligence (AI) presents a double-edged sword: while enabling unprecedented creativity, it also facilitates the generation of highly convincing deceptive content, undermining societal trust.As image generation techniques become increasingly sophisticated, detecting synthetic images is no longer just a binary task: it necessitates interpretable, context-aware methodologies that enhance trustworthiness and transparency.However, existing detection models primarily focus on classification, offering limited explanatory insights into image authenticity.In this work, we propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics, which not only identifies AI-synthetic images with high accuracy but also provides rich, interpretable, and query-driven forensic insights.We first construct FakeChain dataset that contains linguistic authenticity reasoning based on visual trace evidence, developed through a novel human-machine collaborative framework.Building upon it, we further present FakeInstruct, the largest multimodal instruction tuning dataset containing 2 million visual instructions tailored to enhance forensic awareness in LMMs. 0.761FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios.It can distinguish synthetic images with high accuracy while offering coherent and insightful explanations, free-form discussions on fine-grained forgery attributes, and actionable enhancement strategies.Notably, despite being trained exclusively on qualitative hard labels, FakeScope demonstrates remarkable zero-shot quantitative capability on detection, enabled by our proposed token-based probability estimation strategy.Furthermore, FakeScope exhibits strong generalization and in-the-wild ability, ensuring its applicability in real-world scenarios. |
2025-03-31 |
Visual Acoustic Fields
Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties.Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS).Our approach features two key modules: sound generation and sound localization.The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds.Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources.To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds.To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. 0.835Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources.Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/. |
2025-03-31 |
Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge
Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy.Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms.This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification.The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024.The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency.The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences.The efficiency component tests the latency of algorithm inference.The challenge was conducted as a part of MICCAI EndoVis 2024.In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day.This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery.In this paper we summarize the design, submissions, and results from the challenge.The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics 0.749 |
2025-03-31 |
Faster Releases, Fewer Risks: A Study on Maven Artifact Vulnerabilities and Lifecycle Management
In modern software ecosystems, dependency management plays a critical role in ensuring secure and maintainable applications.However, understanding the relationship between release practices and their impact on vulnerabilities and update cycles remains a challenge.In this study, we analyze the release histories of 10,000 Maven artifacts, covering over 203,000 releases and 1.7 million dependencies. 0.72We evaluate how release speed affects software security and lifecycle.Our results show an inverse relationship between release speed and dependency outdatedness.Artifacts with more frequent releases maintain significantly shorter outdated times.We also find that faster release cycles are linked to fewer CVEs in dependency chains, indicating a strong negative correlation.These findings emphasize the importance of accelerated release strategies in reducing security risks and ensuring timely updates.Our research provides valuable insights for software developers, maintainers, and ecosystem managers. |
2025-03-31 |
InstructRestore: Region-Customized Image Restoration with Human Instructions
Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions.In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions.To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. 0.746With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. 0.924We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement.Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions.Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement.Our work advances the investigation of interactive image restoration and enhancement techniques.Data, code, and models will be found at https://github.com/shuaizhengliu/InstructRestore.git. |
2025-03-31 |
Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation
We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation.Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance.To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness.These enriched features are then decoded to produce precise and robust segmentation.We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. 0.842Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20\% on average in the 1\% and 10\% data settings.Our method achieves ∼77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications. |
2025-03-27 |
Dataset and Analysis of Long-Term Skill Acquisition in Robot-Assisted Minimally Invasive Surgery
Objective: We aim to investigate long-term robotic surgical skill acquisition among surgical residents and the effects of training intervals and fatigue on performance.Methods: For six months, surgical residents participated in three training sessions once a month, surrounding a single 26-hour hospital shift.In each shift, they participated in training sessions scheduled before, during, and after the shift.In each training session, they performed three dry-lab training tasks: Ring Tower Transfer, Knot-Tying, and Suturing.We collected a comprehensive dataset, including videos synchronized with kinematic data, activity tracking, and scans of the suturing pads. 0.84Results:We collected a dataset of 972 trials performed by 18 residents of different surgical specializations.Participants demonstrated consistent performance improvement across all tasks.In addition, we found variations in between-shift learning and forgetting across metrics and tasks, and hints for possible effects of fatigue.Conclusion: The findings from our first analysis shed light on the long-term learning processes of robotic surgical skills with extended intervals and varying levels of fatigue.Significance: This study lays the groundwork for future research aimed at optimizing training protocols and enhancing AI applications in surgery, ultimately contributing to improved patient outcomes.The dataset will be made available upon acceptance of our journal submission. 0.938 |
2025-03-27 |
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards.Building on this idea, we are the first to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks.To this end, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. 0.74We also introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO).Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks.Specifically, on the ID benchmark AndroidControl, the action type accuracy improves by 15%, while grounding accuracy increases by 10.3%, compared with the base model (i.e. Qwen2.5-VL-3B).On the OOD GUI grounding benchmark ScreenSpot-Pro, our model surpasses the base model by 6.0% and achieves competitive performance with larger models (e.g., OS-Atlas-7B), which are trained via supervised fine-tuning (SFT) on 76K data.These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain. |
2025-03-27 |
The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection
In recent years, performance on existing anomaly detection benchmarks like MVTec AD and VisA has started to saturate in terms of segmentation AU-PRO, with state-of-the-art models often competing in the range of less than one percentage point.This lack of discriminatory power prevents a meaningful comparison of models and thus hinders progress of the field, especially when considering the inherent stochastic nature of machine learning results.We present MVTec AD 2, a collection of eight anomaly detection scenarios with more than 8000 high-resolution images.It comprises challenging and highly relevant industrial inspection use cases that have not been considered in previous datasets, including transparent and overlapping objects, dark-field and back light illumination, objects with high variance in the normal data, and extremely small defects.We provide comprehensive evaluations of state-of-the-art methods and show that their performance remains below 60% average AU-PRO.Additionally, our dataset provides test scenarios with lighting condition changes to assess the robustness of methods under real-world distribution shifts.We host a publicly accessible evaluation server that holds the pixel-precise ground truth of the test set (https://benchmark.mvtec.com/).All image data is available at https://www.mvtec.com/company/research/datasets/mvtec-ad-2. 0.712 |
2025-03-27 |
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing
The rapid growth of digital communication has driven the widespread use of code-mixing, particularly Hindi-English, in multilingual communities.Existing datasets often focus on romanized text, have limited scope, or rely on synthetic data, which fails to capture realworld language nuances.Human annotations are crucial for assessing the naturalness and acceptability of code-mixed text.To address these challenges, We introduce COMI-LINGUA, the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. 0.771The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation.We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities.COMI-LINGUA is publically availabe at: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA. 0.767 |
2025-03-27 |
JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community
This paper introduces JiraiBench, the first bilingual benchmark for evaluating large language models' effectiveness in detecting self-destructive content across Chinese and Japanese social media communities.Focusing on the transnational "Jirai" (landmine) online subculture that encompasses multiple forms of self-destructive behaviors including drug overdose, eating disorders, and self-harm, we present a comprehensive evaluation framework incorporating both linguistic and cultural dimensions.Our dataset comprises 10,419 Chinese posts and 5,000 Japanese posts with multidimensional annotation along three behavioral categories, achieving substantial inter-annotator agreement. 0.842Experimental evaluations across four state-of-the-art models reveal significant performance variations based on instructional language, with Japanese prompts unexpectedly outperforming Chinese prompts when processing Chinese content.This emergent cross-cultural transfer suggests that cultural proximity can sometimes outweigh linguistic similarity in detection tasks.Cross-lingual transfer experiments with fine-tuned models further demonstrate the potential for knowledge transfer between these language systems without explicit target language training.These findings highlight the need for culturally-informed approaches to multilingual content moderation and provide empirical evidence for the importance of cultural context in developing more effective detection systems for vulnerable online communities. |
2025-03-27 |
CMED: A Child Micro-Expression Dataset
Micro-expressions are short bursts of emotion that are difficult to hide.Their detection in children is an important cue to assist psychotherapists in conducting better therapy.However, existing research on the detection of micro-expressions has focused on adults, whose expressions differ in their characteristics from those of children.The lack of research is a direct consequence of the lack of a child-based micro-expressions dataset as it is much more challenging to capture children's facial expressions due to the lack of predictability and controllability.This study compiles a dataset of spontaneous child micro-expression videos, the first of its kind, to the best of the authors knowledge.The dataset is captured in the wild using video conferencing software. 0.906This dataset enables us to then explore key features and differences between adult and child micro-expressions.This study also establishes a baseline for the automated spotting and recognition of micro-expressions in children using three approaches comprising of hand-created and learning-based approaches. |
2025-03-27 |
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity.Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024×1024 images. 0.921Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance.To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation.Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%).Our codes, models, datasets, and demo are publicly available. 0.804 |
2025-03-27 |
Video-R1: Reinforcing Video Reasoning in MLLMs
Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs).However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data.To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning.Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process.We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. 0.904Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc.Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o.All codes, models, data are released. |
2025-03-26 |
Collaborative Storytelling and LLM: A Linguistic Analysis of Automatically-Generated Role-Playing Game Sessions
Role-playing games (RPG) are games in which players interact with one another to create narratives.The role of players in the RPG is largely based on the interaction between players and their characters.This emerging form of shared narrative, primarily oral, is receiving increasing attention.In particular, many authors investigated the use of an LLM as an actor in the game.In this paper, we aim to discover to what extent the language of Large Language Models (LLMs) exhibit oral or written features when asked to generate an RPG session without human interference.We will conduct a linguistic analysis of the lexical and syntactic features of the generated texts and compare the results with analyses of conversations, transcripts of human RPG sessions, and books. 0.789We found that LLMs exhibit a pattern that is distinct from all other text categories, including oral conversations, human RPG sessions and books.Our analysis has shown how training influences the way LLMs express themselves and provides important indications of the narrative capabilities of these tools. |
2025-03-26 |
AccidentSim: Generating Physically Realistic Vehicle Collision Videos from Real-World Accident Reports
Collecting real-world vehicle accident videos for autonomous driving research is challenging due to their rarity and complexity.While existing driving video generation methods may produce visually realistic videos, they often fail to deliver physically realistic simulations because they lack the capability to generate accurate post-collision trajectories.In this paper, we introduce AccidentSim, a novel framework that generates physically realistic vehicle collision videos by extracting and utilizing the physical clues and contextual information available in real-world vehicle accident reports.Specifically, AccidentSim leverages a reliable physical simulator to replicate post-collision vehicle trajectories from the physical and contextual information in the accident reports and to build a vehicle collision trajectory dataset.This dataset is then used to fine-tune a language model, enabling it to respond to user prompts and predict physically consistent post-collision trajectories across various driving scenarios based on user descriptions. 0.808Finally, we employ Neural Radiance Fields (NeRF) to render high-quality backgrounds, merging them with the foreground vehicles that exhibit physically realistic trajectories to generate vehicle collision videos.Experimental results demonstrate that the videos produced by AccidentSim excel in both visual and physical authenticity. |
2025-03-26 |
ARMO: Autoregressive Rigging for Multi-Category Objects
Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation.However, most existing methods focus primarily on generating static 3D models, overlooking the potentially dynamic nature of certain shapes, such as humanoids, animals, and insects.To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models.In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. 0.783Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses.Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner.By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokens.A mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation.Our method addresses the limitations of regression-based approaches, which often suffer from error accumulation and suboptimal connectivity estimation.Through extensive experiments on the OmniRig dataset, our approach achieves state-of-the-art performance in skeleton prediction, demonstrating improved generalization across diverse object categories.The code and dataset will be made public for academic use upon acceptance. |
2025-03-26 |
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy
The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework.However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems.While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks.We consider that these hallucinations arise from an absence of clear self-awareness within the models.To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks.This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. 0.82Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks.Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination.Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations.Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics. |
2025-03-26 |
MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams
Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements.Unlike natural images, their inherently symbolic and abstract nature poses significant challenges for Multimodal Large Language Models (MLLMs).However, current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether MLLMs genuinely understand mathematical diagrams beyond superficial pattern recognition.To address this gap, we introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs.MATHGLANCE comprises 1.2K images and 1.6K carefully curated questions spanning four perception tasks: shape classification, object counting, relationship identification, and object grounding, covering diverse domains including plane geometry, solid geometry, and graphical representations.Our evaluation of MLLMs reveals that their ability to understand diagrams is notably limited, particularly in fine-grained grounding tasks.In response, we construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text pairs explicitly annotated with geometric primitives and precise spatial relationships. 0.746Training MLLM on GeoPeP leads to significant gains in perceptual accuracy, which in turn substantially improves mathematical reasoning.Our benchmark and dataset establish critical standards for evaluating and advancing multimodal mathematical understanding, providing valuable resources and insights to foster future MLLM research. |
2025-03-26 |
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation.BASKET contains 4,477 hours of video capturing 32,232 basketball players from all over the world.Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill through in-depth video analysis.Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills.Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline.We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities.In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others.Dataset and code are available at https://github.com/yulupan00/BASKET. 0.85 |
2025-03-25 |
Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings
Advancements in computer-assisted surgical procedures heavily rely on accurate visual data interpretation from camera systems used during surgeries.Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos with less than 100K images.To address these constraints, a new dataset called Surg-3M has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. 0.811Featuring an extensive collection of over 4K surgical videos and more than 3 million high-quality images from multiple procedure types, Surg-3M offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel tasks.To demonstrate the effectiveness of this dataset, we present SurgFM, a self-supervised foundation model pretrained on Surg-3M that achieves impressive results in downstream tasks such as surgical phase recognition, action recognition, and tool presence detection.Combining key components from ConvNeXt, DINO, and an innovative augmented distillation method, SurgFM exhibits exceptional performance compared to specialist architectures across various benchmarks.Our experimental results show that SurgFM outperforms state-of-the-art models in multiple downstream tasks, including significant gains in surgical phase recognition (+8.9pp, +4.7pp, and +3.9pp of Jaccard in AutoLaparo, M2CAI16, and Cholec80), action recognition (+3.1pp of mAP in CholecT50) and tool presence detection (+4.6pp of mAP in Cholec80).Moreover, even when using only half of the data, SurgFM outperforms state-of-the-art models in AutoLaparo and achieves state-of-the-art performance in Cholec80.Both Surg-3M and SurgFM have significant potential to accelerate progress towards developing autonomous robotic surgery systems. |
2025-03-25 |
BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts
Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility.The recent Segment Anything Model (SAM) has demonstrated powerful point-prompt segmentation capabilities, while text-based segmentation models offer rich semantic understanding.However, existing approaches rarely explore how to effectively combine these complementary modalities for optimal segmentation performance.This paper presents BiPrompt-SAM, a novel dual-modal prompt segmentation framework that fuses the advantages of point and text prompts through an explicit selection mechanism.Specifically, we leverage SAM's inherent ability to generate multiple mask candidates, combined with a semantic guidance mask from text prompts, and explicitly select the most suitable candidate based on similarity metrics.This approach can be viewed as a simplified Mixture of Experts (MoE) system, where the point and text modules act as distinct "experts," and the similarity scoring serves as a rudimentary "gating network."We conducted extensive evaluations on both the Endovis17 medical dataset and RefCOCO series natural image datasets. 0.822On Endovis17, BiPrompt-SAM achieved 89.55\% mDice and 81.46\% mIoU, comparable to state-of-the-art specialized medical segmentation models.On the RefCOCO series datasets, our method attained 87.1\%, 86.5\%, and 85.8\% IoU, significantly outperforming existing approaches.Experiments demonstrate that our explicit dual-selection method effectively combines the spatial precision of point prompts with the semantic richness of text prompts, particularly excelling in scenarios involving semantically complex objects, multiple similar objects, and partial occlusions.BiPrompt-SAM not only provides a simple yet effective implementation but also offers a new perspective on multi-modal prompt fusion. |
2025-03-25 |
Outsourcing an Information Operation: A Complete Dataset of Tenet Media's Podcasts on Rumble
Tenet Media, a U.S.-based, right-wing media company, hired six established podcasters to create content related to U.S. politics and culture during the 2024 U.S. presidential election cycle.After publishing content on YouTube and Rumble for nearly a year, Tenet Media was declared by the U.S. government to be funded entirely by Russia -- making it effectively an outsourced state-sponsored information operation (SSIO).We present a complete dataset of the 560 podcast videos published by the Tenet Media channel on the video-sharing platform Rumble between November 2023 and September 2024. 0.909Our dataset includes video metadata and user comments, as well as high-quality video transcriptions, representing over 300 hours of video content. 0.891This dataset provides researchers with material to study a Russian SSIO, and notably on Rumble, which is an understudied platform in SSIO scholarship. 0.848 |
2025-03-25 |
LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset
Low-light image enhancement is crucial for a myriad of applications, from night vision and surveillance, to autonomous driving.However, due to the inherent limitations that come in hand with capturing images in low-illumination environments, the task of enhancing such scenes still presents a formidable challenge.To advance research in this field, we introduce our Low Exposure Night Vision (LENVIZ) Dataset, a comprehensive multi-exposure benchmark dataset for low-light image enhancement comprising of over 230K frames showcasing 24K real-world indoor and outdoor, with-and without human, scenes. 0.793Captured using 3 different camera sensors, LENVIZ offers a wide range of lighting conditions, noise levels, and scene complexities, making it the largest publicly available up-to 4K resolution benchmark in the field.LENVIZ includes high quality human-generated ground truth, for which each multi-exposure low-light scene has been meticulously curated and edited by expert photographers to ensure optimal image quality.Furthermore, we also conduct a comprehensive analysis of current state-of-the-art low-light image enhancement techniques on our dataset and highlight potential areas of improvement. |
2025-03-25 |
Scaling Down Text Encoders of Text-to-Image Diffusion Models
Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL.Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters.Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power.Therefore, it raises an important question: "Do we really need such a large text encoder?"In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models.To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. 0.711Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size.This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible. |
2025-03-24 |
SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction
Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world.To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions.SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities.We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality.We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. 0.709Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications. |
2025-03-24 |
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding.This model family employs the two-stream SlowFast mechanism, enabling efficient modeling of long-range temporal context to meet the demand for lightweight, mobile-friendly Video LLMs.We provide models ranging from 1B to 7B parameters, optimized through a streamlined training pipeline and a high-quality data mixture composed of publicly available datasets. 0.745Experimental results demonstrate that SF-LLaVA-1.5 achieves competitive performance on a wide range of video and image benchmarks, with robust results across all model sizes.Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales (1B and 3B) across various video benchmarks. |
Data Quality |
|
2025-04-02 |
Buggin: Automatic intrinsic bugs classification model using NLP and ML
Recent studies have shown that bugs can be categorized into intrinsic and extrinsic types.Intrinsic bugs can be backtracked to specific changes in the version control system (VCS), while extrinsic bugs originate from external changes to the VCS and lack a direct bug-inducing change.Using only intrinsic bugs to train bug prediction models has been reported as beneficial to improve the performance of such models.However, there is currently no automated approach to identify intrinsic bugs.To bridge this gap, our study employs Natural Language Processing (NLP) techniques to automatically identify intrinsic bugs.Specifically, we utilize two embedding techniques, seBERT and TF-IDF, applied to the title and description text of bug reports.The resulting embeddings are fed into well-established machine learning algorithms such as Support Vector Machine, Logistic Regression, Decision Tree, Random Forest, and K-Nearest Neighbors.The primary objective of this paper is to assess the performance of various NLP and machine learning techniques in identifying intrinsic bugs using the textual information extracted from bug reports. 0.605The results demonstrate that both seBERT and TF-IDF can be effectively utilized for intrinsic bug identification.The highest performance scores were achieved by combining TF-IDF with the Decision Tree algorithm and utilizing the bug titles (yielding an F1 score of 78%).This was closely followed by seBERT, Support Vector Machine, and bug titles (with an F1 score of 77%).In summary, this paper introduces an innovative approach that automates the identification of intrinsic bugs using textual information derived from bug reports. |
2025-03-25 |
RCC-PFL: Robust Client Clustering under Noisy Labels in Personalized Federated Learning
We address the problem of cluster identity estimation in a personalized federated learning (PFL) setting in which users aim to learn different personal models.The backbone of effective learning in such a setting is to cluster users into groups whose objectives are similar.A typical approach in the literature is to achieve this by training users' data on different proposed personal models and assign them to groups based on which model achieves the lowest value of the users' loss functions.This process is to be done iteratively until group identities converge.A key challenge in such a setting arises when users have noisy labeled data, which may produce misleading values of their loss functions, and hence lead to ineffective clustering. 0.635To overcome this challenge, we propose a label-agnostic data similarity-based clustering algorithm, coined RCC-PFL, with three main advantages: the cluster identity estimation procedure is independent from the training labels; it is a one-shot clustering algorithm performed prior to the training; and it requires fewer communication rounds and less computation compared to iterative-based clustering methods.We validate our proposed algorithm using various models and datasets and show that it outperforms multiple baselines in terms of average accuracy and variance reduction. |
2025-03-13 |
Learning Disease State from Noisy Ordinal Disease Progression Labels
Learning from noisy ordinal labels is a key challenge in medical imaging.In this work, we ask whether ordinal disease progression labels (better, worse, or stable) can be used to learn a representation allowing to classify disease state.For neovascular age-related macular degeneration (nAMD), we cast the problem of modeling disease progression between medical visits as a classification task with ordinal ranks.To enhance generalization, we tailor our model to the problem setting by (1) independent image encoding, (2) antisymmetric logit space equivariance, and (3) ordinal scale awareness.In addition, we address label noise by learning an uncertainty estimate for loss re-weighting. 0.657Our approach learns an interpretable disease representation enabling strong few-shot performance for the related task of nAMD activity classification from single images, despite being trained only on image pairs with ordinal disease progression labels. |
2025-03-13 |
More Than Just Warnings:Exploring the Ways of Communicating Credibility Assessment on Social Media
Reducing the spread of misinformation is challenging.AI-based fact verification systems offer a promising solution by addressing the high costs and slow pace of traditional fact-checking.However, the problem of how to effectively communicate the results to users remains unsolved.Warning labels may seem an easy solution, but they fail to account for fuzzy misinformation that is not entirely fake. 0.748Additionally, users' limited attention spans and social media information should be taken into account while designing the presentation.The online experiment (n = 537) investigates the impact of sources and granularity on users' perception of information veracity and the system's usefulness and trustworthiness.Findings show that fine-grained indicators enhance nuanced opinions, information awareness, and the intention to use fact-checking systems.Source differences had minimal impact on opinions and perceptions, except for informativeness.Qualitative findings suggest the proposed indicators promote critical thinking.We discuss implications for designing concise, user-friendly AI fact-checking feedback. |
2025-03-13 |
Unlock the Power of Unlabeled Data in Language Driving Model
Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements.However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive.To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner.Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data.Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. 0.647By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods.Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets.In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark. |
Benchmarks |
|
2025-04-03 |
BECAME: BayEsian Continual Learning with Adaptive Model MErging
Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting.A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks).While representative gradient projection methods ensure stability, they often limit plasticity.Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters.In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits.Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks.To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging.Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies. 0.625 |
2025-04-03 |
SCMPPI: Supervised Contrastive Multimodal Framework for Predicting Protein-Protein Interactions
Protein-Protein Interaction (PPI) prediction is a key task in uncovering cellular functional networks and disease mechanisms.However, traditional experimental methods are time-consuming and costly, and existing computational models face challenges in cross-modal feature fusion, robustness, and false-negative suppression.In this paper, we propose a novel supervised contrastive multimodal framework, SCMPPI, for PPI prediction.By integrating protein sequence features (AAC, DPC, CKSAAP-ESMC) with PPI network topology information (Node2Vec graph embedding), and combining an improved supervised contrastive learning strategy, SCMPPI significantly enhances PPI prediction performance.For the PPI task, SCMPPI introduces a negative sample filtering mechanism and modifies the contrastive loss function, effectively optimizing multimodal features.Experiments on eight benchmark datasets, including yeast, human, and H.pylori, show that SCMPPI outperforms existing state-of-the-art methods (such as DF-PPI and TAGPPI) in key metrics such as accuracy ( 98.01%) and AUC (99.62%), and demonstrates strong generalization in cross-species prediction (AUC > 99% on multi-species datasets). 0.604Furthermore, SCMPPI has been successfully applied to CD9 networks, the Wnt pathway, and cancer-specific networks, providing a reliable tool for disease target discovery.This framework also offers a new paradigm for multimodal biological information fusion and contrastive learning in collaborative optimization for various combined predictions. |
2025-04-03 |
Pushing the Limit of PPG Sensing in Sedentary Conditions by Addressing Poor Skin-sensor Contact
Photoplethysmography (PPG) is a widely used non-invasive technique for monitoring cardiovascular health and various physiological parameters on consumer and medical devices.While motion artifacts are well-known challenges in dynamic settings, suboptimal skin-sensor contact in sedentary conditions - a critical issue often overlooked in existing literature - can distort PPG signal morphology, leading to the loss or shift of essential waveform features and therefore degrading sensing performance.In this work, we propose CP-PPG, a novel approach that transforms Contact Pressure-distorted PPG signals into ones with the ideal morphology.CP-PPG incorporates a novel data collection approach, a well-crafted signal processing pipeline, and an advanced deep adversarial model trained with a custom PPG-aware loss function.We validated CP-PPG through comprehensive evaluations, including 1) morphology transformation performance on our self-collected dataset, 2) downstream physiological monitoring performance on public datasets, and 3) in-the-wild performance.Extensive experiments demonstrate substantial and consistent improvements in signal fidelity (Mean Absolute Error: 0.09, 40% improvement over the original signal) as well as downstream performance across all evaluations in Heart Rate (HR), Heart Rate Variability (HRV), Respiration Rate (RR), and Blood Pressure (BP) estimation (on average, 21% improvement in HR; 41-46% in HRV; 6% in RR; and 4-5% in BP). 0.604These findings highlight the critical importance of addressing skin-sensor contact issues for accurate and dependable PPG-based physiological monitoring.Furthermore, CP-PPG can serve as a generic, plug-in API to enhance PPG signal quality. |
2025-04-03 |
Curbing the Ramifications of Authorship Abuse in Science
Research performance is often measured using bibliometric indicators, such as publication count, total citations, and h-index. 0.614These metrics influence career advancements, salary adjustments, administrative opportunities, funding prospects, and professional recognition.However, the reliance on these metrics has also made them targets for manipulation, misuse, and abuse.One primary ethical concern is authorship abuse, which includes paid, ornamental, exploitative, and cartel authorships.These practices are prevalent because they artificially enhance multiple bibliometric indicators all at once.Our study confirms a significant rise in the mean and median number of authors per publication across multiple disciplines over the last 34 years.While it is important to identify the cases of authorship abuse, a thorough investigation of every paper proves impractical.In this study, we propose a credit allocation scheme based on the reciprocals of the Fibonacci numbers, designed to adjust credit for individual contributions while systematically reducing credit for potential authorship abuse.The proposed scheme aligns with rigorous authorship guidelines from scientific associations, which mandate significant contributions across most phases of a study, while accommodating more lenient guidelines from scientific publishers, which recognize authorship for minimal contributions.We recalibrate the traditional bibliometric indicators to emphasize author contribution rather than participation in publications.Additionally, we propose a new indicator, T′-index, to assess researchers' leading and contributing roles in their publications.Our proposed credit allocation scheme mitigates the effects of authorship abuse and promotes a more ethical scientific ecosystem. |
2025-04-03 |
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
The detection and intervention of mental health issues represent a critical global research focus, and social media data has been recognized as an important resource for mental health research.However, how to utilize Large Language Models (LLMs) for mental health problem detection on social media poses significant challenges.Hence, this paper aims to explore the potential of LLM applications in social media data analysis, focusing not only on the most common psychological disorders such as depression and anxiety but also incorporating psychotic disorders and externalizing disorders, summarizing the application methods of LLM from different dimensions, such as text data analysis and detection of mental disorders, and revealing the major challenges and shortcomings of current research.In addition, the paper provides an overview of popular datasets, and evaluation metrics. 0.625The survey in this paper provides a comprehensive frame of reference for researchers in the field of mental health, while demonstrating the great potential of LLMs in mental health detection to facilitate the further application of LLMs in future mental health interventions. |
2025-04-02 |
Budget-Feasible Contracts
The problem of computing near-optimal contracts in combinatorial settings has recently attracted significant interest in the computer science community.Previous work has provided a rich body of structural and algorithmic insights into this problem. 0.604However, most of these results rely on the assumption that the principal has an unlimited budget for incentivizing agents, an assumption that is often unrealistic in practice.This motivates the study of the optimal contract problem under budget constraints.We study multi-agent contracts with budget constraints under both binary and combinatorial actions.For binary actions, our contribution is threefold.First, we generalize all previously known approximation guarantees on the principal's revenue to budgeted settings.Second, through the lens of budget constraints, we uncover insightful connections between the standard objective of the principal's revenue and other objectives.We identify a broad class of objectives, which we term BEST objectives, including reward, social welfare, and revenue, and show that they are all equivalent (up to a constant factor), leading to approximation guarantees for all BEST objectives.Third, we introduce the price of frugality, which quantifies the loss due to budget constraints, and establish near-tight bounds on this measure, providing deeper insights into the tradeoffs between budgets and incentives.For combinatorial actions, we establish a strong negative result.Specifically, we show that in a budgeted setting with submodular rewards, no finite approximation is possible to any BEST objective.This stands in contrast to the unbudgeted setting with submodular rewards, where a polynomial-time constant-factor approximation is known for revenue.On the positive side, for gross substitutes rewards, we recover our binary-actions results, obtaining a constant-factor approximation for all BEST objectives. |
2025-04-02 |
Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images
Artificial Intelligence (AI) in skin disease diagnosis has improved significantly, but a major concern is that these models frequently show biased performance across subgroups, especially regarding sensitive attributes such as skin color.To address these issues, we propose a novel generative AI-based framework, namely, Dermatology Diffusion Transformer (DermDiT), which leverages text prompts generated via Vision Language Models and multimodal text-image learning to generate new dermoscopic images.We utilize large vision language models to generate accurate and proper prompts for each dermoscopic image which helps to generate synthetic images to improve the representation of underrepresented groups (patient, disease, etc.) in highly imbalanced datasets for clinical diagnoses.Our extensive experimentation showcases the large vision language models providing much more insightful representations, that enable DermDiT to generate high-quality images.Our code is available at https://github.com/Munia03/DermDiT 0.612 |
2025-04-02 |
LARGE: Legal Retrieval Augmented Generation Evaluation Tool
Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice.Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents.However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. 0.608Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain.LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy.We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above.The source code is available at https://github.com/hoorangyee/LRAGE. |
2025-04-02 |
Enhanced Diffusion Sampling via Extrapolation with Multiple ODE Solutions
Diffusion probabilistic models (DPMs), while effective in generating high-quality samples, often suffer from high computational costs due to their iterative sampling process.To address this, we propose an enhanced ODE-based sampling method for DPMs inspired by Richardson extrapolation, which reduces numerical error and improves convergence rates.Our method, RX-DPM, leverages multiple ODE solutions at intermediate time steps to extrapolate the denoised prediction in DPMs.This significantly enhances the accuracy of estimations for the final sample while maintaining the number of function evaluations (NFEs). 0.637Unlike standard Richardson extrapolation, which assumes uniform discretization of the time grid, we develop a more general formulation tailored to arbitrary time step scheduling, guided by local truncation error derived from a baseline sampling method.The simplicity of our approach facilitates accurate estimation of numerical solutions without significant computational overhead, and allows for seamless and convenient integration into various DPMs and solvers. 0.611Additionally, RX-DPM provides explicit error estimates, effectively demonstrating the faster convergence as the leading error term's order increases. 0.647Through a series of experiments, we show that the proposed method improves the quality of generated samples without requiring additional sampling iterations. 0.637 |
2025-04-02 |
Buggin: Automatic intrinsic bugs classification model using NLP and ML
Recent studies have shown that bugs can be categorized into intrinsic and extrinsic types.Intrinsic bugs can be backtracked to specific changes in the version control system (VCS), while extrinsic bugs originate from external changes to the VCS and lack a direct bug-inducing change.Using only intrinsic bugs to train bug prediction models has been reported as beneficial to improve the performance of such models.However, there is currently no automated approach to identify intrinsic bugs.To bridge this gap, our study employs Natural Language Processing (NLP) techniques to automatically identify intrinsic bugs.Specifically, we utilize two embedding techniques, seBERT and TF-IDF, applied to the title and description text of bug reports.The resulting embeddings are fed into well-established machine learning algorithms such as Support Vector Machine, Logistic Regression, Decision Tree, Random Forest, and K-Nearest Neighbors.The primary objective of this paper is to assess the performance of various NLP and machine learning techniques in identifying intrinsic bugs using the textual information extracted from bug reports.The results demonstrate that both seBERT and TF-IDF can be effectively utilized for intrinsic bug identification.The highest performance scores were achieved by combining TF-IDF with the Decision Tree algorithm and utilizing the bug titles (yielding an F1 score of 78%). 0.645This was closely followed by seBERT, Support Vector Machine, and bug titles (with an F1 score of 77%).In summary, this paper introduces an innovative approach that automates the identification of intrinsic bugs using textual information derived from bug reports. |
2025-04-02 |
A Diffusion-Based Framework for Occluded Object Movement
Seamlessly moving objects within a scene is a common requirement for image editing, but it is still a challenge for existing editing methods.Especially for real-world images, the occlusion situation further increases the difficulty.The main difficulty is that the occluded portion needs to be completed before movement can proceed.To leverage the real-world knowledge embedded in the pre-trained diffusion models, we propose a Diffusion-based framework specifically designed for Occluded Object Movement, named DiffOOM.The proposed DiffOOM consists of two parallel branches that perform object de-occlusion and movement simultaneously.The de-occlusion branch utilizes a background color-fill strategy and a continuously updated object mask to focus the diffusion process on completing the obscured portion of the target object.Concurrently, the movement branch employs latent optimization to place the completed object in the target location and adopts local text-conditioned guidance to integrate the object into new surroundings appropriately.Extensive evaluations demonstrate the superior performance of our method, which is further validated by a comprehensive user study. 0.779 |
2025-04-02 |
TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables
Humans continuously make new discoveries, and understanding temporal sequence of events leading to these breakthroughs is essential for advancing science and society.This ability to reason over time allows us to identify future steps and understand the effects of financial and political decisions on our lives.However, large language models (LLMs) are typically trained on static datasets, limiting their ability to perform effective temporal reasoning.To assess the temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES dataset, which comprises 3,971 questions derived from over 14,000 tables, spanning 1,238 entities across multiple time periods.We introduce a template-based question-generation pipeline that harnesses LLMs to refine both templates and questions.Additionally, we establish baseline results using state-of-the-art LLMs to create a benchmark. 0.693We also introduce novel modeling strategies centered around task decomposition, enhancing LLM performance. |
2025-04-02 |
Multi-fidelity Parameter Estimation Using Conditional Diffusion Models
We present a multi-fidelity method for uncertainty quantification of parameter estimates in complex systems, leveraging generative models trained to sample the target conditional distribution.In the Bayesian inference setting, traditional parameter estimation methods rely on repeated simulations of potentially expensive forward models to determine the posterior distribution of the parameter values, which may result in computationally intractable workflows.Furthermore, methods such as Markov Chain Monte Carlo (MCMC) necessitate rerunning the entire algorithm for each new data observation, further increasing the computational burden.Hence, we propose a novel method for efficiently obtaining posterior distributions of parameter estimates for high-fidelity models given data observations of interest.The method first constructs a low-fidelity, conditional generative model capable of amortized Bayesian inference and hence rapid posterior density approximation over a wide-range of data observations.When higher accuracy is needed for a specific data observation, the method employs adaptive refinement of the density approximation. 0.625It uses outputs from the low-fidelity generative model to refine the parameter sampling space, ensuring efficient use of the computationally expensive high-fidelity solver.Subsequently, a high-fidelity, unconditional generative model is trained to achieve greater accuracy in the target posterior distribution.Both low- and high- fidelity generative models enable efficient sampling from the target posterior and do not require repeated simulation of the high-fidelity forward model.We demonstrate the effectiveness of the proposed method on several numerical examples, including cases with multi-modal densities, as well as an application in plasma physics for a runaway electron simulation model. |
2025-04-02 |
Build Code Needs Maintenance Too: A Study on Refactoring and Technical Debt in Build Systems
In modern software engineering, build systems play the crucial role of facilitating the conversion of source code into software artifacts.Recent research has explored high-level causes of build failures, but has largely overlooked the structural properties of build files.Akin to source code, build systems face technical debt challenges that hinder maintenance and optimization.While refactoring is often seen as a key tool for addressing technical debt in source code, there is a significant research gap regarding the specific refactoring changes developers apply to build code and whether these refactorings effectively address technical debt.In this paper, we address this gap by examining refactorings applied to build scripts in open-source projects, covering the widely used build systems of Gradle, Ant, and Maven.Additionally, we investigate whether these refactorings are used to tackle technical debts in build systems.Our analysis was conducted on \totalCommits examined build-file-related commits.We identified \totalRefactoringCategories build-related refactorings, which we divided into \totalCategories main categories.These refactorings are organized into the first empirically derived taxonomy of build system refactorings.Furthermore, we investigate how developers employ these refactoring types to address technical debts via a manual commit-analysis and a developer survey.In this context, we identified \totalTechnicalDebts technical debts addressed by these refactorings and discussed their correlation with the different refactorings.Finally, we introduce BuildRefMiner, an LLM-powered tool leveraging GPT-4o to automate the detection of refactorings within build systems.We evaluated its performance and found that it achieves an F1 score of \toolFoneScore across all build systems. 0.622 |
2025-04-02 |
Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework
Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research.In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy.The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. 0.709The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics.These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques.The code of the framework is available at https://github.com/mostly-ai/mostlyai-qa. |
2025-04-02 |
Client Selection in Federated Learning with Data Heterogeneity and Network Latencies
Federated learning (FL) is a distributed machine learning paradigm where multiple clients conduct local training based on their private data, then the updated models are sent to a central server for global aggregation.The practical convergence of FL is challenged by multiple factors, with the primary hurdle being the heterogeneity among clients.This heterogeneity manifests as data heterogeneity concerning local data distribution and latency heterogeneity during model transmission to the server.While prior research has introduced various efficient client selection methods to alleviate the negative impacts of either of these heterogeneities individually, efficient methods to handle real-world settings where both these heterogeneities exist simultaneously do not exist.In this paper, we propose two novel theoretically optimal client selection schemes that can handle both these heterogeneities.Our methods involve solving simple optimization problems every round obtained by minimizing the theoretical runtime to convergence.Empirical evaluations on 9 datasets with non-iid data distributions, 2 practical delay distributions, and non-convex neural network models demonstrate that our algorithms are at least competitive to and at most 20 times better than best existing baselines. 0.601 |
2025-04-02 |
Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection
While AI agents have shown remarkable performance at various tasks, they still struggle with complex multi-modal applications, structured generation and strategic planning.Improvements via standard fine-tuning is often impractical, as solving agentic tasks usually relies on black box API access without control over model parameters.Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. 0.626However, BON lacks iterative feedback integration mechanism.Hence, we propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.IAD differs in how feedback is designed and integrated, specifically optimized to extract maximal signal from reward scores.We conduct a detailed comparison of baselines across key metrics on Sketch2Code, Text2SQL, and Webshop where IAD consistently outperforms baselines, achieving 3--6% absolute gains on Sketch2Code and Text2SQL (with and without LLM judges) and 8--10% gains on Webshop across multiple metrics.To better understand the source of IAD's gains, we perform controlled experiments to disentangle the effect of adaptive feedback from stochastic sampling, and find that IAD's improvements are primarily driven by verifier-guided refinement, not merely sampling diversity.We also show that both IAD and BON exhibit inference-time scaling with increased compute when guided by an optimal verifier.Our analysis highlights the critical role of verifier quality in effective inference-time optimization and examines the impact of noisy and sparse rewards on scaling behavior.Together, these findings offer key insights into the trade-offs and principles of effective inference-time optimization. |
2025-04-02 |
Diffusion-Guided Gaussian Splatting for Large-Scale Unconstrained 3D Reconstruction and Novel View Synthesis
Recent advancements in 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have achieved impressive results in real-time 3D reconstruction and novel view synthesis.However, these methods struggle in large-scale, unconstrained environments where sparse and uneven input coverage, transient occlusions, appearance variability, and inconsistent camera settings lead to degraded quality.We propose GS-Diff, a novel 3DGS framework guided by a multi-view diffusion model to address these limitations.By generating pseudo-observations conditioned on multi-view inputs, our method transforms under-constrained 3D reconstruction problems into well-posed ones, enabling robust optimization even with sparse data.GS-Diff further integrates several enhancements, including appearance embedding, monocular depth priors, dynamic object modeling, anisotropy regularization, and advanced rasterization techniques, to tackle geometric and photometric challenges in real-world settings.Experiments on four benchmarks demonstrate that GS-Diff consistently outperforms state-of-the-art baselines by significant margins. 0.699 |
2025-04-02 |
Learning from Streaming Video with Orthogonal Gradients
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner.This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms.When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance.We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction.To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training.The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. 0.617Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks.On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios. |
2025-03-31 |
Spatio-temporal Prediction of Fine-Grained Origin-Destination Matrices with Applications in Ridesharing
Accurate spatial-temporal prediction of network-based travelers' requests is crucial for the effective policy design of ridesharing platforms.Having knowledge of the total demand between various locations in the upcoming time slots enables platforms to proactively prepare adequate supplies, thereby increasing the likelihood of fulfilling travelers' requests and redistributing idle drivers to areas with high potential demand to optimize the global supply-demand equilibrium.This paper delves into the prediction of Origin-Destination (OD) demands at a fine-grained spatial level, especially when confronted with an expansive set of local regions.While this task holds immense practical value, it remains relatively unexplored within the research community.To fill this gap, we introduce a novel prediction model called OD-CED, which comprises an unsupervised space coarsening technique to alleviate data sparsity and an encoder-decoder architecture to capture both semantic and geographic dependencies.Through practical experimentation, OD-CED has demonstrated remarkable results.It achieved an impressive reduction of up to 45% reduction in root-mean-square error and 60% in weighted mean absolute percentage error over traditional statistical methods when dealing with OD matrices exhibiting a sparsity exceeding 90%. 0.652 |
2025-03-31 |
Advanced Quantum Annealing Approach to Vehicle Routing Problems with Time Windows
In this paper, we explore the potential for quantum annealing to solve realistic routing problems.We focus on two NP-Hard problems, including the Traveling Salesman Problem with Time Windows and the Capacitated Vehicle Routing Problem with Time Windows.We utilize D-Wave's Quantum Annealer and Constrained Quadratic Model (CQM) solver within a hybrid framework to solve these problems.We demonstrate that while the CQM solver effectively minimizes route costs, it struggles to maintain time window feasibility as the problem size increases.To address this limitation, we implement a heuristic method that fixes infeasible solutions through a series of swapping operations.Testing on benchmark instances shows our method achieves promising results with an average optimality gap of 3.86%. 0.759 |
2025-03-31 |
Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning
We consider a decentralized wireless network with several source-destination pairs sharing a limited number of orthogonal frequency bands.Sources learn to adapt their transmissions (specifically, their band selection strategy) over time, in a decentralized manner, without sharing information with each other.Sources can only observe the outcome of their own transmissions (i.e., success or collision), having no prior knowledge of the network size or of the transmission strategy of other sources.The goal of each source is to maximize their own throughput while striving for network-wide fairness.We propose a novel fully decentralized Reinforcement Learning (RL)-based solution that achieves fairness without coordination.The proposed Fair Share RL (FSRL) solution combines: (i) state augmentation with a semi-adaptive time reference; (ii) an architecture that leverages risk control and time difference likelihood; and (iii) a fairness-driven reward structure.We evaluate FSRL in more than 50 network settings with different number of agents, different amounts of available spectrum, in the presence of jammers, and in an ad-hoc setting.Simulation results suggest that, when we compare FSRL with a common baseline RL algorithm from the literature, FSRL can be up to 89.0% fairer (as measured by Jain's fairness index) in stringent settings with several sources and a single frequency band, and 48.1% fairer on average. 0.615 |
2025-03-31 |
Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge
Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy.Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms.This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification.The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024.The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency.The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. 0.6The efficiency component tests the latency of algorithm inference.The challenge was conducted as a part of MICCAI EndoVis 2024.In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day.This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery.In this paper we summarize the design, submissions, and results from the challenge.The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics |
2025-03-31 |
BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models
In this research, we introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs).Building upon the BEATS framework, we present a bias benchmark for LLMs that measure performance across 29 distinct metrics. 0.619These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk.These metrics enable a quantitative assessment of the extent to which LLM generated responses may perpetuate societal prejudices that reinforce or expand systemic inequities.To achieve a high score on this benchmark a LLM must show very equitable behavior in their responses, making it a rigorous standard for responsible AI evaluation.Empirical results based on data from our experiment show that, 37.65\% of outputs generated by industry leading models contained some form of bias, highlighting a substantial risk of using these models in critical decision making systems.BEATS framework and benchmark offer a scalable and statistically rigorous methodology to benchmark LLMs, diagnose factors driving biases, and develop mitigation strategies.With the BEATS framework, our goal is to help the development of more socially responsible and ethically aligned AI models. |
2025-03-31 |
Sample-Optimal Private Regression in Polynomial Time
We consider the task of privately obtaining prediction error guarantees in ordinary least-squares regression problems with Gaussian covariates (with unknown covariance structure).We provide the first sample-optimal polynomial time algorithm for this task under both pure and approximate differential privacy.We show that any improvement to the sample complexity of our algorithm would violate either statistical-query or information-theoretic lower bounds.Additionally, our algorithm is robust to a small fraction of arbitrary outliers and achieves optimal error rates as a function of the fraction of outliers. 0.639In contrast, all prior efficient algorithms either incurred sample complexities with sub-optimal dimension dependence, scaling with the condition number of the covariates, or obtained a polynomially worse dependence on the privacy parameters. Our technical contributions are two-fold: first, we leverage resilience guarantees of Gaussians within the sum-of-squares framework.As a consequence, we obtain efficient sum-of-squares algorithms for regression with optimal robustness rates and sample complexity.Second, we generalize the recent robustness-to-privacy framework[HKMN23, (arXiv:2212.05015)] to account for the geometry induced by the covariance of the input samples.This framework crucially relies on the robust estimators to be sum-of-squares algorithms, and combining the two steps yields a sample-optimal private regression algorithm.We believe our techniques are of independent interest, and we demonstrate this by obtaining an efficient algorithm for covariance-aware mean estimation, with an optimal dependence on the privacy parameters. |
2025-03-31 |
Contextual Preference Collaborative Measure Framework Based on Belief System
To reduce the human intervention in the preference measure process,this article proposes a preference collaborative measure framework based on an updated belief system,which is also capable of improving the accuracy and efficiency of preferen-ce measure algorithms.Firstly,the distance of rules and the average internal distance of rulesets are proposed for specifying the relationship between the rules.For discovering the most representative preferences that are common in all users,namely common preference,a algorithm based on average internal distance of ruleset,PRA algorithm,is proposed,which aims to finish the discoveryprocess with minimum information loss rate.Furthermore,the concept of Common belief is proposed to update the belief system,and the common preferences are the evidences of updated belief system.Then,under the belief system,the proposed belief degree and deviation degree are used to determine whether a rule confirms the belief system or not and classify the preference rules into two kinds(generalized or personalized),and eventually filters out Top-K interesting rules relying on belief degree and deviation degree.Based on above,a scalable interestingness calculation framework that can apply various formulas is proposed for accurately calculating interestingness in different conditions.At last,IMCos algorithm and IMCov algorithm are proposed as exemplars to verify the accuracy and efficiency of the framework by using weighted cosine similarity and correlation coefficients as belief degree.In experiments,the proposed algorithms are compared to two state-of-the-art algorithms and the results show that IMCos and IMCov outperform than the other two in most aspects. 0.636 |
2025-03-31 |
Accelerated Approximate Optimization of Multi-Commodity Flows on Directed Graphs
We provide m1+o(1)kϵ−1-time algorithms for computing multiplicative (1−ϵ)-approximate solutions to multi-commodity flow problems with k-commodities on m-edge directed graphs, including concurrent multi-commodity flow and maximum multi-commodity flow. To obtain our results, we provide new optimization tools of potential independent interest.First, we provide an improved optimization method for solving ℓq,p-regression problems to high accuracy. 0.605This method makes O~q,p(k) queries to a high accuracy convex minimization oracle for an individual block, where O~q,p(⋅) hides factors depending only on q, p, or poly(logm), improving upon the O~q,p(k2) bound of [Chen-Ye, ICALP 2024].As a result, we obtain the first almost-linear time algorithm that solves ℓq,p flows on directed graphs to high accuracy.Second, we present optimization tools to reduce approximately solving composite ℓ1,∞-regression problems to solving mo(1)ϵ−1 instances of composite ℓq,p-regression problem.The method builds upon recent advances in solving box-simplex games [Jambulapati-Tian, NeurIPS 2023] and the area convex regularizer introduced in [Sherman, STOC 2017] to obtain faster rates for constrained versions of the problem.Carefully combining these techniques yields our directed multi-commodity flow algorithm. |
2025-03-31 |
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets.In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model.This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths.In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction.Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning.We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion.By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction.Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets.Our code is publicly available for research purpose at https://easi3r.github.io/ 0.606 |
2025-03-27 |
Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing
Large Language Models (LLMs) have transformed task automation and content generation across various domains while incorporating safety filters to prevent misuse.We introduce a novel jailbreaking framework that employs distributed prompt processing combined with iterative refinements to bypass these safety measures, particularly in generating malicious code.Our architecture consists of four key modules: prompt segmentation, parallel processing, response aggregation, and LLM-based jury evaluation.Tested on 500 malicious prompts across 10 cybersecurity categories, the framework achieves a 73.2% Success Rate (SR) in generating malicious code.Notably, our comparative analysis reveals that traditional single-LLM judge evaluation overestimates SRs (93.8%) compared to our LLM jury system (73.2%), with manual verification confirming that single-judge assessments often accept incomplete implementations. 0.634Moreover, we demonstrate that our distributed architecture improves SRs by 12% over the non-distributed approach in an ablation study, highlighting both the effectiveness of distributed prompt processing and the importance of robust evaluation methodologies in assessing jailbreak attempts. |
2025-03-27 |
Audio-driven Gesture Generation via Deviation Feature in the Latent Space
Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions.While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations.We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation.Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation.By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production.Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques. 0.65 |
2025-03-27 |
The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection
In recent years, performance on existing anomaly detection benchmarks like MVTec AD and VisA has started to saturate in terms of segmentation AU-PRO, with state-of-the-art models often competing in the range of less than one percentage point.This lack of discriminatory power prevents a meaningful comparison of models and thus hinders progress of the field, especially when considering the inherent stochastic nature of machine learning results.We present MVTec AD 2, a collection of eight anomaly detection scenarios with more than 8000 high-resolution images.It comprises challenging and highly relevant industrial inspection use cases that have not been considered in previous datasets, including transparent and overlapping objects, dark-field and back light illumination, objects with high variance in the normal data, and extremely small defects.We provide comprehensive evaluations of state-of-the-art methods and show that their performance remains below 60% average AU-PRO. 0.707Additionally, our dataset provides test scenarios with lighting condition changes to assess the robustness of methods under real-world distribution shifts.We host a publicly accessible evaluation server that holds the pixel-precise ground truth of the test set (https://benchmark.mvtec.com/).All image data is available at https://www.mvtec.com/company/research/datasets/mvtec-ad-2. |
2025-03-27 |
ClusterSC: Advancing Synthetic Control with Donor Selection
In causal inference with observational studies, synthetic control (SC) has emerged as a prominent tool.SC has traditionally been applied to aggregate-level datasets, but more recent work has extended its use to individual-level data.As they contain a greater number of observed units, this shift introduces the curse of dimensionality to SC.To address this, we propose Cluster Synthetic Control (ClusterSC), based on the idea that groups of individuals may exist where behavior aligns internally but diverges between groups.ClusterSC incorporates a clustering step to select only the relevant donors for the target.We provide theoretical guarantees on the improvements induced by ClusterSC, supported by empirical demonstrations on synthetic and real-world datasets.The results indicate that ClusterSC consistently outperforms classical SC approaches. 0.636 |
2025-03-27 |
A Bespoke Design Approach to Low-Power Printed Microprocessors for Machine Learning Applications
Printed electronics have gained significant traction in recent years, presenting a viable path to integrating computing into everyday items, from disposable products to low-cost healthcare.However, the adoption of computing in these domains is hindered by strict area and power constraints, limiting the effectiveness of general-purpose microprocessors.This paper proposes a bespoke microprocessor design approach to address these challenges, by tailoring the design to specific applications and eliminating unnecessary logic.Targeting machine learning applications, we further optimize core operations by integrating a SIMD MAC unit supporting 4 precision configurations that boost the efficiency of microprocessors.Our evaluation across 6 ML models and the large-scale Zero-Riscy core, shows that our methodology can achieve improvements of 22.2%, 23.6%, and 33.79% in area, power, and speed, respectively, without compromising accuracy. 0.6Against state-of-the-art printed processors, our approach can still offer significant speedups, but along with some accuracy degradation. 0.654This work explores how such trade-offs can enable low-power printed microprocessors for diverse ML applications. |
2025-03-27 |
AMA-SAM: Adversarial Multi-Domain Alignment of Segment Anything Model for High-Fidelity Histology Nuclei Segmentation
Accurate segmentation of cell nuclei in histopathology images is essential for numerous biomedical research and clinical applications.However, existing cell nucleus segmentation methods only consider a single dataset (i.e., primary domain), while neglecting to leverage supplementary data from diverse sources (i.e., auxiliary domains) to reduce overfitting and enhance the performance.Although incorporating multiple datasets could alleviate overfitting, it often exacerbates performance drops caused by domain shifts.In this work, we introduce Adversarial Multi-domain Alignment of Segment Anything Model (AMA-SAM) that extends the Segment Anything Model (SAM) to overcome these obstacles through two key innovations.First, we propose a Conditional Gradient Reversal Layer (CGRL), a multi-domain alignment module that harmonizes features from diverse domains to promote domain-invariant representation learning while preserving crucial discriminative features for the primary dataset.Second, we address SAM's inherent low-resolution output by designing a High-Resolution Decoder (HR-Decoder), which directly produces fine-grained segmentation maps in order to capture intricate nuclei boundaries in high-resolution histology images.To the best of our knowledge, this is the first attempt to adapt SAM for multi-dataset learning with application to histology nuclei segmentation.We validate our method on several publicly available datasets, demonstrating consistent and significant improvements over state-of-the-art approaches. 0.645 |
2025-03-27 |
Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance
Evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences.We propose cFreD, a metric based on the notion of Conditional Fr\'echet Distance that explicitly accounts for both visual fidelity and text-prompt alignment.Existing metrics such as Inception Score (IS), Fr\'echet Inception Distance (FID) and CLIPScore assess either image quality or image-text alignment but not both which limits their correlation with human preferences.Scoring models explicitly trained to replicate human preferences require constant updates and may not generalize to novel generation techniques or out-of-domain inputs.Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, we demonstrate that cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences.Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models, standardizing benchmarking in this rapidly evolving field.We release our evaluation toolkit and benchmark in the appendix. 0.621 |
2025-03-27 |
GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics
Ensuring the reliability and effectiveness of software release decisions is critical, particularly in safety-critical domains like automotive systems.Precise analysis of release validation data, often presented in tabular form, plays a pivotal role in this process.However, traditional methods that rely on manual analysis of extensive test datasets and validation metrics are prone to delays and high costs.Large Language Models (LLMs) offer a promising alternative but face challenges in analytical reasoning, contextual understanding, handling out-of-scope queries, and processing structured test data consistently; limitations that hinder their direct application in safety-critical scenarios.This paper introduces GateLens, an LLM-based tool for analyzing tabular data in the automotive domain.GateLens translates natural language queries into Relational Algebra (RA) expressions and then generates optimized Python code.It outperforms the baseline system on benchmarking datasets, achieving higher F1 scores and handling complex and ambiguous queries with greater robustness. 0.601Ablation studies confirm the critical role of the RA module, with performance dropping sharply when omitted.Industrial evaluations reveal that GateLens reduces analysis time by over 80% while maintaining high accuracy and reliability.As demonstrated by presented results, GateLens achieved high performance without relying on few-shot examples, showcasing strong generalization across various query types from diverse company roles.Insights from deploying GateLens with a partner automotive company offer practical guidance for integrating AI into critical workflows such as release validation.Results show that by automating test result analysis, GateLens enables faster, more informed, and dependable release decisions, and can thus advance software scalability and reliability in automotive systems. |
2025-03-27 |
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity.Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024×1024 images.Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance.To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation.Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). 0.604Our codes, models, datasets, and demo are publicly available. |
2025-03-27 |
A Unified Framework for Diffusion Bridge Problems: Flow Matching and Schrödinger Matching into One
The bridge problem is to find an SDE (or sometimes an ODE) that bridges two given distributions.The application areas of the bridge problem are enormous, among which the recent generative modeling (e.g., conditional or unconditional image generation) is the most popular.Also the famous Schr\"{o}dinger bridge problem, a widely known problem for a century, is a special instance of the bridge problem.Two most popular algorithms to tackle the bridge problems in the deep learning era are: (conditional) flow matching and iterative fitting algorithms, where the former confined to ODE solutions, and the latter specifically for the Schr\"{o}dinger bridge problem.The main contribution of this article is in two folds: i) We provide concise reviews of these algorithms with technical details to some extent; ii) We propose a novel unified perspective and framework that subsumes these seemingly unrelated algorithms (and their variants) into one. 0.618In particular, we show that our unified framework can instantiate the Flow Matching (FM) algorithm, the (mini-batch) optimal transport FM algorithm, the (mini-batch) Schr\"{o}dingerbridge FM algorithm, and the deep Schr\"{o}dinger bridge matching (DSBM) algorithm as its special cases.We believe that this unified framework will be useful for viewing the bridge problems in a more general and flexible perspective, and in turn can help researchers and practitioners to develop new bridge algorithms in their fields. |
2025-03-27 |
Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying
Open-vocabulary querying in 3D Gaussian Splatting aims to identify semantically relevant regions within a 3D Gaussian representation based on a given text query.Prior work, such as LangSplat, addressed this task by retrieving these regions in the form of segmentation masks on 2D renderings.More recently, OpenGaussian introduced point-level querying, which directly selects a subset of 3D Gaussians.In this work, we propose a point-level querying method that builds upon LangSplat's framework.Our approach improves the framework in two key ways: (a) we leverage masklets from the Segment Anything Model 2 (SAM2) to establish semantic consistent ground-truth for distilling the language Gaussians; (b) we introduces a novel two-step querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians.Experimental evaluations on three benchmark datasets demonstrate that the proposed method achieves better performance compared to state-of-the-art approaches. 0.75For instance, our method achieves an mIoU improvement of +20.42 on the 3D-OVS dataset. |
2025-03-27 |
Optimal Stepsize for Diffusion Sampling
Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization.While existing works focus on optimizing denoising directions, we address the principled design of stepsize schedules.This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories.By reformulating stepsize optimization as recursive error minimization, our method guarantees global discretization bounds through optimal substructure exploitation.Crucially, the distilled schedules demonstrate strong robustness across architectures, ODE solvers, and noise schedules.Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval.Our code is available at https://github.com/bebebe666/OptimalSteps. 0.691 |
2025-03-27 |
X2-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction
Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows.Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality.In this paper, We propose X2-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning.Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization.To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization.Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. 0.611By unifying continuous motion modeling with hardware-free period learning, X2-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging.Project website at: https://x2-gaussian.github.io/. |
2025-03-26 |
Benchmarking and optimizing organism wide single-cell RNA alignment methods
Many methods have been proposed for removing batch effects and aligning single-cell RNA (scRNA) datasets.However, performance is typically evaluated based on multiple parameters and few datasets, creating challenges in assessing which method is best for aligning data at scale. 0.646Here, we introduce the K-Neighbors Intersection (KNI) score, a single score that both penalizes batch effects and measures accuracy at cross-dataset cell-type label prediction alongside carefully curated small (scMARK) and large (scREF) benchmarks comprising 11 and 46 human scRNA studies respectively, where we have standardized author labels.Using the KNI score, we evaluate and optimize approaches for cross-dataset single-cell RNA integration.We introduce Batch Adversarial single-cell Variational Inference (BA-scVI), as a new variant of scVI that uses adversarial training to penalize batch-effects in the encoder and decoder, and show this approach outperforms other methods.In the resulting aligned space, we find that the granularity of cell-type groupings is conserved, supporting the notion that whole-organism cell-type maps can be created by a single model without loss of information. |
2025-03-26 |
SChanger: Change Detection from a Semantic Change and Spatial Consistency Perspective
Change detection is a key task in Earth observation applications.Recently, deep learning methods have demonstrated strong performance and widespread application.However, change detection faces data scarcity due to the labor-intensive process of accurately aligning remote sensing images of the same area, which limits the performance of deep learning algorithms.To address the data scarcity issue, we develop a fine-tuning strategy called the Semantic Change Network (SCN).We initially pre-train the model on single-temporal supervised tasks to acquire prior knowledge of instance feature extraction.The model then employs a shared-weight Siamese architecture and extended Temporal Fusion Module (TFM) to preserve this prior knowledge and is fine-tuned on change detection tasks.The learned semantics for identifying all instances is changed to focus on identifying only the changes.Meanwhile, we observe that the locations of changes between the two images are spatially identical, a concept we refer to as spatial consistency.We introduce this inductive bias through an attention map that is generated by large-kernel convolutions and applied to the features from both time points.This enhances the modeling of multi-scale changes and helps capture underlying relationships in change detection semantics.We develop a binary change detection model utilizing these two strategies.The model is validated against state-of-the-art methods on six datasets, surpassing all benchmark methods and achieving F1 scores of 92.87%, 86.43%, 68.95%, 97.62%, 84.58%, and 93.20% on the LEVIR-CD, LEVIR-CD+, S2Looking, CDD, SYSU-CD, and WHU-CD datasets, respectively. 0.62 |
2025-03-26 |
Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework
This paper introduces a theoretical framework for a Transformer-augmented, sectional Mixture-of-Experts (MoE) architecture that aims to enhance computational efficiency while preserving model scalability.Unlike conventional MoE models, which route entire token embeddings to selected experts, our approach portions the embedding dimension itself -- assigning segments of each token's representation to dedicated experts.To combat losses in token representation, we utilize a pre-expert transformer layer to recompute attention across tokens and reduce the sequence length dimensionality.We extend our theory by deriving optimal scaling laws that a non-linear relationship between the number of experts and factors such as model dimensionality, sequence length, and system overhead.These formulations yield closed-form and numerically-solvable expressions for identifying the optimal expert count under given architectural and hardware constraints.As a result, our framework not only provides theoretical bounds for computing efficiency with varying frameworks but also guides practical design choices for scaling large models effectively.While empirical validation is pending, we present a comprehensive experimental road map to evaluate the framework's efficiency, scalability, and practicality in future work. 0.667 |
2025-03-26 |
ASGO: Adaptive Structured Gradient Optimization
Training deep neural networks (DNNs) is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than simple vectors.Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block-wise diagonal.These structured properties are crucial for designing efficient optimization algorithms but may not be utilized by current popular optimizers like Adam.In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients.By fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. 0.612Based on the convergence theory, we further demonstrate that ASGO can benefit from the low-rank and block-wise diagonal properties.We also discuss practical modifications of ASGO and empirically verify the effectiveness of the algorithm on language model tasks. |
2025-03-26 |
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training.To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing.AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound.It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation.We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes.To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits.AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. 0.627Results are available at https://genjib.github.io/project_page/AVED/index.html |
LLMs |
|
2025-04-03 |
Affordable AI Assistants with Knowledge Graph of Thoughts
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. 0.665However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. 0.679To address these issues, we propose the Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). 0.629KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts.Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively.For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini, while reducing costs by over 36x compared to GPT-4o.Improvements for recent reasoning models are similar, e.g., 36% and 37.5% for Qwen2.5-32B and Deepseek-R1-70B, respectively.KGoT offers a scalable, affordable, and high-performing solution for AI assistants. |
2025-04-03 |
LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems
Fermi Problems (FPs) are mathematical reasoning tasks that require human-like logic and numerical reasoning.Unlike other reasoning questions, FPs often involve real-world impracticalities or ambiguous concepts, making them challenging even for humans to solve.Despite advancements in AI, particularly with large language models (LLMs) in various reasoning tasks, FPs remain relatively under-explored.This work conducted an exploratory study to examine the capabilities and limitations of LLMs in solving FPs. 0.72We first evaluated the overall performance of three advanced LLMs using a publicly available FP dataset. 0.7We designed prompts according to the recently proposed TELeR taxonomy, including a zero-shot scenario.Results indicated that all three LLMs achieved a fp_score (range between 0 - 1) below 0.5, underscoring the inherent difficulty of these reasoning tasks. 0.653To further investigate, we categorized FPs into standard and specific questions, hypothesizing that LLMs would perform better on standard questions, which are characterized by clarity and conciseness, than on specific ones. 0.669Comparative experiments confirmed this hypothesis, demonstrating that LLMs performed better on standard FPs in terms of both accuracy and efficiency. 0.723 |
2025-04-03 |
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations.However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages.Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings.To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. 0.697We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints.Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. 0.662These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment.Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages. 0.752 |
2025-04-03 |
TeleMoM: Consensus-Driven Telecom Intelligence via Mixture of Models
Large language models (LLMs) face significant challenges in specialized domains like telecommunication (Telecom) due to technical complexity, specialized terminology, and rapidly evolving knowledge. 0.701Traditional methods, such as scaling model parameters or retraining on domain-specific corpora, are computationally expensive and yield diminishing returns, while existing approaches like retrieval-augmented generation, mixture of experts, and fine-tuning struggle with accuracy, efficiency, and coordination.To address this issue, we propose Telecom mixture of models (TeleMoM), a consensus-driven ensemble framework that integrates multiple LLMs for enhanced decision-making in Telecom.TeleMoM employs a two-stage process: proponent models generate justified responses, and an adjudicator finalizes decisions, supported by a quality-checking mechanism. 0.606This approach leverages strengths of diverse models to improve accuracy, reduce biases, and handle domain-specific complexities effectively.Evaluation results demonstrate that TeleMoM achieves a 9.7\% increase in answer accuracy, highlighting its effectiveness in Telecom applications. |
2025-04-03 |
Computing High-dimensional Confidence Sets for Arbitrary Distributions
We study the problem of learning a high-density region of an arbitrary distribution over Rd. Given a target coverage parameter δ, and sample access to an arbitrary distribution D, we want to output a confidence set S⊂Rd such that S achieves δ coverage of D, i.e., $\mathbb{P}_{y \sim D} \left[ y \in S \right] \ge \delta,andthevolumeofS$ is as small as possible. 0.611This is a central problem in high-dimensional statistics with applications in finding confidence sets, uncertainty quantification, and support estimation. In the most general setting, this problem is statistically intractable, so we restrict our attention to competing with sets from a concept class C with bounded VC-dimension.An algorithm is competitive with class C if, given samples from an arbitrary distribution D, it outputs in polynomial time a set that achieves δ coverage of D, and whose volume is competitive with the smallest set in C with the required coverage δ.This problem is computationally challenging even in the basic setting when C is the set of all Euclidean balls.Existing algorithms based on coresets find in polynomial time a ball whose volume is exp(O~(d/logd))-factor competitive with the volume of the best ball. Our main result is an algorithm that finds a confidence set whose volume is exp(O~(d2/3)) factor competitive with the optimal ball having the desired coverage.The algorithm is improper (it outputs an ellipsoid).Combined with our computational intractability result for proper learning balls within an exp(O~(d1−o(1))) approximation factor in volume, our results provide an interesting separation between proper and (improper) learning of confidence sets. |
2025-04-03 |
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization
Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. 0.648Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks.In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. 0.631Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy.Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency. 0.703 |
2025-04-03 |
Why do LLMs attend to the first token?
Large Language Models (LLMs) tend to attend heavily to the first token in the sequence -- creating a so-called attention sink. 0.655Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it.Attention sinks have been connected to quantisation difficulties, security issues, and streaming attention.Yet, while many works have provided conditions in which they occur or not, a critical question remains shallowly answered: Why do LLMs learn such patterns and how are they being used? 0.73In this work, we argue theoretically and empirically that this mechanism provides a method for LLMs to avoid over-mixing, connecting this to existing lines of work that study mathematically how information propagates in Transformers. 0.672We conduct experiments to validate our theoretical intuitions and show how choices such as context length, depth, and data packing influence the sink behaviour.We hope that this study provides a new practical perspective on why attention sinks are useful in LLMs, leading to a better understanding of the attention patterns that form during training. 0.675 |
2025-04-03 |
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study
Large Language Models (LLMs) are highly vulnerable to input perturbations, as even a small prompt change may result in a substantially different output. 0.679Existing methods to enhance LLM robustness are primarily focused on perturbed data samples, whereas improving resiliency to perturbations of task-level instructions has remained relatively underexplored. 0.697In this work, we focus on character- and word-level edits of task-specific instructions, which substantially degrade downstream performance.We experiment with a variety of techniques to enhance the robustness of LLMs, including self-denoising and representation alignment, testing different models (Llama 3 and Flan-T5), datasets (CoLA, QNLI, SST-2) and instructions (both task-oriented and role-oriented). 0.676We find that, on average, self-denoising -- whether performed by a frozen LLM or a fine-tuned model -- achieves substantially higher performance gains than alternative strategies, including more complex baselines such as ensembling and supervised methods. |
2025-04-03 |
How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices?
The spread of scientific knowledge depends on how researchers discover and cite previous work.The adoption of large language models (LLMs) in the scientific research process introduces a new layer to these citation practices. 0.686However, it remains unclear to what extent LLMs align with human citation practices, how they perform across domains, and may influence citation dynamics. 0.734Here, we show that LLMs systematically reinforce the Matthew effect in citations by consistently favoring highly cited papers when generating references. 0.694This pattern persists across scientific domains despite significant field-specific variations in existence rates, which refer to the proportion of generated references that match existing records in external bibliometric databases.Analyzing 274,951 references generated by GPT-4o for 10,000 papers, we find that LLM recommendations diverge from traditional citation patterns by preferring more recent references with shorter titles and fewer authors. 0.622Emphasizing their content-level relevance, the generated references are semantically aligned with the content of each paper at levels comparable to the ground truth references and display similar network effects while reducing author self-citations.These findings illustrate how LLMs may reshape citation practices and influence the trajectory of scientific discovery by reflecting and amplifying established trends. 0.744As LLMs become more integrated into the scientific research process, it is important to understand their role in shaping how scientific communities discover and build upon prior work. 0.775 |
2025-04-03 |
MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs
We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs.Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph.MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages. 0.702 |
2025-04-03 |
BT-ACTION: A Test-Driven Approach for Modular Understanding of User Instruction Leveraging Behaviour Trees and LLMs
Natural language instructions are often abstract and complex, requiring robots to execute multiple subtasks even for seemingly simple queries.For example, when a user asks a robot to prepare avocado toast, the task involves several sequential steps.Moreover, such instructions can be ambiguous or infeasible for the robot or may exceed the robot's existing knowledge.While Large Language Models (LLMs) offer strong language reasoning capabilities to handle these challenges, effectively integrating them into robotic systems remains a key challenge. 0.689To address this, we propose BT-ACTION, a test-driven approach that combines the modular structure of Behavior Trees (BT) with LLMs to generate coherent sequences of robot actions for following complex user instructions, specifically in the context of preparing recipes in a kitchen-assistance setting.We evaluated BT-ACTION in a comprehensive user study with 45 participants, comparing its performance to direct LLM prompting. 0.647Results demonstrate that the modular design of BT-ACTION helped the robot make fewer mistakes and increased user trust, and participants showed a significant preference for the robot leveraging BT-ACTION.The code is publicly available at https://github.com/1Eggbert7/BT_LLM. |
2025-04-03 |
From Consumption to Collaboration: Measuring Interaction Patterns to Augment Human Cognition in Open-Ended Tasks
The rise of Generative AI, and Large Language Models (LLMs) in particular, is fundamentally changing cognitive processes in knowledge work, raising critical questions about their impact on human reasoning and problem-solving capabilities. 0.701As these AI systems become increasingly integrated into workflows, they offer unprecedented opportunities for augmenting human thinking while simultaneously risking cognitive erosion through passive consumption of generated answers.This tension is particularly pronounced in open-ended tasks, where effective solutions require deep contextualization and integration of domain knowledge.Unlike structured tasks with established metrics, measuring the quality of human-LLM interaction in such open-ended tasks poses significant challenges due to the absence of ground truth and the iterative nature of solution development. 0.674To address this, we present a framework that analyzes interaction patterns along two dimensions: cognitive activity mode (exploration vs. exploitation) and cognitive engagement mode (constructive vs. detrimental).This framework provides systematic measurements to evaluate when LLMs are effective tools for thought rather than substitutes for human cognition, advancing theoretical understanding and practical guidance for developing AI systems that protect and augment human cognitive capabilities. 0.696 |
2025-04-03 |
A Framework for Robust Cognitive Evaluation of LLMs
Emergent cognitive abilities in large language models (LLMs) have been widely observed, but their nature and underlying mechanisms remain poorly understood. 0.63A growing body of research draws on cognitive science to investigate LLM cognition, but standard methodologies and experimen-tal pipelines have not yet been established. 0.69To address this gap we develop CognitivEval, a framework for systematically evaluating the artificial cognitive capabilities of LLMs, with a particular emphasis on robustness in response collection. 0.693The key features of CognitivEval include: (i) automatic prompt permutations, and (ii) testing that gathers both generations and model probability estimates.Our experiments demonstrate that these features lead to more robust experimental outcomes.Using CognitivEval, we replicate five classic experiments in cognitive science, illustrating the framework's generalizability across various experimental tasks and obtaining a cognitive profile of several state of the art LLMs. 0.636CognitivEval will be released publicly to foster broader collaboration within the cognitive science community. |
2025-04-03 |
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
The detection and intervention of mental health issues represent a critical global research focus, and social media data has been recognized as an important resource for mental health research.However, how to utilize Large Language Models (LLMs) for mental health problem detection on social media poses significant challenges. 0.632Hence, this paper aims to explore the potential of LLM applications in social media data analysis, focusing not only on the most common psychological disorders such as depression and anxiety but also incorporating psychotic disorders and externalizing disorders, summarizing the application methods of LLM from different dimensions, such as text data analysis and detection of mental disorders, and revealing the major challenges and shortcomings of current research. 0.615In addition, the paper provides an overview of popular datasets, and evaluation metrics.The survey in this paper provides a comprehensive frame of reference for researchers in the field of mental health, while demonstrating the great potential of LLMs in mental health detection to facilitate the further application of LLMs in future mental health interventions. 0.669 |
2025-04-03 |
MegaMath: Pushing the Limits of Open Math Corpora
Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs).However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. 0.625We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet.(2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity.(3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data.By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets. |
2025-04-03 |
Generative Evaluation of Complex Reasoning in Large Language Models
With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? 0.707Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. 0.629To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. 0.706KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. 0.672Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization.We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. 0.716Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. 0.725Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities. 0.654 |
2025-04-03 |
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs).In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations.Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy).Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. 0.608These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs. |
Developer Research |
|
2025-04-03 |
The Myth of Immutability: A Multivocal Review on Smart Contract Upgradeability
The immutability of smart contracts on blockchain platforms like Ethereum promotes security and trustworthiness but presents challenges for updates, bug fixes, or adding new features post-deployment.These limitations can lead to vulnerabilities and outdated functionality, impeding the evolution and maintenance of decentralized applications.Despite various upgrade mechanisms proposed in academic research and industry, a comprehensive analysis of their trade-offs and practical implications is lacking.This study aims to systematically identify, classify, and evaluate existing smart contract upgrade mechanisms, bridging the gap between theoretical concepts and practical implementations.It introduces standardized terminology and evaluates the trade-offs of different approaches using software quality attributes.We conducted a Multivocal Literature Review (MLR) to analyze upgrade mechanisms from both academic research and industry practice.We first establish a unified definition of smart contract upgradeability and identify core components essential for understanding the upgrade process.Based on this definition, we classify existing methods into full upgrade and partial upgrade approaches, introducing standardized terminology to harmonize the diverse terms used in the literature.We then characterize each approach and assess its benefits and limitations using software quality attributes such as complexity, flexibility, security, and usability.The analysis highlights significant trade-offs among upgrade mechanisms, providing valuable insights into the benefits and limitations of each approach.These findings guide developers and researchers in selecting mechanisms tailored to specific project requirements. 0.611 |
2025-04-02 |
The Factors Influencing Well-Being in Software Engineers: A Cross-Country Mixed-Method Study
The well-being of software engineers is increasingly under strain due to the high-stress nature of their roles, which involve complex problem-solving, tight deadlines, and the pressures of rapidly evolving technologies. Despite increasing recognition of mental health challenges in software engineering, few studies focus on the factors that sustain or undermine well-being. 0.609Existing research often overlooks the interaction between personal, collaborative, and organisational influences on this unique population.This study fills this gap by investigating the specific factors affecting the well-being of software engineers. 0.63We conducted 15 qualitative interviews and complemented them with a confirmatory cross-country survey to validate and extend our findings to a broader population.Our mixed-methods approach provides a robust framework to identify key factors influencing well-being, including personal perceptions of well-being, interpersonal and collaborative dynamics, workplace support and recognition, organisational culture, and specific stressors inherent to software engineering. By offering a detailed, context-specific exploration of these factors, our study builds on existing literature and provides actionable insights for improving well-being in software engineering. 0.605We conclude with policy recommendations to inform organisational strategies and develop targeted interventions that address the specific challenges of this field, contributing to more sustainable and supportive work environments. |
2025-04-02 |
Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks
Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks.This makes it crucial to align these tools with human values to prevent malicious misuse.In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain.We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. 0.614To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs.Furthermore, we investigate the impact of models size, architecture family, and alignment strategies on their tendency to generate harmful content.The results show significant disparities in the alignment of various LLMs for harmlessness.We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts.Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices.On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information.These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area. |
2025-04-02 |
Build Code Needs Maintenance Too: A Study on Refactoring and Technical Debt in Build Systems
In modern software engineering, build systems play the crucial role of facilitating the conversion of source code into software artifacts. 0.652Recent research has explored high-level causes of build failures, but has largely overlooked the structural properties of build files.Akin to source code, build systems face technical debt challenges that hinder maintenance and optimization. 0.618While refactoring is often seen as a key tool for addressing technical debt in source code, there is a significant research gap regarding the specific refactoring changes developers apply to build code and whether these refactorings effectively address technical debt. 0.642In this paper, we address this gap by examining refactorings applied to build scripts in open-source projects, covering the widely used build systems of Gradle, Ant, and Maven.Additionally, we investigate whether these refactorings are used to tackle technical debts in build systems.Our analysis was conducted on \totalCommits examined build-file-related commits.We identified \totalRefactoringCategories build-related refactorings, which we divided into \totalCategories main categories.These refactorings are organized into the first empirically derived taxonomy of build system refactorings. 0.625Furthermore, we investigate how developers employ these refactoring types to address technical debts via a manual commit-analysis and a developer survey. 0.661In this context, we identified \totalTechnicalDebts technical debts addressed by these refactorings and discussed their correlation with the different refactorings.Finally, we introduce BuildRefMiner, an LLM-powered tool leveraging GPT-4o to automate the detection of refactorings within build systems.We evaluated its performance and found that it achieves an F1 score of \toolFoneScore across all build systems. |
2025-03-31 |
MaintainCoder: Maintainable Code Generation Under Dynamic Requirements
Modern code generation has made significant strides in functional correctness and execution efficiency.However, these systems often overlook a critical dimension in real-world software development: maintainability.To handle dynamic requirements with minimal rework, we propose MaintainCoder as a pioneering solution.It integrates Waterfall model, design patterns, and multi-agent collaboration to systematically enhance cohesion, reduce coupling, and improve adaptability.We also introduce MaintainBench, a benchmark comprising requirement changes and corresponding dynamic metrics on maintainance effort.Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. 0.622In contrast, MaintainCoder improves maintainability metrics by 14-30% with even higher correctness, i.e. pass@k.Our work not only provides the foundation of maintainable code generation, but also highlights the need for more holistic code quality research.Resources: https://github.com/IAAR-Shanghai/MaintainCoder. |
2025-03-27 |
Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs
Repository-level software repair faces challenges in bridging semantic gaps between issue descriptions and code patches. 0.641Existing approaches, which mostly depend on large language models (LLMs), suffer from semantic ambiguities, limited structural context understanding, and insufficient reasoning capability.To address these limitations, we propose KGCompass with two innovations: (1) a novel repository-aware knowledge graph (KG) that accurately links repository artifacts (issues and pull requests) and codebase entities (files, classes, and functions), allowing us to effectively narrow down the vast search space to only 20 most relevant functions with accurate candidate bug locations and contextual information, and (2) a path-guided repair mechanism that leverages KG-mined entity path, tracing through which allows us to augment LLMs with relevant contextual information to generate precise patches along with their explanations.Experimental results in the SWE-Bench-Lite demonstrate that KGCompass achieves state-of-the-art repair performance (45.67%) and function-level localization accuracy (51.33%) across open-source approaches, costing only $0.20 per repair.Our analysis reveals that among successfully localized bugs, 69.7% require multi-hop traversals through the knowledge graph, without which LLM-based approaches struggle to accurately locate bugs.The knowledge graph built in KGCompass is language agnostic and can be incrementally updated, making it a practical solution for real-world development environments. |
2025-03-25 |
SLA-Awareness for AI-assisted coding
The integration of AI-assisted coding tools within development environments drastically reduces development time, and allows developers to focus more on creative and critical aspects of software engineering through the use of Code Large Language Models (CodeLLMs). 0.626These coding assistants automate repetitive and time-consuming coding tasks such as code generation, code completion, code summarization, and code translation.Responsiveness is a crucial requirement of these coding assistants to maintain real-time interactivity, such that their use does not impede the developers' workflows.Different coding tasks have unique characteristics and latency requirements: Time-To-First-Token (TTFT) latency is essential for code completion tasks, while End-To-End (E2E) latency is crucial for code translation tasks.Managing these varying requirements simultaneously while optimizing resource usage poses significant challenges.Existing work adopts the Model-as-a-Service paradigm for serving individual CodeLLMs, but cannot effectively manage latency requirements of concurrent coding tasks and sequences of CodeLLM inference calls, due to a lack of end-to-end latency awareness.Another challenge is keeping resource utilization high, when the serving system is deployed on a shared cluster environment.To address these challenges, we propose Coding Assistant Task Orchestrator (CATO), a runtime system designed to serve a diverse assortment of coding tasks while meeting latency requirements and maximizing resource utilization.Our experiments demonstrate that when all types of coding tasks were served simultaneously, for TTFT-critical tasks, CATO improves overall Goodput rate and resource utilization by up to 10% and 41.1%, respectively.P95 E2E latency was also reduced by 18% for code summarization tasks, and P95 TTFT for code generation tasks were reduced by 14% compared against state-of-the-art systems. |
2025-03-18 |
MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration
Maintaining and scaling software systems relies heavily on effective code refactoring, yet this process remains labor-intensive, requiring developers to carefully analyze existing codebases and prevent the introduction of new defects. 0.652Although recent advancements have leveraged Large Language Models (LLMs) to automate refactoring tasks, current solutions are constrained in scope and lack mechanisms to guarantee code compilability and successful test execution.In this work, we introduce MANTRA, a comprehensive LLM agent-based framework that automates method-level refactoring.MANTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning to emulate human decision-making during refactoring while preserving code correctness and readability.Our empirical study, conducted on 703 instances of "pure refactorings" (i.e., code changes exclusively involving structural improvements), drawn from 10 representative Java projects, covers the six most prevalent refactoring operations.Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model (RawGPT ), achieving an 82.8% success rate (582/703) in producing code that compiles and passes all tests, compared to just 8.7% (61/703) with RawGPT.Moreover, in comparison to IntelliJ's LLM-powered refactoring tool (EM-Assist), MANTRA exhibits a 50% improvement in generating Extract Method transformations.A usability study involving 37 professional developers further shows that refactorings performed by MANTRA are perceived to be as readable and reusable as human-written code, and in certain cases, even more favorable. 0.652These results highlight the practical advantages of MANTRA and emphasize the growing potential of LLM-based systems in advancing the automation of software refactoring tasks. |
Data Annotation Techniques |