Vincent's Arxiv FrontPage


Generated on 2025-04-01.


This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.


New Datasets

2025-03-31

FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics

The rapid and unrestrained advancement of generative artificial intelligence (AI) presents a double-edged sword: while enabling unprecedented creativity, it also facilitates the generation of highly convincing deceptive content, undermining societal trust.As image generation techniques become increasingly sophisticated, detecting synthetic images is no longer just a binary task: it necessitates interpretable, context-aware methodologies that enhance trustworthiness and transparency.However, existing detection models primarily focus on classification, offering limited explanatory insights into image authenticity.In this work, we propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics, which not only identifies AI-synthetic images with high accuracy but also provides rich, interpretable, and query-driven forensic insights.We first construct FakeChain dataset that contains linguistic authenticity reasoning based on visual trace evidence, developed through a novel human-machine collaborative framework.Building upon it, we further present FakeInstruct, the largest multimodal instruction tuning dataset containing 2 million visual instructions tailored to enhance forensic awareness in LMMs. 0.761FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios.It can distinguish synthetic images with high accuracy while offering coherent and insightful explanations, free-form discussions on fine-grained forgery attributes, and actionable enhancement strategies.Notably, despite being trained exclusively on qualitative hard labels, FakeScope demonstrates remarkable zero-shot quantitative capability on detection, enabled by our proposed token-based probability estimation strategy.Furthermore, FakeScope exhibits strong generalization and in-the-wild ability, ensuring its applicability in real-world scenarios.

link

2025-03-31

Visual Acoustic Fields

Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties.Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS).Our approach features two key modules: sound generation and sound localization.The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds.Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources.To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds.To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. 0.835Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources.Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.

link

2025-03-31

Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge

Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy.Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms.This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification.The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024.The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency.The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences.The efficiency component tests the latency of algorithm inference.The challenge was conducted as a part of MICCAI EndoVis 2024.In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day.This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery.In this paper we summarize the design, submissions, and results from the challenge.The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics 0.749

link

2025-03-31

Faster Releases, Fewer Risks: A Study on Maven Artifact Vulnerabilities and Lifecycle Management

In modern software ecosystems, dependency management plays a critical role in ensuring secure and maintainable applications.However, understanding the relationship between release practices and their impact on vulnerabilities and update cycles remains a challenge.In this study, we analyze the release histories of 10,000 Maven artifacts, covering over 203,000 releases and 1.7 million dependencies. 0.72We evaluate how release speed affects software security and lifecycle.Our results show an inverse relationship between release speed and dependency outdatedness.Artifacts with more frequent releases maintain significantly shorter outdated times.We also find that faster release cycles are linked to fewer CVEs in dependency chains, indicating a strong negative correlation.These findings emphasize the importance of accelerated release strategies in reducing security risks and ensuring timely updates.Our research provides valuable insights for software developers, maintainers, and ecosystem managers.

link

2025-03-31

InstructRestore: Region-Customized Image Restoration with Human Instructions

Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions.In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions.To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. 0.746With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. 0.924We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement.Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions.Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement.Our work advances the investigation of interactive image restoration and enhancement techniques.Data, code, and models will be found at https://github.com/shuaizhengliu/InstructRestore.git.

link

2025-03-31

Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation.Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance.To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness.These enriched features are then decoded to produce precise and robust segmentation.We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. 0.842Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20\% on average in the 1\% and 10\% data settings.Our method achieves $\sim$77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.

link

2025-03-27

Dataset and Analysis of Long-Term Skill Acquisition in Robot-Assisted Minimally Invasive Surgery

Objective: We aim to investigate long-term robotic surgical skill acquisition among surgical residents and the effects of training intervals and fatigue on performance.Methods: For six months, surgical residents participated in three training sessions once a month, surrounding a single 26-hour hospital shift.In each shift, they participated in training sessions scheduled before, during, and after the shift.In each training session, they performed three dry-lab training tasks: Ring Tower Transfer, Knot-Tying, and Suturing.We collected a comprehensive dataset, including videos synchronized with kinematic data, activity tracking, and scans of the suturing pads. 0.84Results:We collected a dataset of 972 trials performed by 18 residents of different surgical specializations.Participants demonstrated consistent performance improvement across all tasks.In addition, we found variations in between-shift learning and forgetting across metrics and tasks, and hints for possible effects of fatigue.Conclusion: The findings from our first analysis shed light on the long-term learning processes of robotic surgical skills with extended intervals and varying levels of fatigue.Significance: This study lays the groundwork for future research aimed at optimizing training protocols and enhancing AI applications in surgery, ultimately contributing to improved patient outcomes.The dataset will be made available upon acceptance of our journal submission. 0.938

link

2025-03-27

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards.Building on this idea, we are the first to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks.To this end, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. 0.74We also introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO).Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks.Specifically, on the ID benchmark AndroidControl, the action type accuracy improves by 15%, while grounding accuracy increases by 10.3%, compared with the base model (i.e. Qwen2.5-VL-3B).On the OOD GUI grounding benchmark ScreenSpot-Pro, our model surpasses the base model by 6.0% and achieves competitive performance with larger models (e.g., OS-Atlas-7B), which are trained via supervised fine-tuning (SFT) on 76K data.These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain.

link

2025-03-27

The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection

In recent years, performance on existing anomaly detection benchmarks like MVTec AD and VisA has started to saturate in terms of segmentation AU-PRO, with state-of-the-art models often competing in the range of less than one percentage point.This lack of discriminatory power prevents a meaningful comparison of models and thus hinders progress of the field, especially when considering the inherent stochastic nature of machine learning results.We present MVTec AD 2, a collection of eight anomaly detection scenarios with more than 8000 high-resolution images.It comprises challenging and highly relevant industrial inspection use cases that have not been considered in previous datasets, including transparent and overlapping objects, dark-field and back light illumination, objects with high variance in the normal data, and extremely small defects.We provide comprehensive evaluations of state-of-the-art methods and show that their performance remains below 60% average AU-PRO.Additionally, our dataset provides test scenarios with lighting condition changes to assess the robustness of methods under real-world distribution shifts.We host a publicly accessible evaluation server that holds the pixel-precise ground truth of the test set (https://benchmark.mvtec.com/).All image data is available at https://www.mvtec.com/company/research/datasets/mvtec-ad-2. 0.712

link

2025-03-27

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing

The rapid growth of digital communication has driven the widespread use of code-mixing, particularly Hindi-English, in multilingual communities.Existing datasets often focus on romanized text, have limited scope, or rely on synthetic data, which fails to capture realworld language nuances.Human annotations are crucial for assessing the naturalness and acceptability of code-mixed text.To address these challenges, We introduce COMI-LINGUA, the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. 0.771The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation.We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities.COMI-LINGUA is publically availabe at: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA. 0.767

link

2025-03-27

JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community

This paper introduces JiraiBench, the first bilingual benchmark for evaluating large language models' effectiveness in detecting self-destructive content across Chinese and Japanese social media communities.Focusing on the transnational "Jirai" (landmine) online subculture that encompasses multiple forms of self-destructive behaviors including drug overdose, eating disorders, and self-harm, we present a comprehensive evaluation framework incorporating both linguistic and cultural dimensions.Our dataset comprises 10,419 Chinese posts and 5,000 Japanese posts with multidimensional annotation along three behavioral categories, achieving substantial inter-annotator agreement. 0.842Experimental evaluations across four state-of-the-art models reveal significant performance variations based on instructional language, with Japanese prompts unexpectedly outperforming Chinese prompts when processing Chinese content.This emergent cross-cultural transfer suggests that cultural proximity can sometimes outweigh linguistic similarity in detection tasks.Cross-lingual transfer experiments with fine-tuned models further demonstrate the potential for knowledge transfer between these language systems without explicit target language training.These findings highlight the need for culturally-informed approaches to multilingual content moderation and provide empirical evidence for the importance of cultural context in developing more effective detection systems for vulnerable online communities.

link

2025-03-27

CMED: A Child Micro-Expression Dataset

Micro-expressions are short bursts of emotion that are difficult to hide.Their detection in children is an important cue to assist psychotherapists in conducting better therapy.However, existing research on the detection of micro-expressions has focused on adults, whose expressions differ in their characteristics from those of children.The lack of research is a direct consequence of the lack of a child-based micro-expressions dataset as it is much more challenging to capture children's facial expressions due to the lack of predictability and controllability.This study compiles a dataset of spontaneous child micro-expression videos, the first of its kind, to the best of the authors knowledge.The dataset is captured in the wild using video conferencing software. 0.906This dataset enables us to then explore key features and differences between adult and child micro-expressions.This study also establishes a baseline for the automated spotting and recognition of micro-expressions in children using three approaches comprising of hand-created and learning-based approaches.

link

2025-03-27

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity.Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. 0.921Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance.To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation.Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%).Our codes, models, datasets, and demo are publicly available. 0.804

link

2025-03-27

Video-R1: Reinforcing Video Reasoning in MLLMs

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs).However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data.To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning.Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process.We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. 0.904Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc.Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o.All codes, models, data are released.

link

2025-03-26

Collaborative Storytelling and LLM: A Linguistic Analysis of Automatically-Generated Role-Playing Game Sessions

Role-playing games (RPG) are games in which players interact with one another to create narratives.The role of players in the RPG is largely based on the interaction between players and their characters.This emerging form of shared narrative, primarily oral, is receiving increasing attention.In particular, many authors investigated the use of an LLM as an actor in the game.In this paper, we aim to discover to what extent the language of Large Language Models (LLMs) exhibit oral or written features when asked to generate an RPG session without human interference.We will conduct a linguistic analysis of the lexical and syntactic features of the generated texts and compare the results with analyses of conversations, transcripts of human RPG sessions, and books. 0.789We found that LLMs exhibit a pattern that is distinct from all other text categories, including oral conversations, human RPG sessions and books.Our analysis has shown how training influences the way LLMs express themselves and provides important indications of the narrative capabilities of these tools.

link

2025-03-26

AccidentSim: Generating Physically Realistic Vehicle Collision Videos from Real-World Accident Reports

Collecting real-world vehicle accident videos for autonomous driving research is challenging due to their rarity and complexity.While existing driving video generation methods may produce visually realistic videos, they often fail to deliver physically realistic simulations because they lack the capability to generate accurate post-collision trajectories.In this paper, we introduce AccidentSim, a novel framework that generates physically realistic vehicle collision videos by extracting and utilizing the physical clues and contextual information available in real-world vehicle accident reports.Specifically, AccidentSim leverages a reliable physical simulator to replicate post-collision vehicle trajectories from the physical and contextual information in the accident reports and to build a vehicle collision trajectory dataset.This dataset is then used to fine-tune a language model, enabling it to respond to user prompts and predict physically consistent post-collision trajectories across various driving scenarios based on user descriptions. 0.808Finally, we employ Neural Radiance Fields (NeRF) to render high-quality backgrounds, merging them with the foreground vehicles that exhibit physically realistic trajectories to generate vehicle collision videos.Experimental results demonstrate that the videos produced by AccidentSim excel in both visual and physical authenticity.

link

2025-03-26

ARMO: Autoregressive Rigging for Multi-Category Objects

Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation.However, most existing methods focus primarily on generating static 3D models, overlooking the potentially dynamic nature of certain shapes, such as humanoids, animals, and insects.To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models.In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. 0.783Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses.Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner.By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokens.A mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation.Our method addresses the limitations of regression-based approaches, which often suffer from error accumulation and suboptimal connectivity estimation.Through extensive experiments on the OmniRig dataset, our approach achieves state-of-the-art performance in skeleton prediction, demonstrating improved generalization across diverse object categories.The code and dataset will be made public for academic use upon acceptance.

link

2025-03-26

Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy

The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework.However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems.While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks.We consider that these hallucinations arise from an absence of clear self-awareness within the models.To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks.This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. 0.82Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks.Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination.Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations.Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.

link

2025-03-26

MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams

Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements.Unlike natural images, their inherently symbolic and abstract nature poses significant challenges for Multimodal Large Language Models (MLLMs).However, current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether MLLMs genuinely understand mathematical diagrams beyond superficial pattern recognition.To address this gap, we introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs.MATHGLANCE comprises 1.2K images and 1.6K carefully curated questions spanning four perception tasks: shape classification, object counting, relationship identification, and object grounding, covering diverse domains including plane geometry, solid geometry, and graphical representations.Our evaluation of MLLMs reveals that their ability to understand diagrams is notably limited, particularly in fine-grained grounding tasks.In response, we construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text pairs explicitly annotated with geometric primitives and precise spatial relationships. 0.746Training MLLM on GeoPeP leads to significant gains in perceptual accuracy, which in turn substantially improves mathematical reasoning.Our benchmark and dataset establish critical standards for evaluating and advancing multimodal mathematical understanding, providing valuable resources and insights to foster future MLLM research.

link

2025-03-26

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation.BASKET contains 4,477 hours of video capturing 32,232 basketball players from all over the world.Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill through in-depth video analysis.Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills.Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline.We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities.In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others.Dataset and code are available at https://github.com/yulupan00/BASKET. 0.85

link

2025-03-25

Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings

Advancements in computer-assisted surgical procedures heavily rely on accurate visual data interpretation from camera systems used during surgeries.Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos with less than 100K images.To address these constraints, a new dataset called Surg-3M has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. 0.811Featuring an extensive collection of over 4K surgical videos and more than 3 million high-quality images from multiple procedure types, Surg-3M offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel tasks.To demonstrate the effectiveness of this dataset, we present SurgFM, a self-supervised foundation model pretrained on Surg-3M that achieves impressive results in downstream tasks such as surgical phase recognition, action recognition, and tool presence detection.Combining key components from ConvNeXt, DINO, and an innovative augmented distillation method, SurgFM exhibits exceptional performance compared to specialist architectures across various benchmarks.Our experimental results show that SurgFM outperforms state-of-the-art models in multiple downstream tasks, including significant gains in surgical phase recognition (+8.9pp, +4.7pp, and +3.9pp of Jaccard in AutoLaparo, M2CAI16, and Cholec80), action recognition (+3.1pp of mAP in CholecT50) and tool presence detection (+4.6pp of mAP in Cholec80).Moreover, even when using only half of the data, SurgFM outperforms state-of-the-art models in AutoLaparo and achieves state-of-the-art performance in Cholec80.Both Surg-3M and SurgFM have significant potential to accelerate progress towards developing autonomous robotic surgery systems.

link

2025-03-25

BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility.The recent Segment Anything Model (SAM) has demonstrated powerful point-prompt segmentation capabilities, while text-based segmentation models offer rich semantic understanding.However, existing approaches rarely explore how to effectively combine these complementary modalities for optimal segmentation performance.This paper presents BiPrompt-SAM, a novel dual-modal prompt segmentation framework that fuses the advantages of point and text prompts through an explicit selection mechanism.Specifically, we leverage SAM's inherent ability to generate multiple mask candidates, combined with a semantic guidance mask from text prompts, and explicitly select the most suitable candidate based on similarity metrics.This approach can be viewed as a simplified Mixture of Experts (MoE) system, where the point and text modules act as distinct "experts," and the similarity scoring serves as a rudimentary "gating network."We conducted extensive evaluations on both the Endovis17 medical dataset and RefCOCO series natural image datasets. 0.822On Endovis17, BiPrompt-SAM achieved 89.55\% mDice and 81.46\% mIoU, comparable to state-of-the-art specialized medical segmentation models.On the RefCOCO series datasets, our method attained 87.1\%, 86.5\%, and 85.8\% IoU, significantly outperforming existing approaches.Experiments demonstrate that our explicit dual-selection method effectively combines the spatial precision of point prompts with the semantic richness of text prompts, particularly excelling in scenarios involving semantically complex objects, multiple similar objects, and partial occlusions.BiPrompt-SAM not only provides a simple yet effective implementation but also offers a new perspective on multi-modal prompt fusion.

link

2025-03-25

Outsourcing an Information Operation: A Complete Dataset of Tenet Media's Podcasts on Rumble

Tenet Media, a U.S.-based, right-wing media company, hired six established podcasters to create content related to U.S. politics and culture during the 2024 U.S. presidential election cycle.After publishing content on YouTube and Rumble for nearly a year, Tenet Media was declared by the U.S. government to be funded entirely by Russia -- making it effectively an outsourced state-sponsored information operation (SSIO).We present a complete dataset of the 560 podcast videos published by the Tenet Media channel on the video-sharing platform Rumble between November 2023 and September 2024. 0.909Our dataset includes video metadata and user comments, as well as high-quality video transcriptions, representing over 300 hours of video content. 0.891This dataset provides researchers with material to study a Russian SSIO, and notably on Rumble, which is an understudied platform in SSIO scholarship. 0.848

link

2025-03-25

LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset

Low-light image enhancement is crucial for a myriad of applications, from night vision and surveillance, to autonomous driving.However, due to the inherent limitations that come in hand with capturing images in low-illumination environments, the task of enhancing such scenes still presents a formidable challenge.To advance research in this field, we introduce our Low Exposure Night Vision (LENVIZ) Dataset, a comprehensive multi-exposure benchmark dataset for low-light image enhancement comprising of over 230K frames showcasing 24K real-world indoor and outdoor, with-and without human, scenes. 0.793Captured using 3 different camera sensors, LENVIZ offers a wide range of lighting conditions, noise levels, and scene complexities, making it the largest publicly available up-to 4K resolution benchmark in the field.LENVIZ includes high quality human-generated ground truth, for which each multi-exposure low-light scene has been meticulously curated and edited by expert photographers to ensure optimal image quality.Furthermore, we also conduct a comprehensive analysis of current state-of-the-art low-light image enhancement techniques on our dataset and highlight potential areas of improvement.

link

2025-03-25

Scaling Down Text Encoders of Text-to-Image Diffusion Models

Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL.Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters.Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power.Therefore, it raises an important question: "Do we really need such a large text encoder?"In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models.To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. 0.711Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size.This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.

link

2025-03-24

EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos

Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room.A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions.We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos.Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate.Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets.The dataset will be released at https://github.com/Fujiry0/EgoSurgery. 0.713

link

2025-03-24

CCMusic: An Open and Diverse Database for Chinese Music Information Retrieval Research

Data are crucial in various computer-related fields, including music information retrieval (MIR), an interdisciplinary area bridging computer science and music.This paper introduces CCMusic, an open and diverse database comprising multiple datasets specifically designed for tasks related to Chinese music, highlighting our focus on this culturally rich domain. 0.761The database integrates both published and unpublished datasets, with steps taken such as data cleaning, label refinement, and data structure unification to ensure data consistency and create ready-to-use versions.We conduct benchmark evaluations for all datasets using a unified evaluation framework developed specifically for this purpose.This publicly available framework supports both classification and detection tasks, ensuring standardized and reproducible results across all datasets.The database is hosted on HuggingFace and ModelScope, two open and multifunctional data and model hosting platforms, ensuring ease of accessibility and usability.

link

2025-03-24

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering.To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts.However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability.This paper proposes the first multi-concept personalization paradigm, MC-LLaVA.Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step.To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens.Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities.To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset.We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity.Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants.The code and dataset will be publicly available at $\href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}$. 0.808

link

2025-03-24

SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction

Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world.To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions.SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities.We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality.We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. 0.709Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.

link

2025-03-24

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding.This model family employs the two-stream SlowFast mechanism, enabling efficient modeling of long-range temporal context to meet the demand for lightweight, mobile-friendly Video LLMs.We provide models ranging from 1B to 7B parameters, optimized through a streamlined training pipeline and a high-quality data mixture composed of publicly available datasets. 0.745Experimental results demonstrate that SF-LLaVA-1.5 achieves competitive performance on a wide range of video and image benchmarks, with robust results across all model sizes.Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales (1B and 3B) across various video benchmarks.

link

2025-03-20

Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data

Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce.Existing methods leveraged (M)LLMs for data generation, but direct prompting often yields limited precision and diversity.In this paper, we propose \textit{Chain of Functions (CoF)}, a novel programmatic reasoning data generation pipeline that utilizes freely-explored reasoning paths as supervision to ensure data precision and diversity.Specifically, it starts with human-free exploration among the atomic functions (e.g., maximum data and arithmetic operations) to generate diverse function chains, which are then translated into linguistic rationales and questions with only a moderate open-sourced LLM. \textit{CoF} provides multiple benefits: 1) Precision: function-governed generation reduces hallucinations compared to freeform generation; 2) Diversity: enumerating function chains enables varied question taxonomies; 3) Explainability: function chains serve as built-in rationales, allowing fine-grained evaluation beyond overall accuracy; 4) Practicality: eliminating reliance on extremely large models.Employing \textit{CoF}, we construct the \textit{ChartCoF} dataset, with 1.4k complex reasoning Q\&A for fine-grained analysis and 50k Q\&A for reasoning enhancement. 0.838The fine-grained evaluation on \textit{ChartCoF} reveals varying performance across question taxonomies for each MLLM, and the experiments also show that finetuning with \textit{ChartCoF} achieves state-of-the-art performance among same-scale MLLMs on widely used benchmarks.Furthermore, the novel paradigm of function-governed rationale generation in \textit{CoF} could inspire broader applications beyond charts.

link

2025-03-20

A Dataset of Performance Measurements and Alerts from Mozilla (Data Artifact)

Performance regressions in software systems can lead to significant financial losses and degraded user satisfaction, making their early detection and mitigation critical.Despite the importance of practices that capture performance regressions early, there is a lack of publicly available datasets that comprehensively capture real-world performance measurements, expert-validated alerts, and associated metadata such as bugs and testing conditions. To address this gap, we introduce a unique dataset to support various research studies in performance engineering, anomaly detection, and machine learning.This dataset was collected from Mozilla Firefox's performance testing infrastructure and comprises 5,655 performance time series, 17,989 performance alerts, and detailed annotations of resulting bugs collected from May 2023 to May 2024.By publishing this dataset, we provide researchers with an invaluable resource for studying performance trends, developing novel change point detection methods, and advancing performance regression analysis across diverse platforms and testing environments.The dataset is available at https://doi.org/10.5281/zenodo.14642238 0.831

link

2025-03-20

NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises.Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes.To address this, we propose an efficient approach that encodes scene chunks as uniform vector sets, offering better compression and performance than the spatially structured latents used in prior methods.Furthermore, we train an explicit outpainting model for unbounded generation, which improves coherence compared to prior resampling-based inpainting schemes while also speeding up generation by eliminating extra diffusion steps.To facilitate this task, we curate NuiScene43, a small but high-quality set of scenes, preprocessed for joint training. 0.811Notably, when trained on scenes of varying styles, our model can blend different environments, such as rural houses and city skyscrapers, within the same scene, highlighting the potential of our curation process to leverage heterogeneous scenes for joint training.

link

2025-03-20

Panoptic-CUDAL Technical Report: Rural Australia Point Cloud Dataset in Rainy Conditions

Existing autonomous driving datasets are predominantly oriented towards well-structured urban settings and favorable weather conditions, leaving the complexities of rural environments and adverse weather conditions largely unaddressed.Although some datasets encompass variations in weather and lighting, bad weather scenarios do not appear often.Rainfall can significantly impair sensor functionality, introducing noise and reflections in LiDAR and camera data and reducing the system's capabilities for reliable environmental perception and safe navigation.We introduce the Panoptic-CUDAL dataset, a novel dataset purpose-built for panoptic segmentation in rural areas subject to rain. 0.848By recording high-resolution LiDAR, camera, and pose data, Panoptic-CUDAL offers a diverse, information-rich dataset in a challenging scenario.We present analysis of the recorded data and provide baseline results for panoptic and semantic segmentation methods on LiDAR point clouds.The dataset can be found here: https://robotics.sydney.edu.au/our-research/intelligent-transportation-systems/ 0.838

link

2025-03-20

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views.We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions.To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views.Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame.Our code and newly curated dataset are available at https://github.com/chenchen235/SA-Occ. 0.867

link

2025-03-20

GAEA: A Geolocation Aware Conversational Model

Image geolocalization, in which, traditionally, an AI model predicts the precise GPS coordinates of an image is a challenging task with many downstream applications.However, the user cannot utilize the model to further their knowledge other than the GPS coordinate; the model lacks an understanding of the location and the conversational ability to communicate with the user.In recent days, with tremendous progress of large multimodal models (LMMs) proprietary and open-source researchers have attempted to geolocalize images via LMMs.However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, one of which is geolocalization, LMMs struggle.In this work, we propose to solve this problem by introducing a conversational model GAEA that can provide information regarding the location of an image, as required by a user.No large-scale dataset enabling the training of such a model exists.Thus we propose a comprehensive dataset GAEA with 800K images and around 1.6M question answer pairs constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. 0.907For quantitative evaluation, we propose a diverse benchmark comprising 4K image-text pairs to evaluate conversational capabilities equipped with diverse question types.We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision by 25.69% and the best proprietary model, GPT-4o by 8.28%.Our dataset, model and codes are available 0.957

link

2025-03-19

aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion

Repository-level code completion aims to complete code based on the long contexts of the repository.Existing studies extract long contexts from the repository as inputs and leverage Large Language Models (LLMs) to generate code.However, we reveal a severe limitation of LLMs, i.e., LLMs may ignore the information within long contexts in code completion.In other words, even the contexts contain useful information (e.g., relevant APIs or similar code), LLMs may fail to utilize this information.We think this limitation is caused by an inherent bias in LLMs, i.e., relying on nearby contexts and ignoring long-range contexts.To address this, we propose a novel fine-tuning approach named CoLT.The core idea of CoLT is to provide explicit supervision signals, which emphasize that long-range contexts may hold relevant information.Specifically, CoLT proposes a reinforcement learning-based training, which explicitly encourages models to utilize the information within long contexts and punishes models for ignoring long contexts.To support CoLT, we release CoLT-132K, a large-scale dataset with 132k samples across four languages, each containing long-context inputs. 0.932We apply CoLT to a popular LLM - aiXcoder-7B and release aiXcoder-7B-v2.We conduct extensive experiments on CoLT-132K and a public benchmark - CrossCodeEval.Our experiments yield the results: 1.Effectiveness.CoLT substantially improves aiXcoder-7B. aiXcoder-7B-v2 outperforms aiXcoder-7B by up to 44% in exact match.aiXcoder-7B-v2 becomes the state-of-the-art 7B model in code completion and even surpasses larger models.2. Generalizability.The capability learned by CoLT can generalize to new languages.Besides, CoLT is model-agnostic and effectively improves multiple LLMs.3. Enhanced Context Utilization Capability.CoLT significantly improves the capability of LLMs in utilizing the relevant information within long contexts.

link

2025-03-19

Genomic data processing with GenomeFlow

Advances in genome sequencing technologies generate massive amounts of sequence data that are increasingly analyzed and shared through public repositories. 0.719On-demand infrastructure services on cloud computing platforms enable the processing of such large-scale genomic sequence data in distributed processing environments with a significant reduction in analysis time.However, parallel processing on cloud computing platforms presents many challenges to researchers, even skillful bioinformaticians.In particular, it is difficult to design a computing architecture optimized to reduce the cost of computing and disk storage as genomic data analysis pipelines often employ many heterogeneous tools with different resource requirements.To address these issues, we developed GenomeFlow, a tool for automated development of computing architecture and resource optimization on Google Cloud Platform, which allows users to process a large number of samples at minimal cost.We outline multiple use cases of GenomeFlow demonstrating its utility to significantly reduce computing time and cost associated with analyzing genomic and transcriptomic data from hundreds to tens of thousands of samples from several consortia.Here, we describe a step-by-step protocol on how to use GenomeFlow for a common genomic data processing task.We introduce this example protocol geared toward a bioinformatician with little experience in cloud computing.

link

2025-03-19

Towards efficient keyword spotting using spike-based time difference encoders

Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used.However, its deployment is often limited by the extreme low-power constraints of the target embedded systems.Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting.This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors.We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. 0.741We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals.The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons.We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task.We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations.The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%).However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses.In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset.Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.

link

2025-03-19

Visual Persona: Foundation Model for Full-Body Human Customization

We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions.Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations.Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain.To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. 0.715For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images.Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs.Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.

link

2025-03-19

Visual Position Prompt for MLLM based Visual Grounding

Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding.This limitation arises from two key factors.First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations.Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability.To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability.VPP-LLaVA integrates two complementary mechanisms.The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues.The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations.We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. 0.754Training on this dataset with VPP enhances the model's performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ($\sim$21M samples).The code and VPP-SFT dataset will be available at https://github.com/WayneTomas/VPP-LLaVA upon acceptance. 0.766

link

Data Quality

2025-03-25

RCC-PFL: Robust Client Clustering under Noisy Labels in Personalized Federated Learning

We address the problem of cluster identity estimation in a personalized federated learning (PFL) setting in which users aim to learn different personal models.The backbone of effective learning in such a setting is to cluster users into groups whose objectives are similar.A typical approach in the literature is to achieve this by training users' data on different proposed personal models and assign them to groups based on which model achieves the lowest value of the users' loss functions.This process is to be done iteratively until group identities converge.A key challenge in such a setting arises when users have noisy labeled data, which may produce misleading values of their loss functions, and hence lead to ineffective clustering. 0.635To overcome this challenge, we propose a label-agnostic data similarity-based clustering algorithm, coined RCC-PFL, with three main advantages: the cluster identity estimation procedure is independent from the training labels; it is a one-shot clustering algorithm performed prior to the training; and it requires fewer communication rounds and less computation compared to iterative-based clustering methods.We validate our proposed algorithm using various models and datasets and show that it outperforms multiple baselines in terms of average accuracy and variance reduction.

link

2025-03-13

Learning Disease State from Noisy Ordinal Disease Progression Labels

Learning from noisy ordinal labels is a key challenge in medical imaging.In this work, we ask whether ordinal disease progression labels (better, worse, or stable) can be used to learn a representation allowing to classify disease state.For neovascular age-related macular degeneration (nAMD), we cast the problem of modeling disease progression between medical visits as a classification task with ordinal ranks.To enhance generalization, we tailor our model to the problem setting by (1) independent image encoding, (2) antisymmetric logit space equivariance, and (3) ordinal scale awareness.In addition, we address label noise by learning an uncertainty estimate for loss re-weighting. 0.657Our approach learns an interpretable disease representation enabling strong few-shot performance for the related task of nAMD activity classification from single images, despite being trained only on image pairs with ordinal disease progression labels.

link

2025-03-13

More Than Just Warnings:Exploring the Ways of Communicating Credibility Assessment on Social Media

Reducing the spread of misinformation is challenging.AI-based fact verification systems offer a promising solution by addressing the high costs and slow pace of traditional fact-checking.However, the problem of how to effectively communicate the results to users remains unsolved.Warning labels may seem an easy solution, but they fail to account for fuzzy misinformation that is not entirely fake. 0.748Additionally, users' limited attention spans and social media information should be taken into account while designing the presentation.The online experiment (n = 537) investigates the impact of sources and granularity on users' perception of information veracity and the system's usefulness and trustworthiness.Findings show that fine-grained indicators enhance nuanced opinions, information awareness, and the intention to use fact-checking systems.Source differences had minimal impact on opinions and perceptions, except for informativeness.Qualitative findings suggest the proposed indicators promote critical thinking.We discuss implications for designing concise, user-friendly AI fact-checking feedback.

link

2025-03-13

Unlock the Power of Unlabeled Data in Language Driving Model

Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements.However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive.To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner.Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data.Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. 0.647By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods.Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets.In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.

link

2025-03-12

Diff-CL: A Novel Cross Pseudo-Supervision Method for Semi-supervised Medical Image Segmentation

Semi-supervised learning utilizes insights from unlabeled data to improve model generalization, thereby reducing reliance on large labeled datasets.Most existing studies focus on limited samples and fail to capture the overall data distribution.We contend that combining distributional information with detailed information is crucial for achieving more robust and accurate segmentation results.On the one hand, with its robust generative capabilities, diffusion models (DM) learn data distribution effectively.However, it struggles with fine detail capture, leading to generated images with misleading details.Combining DM with convolutional neural networks (CNNs) enables the former to learn data distribution while the latter corrects fine details.While capturing complete high-frequency details by CNNs requires substantial computational resources and is susceptible to local noise.On the other hand, given that both labeled and unlabeled data come from the same distribution, we believe that regions in unlabeled data similar to overall class semantics to labeled data are likely to belong to the same class, while regions with minimal similarity are less likely to. 0.622This work introduces a semi-supervised medical image segmentation framework from the distribution perspective (Diff-CL).Firstly, we propose a cross-pseudo-supervision learning mechanism between diffusion and convolution segmentation networks.Secondly, we design a high-frequency mamba module to capture boundary and detail information globally.Finally, we apply contrastive learning for label propagation from labeled to unlabeled data. 0.653Our method achieves state-of-the-art (SOTA) performance across three datasets, including left atrium, brain tumor, and NIH pancreas datasets.

link

2025-03-12

Double-Stage Feature-Level Clustering-Based Mixture of Experts Framework

The Mixture-of-Experts (MoE) model has succeeded in deep learning (DL).However, its complex architecture and advantages over dense models in image classification remain unclear.In previous studies, MoE performance has often been affected by noise and outliers in the input space.Some approaches incorporate input clustering for training MoE models, but most clustering algorithms lack access to labeled data, limiting their effectiveness.This paper introduces the Double-stage Feature-level Clustering and Pseudo-labeling-based Mixture of Experts (DFCP-MoE) framework, which consists of input feature extraction, feature-level clustering, and a computationally efficient pseudo-labeling strategy.This approach reduces the impact of noise and outliers while leveraging a small subset of labeled data to label a large portion of unlabeled inputs. 0.619We propose a conditional end-to-end joint training method that improves expert specialization by training the MoE model on well-labeled, clustered inputs.Unlike traditional MoE and dense models, the DFCP-MoE framework effectively captures input space diversity, leading to competitive inference results.We validate our approach on three benchmark datasets for multi-class classification tasks.

link

2025-03-11

How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?

Machine unlearning is the task of updating a trained model to forget specific training data without retraining from scratch.In this paper, we investigate how unlearning of deep neural networks (DNNs) is affected by the model parameterization level, which corresponds here to the DNN width.We define validation-based tuning for several unlearning methods from the recent literature, and show how these methods perform differently depending on (i) the DNN parameterization level, (ii) the unlearning goal (unlearned data privacy or bias removal), (iii) whether the unlearning method explicitly uses the unlearned examples.Our results show that unlearning excels on overparameterized models, in terms of balancing between generalization and achieving the unlearning goal; although for bias removal this requires the unlearning method to use the unlearned examples.We further elucidate our error-based analysis by measuring how much the unlearning changes the classification decision regions in the proximity of the unlearned examples, and avoids changing them elsewhere. 0.645By this we show that the unlearning success for overparameterized models stems from the ability to delicately change the model functionality in small regions in the input space while keeping much of the model functionality unchanged.

link

Benchmarks

2025-03-31

Spatio-temporal Prediction of Fine-Grained Origin-Destination Matrices with Applications in Ridesharing

Accurate spatial-temporal prediction of network-based travelers' requests is crucial for the effective policy design of ridesharing platforms.Having knowledge of the total demand between various locations in the upcoming time slots enables platforms to proactively prepare adequate supplies, thereby increasing the likelihood of fulfilling travelers' requests and redistributing idle drivers to areas with high potential demand to optimize the global supply-demand equilibrium.This paper delves into the prediction of Origin-Destination (OD) demands at a fine-grained spatial level, especially when confronted with an expansive set of local regions.While this task holds immense practical value, it remains relatively unexplored within the research community.To fill this gap, we introduce a novel prediction model called OD-CED, which comprises an unsupervised space coarsening technique to alleviate data sparsity and an encoder-decoder architecture to capture both semantic and geographic dependencies.Through practical experimentation, OD-CED has demonstrated remarkable results.It achieved an impressive reduction of up to 45% reduction in root-mean-square error and 60% in weighted mean absolute percentage error over traditional statistical methods when dealing with OD matrices exhibiting a sparsity exceeding 90%. 0.652

link

2025-03-31

Advanced Quantum Annealing Approach to Vehicle Routing Problems with Time Windows

In this paper, we explore the potential for quantum annealing to solve realistic routing problems.We focus on two NP-Hard problems, including the Traveling Salesman Problem with Time Windows and the Capacitated Vehicle Routing Problem with Time Windows.We utilize D-Wave's Quantum Annealer and Constrained Quadratic Model (CQM) solver within a hybrid framework to solve these problems.We demonstrate that while the CQM solver effectively minimizes route costs, it struggles to maintain time window feasibility as the problem size increases.To address this limitation, we implement a heuristic method that fixes infeasible solutions through a series of swapping operations.Testing on benchmark instances shows our method achieves promising results with an average optimality gap of 3.86%. 0.759

link

2025-03-31

Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning

We consider a decentralized wireless network with several source-destination pairs sharing a limited number of orthogonal frequency bands.Sources learn to adapt their transmissions (specifically, their band selection strategy) over time, in a decentralized manner, without sharing information with each other.Sources can only observe the outcome of their own transmissions (i.e., success or collision), having no prior knowledge of the network size or of the transmission strategy of other sources.The goal of each source is to maximize their own throughput while striving for network-wide fairness.We propose a novel fully decentralized Reinforcement Learning (RL)-based solution that achieves fairness without coordination.The proposed Fair Share RL (FSRL) solution combines: (i) state augmentation with a semi-adaptive time reference; (ii) an architecture that leverages risk control and time difference likelihood; and (iii) a fairness-driven reward structure.We evaluate FSRL in more than 50 network settings with different number of agents, different amounts of available spectrum, in the presence of jammers, and in an ad-hoc setting.Simulation results suggest that, when we compare FSRL with a common baseline RL algorithm from the literature, FSRL can be up to 89.0% fairer (as measured by Jain's fairness index) in stringent settings with several sources and a single frequency band, and 48.1% fairer on average. 0.615

link

2025-03-31

Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge

Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy.Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms.This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification.The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024.The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency.The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. 0.6The efficiency component tests the latency of algorithm inference.The challenge was conducted as a part of MICCAI EndoVis 2024.In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day.This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery.In this paper we summarize the design, submissions, and results from the challenge.The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics

link

2025-03-31

BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models

In this research, we introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs).Building upon the BEATS framework, we present a bias benchmark for LLMs that measure performance across 29 distinct metrics. 0.619These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk.These metrics enable a quantitative assessment of the extent to which LLM generated responses may perpetuate societal prejudices that reinforce or expand systemic inequities.To achieve a high score on this benchmark a LLM must show very equitable behavior in their responses, making it a rigorous standard for responsible AI evaluation.Empirical results based on data from our experiment show that, 37.65\% of outputs generated by industry leading models contained some form of bias, highlighting a substantial risk of using these models in critical decision making systems.BEATS framework and benchmark offer a scalable and statistically rigorous methodology to benchmark LLMs, diagnose factors driving biases, and develop mitigation strategies.With the BEATS framework, our goal is to help the development of more socially responsible and ethically aligned AI models.

link

2025-03-31

Sample-Optimal Private Regression in Polynomial Time

We consider the task of privately obtaining prediction error guarantees in ordinary least-squares regression problems with Gaussian covariates (with unknown covariance structure).We provide the first sample-optimal polynomial time algorithm for this task under both pure and approximate differential privacy.We show that any improvement to the sample complexity of our algorithm would violate either statistical-query or information-theoretic lower bounds.Additionally, our algorithm is robust to a small fraction of arbitrary outliers and achieves optimal error rates as a function of the fraction of outliers. 0.639In contrast, all prior efficient algorithms either incurred sample complexities with sub-optimal dimension dependence, scaling with the condition number of the covariates, or obtained a polynomially worse dependence on the privacy parameters. Our technical contributions are two-fold: first, we leverage resilience guarantees of Gaussians within the sum-of-squares framework.As a consequence, we obtain efficient sum-of-squares algorithms for regression with optimal robustness rates and sample complexity.Second, we generalize the recent robustness-to-privacy framework[HKMN23, (arXiv:2212.05015)] to account for the geometry induced by the covariance of the input samples.This framework crucially relies on the robust estimators to be sum-of-squares algorithms, and combining the two steps yields a sample-optimal private regression algorithm.We believe our techniques are of independent interest, and we demonstrate this by obtaining an efficient algorithm for covariance-aware mean estimation, with an optimal dependence on the privacy parameters.

link

2025-03-31

Contextual Preference Collaborative Measure Framework Based on Belief System

To reduce the human intervention in the preference measure process,this article proposes a preference collaborative measure framework based on an updated belief system,which is also capable of improving the accuracy and efficiency of preferen-ce measure algorithms.Firstly,the distance of rules and the average internal distance of rulesets are proposed for specifying the relationship between the rules.For discovering the most representative preferences that are common in all users,namely common preference,a algorithm based on average internal distance of ruleset,PRA algorithm,is proposed,which aims to finish the discoveryprocess with minimum information loss rate.Furthermore,the concept of Common belief is proposed to update the belief system,and the common preferences are the evidences of updated belief system.Then,under the belief system,the proposed belief degree and deviation degree are used to determine whether a rule confirms the belief system or not and classify the preference rules into two kinds(generalized or personalized),and eventually filters out Top-K interesting rules relying on belief degree and deviation degree.Based on above,a scalable interestingness calculation framework that can apply various formulas is proposed for accurately calculating interestingness in different conditions.At last,IMCos algorithm and IMCov algorithm are proposed as exemplars to verify the accuracy and efficiency of the framework by using weighted cosine similarity and correlation coefficients as belief degree.In experiments,the proposed algorithms are compared to two state-of-the-art algorithms and the results show that IMCos and IMCov outperform than the other two in most aspects. 0.636

link

2025-03-31

Accelerated Approximate Optimization of Multi-Commodity Flows on Directed Graphs

We provide $m^{1+o(1)}k\epsilon^{-1}$-time algorithms for computing multiplicative $(1 - \epsilon)$-approximate solutions to multi-commodity flow problems with $k$-commodities on $m$-edge directed graphs, including concurrent multi-commodity flow and maximum multi-commodity flow. To obtain our results, we provide new optimization tools of potential independent interest.First, we provide an improved optimization method for solving $\ell_{q, p}$-regression problems to high accuracy. 0.605This method makes $\tilde{O}_{q, p}(k)$ queries to a high accuracy convex minimization oracle for an individual block, where $\tilde{O}_{q, p}(\cdot)$ hides factors depending only on $q$, $p$, or $\mathrm{poly}(\log m)$, improving upon the $\tilde{O}_{q, p}(k^2)$ bound of [Chen-Ye, ICALP 2024].As a result, we obtain the first almost-linear time algorithm that solves $\ell_{q, p}$ flows on directed graphs to high accuracy.Second, we present optimization tools to reduce approximately solving composite $\ell_{1, \infty}$-regression problems to solving $m^{o(1)}\epsilon^{-1}$ instances of composite $\ell_{q, p}$-regression problem.The method builds upon recent advances in solving box-simplex games [Jambulapati-Tian, NeurIPS 2023] and the area convex regularizer introduced in [Sherman, STOC 2017] to obtain faster rates for constrained versions of the problem.Carefully combining these techniques yields our directed multi-commodity flow algorithm.

link

2025-03-31

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets.In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model.This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths.In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction.Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning.We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion.By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction.Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets.Our code is publicly available for research purpose at https://easi3r.github.io/ 0.606

link

2025-03-27

Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing

Large Language Models (LLMs) have transformed task automation and content generation across various domains while incorporating safety filters to prevent misuse.We introduce a novel jailbreaking framework that employs distributed prompt processing combined with iterative refinements to bypass these safety measures, particularly in generating malicious code.Our architecture consists of four key modules: prompt segmentation, parallel processing, response aggregation, and LLM-based jury evaluation.Tested on 500 malicious prompts across 10 cybersecurity categories, the framework achieves a 73.2% Success Rate (SR) in generating malicious code.Notably, our comparative analysis reveals that traditional single-LLM judge evaluation overestimates SRs (93.8%) compared to our LLM jury system (73.2%), with manual verification confirming that single-judge assessments often accept incomplete implementations. 0.634Moreover, we demonstrate that our distributed architecture improves SRs by 12% over the non-distributed approach in an ablation study, highlighting both the effectiveness of distributed prompt processing and the importance of robust evaluation methodologies in assessing jailbreak attempts.

link

2025-03-27

Audio-driven Gesture Generation via Deviation Feature in the Latent Space

Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions.While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations.We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation.Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation.By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production.Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques. 0.65

link

2025-03-27

The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection

In recent years, performance on existing anomaly detection benchmarks like MVTec AD and VisA has started to saturate in terms of segmentation AU-PRO, with state-of-the-art models often competing in the range of less than one percentage point.This lack of discriminatory power prevents a meaningful comparison of models and thus hinders progress of the field, especially when considering the inherent stochastic nature of machine learning results.We present MVTec AD 2, a collection of eight anomaly detection scenarios with more than 8000 high-resolution images.It comprises challenging and highly relevant industrial inspection use cases that have not been considered in previous datasets, including transparent and overlapping objects, dark-field and back light illumination, objects with high variance in the normal data, and extremely small defects.We provide comprehensive evaluations of state-of-the-art methods and show that their performance remains below 60% average AU-PRO. 0.707Additionally, our dataset provides test scenarios with lighting condition changes to assess the robustness of methods under real-world distribution shifts.We host a publicly accessible evaluation server that holds the pixel-precise ground truth of the test set (https://benchmark.mvtec.com/).All image data is available at https://www.mvtec.com/company/research/datasets/mvtec-ad-2.

link

2025-03-27

ClusterSC: Advancing Synthetic Control with Donor Selection

In causal inference with observational studies, synthetic control (SC) has emerged as a prominent tool.SC has traditionally been applied to aggregate-level datasets, but more recent work has extended its use to individual-level data.As they contain a greater number of observed units, this shift introduces the curse of dimensionality to SC.To address this, we propose Cluster Synthetic Control (ClusterSC), based on the idea that groups of individuals may exist where behavior aligns internally but diverges between groups.ClusterSC incorporates a clustering step to select only the relevant donors for the target.We provide theoretical guarantees on the improvements induced by ClusterSC, supported by empirical demonstrations on synthetic and real-world datasets.The results indicate that ClusterSC consistently outperforms classical SC approaches. 0.636

link

2025-03-27

A Bespoke Design Approach to Low-Power Printed Microprocessors for Machine Learning Applications

Printed electronics have gained significant traction in recent years, presenting a viable path to integrating computing into everyday items, from disposable products to low-cost healthcare.However, the adoption of computing in these domains is hindered by strict area and power constraints, limiting the effectiveness of general-purpose microprocessors.This paper proposes a bespoke microprocessor design approach to address these challenges, by tailoring the design to specific applications and eliminating unnecessary logic.Targeting machine learning applications, we further optimize core operations by integrating a SIMD MAC unit supporting 4 precision configurations that boost the efficiency of microprocessors.Our evaluation across 6 ML models and the large-scale Zero-Riscy core, shows that our methodology can achieve improvements of 22.2%, 23.6%, and 33.79% in area, power, and speed, respectively, without compromising accuracy. 0.6Against state-of-the-art printed processors, our approach can still offer significant speedups, but along with some accuracy degradation. 0.654This work explores how such trade-offs can enable low-power printed microprocessors for diverse ML applications.

link

2025-03-27

AMA-SAM: Adversarial Multi-Domain Alignment of Segment Anything Model for High-Fidelity Histology Nuclei Segmentation

Accurate segmentation of cell nuclei in histopathology images is essential for numerous biomedical research and clinical applications.However, existing cell nucleus segmentation methods only consider a single dataset (i.e., primary domain), while neglecting to leverage supplementary data from diverse sources (i.e., auxiliary domains) to reduce overfitting and enhance the performance.Although incorporating multiple datasets could alleviate overfitting, it often exacerbates performance drops caused by domain shifts.In this work, we introduce Adversarial Multi-domain Alignment of Segment Anything Model (AMA-SAM) that extends the Segment Anything Model (SAM) to overcome these obstacles through two key innovations.First, we propose a Conditional Gradient Reversal Layer (CGRL), a multi-domain alignment module that harmonizes features from diverse domains to promote domain-invariant representation learning while preserving crucial discriminative features for the primary dataset.Second, we address SAM's inherent low-resolution output by designing a High-Resolution Decoder (HR-Decoder), which directly produces fine-grained segmentation maps in order to capture intricate nuclei boundaries in high-resolution histology images.To the best of our knowledge, this is the first attempt to adapt SAM for multi-dataset learning with application to histology nuclei segmentation.We validate our method on several publicly available datasets, demonstrating consistent and significant improvements over state-of-the-art approaches. 0.645

link

2025-03-27

Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance

Evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences.We propose cFreD, a metric based on the notion of Conditional Fr\'echet Distance that explicitly accounts for both visual fidelity and text-prompt alignment.Existing metrics such as Inception Score (IS), Fr\'echet Inception Distance (FID) and CLIPScore assess either image quality or image-text alignment but not both which limits their correlation with human preferences.Scoring models explicitly trained to replicate human preferences require constant updates and may not generalize to novel generation techniques or out-of-domain inputs.Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, we demonstrate that cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences.Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models, standardizing benchmarking in this rapidly evolving field.We release our evaluation toolkit and benchmark in the appendix. 0.621

link

2025-03-27

GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics

Ensuring the reliability and effectiveness of software release decisions is critical, particularly in safety-critical domains like automotive systems.Precise analysis of release validation data, often presented in tabular form, plays a pivotal role in this process.However, traditional methods that rely on manual analysis of extensive test datasets and validation metrics are prone to delays and high costs.Large Language Models (LLMs) offer a promising alternative but face challenges in analytical reasoning, contextual understanding, handling out-of-scope queries, and processing structured test data consistently; limitations that hinder their direct application in safety-critical scenarios.This paper introduces GateLens, an LLM-based tool for analyzing tabular data in the automotive domain.GateLens translates natural language queries into Relational Algebra (RA) expressions and then generates optimized Python code.It outperforms the baseline system on benchmarking datasets, achieving higher F1 scores and handling complex and ambiguous queries with greater robustness. 0.601Ablation studies confirm the critical role of the RA module, with performance dropping sharply when omitted.Industrial evaluations reveal that GateLens reduces analysis time by over 80% while maintaining high accuracy and reliability.As demonstrated by presented results, GateLens achieved high performance without relying on few-shot examples, showcasing strong generalization across various query types from diverse company roles.Insights from deploying GateLens with a partner automotive company offer practical guidance for integrating AI into critical workflows such as release validation.Results show that by automating test result analysis, GateLens enables faster, more informed, and dependable release decisions, and can thus advance software scalability and reliability in automotive systems.

link

2025-03-27

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity.Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images.Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance.To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation.Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). 0.604Our codes, models, datasets, and demo are publicly available.

link

2025-03-27

A Unified Framework for Diffusion Bridge Problems: Flow Matching and Schrödinger Matching into One

The bridge problem is to find an SDE (or sometimes an ODE) that bridges two given distributions.The application areas of the bridge problem are enormous, among which the recent generative modeling (e.g., conditional or unconditional image generation) is the most popular.Also the famous Schr\"{o}dinger bridge problem, a widely known problem for a century, is a special instance of the bridge problem.Two most popular algorithms to tackle the bridge problems in the deep learning era are: (conditional) flow matching and iterative fitting algorithms, where the former confined to ODE solutions, and the latter specifically for the Schr\"{o}dinger bridge problem.The main contribution of this article is in two folds: i) We provide concise reviews of these algorithms with technical details to some extent; ii) We propose a novel unified perspective and framework that subsumes these seemingly unrelated algorithms (and their variants) into one. 0.618In particular, we show that our unified framework can instantiate the Flow Matching (FM) algorithm, the (mini-batch) optimal transport FM algorithm, the (mini-batch) Schr\"{o}dingerbridge FM algorithm, and the deep Schr\"{o}dinger bridge matching (DSBM) algorithm as its special cases.We believe that this unified framework will be useful for viewing the bridge problems in a more general and flexible perspective, and in turn can help researchers and practitioners to develop new bridge algorithms in their fields.

link

2025-03-27

Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying

Open-vocabulary querying in 3D Gaussian Splatting aims to identify semantically relevant regions within a 3D Gaussian representation based on a given text query.Prior work, such as LangSplat, addressed this task by retrieving these regions in the form of segmentation masks on 2D renderings.More recently, OpenGaussian introduced point-level querying, which directly selects a subset of 3D Gaussians.In this work, we propose a point-level querying method that builds upon LangSplat's framework.Our approach improves the framework in two key ways: (a) we leverage masklets from the Segment Anything Model 2 (SAM2) to establish semantic consistent ground-truth for distilling the language Gaussians; (b) we introduces a novel two-step querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians.Experimental evaluations on three benchmark datasets demonstrate that the proposed method achieves better performance compared to state-of-the-art approaches. 0.75For instance, our method achieves an mIoU improvement of +20.42 on the 3D-OVS dataset.

link

2025-03-27

Optimal Stepsize for Diffusion Sampling

Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization.While existing works focus on optimizing denoising directions, we address the principled design of stepsize schedules.This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories.By reformulating stepsize optimization as recursive error minimization, our method guarantees global discretization bounds through optimal substructure exploitation.Crucially, the distilled schedules demonstrate strong robustness across architectures, ODE solvers, and noise schedules.Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval.Our code is available at https://github.com/bebebe666/OptimalSteps. 0.691

link

2025-03-27

X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction

Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows.Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality.In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning.Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization.To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization.Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. 0.611By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging.Project website at: https://x2-gaussian.github.io/.

link

2025-03-26

Synthetic Data Augmentation for Cross-domain Implicit Discourse Relation Recognition

Implicit discourse relation recognition (IDRR) -- the task of identifying the implicit coherence relation between two text spans -- requires deep semantic understanding.Recent studies have shown that zero- or few-shot approaches significantly lag behind supervised models, but LLMs may be useful for synthetic data augmentation, where LLMs generate a second argument following a specified coherence relation.We applied this approach in a cross-domain setting, generating discourse continuations using unlabelled target-domain data to adapt a base model which was trained on source-domain labelled data.Evaluations conducted on a large-scale test set revealed that different variations of the approach did not result in any significant improvements. 0.668We conclude that LLMs often fail to generate useful samples for IDRR, and emphasize the importance of considering both statistical significance and comparability when evaluating IDRR models.

link

2025-03-26

Late Breaking Results: A RISC-V ISA Extension for Chaining in Scalar Processors

Modern general-purpose accelerators integrate a large number of programmable area- and energy-efficient processing elements (PEs), to deliver high performance while meeting stringent power delivery and thermal dissipation constraints.In this context, PEs are often implemented by scalar in-order cores, which are highly sensitive to pipeline stalls.Traditional software techniques, such as loop unrolling, mitigate the issue at the cost of increased register pressure, limiting flexibility.We propose scalar chaining, a novel hardware-software solution, to address this issue without incurring the drawbacks of traditional software-only techniques.We demonstrate our solution on register-limited stencil codes, achieving >93% FPU utilizations and a 4% speedup and 10% higher energy efficiency, on average, over highly-optimized baselines.Our implementation is fully open source and performance experiments are reproducible using free software. 0.647

link

2025-03-26

IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting

Recent pre-trained vision-language models (PT-VLMs) often face a Multi-Domain Class-Incremental Learning (MCIL) scenario in practice, where several classes and domains of multi-modal tasks are incrementally arrived.Without access to previously learned tasks and unseen tasks, memory-constrained MCIL suffers from forward and backward forgetting.To alleviate the above challenges, parameter-efficient fine-tuning techniques (PEFT), such as prompt tuning, are employed to adapt the PT-VLM to the diverse incrementally learned tasks.To achieve effective new task adaptation, existing methods only consider the effect of PEFT strategy selection, but neglect the influence of PEFT parameter setting (e.g., prompting).In this paper, we tackle the challenge of optimizing prompt designs for diverse tasks in MCIL and propose an Instance-Aware Prompting (IAP) framework.Specifically, our Instance-Aware Gated Prompting (IA-GP) module enhances adaptation to new tasks while mitigating forgetting by dynamically assigning prompts across transformer layers at the instance level.Our Instance-Aware Class-Distribution-Driven Prompting (IA-CDDP) improves the task adaptation process by determining an accurate task-label-related confidence score for each instance.Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method. 0.696Code can be found at https://github.com/FerdinandZJU/IAP.

link

2025-03-26

State-Aware Perturbation Optimization for Robust Deep Reinforcement Learning

Recently, deep reinforcement learning (DRL) has emerged as a promising approach for robotic control.However, the deployment of DRL in real-world robots is hindered by its sensitivity to environmental perturbations.While existing whitebox adversarial attacks rely on local gradient information and apply uniform perturbations across all states to evaluate DRL robustness, they fail to account for temporal dynamics and state-specific vulnerabilities.To combat the above challenge, we first conduct a theoretical analysis of white-box attacks in DRL by establishing the adversarial victim-dynamics Markov decision process (AVD-MDP), to derive the necessary and sufficient conditions for a successful attack.Based on this, we propose a selective state-aware reinforcement adversarial attack method, named STAR, to optimize perturbation stealthiness and state visitation dispersion.STAR first employs a soft mask-based state-targeting mechanism to minimize redundant perturbations, enhancing stealthiness and attack effectiveness.Then, it incorporates an information-theoretic optimization objective to maximize mutual information between perturbations, environmental states, and victim actions, ensuring a dispersed state-visitation distribution that steers the victim agent into vulnerable states for maximum return reduction.Extensive experiments demonstrate that STAR outperforms state-of-the-art benchmarks. 0.665

link

2025-03-26

ProFed: a Benchmark for Proximity-based non-IID Federated Learning

In recent years, cro:flFederated learning (FL) has gained significant attention within the machine learning community.Although various FL algorithms have been proposed in the literature, their performance often degrades when data across clients is non-independently and identically distributed (non-IID).This skewness in data distribution often emerges from geographic patterns, with notable examples including regional linguistic variations in text data or localized traffic patterns in urban environments.Such scenarios result in IID data within specific regions but non-IID data across regions.However, existing FL algorithms are typically evaluated by randomly splitting non-IID data across devices, disregarding their spatial distribution.To address this gap, we introduce ProFed, a benchmark that simulates data splits with varying degrees of skewness across different regions.We incorporate several skewness methods from the literature and apply them to well-known datasets, including MNIST, FashionMNIST, CIFAR-10, and CIFAR-100.Our goal is to provide researchers with a standardized framework to evaluate FL algorithms more effectively and consistently against established baselines. 0.637

link

2025-03-26

Semi-supervised Node Importance Estimation with Informative Distribution Modeling for Uncertainty Regularization

Node importance estimation, a classical problem in network analysis, underpins various web applications. 0.605Previous methods either exploit intrinsic topological characteristics, e.g., graph centrality, or leverage additional information, e.g., data heterogeneity, for node feature enhancement.However, these methods follow the supervised learning setting, overlooking the fact that ground-truth node-importance data are usually partially labeled in practice.In this work, we propose the first semi-supervised node importance estimation framework, i.e., EASING, to improve learning quality for unlabeled data in heterogeneous graphs.Different from previous approaches, EASING explicitly captures uncertainty to reflect the confidence of model predictions.To jointly estimate the importance values and uncertainties, EASING incorporates DJE, a deep encoder-decoder neural architecture.DJE introduces distribution modeling for graph nodes, where the distribution representations derive both importance and uncertainty estimates.Additionally, DJE facilitates effective pseudo-label generation for the unlabeled data to enrich the training samples.Based on labeled and pseudo-labeled data, EASING develops effective semi-supervised heteroscedastic learning with varying node uncertainty regularization.Extensive experiments on three real-world datasets highlight the superior performance of EASING compared to competing methods.Codes are available via https://github.com/yankai-chen/EASING.

link

2025-03-26

Learning Straight Flows by Learning Curved Interpolants

Flow matching models typically use linear interpolants to define the forward/noise addition process.This, together with the independent coupling between noise and target distributions, yields a vector field which is often non-straight.Such curved fields lead to a slow inference/generation process.In this work, we propose to learn flexible (potentially curved) interpolants in order to learn straight vector fields to enable faster generation.We formulate this via a multi-level optimization problem and propose an efficient approximate procedure to solve it. 0.626Our framework provides an end-to-end and simulation-free optimization procedure, which can be leveraged to learn straight line generative trajectories.

link

2025-03-26

Semantic Communications via Features Identification

The development of the new generation of wireless technologies (6G) has led to an increased interest in semantic communication.Thanks also to recent developments in artificial intelligence and communication technologies, researchers in this field have defined new communication paradigms that go beyond those of syntactic communication to post-Shannon and semantic communication.However, there is still need to define a clear and practical framework for semantic communication, as well as an effective structure of semantic elements that can be used in it.The aim of this work is to bridge the gap between two post-Shannon communication paradigms, and to define a robust and effective semantic communication strategy that focuses on a dedicated semantic element that can be easily derived from any type of message.Our work will take form as an innovative communication method called identification via semantic features, which aims at exploiting the ambiguities present in semantic messages, allowing for their identification instead of reproducing them bit by bit.Our approach has been tested through numerical simulations using a combination of machine learning and data analysis. 0.601The proposed communication method showed promising results, demonstrating a clear and significant gain over traditional syntactic communication paradigms.

link

2025-03-26

Benchmarking and optimizing organism wide single-cell RNA alignment methods

Many methods have been proposed for removing batch effects and aligning single-cell RNA (scRNA) datasets.However, performance is typically evaluated based on multiple parameters and few datasets, creating challenges in assessing which method is best for aligning data at scale. 0.646Here, we introduce the K-Neighbors Intersection (KNI) score, a single score that both penalizes batch effects and measures accuracy at cross-dataset cell-type label prediction alongside carefully curated small (scMARK) and large (scREF) benchmarks comprising 11 and 46 human scRNA studies respectively, where we have standardized author labels.Using the KNI score, we evaluate and optimize approaches for cross-dataset single-cell RNA integration.We introduce Batch Adversarial single-cell Variational Inference (BA-scVI), as a new variant of scVI that uses adversarial training to penalize batch-effects in the encoder and decoder, and show this approach outperforms other methods.In the resulting aligned space, we find that the granularity of cell-type groupings is conserved, supporting the notion that whole-organism cell-type maps can be created by a single model without loss of information.

link

2025-03-26

SChanger: Change Detection from a Semantic Change and Spatial Consistency Perspective

Change detection is a key task in Earth observation applications.Recently, deep learning methods have demonstrated strong performance and widespread application.However, change detection faces data scarcity due to the labor-intensive process of accurately aligning remote sensing images of the same area, which limits the performance of deep learning algorithms.To address the data scarcity issue, we develop a fine-tuning strategy called the Semantic Change Network (SCN).We initially pre-train the model on single-temporal supervised tasks to acquire prior knowledge of instance feature extraction.The model then employs a shared-weight Siamese architecture and extended Temporal Fusion Module (TFM) to preserve this prior knowledge and is fine-tuned on change detection tasks.The learned semantics for identifying all instances is changed to focus on identifying only the changes.Meanwhile, we observe that the locations of changes between the two images are spatially identical, a concept we refer to as spatial consistency.We introduce this inductive bias through an attention map that is generated by large-kernel convolutions and applied to the features from both time points.This enhances the modeling of multi-scale changes and helps capture underlying relationships in change detection semantics.We develop a binary change detection model utilizing these two strategies.The model is validated against state-of-the-art methods on six datasets, surpassing all benchmark methods and achieving F1 scores of 92.87%, 86.43%, 68.95%, 97.62%, 84.58%, and 93.20% on the LEVIR-CD, LEVIR-CD+, S2Looking, CDD, SYSU-CD, and WHU-CD datasets, respectively. 0.62

link

2025-03-26

Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework

This paper introduces a theoretical framework for a Transformer-augmented, sectional Mixture-of-Experts (MoE) architecture that aims to enhance computational efficiency while preserving model scalability.Unlike conventional MoE models, which route entire token embeddings to selected experts, our approach portions the embedding dimension itself -- assigning segments of each token's representation to dedicated experts.To combat losses in token representation, we utilize a pre-expert transformer layer to recompute attention across tokens and reduce the sequence length dimensionality.We extend our theory by deriving optimal scaling laws that a non-linear relationship between the number of experts and factors such as model dimensionality, sequence length, and system overhead.These formulations yield closed-form and numerically-solvable expressions for identifying the optimal expert count under given architectural and hardware constraints.As a result, our framework not only provides theoretical bounds for computing efficiency with varying frameworks but also guides practical design choices for scaling large models effectively.While empirical validation is pending, we present a comprehensive experimental road map to evaluate the framework's efficiency, scalability, and practicality in future work. 0.667

link

2025-03-26

ASGO: Adaptive Structured Gradient Optimization

Training deep neural networks (DNNs) is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than simple vectors.Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block-wise diagonal.These structured properties are crucial for designing efficient optimization algorithms but may not be utilized by current popular optimizers like Adam.In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients.By fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. 0.612Based on the convergence theory, we further demonstrate that ASGO can benefit from the low-rank and block-wise diagonal properties.We also discuss practical modifications of ASGO and empirically verify the effectiveness of the algorithm on language model tasks.

link

2025-03-26

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training.To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing.AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound.It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation.We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes.To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits.AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. 0.627Results are available at https://genjib.github.io/project_page/AVED/index.html

link

2025-03-25

OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations

3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language.However, the evaluation of these representations is limited to closed-set semantics that do not capture the richness of language.This work presents OpenLex3D, a dedicated benchmark to evaluate 3D open-vocabulary scene representations.OpenLex3D provides entirely new label annotations for 23 scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions.By introducing an open-set 3D semantic segmentation task and an object retrieval task, we provide insights on feature precision, segmentation, and downstream capabilities.We evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement.The benchmark is publicly available at: https://openlex3d.github.io/. 0.662

link

2025-03-25

BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility.The recent Segment Anything Model (SAM) has demonstrated powerful point-prompt segmentation capabilities, while text-based segmentation models offer rich semantic understanding.However, existing approaches rarely explore how to effectively combine these complementary modalities for optimal segmentation performance.This paper presents BiPrompt-SAM, a novel dual-modal prompt segmentation framework that fuses the advantages of point and text prompts through an explicit selection mechanism.Specifically, we leverage SAM's inherent ability to generate multiple mask candidates, combined with a semantic guidance mask from text prompts, and explicitly select the most suitable candidate based on similarity metrics.This approach can be viewed as a simplified Mixture of Experts (MoE) system, where the point and text modules act as distinct "experts," and the similarity scoring serves as a rudimentary "gating network."We conducted extensive evaluations on both the Endovis17 medical dataset and RefCOCO series natural image datasets.On Endovis17, BiPrompt-SAM achieved 89.55\% mDice and 81.46\% mIoU, comparable to state-of-the-art specialized medical segmentation models.On the RefCOCO series datasets, our method attained 87.1\%, 86.5\%, and 85.8\% IoU, significantly outperforming existing approaches. 0.682Experiments demonstrate that our explicit dual-selection method effectively combines the spatial precision of point prompts with the semantic richness of text prompts, particularly excelling in scenarios involving semantically complex objects, multiple similar objects, and partial occlusions.BiPrompt-SAM not only provides a simple yet effective implementation but also offers a new perspective on multi-modal prompt fusion.

link

2025-03-25

Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion

Modern autonomous driving perception systems utilize complementary multi-modal sensors, such as LiDAR and cameras.Although sensor fusion architectures enhance performance in challenging environments, they still suffer significant performance drops under severe sensor failures, such as LiDAR beam reduction, LiDAR drop, limited field of view, camera drop, and occlusion.This limitation stems from inter-modality dependencies in current sensor fusion frameworks.In this study, we introduce an efficient and robust LiDAR-camera 3D object detector, referred to as MoME, which can achieve robust performance through a mixture of experts approach.Our MoME fully decouples modality dependencies using three parallel expert decoders, which use camera features, LiDAR features, or a combination of both to decode object queries, respectively.We propose Multi-Expert Decoding (MED) framework, where each query is decoded selectively using one of three expert decoders.MoME utilizes an Adaptive Query Router (AQR) to select the most appropriate expert decoder for each query based on the quality of camera and LiDAR features.This ensures that each query is processed by the best-suited expert, resulting in robust performance across diverse sensor failure scenarios.We evaluated the performance of MoME on the nuScenes-R benchmark. 0.657Our MoME achieved state-of-the-art performance in extreme weather and sensor failure conditions, significantly outperforming the existing models across various sensor failure scenarios.

link

2025-03-25

A comparative study of calibration techniques for finite strain elastoplasticity: Numerically-exact sensitivities for FEMU and VFM

Accurate identification of material parameters is crucial for predictive modeling in computational mechanics.The two primary approaches in the experimental mechanics' community for calibration from full-field digital image correlation data are known as finite element model updating (FEMU) and the virtual fields method (VFM).In VFM, the objective function is a squared mismatch between internal and external virtual work or power.In FEMU, the objective function quantifies the weighted mismatch between model predictions and corresponding experimentally measured quantities of interest.It is minimized by iteratively updating the parameters of an FE model.While FEMU is seen as more flexible, VFM is commonly used instead of FEMU due to its considerably greater computational expense.However, comparisons between the two methods usually involve approximations of gradients or sensitivities with finite difference schemes, thereby making direct assessments difficult. 0.719Hence, in this study, we rigorously compare VFM and FEMU in the context of numerically-exact sensitivities obtained through local sensitivity analyses and the application of automatic differentiation software.To this end, both methods are tested on a finite strain elastoplasticity model.We conduct a series of test cases to assess both methods' robustness under practical challenges. 0.655

link

2025-03-25

SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation

Image generation technology has brought significant advancements across various fields but has also raised concerns about data misuse and potential rights infringements, particularly with respect to creating visual artworks.Current methods aimed at safeguarding artworks often employ adversarial attacks.However, these methods face challenges such as poor transferability, high computational costs, and the introduction of noticeable noise, which compromises the aesthetic quality of the original artwork.To address these limitations, we propose a Structurally Imperceptible and Transferable Adversarial (SITA) attacks.SITA leverages a CLIP-based destylization loss, which decouples and disrupts the robust style representation of the image.This disruption hinders style extraction during stylized image generation, thereby impairing the overall stylization process.Importantly, SITA eliminates the need for a surrogate diffusion model, leading to significantly reduced computational overhead.The method's robust style feature disruption ensures high transferability across diverse models.Moreover, SITA introduces perturbations by embedding noise within the imperceptible structural details of the image.This approach effectively protects against style extraction without compromising the visual quality of the artwork.Extensive experiments demonstrate that SITA offers superior protection for artworks against unauthorized use in stylized generation.It significantly outperforms existing methods in terms of transferability, computational efficiency, and noise imperceptibility. 0.654Code is available at https://github.com/A-raniy-day/SITA.

link

2025-03-25

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy

Meta-evaluation of automatic evaluation metrics -- assessing evaluation metrics themselves -- is crucial for accurately benchmarking natural language processing systems and has implications for scientific inquiry, production model development, and policy enforcement.While existing approaches to metric meta-evaluation focus on general statements about the absolute and relative quality of metrics across arbitrary system outputs, in practice, metrics are applied in highly contextual settings, often measuring the performance for a highly constrained set of system outputs. 0.644For example, we may only be interested in evaluating a specific model or class of models.We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics.Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts.This observed variation highlights the importance of adopting context-specific metric evaluations over global ones.

link

2025-03-25

A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950.Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes.Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series.The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. 0.647Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.

link

2025-03-25

Towards Online Multi-Modal Social Interaction Understanding

Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems.In real-world scenarios, AI agents are required to provide real-time feedback.However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems.To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams.To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models.First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details.Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame.Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. 0.632The code and pre-trained models will be publicly released at: https://github.com/Sampson-Lee/OnlineMMSI.

link

2025-03-25

Extensions of regret-minimization algorithm for optimal design

We explore extensions and applications of the regret minimization framework introduced by~\cite{design} for solving optimal experimental design problems.Specifically, we incorporate the entropy regularizer into this framework, leading to a novel sample selection objective and a provable sample complexity bound that guarantees a $(1+\epsilon)$-near optimal solution.We further extend the method to handle regularized optimal design settings.As an application, we use our algorithm to select a small set of representative samples from image classification datasets without relying on label information.To evaluate the quality of the selected samples, we train a logistic regression model and compare performance against several baseline sampling strategies. 0.641Experimental results on MNIST, CIFAR-10, and a 50-class subset of ImageNet show that our approach consistently outperforms competing methods in most cases.

link

2025-03-25

RCC-PFL: Robust Client Clustering under Noisy Labels in Personalized Federated Learning

We address the problem of cluster identity estimation in a personalized federated learning (PFL) setting in which users aim to learn different personal models.The backbone of effective learning in such a setting is to cluster users into groups whose objectives are similar.A typical approach in the literature is to achieve this by training users' data on different proposed personal models and assign them to groups based on which model achieves the lowest value of the users' loss functions.This process is to be done iteratively until group identities converge.A key challenge in such a setting arises when users have noisy labeled data, which may produce misleading values of their loss functions, and hence lead to ineffective clustering.To overcome this challenge, we propose a label-agnostic data similarity-based clustering algorithm, coined RCC-PFL, with three main advantages: the cluster identity estimation procedure is independent from the training labels; it is a one-shot clustering algorithm performed prior to the training; and it requires fewer communication rounds and less computation compared to iterative-based clustering methods.We validate our proposed algorithm using various models and datasets and show that it outperforms multiple baselines in terms of average accuracy and variance reduction. 0.763

link

2025-03-25

In the Magma chamber: Update and challenges in ground-truth vulnerabilities revival for automatic input generator comparison

Fuzzing is a well-established technique for detecting bugs and vulnerabilities.With the surge of fuzzers and fuzzer platforms being developed such as AFL and OSSFuzz rises the necessity to benchmark these tools' performance.A common problem is that vulnerability benchmarks are based on bugs in old software releases.For this very reason, Magma introduced the notion of forward-porting to reintroduce vulnerable code in current software releases.While their results are promising, the state-of-the-art lacks an update on the maintainability of this approach over time. 0.614Indeed, adding the vulnerable code to a recent software version might either break its functionality or make the vulnerable code no longer reachable.We characterise the challenges with forward-porting by reassessing the portability of Magma's CVEs four years after its release and manually reintroducing the vulnerabilities in the current software versions.We find the straightforward process efficient for 17 of the 32 CVEs in our study.We further investigate why a trivial forward-porting process fails in the 15 other CVEs.This involves identifying the commits breaking the forward-porting process and reverting them in addition to the bug fix.While we manage to complete the process for nine of these CVEs, we provide an update on all 15 and explain the challenges we have been confronted with in this process.Thereby, we give the basis for future work towards a sustainable forward-ported fuzzing benchmark.

link

2025-03-25

CoLLM: A Large Language Model for Composed Image Retrieval

Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire.The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs.However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data.Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities.We present CoLLM, a one-stop framework that effectively addresses these limitations.Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation.We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion.Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability.Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings.MTCIR yields competitive results, with up to 15% performance improvement. 0.645Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field. 0.687

link

LLMs

2025-03-31

Implicit In-Context Learning: Evidence from Artificial Language Experiments

Humans acquire language through implicit learning, absorbing complex patterns without explicit awareness.While LLMs demonstrate impressive linguistic capabilities, it remains unclear whether they exhibit human-like pattern recognition during in-context learning at inferencing level. 0.715We adapted three classic artificial language learning experiments spanning morphology, morphosyntax, and syntax to systematically evaluate implicit learning at inferencing level in two state-of-the-art OpenAI models: gpt-4o and o3-mini.Our results reveal linguistic domain-specific alignment between models and human behaviors, o3-mini aligns better in morphology while both models align in syntax.

link

2025-03-31

Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms

Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. 0.76Large Language Models (LLMs) are extensively used as tooling platforms through structured output APIs to ensure syntax compliance so that robust integration with existing softwares like agent systems, could be achieved. 0.668However, the feature enabling functionality of grammar-guided structured output presents significant security vulnerabilities.In this work, we reveal a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities.We introduce Constrained Decoding Attack (CDA), a novel jailbreak class that weaponizes structured output constraints to bypass safety mechanisms.Unlike prior attacks focused on input prompts, CDA operates by embedding malicious intent in schema-level grammar rules (control-plane) while maintaining benign surface prompts (data-plane).We instantiate this with a proof-of-concept Chain Enum Attack, achieves 96.2% attack success rates across proprietary and open-weight LLMs on five safety benchmarks with a single query, including GPT-4o and Gemini-2.0-flash. 0.635Our findings identify a critical security blind spot in current LLM architectures and urge a paradigm shift in LLM safety to address control-plane vulnerabilities, as current mechanisms focused solely on data-plane threats leave critical systems exposed. 0.763

link

2025-03-31

Text2Tracks: Prompt-based Music Recommendation via Generative Retrieval

In recent years, Large Language Models (LLMs) have enabled users to provide highly specific music recommendation requests using natural language prompts (e.g. "Can you recommend some old classics for slow dancing?"). 0.614In this setup, the recommended tracks are predicted by the LLM in an autoregressive way, i.e. the LLM generates the track titles one token at a time.While intuitive, this approach has several limitation.First, it is based on a general purpose tokenization that is optimized for words rather than for track titles.Second, it necessitates an additional entity resolution layer that matches the track title to the actual track identifier.Third, the number of decoding steps scales linearly with the length of the track title, slowing down inference.In this paper, we propose to address the task of prompt-based music recommendation as a generative retrieval task.Within this setting, we introduce novel, effective, and efficient representations of track identifiers that significantly outperform commonly used strategies.We introduce Text2Tracks, a generative retrieval model that learns a mapping from a user's music recommendation prompt to the relevant track IDs directly.Through an offline evaluation on a dataset of playlists with language inputs, we find that (1) the strategy to create IDs for music tracks is the most important factor for the effectiveness of Text2Tracks and semantic IDs significantly outperform commonly used strategies that rely on song titles as identifiers (2) provided with the right choice of track identifiers, Text2Tracks outperforms sparse and dense retrieval solutions trained to retrieve tracks from language prompts.

link

2025-03-31

Fermilab's Transition to Token Authentication

Fermilab is the first High Energy Physics institution to transition from X.509 user certificates to authentication tokens in production systems.All the experiments that Fermilab hosts are now using JSON Web Token (JWT) access tokens in their grid jobs.Many software components have been either updated or created for this transition, and most of the software is available to others as open source.The tokens are defined using the WLCG Common JWT Profile.Token attributes for all the tokens are stored in the Fermilab FERRY system which generates the configuration for the CILogon token issuer.High security-value refresh tokens are stored in Hashicorp Vault configured by htvault-config, and JWT access tokens are requested by the htgettoken client through its integration with HTCondor.The Fermilab job submission system jobsub was redesigned to be a lightweight wrapper around HTCondor. 0.613The grid workload management system GlideinWMS which is also based on HTCondor was updated to use tokens for pilot job submission.For automated job submissions a managed tokens service was created to reduce duplication of effort and knowledge of how to securely keep tokens active.The existing Fermilab file transfer tool ifdh was updated to work seamlessly with tokens, as well as the Fermilab POMS (Production Operations Management System) which is used to manage automatic job submission and the RCDS (Rapid Code Distribution System) which is used to distribute analysis code via the CernVM FileSystem.The dCache storage system was reconfigured to accept tokens for authentication in place of X.509 proxy certificates.As some services and sites have not yet implemented token support, proxy certificates are still sent with jobs for backwards compatibility, but some experiments are beginning to transition to stop using them.

link

2025-03-31

TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers' Guidance

Large Language Models (LLMs) have made significant strides in problem-solving by incorporating reasoning processes. 0.669However, this enhanced reasoning capability results in an increased number of output tokens during inference, leading to higher computational costs.To address this challenge, we propose TwT (Thinking without Tokens), a method that reduces inference-time costs through habitual reasoning distillation with multi-teachers' guidance, while maintaining high performance.Our approach introduces a Habitual Reasoning Distillation method, which internalizes explicit reasoning into the model's habitual behavior through a Teacher-Guided compression strategy inspired by human cognition.Additionally, we propose Dual-Criteria Rejection Sampling (DCRS), a technique that generates a high-quality and diverse distillation dataset using multiple teacher models, making our method suitable for unsupervised scenarios.Experimental results demonstrate that TwT effectively reduces inference costs while preserving superior performance, achieving up to a 13.6% improvement in accuracy with fewer output tokens compared to other distillation methods, offering a highly practical solution for efficient LLM deployment. 0.67

link

2025-03-31

Synthetic News Generation for Fake News Classification

This study explores the generation and evaluation of synthetic fake news through fact based manipulations using large language models (LLMs). 0.612We introduce a novel methodology that extracts key facts from real articles, modifies them, and regenerates content to simulate fake news while maintaining coherence.To assess the quality of the generated content, we propose a set of evaluation metrics coherence, dissimilarity, and correctness.The research also investigates the application of synthetic data in fake news classification, comparing traditional machine learning models with transformer based models such as BERT.Our experiments demonstrate that transformer models, especially BERT, effectively leverage synthetic data for fake news detection, showing improvements with smaller proportions of synthetic data.Additionally, we find that fact verification features, which focus on identifying factual inconsistencies, provide the most promising results in distinguishing synthetic fake news.The study highlights the potential of synthetic data to enhance fake news detection systems, offering valuable insights for future research and suggesting that targeted improvements in synthetic data generation can further strengthen detection models.

link

2025-03-31

PAARS: Persona Aligned Agentic Retail Shoppers

In e-commerce, behavioral data is collected for decision making which can be costly and slow.Simulation with LLM powered agents is emerging as a promising alternative for representing human population behavior. 0.672However, LLMs are known to exhibit certain biases, such as brand bias, review rating bias and limited representation of certain groups in the population, hence they need to be carefully benchmarked and aligned to user behavior. 0.703Ultimately, our goal is to synthesise an agent population and verify that it collectively approximates a real sample of humans.To this end, we propose a framework that: (i) creates synthetic shopping agents by automatically mining personas from anonymised historical shopping data, (ii) equips agents with retail-specific tools to synthesise shopping sessions and (iii) introduces a novel alignment suite measuring distributional differences between humans and shopping agents at the group (i.e. population) level rather than the traditional "individual" level.Experimental results demonstrate that using personas improves performance on the alignment suite, though a gap remains to human behaviour.We showcase an initial application of our framework for automated agentic A/B testing and compare the findings to human results.Finally, we discuss applications, limitations and challenges setting the stage for impactful future work.

link

2025-03-31

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus.Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. 0.632However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding.To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale.Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape.From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment.Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.

link

2025-03-31

Enhancing Large Language Models (LLMs) for Telecommunications using Knowledge Graphs and Retrieval-Augmented Generation

Large language models (LLMs) have made significant progress in general-purpose natural language processing tasks. 0.676However, LLMs are still facing challenges when applied to domain-specific areas like telecommunications, which demands specialized expertise and adaptability to evolving standards. 0.786This paper presents a novel framework that combines knowledge graph (KG) and retrieval-augmented generation (RAG) techniques to enhance LLM performance in the telecom domain. 0.675The framework leverages a KG to capture structured, domain-specific information about network protocols, standards, and other telecom-related entities, comprehensively representing their relationships.By integrating KG with RAG, LLMs can dynamically access and utilize the most relevant and up-to-date knowledge during response generation. 0.746This hybrid approach bridges the gap between structured knowledge representation and the generative capabilities of LLMs, significantly enhancing accuracy, adaptability, and domain-specific comprehension. 0.7Our results demonstrate the effectiveness of the KG-RAG framework in addressing complex technical queries with precision.The proposed KG-RAG model attained an accuracy of 88% for question answering tasks on a frequently used telecom-specific dataset, compared to 82% for the RAG-only and 48% for the LLM-only approaches.

link

2025-03-31

Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning

We propose Rec-R1, a general reinforcement learning framework that bridges large language models (LLMs) with recommendation systems through closed-loop optimization.Unlike prompting and supervised fine-tuning (SFT), Rec-R1 directly optimizes LLM generation using feedback from a fixed black-box recommendation model, without relying on synthetic SFT data from proprietary models such as GPT-4o. 0.613This avoids the substantial cost and effort required for data distillation.To verify the effectiveness of Rec-R1, we evaluate it on two representative tasks: product search and sequential recommendation.Experimental results demonstrate that Rec-R1 not only consistently outperforms prompting- and SFT-based methods, but also achieves significant gains over strong discriminative baselines, even when used with simple retrievers such as BM25.Moreover, Rec-R1 preserves the general-purpose capabilities of the LLM, unlike SFT, which often impairs instruction-following and reasoning. 0.675These findings suggest Rec-R1 as a promising foundation for continual task-specific adaptation without catastrophic forgetting.

link

2025-03-31

Is analogy enough to draw novel adjective-noun inferences?

Recent work (Ross et al., 2025, 2024) has argued that the ability of humans and LLMs respectively to generalize to novel adjective-noun combinations shows that they each have access to a compositional mechanism to determine the phrase's meaning and derive inferences. 0.604We study whether these inferences can instead be derived by analogy to known inferences, without need for composition.We investigate this by (1) building a model of analogical reasoning using similarity over lexical items, and (2) asking human participants to reason by analogy.While we find that this strategy works well for a large proportion of the dataset of Ross et al. (2025), there are novel combinations for which both humans and LLMs derive convergent inferences but which are not well handled by analogy. 0.611We thus conclude that the mechanism humans and LLMs use to generalize in these cases cannot be fully reduced to analogy, and likely involves composition. 0.637

link

2025-03-31

A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG

This study presents a systematic comparison of three approaches for the analysis of mental health text using large language models (LLMs): prompt engineering, retrieval augmented generation (RAG), and fine-tuning.Using LLaMA 3, we evaluate these approaches on emotion classification and mental health condition detection tasks across two datasets.Fine-tuning achieves the highest accuracy (91% for emotion classification, 80% for mental health conditions) but requires substantial computational resources and large training sets, while prompt engineering and RAG offer more flexible deployment with moderate performance (40-68% accuracy).Our findings provide practical insights for implementing LLM-based solutions in mental health applications, highlighting the trade-offs between accuracy, computational requirements, and deployment flexibility. 0.689

link

2025-03-31

BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models

In this research, we introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs). 0.611Building upon the BEATS framework, we present a bias benchmark for LLMs that measure performance across 29 distinct metrics. 0.656These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk.These metrics enable a quantitative assessment of the extent to which LLM generated responses may perpetuate societal prejudices that reinforce or expand systemic inequities. 0.682To achieve a high score on this benchmark a LLM must show very equitable behavior in their responses, making it a rigorous standard for responsible AI evaluation. 0.631Empirical results based on data from our experiment show that, 37.65\% of outputs generated by industry leading models contained some form of bias, highlighting a substantial risk of using these models in critical decision making systems.BEATS framework and benchmark offer a scalable and statistically rigorous methodology to benchmark LLMs, diagnose factors driving biases, and develop mitigation strategies.With the BEATS framework, our goal is to help the development of more socially responsible and ethically aligned AI models.

link

2025-03-31

ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion

Parameter generation has emerged as a novel paradigm for neural network development, offering an alternative to traditional neural network training by synthesizing high-quality model weights directly.In the context of Low-Rank Adaptation (LoRA) for evolving ($\textit{i.e.}$, constantly updated) large language models (LLMs) 0.649, this approach promises efficient adaptation without costly retraining.However, existing methods face critical limitations in simultaneously achieving scalability and controllability.In this paper, we introduce $\texttt{ORAL}$, a novel $\textbf{conditional recurrent diffusion}$ framework that addresses these challenges.$\texttt{ORAL}$ incorporates a novel conditioning mechanism that integrates model architecture and textual task specifications, enabling the generation of task-specific LoRA parameters that can seamlessly transfer across evolving foundation models.Our approach successfully scales to billions-of-parameter LLMs and maintains controllability. 0.678Through extensive experiments across seven language tasks, four vision tasks, and three multimodal tasks using five pre-trained LLMs, we demonstrate that $\texttt{ORAL}$ generates high-quality LoRA parameters that achieve comparable or superior performance to vanilla trained counterparts. 0.683

link

2025-03-31

SQuat: Subspace-orthogonal KV Cache Quantization

The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. 0.708It reduces redundant computation at the cost of increased memory usage.To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs.In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization).It first constructs a subspace spanned by query tensors to capture the most critical task-related information.During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism's outputs.SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop.Through numerical experiments, we show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.

link

2025-03-31

Effectively Controlling Reasoning Models through Thinking Intervention

Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. 0.681In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior.We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. 0.756We conduct comprehensive evaluations across multiple tasks, including instruction following on IFEval, instruction hierarchy on SEP, and safety alignment on XSTest and SORRY-Bench.Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models.Overall, our work opens a promising new research avenue for controlling reasoning LLMs. 0.764

link

2025-03-31

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. 0.62Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning.To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding.It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning.SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers.Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench.Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains.We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

link

2025-03-31

Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models

Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks, transitioning from fast and intuitive thinking (System 1) to slow and deep reasoning (System 2). 0.683While System 2 reasoning improves task accuracy, it often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning behaviors.In contrast, System 1 reasoning is computationally efficient but leads to suboptimal performance.Consequently, it is critical to balance the trade-off between performance (benefits) and computational costs (budgets), giving rise to the concept of reasoning economy.In this survey, we provide a comprehensive analysis of reasoning economy in both the post-training and test-time inference stages of LLMs, encompassing i) the cause of reasoning inefficiency, ii) behavior analysis of different reasoning patterns, and iii) potential solutions to achieve reasoning economy. 0.629By offering actionable insights and highlighting open challenges, we aim to shed light on strategies for improving the reasoning economy of LLMs, thereby serving as a valuable resource for advancing research in this evolving area. 0.774We also provide a public repository to continually track developments in this fast-evolving field.

link

2025-03-27

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. 0.718Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences and broader utilities, but it requires updating billions of model parameters, which is computationally expensive.Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining.However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks.To strengthen the test-time performance w.r.t the target task, we propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. 0.658Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents.For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. 0.615This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. 0.67Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward for the given off-the-shelf models.We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines.Notably, Collab surpasses the current SoTA decoding strategy, achieving an improvement of up to 1.56x in average reward and 71.89% in GPT-4 based win-tie rate.

link

2025-03-27

Effective Skill Unlearning through Intervention and Abstention

Large language Models (LLMs) have demonstrated remarkable skills across various domains. 0.717Understanding the mechanisms behind their abilities and implementing controls over them is becoming increasingly important for developing better models.In this paper, we focus on skill unlearning in LLMs, specifically unlearning a particular skill while retaining their overall capabilities. 0.723We introduce two lightweight, training-free machine skill unlearning techniques for LLMs. 0.729First, we observe that the pre-activation distribution of neurons in each Feed-Forward Layer (FFL) differs when the model demonstrates different skills.Additionally, we find that queries triggering the same skill cluster within the FFL key space and can be separated from other queries using a hypercube.Based on these observations, we propose two lightweight, training-free skill unlearning methods via \textit{intervention} and \textit{abstention} respectively: \texttt{Neuron Adjust} and \texttt{Key Space Detection}.We evaluate our methods on unlearning math-solving, Python-coding, and comprehension skills across seven different languages.The results demonstrate their strong unlearning capabilities for the designated skills.Specifically, \texttt{Key Space Detection} achieves over 80\% relative performance drop on the forgetting skill and less than 10\% relative performance drop on other skills and the model's general knowledge (MMLU) for most unlearning tasks.Our code is available at https://github.com/Trustworthy-ML-Lab/effective_skill_unlearning

link

2025-03-27

GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics

Ensuring the reliability and effectiveness of software release decisions is critical, particularly in safety-critical domains like automotive systems.Precise analysis of release validation data, often presented in tabular form, plays a pivotal role in this process.However, traditional methods that rely on manual analysis of extensive test datasets and validation metrics are prone to delays and high costs.Large Language Models (LLMs) offer a promising alternative but face challenges in analytical reasoning, contextual understanding, handling out-of-scope queries, and processing structured test data consistently; limitations that hinder their direct application in safety-critical scenarios. 0.696This paper introduces GateLens, an LLM-based tool for analyzing tabular data in the automotive domain. 0.637GateLens translates natural language queries into Relational Algebra (RA) expressions and then generates optimized Python code.It outperforms the baseline system on benchmarking datasets, achieving higher F1 scores and handling complex and ambiguous queries with greater robustness.Ablation studies confirm the critical role of the RA module, with performance dropping sharply when omitted.Industrial evaluations reveal that GateLens reduces analysis time by over 80% while maintaining high accuracy and reliability.As demonstrated by presented results, GateLens achieved high performance without relying on few-shot examples, showcasing strong generalization across various query types from diverse company roles.Insights from deploying GateLens with a partner automotive company offer practical guidance for integrating AI into critical workflows such as release validation.Results show that by automating test result analysis, GateLens enables faster, more informed, and dependable release decisions, and can thus advance software scalability and reliability in automotive systems.

link

2025-03-27

Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck

In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient.We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner.At the core of Fwd2bot there exists a "double-forward pass" training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens.Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. 0.699The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks.The training is further enhanced by stage-specific adapters.We accompany the proposed method by an in-depth ablation study.Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks.For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result.For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.

link

2025-03-27

MemInsight: Autonomous Memory Augmentation for LLM Agents

Large language model (LLM) agents have evolved to intelligently process information, make decisions, and interact with users or tools. 0.712A key capability is the integration of long-term memory capabilities, enabling these agents to draw upon historical interactions and knowledge.However, the growing memory size and need for semantic structuring pose significant challenges.In this work, we propose an autonomous memory augmentation approach, MemInsight, to enhance semantic data representation and retrieval mechanisms.By leveraging autonomous augmentation to historical interactions, LLM agents are shown to deliver more accurate and contextualized responses. 0.678We empirically validate the efficacy of our proposed approach in three task scenarios; conversational recommendation, question answering and event summarization.On the LLM-REDIAL dataset, MemInsight boosts persuasiveness of recommendations by up to 14%.Moreover, it outperforms a RAG baseline by 34% in recall for LoCoMo retrieval.Our empirical results show the potential of MemInsight to enhance the contextual performance of LLM agents across multiple tasks. 0.693

link

Developer Research

2025-03-31

MaintainCoder: Maintainable Code Generation Under Dynamic Requirements

Modern code generation has made significant strides in functional correctness and execution efficiency.However, these systems often overlook a critical dimension in real-world software development: maintainability.To handle dynamic requirements with minimal rework, we propose MaintainCoder as a pioneering solution.It integrates Waterfall model, design patterns, and multi-agent collaboration to systematically enhance cohesion, reduce coupling, and improve adaptability.We also introduce MaintainBench, a benchmark comprising requirement changes and corresponding dynamic metrics on maintainance effort.Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. 0.622In contrast, MaintainCoder improves maintainability metrics by 14-30% with even higher correctness, i.e. pass@k.Our work not only provides the foundation of maintainable code generation, but also highlights the need for more holistic code quality research.Resources: https://github.com/IAAR-Shanghai/MaintainCoder.

link

2025-03-27

Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs

Repository-level software repair faces challenges in bridging semantic gaps between issue descriptions and code patches. 0.641Existing approaches, which mostly depend on large language models (LLMs), suffer from semantic ambiguities, limited structural context understanding, and insufficient reasoning capability.To address these limitations, we propose KGCompass with two innovations: (1) a novel repository-aware knowledge graph (KG) that accurately links repository artifacts (issues and pull requests) and codebase entities (files, classes, and functions), allowing us to effectively narrow down the vast search space to only 20 most relevant functions with accurate candidate bug locations and contextual information, and (2) a path-guided repair mechanism that leverages KG-mined entity path, tracing through which allows us to augment LLMs with relevant contextual information to generate precise patches along with their explanations.Experimental results in the SWE-Bench-Lite demonstrate that KGCompass achieves state-of-the-art repair performance (45.67%) and function-level localization accuracy (51.33%) across open-source approaches, costing only $0.20 per repair.Our analysis reveals that among successfully localized bugs, 69.7% require multi-hop traversals through the knowledge graph, without which LLM-based approaches struggle to accurately locate bugs.The knowledge graph built in KGCompass is language agnostic and can be incrementally updated, making it a practical solution for real-world development environments.

link

2025-03-25

SLA-Awareness for AI-assisted coding

The integration of AI-assisted coding tools within development environments drastically reduces development time, and allows developers to focus more on creative and critical aspects of software engineering through the use of Code Large Language Models (CodeLLMs). 0.626These coding assistants automate repetitive and time-consuming coding tasks such as code generation, code completion, code summarization, and code translation.Responsiveness is a crucial requirement of these coding assistants to maintain real-time interactivity, such that their use does not impede the developers' workflows.Different coding tasks have unique characteristics and latency requirements: Time-To-First-Token (TTFT) latency is essential for code completion tasks, while End-To-End (E2E) latency is crucial for code translation tasks.Managing these varying requirements simultaneously while optimizing resource usage poses significant challenges.Existing work adopts the Model-as-a-Service paradigm for serving individual CodeLLMs, but cannot effectively manage latency requirements of concurrent coding tasks and sequences of CodeLLM inference calls, due to a lack of end-to-end latency awareness.Another challenge is keeping resource utilization high, when the serving system is deployed on a shared cluster environment.To address these challenges, we propose Coding Assistant Task Orchestrator (CATO), a runtime system designed to serve a diverse assortment of coding tasks while meeting latency requirements and maximizing resource utilization.Our experiments demonstrate that when all types of coding tasks were served simultaneously, for TTFT-critical tasks, CATO improves overall Goodput rate and resource utilization by up to 10% and 41.1%, respectively.P95 E2E latency was also reduced by 18% for code summarization tasks, and P95 TTFT for code generation tasks were reduced by 14% compared against state-of-the-art systems.

link

2025-03-18

MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration

Maintaining and scaling software systems relies heavily on effective code refactoring, yet this process remains labor-intensive, requiring developers to carefully analyze existing codebases and prevent the introduction of new defects. 0.652Although recent advancements have leveraged Large Language Models (LLMs) to automate refactoring tasks, current solutions are constrained in scope and lack mechanisms to guarantee code compilability and successful test execution.In this work, we introduce MANTRA, a comprehensive LLM agent-based framework that automates method-level refactoring.MANTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning to emulate human decision-making during refactoring while preserving code correctness and readability.Our empirical study, conducted on 703 instances of "pure refactorings" (i.e., code changes exclusively involving structural improvements), drawn from 10 representative Java projects, covers the six most prevalent refactoring operations.Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model (RawGPT ), achieving an 82.8% success rate (582/703) in producing code that compiles and passes all tests, compared to just 8.7% (61/703) with RawGPT.Moreover, in comparison to IntelliJ's LLM-powered refactoring tool (EM-Assist), MANTRA exhibits a 50% improvement in generating Extract Method transformations.A usability study involving 37 professional developers further shows that refactorings performed by MANTRA are perceived to be as readable and reusable as human-written code, and in certain cases, even more favorable. 0.652These results highlight the practical advantages of MANTRA and emphasize the growing potential of LLM-based systems in advancing the automation of software refactoring tasks.

link

2025-03-12

Validity in Design Science

Researchers must ensure that the claims about the knowledge produced by their work are valid.However, validity is neither well-understood nor consistently established in design science, which involves the development and evaluation of artifacts (models, methods, instantiations, and theories) to solve problems.As a result, it is challenging to demonstrate and communicate the validity of knowledge claims about artifacts.This paper defines validity in design science and derives the Design Science Validity Framework and a process model for applying it.The framework comprises three high-level claim and validity types-criterion, causal, and context-as well as validity subtypes.The framework guides researchers in integrating validity considerations into projects employing design science and contributes to the growing body of research on design science methodology. 0.61It also provides a systematic way to articulate and validate the knowledge claims of design science projects.We apply the framework to examples from existing research and then use it to demonstrate the validity of knowledge claims about the framework itself.

link

2025-03-12

Automating Code Review: A Systematic Literature Review

Code Review consists in assessing the code written by teammates with the goal of increasing code quality. 0.621Empirical studies documented the benefits brought by such a practice that, however, has its cost to pay in terms of developers' time.For this reason, researchers have proposed techniques and tools to automate code review tasks such as the reviewers selection (i.e., identifying suitable reviewers for a given code change) or the actual review of a given change (i.e., recommending improvements to the contributor as a human reviewer would do). 0.655Given the substantial amount of papers recently published on the topic, it may be challenging for researchers and practitioners to get a complete overview of the state-of-the-art. We present a systematic literature review (SLR) featuring 119 papers concerning the automation of code review tasks.We provide: (i) a categorization of the code review tasks automated in the literature; (ii) an overview of the under-the-hood techniques used for the automation, including the datasets used for training data-driven techniques; (iii) publicly available techniques and datasets used for their evaluation, with a description of the evaluation metrics usually adopted for each task. The SLR is concluded by a discussion of the current limitations of the state-of-the-art, with insights for future research directions.

link

2025-03-11

From Expert to Novice: An Empirical Study on Software Architecture Explanations

The sharing of knowledge about software architecture is crucial in software development, particularly during the onboarding of new developers. 0.621However, existing documentation often falls short due to issues like incompleteness and ambiguity.Consequently, oral explanations are used for knowledge transfer.This study investigates what constitutes a good explanation of software architecture through an empirical study.It aims to explore how software architecture explanations are conducted, identify the main challenges, and suggest improvements.It addresses five key areas: relevant architectural concerns, explanation plans, supporting artefacts, typical questions, and expectations.An exploratory field study was conducted using semi-structured interviews with 17 software professionals, including 9 architecture explainers and 8 explainees.The study discovers that an explanation must balance both problem and technical domains while considering the explainee's role, experience, and the goal of the explanation.The concept of the explanation window, which adjusts the level of detail and scope, is introduced to address these variables.We also extend the Twin Peaks model to guide the interplay between problem and solution domains during architectural explanations by adding an emphasis to the context surrounding both domains.Future research should focus on developing better tools and processes to support architecture explanations.

link

2025-03-11

GraphSense: Graph Embedding Based Code Suggestion Framework

Code suggestions have become an integral part of IDEs and developers use code suggestions generated by IDEs all the time. 0.616These code suggestions are mostly for calling a method of an object or for using a function of a library and not for possible next line of the code.GPT based models are too slow or resource intensive for real-time code suggestions in local environments.As a solution to this GraphSense was introduced which provide code suggestions with minimum amount of resource usage in real-time.

link

Data Annotation Techniques