Vincent's Arxiv FrontPage


Generated on 2025-03-25.


This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.


New Datasets

2025-03-24

EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos

Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room.A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions.We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos.Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate.Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets.The dataset will be released at https://github.com/Fujiry0/EgoSurgery. 0.713

link

2025-03-24

CCMusic: An Open and Diverse Database for Chinese Music Information Retrieval Research

Data are crucial in various computer-related fields, including music information retrieval (MIR), an interdisciplinary area bridging computer science and music.This paper introduces CCMusic, an open and diverse database comprising multiple datasets specifically designed for tasks related to Chinese music, highlighting our focus on this culturally rich domain. 0.761The database integrates both published and unpublished datasets, with steps taken such as data cleaning, label refinement, and data structure unification to ensure data consistency and create ready-to-use versions.We conduct benchmark evaluations for all datasets using a unified evaluation framework developed specifically for this purpose.This publicly available framework supports both classification and detection tasks, ensuring standardized and reproducible results across all datasets.The database is hosted on HuggingFace and ModelScope, two open and multifunctional data and model hosting platforms, ensuring ease of accessibility and usability.

link

2025-03-24

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering.To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts.However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability.This paper proposes the first multi-concept personalization paradigm, MC-LLaVA.Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step.To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens.Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities.To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset.We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity.Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants.The code and dataset will be publicly available at $\href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}$. 0.808

link

2025-03-24

SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction

Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world.To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions.SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities.We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality.We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. 0.709Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.

link

2025-03-24

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding.This model family employs the two-stream SlowFast mechanism, enabling efficient modeling of long-range temporal context to meet the demand for lightweight, mobile-friendly Video LLMs.We provide models ranging from 1B to 7B parameters, optimized through a streamlined training pipeline and a high-quality data mixture composed of publicly available datasets. 0.745Experimental results demonstrate that SF-LLaVA-1.5 achieves competitive performance on a wide range of video and image benchmarks, with robust results across all model sizes.Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales (1B and 3B) across various video benchmarks.

link

2025-03-20

Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data

Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce.Existing methods leveraged (M)LLMs for data generation, but direct prompting often yields limited precision and diversity.In this paper, we propose \textit{Chain of Functions (CoF)}, a novel programmatic reasoning data generation pipeline that utilizes freely-explored reasoning paths as supervision to ensure data precision and diversity.Specifically, it starts with human-free exploration among the atomic functions (e.g., maximum data and arithmetic operations) to generate diverse function chains, which are then translated into linguistic rationales and questions with only a moderate open-sourced LLM. \textit{CoF} provides multiple benefits: 1) Precision: function-governed generation reduces hallucinations compared to freeform generation; 2) Diversity: enumerating function chains enables varied question taxonomies; 3) Explainability: function chains serve as built-in rationales, allowing fine-grained evaluation beyond overall accuracy; 4) Practicality: eliminating reliance on extremely large models.Employing \textit{CoF}, we construct the \textit{ChartCoF} dataset, with 1.4k complex reasoning Q\&A for fine-grained analysis and 50k Q\&A for reasoning enhancement. 0.838The fine-grained evaluation on \textit{ChartCoF} reveals varying performance across question taxonomies for each MLLM, and the experiments also show that finetuning with \textit{ChartCoF} achieves state-of-the-art performance among same-scale MLLMs on widely used benchmarks.Furthermore, the novel paradigm of function-governed rationale generation in \textit{CoF} could inspire broader applications beyond charts.

link

2025-03-20

A Dataset of Performance Measurements and Alerts from Mozilla (Data Artifact)

Performance regressions in software systems can lead to significant financial losses and degraded user satisfaction, making their early detection and mitigation critical.Despite the importance of practices that capture performance regressions early, there is a lack of publicly available datasets that comprehensively capture real-world performance measurements, expert-validated alerts, and associated metadata such as bugs and testing conditions. To address this gap, we introduce a unique dataset to support various research studies in performance engineering, anomaly detection, and machine learning.This dataset was collected from Mozilla Firefox's performance testing infrastructure and comprises 5,655 performance time series, 17,989 performance alerts, and detailed annotations of resulting bugs collected from May 2023 to May 2024.By publishing this dataset, we provide researchers with an invaluable resource for studying performance trends, developing novel change point detection methods, and advancing performance regression analysis across diverse platforms and testing environments.The dataset is available at https://doi.org/10.5281/zenodo.14642238 0.831

link

2025-03-20

NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises.Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes.To address this, we propose an efficient approach that encodes scene chunks as uniform vector sets, offering better compression and performance than the spatially structured latents used in prior methods.Furthermore, we train an explicit outpainting model for unbounded generation, which improves coherence compared to prior resampling-based inpainting schemes while also speeding up generation by eliminating extra diffusion steps.To facilitate this task, we curate NuiScene43, a small but high-quality set of scenes, preprocessed for joint training. 0.811Notably, when trained on scenes of varying styles, our model can blend different environments, such as rural houses and city skyscrapers, within the same scene, highlighting the potential of our curation process to leverage heterogeneous scenes for joint training.

link

2025-03-20

Panoptic-CUDAL Technical Report: Rural Australia Point Cloud Dataset in Rainy Conditions

Existing autonomous driving datasets are predominantly oriented towards well-structured urban settings and favorable weather conditions, leaving the complexities of rural environments and adverse weather conditions largely unaddressed.Although some datasets encompass variations in weather and lighting, bad weather scenarios do not appear often.Rainfall can significantly impair sensor functionality, introducing noise and reflections in LiDAR and camera data and reducing the system's capabilities for reliable environmental perception and safe navigation.We introduce the Panoptic-CUDAL dataset, a novel dataset purpose-built for panoptic segmentation in rural areas subject to rain. 0.848By recording high-resolution LiDAR, camera, and pose data, Panoptic-CUDAL offers a diverse, information-rich dataset in a challenging scenario.We present analysis of the recorded data and provide baseline results for panoptic and semantic segmentation methods on LiDAR point clouds.The dataset can be found here: https://robotics.sydney.edu.au/our-research/intelligent-transportation-systems/ 0.838

link

2025-03-20

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views.We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions.To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views.Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame.Our code and newly curated dataset are available at https://github.com/chenchen235/SA-Occ. 0.867

link

2025-03-20

GAEA: A Geolocation Aware Conversational Model

Image geolocalization, in which, traditionally, an AI model predicts the precise GPS coordinates of an image is a challenging task with many downstream applications.However, the user cannot utilize the model to further their knowledge other than the GPS coordinate; the model lacks an understanding of the location and the conversational ability to communicate with the user.In recent days, with tremendous progress of large multimodal models (LMMs) proprietary and open-source researchers have attempted to geolocalize images via LMMs.However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, one of which is geolocalization, LMMs struggle.In this work, we propose to solve this problem by introducing a conversational model GAEA that can provide information regarding the location of an image, as required by a user.No large-scale dataset enabling the training of such a model exists.Thus we propose a comprehensive dataset GAEA with 800K images and around 1.6M question answer pairs constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. 0.907For quantitative evaluation, we propose a diverse benchmark comprising 4K image-text pairs to evaluate conversational capabilities equipped with diverse question types.We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision by 25.69% and the best proprietary model, GPT-4o by 8.28%.Our dataset, model and codes are available 0.957

link

2025-03-19

TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification

LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability.In high-stakes domains such as healthcare, law, and news, it is crucial to understand where and how the content is created.To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs.Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed.To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. 0.802To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT provenance, and human provenance.We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation.

link

2025-03-19

SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes

Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes - offering richer spatial representation - remain underexplored.This paper introduces SUM Parts, the first large-scale dataset for urban textured meshes with part-level semantic labels, covering about 2.5 km2 with 21 classes. 0.8The dataset was created using our own annotation tool, which supports both face- and texture-based annotations with efficient interactive selection. 0.767We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset. 0.759Our project page is available at https://tudelft3d.github.io/SUMParts/.

link

2025-03-19

aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion

Repository-level code completion aims to complete code based on the long contexts of the repository.Existing studies extract long contexts from the repository as inputs and leverage Large Language Models (LLMs) to generate code.However, we reveal a severe limitation of LLMs, i.e., LLMs may ignore the information within long contexts in code completion.In other words, even the contexts contain useful information (e.g., relevant APIs or similar code), LLMs may fail to utilize this information.We think this limitation is caused by an inherent bias in LLMs, i.e., relying on nearby contexts and ignoring long-range contexts.To address this, we propose a novel fine-tuning approach named CoLT.The core idea of CoLT is to provide explicit supervision signals, which emphasize that long-range contexts may hold relevant information.Specifically, CoLT proposes a reinforcement learning-based training, which explicitly encourages models to utilize the information within long contexts and punishes models for ignoring long contexts.To support CoLT, we release CoLT-132K, a large-scale dataset with 132k samples across four languages, each containing long-context inputs. 0.932We apply CoLT to a popular LLM - aiXcoder-7B and release aiXcoder-7B-v2.We conduct extensive experiments on CoLT-132K and a public benchmark - CrossCodeEval.Our experiments yield the results: 1.Effectiveness.CoLT substantially improves aiXcoder-7B. aiXcoder-7B-v2 outperforms aiXcoder-7B by up to 44% in exact match.aiXcoder-7B-v2 becomes the state-of-the-art 7B model in code completion and even surpasses larger models.2. Generalizability.The capability learned by CoLT can generalize to new languages.Besides, CoLT is model-agnostic and effectively improves multiple LLMs.3. Enhanced Context Utilization Capability.CoLT significantly improves the capability of LLMs in utilizing the relevant information within long contexts.

link

2025-03-19

Genomic data processing with GenomeFlow

Advances in genome sequencing technologies generate massive amounts of sequence data that are increasingly analyzed and shared through public repositories. 0.719On-demand infrastructure services on cloud computing platforms enable the processing of such large-scale genomic sequence data in distributed processing environments with a significant reduction in analysis time.However, parallel processing on cloud computing platforms presents many challenges to researchers, even skillful bioinformaticians.In particular, it is difficult to design a computing architecture optimized to reduce the cost of computing and disk storage as genomic data analysis pipelines often employ many heterogeneous tools with different resource requirements.To address these issues, we developed GenomeFlow, a tool for automated development of computing architecture and resource optimization on Google Cloud Platform, which allows users to process a large number of samples at minimal cost.We outline multiple use cases of GenomeFlow demonstrating its utility to significantly reduce computing time and cost associated with analyzing genomic and transcriptomic data from hundreds to tens of thousands of samples from several consortia.Here, we describe a step-by-step protocol on how to use GenomeFlow for a common genomic data processing task.We introduce this example protocol geared toward a bioinformatician with little experience in cloud computing.

link

2025-03-19

Towards efficient keyword spotting using spike-based time difference encoders

Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used.However, its deployment is often limited by the extreme low-power constraints of the target embedded systems.Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting.This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors.We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. 0.741We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals.The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons.We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task.We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations.The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%).However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses.In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset.Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.

link

2025-03-19

Visual Persona: Foundation Model for Full-Body Human Customization

We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions.Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations.Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain.To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. 0.715For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images.Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs.Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.

link

2025-03-19

Visual Position Prompt for MLLM based Visual Grounding

Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding.This limitation arises from two key factors.First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations.Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability.To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability.VPP-LLaVA integrates two complementary mechanisms.The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues.The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations.We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. 0.754Training on this dataset with VPP enhances the model's performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ($\sim$21M samples).The code and VPP-SFT dataset will be available at https://github.com/WayneTomas/VPP-LLaVA upon acceptance. 0.766

link

2025-03-18

EnvBench: A Benchmark for Automated Environment Setup

Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain.In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system.Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice.To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts.To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages.We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini.The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches.Our benchmark suite is publicly available at https://github.com/JetBrains-Research/EnvBench.The dataset and experiment trajectories are available at https://jb.gg/envbench. 0.732

link

2025-03-18

Bolt3D: Generating 3D Scenes in Seconds

We present a latent diffusion model for fast feed-forward 3D scene generation.Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU.We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations.To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. 0.737Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

link

2025-03-17

Goal2Story: A Multi-Agent Fleet based on Privately Enabled sLLMs for Impacting Mapping on Requirements Elicitation

As requirements drift with rapid iterations, agile development becomes the dominant paradigm.Goal-driven Requirements Elicitation (RE) is a pivotal yet challenging task in agile project development due to its heavy tangling with adaptive planning and efficient collaboration.Recently, AI agents have shown promising ability in supporting requirements analysis by saving significant time and effort for stakeholders.However, current research mainly focuses on functional RE, and research works have not been reported bridging the long journey from goal to user stories.Moreover, considering the cost of LLM facilities and the need for data and idea protection, privately hosted small-sized LLM should be further utilized in RE.To address these challenges, we propose Goal2Story, a multi-agent fleet that adopts the Impact Mapping (IM) framework while merely using cost-effective sLLMs for goal-driven RE.Moreover, we introduce a StorySeek dataset that contains over 1,000 user stories (USs) with corresponding goals and project context information, as well as the semi-automatic dataset construction method. 0.803For evaluation, we proposed two metrics: Factuality Hit Rate (FHR) to measure consistency between the generated USs with the dataset and Quality And Consistency Evaluation (QuACE) to evaluate the quality of the generated USs.Experimental results demonstrate that Goal2Story outperforms the baseline performance of the Super-Agent adopting powerful LLMs, while also showcasing the performance improvements in key metrics brought by CoT and Agent Profile to Goal2Story, as well as its exploration in identifying latent needs.

link

2025-03-17

PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modelling and machine learning

Ease of access to data, tools and models expedites scientific research.In structural biology there are now numerous open repositories of experimental and simulated datasets. 0.705Being able to easily access and utilise these is crucial for allowing researchers to make optimal use of their research effort.The tools presented here are useful for collating existing public cryoEM datasets and/or creating new synthetic cryoEM datasets to aid the development of novel data processing and interpretation algorithms. 0.8In recent years, structural biology has seen the development of a multitude of machine-learning based algorithms for aiding numerous steps in the processing and reconstruction of experimental datasets and the use of these approaches has become widespread.Developing such techniques in structural biology requires access to large datasets which can be cumbersome to curate and unwieldy to make use of.In this paper we present a suite of Python software packages which we collectively refer to as PERC (profet, EMPIARreader and CAKED).These are designed to reduce the burden which data curation places upon structural biology research.The protein structure fetcher (profet) package allows users to conveniently download and cleave sequences or structures from the Protein Data Bank or Alphafold databases.EMPIARreader allows lazy loading of Electron Microscopy Public Image Archive datasets in a machine-learning compatible structure.The Class Aggregator for Key Electron-microscopy Data (CAKED) package is designed to seamlessly facilitate the training of machine learning models on electron microscopy data, including electron-cryo-microscopy-specific data augmentation and labelling.These packages may be utilised independently or as building blocks in workflows.All are available in open source repositories and designed to be easily extensible to facilitate more advanced workflows if required.

link

2025-03-17

One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs.Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures.To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models.Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model.RSD achieves single-step restoration and outperforms the teacher by a large margin.We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods.Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory.We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K. 0.795

link

2025-03-17

Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions

Often, the needs and visual abilities differ between the annotator group and the end user group.Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain.Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat lacking by BLV standards.In this study, we ask sighted individuals to assess -- rather than produce -- diagram descriptions generated by vision-language models (VLM) that have been guided with latent supervision via a multi-pass inference.The sighted assessments prove effective and useful to professional educators who are themselves BLV and teach visually impaired learners.We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and demonstrate their fine-tuning potential in various downstream tasks. 0.722

link

2025-03-17

Scale Efficient Training for Large Datasets

The rapid growth of dataset scales has been a key driver in advancing deep learning research.However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time.To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss.Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples. 0.783SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction.Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach.Code is available at https://github.com/mrazhou/SeTa.

link

2025-03-17

Humanoid Policy ~ Human Policy

Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms.However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale.This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning.We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives.We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. 0.728We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT).The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions.Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision.We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency.Code and data: https://human-as-robot.github.io/

link

2025-03-13

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

In this paper, we propose spatio-temporal omni-object video grounding, dubbed OmniSTVG, a new STVG task that aims at localizing spatially and temporally all targets mentioned in the textual query from videos.Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding.In order to facilitate exploration of OmniSTVG, we introduce BOSTVG, a large-scale benchmark dedicated to OmniSTVG.Specifically, our BOSTVG consists of 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. 0.737Each sequence in BOSTVG, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10.To ensure high quality, each video is manually annotated with meticulous inspection and refinement.To our best knowledge, BOSTVG is to date the first and the largest benchmark for OmniSTVG.To encourage future research, we introduce a simple yet effective approach, named OmniTube, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results.By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG.Our benchmark, model, and results will be released at https://github.com/JellyYao3000/OmniSTVG.

link

2025-03-13

SySLLM: Generating Synthesized Policy Summaries for Reinforcement Learning Agents Using Large Language Models

Policies generated by Reinforcement Learning (RL) algorithms can be difficult to describe to users, as they result from the interplay between complex reward structures and neural network-based representations.This combination often leads to unpredictable behaviors, making policies challenging to analyze and posing significant obstacles to fostering human trust in real-world applications.Global policy summarization methods aim to describe agent behavior through a demonstration of actions in a subset of world-states.However, users can only watch a limited number of demonstrations, restricting their understanding of policies.Moreover, those methods overly rely on user interpretation, as they do not synthesize observations into coherent patterns.In this work, we present SySLLM (Synthesized Summary using LLMs), a novel method that employs synthesis summarization, utilizing large language models' (LLMs) extensive world knowledge and ability to capture patterns, to generate textual summaries of policies. 0.719Specifically, an expert evaluation demonstrates that the proposed approach generates summaries that capture the main insights generated by experts while not resulting in significant hallucinations.Additionally, a user study shows that SySLLM summaries are preferred over demonstration-based policy summaries and match or surpass their performance in objective agent identification tasks.

link

2025-03-13

AudioX: Diffusion Transformer for Anything-to-Audio Generation

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs.In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation.Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio.Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations.To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. 0.838Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture.The code and datasets will be available at https://zeyuet.github.io/AudioX/ 0.807

link

2025-03-13

PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements.However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets.Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps.To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics.We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information.By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation.We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA.Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations.To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels.Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively.We will release the code, datasets, and benchmark. 0.755

link

2025-03-13

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data.In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets.We propose VisualWebInstruct - a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc.Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images.We collect and process the HTMLs from over 700K unique URL sources.Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40% being visual QA pairs and the rest as text QA pairs. 0.892Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain.Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath (55.7%).These remarkable results highlight the effectiveness of our dataset in enhancing VLMs' reasoning capabilities for complex multimodal tasks.

link

2025-03-13

OCCUQ: Exploring Efficient Uncertainty Quantification for 3D Occupancy Prediction

Autonomous driving has the potential to significantly enhance productivity and provide numerous societal benefits.Ensuring robustness in these safety-critical systems is essential, particularly when vehicles must navigate adverse weather conditions and sensor corruptions that may not have been encountered during training.Current methods often overlook uncertainties arising from adversarial conditions or distributional shifts, limiting their real-world applicability.We propose an efficient adaptation of an uncertainty estimation technique for 3D occupancy prediction.Our method dynamically calibrates model confidence using epistemic uncertainty estimates.Our evaluation under various camera corruption scenarios, such as fog or missing cameras, demonstrates that our approach effectively quantifies epistemic uncertainty by assigning higher uncertainty values to unseen data.We introduce region-specific corruptions to simulate defects affecting only a single camera and validate our findings through both scene-level and region-level assessments.Our results show superior performance in Out-of-Distribution (OoD) detection and confidence calibration compared to common baselines such as Deep Ensembles and MC-Dropout.Our approach consistently demonstrates reliable uncertainty measures, indicating its potential for enhancing the robustness of autonomous driving systems in real-world scenarios.Code and dataset are available at https://github.com/ika-rwth-aachen/OCCUQ . 0.886

link

2025-03-13

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks.However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge.Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks.Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities.In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning.To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning.Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. 0.817We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities.To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond.Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

link

2025-03-13

DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers.One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made.In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning.Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses.Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios.To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving.Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios.We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios.In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks.Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1. 0.706

link

2025-03-13

Charting and Navigating Hugging Face's Model Atlas

As there are now millions of publicly available neural networks, searching and analyzing large model repositories becomes increasingly important.Navigating so many models requires an atlas, but as most models are poorly documented charting such an atlas is challenging.To explore the hidden potential of model repositories, we chart a preliminary atlas representing the documented fraction of Hugging Face.It provides stunning visualizations of the model landscape and evolution.We demonstrate several applications of this atlas including predicting model attributes (e.g., accuracy), and analyzing trends in computer vision models.However, as the current atlas remains incomplete, we propose a method for charting undocumented regions. 0.729Specifically, we identify high-confidence structural priors based on dominant real-world model training practices.Leveraging these priors, our approach enables accurate mapping of previously undocumented areas of the atlas.We publicly release our datasets, code, and interactive atlas. 0.891

link

2025-03-13

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations.We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images.This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. 0.771To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module.Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines.Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments.GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent.To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.

link

2025-03-12

Got Compute, but No Data: Lessons From Post-training a Finnish LLM

As LLMs gain more popularity as chatbots and general assistants, methods have been developed to enable LLMs to follow instructions and align with human preferences.These methods have found success in the field, but their effectiveness has not been demonstrated outside of high-resource languages.In this work, we discuss our experiences in post-training an LLM for instruction-following for English and Finnish.We use a multilingual LLM to translate instruction and preference datasets from English to Finnish.We perform instruction tuning and preference optimization in English and Finnish and evaluate the instruction-following capabilities of the model in both languages.Our results show that with a few hundred Finnish instruction samples we can obtain competitive performance in Finnish instruction-following.We also found that although preference optimization in English offers some cross-lingual benefits, we obtain our best results by using preference data from both languages.We release our model, datasets, and recipes under open licenses at https://huggingface.co/LumiOpen/Poro-34B-chat-OpenAssistant 0.787

link

2025-03-12

CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection

Identifying vulnerabilities in source code is crucial, especially in critical software components.Existing methods such as static analysis, dynamic analysis, formal verification, and recently Large Language Models are widely used to detect security flaws.This paper introduces CASTLE (CWE Automated Security Testing and Low-Level Evaluation), a benchmarking framework for evaluating the vulnerability detection capabilities of different methods.We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs.We propose the CASTLE Score, a novel evaluation metric to ensure fair comparison.Our results reveal key differences: ESBMC (a formal verification tool)minimizes false positives but struggles with vulnerabilities beyond model checking, such as weak cryptography or SQL injection.Static analyzers suffer from high false positives, increasing manual validation efforts for developers.LLMs perform exceptionally well in the CASTLE dataset when identifying vulnerabilities in small code snippets.However, their accuracy declines, and hallucinations increase as the code size grows.These results suggest that LLMs could play a pivotal role in future security solutions, particularly within code completion frameworks, where they can provide real-time guidance to prevent vulnerabilities.The dataset is accessible at https://github.com/CASTLE-Benchmark. 0.862

link

2025-03-12

Cost-Optimal Grouped-Query Attention for Long-Context LLMs

Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs.Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs.However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference.In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost.Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference.Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs.Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios.We will publicly release our code and data. 0.812

link

2025-03-12

RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling

Score Distillation Sampling (SDS) has emerged as an effective technique for leveraging 2D diffusion priors for tasks such as text-to-3D generation.While powerful, SDS struggles with achieving fine-grained alignment to user intent.To overcome this, we introduce RewardSDS, a novel approach that weights noise samples based on alignment scores from a reward model, producing a weighted SDS loss.This loss prioritizes gradients from noise samples that yield aligned high-reward output.Our approach is broadly applicable and can extend SDS-based methods.In particular, we demonstrate its applicability to Variational Score Distillation (VSD) by introducing RewardVSD.We evaluate RewardSDS and RewardVSD on text-to-image, 2D editing, and text-to-3D generation tasks, showing significant improvements over SDS and VSD on a diverse set of metrics measuring generation quality and alignment to desired reward models, enabling state-of-the-art performance.Project page is available at https://itaychachy. 0.738github.io/reward-sds/.

link

2025-03-11

NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. 0.841While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publication takes effect.We also introduce a new task to distinguish between existing scientific claims and aspirational research intentions in proposals.Using zero-shot prompting with frontier large language models, we jointly extract 114K scientific claims and 145K investigation proposals from 16K grant abstracts in the materials science domain to create a focused subset called NSF-SciFy-MatSci.We use this dataset to evaluate 3 three key tasks: (1) technical to non-technical abstract generation, where models achieve high BERTScore (0.85+ F1); (2) scientific claim extraction, where fine-tuned models outperform base models by 100% relative improvement; and (3) investigation proposal extraction, showing 90%+ improvement with fine-tuning.We introduce novel LLM-based evaluation metrics for robust assessment of claim/proposal extraction quality.As the largest scientific claim dataset to date -- with an estimated 2.8 million claims across all STEM disciplines funded by the NSF -- NSF-SciFy enables new opportunities for claim verification and meta-scientific research. 0.802We publicly release all datasets, trained models, and evaluation code to facilitate further research.

link

2025-03-11

MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input

Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models.However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.1(a).To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks.Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset.This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. 0.885(2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies.This stage simplifies the input requirements while preserving garment texture and shape fidelity.Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism.Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches.For more details, visit our project page: https://zhenchenwan.github.io/MF-VITON/.

link

Data Quality

2025-03-13

Learning Disease State from Noisy Ordinal Disease Progression Labels

Learning from noisy ordinal labels is a key challenge in medical imaging.In this work, we ask whether ordinal disease progression labels (better, worse, or stable) can be used to learn a representation allowing to classify disease state.For neovascular age-related macular degeneration (nAMD), we cast the problem of modeling disease progression between medical visits as a classification task with ordinal ranks.To enhance generalization, we tailor our model to the problem setting by (1) independent image encoding, (2) antisymmetric logit space equivariance, and (3) ordinal scale awareness.In addition, we address label noise by learning an uncertainty estimate for loss re-weighting. 0.657Our approach learns an interpretable disease representation enabling strong few-shot performance for the related task of nAMD activity classification from single images, despite being trained only on image pairs with ordinal disease progression labels.

link

2025-03-13

More Than Just Warnings:Exploring the Ways of Communicating Credibility Assessment on Social Media

Reducing the spread of misinformation is challenging.AI-based fact verification systems offer a promising solution by addressing the high costs and slow pace of traditional fact-checking.However, the problem of how to effectively communicate the results to users remains unsolved.Warning labels may seem an easy solution, but they fail to account for fuzzy misinformation that is not entirely fake. 0.748Additionally, users' limited attention spans and social media information should be taken into account while designing the presentation.The online experiment (n = 537) investigates the impact of sources and granularity on users' perception of information veracity and the system's usefulness and trustworthiness.Findings show that fine-grained indicators enhance nuanced opinions, information awareness, and the intention to use fact-checking systems.Source differences had minimal impact on opinions and perceptions, except for informativeness.Qualitative findings suggest the proposed indicators promote critical thinking.We discuss implications for designing concise, user-friendly AI fact-checking feedback.

link

2025-03-13

Unlock the Power of Unlabeled Data in Language Driving Model

Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements.However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive.To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner.Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data.Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. 0.647By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods.Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets.In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.

link

2025-03-12

Diff-CL: A Novel Cross Pseudo-Supervision Method for Semi-supervised Medical Image Segmentation

Semi-supervised learning utilizes insights from unlabeled data to improve model generalization, thereby reducing reliance on large labeled datasets.Most existing studies focus on limited samples and fail to capture the overall data distribution.We contend that combining distributional information with detailed information is crucial for achieving more robust and accurate segmentation results.On the one hand, with its robust generative capabilities, diffusion models (DM) learn data distribution effectively.However, it struggles with fine detail capture, leading to generated images with misleading details.Combining DM with convolutional neural networks (CNNs) enables the former to learn data distribution while the latter corrects fine details.While capturing complete high-frequency details by CNNs requires substantial computational resources and is susceptible to local noise.On the other hand, given that both labeled and unlabeled data come from the same distribution, we believe that regions in unlabeled data similar to overall class semantics to labeled data are likely to belong to the same class, while regions with minimal similarity are less likely to. 0.622This work introduces a semi-supervised medical image segmentation framework from the distribution perspective (Diff-CL).Firstly, we propose a cross-pseudo-supervision learning mechanism between diffusion and convolution segmentation networks.Secondly, we design a high-frequency mamba module to capture boundary and detail information globally.Finally, we apply contrastive learning for label propagation from labeled to unlabeled data. 0.653Our method achieves state-of-the-art (SOTA) performance across three datasets, including left atrium, brain tumor, and NIH pancreas datasets.

link

2025-03-12

Double-Stage Feature-Level Clustering-Based Mixture of Experts Framework

The Mixture-of-Experts (MoE) model has succeeded in deep learning (DL).However, its complex architecture and advantages over dense models in image classification remain unclear.In previous studies, MoE performance has often been affected by noise and outliers in the input space.Some approaches incorporate input clustering for training MoE models, but most clustering algorithms lack access to labeled data, limiting their effectiveness.This paper introduces the Double-stage Feature-level Clustering and Pseudo-labeling-based Mixture of Experts (DFCP-MoE) framework, which consists of input feature extraction, feature-level clustering, and a computationally efficient pseudo-labeling strategy.This approach reduces the impact of noise and outliers while leveraging a small subset of labeled data to label a large portion of unlabeled inputs. 0.619We propose a conditional end-to-end joint training method that improves expert specialization by training the MoE model on well-labeled, clustered inputs.Unlike traditional MoE and dense models, the DFCP-MoE framework effectively captures input space diversity, leading to competitive inference results.We validate our approach on three benchmark datasets for multi-class classification tasks.

link

2025-03-11

How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?

Machine unlearning is the task of updating a trained model to forget specific training data without retraining from scratch.In this paper, we investigate how unlearning of deep neural networks (DNNs) is affected by the model parameterization level, which corresponds here to the DNN width.We define validation-based tuning for several unlearning methods from the recent literature, and show how these methods perform differently depending on (i) the DNN parameterization level, (ii) the unlearning goal (unlearned data privacy or bias removal), (iii) whether the unlearning method explicitly uses the unlearned examples.Our results show that unlearning excels on overparameterized models, in terms of balancing between generalization and achieving the unlearning goal; although for bias removal this requires the unlearning method to use the unlearned examples.We further elucidate our error-based analysis by measuring how much the unlearning changes the classification decision regions in the proximity of the unlearned examples, and avoids changing them elsewhere. 0.645By this we show that the unlearning success for overparameterized models stems from the ability to delicately change the model functionality in small regions in the input space while keeping much of the model functionality unchanged.

link

2025-03-10

Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions

Machine learning models often have uneven performance among subpopulations (a.k.a., groups) in the data distributions.This poses a significant challenge for the models to generalize when the proportions of the groups shift during deployment.To improve robustness to such shifts, existing approaches have developed strategies that train models or perform hyperparameter tuning using the group-labeled data to minimize the worst-case loss over groups.However, a non-trivial amount of high-quality labels is often required to obtain noticeable improvements. 0.606Given the costliness of the labels, we propose to adopt a different paradigm to enhance group label efficiency: utilizing the group-labeled data as a target set to optimize the weights of other group-unlabeled data.We introduce Group-robust Sample Reweighting (GSR), a two-stage approach that first learns the representations from group-unlabeled data, and then tinkers the model by iteratively retraining its last layer on the reweighted data using influence functions.Our GSR is theoretically sound, practically lightweight, and effective in improving the robustness to subpopulation shifts.In particular, GSR outperforms the previous state-of-the-art approaches that require the same amount or even more group labels.

link

2025-03-05

Rethinking Deep Clustering Paradigms: Self-Supervision Is All You Need

The recent advances in deep clustering have been made possible by significant progress in self-supervised and pseudo-supervised learning.However, the trade-off between self-supervision and pseudo-supervision can give rise to three primary issues.The joint training causes Feature Randomness and Feature Drift, whereas the independent training causes Feature Randomness and Feature Twist.In essence, using pseudo-labels generates random and unreliable features. 0.646The combination of pseudo-supervision and self-supervision drifts the reliable clustering-oriented features.Moreover, moving from self-supervision to pseudo-supervision can twist the curved latent manifolds.This paper addresses the limitations of existing deep clustering paradigms concerning Feature Randomness, Feature Drift, and Feature Twist.We propose a new paradigm with a new strategy that replaces pseudo-supervision with a second round of self-supervision training.The new strategy makes the transition between instance-level self-supervision and neighborhood-level self-supervision smoother and less abrupt.Moreover, it prevents the drifting effect that is caused by the strong competition between instance-level self-supervision and clustering-level pseudo-supervision.Moreover, the absence of the pseudo-supervision prevents the risk of generating random features.With this novel approach, our paper introduces a Rethinking of the Deep Clustering Paradigms, denoted by R-DC.Our model is specifically designed to address three primary challenges encountered in Deep Clustering: Feature Randomness, Feature Drift, and Feature Twist.Experimental results conducted on six datasets have shown that the two-level self-supervision training yields substantial improvements.

link

Benchmarks

2025-03-24

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache

The growing adoption of long-context Large Language Models (LLMs) has introduced significant memory and computational challenges in autoregressive decoding due to the expanding Key-Value (KV) cache.KV cache quantization has emerged as a promising solution, with prior work showing that 4-bit or even 2-bit quantization can maintain model accuracy while reducing memory costs.However, despite these benefits, preliminary implementations for the low-bit KV cache struggle to deliver the expected speedup due to quantization and dequantization overheads and the lack of Tensor Cores utilization.In this work, we propose BitDecoding, a GPU-optimized framework that unlocks Tensor Cores for efficient decoding with low-bit KV cache.Efficiently leveraging Tensor Cores for low-bit KV cache is challenging due to the dynamic nature of KV cache generation at each decoding step.BitDecoding addresses these challenges with a Tensor Cores-Centric BitFusion Scheme that ensures data layout compatibility to enable high utilization of Tensor Cores.Additionally, BitDecoding incorporates a warp-efficient parallel decoding kernel and a fine-grained asynchronous pipeline, minimizing dequantization overhead and improving computational efficiency.Experiments show that BitDecoding achieves up to 7.5x speedup on RTX 4090, 4.8x on A100, and 8.9x on H100, compared to FP16 FlashDecoding-v2.It also outperforms the state-of-the-art low-bit KV cache implementation (QServe) by up to 4.3x. 0.618On LLaMA-3.1-8B with a 128K sequence length, BitDecoding reduces single-batch decoding latency by 3x, demonstrating its effectiveness in long-context generation scenarios.The code is available at https://github.com/DD-DuDa/BitDecoding.

link

2025-03-24

LGI-DETR: Local-Global Interaction for UAV Object Detection

UAV has been widely used in various fields.However, most of the existing object detectors used in drones are not end-to-end and require the design of various complex components and careful fine-tuning.Most of the existing end-to-end object detectors are designed for natural scenes.It is not ideal to apply them directly to UAV images.In order to solve the above challenges, we design an local-global information interaction DETR for UAVs, namely LGI-DETR.Cross-layer bidirectional low-level and high-level feature information enhancement, this fusion method is effective especially in the field of small objection detection.At the initial stage of encoder, we propose a local spatial enhancement module (LSE), which enhances the low-level rich local spatial information into the high-level feature, and reduces the loss of local information in the transmission process of high-level information.At the final stage of the encoder, we propose a novel global information injection module (GII) designed to integrate rich high-level global semantic representations with low-level feature maps.This hierarchical fusion mechanism effectively addresses the inherent limitations of local receptive fields by propagating contextual information across the feature hierarchy.Experimental results on two challenging UAV image object detection benchmarks, VisDrone2019 and UAVDT, show that our proposed model outperforms the SOTA model.Compared to the baseline model, AP and AP50 improved by 1.9% and 2.4%, respectively. 0.686

link

2025-03-24

CCMusic: An Open and Diverse Database for Chinese Music Information Retrieval Research

Data are crucial in various computer-related fields, including music information retrieval (MIR), an interdisciplinary area bridging computer science and music.This paper introduces CCMusic, an open and diverse database comprising multiple datasets specifically designed for tasks related to Chinese music, highlighting our focus on this culturally rich domain.The database integrates both published and unpublished datasets, with steps taken such as data cleaning, label refinement, and data structure unification to ensure data consistency and create ready-to-use versions.We conduct benchmark evaluations for all datasets using a unified evaluation framework developed specifically for this purpose. 0.739This publicly available framework supports both classification and detection tasks, ensuring standardized and reproducible results across all datasets.The database is hosted on HuggingFace and ModelScope, two open and multifunctional data and model hosting platforms, ensuring ease of accessibility and usability.

link

2025-03-24

CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos

Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection.Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner.Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks.Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur.In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning.Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively.Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. 0.618Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.

link

2025-03-24

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models.Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged.However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets.In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance.Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. 0.622We investigate the limitation of na\"ive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge.Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings.To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data.This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space.We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere.Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

link

2025-03-24

EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

We develop benchmarks for LLM agents that act in, learn from, and strategize in unknown environments, the specifications of which the LLM agent must learn over time from deliberate exploration.Our benchmarks consist of decision-making tasks derived from key problems in economics. 0.616To forestall saturation, the benchmark tasks are synthetically generated with scalable difficulty levels.Additionally, we propose litmus tests, a new kind of quantitative measure for LLMs and LLM agents.Unlike benchmarks, litmus tests quantify differences in character, values, and tendencies of LLMs and LLM agents, by considering their behavior when faced with tradeoffs (e.g., efficiency versus equality) where there is no objectively right or wrong behavior.Overall, our benchmarks and litmus tests assess the abilities and tendencies of LLM agents in tackling complex economic problems in diverse settings spanning procurement, scheduling, task allocation, and pricing -- applications that should grow in importance as such agents are further integrated into the economy.

link

2025-03-24

DAGait: Generalized Skeleton-Guided Data Alignment for Gait Recognition

Gait recognition is emerging as a promising and innovative area within the field of computer vision, widely applied to remote person identification.Although existing gait recognition methods have achieved substantial success in controlled laboratory datasets, their performance often declines significantly when transitioning to wild datasets.We argue that the performance gap can be primarily attributed to the spatio-temporal distribution inconsistencies present in wild datasets, where subjects appear at varying angles, positions, and distances across the frames.To achieve accurate gait recognition in the wild, we propose a skeleton-guided silhouette alignment strategy, which uses prior knowledge of the skeletons to perform affine transformations on the corresponding silhouettes.To the best of our knowledge, this is the first study to explore the impact of data alignment on gait recognition.We conducted extensive experiments across multiple datasets and network architectures, and the results demonstrate the significant advantages of our proposed alignment strategy.Specifically, on the challenging Gait3D dataset, our method achieved an average performance improvement of 7.9% across all evaluated networks.Furthermore, our method achieves substantial improvements on cross-domain datasets, with accuracy improvements of up to 24.0%. 0.627

link

2025-03-24

Efficient and Accurate Scene Text Recognition with Cascaded-Transformers

In recent years, vision transformers with text decoder have demonstrated remarkable performance on Scene Text Recognition (STR) due to their ability to capture long-range dependencies and contextual relationships with high learning capacity.However, the computational and memory demands of these models are significant, limiting their deployment in resource-constrained applications.To address this challenge, we propose an efficient and accurate STR system.Specifically, we focus on improving the efficiency of encoder models by introducing a cascaded-transformers structure.This structure progressively reduces the vision token size during the encoding step, effectively eliminating redundant tokens and reducing computational cost.Our experimental results confirm that our STR system achieves comparable performance to state-of-the-art baselines while substantially decreasing computational requirements. 0.605In particular, for large-models, the accuracy remains same, 92.77 to 92.68, while computational complexity is almost halved with our structure.

link

2025-03-24

Multi-Physics Inverse Design of Varifocal Optical Devices using Data-Driven Surrogates and Differential Modeling

Designing a new varifocal architecture in AR glasses poses significant challenges due to the complex interplay of multiple physics disciplines, including innovated piezo-electric material, solid mechanics, electrostatics, and optics.Traditional design methods, which treat each physics separately, are insufficient for this problem as they fail to establish the intricate relationships among design parameters in such a large and sensitive space, leading to suboptimal solutions.To address this challenge, we propose a novel design pipeline, mPhDBBs (multi-Physics Differential Building Blocks), that integrates these diverse physics through a graph neural network-based surrogate model and a differentiable ray tracing model.A hybrid optimization method combining evolutionary and gradient approaches is employed to efficiently determine superior design variables that achieve desired optical objectives, such as focal length and focusing quality.Our results demonstrate the effectiveness of mPhDBBs, achieving high accuracy with minimal training data and computational resources, resulting in a speedup of at least 1000 times compared to non-gradient-based methods. 0.611This work offers a promising paradigm shift in product design, enabling rapid and accurate optimization of complex multi-physics systems, and demonstrates its adaptability to other inverse design problems.

link

2025-03-24

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications.In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process.To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline.CoMP uses a Continual Rotary Position Embedding to support native resolution continual pre-training, and an Alignment Loss between visual and textual features through language prototypes to align multimodal representations.By three-stage training, our VFMs achieve remarkable improvements not only in multimodal understanding but also in other downstream tasks such as classification and segmentation.Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQA with a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5 mIoU on ADE20K under frozen chunk evaluation. 0.62

link

2025-03-20

OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection

The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs.This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts.OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories.We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. 0.622Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field.By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare.The repository is available at https://github.com/remic-othr/OpenMIBOOD.

link

2025-03-20

Rethinking Robustness in Machine Learning: A Posterior Agreement Approach

The robustness of algorithms against covariate shifts is a fundamental problem with critical implications for the deployment of machine learning algorithms in the real world.Current evaluation methods predominantly match the robustness definition to that of standard generalization, relying on standard metrics like accuracy-based scores, which, while designed for performance assessment, lack a theoretical foundation encompassing their application in estimating robustness to distribution shifts. 0.695In this work, we set the desiderata for a robustness metric, and we propose a novel principled framework for the robustness assessment problem that directly follows the Posterior Agreement (PA) theory of model validation.Specifically, we extend the PA framework to the covariate shift setting by proposing a PA metric for robustness evaluation in supervised classification tasks.We assess the soundness of our metric in controlled environments and through an empirical robustness analysis in two different covariate shift scenarios: adversarial learning and domain generalization.We illustrate the suitability of PA by evaluating several models under different nature and magnitudes of shift, and proportion of affected observations.The results show that the PA metric provides a sensible and consistent analysis of the vulnerabilities in learning algorithms, even in the presence of few perturbed observations.

link

2025-03-20

Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens

Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction.However, despite the critical role of 3D structural generation and understanding ({3D GU}) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored.To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates {3D GU} tasks via autoregressive prediction.At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures.It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures.We further propose two optimizations to enhance efficiency and effectiveness. 0.672The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x.The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance.By combining these strategies, Uni-3DAR successfully unifies diverse {3D GU} tasks within a single autoregressive framework.Extensive experiments across multiple microscopic {3D GU} tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility.Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster.The code is publicly available at https://github.com/dptech-corp/Uni-3DAR.

link

2025-03-20

PSA-MIL: A Probabilistic Spatial Attention-Based Multiple Instance Learning for Whole Slide Image Classification

Whole Slide Images (WSIs) are high-resolution digital scans widely used in medical diagnostics.WSI classification is typically approached using Multiple Instance Learning (MIL), where the slide is partitioned into tiles treated as interconnected instances.While attention-based MIL methods aim to identify the most informative tiles, they often fail to fully exploit the spatial relationships among them, potentially overlooking intricate tissue structures crucial for accurate diagnosis.To address this limitation, we propose Probabilistic Spatial Attention MIL (PSA-MIL), a novel attention-based MIL framework that integrates spatial context into the attention mechanism through learnable distance-decayed priors, formulated within a probabilistic interpretation of self-attention as a posterior distribution.This formulation enables a dynamic inference of spatial relationships during training, eliminating the need for predefined assumptions often imposed by previous approaches.Additionally, we suggest a spatial pruning strategy for the posterior, effectively reducing self-attention's quadratic complexity.To further enhance spatial modeling, we introduce a diversity loss that encourages variation among attention heads, ensuring each captures distinct spatial representations.Together, PSA-MIL enables a more data-driven and adaptive integration of spatial context, moving beyond predefined constraints.We achieve state-of-the-art performance across both contextual and non-contextual baselines, while significantly reducing computational costs. 0.609

link

2025-03-20

A parallel algorithm for the odd two-face shortest k-disjoint path problem

The shortest Disjoint Path problem (SDPP) requires us to find pairwise vertex disjoint paths between k designated pairs of terminal vertices such that the sum of the path lengths is minimum.The focus here is on SDPP restricted to planar graphs where all terminals are arbitrarily partitioned over two distinct faces with the additional restriction that each face is required to contain an odd number of terminals.We call this problem the Odd two-face planar SDPP.It is shown that this problem is solvable in randomized polynomial time and even in RNC.This is the first parallel (or even polynomial time) solution for the problem. Our algorithm combines ideas from the randomized solution for 2-SDPP by Bj\"orklund and Huslfeldt with its parallelization by Datta and Jaiswal along with the deterministic algorithm for One-face planar SDPP by Datta, Iyer, Kulkarni and Mukherjee. The proof uses a combination of two involutions to reduce a system of linear equations modulo a power of 2 to a system of triangular form that is, therefore, invertible.This, in turn, is proved by showing that the matrix of the equations, can be interpreted as (the adjacency matrix of) a directed acyclic graph (DAG).While our algorithm is primarily algebraic the proof remains combinatorial. We also give a parallel algorithm for the (A + B)-SDPP introduced by Hirai and Namba. 0.627

link

2025-03-20

HiQ-Lip: The First Quantum-Classical Hierarchical Method for Global Lipschitz Constant Estimation of ReLU Networks

Estimating the global Lipschitz constant of neural networks is crucial for understanding and improving their robustness and generalization capabilities.However, precise calculations are NP-hard, and current semidefinite programming (SDP) methods face challenges such as high memory usage and slow processing speeds. 0.601In this paper, we propose \textbf{HiQ-Lip}, a hybrid quantum-classical hierarchical method that leverages Coherent Ising Machines (CIMs) to estimate the global Lipschitz constant.We tackle the estimation by converting it into a Quadratic Unconstrained Binary Optimization (QUBO) problem and implement a multilevel graph coarsening and refinement strategy to adapt to the constraints of contemporary quantum hardware.Our experimental evaluations on fully connected neural networks demonstrate that HiQ-Lip not only provides estimates comparable to state-of-the-art methods but also significantly accelerates the computation process.In specific tests involving two-layer neural networks with 256 hidden neurons, HiQ-Lip doubles the solving speed and offers more accurate upper bounds than the existing best method, LiPopt.These findings highlight the promising utility of small-scale quantum devices in advancing the estimation of neural network robustness.

link

2025-03-20

An Evaluation Tool for Backbone Extraction Techniques in Weighted Complex Networks

Networks are essential for analyzing complex systems.However, their growing size necessitates backbone extraction techniques aimed at reducing their size while retaining critical features.In practice, selecting, implementing, and evaluating the most suitable backbone extraction method may be challenging.This paper introduces netbone, a Python package designed for assessing the performance of backbone extraction techniques in weighted networks.Its comparison framework is the standout feature of netbone.Indeed, the tool incorporates state-of-the-art backbone extraction techniques.Furthermore, it provides a comprehensive suite of evaluation metrics allowing users to evaluate different backbones techniques. 0.623We illustrate the flexibility and effectiveness of netbone through the US air transportation network analysis.We compare the performance of different backbone extraction techniques using the evaluation metrics.We also show how users can integrate a new backbone extraction method into the comparison framework.netbone is publicly available as an open-source tool, ensuring its accessibility to researchers and practitioners.Promoting standardized evaluation practices contributes to the advancement of backbone extraction techniques and fosters reproducibility and comparability in research efforts.We anticipate that netbone will serve as a valuable resource for researchers and practitioners enabling them to make informed decisions when selecting backbone extraction techniques to gain insights into the structural and functional properties of complex systems.

link

2025-03-20

Lyra: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences

Deep learning architectures such as convolutional neural networks and Transformers have revolutionized biological sequence modeling, with recent advances driven by scaling up foundation and task-specific models.The computational resources and large datasets required, however, limit their applicability in biological contexts.We introduce Lyra, a subquadratic architecture for sequence modeling, grounded in the biological framework of epistasis for understanding sequence-to-function relationships.Mathematically, we demonstrate that state space models efficiently capture global epistatic interactions and combine them with projected gated convolutions for modeling local relationships.We demonstrate that Lyra is performant across over 100 wide-ranging biological tasks, achieving state-of-the-art (SOTA) performance in many key areas, including protein fitness landscape prediction, biophysical property prediction (e.g. disordered protein region functions) peptide engineering applications (e.g. antibody binding, cell-penetrating peptide prediction), RNA structure analysis, RNA function prediction, and CRISPR guide design.It achieves this with orders-of-magnitude improvements in inference speed and reduction in parameters (up to 120,000-fold in our tests) compared to recent biology foundation models. 0.602Using Lyra, we were able to train and run every task in this study on two or fewer GPUs in under two hours, democratizing access to biological sequence modeling at SOTA performance, with potential applications to many fields.

link

2025-03-20

Deep Feynman-Kac Methods for High-dimensional Semilinear Parabolic Equations: Revisit

Deep Feynman-Kac method was first introduced to solve parabolic partial differential equations(PDE) by Beck et al.(SISC, V.43, 2021), named Deep Splitting method since they trained the Neural Networks step by step in the time direction.In this paper, we propose a new training approach with two different features.Firstly, neural networks are trained at all time steps globally, instead of step by step.Secondly, the training data are generated in a new way, in which the method is consistent with a direct Monte Carlo scheme when dealing with a linear parabolic PDE.Numerical examples show that our method has significant improvement both in efficiency and accuracy. 0.771

link

2025-03-20

Survey on Evaluation of LLM-based Agents

The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments.This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents.We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents.Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. 0.686We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods.This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

link

2025-03-19

SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models

In this paper, we propose Selection and Pooling with Large Language Models (SPILL), an intuitive and domain-adaptive method for intent clustering without fine-tuning.Existing embeddings-based clustering methods rely on a few labeled examples or unsupervised fine-tuning to optimize results for each new dataset, which makes them less generalizable to multiple datasets.Our goal is to make these existing embedders more generalizable to new domain datasets without further fine-tuning.Inspired by our theoretical derivation and simulation results on the effectiveness of sampling and pooling techniques, we view the clustering task as a small-scale selection problem.A good solution to this problem is associated with better clustering performance.Accordingly, we propose a two-stage approach:First, for each utterance (referred to as the seed), we derive its embedding using an existing embedder.Then, we apply a distance metric to select a pool of candidates close to the seed.Because the embedder is not optimized for new datasets, in the second stage, we use an LLM to further select utterances from these candidates that share the same intent as the seed.Finally, we pool these selected candidates with the seed to derive a refined embedding for the seed.We found that our method generally outperforms directly using an embedder, and it achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning, showing its strength and efficiency. 0.671Our results indicate that our method enables existing embedders to be further improved without additional fine-tuning, making them more adaptable to new domain datasets.Additionally, viewing the clustering task as a small-scale selection problem gives the potential of using LLMs to customize clustering tasks according to the user's goals.

link

2025-03-19

Online Imitation Learning for Manipulation via Decaying Relative Correction through Teleoperation

Teleoperated robotic manipulators enable the collection of demonstration data, which can be used to train control policies through imitation learning.However, such methods can require significant amounts of training data to develop robust policies or adapt them to new and unseen tasks.While expert feedback can significantly enhance policy performance, providing continuous feedback can be cognitively demanding and time-consuming for experts.To address this challenge, we propose to use a cable-driven teleoperation system which can provide spatial corrections with 6 degree of freedom to the trajectories generated by a policy model.Specifically, we propose a correction method termed Decaying Relative Correction (DRC) which is based upon the spatial offset vector provided by the expert and exists temporarily, and which reduces the intervention steps required by an expert. 0.607Our results demonstrate that DRC reduces the required expert intervention rate by 30\% compared to a standard absolute corrective method.Furthermore, we show that integrating DRC within an online imitation learning framework rapidly increases the success rate of manipulation tasks such as raspberry harvesting and cloth wiping.

link

2025-03-19

Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR data

Background: Patient recruitment in clinical trials is hindered by complex eligibility criteria and labor-intensive chart reviews.Prior research using text-only models have struggled to address this problem in a reliable and scalable way due to (1) limited reasoning capabilities, (2) information loss from converting visual records to text, and (3) lack of a generic EHR integration to extract patient data. Methods: We introduce a broadly applicable, integration-free, LLM-powered pipeline that automates patient-trial matching using unprocessed documents extracted from EHRs.Our approach leverages (1) the new reasoning-LLM paradigm, enabling the assessment of even the most complex criteria, (2) visual capabilities of latest LLMs to interpret medical records without lossy image-to-text conversions, and (3) multimodal embeddings for efficient medical record search.The pipeline was validated on the n2c2 2018 cohort selection dataset (288 diabetic patients) and a real-world dataset composed of 485 patients from 30 different sites matched against 36 diverse trials. Results: On the n2c2 dataset, our method achieved a new state-of-the-art criterion-level accuracy of 93\%. 0.67In real-world trials, the pipeline yielded an accuracy of 87\%, undermined by the difficulty to replicate human decision-making when medical records lack sufficient information.Nevertheless, users were able to review overall eligibility in under 9 minutes per patient on average, representing an 80\% improvement over traditional manual chart reviews. Conclusion: This pipeline demonstrates robust performance in clinical trial patient matching without requiring custom integration with site systems or trial-specific tailoring, thereby enabling scalable deployment across sites seeking to leverage AI for patient matching.

link

2025-03-19

Online Matching under KIID: Enhanced Competitive Analysis through Ordinary Differential Equation Systems

We consider the (offline) vertex-weighted Online Matching problem under Known Identical and Independent Distributions (KIID) with integral arrival rates.We propose a meta-algorithm, denoted as $\mathsf{RTB}$, featuring Real-Time Boosting, where the core idea is as follows.Consider a bipartite graph $G=(I,J,E)$, where $I$ and $J$ represent the sets of offline and online nodes, respectively.Let $\mathbf{x}=(x_{ij})\in[0,1]^{|E|}$, where $x_{ij}$ for $(i,j) \in E$ represents the probability that edge $(i,j)$ is matched in an offline optimal policy (a.k.a. a clairvoyant optimal policy), typically obtained by solving a benchmark linear program (LP).Upon the arrival of an online node $j$ at some time $t \in[0,1]$, $\mathsf{RTB}$ samples a safe (available) neighbor $i \in I_{j,t}$ with probability $x_{ij}/\sum_{i' \in I_{j,t}} x_{i'j}$ and matches it to $j$, where $I_{j,t}$ denotes the set of safe offline neighbors of $j$. In this paper, we showcase the power of Real-Time Boosting by demonstrating that $\mathsf{RTB}$, when fed with $\mathbf{X}^*$, achieves a competitive ratio of $(2e^4 - 8e^2 + 21e - 27) /(2e^4) \approx 0.7341$, where $\mathbf{X}^* \in \{0,1/3,2/3\}^{|E|}$ is a random vector obtained by applying a customized dependent rounding technique due to Brubach et al.(Algorithmica, 2020).Our result improves upon the state-of-the-art ratios of 0.7299 by Brubach et al. 0.632(Algorithmica, 2020) and 0.725 by Jaillet and Lu (Mathematics of Operations Research, 2013).Notably, this improvement does not stem from the algorithm itself but from a new competitive analysis methodology: We introduce an Ordinary Differential Equation (ODE) system-based approach that enables a {holistic} analysis of $\mathsf{RTB}$. We anticipate that utilizing other well-structured vectors from more advanced rounding techniques could potentially yield further improvements in the competitiveness.

link

2025-03-19

LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains, offering key advantages such as memory efficiency and resolution independence.Conventional deep learning models are typically modality-dependent, often requiring custom architectures and objectives for different types of signals.However, existing INR frameworks frequently rely on global latent vectors or exhibit computational inefficiencies that limit their broader applicability.We introduce LIFT, a novel, high-performance framework that addresses these challenges by capturing multiscale information through meta-learning.LIFT leverages multiple parallel localized implicit functions alongside a hierarchical latent generator to produce unified latent representations that span local, intermediate, and global features.This architecture facilitates smooth transitions across local regions, enhancing expressivity while maintaining inference efficiency.Additionally, we introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings.With this straightforward approach, ReLIFT effectively addresses the convergence-capacity gap found in comparable methods, providing an efficient yet powerful solution to improve capacity and speed up convergence. 0.624Empirical results show that LIFT achieves state-of-the-art (SOTA) performance in generative modeling and classification tasks, with notable reductions in computational costs.Moreover, in single-task settings, the streamlined ReLIFT architecture proves effective in signal representations and inverse problem tasks.

link

2025-03-19

Reducing Communication Overhead in Federated Learning for Network Anomaly Detection with Adaptive Client Selection

Communication overhead in federated learning (FL) poses a significant challenge for network anomaly detection systems, where diverse client configurations and network conditions impact efficiency and detection accuracy.Existing approaches attempt optimization individually but struggle to balance reduced overhead with performance. 0.616This paper presents an adaptive FL framework combining batch size optimization, client selection, and asynchronous updates for efficient anomaly detection.Using UNSW-NB15 for general network traffic and ROAD for automotive networks, our framework reduces communication overhead by 97.6% (700.0s to 16.8s) while maintaining comparable accuracy (95.10% vs. 95.12%).The Mann-Whitney U test confirms significant improvements (p < 0.05). 0.61Profiling analysis reveals efficiency gains via reduced GPU operations and memory transfers, ensuring robust detection across varying client conditions.

link

2025-03-19

Temporal Encoding Strategies for Energy Time Series Prediction

In contemporary power systems, energy consumption prediction plays a crucial role in maintaining grid stability and resource allocation enabling power companies to minimize energy waste and avoid overloading the grid.While there are several research works on energy optimization, they often fail to address the complexities of real-time fluctuations and the cyclic pattern of energy consumption.This work proposes a novel approach to enhance the accuracy of predictive models by employing sinusoidal encoding on periodic features of time-series data.To demonstrate the increase in performance, several statistical and ensemble machine learning models were trained on an energy demand dataset, using the proposed sinusoidal encoding.The performance of these models was then benchmarked against identical models trained on traditional encoding methods. 0.631The results demonstrated a 12.6% improvement of Root Mean Squared Error (from 0.5497 to 0.4802) and a 7.8% increase in the R^2 score (from 0.7530 to 0.8118), indicating that the proposed encoding better captures the cyclic nature of temporal patterns than traditional methods.The proposed methodology significantly improves prediction accuracy while maintaining computational efficiency, making it suitable for real-time applications in smart grid systems.

link

2025-03-19

From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment

Large language models (LLMs) have traditionally been aligned through one-size-fits-all approaches that assume uniform human preferences, fundamentally overlooking the diversity in user values and needs.This paper introduces a comprehensive framework for scalable personalized alignment of LLMs.We establish a systematic preference space characterizing psychological and behavioral dimensions, alongside diverse persona representations for robust preference inference in real-world scenarios.Building upon this foundation, we introduce \textsc{AlignX}, a large-scale dataset of over 1.3 million personalized preference examples, and develop two complementary alignment approaches: \textit{in-context alignment} directly conditioning on persona representations and \textit{preference-bridged alignment} modeling intermediate preference distributions.Extensive experiments demonstrate substantial improvements over existing methods, with an average 17.06\% accuracy gain across four benchmarks while exhibiting a strong adaptation capability to novel preferences, robustness to limited user data, and precise preference controllability. 0.624These results validate our framework's effectiveness, advancing toward truly user-adaptive AI systems.

link

2025-03-18

COPA: Comparing the Incomparable to Explore the Pareto Front

In machine learning (ML), it is common to account for multiple objectives when, e.g., selecting a model to deploy.However, it is often unclear how one should compare, aggregate and, ultimately, trade-off these objectives, as they might be measured in different units or scales. 0.618For example, when deploying large language models (LLMs), we might not only care about their performance, but also their CO2 consumption.In this work, we investigate how objectives can be sensibly compared and aggregated to navigate their Pareto front.To do so, we propose to make incomparable objectives comparable via their CDFs, approximated by their relative rankings. 0.602This allows us to aggregate them while matching user-specific preferences, allowing practitioners to meaningfully navigate and search for models in the Pareto front.We demonstrate the potential impact of our methodology in diverse areas such as LLM selection, domain generalization, and AutoML benchmarking, where classical ways to aggregate and normalize objectives fail.

link

2025-03-18

Wasserstein-based Kernels for Clustering: Application to Power Distribution Graphs

Many data clustering applications must handle objects that cannot be represented as vector data.In this context, the bag-of-vectors representation can be leveraged to describe complex objects through discrete distributions, and the Wasserstein distance can effectively measure the dissimilarity between them.Additionally, kernel methods can be used to embed data into feature spaces that are easier to analyze.Despite significant progress in data clustering, a method that simultaneously accounts for distributional and vectorial dissimilarity measures is still lacking.To tackle this gap, this work explores kernel methods and Wasserstein distance metrics to develop a computationally tractable clustering framework.The compositional properties of kernels allow the simultaneous handling of different metrics, enabling the integration of both vectors and discrete distributions for object representation.This approach is flexible enough to be applied in various domains, such as graph analysis and image processing.The framework consists of three main components.First, we efficiently approximate pairwise Wasserstein distances using multiple reference distributions.Second, we employ kernel functions based on Wasserstein distances and present ways of composing kernels to express different types of information.Finally, we use the kernels to cluster data and evaluate the quality of the results using scalable and distance-agnostic validity indices. 0.644A case study involving two datasets of 879 and 34,920 power distribution graphs demonstrates the framework's effectiveness and efficiency.

link

2025-03-18

EnvBench: A Benchmark for Automated Environment Setup

Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain.In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system.Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice.To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts.To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. 0.604We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini.The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches.Our benchmark suite is publicly available at https://github.com/JetBrains-Research/EnvBench. 0.695The dataset and experiment trajectories are available at https://jb.gg/envbench.

link

2025-03-18

Attribution Score Alignment in Explainable Data Management

Different attribution-scores have been proposed to quantify the relevance of database tuples for a query answer from a database.Among them, we find Causal Responsibility, the Shapley Value, the Banzhaf Power-Index, and the Causal Effect.They have been analyzed in isolation, mainly in terms of computational properties.In this work, we start an investigation into the alignment of these scores on the basis of the queries at hand; that is, on whether they induce compatible rankings of tuples. 0.611We are able to identify vast classes of queries for which some pairs of scores are always aligned, and others for which they are not.It turns out that the presence of exogenous tuples makes a crucial difference in this regard.

link

2025-03-18

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts.While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored.To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks.The benchmark comprises 765 test cases spanning 51 fine-grained tasks. 0.624To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs.Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks.Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities.Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence.Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.

link

2025-03-18

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras.Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations.Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time.As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild.Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure.Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings. 0.698

link

2025-03-18

Temporal Consistency for LLM Reasoning Process Error Identification

Verification is crucial for effective mathematical reasoning.We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment.Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy.Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. 0.711When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench.Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1.Our codes are available at https://github.com/jcguo123/Temporal-Consistency

link

2025-03-18

Utilization of Neighbor Information for Image Classification with Different Levels of Supervision

We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering.Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task -- GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently.We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting.State-of-the-art clustering methods already rely heavily on nearest neighbors.We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. 0.607We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors.Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircraft).

link

2025-03-18

Aligning Multimodal LLM with Human Preference: A Survey

Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training.Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data.However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed.This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals.Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. 0.604In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs.Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms.This work seeks to help researchers organize current advancements in the field and inspire better alignment methods.The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.

link

2025-03-17

LLM-Match: An Open-Sourced Patient Matching Model Based on Large Language Models and Retrieval-Augmented Generation

Patient matching is the process of linking patients to appropriate clinical trials by accurately identifying and matching their medical records with trial eligibility criteria.We propose LLM-Match, a novel framework for patient matching leveraging fine-tuned open-source large language models.Our approach consists of four key components.First, a retrieval-augmented generation (RAG) module extracts relevant patient context from a vast pool of electronic health records (EHRs).Second, a prompt generation module constructs input prompts by integrating trial eligibility criteria (both inclusion and exclusion criteria), patient context, and system instructions.Third, a fine-tuning module with a classification head optimizes the model parameters using structured prompts and ground-truth labels.Fourth, an evaluation module assesses the fine-tuned model's performance on the testing datasets.We evaluated LLM-Match on four open datasets, n2c2, SIGIR, TREC 2021, and TREC 2022, using open-source models, comparing it against TrialGPT, Zero-Shot, and GPT-4-based closed models.LLM-Match outperformed all baselines. 0.625

link

2025-03-17

Hierarchical Multicriteria Shortest Path Search

This paper presents a novel multicriteria shortest path search algorithm called Hierarchical MLS.The distinguishing feature of the algorithm is the multilayered structure of compressed k-Path-Cover graphs it operates on.In addition to providing significant improvements in terms of time and memory consumption, the algorithm is notable for several other features.Due to the preprocessing phase requiring only several seconds, the algorithm can be successfully applied to scenarios with dynamic prices.Moreover, the algorithm does not employ bidirectional search, and can thus work on time-dependent metrics.We test the algorithm on multiple graphs and analyze its performance in terms of time and memory efficiency. 0.63The results prove Hierarchical MLS to be faster than its direct alternatives by at least 2 times in terms of query runtime and at least 20 times in terms of preprocessing.

link

2025-03-17

Reliable and Efficient Amortized Model-based Evaluation

Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation).The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice.Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical.A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. 0.746This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. 0.686Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty.Unfortunately, question difficulty is expensive to estimate.Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level.This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance.Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.

link

2025-03-17

Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research.However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b).Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate.To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns.In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms.Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation.Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data. 0.622

link

2025-03-17

A $(1+ε)$-Approximation for Ultrametric Embedding in Subquadratic Time

Efficiently computing accurate representations of high-dimensional data is essential for data analysis and unsupervised learning.Dendrograms, also known as ultrametrics, are widely used representations that preserve hierarchical relationships within the data.However, popular methods for computing them, such as linkage algorithms, suffer from quadratic time and space complexity, making them impractical for large datasets. The "best ultrametric embedding" (a.k.a. "best ultrametric fit") problem, which aims to find the ultrametric that best preserves the distances between points in the original data, is known to require at least quadratic time for an exact solution. Recent work has focused on improving scalability by approximating optimal solutions in subquadratic time, resulting in a $(\sqrt{2} + \epsilon)$-approximation (Cohen-Addad, de Joannis de Verclos and Lagarde, 2021). In this paper, we present the first subquadratic algorithm that achieves arbitrarily precise approximations of the optimal ultrametric embedding.Specifically, we provide an algorithm that, for any $c \geq 1$, outputs a $c$-approximation of the best ultrametric in time $\tilde{O}(n^{1 + 1/c})$. In particular, for any fixed $\epsilon > 0$, the algorithm computes a $(1+\epsilon)$-approximation in time $\tilde{O}(n^{2 - \epsilon + o(\epsilon^2)})$. Experimental results show that our algorithm improves upon previous methods in terms of approximation quality while maintaining comparable running times. 0.677

link

2025-03-17

DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective

Large Language Models (LLMs) have achieved remarkable success across diverse tasks, largely driven by well-designed prompts.However, crafting and selecting such prompts often requires considerable human effort, significantly limiting its scalability.To mitigate this, recent studies have explored automated prompt optimization as a promising solution.Despite these efforts, existing methods still face critical challenges in robustness, efficiency, and generalization. 0.601To systematically address these challenges, we first conduct an empirical analysis to identify the limitations of current reflection-based prompt optimization paradigm.Building on these insights, we propose 7 innovative approaches inspired by traditional deep learning paradigms for prompt optimization (DLPO), seamlessly integrating these concepts into text-based gradient optimization.Through these advancements, we progressively tackle the aforementioned challenges and validate our methods through extensive experimentation.We hope our study not only provides valuable guidance for future research but also offers a comprehensive understanding of the challenges and potential solutions in prompt optimization.Our code is available at https://github.com/sfasfaffa/DLPO.

link

2025-03-17

FLEX: A Framework for Learning Robot-Agnostic Force-based Skills Involving Sustained Contact Object Manipulation

Learning to manipulate objects efficiently, particularly those involving sustained contact (e.g., pushing, sliding) and articulated parts (e.g., drawers, doors), presents significant challenges.Traditional methods, such as robot-centric reinforcement learning (RL), imitation learning, and hybrid techniques, require massive training and often struggle to generalize across different objects and robot platforms.We propose a novel framework for learning object-centric manipulation policies in force space, decoupling the robot from the object.By directly applying forces to selected regions of the object, our method simplifies the action space, reduces unnecessary exploration, and decreases simulation overhead.This approach, trained in simulation on a small set of representative objects, captures object dynamics -- such as joint configurations -- allowing policies to generalize effectively to new, unseen objects.Decoupling these policies from robot-specific dynamics enables direct transfer to different robotic platforms (e.g., Kinova, Panda, UR5) without retraining.Our evaluations demonstrate that the method significantly outperforms baselines, achieving over an order of magnitude improvement in training efficiency compared to other state-of-the-art methods. 0.682Additionally, operating in force space enhances policy transferability across diverse robot platforms and object types.We further showcase the applicability of our method in a real-world robotic setting.For supplementary materials and videos, please visit: https://tufts-ai-robotics-group.github.io/FLEX/

link

2025-03-17

AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction

Autonomous driving requires an understanding of the infrastructure elements, such as lanes and crosswalks.To navigate safely, this understanding must be derived from sensor data in real-time and needs to be represented in vectorized form.Learned Bird's-Eye View (BEV) encoders are commonly used to combine a set of camera images from multiple views into one joint latent BEV grid.Traditionally, from this latent space, an intermediate raster map is predicted, providing dense spatial supervision but requiring post-processing into the desired vectorized form.More recent models directly derive infrastructure elements as polylines using vectorized map decoders, providing instance-level information.Our approach, Augmentation Map Network (AugMapNet), proposes latent BEV grid augmentation, a novel technique that significantly enhances the latent BEV representation.AugMapNet combines vector decoding and dense spatial supervision more effectively than existing architectures while remaining as straightforward to integrate and as generic as auxiliary supervision.Experiments on nuScenes and Argoverse2 datasets demonstrate significant improvements in vectorized map prediction performance up to 13.3% over the StreamMapNet baseline on 60m range and greater improvements on larger ranges.We confirm transferability by applying our method to another baseline and find similar improvements. 0.614A detailed analysis of the latent BEV grid confirms a more structured latent space of AugMapNet and shows the value of our novel concept beyond pure performance improvement.The code will be released soon.

link

2025-03-17

Less Biased Noise Scale Estimation for Threshold-Robust RANSAC

The gold-standard for robustly estimating relative pose through image matching is RANSAC.While RANSAC is powerful, it requires setting the inlier threshold that determines whether the error of a correspondence under an estimated model is sufficiently small to be included in its consensus set.Setting this threshold is typically done by hand, and is difficult to tune without a access to ground truth data.Thus, a method capable of automatically determining the optimal threshold would be desirable.In this paper we revisit inlier noise scale estimation, which is an attractive approach as the inlier noise scale is linear to the optimal threshold.We revisit the noise scale estimation method SIMFIT and find bias in the estimate of the noise scale.In particular, we fix underestimates from using the same data for fitting the model as estimating the inlier noise, and from not taking the threshold itself into account.Secondly, since the optimal threshold within a scene is approximately constant we propose a multi-pair extension of SIMFIT++, by filtering of estimates, which improves results.Our approach yields robust performance across a range of thresholds, shown in Figure 1. 0.639

link

LLMs

2025-03-24

Synthetic Function Demonstrations Improve Generation in Low-Resource Programming Languages

A key consideration when training an LLM is whether the target language is more or less resourced, whether this is English compared to Welsh, or Python compared to Excel. 0.741Typical training data for programming languages consist of real program demonstrations coupled with human-written comments.Here we present novel approaches to the creation of such data for low resource programming languages.We generate fully-synthetic, textbook-quality demonstrations of common library functions in an example domain of Excel formulas, using a teacher model.We then finetune an underperforming student model, and show improvement on 2 question-answering datasets recast into the Excel domain.We show advantages of finetuning over standard, off-the-shelf RAG approaches, which can offer only modest improvement due to the unfamiliar target domain.

link

2025-03-24

AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning

This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of large language models (LLMs) for 3D Cartesian space navigation.AlphaSpace employs a semantics-based tokenization strategy, encoding height information through specialized semantic tokens, and integrates primarily symbolic synthetic reasoning data.This approach enables LLMs to accurately manipulate objects by positioning them at specific 0.751[x, y, z] coordinates.Experimental results demonstrate that AlphaSpace significantly outperforms existing models on manipulation subtasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet.

link

2025-03-24

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache

The growing adoption of long-context Large Language Models (LLMs) has introduced significant memory and computational challenges in autoregressive decoding due to the expanding Key-Value (KV) cache. 0.606KV cache quantization has emerged as a promising solution, with prior work showing that 4-bit or even 2-bit quantization can maintain model accuracy while reducing memory costs.However, despite these benefits, preliminary implementations for the low-bit KV cache struggle to deliver the expected speedup due to quantization and dequantization overheads and the lack of Tensor Cores utilization.In this work, we propose BitDecoding, a GPU-optimized framework that unlocks Tensor Cores for efficient decoding with low-bit KV cache.Efficiently leveraging Tensor Cores for low-bit KV cache is challenging due to the dynamic nature of KV cache generation at each decoding step.BitDecoding addresses these challenges with a Tensor Cores-Centric BitFusion Scheme that ensures data layout compatibility to enable high utilization of Tensor Cores.Additionally, BitDecoding incorporates a warp-efficient parallel decoding kernel and a fine-grained asynchronous pipeline, minimizing dequantization overhead and improving computational efficiency.Experiments show that BitDecoding achieves up to 7.5x speedup on RTX 4090, 4.8x on A100, and 8.9x on H100, compared to FP16 FlashDecoding-v2.It also outperforms the state-of-the-art low-bit KV cache implementation (QServe) by up to 4.3x.On LLaMA-3.1-8B with a 128K sequence length, BitDecoding reduces single-batch decoding latency by 3x, demonstrating its effectiveness in long-context generation scenarios. 0.676The code is available at https://github.com/DD-DuDa/BitDecoding.

link

2025-03-24

REALM: A Dataset of Real-World LLM Use Cases

Large Language Models, such as the GPT series, have driven significant industrial applications, leading to economic and societal transformations.However, a comprehensive understanding of their real-world applications remains limited.To address this, we introduce REALM, a dataset of over 94,000 LLM use cases collected from Reddit and news articles. 0.679REALM captures two key dimensions: the diverse applications of LLMs and the demographics of their users. 0.715It categorizes LLM applications and explores how users' occupations relate to the types of applications they use. 0.724By integrating real-world data, REALM offers insights into LLM adoption across different domains, providing a foundation for future research on their evolving societal roles. 0.712A dedicated dashboard https://realm-e7682.web.app/ presents the data.

link

2025-03-24

NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting

Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have noticeably advanced photo-realistic novel view synthesis using images from densely spaced camera viewpoints.However, these methods struggle in few-shot scenarios due to limited supervision.In this paper, we present NexusGS, a 3DGS-based approach that enhances novel view synthesis from sparse-view images by directly embedding depth information into point clouds, without relying on complex manual regularizations.Exploiting the inherent epipolar geometry of 3DGS, our method introduces a novel point cloud densification strategy that initializes 3DGS with a dense point cloud, reducing randomness in point placement while preventing over-smoothing and overfitting.Specifically, NexusGS comprises three key steps: Epipolar Depth Nexus, Flow-Resilient Depth Blending, and Flow-Filtered Depth Pruning.These steps leverage optical flow and camera poses to compute accurate depth maps, while mitigating the inaccuracies often associated with optical flow.By incorporating epipolar depth priors, NexusGS ensures reliable dense point cloud coverage and supports stable 3DGS training under sparse-view conditions.Experiments demonstrate that NexusGS 0.605significantly enhances depth accuracy and rendering quality, surpassing state-of-the-art methods by a considerable margin.Furthermore, we validate the superiority of our generated point clouds by substantially boosting the performance of competing methods.Project page: https://usmizuki.github.io/NexusGS/.

link

2025-03-24

Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code

In recent years, large language models (LLMs) have shown remarkable capabilities in various artificial intelligence problems. 0.665However, they fail to plan reliably, even when prompted with a detailed definition of the planning task.Attempts to improve their planning capabilities, such as chain-of-thought prompting, fine-tuning, and explicit "reasoning" still yield incorrect plans and usually fail to generalize to larger tasks.In this paper, we show how to use LLMs to generate correct plans, even for out-of-distribution tasks of increasing size. 0.711For a given planning domain, we ask an LLM to generate several domain-dependent heuristic functions in the form of Python code, evaluate them on a set of training tasks within a greedy best-first search, and choose the strongest one. 0.646The resulting LLM-generated heuristics solve many more unseen test tasks than state-of-the-art domain-independent heuristics for classical planning. 0.611They are even competitive with the strongest learning algorithm for domain-dependent planning.These findings are especially remarkable given that our proof-of-concept implementation is based on an unoptimized Python planner and the baselines all build upon highly optimized C++ code.In some domains, the LLM-generated heuristics expand fewer states than the baselines, revealing that they are not only efficiently computable, but sometimes even more informative than the state-of-the-art heuristics. 0.686Overall, our results show that sampling a set of planning heuristic function programs can significantly improve the planning capabilities of LLMs. 0.698

link

2025-03-24

Defeating Prompt Injections by Design

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment. 0.705However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. 0.755In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models may be susceptible to attacks. 0.682To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. 0.711To further improve security, CaMeL relies on a notion of a capability to prevent the exfiltration of private data over unauthorized data flows.We demonstrate effectiveness of CaMeL by solving $67\%$ of tasks with provable security in AgentDojo [NeurIPS 2024], a recent agentic security benchmark.

link

2025-03-24

Towards Responsible AI Music: an Investigation of Trustworthy Features for Creative Systems

Generative AI is radically changing the creative arts, by fundamentally transforming the way we create and interact with cultural artefacts.While offering unprecedented opportunities for artistic expression and commercialisation, this technology also raises ethical, societal, and legal concerns.Key among these are the potential displacement of human creativity, copyright infringement stemming from vast training datasets, and the lack of transparency, explainability, and fairness mechanisms.As generative systems become pervasive in this domain, responsible design is crucial. 0.608Whilst previous work has tackled isolated aspects of generative systems (e.g., transparency, evaluation, data), we take a comprehensive approach, grounding these efforts within the Ethics Guidelines for Trustworthy Artificial Intelligence produced by the High-Level Expert Group on AI appointed by the European Commission - a framework for designing responsible AI systems across seven macro requirements.Focusing on generative music AI, we illustrate how these requirements can be contextualised for the field, addressing trustworthiness across multiple dimensions and integrating insights from the existing literature.We further propose a roadmap for operationalising these contextualised requirements, emphasising interdisciplinary collaboration and stakeholder engagement.Our work provides a foundation for designing and evaluating responsible music generation systems, calling for collaboration among AI experts, ethicists, legal scholars, and artists.This manuscript is accompanied by a website: https://amresearchlab.github.io/raim-framework/.

link

2025-03-24

Learning Multi-Robot Coordination through Locality-Based Factorized Multi-Agent Actor-Critic Algorithm

In this work, we present a novel cooperative multi-agent reinforcement learning method called \textbf{Loc}ality based \textbf{Fac}torized \textbf{M}ulti-Agent \textbf{A}ctor-\textbf{C}ritic (Loc-FACMAC).Existing state-of-the-art algorithms, such as FACMAC, rely on global reward information, which may not accurately reflect the quality of individual robots' actions in decentralized systems.We integrate the concept of locality into critic learning, where strongly related robots form partitions during training.Robots within the same partition have a greater impact on each other, leading to more precise policy evaluation.Additionally, we construct a dependency graph to capture the relationships between robots, facilitating the partitioning process.This approach mitigates the curse of dimensionality and prevents robots from using irrelevant information.Our method improves existing algorithms by focusing on local rewards and leveraging partition-based learning to enhance training efficiency and performance.We evaluate the performance of Loc-FACMAC in three environments: Hallway, Multi-cartpole, and Bounded-Cooperative-Navigation. 0.606We explore the impact of partition sizes on the performance and compare the result with baseline MARL algorithms such as LOMAQ, FACMAC, and QMIX.The experiments reveal that, if the locality structure is defined properly, Loc-FACMAC outperforms these baseline algorithms up to 108\%, indicating that exploiting the locality structure in the actor-critic framework improves the MARL performance.

link

2025-03-24

EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

We develop benchmarks for LLM agents that act in, learn from, and strategize in unknown environments, the specifications of which the LLM agent must learn over time from deliberate exploration. 0.672Our benchmarks consist of decision-making tasks derived from key problems in economics.To forestall saturation, the benchmark tasks are synthetically generated with scalable difficulty levels.Additionally, we propose litmus tests, a new kind of quantitative measure for LLMs and LLM agents. 0.699Unlike benchmarks, litmus tests quantify differences in character, values, and tendencies of LLMs and LLM agents, by considering their behavior when faced with tradeoffs (e.g., efficiency versus equality) where there is no objectively right or wrong behavior. 0.69Overall, our benchmarks and litmus tests assess the abilities and tendencies of LLM agents in tackling complex economic problems in diverse settings spanning procurement, scheduling, task allocation, and pricing -- applications that should grow in importance as such agents are further integrated into the economy. 0.664

link

2025-03-24

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering.To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts.However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability.This paper proposes the first multi-concept personalization paradigm, MC-LLaVA.Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. 0.603To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens.Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities.To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset.We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity.Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants.The code and dataset will be publicly available at $\href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}$.

link

2025-03-24

Structuring Scientific Innovation: A Framework for Modeling and Discovering Impactful Knowledge Combinations

The emergence of large language models offers new possibilities for structured exploration of scientific knowledge.Rather than viewing scientific discovery as isolated ideas or content, we propose a structured approach that emphasizes the role of method combinations in shaping disruptive insights.Specifically, we investigate how knowledge unit--especially those tied to methodological design--can be modeled and recombined to yield research breakthroughs.Our proposed framework addresses two key challenges.First, we introduce a contrastive learning-based mechanism to identify distinguishing features of historically disruptive method combinations within problem-driven contexts.Second, we propose a reasoning-guided Monte Carlo search algorithm that leverages the chain-of-thought capability of LLMs to identify promising knowledge recombinations for new problem statements. 0.654Empirical studies across multiple domains show that the framework is capable of modeling the structural dynamics of innovation and successfully highlights combinations with high disruptive potential.This research provides a new path for computationally guided scientific ideation grounded in structured reasoning and historical data modeling.

link

2025-03-24

Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design

The efficiency of Large Language Model~(LLM) inference is often constrained by substantial memory bandwidth and capacity demands. 0.65Existing techniques, such as pruning, quantization, and mixture of experts/depth, reduce memory capacity and/or bandwidth consumption at the cost of slight degradation in inference quality.This paper introduces a design solution that further alleviates memory bottlenecks by enhancing the on-chip memory controller in AI accelerators to achieve two main objectives: (1) significantly reducing memory capacity and bandwidth usage through lossless block compression~(e.g., LZ4 and ZSTD) of model weights and key-value (KV) cache without compromising inference quality, and (2) enabling memory bandwidth and energy consumption to scale proportionally with context-dependent dynamic quantization.These goals are accomplished by equipping the on-chip memory controller with mechanisms to improve fine-grained bit-level accessibility and compressibility of weights and KV cache through LLM-aware configuration of in-memory placement and representation. 0.706Experimental results on publicly available LLMs demonstrate the effectiveness of this approach, showing memory footprint reductions of 25.2\% for model weights and 46.9\% for KV cache. 0.741In addition, our hardware prototype at 4\,GHz and 32 lanes (7\,nm) achieves 8\,TB/s throughput with a modest area overhead (under 3.8\,mm\(^2\)), which underscores the viability of LLM-aware memory control as a key to efficient large-scale inference. 0.713

link

2025-03-24

A semantic communication-based workload-adjustable transceiver for wireless AI-generated content (AIGC) delivery

With the significant advances in generative AI (GAI) and the proliferation of mobile devices, providing high-quality AI-generated content (AIGC) services via wireless networks is becoming the future direction.However, the primary challenges of AIGC service delivery in wireless networks lie in unstable channels, limited bandwidth resources, and unevenly distributed computational resources.In this paper, we employ semantic communication (SemCom) in diffusion-based GAI models to propose a Resource-aware wOrkload-adjUstable TransceivEr (ROUTE) for AIGC delivery in dynamic wireless networks.Specifically, to relieve the communication resource bottleneck, SemCom is utilized to prioritize semantic information of the generated content. 0.628Then, to improve computational resource utilization in both edge and local and reduce AIGC semantic distortion in transmission, modified diffusion-based models are applied to adjust the computing workload and semantic density in cooperative content generation.Simulations verify the superiority of our proposed ROUTE in terms of latency and content quality compared to conventional AIGC approaches.

link

2025-03-24

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Large Language Models (LLMs) have achieved remarkable success in natural language processing. 0.673Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. 0.688Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored.In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models.First, we propose an approach to extract candidate ''reasoning features'' from SAE representations.We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model's reasoning abilities.Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. 0.697Code available at https://github.com/AIRI-Institute/SAE-Reasoning

link

2025-03-24

AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration

Multi-agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem-solving. 0.617However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents' communication topologies particularly important.Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout, which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance.Compared to state-of-the-art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks.Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness.We release our code at https://github.com/wangzx1219/AgentDropout.

link

2025-03-24

xKV: Cross-Layer SVD for KV-Cache Compression

Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). 0.685Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice.We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache.Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers.xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes.Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. 0.604Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation.These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. 0.708Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.

link

2025-03-24

FFN Fusion: Rethinking Sequential Computation in Large Language Models

We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization.Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact.We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior.Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. 0.63Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning.Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

link

2025-03-24

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases.We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. 0.603In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer.A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets.TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings.On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.

link

2025-03-24

Exploring Training and Inference Scaling Laws in Generative Retrieval

Generative retrieval has emerged as a novel paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers. 0.623Although promising, the mechanisms that underpin its performance and scalability remain largely unclear.We conduct a systematic investigation of training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence retrieval performance.To address the lack of suitable metrics, we propose a novel evaluation measure inspired by contrastive entropy and generation loss, providing a continuous performance signal that enables robust comparisons across diverse generative retrieval methods.Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws, especially when paired with larger LLMs.Furthermore, increasing inference computation yields substantial performance gains, revealing that generative retrieval can significantly benefit from higher compute budgets at inference.Across these settings, LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval. 0.615Taken together, our findings underscore that model sizes, data availability, and inference computation interact to unlock the full potential of generative retrieval, offering new insights for designing and optimizing future systems.

link

2025-03-24

Video-T1: Test-Time Scaling for Video Generation

With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains.Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. 0.641Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt.In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution.Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process.Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time.As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner.Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos.Project page: https://liuff19.github.io/Video-T1

link

2025-03-24

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding.This model family employs the two-stream SlowFast mechanism, enabling efficient modeling of long-range temporal context to meet the demand for lightweight, mobile-friendly Video LLMs. 0.639We provide models ranging from 1B to 7B parameters, optimized through a streamlined training pipeline and a high-quality data mixture composed of publicly available datasets.Experimental results demonstrate that SF-LLaVA-1.5 achieves competitive performance on a wide range of video and image benchmarks, with robust results across all model sizes.Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales (1B and 3B) across various video benchmarks.

link

2025-03-20

The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination

Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability.To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them.However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking.In this paper, we design a systematic and controlled pipeline along with two novel metrics-fidelity and contamination resistance-to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies.Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions.Our metrics address this limitation by emphasizing question-level evaluation result matching.Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. 0.657These findings underscore the urgent need for designing more effective BDC mitigation strategies.Our code repository is available at https://github.com/ASTRAL-Group/BDC_mitigation_assessment.

link

2025-03-20

Survey on Evaluation of LLM-based Agents

The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. 0.741This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents.We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents.Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks.We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods.This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

link

2025-03-20

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. 0.7Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning.However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the "overthinking phenomenon".In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. 0.748Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. 0.703Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking.

link

Developer Research

2025-03-18

MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration

Maintaining and scaling software systems relies heavily on effective code refactoring, yet this process remains labor-intensive, requiring developers to carefully analyze existing codebases and prevent the introduction of new defects. 0.652Although recent advancements have leveraged Large Language Models (LLMs) to automate refactoring tasks, current solutions are constrained in scope and lack mechanisms to guarantee code compilability and successful test execution.In this work, we introduce MANTRA, a comprehensive LLM agent-based framework that automates method-level refactoring.MANTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning to emulate human decision-making during refactoring while preserving code correctness and readability.Our empirical study, conducted on 703 instances of "pure refactorings" (i.e., code changes exclusively involving structural improvements), drawn from 10 representative Java projects, covers the six most prevalent refactoring operations.Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model (RawGPT ), achieving an 82.8% success rate (582/703) in producing code that compiles and passes all tests, compared to just 8.7% (61/703) with RawGPT.Moreover, in comparison to IntelliJ's LLM-powered refactoring tool (EM-Assist), MANTRA exhibits a 50% improvement in generating Extract Method transformations.A usability study involving 37 professional developers further shows that refactorings performed by MANTRA are perceived to be as readable and reusable as human-written code, and in certain cases, even more favorable. 0.652These results highlight the practical advantages of MANTRA and emphasize the growing potential of LLM-based systems in advancing the automation of software refactoring tasks.

link

2025-03-12

Validity in Design Science

Researchers must ensure that the claims about the knowledge produced by their work are valid.However, validity is neither well-understood nor consistently established in design science, which involves the development and evaluation of artifacts (models, methods, instantiations, and theories) to solve problems.As a result, it is challenging to demonstrate and communicate the validity of knowledge claims about artifacts.This paper defines validity in design science and derives the Design Science Validity Framework and a process model for applying it.The framework comprises three high-level claim and validity types-criterion, causal, and context-as well as validity subtypes.The framework guides researchers in integrating validity considerations into projects employing design science and contributes to the growing body of research on design science methodology. 0.61It also provides a systematic way to articulate and validate the knowledge claims of design science projects.We apply the framework to examples from existing research and then use it to demonstrate the validity of knowledge claims about the framework itself.

link

2025-03-12

Automating Code Review: A Systematic Literature Review

Code Review consists in assessing the code written by teammates with the goal of increasing code quality. 0.621Empirical studies documented the benefits brought by such a practice that, however, has its cost to pay in terms of developers' time.For this reason, researchers have proposed techniques and tools to automate code review tasks such as the reviewers selection (i.e., identifying suitable reviewers for a given code change) or the actual review of a given change (i.e., recommending improvements to the contributor as a human reviewer would do). 0.655Given the substantial amount of papers recently published on the topic, it may be challenging for researchers and practitioners to get a complete overview of the state-of-the-art. We present a systematic literature review (SLR) featuring 119 papers concerning the automation of code review tasks.We provide: (i) a categorization of the code review tasks automated in the literature; (ii) an overview of the under-the-hood techniques used for the automation, including the datasets used for training data-driven techniques; (iii) publicly available techniques and datasets used for their evaluation, with a description of the evaluation metrics usually adopted for each task. The SLR is concluded by a discussion of the current limitations of the state-of-the-art, with insights for future research directions.

link

2025-03-11

From Expert to Novice: An Empirical Study on Software Architecture Explanations

The sharing of knowledge about software architecture is crucial in software development, particularly during the onboarding of new developers. 0.621However, existing documentation often falls short due to issues like incompleteness and ambiguity.Consequently, oral explanations are used for knowledge transfer.This study investigates what constitutes a good explanation of software architecture through an empirical study.It aims to explore how software architecture explanations are conducted, identify the main challenges, and suggest improvements.It addresses five key areas: relevant architectural concerns, explanation plans, supporting artefacts, typical questions, and expectations.An exploratory field study was conducted using semi-structured interviews with 17 software professionals, including 9 architecture explainers and 8 explainees.The study discovers that an explanation must balance both problem and technical domains while considering the explainee's role, experience, and the goal of the explanation.The concept of the explanation window, which adjusts the level of detail and scope, is introduced to address these variables.We also extend the Twin Peaks model to guide the interplay between problem and solution domains during architectural explanations by adding an emphasis to the context surrounding both domains.Future research should focus on developing better tools and processes to support architecture explanations.

link

2025-03-11

GraphSense: Graph Embedding Based Code Suggestion Framework

Code suggestions have become an integral part of IDEs and developers use code suggestions generated by IDEs all the time. 0.616These code suggestions are mostly for calling a method of an object or for using a function of a library and not for possible next line of the code.GPT based models are too slow or resource intensive for real-time code suggestions in local environments.As a solution to this GraphSense was introduced which provide code suggestions with minimum amount of resource usage in real-time.

link

2025-03-05

Enhancing the Accuracy and Comprehensibility in Architectural Tactics Detection via Small Model-Augmented Prompt Engineering

Architectural tactics (ATs), as the concrete implementation of architectural decisions in code, address non-functional requirements of software systems.Due to the implicit nature of architectural knowledge in code implementation, developers may risk inadvertently altering or removing these tactics during code modifications or optimizations. 0.663Such unintended changes can trigger architectural erosion, gradually undermining the system's original design.While many researchers have proposed machine learning-based methods to improve the accuracy of detecting ATs in code, the black-box nature and the required architectural domain knowledge pose significant challenges for developers in verifying the results.Effective verification requires not only accurate detection results but also interpretable explanations that enhance their comprehensibility.However, this is a critical gap in current research.Large language models (LLMs) can generate easily interpretable ATs detection comments if they have domain knowledge.Fine-tuning LLMs to acquire domain knowledge faces challenges such as catastrophic forgetting and hardware constraints.Thus, we propose Prmt4TD, a small model-augmented prompting framework to enhance the accuracy and comprehensibility of ATs detection.Combining fine-tuned small models with In-Context Learning can also reduce fine-tuning costs while equipping the LLM with additional domain knowledge.Prmt4TD can leverage the remarkable processing and reasoning capabilities of LLMs to generate easily interpretable ATs detection results.Our evaluation results demonstrate that Prmt4TD achieves accuracy (\emph{F1-score}) improvement of 13\%-23\% on the ATs balanced dataset and enhances the comprehensibility of the detection results.

link

2025-03-05

Robust Learning of Diverse Code Edits

Software engineering activities frequently involve edits to existing code. 0.69However, contemporary code language models (LMs) lack the ability to handle diverse types of code-edit requirements.In this work, we attempt to overcome this shortcoming through (1) a novel synthetic data generation pipeline and (2) a robust model adaptation algorithm.Starting with seed code examples and diverse editing criteria, our pipeline generates high-quality samples comprising original and modified code, along with natural language instructions in different styles and verbosity.Today's code LMs come bundled with strong abilities, such as code generation and instruction following, which should not be lost due to fine-tuning.To ensure this, we propose a novel adaptation algorithm, SeleKT, that (a) leverages a dense gradient-based step to identify the weights that are most important for code editing, and (b) does a sparse projection onto the base model to avoid overfitting.Using our approach, we obtain a new series of models NextCoder (adapted from QwenCoder-2.5) that achieves strong results on five code-editing benchmarks, outperforming comparable size models and even several larger ones.We show the generality of our approach on two model families (DeepSeekCoder and QwenCoder), compare against other fine-tuning approaches, and demonstrate robustness by showing retention of code generation abilities post adaptation.

link

2025-03-04

The Shift from Writing to Pruning Software: A Bonsai-Inspired IDE for Reshaping AI Generated Code

The rise of AI-driven coding assistants signals a fundamental shift in how software is built.While AI coding assistants have been integrated into existing Integrated Development Environments (IDEs), their full potential remains largely untapped.A key challenge is that these AI assistants can suffer from hallucinations, leading developers down decision paths that the AI should not dictate, sometimes even without the users awareness or consent.Moreover, current static-file IDEs lack the mechanisms to address critical issues such as tracking the provenance of AI-generated code and integrating version control in a way that aligns with the dynamic nature of AI-assisted development.As a result, developers are left without the necessary tools to manage, refine, and validate AI generated code systematically, making it difficult to ensure correctness, maintainability, and trust in the development process. 0.654Existing IDEs treat AI-generated code as static text, offering limited support for managing its evolution, refinement, or multiple alternative paths. Drawing inspiration from the ancient art of Japanese Bonsai gardening focused on balance, structure, and deliberate pruning: we propose a new approach to IDEs, where AI is allowed to generate in its true, unconstrained form, free from traditional file structures.This approach fosters a more fluid and interactive method for code evolution.We introduce the concept of a Bonsai-inspired IDE, structured as a graph of generated code snippets and multiple code paths, enabling developers to reshape AI generated code to suit their needs.Our vision calls for a shift away from a static file based model toward a dynamic, evolving system that allows for continuous refinement of generated code, with the IDE evolving alongside AI powered modifications rather than merely serving as a place to write and edit code.

link

Data Annotation Techniques