Vincent's Arxiv FrontPage


Generated on 2025-04-25.


This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.


New Datasets

2025-04-24

PTCL: Pseudo-Label Temporal Curriculum Learning for Label-Limited Dynamic Graph

Dynamic node classification is critical for modeling evolving systems like financial transactions and academic collaborations.In such systems, dynamically capturing node information changes is critical for dynamic node classification, which usually requires all labels at every timestamp.However, it is difficult to collect all dynamic labels in real-world scenarios due to high annotation costs and label uncertainty (e.g., ambiguous or delayed labels in fraud detection).In contrast, final timestamp labels are easier to obtain as they rely on complete temporal patterns and are usually maintained as a unique label for each user in many open platforms, without tracking the history data.To bridge this gap, we propose PTCL(Pseudo-label Temporal Curriculum Learning), a pioneering method addressing label-limited dynamic node classification where only final labels are available.PTCL introduces: (1) a temporal decoupling architecture separating the backbone (learning time-aware representations) and decoder (strictly aligned with final labels), which generate pseudo-labels, and (2) a Temporal Curriculum Learning strategy that prioritizes pseudo-labels closer to the final timestamp by assigning them higher weights using an exponentially decaying function.We contribute a new academic dataset (CoOAG), capturing long-range research interest in dynamic graph. 0.711Experiments across real-world scenarios demonstrate PTCL's consistent superiority over other methods adapted to this task.Beyond methodology, we propose a unified framework FLiD (Framework for Label-Limited Dynamic Node Classification), consisting of a complete preparation workflow, training pipeline, and evaluation standards, and supporting various models and datasets.The code can be found at https://github.com/3205914485/FLiD.

link

2025-04-24

PICO: Reconstructing 3D People In Contact with Objects

Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance.Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes.Instead, we need methods that generalize to natural images and novel object classes.We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. 0.816To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body.In contrast, we seek contact labels on both the body and the object.To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models.Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch.This minimal human input establishes rich contact correspondences between bodies and objects.(2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction.PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization.Uniquely, PICO-fit works well for many object categories that no existing method can tackle.This is crucial to enable HOI understanding to scale in the wild.Our data and code are available at https://pico.is.tue.mpg.de. 0.915

link

2025-04-24

Hierarchical and Multimodal Data for Daily Activity Understanding

Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings.DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions.The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3 procedures are shared between L2 actions. 0.718The overlap and unscripted nature of DARai allows counterfactual activities in the dataset. Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications.Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels.To highlight the limitations of individual sensors, we also conduct domain-variant experiments that are enabled by DARai's multi-sensor and counterfactual activity design setup. The code, documentation, and dataset are available at the dedicated DARai website: https://alregib.ece.gatech.edu/software-and-datasets/darai-daily-activity-recordings-for-artificial-intelligence-and-machine-learning/ 0.809

link

2025-04-24

Network Sampling: An Overview and Comparative Analysis

Network sampling is a crucial technique for analyzing large or partially observable networks.However, the effectiveness of different sampling methods can vary significantly depending on the context.In this study, we empirically compare representative methods from three main categories: node-based, edge-based, and exploration-based sampling.We used two real-world datasets for our analysis: a scientific collaboration network and a temporal message-sending network. 0.867Our results indicate that no single sampling method consistently outperforms the others in both datasets.Although advanced methods tend to provide better accuracy on static networks, they often perform poorly on temporal networks, where simpler techniques can be more effective.These findings suggest that the best sampling strategy depends not only on the structural characteristics of the network but also on the specific metrics that need to be preserved or analyzed.Our work offers practical insights for researchers in choosing sampling approaches that are tailored to different types of networks and analytical objectives.

link

2025-04-24

Step1X-Edit: A Practical Framework for General Image Editing

In recent years, image editing models have witnessed remarkable and rapid development.The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities.These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation.However, there is still a large gap between the open-source algorithm with these closed-source models.Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash.More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction.A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image.To train the model, we build a data generation pipeline to produce a high-quality dataset. 0.702For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions.Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

link

2025-04-24

Dynamic Camera Poses and Where to Find Them

Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation.However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation.Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods.In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. 0.838Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models.For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches.Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.

link

2025-04-23

Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures

Flaky tests produce inconsistent outcomes without code changes, creating major challenges for software developers.An industrial case study reported that developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250.We discovered that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness.This suggests that developers can reduce repair costs by addressing shared root causes, enabling them to fix multiple flaky tests at once rather than tackling them individually.This study represents an inflection point by challenging the deep-seated assumption that flaky test failures are isolated occurrences.We used an established dataset of 10,000 test suite runs from 24 Java projects on GitHub, spanning domains from data orchestration to job scheduling. 0.866It contains 810 flaky tests, which we levered to perform a mixed-method empirical analysis of co-occurring flaky test failures.Systemic flakiness is significant and widespread.We performed agglomerative clustering of flaky tests based on their failure co-occurrence, finding that 75% of flaky tests across all projects belong to a cluster, with a mean cluster size of 13.5 flaky tests.Instead of requiring 10,000 test suite runs to identify systemic flakiness, we demonstrated a lightweight alternative by training machine learning models based on static test case distance measures.Through manual inspection of stack traces, conducted independently by four authors and resolved through negotiated agreement, we identified intermittent networking issues and instabilities in external dependencies as the predominant causes of systemic flakiness.

link

2025-04-23

High-Quality Cloud-Free Optical Image Synthesis Using Multi-Temporal SAR and Contaminated Optical Data

Addressing gaps caused by cloud cover and the long revisit cycle of satellites is vital for providing essential data to support remote sensing applications.This paper tackles the challenges of missing optical data synthesis, particularly in complex scenarios with cloud cover.We propose CRSynthNet, a novel image synthesis network that incorporates innovative designed modules such as the DownUp Block and Fusion Attention to enhance accuracy.Experimental results validate the effectiveness of CRSynthNet, demonstrating substantial improvements in restoring structural details, preserving spectral consist, and achieving superior visual effects that far exceed those produced by comparison methods.It achieves quantitative improvements across multiple metrics: a peak signal-to-noise ratio (PSNR) of 26.978, a structural similarity index measure (SSIM) of 0.648, and a root mean square error (RMSE) of 0.050.Furthermore, this study creates the TCSEN12 dataset, a valuable resource specifically designed to address cloud cover challenges in missing optical data synthesis study. 0.762The dataset uniquely includes cloud-covered images and leverages earlier image to predict later image, offering a realistic representation of real-world scenarios. 0.711This study offer practical method and valuable resources for optical satellite image synthesis task.

link

2025-04-23

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset

This paper presents our winning submission to the AI Mathematical Olympiad - Progress Prize 2 (AIMO-2) competition.Our recipe for building state-of-the-art mathematical reasoning models relies on three key pillars.First, we create a large-scale dataset comprising 540K unique high-quality math problems, including olympiad-level problems, and their 3.2M long-reasoning solutions. 0.726Second, we develop a novel method to integrate code execution with long reasoning models through iterative training, generation, and quality filtering, resulting in 1.7M high-quality Tool-Integrated Reasoning solutions.Third, we create a pipeline to train models to select the most promising solution from many candidates.We show that such generative solution selection (GenSelect) can significantly improve upon majority voting baseline.Combining these ideas, we train a series of models that achieve state-of-the-art results on mathematical reasoning benchmarks.To facilitate further research, we release our code, models, and the complete OpenMathReasoning dataset under a commercially permissive license.

link

2025-04-23

Procedural Dataset Generation for Zero-Shot Stereo Matching

Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains largely unexplored.We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. 0.702We collect the best settings to produce Infinigen-Stereo, a procedural generator specifically optimized for zero-shot stereo datasets.Models trained only on data from our system outperform robust baselines trained on a combination of existing synthetic datasets and have stronger zero-shot stereo matching performance than public checkpoints from prior works.We open source our system at https://github.com/princeton-vl/InfinigenStereo to enable further research on procedural stereo datasets. 0.712

link

2025-04-22

Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions

Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge.Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL).However, users often need multiple interactions to obtain answers that align with their expectations when using the system.During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location.Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers.The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments.To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions.It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content.We conduct extensive experiments on three reconstructed In-VAL datasets. 0.862Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task.Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc. 0.882

link

2025-04-22

W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models

The demand for efficient natural language processing (NLP) systems has led to the development of lightweight language models.Previous work in this area has primarily focused on manual design or training-based neural architecture search (NAS) methods.Recently, zero-shot NAS methods have been proposed for evaluating language models without the need for training.However, prevailing approaches to zero-shot NAS often face challenges such as biased evaluation metrics and computational inefficiencies.In this paper, we introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models.Our approach utilizes two evaluation proxies: the parameter count and the number of principal components with cumulative contribution exceeding $\eta$ in the feed-forward neural (FFN) layer.Additionally, by eliminating the need for gradient computations, we optimize the evaluation time, thus enhancing the efficiency of designing and evaluating lightweight language models.We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach. 0.734The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods.Furthermore, we perform ranking evaluations on a dataset sampled from the FlexiBERT search space.Our approach exhibits superior ranking correlation and further reduces solving time compared to other zero-shot NAS methods that require gradient computation.

link

2025-04-22

Trends in AI Supercomputers

Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited.We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in performance, power needs, hardware cost, ownership, and global distribution. 0.783We find that the computational performance of AI supercomputers has doubled every nine months, while hardware acquisition cost and power needs both doubled every year.The leading system in March 2025, xAI's Colossus, used 200,000 AI chips, had a hardware cost of \$7B, and required 300 MW of power, as much as 250,000 households.As AI supercomputers evolved from tools for science to industrial machines, companies rapidly expanded their share of total AI supercomputer performance, while the share of governments and academia diminished.Globally, the United States accounts for about 75% of total performance in our dataset, with China in second place at 15%.If the observed trends continue, the leading AI supercomputer in 2030 will achieve $2\times10^{22}$ 16-bit FLOP/s, use two million AI chips, have a hardware cost of \$200 billion, and require 9 GW of power.Our analysis provides visibility into the AI supercomputer landscape, allowing policymakers to assess key AI trends like resource needs, ownership, and national competitiveness.

link

2025-04-22

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975).However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims.In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. 0.816Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues.We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success.Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

link

2025-04-21

KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking

Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering.Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy.However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples.In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL.Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-language models based on its text and images.(2) Retrieval:Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention.(3) Reranking: Refines the KG triples of the candidate entities and employs large language models to identify the best-matching entity for the mention.Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods.Our code and datasets are available at: https://github.com/juyeonnn/KGMEL. 0.836

link

2025-04-21

Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform

The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint.While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models.We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. 0.809By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption.The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models.Project page: https://tinytigerpan.github.io/tiger200k/

link

2025-04-21

Fully Bayesian Approaches to Topics over Time

The Topics over Time (ToT) model captures thematic changes in timestamped datasets by explicitly modeling publication dates jointly with word co-occurrence patterns.However, ToT was not approached in a fully Bayesian fashion, a flaw that makes it susceptible to stability problems.To address this issue, we propose a fully Bayesian Topics over Time (BToT) model via the introduction of a conjugate prior to the Beta distribution.This prior acts as a regularization that prevents the online version of the algorithm from unstable updates when a topic is poorly represented in a mini-batch.The characteristics of this prior to the Beta distribution are studied here for the first time.Still, this model suffers from a difference in scale between the single-time observations and the multiplicity of words per document.A variation of BToT, Weighted Bayesian Topics over Time (WBToT), is proposed as a solution.In WBToT, publication dates are repeated a certain number of times per document, which balances the relative influence of words and timestamps along the inference process.We have tested our models on two datasets: a collection of over 200 years of US state-of-the-union (SOTU) addresses and a large-scale COVID-19 Twitter corpus of 10 million tweets. 0.748The results show that WBToT captures events better than Latent Dirichlet Allocation and other SOTA topic models like BERTopic: the median absolute deviation of the topic presence over time is reduced by $51\%$ and $34\%$, respectively.Our experiments also demonstrate the superior coherence of WBToT over BToT, which highlights the importance of balancing the time and word modalities.Finally, we illustrate the stability of the online optimization algorithm in WBToT, which allows the application of WBToT to problems that are intractable for standard ToT.

link

2025-04-21

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems.However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases.We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation.By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files.The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness.We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques.We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust.The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting.Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety.You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench. 0.746

link

2025-04-21

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning.Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks.The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details.The framework also includes numerous efficiency optimizations in the pipeline for long-context data training.Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. 0.792Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs.Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.

link

2025-04-17

Hierarchical Feature Learning for Medical Point Clouds via State Space Model

Deep learning-based point cloud modeling has been widely investigated as an indispensable component of general shape analysis.Recently, transformer and state space model (SSM) have shown promising capacities in point cloud learning.However, limited research has been conducted on medical point clouds, which have great potential in disease diagnosis and treatment.This paper presents an SSM-based hierarchical feature learning framework for medical point cloud understanding.Specifically, we down-sample input into multiple levels through the farthest point sampling.At each level, we perform a series of k-nearest neighbor (KNN) queries to aggregate multi-scale structural information.To assist SSM in processing point clouds, we introduce coordinate-order and inside-out scanning strategies for efficient serialization of irregular points.Point features are calculated progressively from short neighbor sequences and long point sequences through vanilla and group Point SSM blocks, to capture both local patterns and long-range dependencies.To evaluate the proposed method, we build a large-scale medical point cloud dataset named MedPointS for anatomy classification, completion, and segmentation.Extensive experiments conducted on MedPointS demonstrate that our method achieves superior performance across all tasks.The dataset is available at https://flemme-docs.readthedocs.io/en/latest/medpoints.html. 0.788Code is merged to a public medical imaging platform: https://github.com/wlsdzyzl/flemme.

link

2025-04-17

ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images

Recent studies have made significant progress in developing large language models (LLMs) in the medical domain, which can answer expert-level questions and demonstrate the potential to assist clinicians in real-world clinical scenarios.Studies have also witnessed the importance of integrating various modalities with the existing LLMs for a better understanding of complex clinical contexts, which are innately multi-faceted by nature.Although studies have demonstrated the ability of multimodal LLMs in histopathology to answer questions from given images, they lack in understanding of thorough clinical context due to the patch-level data with limited information from public datasets.Thus, developing WSI-level MLLMs is significant in terms of the scalability and applicability of MLLMs in histopathology.In this study, we introduce an expert-level MLLM for histopathology using WSIs, dubbed as ChatEXAONEPath.We present a retrieval-based data generation pipeline using 10,094 pairs of WSIs and histopathology reports from The Cancer Genome Atlas (TCGA). 0.785We also showcase an AI-based evaluation protocol for a comprehensive understanding of the medical context from given multimodal information and evaluate generated answers compared to the original histopathology reports.We demonstrate the ability of diagnosing the given histopathology images using ChatEXAONEPath with the acceptance rate of 62.9% from 1,134 pairs of WSIs and reports.Our proposed model can understand pan-cancer WSIs and clinical context from various cancer types.We argue that our proposed model has the potential to assist clinicians by comprehensively understanding complex morphology of WSIs for cancer diagnosis through the integration of multiple modalities.

link

2025-04-17

How Large Language Models Are Changing MOOC Essay Answers: A Comparison of Pre- and Post-LLM Responses

The release of ChatGPT in late 2022 caused a flurry of activity and concern in the academic and educational communities.Some see the tool's ability to generate human-like text that passes at least cursory inspections for factual accuracy ``often enough'' a golden age of information retrieval and computer-assisted learning.Some, on the other hand, worry the tool may lead to unprecedented levels of academic dishonesty and cheating.In this work, we quantify some of the effects of the emergence of Large Language Models (LLMs) on online education by analyzing a multi-year dataset of student essay responses from a free university-level MOOC on AI ethics.Our dataset includes essays submitted both before and after ChatGPT's release. 0.854We find that the launch of ChatGPT coincided with significant changes in both the length and style of student essays, mirroring observations in other contexts such as academic publishing.We also observe -- as expected based on related public discourse -- changes in prevalence of key content words related to AI and LLMs, but not necessarily the general themes or topics discussed in the student essays as identified through (dynamic) topic modeling.

link

2025-04-17

EventVAD: Training-Free Event-Aware Video Anomaly Detection

Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos.Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies.In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events.Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning.Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features.Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features.The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency.Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions.We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. 0.739The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.

link

2025-04-17

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues.To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization.VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens.Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. 0.764Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination.The code and data are available at https://github.com/HaroldChen19/VistaDPO.

link

2025-04-17

Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models.However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs.As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain.This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation.This paper presents three key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions.Our continuous DPO methodology yields remarkable results in reducing hallucinations.Specifically, the non-hallucination caption rate on a held-out test set increases from 48.2% to 77.9% for a 7B-size model.2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts.Across 35 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to alt-text pairs and other previous work.Meanwhile, it also offers considerable support in the text-to-image domain.With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark.3) We will release Hunyuan-Recap100M, a low-hallucination and knowledge-intensive synthetic caption dataset. 0.826

link

2025-04-17

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers.FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures.We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. 0.726On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality.In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics).We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.FreshStack datasets are available at: https://fresh-stack.github.io. 0.955

link

2025-04-17

Science-T2I: Addressing Scientific Illusions in Image Synthesis

We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis.First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. 0.793Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model.Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models.Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content.Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators.Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.

link

2025-04-17

IMAGGarment-1: Fine-Grained Garment Generation for Controllable Fashion Design

This paper presents IMAGGarment-1, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement.Unlike existing methods that are limited to single-condition inputs, IMAGGarment-1 addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications.Specifically, IMAGGarment-1 employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference.In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter.In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency.To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. 0.861Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance.The code and model are available at https://github.com/muzishen/IMAGGarment-1.

link

2025-04-17

Perception Encoder: The best visual embeddings are not at the output of the network

We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning.Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization.Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.There is only one caveat: these embeddings are hidden within the intermediate layers of the network.To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction.Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking.To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos. 0.753

link

2025-04-16

Self-Supervised Traversability Learning with Online Prototype Adaptation for Off-Road Autonomous Driving

Achieving reliable and safe autonomous driving in off-road environments requires accurate and efficient terrain traversability analysis.However, this task faces several challenges, including the scarcity of large-scale datasets tailored for off-road scenarios, the high cost and potential errors of manual annotation, the stringent real-time requirements of motion planning, and the limited computational power of onboard units.To address these challenges, this paper proposes a novel traversability learning method that leverages self-supervised learning, eliminating the need for manual annotation.For the first time, a Birds-Eye View (BEV) representation is used as input, reducing computational burden and improving adaptability to downstream motion planning.During vehicle operation, the proposed method conducts online analysis of traversed regions and dynamically updates prototypes to adaptively assess the traversability of the current environment, effectively handling dynamic scene changes.We evaluate our approach against state-of-the-art benchmarks on both public datasets and our own dataset, covering diverse seasons and geographical locations. 0.705Experimental results demonstrate that our method significantly outperforms recent approaches.Additionally, real-world vehicle experiments show that our method operates at 10 Hz, meeting real-time requirements, while a 5.5 km autonomous driving experiment further validates the generated traversability cost maps compatibility with downstream motion planning.

link

2025-04-16

Towards LLM Agents for Earth Observation

Earth Observation (EO) provides critical planetary data for environmental monitoring, disaster management, climate science, and other scientific domains.Here we ask: Are AI systems ready for reliable Earth Observation?We introduce \datasetnamenospace, a benchmark of 140 yes/no questions from NASA Earth Observatory articles across 13 topics and 17 satellite sensors. 0.791Using Google Earth Engine API as a tool, LLM agents can only achieve an accuracy of 33% because the code fails to run over 58% of the time.We improve the failure rate for open models by fine-tuning synthetic data, allowing much smaller models (Llama-3.1-8B) to achieve comparable accuracy to much larger ones (e.g., DeepSeek-R1).Taken together, our findings identify significant challenges to be solved before AI agents can automate earth observation, and suggest paths forward.The project page is available at https://iandrover.github.io/UnivEarth.

link

2025-04-16

CodingHomo: Bootstrapping Deep Homography With Video Coding

Homography estimation is a fundamental task in computer vision with applications in diverse fields.Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability.However, accurately predicting homography, especially in complex motions, remains a challenge.In response, this work introduces a novel method leveraging video coding, particularly by harnessing inherent motion vectors (MVs) present in videos.We present CodingHomo, an unsupervised framework for homography estimation.Our framework features a Mask-Guided Fusion (MGF) module that identifies and utilizes beneficial features among the MVs, thereby enhancing the accuracy of homography prediction.Additionally, the Mask-Guided Homography Estimation (MGHE) module is presented for eliminating undesired features in the coarse-to-fine homography refinement process.CodingHomo outperforms existing state-of-the-art unsupervised methods, delivering good robustness and generalizability.The code and dataset are available at: \href{github}{https://github.com/liuyike422/CodingHomo 0.826

link

2025-04-16

RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning

Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors.To date, their potential to mitigate the noise impact on radar object detection remains under-explored.In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. 0.806Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars.Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task.We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models.Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection.We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods.We believe this work will foster further research on semantic-guided and map-supported radar object detection.Our project page is publicly available athttps://gpp-communication.github.io/RADLER .

link

2025-04-16

Uncertainty-Guided Coarse-to-Fine Tumor Segmentation with Anatomy-Aware Post-Processing

Reliable tumor segmentation in thoracic computed tomography (CT) remains challenging due to boundary ambiguity, class imbalance, and anatomical variability.We propose an uncertainty-guided, coarse-to-fine segmentation framework that combines full-volume tumor localization with refined region-of-interest (ROI) segmentation, enhanced by anatomically aware post-processing.The first-stage model generates a coarse prediction, followed by anatomically informed filtering based on lung overlap, proximity to lung surfaces, and component size.The resulting ROIs are segmented by a second-stage model trained with uncertainty-aware loss functions to improve accuracy and boundary calibration in ambiguous regions.Experiments on private and public datasets demonstrate improvements in Dice and Hausdorff scores, with fewer false positives and enhanced spatial interpretability.These results highlight the value of combining uncertainty modeling and anatomical priors in cascaded segmentation pipelines for robust and clinically meaningful tumor delineation.On the Orlando dataset, our framework improved Swin UNETR Dice from 0.4690 to 0.6447. 0.728Reduction in spurious components was strongly correlated with segmentation gains, underscoring the value of anatomically informed post-processing.

link

2025-04-16

Coding-Prior Guided Diffusion Network for Video Deblurring

While recent video deblurring methods have advanced significantly, they often overlook two valuable prior information: (1) motion vectors (MVs) and coding residuals (CRs) from video codecs, which provide efficient inter-frame alignment cues, and (2) the rich real-world knowledge embedded in pre-trained diffusion generative models.We present CPGDNet, a novel two-stage framework that effectively leverages both coding priors and generative diffusion priors for high-quality deblurring.First, our coding-prior feature propagation (CPFP) module utilizes MVs for efficient frame alignment and CRs to generate attention masks, addressing motion inaccuracies and texture variations.Second, a coding-prior controlled generation (CPC) module network integrates coding priors into a pretrained diffusion model, guiding it to enhance critical regions and synthesize realistic details.Experiments demonstrate our method achieves state-of-the-art perceptual quality with up to 30% improvement in IQA metrics.Both the code and the codingprior-augmented dataset will be open-sourced. 0.702

link

2025-04-16

Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning

Automatic speech recognition (ASR) is crucial for human-machine interaction in diverse applications like conversational agents, industrial robotics, call center automation, and automated subtitling.However, developing high-performance ASR models remains challenging, particularly for low-resource languages like Arabic, due to the scarcity of large, labeled speech datasets, which are costly and labor-intensive to produce.In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture.Our model is trained from scratch on 15,000 hours of weakly annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal Arabic (DA), eliminating the need for costly manual transcriptions. 0.709Despite the absence of human-verified labels, our approach attains state-of-the-art (SOTA) performance, exceeding all previous efforts in the field of Arabic ASR on the standard benchmarks.By demonstrating the effectiveness of weak supervision as a scalable, cost-efficient alternative to traditional supervised approaches, paving the way for improved ASR systems in low resource settings.

link

2025-04-16

How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions

We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input.Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook.To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. 0.748We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes.Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.

link

2025-04-15

UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis

Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices.Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability.In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data annotation.In this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction.To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. 0.754Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects.Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline.The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in GUI grounding.We will release corresponding artifacts at https://colmon46.github.io/i2e-bench-leaderboard/

link

2025-04-15

Code Reborn AI-Driven Legacy Systems Modernization from COBOL to Java

This study investigates AI-driven modernization of legacy COBOL code into Java, addressing a critical challenge in aging software systems.Leveraging the Legacy COBOL 2024 Corpus -- 50,000 COBOL files from public and enterprise sources -- Java parses the code, AI suggests upgrades, and React visualizes gains. 0.715Achieving 93% accuracy, complexity drops 35% (from 18 to 11.7) and coupling 33% (from 8 to 5.4), surpassing manual efforts (75%) and rule-based tools (82%).The approach offers a scalable path to rejuvenate COBOL systems, vital for industries like banking and insurance.

link

2025-04-15

DeepWheel: Generating a 3D Synthetic Wheel Dataset for Design and Performance Evaluation

Data-driven design is emerging as a powerful strategy to accelerate engineering innovation.However, its application to vehicle wheel design remains limited due to the lack of large-scale, high-quality datasets that include 3D geometry and physical performance metrics.To address this gap, this study proposes a synthetic design-performance dataset generation framework using generative AI.The proposed framework first generates 2D rendered images using Stable Diffusion, and then reconstructs the 3D geometry through 2.5D depth estimation.Structural simulations are subsequently performed to extract engineering performance data.To further expand the design and performance space, topology optimization is applied, enabling the generation of a more diverse set of wheel designs.The final dataset, named DeepWheel, consists of over 6,000 photo-realistic images and 900 structurally analyzed 3D models. 0.822This multi-modal dataset serves as a valuable resource for surrogate model training, data-driven inverse design, and design space exploration.The proposed methodology is also applicable to other complex design domains.The dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International(CC BY-NC 4.0) and is available on the https://www.smartdesignlab.org/datasets 0.955

link

2025-04-15

PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond

We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities.PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches.Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation.It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition.Comparisons show that PartField is up to 20% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods.Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields.Check our Webpage! 0.792https://research.nvidia.com/labs/toronto-ai/partfield-release/

link

2025-04-15

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

The capacity for complex mathematical reasoning is a key benchmark for artificial intelligence.While reinforcement learning (RL) applied to LLMs shows promise, progress is significantly hindered by the lack of large-scale training data that is sufficiently challenging, possesses verifiable answer formats suitable for RL, and is free from contamination with evaluation benchmarks.To address these limitations, we introduce DeepMath-103K, a new, large-scale dataset comprising approximately 103K mathematical problems, specifically designed to train advanced reasoning models via RL. 0.724DeepMath-103K is curated through a rigorous pipeline involving source analysis, stringent decontamination against numerous benchmarks, and filtering for high difficulty (primarily Levels 5-9), significantly exceeding existing open resources in challenge. 0.745Each problem includes a verifiable final answer, enabling rule-based RL, and three distinct R1-generated solutions suitable for diverse training paradigms like supervised fine-tuning or distillation.Spanning a wide range of mathematical topics, DeepMath-103K promotes the development of generalizable reasoning.We demonstrate that models trained on DeepMath-103K achieve significant improvements on challenging mathematical benchmarks, validating its effectiveness.We release DeepMath-103K publicly to facilitate community progress in building more capable AI reasoning systems: https://github.com/zwhe99/DeepMath.

link

Data Quality

2025-04-24

PTCL: Pseudo-Label Temporal Curriculum Learning for Label-Limited Dynamic Graph

Dynamic node classification is critical for modeling evolving systems like financial transactions and academic collaborations.In such systems, dynamically capturing node information changes is critical for dynamic node classification, which usually requires all labels at every timestamp.However, it is difficult to collect all dynamic labels in real-world scenarios due to high annotation costs and label uncertainty (e.g., ambiguous or delayed labels in fraud detection). 0.618In contrast, final timestamp labels are easier to obtain as they rely on complete temporal patterns and are usually maintained as a unique label for each user in many open platforms, without tracking the history data.To bridge this gap, we propose PTCL(Pseudo-label Temporal Curriculum Learning), a pioneering method addressing label-limited dynamic node classification where only final labels are available.PTCL introduces: (1) a temporal decoupling architecture separating the backbone (learning time-aware representations) and decoder (strictly aligned with final labels), which generate pseudo-labels, and (2) a Temporal Curriculum Learning strategy that prioritizes pseudo-labels closer to the final timestamp by assigning them higher weights using an exponentially decaying function.We contribute a new academic dataset (CoOAG), capturing long-range research interest in dynamic graph.Experiments across real-world scenarios demonstrate PTCL's consistent superiority over other methods adapted to this task.Beyond methodology, we propose a unified framework FLiD (Framework for Label-Limited Dynamic Node Classification), consisting of a complete preparation workflow, training pipeline, and evaluation standards, and supporting various models and datasets.The code can be found at https://github.com/3205914485/FLiD.

link

2025-04-22

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975).However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims.In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions.Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues.We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. 0.63Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

link

2025-04-17

Riemannian Patch Assignment Gradient Flows

This paper introduces patch assignment flows for metric data labeling on graphs.Labelings are determined by regularizing initial local labelings through the dynamic interaction of both labels and label assignments across the graph, entirely encoded by a dictionary of competing labeled patches and mediated by patch assignment variables. 0.629Maximal consistency of patch assignments is achieved by geometric numerical integration of a Riemannian ascent flow, as critical point of a Lagrangian action functional.Experiments illustrate properties of the approach, including uncertainty quantification of label assignments. 0.661

link

2025-04-16

Logits DeConfusion with CLIP for Few-Shot Learning

With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks.However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy.To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module.Our MAF extracts features from different levels and fuses them uniformly to enhance feature representation.Our ICD learnably eliminates inter-class confusion in logits with a residual structure.Experimental results show that our method can significantly improve the classification performance and alleviate the inter-class confusion problem. 0.619The code is available at https://github.com/LiShuo1001/LDC.

link

2025-04-14

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged.As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate.They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses.To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations.xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions.To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment.A multi-round annotation process is employed to ensure label accuracy. 0.716Based on the VAR dataset, we train multiple xVerify models of different scales.In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%.Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance.These results validate the effectiveness and generalizability of xVerify.

link

2025-04-10

Towards Micro-Action Recognition with Limited Annotations: An Asynchronous Pseudo Labeling and Training Approach

Micro-Action Recognition (MAR) aims to classify subtle human actions in video.However, annotating MAR datasets is particularly challenging due to the subtlety of actions.To this end, we introduce the setting of Semi-Supervised MAR (SSMAR), where only a part of samples are labeled.We first evaluate traditional Semi-Supervised Learning (SSL) methods to SSMAR and find that these methods tend to overfit on inaccurate pseudo-labels, leading to error accumulation and degraded performance.This issue primarily arises from the common practice of directly using the predictions of classifier as pseudo-labels to train the model.To solve this issue, we propose a novel framework, called Asynchronous Pseudo Labeling and Training (APLT), which explicitly separates the pseudo-labeling process from model training.Specifically, we introduce a semi-supervised clustering method during the offline pseudo-labeling phase to generate more accurate pseudo-labels.Moreover, a self-adaptive thresholding strategy is proposed to dynamically filter noisy labels of different classes. 0.657We then build a memory-based prototype classifier based on the filtered pseudo-labels, which is fixed and used to guide the subsequent model training phase.By alternating the two pseudo-labeling and model training phases in an asynchronous manner, the model can not only be learned with more accurate pseudo-labels but also avoid the overfitting issue.Experiments on three MAR datasets show that our APLT largely outperforms state-of-the-art SSL methods.For instance, APLT improves accuracy by 14.5\% over FixMatch on the MA-12 dataset when using only 50\% labeled data.Code will be publicly available.

link

2025-04-09

UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation

In medical imaging, the primary challenge is collecting large-scale labeled data due to privacy concerns, logistics, and high labeling costs.In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs, comprising 51,761 MRI 3D samples (equivalent to 17.9 million 2D images) and more than 1.37 billion 2D segmentation masks of 72 organs, all based on the UK Biobank MRI dataset.We utilize automatic labeling, introduce an automated label cleaning pipeline with organ-specific filters, and manually annotate a subset of 300 MRIs with 11 abdominal classes to validate the quality (referred to as UKBOB-manual).This approach allows for scaling up the dataset collection while maintaining confidence in the labels.We further confirm the validity of the labels by demonstrating zero-shot generalization of trained models on the filtered UKBOB to other small labeled datasets from similar domains (e.g., abdominal MRI).To further mitigate the effect of noisy labels, we propose a novel method called Entropy Test-time Adaptation (ETTA) to refine the segmentation output. 0.63We use UKBOB to train a foundation model, Swin-BOB, for 3D medical image segmentation based on the Swin-UNetr architecture, achieving state-of-the-art results in several benchmarks in 3D medical imaging, including the BRATS brain MRI tumor challenge (with a 0.4% improvement) and the BTCV abdominal CT scan benchmark (with a 1.3% improvement).The pre-trained models and the code are available at https://emmanuelleb985.github.io/ukbob , and the filtered labels will be made available with the UK Biobank.

link

2025-04-07

A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?

Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources.This paradigm has quickly made its way into the medical image analysis community.In particular, there is an impressive amount of recent literature developing vision-language models for radiology.However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode.In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead.We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources.Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings.Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision. 0.677

link

Benchmarks

2025-04-24

Improving Open-World Object Localization by Discovering Background

Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference.Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc.In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness.Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions.We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content.We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task. 0.777

link

2025-04-24

Integrated Sensing and Communications for Unsourced Random Access: A Spectrum Sharing Compressive Sensing Approach

This paper addresses the unsourced/uncoordinated random access problem in an integrated sensing and communications (ISAC) system, with a focus on uplink multiple access code design.Recent theoretical advancements highlight that an ISAC system will be overwhelmed by the increasing number of active devices, driven by the growth of massive machine-type communication (mMTC).To meet the demands of future mMTC network, fundamental solutions are required that ensure robust capacity while maintaining favorable energy and spectral efficiency.One promising approach to support emerging massive connectivity is the development of systems based on the unsourced ISAC (UNISAC) framework.This paper proposes a spectrum-sharing compressive sensing-based UNISAC (SSCS-UNISAC) and offers insights into the practical design of UNISAC multiple access codes.In this framework, both communication signals (data transmission) and sensing signals (e.g., radar echoes) overlap within finite channel uses and are transmitted via the proposed UNISAC protocol.The proposed decoder exhibits robust performance, providing 20-30 dB capacity gains compared to conventional protocols such as TDMA and ALOHA.Numerical results validate the promising performance of the proposed scheme. 0.71

link

2025-04-24

CLIPSE -- a minimalistic CLIP-based image search engine for research

A brief overview of CLIPSE, a self-hosted image search engine with the main application of research, is provided.In general, CLIPSE uses CLIP embeddings to process the images and also the text queries.The overall framework is designed with simplicity to enable easy extension and usage.Two benchmark scenarios are described and evaluated, covering indexing and querying time. 0.707It is shown that CLIPSE is capable of handling smaller datasets; for larger datasets, a distributed approach with several instances should be considered.

link

2025-04-24

Aerial Image Classification in Scarce and Unconstrained Environments via Conformal Prediction

This paper presents a comprehensive empirical analysis of conformal prediction methods on a challenging aerial image dataset featuring diverse events in unconstrained environments.Conformal prediction is a powerful post-hoc technique that takes the output of any classifier and transforms it into a set of likely labels, providing a statistical guarantee on the coverage of the true label.Unlike evaluations on standard benchmarks, our study addresses the complexities of data-scarce and highly variable real-world settings.We investigate the effectiveness of leveraging pretrained models (MobileNet, DenseNet, and ResNet), fine-tuned with limited labeled data, to generate informative prediction sets.To further evaluate the impact of calibration, we consider two parallel pipelines (with and without temperature scaling) and assess performance using two key metrics: empirical coverage and average prediction set size. 0.619This setup allows us to systematically examine how calibration choices influence the trade-off between reliability and efficiency.Our findings demonstrate that even with relatively small labeled samples and simple nonconformity scores, conformal prediction can yield valuable uncertainty estimates for complex tasks.Moreover, our analysis reveals that while temperature scaling is often employed for calibration, it does not consistently lead to smaller prediction sets, underscoring the importance of careful consideration in its application.Furthermore, our results highlight the significant potential of model compression techniques within the conformal prediction pipeline for deployment in resource-constrained environments.Based on our observations, we advocate for future research to delve into the impact of noisy or ambiguous labels on conformal prediction performance and to explore effective model reduction strategies.

link

2025-04-24

polyGen: A Learning Framework for Atomic-level Polymer Structure Generation

Synthetic polymeric materials underpin fundamental technologies in the energy, electronics, consumer goods, and medical sectors, yet their development still suffers from prolonged design timelines.Although polymer informatics tools have supported speedup, polymer simulation protocols continue to face significant challenges: on-demand generation of realistic 3D atomic structures that respect the conformational diversity of polymer structures.Generative algorithms for 3D structures of inorganic crystals, bio-polymers, and small molecules exist, but have not addressed synthetic polymers.In this work, we introduce polyGen, the first latent diffusion model designed specifically to generate realistic polymer structures from minimal inputs such as the repeat unit chemistry alone, leveraging a molecular encoding that captures polymer connectivity throughout the architecture.Due to a scarce dataset of only 3855 DFT-optimized polymer structures, we augment our training with DFT-optimized molecular structures, showing improvement in joint learning between similar chemical structures.We also establish structure matching criteria to benchmark our approach on this novel problem. 0.675polyGen effectively generates diverse conformations of both linear chains and complex branched structures, though its performance decreases when handling repeat units with a high atom count.Given these initial results, polyGen represents a paradigm shift in atomic-level structure generation for polymer science-the first proof-of-concept for predicting realistic atomic-level polymer conformations while accounting for their intrinsic structural flexibility.

link

2025-04-24

BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction Monitoring

Augmented reality (AR) applications for construction monitoring rely on real-time environmental tracking to visualize architectural elements.However, construction sites present significant challenges for traditional tracking methods due to featureless surfaces, dynamic changes, and drift accumulation, leading to misalignment between digital models and the physical world.This paper proposes a BIM-aware drift correction method to address these challenges.Instead of relying solely on SLAM-based localization, we align ``as-built" detected planes from the real-world environment with ``as-planned" architectural planes in BIM.Our method performs robust plane matching and computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using optimization techniques, minimizing drift over time.By incorporating BIM as prior structural knowledge, we can achieve improved long-term localization and enhanced AR visualization accuracy in noisy construction environments.The method is evaluated through real-world experiments, showing significant reductions in drift-induced errors and optimized alignment consistency. 0.656On average, our system achieves a reduction of 52.24% in angular deviations and a reduction of 60.8% in the distance error of the matched walls compared to the initial manual alignment by the user. 0.622

link

2025-04-24

Network Sampling: An Overview and Comparative Analysis

Network sampling is a crucial technique for analyzing large or partially observable networks.However, the effectiveness of different sampling methods can vary significantly depending on the context.In this study, we empirically compare representative methods from three main categories: node-based, edge-based, and exploration-based sampling.We used two real-world datasets for our analysis: a scientific collaboration network and a temporal message-sending network.Our results indicate that no single sampling method consistently outperforms the others in both datasets. 0.618Although advanced methods tend to provide better accuracy on static networks, they often perform poorly on temporal networks, where simpler techniques can be more effective.These findings suggest that the best sampling strategy depends not only on the structural characteristics of the network but also on the specific metrics that need to be preserved or analyzed.Our work offers practical insights for researchers in choosing sampling approaches that are tailored to different types of networks and analytical objectives.

link

2025-04-24

Conformal Segmentation in Industrial Surface Defect Detection with Statistical Guarantees

In industrial settings, surface defects on steel can significantly compromise its service life and elevate potential safety risks.Traditional defect detection methods predominantly rely on manual inspection, which suffers from low efficiency and high costs.Although automated defect detection approaches based on Convolutional Neural Networks(e.g., Mask R-CNN) have advanced rapidly, their reliability remains challenged due to data annotation uncertainties during deep model training and overfitting issues.These limitations may lead to detection deviations when processing the given new test samples, rendering automated detection processes unreliable.To address this challenge, we first evaluate the detection model's practical performance through calibration data that satisfies the independent and identically distributed (i.i.d) condition with test data.Specifically, we define a loss function for each calibration sample to quantify detection error rates, such as the complement of recall rate and false discovery rate.Subsequently, we derive a statistically rigorous threshold based on a user-defined risk level to identify high-probability defective pixels in test images, thereby constructing prediction sets (e.g., defect regions).This methodology ensures that the expected error rate (mean error rate) on the test set remains strictly bounced by the predefined risk level.Additionally, we observe a negative correlation between the average prediction set size and the risk level on the test set, establishing a statistically rigorous metric for assessing detection model uncertainty.Furthermore, our study demonstrates robust and efficient control over the expected test set error rate across varying calibration-to-test partitioning ratios, validating the method's adaptability and operational effectiveness. 0.64

link

2025-04-24

Towards Robust LLMs: an Adversarial Robustness Measurement Framework

The rise of Large Language Models (LLMs) has revolutionized artificial intelligence, yet these models remain vulnerable to adversarial perturbations, undermining their reliability in high-stakes applications.While adversarial robustness in vision-based neural networks has been extensively studied, LLM robustness remains under-explored.We adapt the Robustness Measurement and Assessment (RoMA) framework to quantify LLM resilience against adversarial inputs without requiring access to model parameters.By comparing RoMA's estimates to those of formal verification methods, we demonstrate its accuracy with minimal error margins while maintaining computational efficiency. 0.635Our empirical evaluation reveals that robustness varies significantly not only between different models but also across categories within the same task and between various types of perturbations.This non-uniformity underscores the need for task-specific robustness evaluations, enabling practitioners to compare and select models based on application-specific robustness requirements.Our work provides a systematic methodology to assess LLM robustness, advancing the development of more reliable language models for real-world deployment.

link

2025-04-24

CasualHDRSplat: Robust High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos

Recently, photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), have garnered widespread attention due to their superior performance.However, most works rely on low dynamic range (LDR) images, which limits the capturing of richer scene details.Some prior works have focused on high dynamic range (HDR) scene reconstruction, typically require capturing of multi-view sharp images with different exposure times at fixed camera positions during exposure times, which is time-consuming and challenging in practice.For a more flexible data acquisition, we propose a one-stage method: \textbf{CasualHDRSplat} to easily and robustly reconstruct the 3D HDR scene from casually captured videos with auto-exposure enabled, even in the presence of severe motion blur and varying unknown exposure time.\textbf{CasualHDRSplat} contains a unified differentiable physical imaging model which first applies continuous-time trajectory constraint to imaging process so that we can jointly optimize exposure time, camera response function (CRF), camera poses, and sharp 3D HDR scene.Extensive experiments demonstrate that our approach outperforms existing methods in terms of robustness and rendering quality. 0.614Our source code will be available at https://github.com/WU-CVGL/CasualHDRSplat

link

2025-04-24

Fitting Tree Metrics and Ultrametrics in Data Streams

Fitting distances to tree metrics and ultrametrics are two widely used methods in hierarchical clustering, primarily explored within the context of numerical taxonomy.Given a positive distance function $D:\binom{V}{2}\rightarrow\mathbb{R}_{>0}$, the goal is to find a tree (or ultrametric) $T$ including all elements of set $V$ such that the difference between the distances among vertices in $T$ and those specified by $D$ is minimized.In this paper, we initiate the study of ultrametric and tree metric fitting problems in the semi-streaming model, where the distances between pairs of elements from $V$ (with $|V|=n$), defined by the function $D$, can arrive in an arbitrary order.We study these problems under various distance norms: For the $\ell_0$ objective, we provide a single-pass polynomial-time $\tilde{O}(n)$-space $O(1)$ approximation algorithm for ultrametrics and prove that no single-pass exact algorithm exists, even with exponential time. Next, we show that the algorithm for $\ell_0$ implies an $O(\Delta/\delta)$ approximation for the $\ell_1$ objective, where $\Delta$ is the maximum and $\delta$ is the minimum absolute difference between distances in the input. 0.61This bound matches the best-known approximation for the RAM model using a combinatorial algorithm when $\Delta/\delta=O(n)$. For the $\ell_\infty$ objective, we provide a complete characterization of the ultrametric fitting problem.We present a single-pass polynomial-time $\tilde{O}(n)$-space 2-approximation algorithm and show that no better than 2-approximation is possible, even with exponential time.We also show that, with an additional pass, it is possible to achieve a polynomial-time exact algorithm for ultrametrics. Finally, we extend the results for all these objectives to tree metrics by using only one additional pass through the stream and without asymptotically increasing the approximation factor. 0.609

link

2025-04-24

The Fourth Monocular Depth Estimation Challenge

This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings.In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and affine-invariant predictions.We also revised the baselines and included popular off-the-shelf methods: Depth Anything v2 and Marigold.The challenge received a total of 24 submissions that outperformed the baselines on the test set; 10 of these included a report describing their approach, with most leading methods relying on affine-invariant predictions. 0.616The challenge winners improved the 3D F-Score over the previous edition's best result, raising it from 22.58% to 23.05%.

link

2025-04-23

Traffic-Oblivious Multi-Commodity Flow Network Design

We consider the Minimum Multi-Commodity Flow Subgraph (MMCFS) problem: given a directed graph $G$ with edge capacities $\mathit{cap}$ and a retention ratio $\alpha\in(0,1)$, find an edge-wise minimum subgraph $G' \subseteq G$ such that for all traffic matrices $T$ routable in $G$ using a multi-commodity flow, $\alpha\cdot T$ is routable in $G'$. This natural yet novel problem is motivated by recent research that investigates how the power consumption in backbone computer networks can be reduced by turning off connections during times of low demand without compromising the quality of service.Since the actual traffic demands are generally not known beforehand, our approach must be traffic-oblivious, i.e., work for all possible sets of simultaneously routable traffic demands in the original network. In this paper we present the problem, relate it to other known problems in literature, and show several structural results, including a reformulation, maximum possible deviations from the optimum, and NP-hardness (as well as a certain inapproximability) already on very restricted instances.The most significant contribution is a tight $\max(\frac{1}{\alpha}, 2)$-approximation based on an algorithmically surprisingly simple LP-rounding scheme. 0.608

link

2025-04-23

Simple Graph Contrastive Learning via Fractional-order Neural Diffusion Networks

Graph Contrastive Learning (GCL) has recently made progress as an unsupervised graph representation learning paradigm.GCL approaches can be categorized into augmentation-based and augmentation-free methods.The former relies on complex data augmentations, while the latter depends on encoders that can generate distinct views of the same input.Both approaches may require negative samples for training.In this paper, we introduce a novel augmentation-free GCL framework based on graph neural diffusion models.Specifically, we utilize learnable encoders governed by Fractional Differential Equations (FDE).Each FDE is characterized by an order parameter of the differential operator.We demonstrate that varying these parameters allows us to produce learnable encoders that generate diverse views, capturing either local or global information, for contrastive learning.Our model does not require negative samples for training and is applicable to both homophilic and heterophilic datasets.We demonstrate its effectiveness across various datasets, achieving state-of-the-art performance. 0.641

link

2025-04-23

QAOA-PCA: Enhancing Efficiency in the Quantum Approximate Optimization Algorithm via Principal Component Analysis

The Quantum Approximate Optimization Algorithm (QAOA) is a promising variational algorithm for solving combinatorial optimization problems on near-term devices.However, as the number of layers in a QAOA circuit increases, which is correlated with the quality of the solution, the number of parameters to optimize grows linearly.This results in more iterations required by the classical optimizer, which results in an increasing computational burden as more circuit executions are needed.To mitigate this issue, we introduce QAOA-PCA, a novel reparameterization technique that employs Principal Component Analysis (PCA) to reduce the dimensionality of the QAOA parameter space.By extracting principal components from optimized parameters of smaller problem instances, QAOA-PCA facilitates efficient optimization with fewer parameters on larger instances.Our empirical evaluation on the prominent MaxCut problem demonstrates that QAOA-PCA consistently requires fewer iterations than standard QAOA, achieving substantial efficiency gains.While this comes at the cost of a slight reduction in approximation ratio compared to QAOA with the same number of layers, QAOA-PCA almost always outperforms standard QAOA when matched by parameter count. 0.6QAOA-PCA strikes a favorable balance between efficiency and performance, reducing optimization overhead without significantly compromising solution quality.

link

2025-04-23

Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems.The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities.The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets.The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2).In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques.The model's efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and 0.619ROUGE-L.The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). 0.645By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI.

link

2025-04-23

Decoupled Global-Local Alignment for Improving Compositional Understanding

Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities.However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes.Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space.To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities.To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average.Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning.To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types.Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally.Extensive experimental results demonstrate the effectiveness of the DeGLA framework.Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. 0.67Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets.Our code will be released at https://github.com/xiaoxing2001/DeGLA

link

2025-04-23

Graph modification of bounded size to minor-closed classes as fast as vertex deletion

A replacement action is a function $\mathcal{L}$ that maps each graph $H$ to a collection of graphs of size at most $|V(H)|$. Given a graph class $\mathcal{H}$, we consider a general family of graph modification problems, called $\mathcal{L}$-Replacement to $\mathcal{H}$, where the input is a graph $G$ and the question is whether it is possible to replace some induced subgraph $H_1$ of $G$ on at most $k$ vertices by a graph $H_2$ in $\mathcal{L}(H_1)$ so that the resulting graph belongs to $\mathcal{H}$. $\mathcal{L}$-Replacement to $\mathcal{H}$ can simulate many graph modification problems including vertex deletion, edge deletion/addition/edition/contraction, vertex identification, subgraph complementation, independent set deletion, (induced) matching deletion/contraction, etc.We present two algorithms. 0.611The first one solves $\mathcal{L}$-Replacement to $\mathcal{H}$ in time $2^{{\rm poly}(k)}\cdot |V(G)|^2$ for every minor-closed graph class $\mathcal{H}$, where {\rm poly} is a polynomial whose degree depends on $\mathcal{H}$, under a mild technical condition on $\mathcal{L}$. This generalizes the results of Morelle, Sau, Stamoulis, and Thilikos [ICALP 2020, ICALP 2023] for the particular case of Vertex Deletion to $\mathcal{H}$ within the same running time.Our second algorithm is an improvement of the first one when $\mathcal{H}$ is the class of graphs embeddable in a surface of Euler genus at most $g$ and runs in time $2^{\mathcal{O}(k^{9})}\cdot |V(G)|^2$, where the $\mathcal{O}(\cdot)$ notation depends on $g$. To the best of our knowledge, these are the first parameterized algorithms with a reasonable parametric dependence for such a general family of graph modification problems to minor-closed classes.

link

2025-04-23

An Adaptive ML Framework for Power Converter Monitoring via Federated Transfer Learning

This study explores alternative framework configurations for adapting thermal machine learning (ML) models for power converters by combining transfer learning (TL) and federated learning (FL) in a piecewise manner.This approach inherently addresses challenges such as varying operating conditions, data sharing limitations, and security implications.The framework starts with a base model that is incrementally adapted by multiple clients via adapting three state-of-the-art domain adaptation techniques: Fine-tuning, Transfer Component Analysis (TCA), and Deep Domain Adaptation (DDA).The Flower framework is employed for FL, using Federated Averaging for aggregation.Validation with field data demonstrates that fine-tuning offers a straightforward TL approach with high accuracy, making it suitable for practical applications.Benchmarking results reveal a comprehensive comparison of these methods, showcasing their respective strengths and weaknesses when applied in different scenarios. 0.794Locally hosted FL enhances performance when data aggregation is not feasible, while cloud-based FL becomes more practical with a significant increase in the number of clients, addressing scalability and connectivity challenges.

link

2025-04-23

High-Quality Cloud-Free Optical Image Synthesis Using Multi-Temporal SAR and Contaminated Optical Data

Addressing gaps caused by cloud cover and the long revisit cycle of satellites is vital for providing essential data to support remote sensing applications.This paper tackles the challenges of missing optical data synthesis, particularly in complex scenarios with cloud cover.We propose CRSynthNet, a novel image synthesis network that incorporates innovative designed modules such as the DownUp Block and Fusion Attention to enhance accuracy.Experimental results validate the effectiveness of CRSynthNet, demonstrating substantial improvements in restoring structural details, preserving spectral consist, and achieving superior visual effects that far exceed those produced by comparison methods.It achieves quantitative improvements across multiple metrics: a peak signal-to-noise ratio (PSNR) of 26.978, a structural similarity index measure (SSIM) of 0.648, and a root mean square error (RMSE) of 0.050. 0.676Furthermore, this study creates the TCSEN12 dataset, a valuable resource specifically designed to address cloud cover challenges in missing optical data synthesis study.The dataset uniquely includes cloud-covered images and leverages earlier image to predict later image, offering a realistic representation of real-world scenarios.This study offer practical method and valuable resources for optical satellite image synthesis task.

link

2025-04-23

OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents

Optimization plays a vital role in scientific research and practical applications, but formulating a concrete optimization problem described in natural language into a mathematical form and selecting a suitable solver to solve the problem requires substantial domain expertise.We introduce \textbf{OptimAI}, a framework for solving \underline{Optim}ization problems described in natural language by leveraging LLM-powered \underline{AI} agents, achieving superior performance over current state-of-the-art methods.Our framework is built upon four key roles: (1) a \emph{formulator} that translates natural language problem descriptions into precise mathematical formulations; (2) a \emph{planner} that constructs a high-level solution strategy prior to execution; and (3) a \emph{coder} and a \emph{code critic} capable of interacting with the environment and reflecting on outcomes to refine future actions.Ablation studies confirm that all roles are essential; removing the planner or code critic results in $5.8\times$ and $3.1\times$ drops in productivity, respectively.Furthermore, we introduce UCB-based debug scheduling to dynamically switch between alternative plans, yielding an additional $3.3\times$ productivity gain.Our design emphasizes multi-agent collaboration, allowing us to conveniently explore the synergistic effect of combining diverse models within a unified system.Our approach attains 88.1\% accuracy on the NLP4LP dataset and 71.2\% on the Optibench (non-linear w/o table) subset, reducing error rates by 58\% and 50\% respectively over prior best results. 0.627

link

2025-04-23

IberBench: LLM Evaluation on Iberian Languages

Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited.Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages.These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. 0.604With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America.IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization.The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts.We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations.Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems.IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.

link

2025-04-22

Integrating Non-Linear Radon Transformation for Diabetic Retinopathy Grading

Diabetic retinopathy is a serious ocular complication that poses a significant threat to patients' vision and overall health.Early detection and accurate grading are essential to prevent vision loss.Current automatic grading methods rely heavily on deep learning applied to retinal fundus images, but the complex, irregular patterns of lesions in these images, which vary in shape and distribution, make it difficult to capture subtle changes.This study introduces RadFuse, a multi-representation deep learning framework that integrates non-linear RadEx-transformed sinogram images with traditional fundus images to enhance diabetic retinopathy detection and grading.Our RadEx transformation, an optimized non-linear extension of the Radon transform, generates sinogram representations to capture complex retinal lesion patterns.By leveraging both spatial and transformed domain information, RadFuse enriches the feature set available to deep learning models, improving the differentiation of severity levels.We conducted extensive experiments on two benchmark datasets, APTOS-2019 and DDR, using three convolutional neural networks (CNNs): ResNeXt-50, MobileNetV2, and VGG19.RadFuse showed significant improvements over fundus-image-only models across all three CNN architectures and outperformed state-of-the-art methods on both datasets.For severity grading across five stages, RadFuse achieved a quadratic weighted kappa of 93.24%, an accuracy of 87.07%, and an F1-score of 87.17%. 0.624In binary classification between healthy and diabetic retinopathy cases, the method reached an accuracy of 99.09%, precision of 98.58%, and recall of 99.6%, surpassing previously established models.These results demonstrate RadFuse's capacity to capture complex non-linear features, advancing diabetic retinopathy classification and promoting the integration of advanced mathematical transforms in medical image analysis.

link

2025-04-22

Branch-and-Bound Algorithms as Polynomial-time Approximation Schemes

Branch-and-bound algorithms (B&B) and polynomial-time approximation schemes (PTAS) are two seemingly distant areas of combinatorial optimization.We intend to (partially) bridge the gap between them while expanding the boundary of theoretical knowledge on the B&B framework.Branch-and-bound algorithms typically guarantee that an optimal solution is eventually found.However, we show that the standard implementation of branch-and-bound for certain knapsack and scheduling problems also exhibits PTAS-like behavior, yielding increasingly better solutions within polynomial time.Our findings are supported by computational experiments and comparisons with benchmark methods. 0.779This paper is an extended version of a paper accepted at ICALP 2025.

link

2025-04-22

SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning

Recent work shows that reinforcement learning(RL) can markedly sharpen the reasoning ability of large language models (LLMs) by prompting them to "think before answering."Yet whether and how these gains transfer to audio-language reasoning remains largely unexplored.We extend the Group-Relative Policy Optimization (GRPO) framework from DeepSeek-R1 to a Large Audio-Language Model (LALM), and construct a 32k sample multiple-choice corpus.Using a two-stage regimen supervised fine-tuning on structured and unstructured chains-of-thought, followed by curriculum-guided GRPO, we systematically compare implicit vs. explicit, and structured vs. free form reasoning under identical architectures.Our structured audio reasoning model, SARI (Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning), achieves a 16.35% improvement in average accuracy over the base model Qwen2-Audio-7B-Instruct.Furthermore, the variant built upon Qwen2.5-Omni reaches state-of-the-art performance of 67.08% on the MMAU test-mini benchmark. 0.602Ablation experiments show that on the base model we use: (i) SFT warm-up is important for stable RL training, (ii) structured chains yield more robust generalization than unstructured ones, and (iii) easy-to-hard curricula accelerate convergence and improve final performance.These findings demonstrate that explicit, structured reasoning and curriculum learning substantially enhances audio-language understanding.

link

2025-04-22

Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning

In Android GUI testing, generating an action sequence for a task that can be replayed as a test script is common.Generating sequences of actions and respective test scripts from task goals described in natural language can eliminate the need for manually writing test scripts.However, existing approaches based on large language models (LLM) often struggle with identifying the final action, and either end prematurely or continue past the final screen.In this paper, we introduce VisiDroid, a multi-modal, LLM-based, multi-agent framework that iteratively determines the next action and leverages visual images of screens to detect the task's completeness.The multi-modal approach enhances our model in two significant ways.First, this approach enables it to avoid prematurely terminating a task when textual content alone provides misleading indications of task completion.Additionally, visual input helps the tool avoid errors when changes in the GUI do not directly affect functionality toward task completion, such as adjustments to font sizes or colors.Second, the multi-modal approach also ensures the tool not progress beyond the final screen, which might lack explicit textual indicators of task completion but could display a visual element indicating task completion, which is common in GUI apps.Our evaluation shows that VisiDroid achieves an accuracy of 87.3%, outperforming the best baseline relatively by 23.5%. 0.677We also demonstrate that our multi-modal framework with images and texts enables the LLM to better determine when a task is completed.

link

2025-04-22

Language Models to Support Multi-Label Classification of Industrial Data

Multi-label requirements classification is a challenging task, especially when dealing with numerous classes at varying levels of abstraction.The difficulties increases when a limited number of requirements is available to train a supervised classifier.Zero-shot learning (ZSL) does not require training data and can potentially address this problem.This paper investigates the performance of zero-shot classifiers (ZSCs) on a multi-label industrial dataset.We focuse on classifying requirements according to a taxonomy designed to support requirements tracing.We compare multiple variants of ZSCs using different embeddings, including 9 language models (LMs) with a reduced number of parameters (up to 3B), e.g., BERT, and 5 large LMs (LLMs) with a large number of parameters (up to 70B), e.g., Llama.Our ground truth includes 377 requirements and 1968 labels from 6 output spaces.For the evaluation, we adopt traditional metrics, i.e., precision, recall, F1, and $F_\beta$, as well as a novel label distance metric Dn. 0.607This aims to better capture the classification's hierarchical nature and provides a more nuanced evaluation of how far the results are from the ground truth.1)The top-performing model on 5 out of 6 output spaces is T5-xl, with maximum $F_\beta$ = 0.78 and Dn = 0.04, while BERT base outperformed the other models in one case, with maximum $F_\beta$ = 0.83 and Dn = 0.04.2) LMs with smaller parameter size produce the best classification results compared to LLMs.Thus, addressing the problem in practice is feasible as limited computing power is needed.3) The model architecture (autoencoding, autoregression, and sentence-to-sentence) significantly affects the classifier's performance.We conclude that using ZSL for multi-label requirements classification offers promising results.We also present a novel metric that can be used to select the top-performing model for this problem 0.633

link

2025-04-22

Automated Vulnerability Injection in Solidity Smart Contracts: A Mutation-Based Approach for Benchmark Development

The security of smart contracts is critical in blockchain systems, where even minor vulnerabilities can lead to substantial financial losses.Researchers proposed several vulnerability detection tools evaluated using existing benchmarks. 0.647However, most benchmarks are outdated and focus on a narrow set of vulnerabilities.This work evaluates whether mutation seeding can effectively inject vulnerabilities into Solidity-based smart contracts and whether state-of-the-art static analysis tools can detect the injected flaws.We aim to automatically inject vulnerabilities into smart contracts to generate large and wide benchmarks.We propose MuSe, a tool to generate vulnerable smart contracts by leveraging pattern-based mutation operators to inject six vulnerability types into real-world smart contracts.We analyzed these vulnerable smart contracts using Slither, a static analysis tool, to determine its capacity to identify them and assess their validity.The results show that each vulnerability has a different injection rate.Not all smart contracts can exhibit some vulnerabilities because they lack the prerequisites for injection.Furthermore, static analysis tools fail to detect all vulnerabilities injected using pattern-based mutations, underscoring the need for enhancements in static analyzers and demonstrating that benchmarks generated by mutation seeding tools can improve the evaluation of detection tools.

link

2025-04-22

A UAV-Aided Digital Twin Framework for IoT Networks with High Accuracy and Synchronization

With the continued growth of its core technologies, including the Internet of Things (IoT), artificial intelligence (AI), Big Data and data analytics, and edge computing, digital twin (DT) technology has witnessed a significant increase in industrial applications, helping the industry become more sustainable, smart, and adaptable.Hence, DT technology has emerged as a promising link between the physical and virtual worlds, enabling simulation, prediction, and real-time performance optimization.This work aims to explore the development of a high-fidelity digital twin framework, focusing on synchronization and accuracy between physical and digital systems to enhance data-driven decision making.To achieve this, we deploy several stationary UAVs in optimized locations to collect data from industrial IoT devices, which were used to monitor multiple physical entities and perform computations to evaluate their status.We consider a practical setup in which multiple IoT devices may monitor a single physical entity, and as a result, the measurements are combined and processed together to determine the status of the physical entity.The resulting status updates are subsequently uploaded from the UAVs to the base station, where the DT resides.In this work, we consider a novel metric based on the Age of Information (AoI), coined as the Age of Digital Twin (AoDT), to reflect the status freshness of the digital twin.Factoring AoDT in the problem formulation ensures that the DT reliably mirrors the physical system with high accuracy and synchronization.We formulate a mixed-integer non-convex program to maximize the total amount of data collected from all IoT devices while ensuring a constrained AoDT.Using successive convex approximations, we solve the problem, conduct extensive simulations and compare the results with baseline approaches to demonstrate the effectiveness of the proposed solution. 0.727

link

2025-04-22

ad-trait: A Fast and Flexible Automatic Differentiation Library in Rust

The Rust programming language is an attractive choice for robotics and related fields, offering highly efficient and memory-safe code.However, a key limitation preventing its broader adoption in these domains is the lack of high-quality, well-supported Automatic Differentiation (AD)-a fundamental technique that enables convenient derivative computation by systematically accumulating data during function evaluation.In this work, we introduce ad-trait, a new Rust-based AD library.Our implementation overloads Rust's standard floating-point type with a flexible trait that can efficiently accumulate necessary information for derivative computation.The library supports both forward-mode and reverse-mode automatic differentiation, making it the first operator-overloading AD implementation in Rust to offer both options.Additionally, ad-trait leverages Rust's performance-oriented features, such as Single Instruction, Multiple Data acceleration in forward-mode AD, to enhance efficiency.Through benchmarking experiments, we show that our library is among the fastest AD implementations across several programming languages for computing derivatives. 0.637Moreover, it is already integrated into a Rust-based robotics library, where we showcase its ability to facilitate fast optimization procedures.We conclude with a discussion of the limitations and broader implications of our work.

link

2025-04-22

W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models

The demand for efficient natural language processing (NLP) systems has led to the development of lightweight language models.Previous work in this area has primarily focused on manual design or training-based neural architecture search (NAS) methods.Recently, zero-shot NAS methods have been proposed for evaluating language models without the need for training.However, prevailing approaches to zero-shot NAS often face challenges such as biased evaluation metrics and computational inefficiencies.In this paper, we introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models.Our approach utilizes two evaluation proxies: the parameter count and the number of principal components with cumulative contribution exceeding $\eta$ in the feed-forward neural (FFN) layer.Additionally, by eliminating the need for gradient computations, we optimize the evaluation time, thus enhancing the efficiency of designing and evaluating lightweight language models.We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach.The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods.Furthermore, we perform ranking evaluations on a dataset sampled from the FlexiBERT search space.Our approach exhibits superior ranking correlation and further reduces solving time compared to other zero-shot NAS methods that require gradient computation. 0.61

link

2025-04-22

Charting the Uncharted: The Landscape of Monero Peer-to-Peer Network

The Monero blockchain enables anonymous transactions through advanced cryptography in its peer-to-peer network, which underpins decentralization, security, and trustless interactions.However, privacy measures obscure peer connections, complicating network analysis.This study proposes a method to infer peer connections in Monero's latest protocol version, where timestamp data is unavailable.We collect peerlist data from TCP flows, validate our inference algorithm, and map the network structure.Our results show high accuracy, improving with longer observation periods. 0.65This work is the first to reveal connectivity patterns in Monero's updated protocol, providing visualizations and insights into its topology.Our findings enhance the understanding of Monero's P2P network, including the role of supernodes, and highlight potential protocol and security improvements.

link

2025-04-22

Few-shot Hate Speech Detection Based on the MindSpore Framework

The proliferation of hate speech on social media poses a significant threat to online communities, requiring effective detection systems.While deep learning models have shown promise, their performance often deteriorates in few-shot or low-resource settings due to reliance on large annotated corpora.To address this, we propose MS-FSLHate, a prompt-enhanced neural framework for few-shot hate speech detection implemented on the MindSpore deep learning platform.The model integrates learnable prompt embeddings, a CNN-BiLSTM backbone with attention pooling, and synonym-based adversarial data augmentation to improve generalization.Experimental results on two benchmark datasets-HateXplain and HSOL-demonstrate that our approach outperforms competitive baselines in precision, recall, and F1-score. 0.618Additionally, the framework shows high efficiency and scalability, suggesting its suitability for deployment in resource-constrained environments.These findings highlight the potential of combining prompt-based learning with adversarial augmentation for robust and adaptable hate speech detection in few-shot scenarios.

link

2025-04-22

AlphaGrad: Non-Linear Gradient Normalization Optimizer

We introduce AlphaGrad, a memory-efficient, conditionally stateless optimizer addressing the memory overhead and hyperparameter complexity of adaptive methods like Adam.AlphaGrad enforces scale invariance via tensor-wise L2 gradient normalization followed by a smooth hyperbolic tangent transformation, $g' = \tanh(\alpha \cdot \tilde{g})$, controlled by a single steepness parameter $\alpha$.Our contributions include: (1) the AlphaGrad algorithm formulation; (2) a formal non-convex convergence analysis guaranteeing stationarity; (3) extensive empirical evaluation on diverse RL benchmarks (DQN, TD3, PPO). 0.675Compared to Adam, AlphaGrad demonstrates a highly context-dependent performance profile.While exhibiting instability in off-policy DQN, it provides enhanced training stability with competitive results in TD3 (requiring careful $\alpha$ tuning) and achieves substantially superior performance in on-policy PPO.These results underscore the critical importance of empirical $\alpha$ selection, revealing strong interactions between the optimizer's dynamics and the underlying RL algorithm.AlphaGrad presents a compelling alternative optimizer for memory-constrained scenarios and shows significant promise for on-policy learning regimes where its stability and efficiency advantages can be particularly impactful.

link

2025-04-22

A Comparative and Measurement-Based Study on Real-Time Network KPI Extraction Methods for 5G and Beyond Applications

Key performance indicators (KPIs), which can be extracted from the standardized interfaces of network equipment defined by current standards, constitute a primary data source that can be leveraged in the development of non-standardized new equipment, architectures, and computational tools.In next-generation technologies, the demand for data has evolved beyond the conventional log generation or export capabilities provided by existing licensed network monitoring tools.There is now a growing need to collect such data at specific time intervals and with defined granularities.At this stage, the development of real-time KPI extraction methods and enabling their exchange between both standardized/commercialized and non-standardized components or tools has become increasingly critical.This study presents a comprehensive evaluation of three distinct KPI extraction methodologies applied to two commercially available devices.The analysis aims to uncover the strengths, weaknesses, and overall efficacy of these approaches under varying conditions, and highlights the critical insights into the practical capabilities and limitations. 0.602The findings serve as a foundational guide for the seamless integration and robust testing of novel technologies and approaches within commercial telecommunication networks.This work aspires to bridge the gap between technological innovation and real-world applicability, fostering enhanced decision-making in network deployment and optimization.

link

2025-04-22

A Mysterious Connection Between Tolerant Junta Testing and Agnostically Learning Conjunctions

The main conceptual contribution of this paper is identifying a previously unnoticed connection between two central problems in computational learning theory and property testing: agnostically learning conjunctions and tolerantly testing juntas.Inspired by this connection, the main technical contribution is a pair of improved algorithms for these two problems. 0.621In more detail, - We give a distribution-free algorithm for agnostically PAC learning conjunctions over $\{\pm 1\}^n$ that runs in time $2^{\widetilde{O}(n^{1/3})}$, for constant excess error $\varepsilon$. This improves on the fastest previously published algorithm, which runs in time $2^{\widetilde{O}(n^{1/2})}$[KKMS08]. - Building on the ideas in our agnostic conjunction learner and using significant additional technical ingredients, we give an adaptive tolerant testing algorithm for $k$-juntas that makes $2^{\widetilde{O}(k^{1/3})}$ queries, for constant "gap parameter" $\varepsilon$ between the "near" and "far" cases.This improves on the best previous results, due to [ITW21, NP24], which make $2^{\widetilde{O}(\sqrt{k})}$ queries.Since there is a known $2^{\widetilde{\Omega}(\sqrt{k})}$ lower bound for non-adaptive tolerant junta testers, our result shows that adaptive tolerant junta testing algorithms provably outperform non-adaptive ones.

link

2025-04-22

A Markov Chain Monte Carlo Method for Efficient Finite-Length LDPC Code Design

Low-density parity-check (LDPC) codes are among the most prominent error-correction schemes.They find application to fortify various modern storage, communication, and computing systems.Protograph-based (PB) LDPC codes offer many degrees of freedom in the code design and enable fast encoding and decoding.In particular, spatially-coupled (SC) and multi-dimensional (MD) circulant-based codes are PB-LDPC codes with excellent performance.Efficient finite-length (FL) algorithms are required in order to effectively exploit the available degrees of freedom offered by SC partitioning, lifting, and MD relocations.In this paper, we propose a novel Markov chain Monte Carlo (MCMC or MC$^2$) method to perform this FL optimization, addressing the removal of short cycles.While iterating, we draw samples from a defined distribution where the probability decreases as the number of short cycles from the previous iteration increases.We analyze our MC$^2$ method theoretically as we prove the invariance of the Markov chain where each state represents a possible partitioning or lifting arrangement.Via our simulations, we then fit the distribution of the number of cycles resulting from a given arrangement on a Gaussian distribution.We derive estimates for cycle counts that are close to the actual counts.Furthermore, we derive the order of the expected number of iterations required by our approach to reach a local minimum as well as the size of the Markov chain recurrent class.Our approach is compatible with code design techniques based on gradient-descent.Numerical results show that our MC$^2$ method generates SC codes with remarkably less number of short cycles compared with the current state-of-the-art.Moreover, to reach the same number of cycles, our method requires orders of magnitude less overall time compared with the available literature methods. 0.635

link

2025-04-22

Describe Anything: Detailed Localized Image and Video Captioning

Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models.We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC).DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context.To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP).DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL.We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. 0.617DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

link

2025-04-22

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts.PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes.Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges.Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods.We evaluate various LLMs on PHYBench and compare their performance with human experts.Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios.Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/. 0.649

link

2025-04-21

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models).Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation.Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. 0.638In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. 0.65We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters).Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. 0.602Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.

link

2025-04-21

FlowReasoner: Reinforcing Query-Level Meta-Agents

This paper proposes a query-level meta-agent named FlowReasoner to automate the design of query-level multi-agent systems, i.e., one system per user query.Our core idea is to incentivize a reasoning-based meta-agent via external execution feedback.Concretely, by distilling DeepSeek R1, we first endow the basic reasoning ability regarding the generation of multi-agent systems to FlowReasoner.Then, we further enhance it via reinforcement learning (RL) with external execution feedback.A multi-purpose reward is designed to guide the RL training from aspects of performance, complexity, and efficiency.In this manner, FlowReasoner is enabled to generate a personalized multi-agent system for each user query via deliberative reasoning.Experiments on both engineering and competition code benchmarks demonstrate the superiority of FlowReasoner.Remarkably, it surpasses o1-mini by 10.52% accuracy across three benchmarks. 0.728The code is available at https://github.com/sail-sg/FlowReasoner.

link

2025-04-21

Leveraging Language Models for Automated Patient Record Linkage

Objective: Healthcare data fragmentation presents a major challenge for linking patient data, necessitating robust record linkage to integrate patient records from diverse sources.This study investigates the feasibility of leveraging language models for automated patient record linkage, focusing on two key tasks: blocking and matching.Materials and Methods: We utilized real-world healthcare data from the Missouri Cancer Registry and Research Center, linking patient records from two independent sources using probabilistic linkage as a baseline.A transformer-based model, RoBERTa, was fine-tuned for blocking using sentence embeddings.For matching, several language models were experimented under fine-tuned and zero-shot settings, assessing their performance against ground truth labels.Results:The fine-tuned blocking model achieved a 92% reduction in the number of candidate pairs while maintaining near-perfect recall.In the matching task, fine-tuned Mistral-7B achieved the best performance with only 6 incorrect predictions.Among zero-shot models, Mistral-Small-24B performed best, with a total of 55 incorrect predictions.Discussion: Fine-tuned language models achieved strong performance in patient record blocking and matching with minimal errors.However, they remain less accurate and efficient than a hybrid rule-based and probabilistic approach for blocking. 0.626Additionally, reasoning models like DeepSeek-R1 are impractical for large-scale record linkage due to high computational costs.Conclusion:This study highlights the potential of language models for automating patient record linkage, offering improved efficiency by eliminating the manual efforts required to perform patient record linkage.Overall, language models offer a scalable solution that can enhance data integration, reduce manual effort, and support disease surveillance and research.

link

2025-04-21

Interpretable Locomotion Prediction in Construction Using a Memory-Driven LLM Agent With Chain-of-Thought Reasoning

Construction tasks are inherently unpredictable, with dynamic environments and safety-critical demands posing significant risks to workers.Exoskeletons offer potential assistance but falter without accurate intent recognition across diverse locomotion modes.This paper presents a locomotion prediction agent leveraging Large Language Models (LLMs) augmented with memory systems, aimed at improving exoskeleton assistance in such settings.Using multimodal inputs - spoken commands and visual data from smart glasses - the agent integrates a Perception Module, Short-Term Memory (STM), Long-Term Memory (LTM), and Refinement Module to predict locomotion modes effectively.Evaluation reveals a baseline weighted F1-score of 0.73 without memory, rising to 0.81 with STM, and reaching 0.90 with both STM and LTM, excelling with vague and safety-critical commands.Calibration metrics, including a Brier Score drop from 0.244 to 0.090 and ECE from 0.222 to 0.044, affirm improved reliability. 0.66This framework supports safer, high-level human-exoskeleton collaboration, with promise for adaptive assistive systems in dynamic industries.

link

2025-04-21

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents.While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence.To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes.Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model's geometric correspondence and the capacity to align information consistently across views.Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency.Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses.These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness.We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding.The project and benchmark are publicly available at https://danielchyeh.github.io/All-Angles-Bench/. 0.627

link

LLMs

2025-04-24

Seamless Data Migration between Database Schemas with DAMI-Framework: An Empirical Study on Developer Experience

Many businesses depend on legacy systems, which often use outdated technology that complicates maintenance and updates.Therefore, software modernization is essential, particularly data migration between different database schemas.Established methodologies, like model transformation and ETL tools, facilitate this migration; they require deep knowledge of database languages and both the source and target schemas.This necessity renders data migration an error-prone and cognitively demanding task.Our objective is to alleviate developers' workloads during schema evolution through our DAMI-Framework.This framework incorporates a domain-specific language (DSL) and a parser to facilitate data migration between database schemas.DAMI-DSL simplifies schema mapping while the parser automates SQL script generation 0.618.We assess developer experience in data migration by conducting an empirical evaluation with 21 developers to assess their experiences using our DSL versus traditional SQL.The study allows us to measure their perceptions of the DSL properties and user experience.The participants praised DAMI-DSL for its readability and ease of use.The findings indicate that our DSL reduces data migration efforts compared to SQL scripts.

link

2025-04-24

Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics

Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. 0.711However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. 0.716In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs' generated programs in response to math reasoning tasks. 0.687Our evaluation focuses on the extent to which LLMs ground their programs to math rules, and how that affects their end performance. 0.67For this purpose, we assess the generations of five different LLMs, on two different math datasets, both manually and automatically. 0.674Our results reveal that the distribution of grounding depends on LLMs' capabilities and the difficulty of math problems. 0.632Furthermore, mathematical grounding is more effective for closed-source models, while open-source models fail to employ math rules in their solutions correctly.On MATH500, the percentage of grounded programs decreased to half, while the ungrounded generations doubled in comparison to ASDiv grade-school problems.Our work highlights the need for in-depth evaluation beyond execution accuracy metrics, toward a better understanding of code-assisted LLMs' capabilities and limits in the math domain. 0.696

link

2025-04-24

Towards a HIPAA Compliant Agentic AI System in Healthcare

Agentic AI systems powered by Large Language Models (LLMs) as their foundational reasoning engine, are transforming clinical workflows such as medical report generation and clinical summarization by autonomously analyzing sensitive healthcare data and executing decisions with minimal human oversight. 0.638However, their adoption demands strict compliance with regulatory frameworks such as Health Insurance Portability and Accountability Act (HIPAA), particularly when handling Protected Health Information (PHI).This work-in-progress paper introduces a HIPAA-compliant Agentic AI framework that enforces regulatory compliance through dynamic, context-aware policy enforcement.Our framework integrates three core mechanisms: (1) Attribute-Based Access Control (ABAC) for granular PHI governance, (2) a hybrid PHI sanitization pipeline combining regex patterns and BERT-based model to minimize leakage, and (3) immutable audit trails for compliance verification.

link

2025-04-24

Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation

Training large language models (LLMs) requires massive computational resources, often necessitating the aggregation of geographically distributed data centers (\ie, cross-region training). 0.696However, the high communication latency in wide-area networks severely degrades the efficiency of traditional distributed training.While methods like DiLoCo reduce communication frequency, they suffer from blocking synchronization.Streaming DiLoCo alleviates this issue via communication-computation overlapping but introduces update staleness and model inconsistency due to delayed global updates and partial synchronization.These factors impair convergence, especially when aggressive overlap is needed to mask high latency.We propose CoCoDC, a novel distributed training framework with communication-computation overlapping and delay compensation, to explicitly tackle these challenges.Within the CoCoDC framework, we specifically develop a novel Delay Compensation strategy based on Taylor expansion to effectively mitigate the staleness and an Adaptive Transmission strategy that dynamically schedules model fragment synchronization to optimize bandwidth usage and accelerate convergence.Extensive experiments highlight the superior performance of CoCoDC over both DiLoCo and Streaming DiLoCo regarding final accuracy and training speed.Specifically, CoCoDC reduces the training steps needed to reach a comparable perplexity by up to 21.0% compared to Streaming DiLoCo.Our work provides an effective solution for scalable and efficient cross-region LLM training. 0.675

link

2025-04-24

DTECM: Digital Twin Enabled Channel Measurement and Modeling in Terahertz Urban Macrocell

In this work, in the THz UMa, extensive channel measurements are conducted and an accurate channel model is developed by combining ray-tracing, computer vision (CV), and statistical methods.Specifically, substantial channel measurement campaigns with distances up to 410~m are conducted at 220~GHz, with nanosecond-level absolute time synchronization.Based on the measurement results, the propagation phenomena are analyzed in detail and the channel characteristics are calculated and statistically modeled.Furthermore, a digital twin enabled channel model (DTECM) is proposed, which generates THz channel responses in a hybrid manner.Specifically, the dominant paths are generated deterministically by using the ray-tracing technique and CV methods.Apart from the path gains determined by ray-tracing, the additional foliage loss is accurately modeled based on foliage information extracted from panoramic pictures.To maintain a low computational complexity for the DTECM, non-dominant paths are then generated statistically.Numeric results reveal that compared to the traditional statistical channel models, the DTECM reduces the path loss modeling error from 14~dB to 4~dB, showing its great superiority.Furthermore, a preliminary link performance evaluation using the DTECM indicates that THz UMa is feasible, though requiring high antenna gains and coverage extension techniques to achieve high spectral efficiencies and wide coverage. 0.63

link

2025-04-24

Energy Considerations of Large Language Model Inference and Efficiency Optimizations

As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. 0.681Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use.In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation.We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. 0.672Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations.We show that the effectiveness of inference optimizations is highly sensitive to workload geometry, software stack, and hardware accelerators, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption.Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines.These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure. 0.693

link

2025-04-24

INSIGHT: Bridging the Student-Teacher Gap in Times of Large Language Models

The rise of AI, especially Large Language Models, presents challenges and opportunities to integrate such technology into the classroom.AI has the potential to revolutionize education by helping teaching staff with various tasks, such as personalizing their teaching methods, but it also raises concerns, for example, about the degradation of student-teacher interactions and user privacy.This paper introduces INSIGHT, a proof of concept to combine various AI tools to assist teaching staff and students in the process of solving exercises.INSIGHT has a modular design that allows it to be integrated into various higher education courses.We analyze students' questions to an LLM by extracting keywords, which we use to dynamically build an FAQ from students' questions and provide new insights for the teaching staff to use for more personalized face-to-face support. 0.665Future work could build upon INSIGHT by using the collected data to provide adaptive learning and adjust content based on student progress and learning styles to offer a more interactive and inclusive learning experience.

link

2025-04-24

Ensemble Bayesian Inference: Leveraging Small Language Models to Achieve LLM-level Accuracy in Profile Matching Tasks

This study explores the potential of small language model(SLM) ensembles to achieve accuracy comparable to proprietary large language models (LLMs).We propose Ensemble Bayesian Inference (EBI), a novel approach that applies Bayesian estimation to combine judgments from multiple SLMs, allowing them to exceed the performance limitations of individual models.Our experiments on diverse tasks(aptitude assessments and consumer profile analysis in both Japanese and English) demonstrate EBI's effectiveness.Notably, we analyze cases where incorporating models with negative Lift values into ensembles improves overall performance, and we examine the method's efficacy across different languages.These findings suggest new possibilities for constructing high-performance AI systems with limited computational resources and for effectively utilizing models with individually lower performance.Building on existing research on LLM performance evaluation, ensemble methods, and open-source LLM utilization, we discuss the novelty and significance of our approach. 0.689

link

2025-04-24

Multilingual Performance Biases of Large Language Models in Education

Large language models (LLMs) are increasingly being adopted in educational settings. 0.718These applications expand beyond English, though current LLMs remain primarily English-centric. 0.739In this work, we ascertain if their use in education settings in non-English languages is warranted.We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in six languages (Hindi, Arabic, Farsi, Telugu, Ukrainian, Czech) in addition to English. 0.734We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance.Although the models perform reasonably well in most languages, the frequent performance drop from English is significant.Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment. 0.782

link

2025-04-24

Towards Robust LLMs: an Adversarial Robustness Measurement Framework

The rise of Large Language Models (LLMs) has revolutionized artificial intelligence, yet these models remain vulnerable to adversarial perturbations, undermining their reliability in high-stakes applications. 0.65While adversarial robustness in vision-based neural networks has been extensively studied, LLM robustness remains under-explored.We adapt the Robustness Measurement and Assessment (RoMA) framework to quantify LLM resilience against adversarial inputs without requiring access to model parameters. 0.617By comparing RoMA's estimates to those of formal verification methods, we demonstrate its accuracy with minimal error margins while maintaining computational efficiency.Our empirical evaluation reveals that robustness varies significantly not only between different models but also across categories within the same task and between various types of perturbations.This non-uniformity underscores the need for task-specific robustness evaluations, enabling practitioners to compare and select models based on application-specific robustness requirements.Our work provides a systematic methodology to assess LLM robustness, advancing the development of more reliable language models for real-world deployment. 0.72

link

2025-04-24

DPMambaIR:All-in-One Image Restoration via Degradation-Aware Prompt State Space Model

All-in-One image restoration aims to address multiple image degradation problems using a single model, significantly reducing training costs and deployment complexity compared to traditional methods that design dedicated models for each degradation type.Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration.However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts.To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework.By integrating a Degradation-Aware Prompt State Space Model (DP-SSM) and a High-Frequency Enhancement Block (HEB), DPMambaIR enables fine-grained modeling of complex degradation information and efficient global integration, while mitigating the loss of high-frequency details caused by task competition. 0.611Specifically, the DP-SSM utilizes a pre-trained degradation extractor to capture fine-grained degradation features and dynamically incorporates them into the state space modeling process, enhancing the model's adaptability to diverse degradation types.Concurrently, the HEB supplements high-frequency information, effectively addressing the loss of critical details, such as edges and textures, in multi-task image restoration scenarios.Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively.These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.

link

2025-04-24

Step1X-Edit: A Practical Framework for General Image Editing

In recent years, image editing models have witnessed remarkable and rapid development.The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities.These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation.However, there is still a large gap between the open-source algorithm with these closed-source models.Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash.More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. 0.695A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image.To train the model, we build a data generation pipeline to produce a high-quality dataset.For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions.Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

link

2025-04-24

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. 0.686To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate.Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones.2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former.3)There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios.Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution.4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments.Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications. 0.651

link

2025-04-24

Replay to Remember: Retaining Domain Knowledge in Streaming Language Models

Continual learning in large language models (LLMs) typically encounters the critical challenge of catastrophic forgetting, where previously acquired knowledge deteriorates upon exposure to new data. 0.684While techniques like replay buffers and parameter-efficient tuning (e.g., Low-Rank Adaptation or LoRA) have been proposed, few studies investigate real-time domain adaptation under strict computational and data-stream constraints.In this paper, we demonstrate a lightweight method combining LoRA and a minimal replay mechanism in a realistic streaming setting across three diverse knowledge domains: medical question answering, genetics, and law.Using perplexity, semantic similarity, and GPT-based human-like evaluation metrics, we quantify the model's adaptation, forgetting, and recovery over time.Our experiments reveal that while catastrophic forgetting naturally occurs, even minimal replay significantly stabilizes and partially restores domain-specific knowledge.This study contributes practical insights for deploying adaptable LLMs in resource-constrained, real-world scenarios. 0.785

link

2025-04-24

LiDPM: Rethinking Point Diffusion for Lidar Scene Completion

Training diffusion models that work directly on lidar points at the scale of outdoor scenes is challenging due to the difficulty of generating fine-grained details from white noise over a broad field of view.The latest works addressing scene completion with diffusion models tackle this problem by reformulating the original DDPM as a local diffusion process.It contrasts with the common practice of operating at the level of objects, where vanilla DDPMs are currently used. 0.62In this work, we close the gap between these two lines of work.We identify approximations in the local diffusion formulation, show that they are not required to operate at the scene level, and that a vanilla DDPM with a well-chosen starting point is enough for completion.Finally, we demonstrate that our method, LiDPM, leads to better results in scene completion on SemanticKITTI.The project page is https://astra-vision.github.io/LiDPM .

link

2025-04-23

Exploring How LLMs Capture and Represent Domain-Specific Knowledge

We study whether Large Language Models (LLMs) inherently capture domain-specific nuances in natural language. 0.657Our experiments probe the domain sensitivity of LLMs by examining their ability to distinguish queries from different domains using hidden states generated during the prefill phase. 0.742We reveal latent domain-related trajectories that indicate the model's internal recognition of query domains.We also study the robustness of these domain representations to variations in prompt styles and sources.Our approach leverages these representations for model selection, mapping the LLM that best matches the domain trace of the input query (i.e., the model with the highest performance on similar traces). 0.629Our findings show that LLMs can differentiate queries for related domains, and that the fine-tuned model is not always the most accurate. 0.716Unlike previous work, our interpretations apply to both closed and open-ended generative tasks

link

2025-04-23

Context-Enhanced Vulnerability Detection Based on Large Language Model

Vulnerability detection is a critical aspect of software security.Accurate detection is essential to prevent potential security breaches and protect software systems from malicious attacks.Recently, vulnerability detection methods leveraging deep learning and large language models (LLMs) have garnered increasing attention. 0.608However, existing approaches often focus on analyzing individual files or functions, which limits their ability to gather sufficient contextual information.Analyzing entire repositories to gather context introduces significant noise and computational overhead.To address these challenges, we propose a context-enhanced vulnerability detection approach that combines program analysis with LLMs. 0.706Specifically, we use program analysis to extract contextual information at various levels of abstraction, thereby filtering out irrelevant noise.The abstracted context along with source code are provided to LLM for vulnerability detection. 0.677We investigate how different levels of contextual granularity improve LLM-based vulnerability detection performance. 0.724Our goal is to strike a balance between providing sufficient detail to accurately capture vulnerabilities and minimizing unnecessary complexity that could hinder model performance.Based on an extensive study using GPT-4, DeepSeek, and CodeLLaMA with various prompting strategies, our key findings includes: (1) incorporating abstracted context significantly enhances vulnerability detection effectiveness; (2) different models benefit from distinct levels of abstraction depending on their code understanding capabilities; and (3) capturing program behavior through program analysis for general LLM-based code analysis tasks can be a direction that requires further attention. 0.624

link

2025-04-23

Enhancing Critical Thinking with AI: A Tailored Warning System for RAG Models

Retrieval-Augmented Generation (RAG) systems offer a powerful approach to enhancing large language model (LLM) outputs by incorporating fact-checked, contextually relevant information. 0.609However, fairness and reliability concerns persist, as hallucinations can emerge at both the retrieval and generation stages, affecting users' reasoning and decision-making. 0.608Our research explores how tailored warning messages -- whose content depends on the specific context of hallucination -- shape user reasoning and actions in an educational quiz setting.Preliminary findings suggest that while warnings improve accuracy and awareness of high-level hallucinations, they may also introduce cognitive friction, leading to confusion and diminished trust in the system.By examining these interactions, this work contributes to the broader goal of AI-augmented reasoning: developing systems that actively support human reflection, critical thinking, and informed decision-making rather than passive information consumption.

link

2025-04-23

Do Large Language Models know who did what to whom?

Large Language Models (LLMs) are commonly criticized for not understanding language. 0.733However, many critiques focus on cognitive abilities that, in humans, are distinct from language processing.Here, we instead study a kind of understanding tightly linked to language: inferring who did what to whom (thematic roles) in a sentence.Does the central training objective of LLMs-word prediction-result in sentence representations that capture thematic roles? 0.636In two experiments, we characterized sentence representations in four LLMs. 0.695In contrast to human similarity judgments, in LLMs the overall representational similarity of sentence pairs reflected syntactic similarity but not whether their agent and patient assignments were identical vs. reversed. 0.685Furthermore, we found little evidence that thematic role information was available in any subset of hidden units.However, some attention heads robustly captured thematic roles, independently of syntax.Therefore, LLMs can extract thematic roles but, relative to humans, this information influences their representations more weakly. 0.705

link

2025-04-23

Tracing Thought: Using Chain-of-Thought Reasoning to Identify the LLM Behind AI-Generated Text

In recent years, the detection of AI-generated text has become a critical area of research due to concerns about academic integrity, misinformation, and ethical AI deployment.This paper presents COT Fine-tuned, a novel framework for detecting AI-generated text and identifying the specific language model.responsible for generating the text.We propose a dual-task approach, where Task A involves classifying text as AI-generated or human-written, and Task B identifies the specific LLM behind the text.The key innovation of our method lies in the use of Chain-of-Thought reasoning, which enables the model to generate explanations for its predictions, enhancing transparency and interpretability.Our experiments demonstrate that COT Fine-tuned achieves high accuracy in both tasks, with strong performance in LLM identification and human-AI classification. 0.603We also show that the CoT reasoning process contributes significantly to the models effectiveness and interpretability.

link

2025-04-23

IberBench: LLM Evaluation on Iberian Languages

Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. 0.687Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages.These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static.With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. 0.659IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization.The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts.We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. 0.73Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. 0.706IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.

link

Developer Research

2025-04-24

Modeling Communication Perception in Development Teams Using Monte Carlo Methods

Software development is a collaborative task involving diverse development teams, where toxic communication can negatively impact team mood and project success. 0.731Mood surveys enable the early detection of underlying tensions or dissatisfaction within development teams, allowing communication issues to be addressed before they escalate, fostering a positive and productive work environment.The mood can be surveyed indirectly by analyzing the text-based communication of the team.However, emotional subjectivity leads to varying sentiment interpretations across team members; a statement perceived neutrally by one developer might be seen as problematic by another developer with a different conversational culture. 0.631Early identification of perception volatility can help prevent misunderstandings and enhance team morale while safeguarding the project.This paper analyzes the diversity of perceptions within arbitrary development teams and determines how many team members should report their sentiment to accurately reflect the team's mood.Through a Monte Carlo experiment involving 45 developers, we present a preliminary mathematical model to calculate the minimum agreement among a subset of developers based on the whole team's agreement.This model can guide leadership in mood assessment, demonstrating that omitting even a single member in an average-sized 7-member team can misrepresent the overall mood.Therefore, including all developers in mood surveying is recommended to ensure a reliable evaluation of the team's mood.

link

2025-04-22

Automated Bug Report Prioritization in Large Open-Source Projects

Large open-source projects receive a large number of issues (known as bugs), including software defect (i.e., bug) reports and new feature requests from their user and developer communities at a fast rate.The often limited project resources do not allow them to deal with all issues.Instead, they have to prioritize them according to the project's priorities and the issues' severities.In this paper, we propose a novel approach to automated bug prioritization based on the natural language text of the bug reports that are stored in the open bug repositories of the issue-tracking systems. 0.602We conduct topic modeling using a variant of LDA called TopicMiner-MTM and text classification with the BERT large language model to achieve a higher performance level compared to the state-of-the-art.Experimental results using an existing reference dataset containing 85,156 bug reports of the Eclipse Platform project indicate that we outperform existing approaches in terms of Accuracy, Precision, Recall, and F1-measure of the bug report priority prediction.

link

2025-04-22

Token-Aware Coding Flow: A Study with Nano Surge in Reasoning Model

With the widespread application of large-scale language models (LLMs) in software engineering, the Chain of Thought (CoT) approach has emerged as a crucial tool for driving automated code generation and optimization.However, despite the significant success of CoT methods in generating high-quality code, the issue of token inflation during the reasoning process remains a formidable challenge to model performance and efficiency, particularly when dealing with complex code smells.Code smells not only affect the maintainability and scalability of code but also significantly increase the computational burden during LLM inference, leading to excessive token consumption and, consequently, reduced reasoning efficiency.This paper introduces an innovative Token-Aware Coding Flow method, aimed at addressing the token inflation problem caused by smelly code in the CoT process.Through experimentation, we validate the synergistic effect of code refactoring and prompt engineering strategies, demonstrating that after eliminating code smells, token consumption during model inference is significantly reduced. 0.621The experimental results show that refactored code, while maintaining functional consistency, can reduce token consumption by up to 50\%.Additionally, by explicitly prompting the type of code smells in the prompt and incorporating strategies such as context awareness and role constraints, we further optimize the reasoning process, achieving a 24.5\% to 30\% reduction in token consumption.These optimizations not only significantly enhance the model's reasoning efficiency and improve code generation quality but also provide new insights for addressing performance bottlenecks in complex code generation tasks.

link

2025-04-17

Automated Generation of Commit Messages in Software Repositories

Commit messages are crucial for documenting software changes, aiding in program comprehension and maintenance. 0.655However, creating effective commit messages is often overlooked by developers due to time constraints and varying levels of documentation skills. 0.624Our research presents an automated approach to generate commit messages using Machine Learning (ML) and Natural Language Processing (NLP) by developing models that use techniques such as Logistic Regression with TF-IDF and Word2Vec, as well as more sophisticated methods like LSTM.We used the dataset of code changes and corresponding commit messages that was used by Liu et al., which we used to train and evaluate ML/NLP models and was chosen because it is extensively used in previous research, also for comparability in our study.The objective was to explore which ML/NLP techniques generate the most effective, clear, and concise commit messages that accurately reflect the code changes.We split the dataset into training, validation, and testing sets and used these sets to evaluate the performance of each model using qualitative and quantitative evaluation methods.Our results reveal a spectrum of effectiveness among these models, with the highest BLEU score achieved being 16.82, showcasing the models' capability in automating a clear and concise commit message generation.Our paper offers insights into the comparative effectiveness of different machine learning models for automating commit message generation in software development, aiming to enhance the overall practice of code documentation.The source code is available at https://doi.org/10.5281/zenodo.10888106.

link

2025-04-09

FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks

Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. 0.683Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their ability to comprehend and effectively leverage diverse types of feedback remains insufficiently understood.To bridge this gap, we introduce FeedbackEval, a systematic benchmark for evaluating LLMs' feedback comprehension and performance in code repair tasks.We conduct a comprehensive empirical study on five state-of-the-art LLMs, including GPT-4o, Claude-3.5, Gemini-1.5, GLM-4, and Qwen2.5, to evaluate their behavior under both single-iteration and iterative code repair settings.Our results show that structured feedback, particularly in the form of test feedback, leads to the highest repair success rates, while unstructured feedback proves significantly less effective.Iterative feedback further enhances repair performance, though the marginal benefit diminishes after two or three rounds.Moreover, prompt structure is shown to be critical: incorporating docstrings, contextual information, and explicit guidelines substantially improves outcomes, whereas persona-based, chain-of-thought, and few-shot prompting strategies offer limited benefits in single-iteration scenarios.This work introduces a robust benchmark and delivers practical insights to advance the understanding and development of feedback-driven code repair using LLMs.

link

Data Annotation Techniques