Vincent's Arxiv FrontPage


Generated on 2025-02-22.


This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.


New Datasets

2025-02-20

Counter Pools: Counter Representation for Efficient Stream Processing

Due to the large data volume and number of distinct elements, space is often the bottleneck of many stream processing systems.The data structures used by these systems often consist of counters whose optimization yields significant memory savings.The challenge lies in balancing the size of the counters: too small, and they overflow; too large, and memory capacity limits their number. In this work, we suggest an efficient encoding scheme that sizes each counter according to its needs. 0.729Our approach uses fixed-sized pools of memory (e.g., a single memory word or 64 bits), where each pool manages a small number of counters.We pay special attention to performance and demonstrate considerable improvements for various streaming algorithms and workload characteristics.

link

2025-02-20

Entity Framing and Role Portrayal in the News

We introduce a novel multilingual hierarchical corpus annotated for entity framing and role portrayal in news articles.The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. 0.763Each archetype is carefully defined, capturing nuanced portrayals of entities such as guardian, martyr, and underdog for protagonists; tyrant, deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for innocents.The dataset includes 1,378 recent news articles in five languages (Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two critical domains of global significance: the Ukraine-Russia War and Climate Change. 0.891Over 5,800 entity mentions have been annotated with role labels.This dataset serves as a valuable resource for research into role portrayal and has broader implications for news analysis.We describe the characteristics of the dataset and the annotation process, and we report evaluation results on fine-tuned state-of-the-art multilingual transformers and hierarchical zero-shot learning using LLMs at the level of a document, a paragraph, and a sentence.

link

2025-02-20

Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs

A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models.Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored.This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets.While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents.Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort.In contrast, traditional. models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly.We show that LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints.Dataset available at https://huggingface. co/datasets/zli12321/Bills. 0.876

link

2025-02-20

ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication.Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability.This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy.To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. 0.852Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

link

2025-02-20

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential.Given the challenges in continual learning with large language models (LLMs), retrieval-augmented generation (RAG) has become the dominant way to introduce new information.However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory.Recent RAG approaches augment vector embeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity.However, their performance on more basic factual memory tasks drops considerably below standard RAG.We address this unintended deterioration and propose HippoRAG 2, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks.HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM.This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities.This work paves the way for non-parametric continual learning for LLMs.Our code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. 0.727

link

2025-02-20

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words.We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT).To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. 0.82Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model.Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs.Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs.Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o.Code and data: https://github.com/THU-KEG/LongWriter-V

link

2025-02-20

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs).However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data.To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data.Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images.With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM.Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. 0.834Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash.Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

link

2025-02-20

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive.While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark.To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. 0.828Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models' capabilities in classification, interpretation, and historical comprehension.By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation.We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement.Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery.Our code is available at: \url{https://github.com/mbzuai-oryx/TimeTravel}.

link

2025-02-19

An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

Rice is one of the most widely cultivated crops globally and has been developed into numerous varieties.The quality of rice during cultivation is primarily determined by its cultivar and characteristics.Traditionally, rice classification and quality assessment rely on manual visual inspection, a process that is both time-consuming and prone to errors.However, with advancements in machine vision technology, automating rice classification and quality evaluation based on its cultivar and characteristics has become increasingly feasible, enhancing both accuracy and efficiency.This study proposes a real-time evaluation mechanism for comprehensive rice grain assessment, integrating a one-stage object detection approach, a deep convolutional neural network, and traditional machine learning techniques.The proposed framework enables rice variety identification, grain completeness grading, and grain chalkiness evaluation.The rice grain dataset used in this study comprises approximately 20,000 images from six widely cultivated rice varieties in China. 0.885Experimental results demonstrate that the proposed mechanism achieves a mean average precision (mAP) of 99.14% in the object detection task and an accuracy of 97.89% in the classification task.Furthermore, the framework attains an average accuracy of 97.56% in grain completeness grading within the same rice variety, contributing to an effective quality evaluation system.

link

2025-02-19

VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare

Alignment techniques have become central to ensuring that Large Language Models (LLMs) generate outputs consistent with human values.However, existing alignment paradigms often model an averaged or monolithic preference, failing to account for the diversity of perspectives across cultures, demographics, and communities.This limitation is particularly critical in health-related scenarios, where plurality is essential due to the influence of culture, religion, personal values, and conflicting opinions.Despite progress in pluralistic alignment, no prior work has focused on health, likely due to the unavailability of publicly available datasets.To address this gap, we introduce VITAL, a new benchmark dataset comprising 13.1K value-laden situations and 5.4K multiple-choice questions focused on health, designed to assess and benchmark pluralistic alignment methodologies. 0.731Through extensive evaluation of eight LLMs of varying sizes, we demonstrate that existing pluralistic alignment techniques fall short in effectively accommodating diverse healthcare beliefs, underscoring the need for tailored AI alignment in specific domains.This work highlights the limitations of current approaches and lays the groundwork for developing health-specific alignment solutions.

link

2025-02-19

Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge

Estimating the construction year of buildings is of great importance for sustainability.Sustainable buildings minimize energy consumption and are a key part of responsible and sustainable urban planning and development to effectively combat climate change.By using Artificial Intelligence (AI) and recently proposed Transformer models, we are able to estimate the construction epoch of buildings from a multi-modal dataset.In this paper, we introduce a new benchmark multi-modal dataset, i.e. the Map your City Dataset (MyCD), containing top-view Very High Resolution (VHR) images, Earth Observation (EO) multi-spectral data from the Copernicus Sentinel-2 satellite constellation, and street-view images in many different cities in Europe, co-localized with respect to the building under study and labelled with the construction epoch. 0.804We assess EO generalization performance on new/ previously unseen cities that have been held-out from training and appear only during inference.In this work, we present the community-based data challenge we organized based on MyCD. 0.77The ESA AI4EO Challenge MapYourCity was opened in 2024 for 4 months.Here, we present the Top-4 performing models, and the main evaluation results.During inference, the performance of the models using both all three input modalities and only the two top-view modalities, i.e. without the street-view images, is examined.The evaluation results show that the models are effective and can achieve good performance on this difficult real-world task of estimating the age of buildings, even on previously unseen cities, as well as even using only the two top-view modalities (i.e. VHR and Sentinel-2) during inference.

link

2025-02-19

ArtMentor: AI-Assisted Evaluation of Artworks to Explore Multimodal Large Language Models Capabilities

Can Multimodal Large Language Models (MLLMs), with capabilities in perception, recognition, understanding, and reasoning, function as independent assistants in art evaluation dialogues?Current MLLM evaluation methods, which rely on subjective human scoring or costly interviews, lack comprehensive coverage of various scenarios.This paper proposes a process-oriented Human-Computer Interaction (HCI) space design to facilitate more accurate MLLM assessment and development.This approach aids teachers in efficient art evaluation while also recording interactions for MLLM capability assessment.We introduce ArtMentor, a comprehensive space that integrates a dataset and three systems to optimize MLLM evaluation.The dataset consists of 380 sessions conducted by five art teachers across nine critical dimensions. 0.797The modular system includes agents for entity recognition, review generation, and suggestion generation, enabling iterative upgrades.Machine learning and natural language processing techniques ensure the reliability of evaluations.The results confirm GPT-4o's effectiveness in assisting teachers in art evaluation dialogues.Our contributions are available at https://artmentor.github.io/.

link

2025-02-19

MSVCOD:A Large-Scale Multi-Scene Dataset for Video Camouflage Object Detection

Video Camouflaged Object Detection (VCOD) is a challenging task which aims to identify objects that seamlessly concealed within the background in videos.The dynamic properties of video enable detection of camouflaged objects through motion cues or varied perspectives.Previous VCOD datasets primarily contain animal objects, limiting the scope of research to wildlife scenarios.However, the applications of VCOD extend beyond wildlife and have significant implications in security, art, and medical fields.Addressing this problem, we construct a new large-scale multi-domain VCOD dataset MSVCOD. 0.716To achieve high-quality annotations, we design a semi-automatic iterative annotation pipeline that reduces costs while maintaining annotation accuracy.Our MSVCOD is the largest VCOD dataset to date, introducing multiple object categories including human, animal, medical, and vehicle objects for the first time, while also expanding background diversity across various environments.This expanded scope increases the practical applicability of the VCOD task in camouflaged object detection.Alongside this dataset, we introduce a one-steam video camouflage object detection model that performs both feature extraction and information fusion without additional motion feature fusion modules.Our framework achieves state-of-the-art results on the existing VCOD animal dataset and the proposed MSVCOD. 0.774The dataset and code will be made publicly available. 0.912

link

2025-02-19

The NavINST Dataset for Multi-Sensor Autonomous Navigation

The NavINST Laboratory has developed a comprehensive multisensory dataset from various road-test trajectories in urban environments, featuring diverse lighting conditions, including indoor garage scenarios with dense 3D maps.This dataset includes multiple commercial-grade IMUs and a high-end tactical-grade IMU. 0.72Additionally, it contains a wide array of perception-based sensors, such as a solid-state LiDAR - making it one of the first datasets to do so - a mechanical LiDAR, four electronically scanning RADARs, a monocular camera, and two stereo cameras.The dataset also includes forward speed measurements derived from the vehicle's odometer, along with accurately post-processed high-end GNSS/IMU data, providing precise ground truth positioning and navigation information.The NavINST dataset is designed to support advanced research in high-precision positioning, navigation, mapping, computer vision, and multisensory fusion.It offers rich, multi-sensor data ideal for developing and validating robust algorithms for autonomous vehicles.Finally, it is fully integrated with the ROS, ensuring ease of use and accessibility for the research community.The complete dataset and development tools are available at https://navinst.github.io. 0.88

link

2025-02-19

The KnowWhereGraph: A Large-Scale Geo-Knowledge Graph for Interdisciplinary Knowledge Discovery and Geo-Enrichment

Global challenges such as food supply chain disruptions, public health crises, and natural hazard responses require access to and integration of diverse datasets, many of which are geospatial.Over the past few years, a growing number of (geo)portals have been developed to address this need.However, most existing (geo)portals are stacked by separated or sparsely connected data "silos" impeding effective data consolidation.A new way of sharing and reusing geospatial data is therefore urgently needed.In this work, we introduce KnowWhereGraph, a knowledge graph-based data integration, enrichment, and synthesis framework that not only includes schemas and data related to human and environmental systems but also provides a suite of supporting tools for accessing this information. 0.802The KnowWhereGraph aims to address the challenge of data integration by building a large-scale, cross-domain, pre-integrated, FAIR-principles-based, and AI-ready data warehouse rooted in knowledge graphs.We highlight the design principles of KnowWhereGraph, emphasizing the roles of space, place, and time in bridging various data "silos".Additionally, we demonstrate multiple use cases where the proposed geospatial knowledge graph and its associated tools empower decision-makers to uncover insights that are often hidden within complex and poorly interoperable datasets.

link

2025-02-19

PSCon: Toward Conversational Product Search

Conversational Product Search (CPS) is confined to simulated conversations due to the lack of real-world CPS datasets that reflect human-like language.Additionally, current conversational datasets are limited to support cross-market and multi-lingual usage.In this paper, we introduce a new CPS data collection protocol and present PSCon, a novel CPS dataset designed to assist product search via human-like conversations. 0.747The dataset is constructed using a coached human-to-human data collection protocol and supports two languages and dual markets. 0.926Also, the dataset enables thorough exploration of six subtasks of CPS: user intent detection, keyword extraction, system action prediction, question selection, item ranking, and response generation.Furthermore, we also offer an analysis of the dataset and propose a benchmark model on the proposed CPS dataset.

link

2025-02-19

DataSciBench: An LLM Agent Benchmark for Data Science

This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science.Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated.In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics.We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics.This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics).Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules.Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered.This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses.Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models.We release all code and data at https://github.com/THUDM/DataSciBench. 0.808

link

2025-02-19

Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks.However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored.To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images.StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. 0.762Our evaluation of $16$ state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images.For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance.Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.

link

2025-02-19

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations.We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details.To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens.To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. 0.744Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks.Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency.In short, S-VCO offers a significant enhancement of VLM's visually-dependent task performance while retaining or even improving the model's general abilities.We opensource our code at https://s-vco.github.io/

link

2025-02-19

Betsu-Betsu: Multi-View Separable 3D Reconstruction of Two Interacting Objects

Separable 3D reconstruction of multiple objects from multi-view RGB images -- resulting in two different 3D shapes for the two objects with a clear separation between them -- remains a sparsely researched problem.It is challenging due to severe mutual occlusions and ambiguities along the objects' interaction boundaries.This paper investigates the setting and introduces a new neuro-implicit method that can reconstruct the geometry and appearance of two objects undergoing close interactions while disjoining both in 3D, avoiding surface inter-penetrations and enabling novel-view synthesis of the observed scene.The framework is end-to-end trainable and supervised using a novel alpha-blending regularisation that ensures that the two geometries are well separated even under extreme occlusions.Our reconstruction method is markerless and can be applied to rigid as well as articulated objects.We introduce a new dataset consisting of close interactions between a human and an object and also evaluate on two scenes of humans performing martial arts. 0.786The experiments confirm the effectiveness of our framework and substantial improvements using 3D and novel view synthesis metrics compared to several existing approaches applicable in our setting.

link

2025-02-18

LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation

Popular Micro-videos, dominant on platforms like TikTok and YouTube, hold significant commercial value.The rise of high-quality AI-generated content has spurred interest in AI-driven micro-video creation.However, despite the advanced capabilities of large language models (LLMs) like ChatGPT and DeepSeek in text generation and reasoning, their potential to assist the creation of popular micro-videos remains largely unexplored. In this paper, we conduct an empirical study on LLM-assisted popular micro-video generation (LLMPopcorn).Specifically, we investigate the following research questions: (i) How can LLMs be effectively utilized to assist popular micro-video generation?(ii) To what extent can prompt-based enhancements optimize the LLM-generated content for higher popularity?(iii) How well do various LLMs and video generators perform in the popular micro-video generation task?By exploring these questions, we show that advanced LLMs like DeepSeek-V3 enable micro-video generation to achieve popularity comparable to human-created content.Prompt enhancements further boost popularity, and benchmarking highlights DeepSeek-V3 and DeepSeek-R1 among LLMs, while LTX-Video and HunyuanVideo lead in video generation.This pioneering work advances AI-assisted micro-video creation, uncovering new research opportunities.We will release the code and datasets to support future studies. 0.817

link

2025-02-18

You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations

Meeting summarization suffers from limited high-quality data, mainly due to privacy restrictions and expensive collection processes.We address this gap with FAME, a dataset of 500 meetings in English and 300 in German produced by MIMIC, our new multi-agent meeting synthesis framework that generates meeting transcripts on a given knowledge source by defining psychologically grounded participant profiles, outlining the conversation, and orchestrating a large language model (LLM) debate. 0.768A modular post-processing step refines these outputs, mitigating potential repetitiveness and overly formal tones, ensuring coherent, credible dialogues at scale.We also propose a psychologically grounded evaluation framework assessing naturalness, social behavior authenticity, and transcript difficulties.Human assessments show that FAME approximates real-meeting spontaneity (4.5/5 in naturalness), preserves speaker-centric challenges (3/5 in spoken language), and introduces richer information-oriented difficulty (4/5 in difficulty).These findings highlight that FAME is a good and scalable proxy for real-world meeting conditions.It enables new test scenarios for meeting summarization research and other conversation-centric applications in tasks requiring conversation data or simulating social scenarios under behavioral constraints.

link

2025-02-18

Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms

Palms are ecologically and economically indicators of tropical forest health, biodiversity, and human impact that support local economies and global forest product supply chains.While palm detection in plantations is well-studied, efforts to map naturally occurring palms in dense forests remain limited by overlapping crowns, uneven shading, and heterogeneous landscapes.We develop PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images.Orthomosaics are created from thousands of aerial images and spanning several to hundreds of gigabytes.Our contributions are threefold.First, we construct a large UAV-derived orthomosaic dataset collected across 21 ecologically diverse sites in western Ecuador, annotated with 8,830 bounding boxes and 5,026 palm center points. 0.704Second, we evaluate multiple state-of-the-art object detectors based on efficiency and performance, integrating zero-shot SAM 2 as the segmentation backbone, and refining the results for precise geographic mapping.Third, we apply calibration methods to align confidence scores with IoU and explore saliency maps for feature explainability.Though optimized for palms, PRISM is adaptable for identifying other natural objects, such as eastern white pines.Future work will explore transfer learning for lower-resolution datasets (0.5 to 1m).

link

2025-02-18

WeedsGalore: A Multispectral and Multitemporal UAV-based Dataset for Crop and Weed Segmentation in Agricultural Maize Fields

Weeds are one of the major reasons for crop yield loss but current weeding practices fail to manage weeds in an efficient and targeted manner.Effective weed management is especially important for crops with high worldwide production such as maize, to maximize crop yield for meeting increasing global demands.Advances in near-sensing and computer vision enable the development of new tools for weed management.Specifically, state-of-the-art segmentation models, coupled with novel sensing technologies, can facilitate timely and accurate weeding and monitoring systems.However, learning-based approaches require annotated data and show a lack of generalization to aerial imaging for different crops.We present a novel dataset for semantic and instance segmentation of crops and weeds in agricultural maize fields.The multispectral UAV-based dataset contains images with RGB, red-edge, and near-infrared bands, a large number of plant instances, dense annotations for maize and four weed classes, and is multitemporal.We provide extensive baseline results for both tasks, including probabilistic methods to quantify prediction uncertainty, improve model calibration, and demonstrate the approach's applicability to out-of-distribution data.The results show the effectiveness of the two additional bands compared to RGB only, and better performance in our target domain than models trained on existing datasets.We hope our dataset advances research on methods and operational systems for fine-grained weed identification, enhancing the robustness and applicability of UAV-based weed management.The dataset and code are available at https://github.com/GFZ/weedsgalore 0.874

link

2025-02-18

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions.To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers.We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span multiple domains, including STEM fields (e.g., Physics, Computer Science), Economics, Social Sciences, and more. 0.746We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model.Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding.

link

2025-02-18

RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises

Recent advances in large language models (LLMs) have shown that they can answer questions requiring complex reasoning.However, their ability to identify and respond to text containing logical fallacies or deliberately misleading premises remains less studied.To address this gap, we introduce RuozhiBench, a bilingual dataset comprising 677 carefully curated questions that contain various forms of deceptive reasoning, meticulously crafted through extensive human effort and expert review. 0.803In a comprehensive evaluation of 17 LLMs from 5 Series over RuozhiBench using both open-ended and two-choice formats, we conduct extensive analyses on evaluation protocols and result patterns.Despite their high scores on conventional benchmarks, these models showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%.

link

2025-02-17

CriteoPrivateAd: A Real-World Bidding Dataset to Design Private Advertising Systems

In the past years, many proposals have emerged in order to address online advertising use-cases without access to third-party cookies.All these proposals leverage some privacy-enhancing technologies such as aggregation or differential privacy.Yet, no public and rich-enough ground truth is currently available to assess the relevancy of aforementioned private advertising frameworks.We are releasing the largest, in terms of number of features, bidding dataset specifically built in alignment with the design of major browser vendors proposals such as Chrome Privacy Sandbox.This dataset, coined CriteoPrivateAd, stands for an anonymised version of Criteo production logs and provides sufficient data to learn bidding models commonly used in online advertising under many privacy constraints (delayed reports, display and user-level differential privacy, user signal quantisation or aggregated reports). 0.752We ensured that this dataset, while being anonymised, is able to provide offline results close to production performance of adtech companies including Criteo - making it a relevant ground truth to design private advertising systems.The dataset is available in Hugging Face: https://huggingface.co/datasets/criteo/CriteoPrivateAd. 0.931

link

2025-02-13

Pixel-Level Reasoning Segmentation via Multi-turn Conversations

Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions.Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction.Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation.To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. 0.863Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent.The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation.Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics.The code and data are available at: https://github.com/ccccai239/PixelRIST.

link

2025-02-13

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations.To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency.The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features.We also release the \textbf{TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. 0.925Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation.Code, models, and datasets will be publicly available. 0.749

link

2025-02-13

Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

Large Language Models (LLMs) are increasingly used as chatbots, yet their ability to personalize responses to user preferences remains limited.We introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize and adhere to user preferences in a long-context conversational setting.PrefEval comprises 3,000 manually curated user preference and query pairs spanning 20 topics.PrefEval contains user personalization or preference information in both explicit and implicit forms, and evaluates LLM performance using a generation and a classification task.With PrefEval, we evaluated the aforementioned preference following capabilities of 10 open-source and proprietary LLMs in multi-session conversations with varying context lengths up to 100k tokens.We benchmark with various prompting, iterative feedback, and retrieval-augmented generation methods.Our benchmarking effort reveals that state-of-the-art LLMs face significant challenges in proactively following users' preferences during conversations.In particular, in zero-shot settings, preference following accuracy falls below 10% at merely 10 turns (~3k tokens) across most evaluated models.Even with advanced prompting and retrieval methods, preference following still deteriorates in long-context conversations.Furthermore, we show that fine-tuning on PrefEval significantly improves performance.We believe PrefEval serves as a valuable resource for measuring, understanding, and enhancing LLMs' preference following abilities, paving the way for personalized conversational agents.Our code and dataset are available at https://prefeval.github.io/. 0.889

link

2025-02-13

GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

The continuous operation of Earth-orbiting satellites generates vast and ever-growing archives of Remote Sensing (RS) images.Natural language presents an intuitive interface for accessing, querying, and interpreting the data from such archives.However, existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS.This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location.To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis.GAIA comprises of 205,150 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions.Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena.The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. 0.879GAIA's construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o.Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks.

link

2025-02-13

Instance Segmentation of Scene Sketches Using Natural Image Priors

Sketch segmentation involves grouping pixels within a sketch that belong to the same object or instance.It serves as a valuable tool for sketch editing tasks, such as moving, scaling, or removing specific components.While image segmentation models have demonstrated remarkable capabilities in recent years, sketches present unique challenges for these models due to their sparse nature and wide variation in styles.We introduce SketchSeg, a method for instance segmentation of raster scene sketches.Our approach adapts state-of-the-art image segmentation and object detection models to the sketch domain by employing class-agnostic fine-tuning and refining segmentation masks using depth cues.Furthermore, our method organizes sketches into sorted layers, where occluded instances are inpainted, enabling advanced sketch editing applications.As existing datasets in this domain lack variation in sketch styles, we construct a synthetic scene sketch segmentation dataset featuring sketches with diverse brush strokes and varying levels of detail. 0.858We use this dataset to demonstrate the robustness of our approach and will release it to promote further research in the field. Project webpage: https://sketchseg.github.io/sketch-seg/

link

2025-02-12

Examining Spanish Counseling with MIDAS: a Motivational Interviewing Dataset in Spanish

Cultural and language factors significantly influence counseling, but Natural Language Processing research has not yet examined whether the findings of conversational analysis for counseling conducted in English apply to other languages.This paper presents a first step towards this direction.We introduce MIDAS (Motivational Interviewing Dataset in Spanish), a counseling dataset created from public video sources that contains expert annotations for counseling reflections and questions. 0.785Using this dataset, we explore language-based differences in counselor behavior in English and Spanish and develop classifiers in monolingual and multilingual settings, demonstrating its applications in counselor behavioral coding tasks.

link

2025-02-12

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.However, the limited labeled multimodal data often hinders embedding performance.Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.In this work, we identify three criteria for high-quality synthetic multimodal data.First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios.Second, robust cross-modal alignment makes different modalities semantically consistent.Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability.Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. 0.754Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5.Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark.Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.

link

2025-02-12

Salamandra Technical Report

This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters.The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code.Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. 0.731Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications.Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family.Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models.We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and safety.With this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology.In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible.We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.

link

2025-02-12

Moment of Untruth: Dealing with Negative Queries in Video Moment Retrieval

Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences.The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy.We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA.We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR.UniVTG-NA achieves high negative rejection accuracy (avg.$98.4\%$) scores while retaining moment retrieval scores to within $3.87\%$ Recall@1.Dataset splits and code are available at https://github.com/keflanagan/MomentofUntruth 0.774

link

2025-02-12

PulseCheck457: A Diagnostic Benchmark for Comprehensive Spatial Reasoning of Large Multimodal Models

Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain.Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities.To address this limitation, we present PulseCheck457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. 0.828We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks.We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks.To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities.Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.

link

2025-02-12

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation.Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames.To achieve this, CineMaster operates in two stages.In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space.In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content.Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. 0.733Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation.Project page: https://cinemaster-dev.github.io/.

link

2025-02-12

SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation

Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation.However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement.Consequently, despite producing impressive sketches, these methods are limited in practical applications.In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second.SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution.Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes.To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. 0.784For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet.We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.

link

2025-02-11

Advancing climate model interpretability: Feature attribution for Arctic melt anomalies

The focus of our work is improving the interpretability of anomalies in climate models and advancing our understanding of Arctic melt dynamics.The Arctic and Antarctic ice sheets are experiencing rapid surface melting and increased freshwater runoff, contributing significantly to global sea level rise.Understanding the mechanisms driving snowmelt in these regions is crucial.ERA5, a widely used reanalysis dataset in polar climate studies, offers extensive climate variables and global data assimilation. 0.763However, its snowmelt model employs an energy imbalance approach that may oversimplify the complexity of surface melt.In contrast, the Glacier Energy and Mass Balance (GEMB) model incorporates additional physical processes, such as snow accumulation, firn densification, and meltwater percolation/refreezing, providing a more detailed representation of surface melt dynamics.In this research, we focus on analyzing surface snowmelt dynamics of the Greenland Ice Sheet using feature attribution for anomalous melt events in ERA5 and GEMB models.We present a novel unsupervised attribution method leveraging counterfactual explanation method to analyze detected anomalies in ERA5 and GEMB.Our anomaly detection results are validated using MEaSUREs ground-truth data, and the attributions are evaluated against established feature ranking methods, including XGBoost, Shapley values, and Random Forest.Our attribution framework identifies the physics behind each model and the climate features driving melt anomalies.These findings demonstrate the utility of our attribution method in enhancing the interpretability of anomalies in climate models and advancing our understanding of Arctic melt dynamics.

link

2025-02-11

WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts.Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story.To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse.We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability.The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition.This dataset is publicly available here. 0.982

link

2025-02-11

MatSwap: Light-aware material transfers in images

We present MatSwap, a method to transfer materials to designated surfaces in an image photorealistically.Such a task is non-trivial due to the large entanglement of material appearance, geometry, and lighting in a photograph.In the literature, material editing methods typically rely on either cumbersome text engineering or extensive manual annotations requiring artist knowledge and 3D scene properties that are impractical to obtain.In contrast, we propose to directly learn the relationship between the input material -- as observed on a flat surface -- and its appearance within the scene, without the need for explicit UV mapping.To achieve this, we rely on a custom light- and geometry-aware diffusion model.We fine-tune a large-scale pre-trained text-to-image model for material transfer using our synthetic dataset, preserving its strong priors to ensure effective generalization to real images.As a result, our method seamlessly integrates a desired material into the target location in the photograph while retaining the identity of the scene.We evaluate our method on synthetic and real images and show that it compares favorably to recent work both qualitatively and quantitatively.We will release our code and data upon publication. 0.773

link

Data Quality

2025-02-19

Fine-grained Fallacy Detection with Human Label Variation

We introduce Faina, the first dataset for fallacy detection that embraces multiple plausible answers and natural disagreement.Faina includes over 11K span-level annotations with overlaps across 20 fallacy types on social media posts in Italian about migration, climate change, and public health given by two expert annotators.Through an extensive annotation study that allowed discussion over multiple rounds, we minimize annotation errors whilst keeping signals of human label variation. 0.699Moreover, we devise a framework that goes beyond "single ground truth" evaluation and simultaneously accounts for multiple (equally reliable) test sets and the peculiarities of the task, i.e., partial span matches, overlaps, and the varying severity of labeling errors.Our experiments across four fallacy detection setups show that multi-task and multi-label transformer-based approaches are strong baselines across all settings.We release our data, code, and annotation guidelines to foster research on fallacy detection and human label variation more broadly. 0.624

link

2025-02-17

Learning Generalizable Prompt for CLIP with Class Similarity Knowledge

In vision-language models (VLMs), prompt tuning has shown its effectiveness in adapting models to downstream tasks.However, learned prompts struggle to generalize to unseen classes, as they tend to overfit to the classes that are targeted during prompt tuning.Examining failure cases, we observed that learned prompts disrupt the semantics of unseen classes, generating text embeddings with incorrect semantic relationships among classes. 0.64To address this, we propose Similarity Alignment Regularization (SAR), which regularizes learnable prompts to preserve the semantic relationships among classes captured by hand-crafted prompts.Specifically, we first obtain novel classes related to base classes using ChatGPT-4o and utilize them as potential unseen classes during prompt tuning.Then, by targeting both base and novel classes, SAR aligns the similarity relationships among text embeddings generated by learnable prompts with the similarity relationships from hand-crafted prompts.Extensive experiments applying SAR to existing prompt tuning methods demonstrate its effectiveness in improving generalization to unseen classes.

link

2025-02-17

Hypernym Bias: Unraveling Deep Classifier Training Dynamics through the Lens of Class Hierarchy

We investigate the training dynamics of deep classifiers by examining how hierarchical relationships between classes evolve during training.Through extensive experiments, we argue that the learning process in classification problems can be understood through the lens of label clustering. 0.645Specifically, we observe that networks tend to distinguish higher-level (hypernym) categories in the early stages of training, and learn more specific (hyponym) categories later.We introduce a novel framework to track the evolution of the feature manifold during training, revealing how the hierarchy of class relations emerges and refines across the network layers.Our analysis demonstrates that the learned representations closely align with the semantic structure of the dataset, providing a quantitative description of the clustering process.Notably, we show that in the hypernym label space, certain properties of neural collapse appear earlier than in the hyponym label space, helping to bridge the gap between the initial and terminal phases of learning.We believe our findings offer new insights into the mechanisms driving hierarchical learning in deep networks, paving the way for future advancements in understanding deep learning dynamics.

link

2025-02-11

Partial-Label Learning with Conformal Candidate Cleaning

Real-world data is often ambiguous; for example, human annotation produces instances with multiple conflicting class labels. 0.639Partial-label learning (PLL) aims at training a classifier in this challenging setting, where each instance is associated with a set of candidate labels and one correct, but unknown, class label.A multitude of algorithms targeting this setting exists and, to enhance their prediction quality, several extensions that are applicable across a wide range of PLL methods have been introduced.While many of these extensions rely on heuristics, this article proposes a novel enhancing method that incrementally prunes candidate sets using conformal prediction.To work around the missing labeled validation set, which is typically required for conformal prediction, we propose a strategy that alternates between training a PLL classifier to label the validation set, leveraging these predicted class labels for calibration, and pruning candidate labels that are not part of the resulting conformal sets.In this sense, our method alternates between empirical risk minimization and candidate set pruning.We establish that our pruning method preserves the conformal validity with respect to the unknown ground truth.Our extensive experiments on artificial and real-world data show that the proposed approach significantly improves the test set accuracies of several state-of-the-art PLL classifiers.

link

2025-02-11

TMLC-Net: Transferable Meta Label Correction for Noisy Label Learning

The prevalence of noisy labels in real-world datasets poses a significant impediment to the effective deployment of deep learning models. 0.664While meta-learning strategies have emerged as a promising approach for addressing this challenge, existing methods often suffer from limited transferability and task-specific designs.This paper introduces TMLC-Net, a novel Transferable Meta-Learner for Correcting Noisy Labels, designed to overcome these limitations. 0.681TMLC-Net learns a general-purpose label correction strategy that can be readily applied across diverse datasets and model architectures without requiring extensive retraining or fine-tuning. 0.711Our approach integrates three core components: (1) Normalized Noise Perception, which captures and normalizes training dynamics to handle distribution shifts; (2) Time-Series Encoding, which models the temporal evolution of sample statistics using a recurrent neural network; and (3) Subclass Decoding, which predicts a corrected label distribution based on the learned representations.We conduct extensive experiments on benchmark datasets with various noise types and levels, demonstrating that TMLC-Net consistently outperforms state-of-the-art methods in terms of both accuracy and robustness to label noise.Furthermore, we analyze the transferability of TMLC-Net, showcasing its adaptability to new datasets and noise conditions, and establishing its potential as a broadly applicable solution for robust deep learning in noisy environments.

link

2025-02-10

Prototype Contrastive Consistency Learning for Semi-Supervised Medical Image Segmentation

Medical image segmentation is a crucial task in medical image analysis, but it can be very challenging especially when there are less labeled data but with large unlabeled data.Contrastive learning has proven to be effective for medical image segmentation in semi-supervised learning by constructing contrastive samples from partial pixels.However, although previous contrastive learning methods can mine semantic information from partial pixels within images, they ignore the whole context information of unlabeled images, which is very important to precise segmentation.In order to solve this problem, we propose a novel prototype contrastive learning method called Prototype Contrastive Consistency Segmentation (PCCS) for semi-supervised medical image segmentation.The core idea is to enforce the prototypes of the same semantic class to be closer and push the prototypes in different semantic classes far away from each other.Specifically, we construct a signed distance map and an uncertainty map from unlabeled images.The signed distance map is used to construct prototypes for contrastive learning, and then we estimate the prototype uncertainty from the uncertainty map as trade-off among prototypes.In order to obtain better prototypes, based on the student-teacher architecture, a new mechanism named prototype updating prototype is designed to assist in updating the prototypes for contrastive learning.In addition, we propose an uncertainty-consistency loss to mine more reliable information from unlabeled data. 0.617Extensive experiments on medical image segmentation demonstrate that PCCS achieves better segmentation performance than the state-of-the-art methods.The code is available at https://github.com/comphsh/PCCS.

link

2025-02-05

Do Large Language Model Benchmarks Test Reliability?

When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable.Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability.To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability.We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. 0.791Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity.As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks.We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems.Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle.We provide code at https://github.com/MadryLab/platinum-benchmarks

link

2025-02-04

Addressing Label Shift in Distributed Learning via Entropy Regularization

We address the challenge of minimizing true risk in multi-node distributed learning.These systems are frequently exposed to both inter-node and intra-node label shifts, which present a critical obstacle to effectively optimizing model performance while ensuring that data remains confined to each node.To tackle this, we propose the Versatile Robust Label Shift (VRLS) method, which enhances the maximum likelihood estimation of the test-to-train label density ratio. 0.619VRLS incorporates Shannon entropy-based regularization and adjusts the density ratio during training to better handle label shifts at the test time.In multi-node learning environments, VRLS further extends its capabilities by learning and adapting density ratios across nodes, effectively mitigating label shifts and improving overall model performance.Experiments conducted on MNIST, Fashion MNIST, and CIFAR-10 demonstrate the effectiveness of VRLS, outperforming baselines by up to 20% in imbalanced settings.These results highlight the significant improvements VRLS offers in addressing label shifts.Our theoretical analysis further supports this by establishing high-probability bounds on estimation errors.

link

Benchmarks

2025-02-20

seqKAN: Sequence processing with Kolmogorov-Arnold Networks

Kolmogorov-Arnold Networks (KANs) have been recently proposed as a machine learning framework that is more interpretable and controllable than the multi-layer perceptron.Various network architectures have been proposed within the KAN framework targeting different tasks and application domains, including sequence processing. This paper proposes seqKAN, a new KAN architecture for sequence processing.Although multiple sequence processing KAN architectures have already been proposed, we argue that seqKAN is more faithful to the core concept of the KAN framework.Furthermore, we empirically demonstrate that it achieves better results. 0.645The empirical evaluation is performed on generated data from a complex physics problem on an interpolation and an extrapolation task.Using this dataset we compared seqKAN against a prior KAN network for timeseries prediction, recurrent deep networks, and symbolic regression.seqKAN substantially outperforms all architectures, particularly on the extrapolation dataset, while also being the most transparent.

link

2025-02-20

Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup

Large language models have demonstrated excellent performance in many tasks, including Text-to-SQL, due to their powerful in-context learning capabilities.They are becoming the mainstream approach for Text-to-SQL.However, these methods still have a significant gap compared to human performance, especially on complex questions.As the complexity of questions increases, the gap between questions and SQLs increases.We identify two important gaps: the structural mapping gap and the lexical mapping gap.To tackle these two gaps, we propose PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM).AQP aims to obtain the structural pattern of the question by removing database-related information, which enables us to find structurally similar demonstrations.CSM aims to associate database-related text span in the question with specific tables or columns in the database, which alleviates the lexical mapping gap.Experimental results on the Spider and BIRD datasets demonstrate the effectiveness of our proposed method.Specifically, PAS-SQL + GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an execution accuracy of 64.67\%. 0.611

link

2025-02-20

CDGS: Confidence-Aware Depth Regularization for 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) has shown significant advantages in novel view synthesis (NVS), particularly in achieving high rendering speeds and high-quality results.However, its geometric accuracy in 3D reconstruction remains limited due to the lack of explicit geometric constraints during optimization.This paper introduces CDGS, a confidence-aware depth regularization approach developed to enhance 3DGS.We leverage multi-cue confidence maps of monocular depth estimation and sparse Structure-from-Motion depth to adaptively adjust depth supervision during the optimization process.Our method demonstrates improved geometric detail preservation in early training stages and achieves competitive performance in both NVS quality and geometric accuracy.Experiments on the publicly available Tanks and Temples benchmark dataset show that our method achieves more stable convergence behavior and more accurate geometric reconstruction results, with improvements of up to 2.31 dB in PSNR for NVS and consistently lower geometric errors in M3C2 distance metrics. 0.602Notably, our method reaches comparable F-scores to the original 3DGS with only 50% of the training iterations.We expect this work will facilitate the development of efficient and accurate 3D reconstruction systems for real-world applications such as digital twin creation, heritage preservation, or forestry applications.

link

2025-02-20

SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition

RNN-Transducer (RNN-T) is a widely adopted architecture in speech recognition, integrating acoustic and language modeling in an end-to-end framework.However, the RNN-T predictor tends to over-rely on consecutive word dependencies in training data, leading to high deletion error rates, particularly with less common or out-of-domain phrases.Existing solutions, such as regularization and data augmentation, often compromise other aspects of performance.We propose SegAug, an alignment-based augmentation technique that generates contextually varied audio-text pairs with low sentence-level semantics.This method encourages the model to focus more on acoustic features while diversifying the learned textual patterns of its internal language model, thereby reducing deletion errors and enhancing overall performance.Evaluations on the LibriSpeech and Tedlium-v3 datasets demonstrate a relative WER reduction of up to 12.5% on small-scale and 6.9% on large-scale settings.Notably, most of the improvement stems from reduced deletion errors, with relative reductions of 45.4% and 18.5%, respectively. 0.671These results highlight SegAug's effectiveness in improving RNN-T's robustness, offering a promising solution for enhancing speech recognition performance across diverse and challenging scenarios.

link

2025-02-20

Revisiting Near-Far Field Boundary in Dual-Polarized XL-MIMO Systems

Extremely large-scale multiple-input multiple-output (XL-MIMO) is expected to be an important technology in future sixth generation (6G) networks.Compared with conventional single-polarized XL-MIMO, where signals are transmitted and received in only one polarization direction, dual-polarized XL-MIMO systems achieve higher data rate by improving multiplexing performances, and thus are the focus of this paper.Due to enlarged aperture, near-field regions become non-negligible in XL-MIMO communications, necessitating accurate near-far field boundary characterizations.However, existing boundaries developed for single-polarized systems only consider phase or power differences across array elements while irrespective of cross-polarization discrimination (XPD) variances in dual-polarized XL-MIMO systems, deteriorating transmit covariance optimization performances.In this paper, we revisit near-far field boundaries for dual-polarized XL-MIMO systems by taking XPD differences into account, which faces the following challenge.Unlike existing near-far field boundaries, which only need to consider co-polarized channel components, deriving boundaries for dual-polarized XL-MIMO systems requires modeling joint effects of co-polarized and cross-polarized components.To address this issue, we model XPD variations across antennas and introduce a non-uniform XPD distance to complement existing near-far field boundaries.Based on the new distance criterion, we propose an efficient scheme to optimize transmit covariance.Numerical results validate our analysis and demonstrate the proposed algorithm's effectiveness. 0.786

link

2025-02-20

General Uncertainty Estimation with Delta Variances

Decision makers may suffer from uncertainty induced by limited data.This may be mitigated by accounting for epistemic uncertainty, which is however challenging to estimate efficiently for large neural networks.To this extent we investigate Delta Variances, a family of algorithms for epistemic uncertainty quantification, that is computationally efficient and convenient to implement.It can be applied to neural networks and more general functions composed of neural networks.As an example we consider a weather simulator with a neural-network-based step function inside -- here Delta Variances empirically obtain competitive results at the cost of a single gradient computation.The approach is convenient as it requires no changes to the neural network architecture or training procedure.We discuss multiple ways to derive Delta Variances theoretically noting that special cases recover popular techniques and present a unified perspective on multiple related methods. 0.601Finally we observe that this general perspective gives rise to a natural extension and empirically show its benefit.

link

2025-02-20

Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.Specifically, we identify a critical issue of ''$\textbf{reconstruction error explosion}$'' in existing LLMs sparsification methods.This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers.As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance.Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. 0.643Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. 0.657Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal.Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods.Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models.Notably, our method achieves a reduction of 52.10 in perplexity for the 70$\%$ sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50$\%$, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively.

link

2025-02-20

Efficient Multivariate Robust Mean Estimation Under Mean-Shift Contamination

We study the algorithmic problem of robust mean estimation of an identity covariance Gaussian in the presence of mean-shift contamination.In this contamination model, we are given a set of points in $\mathbb{R}^d$ generated i.i.d. via the following process.For a parameter $\alpha<1/2$, the $i$-th sample $x_i$ is obtained as follows: with probability $1-\alpha$, $x_i$ is drawn from $\mathcal{N}(\mu, I)$, where $\mu \in \mathbb{R}^d$ is the target mean; and with probability $\alpha$, $x_i$ is drawn from $\mathcal{N}(z_i, I)$, where $z_i$ is unknown and potentially arbitrary.Prior work characterized the information-theoretic limits of this task.Specifically, it was shown that, in contrast to Huber contamination, in the presence of mean-shift contamination consistent estimation is possible.On the other hand, all known robust estimators in the mean-shift model have running times exponential in the dimension.Here we give the first computationally efficient algorithm for high-dimensional robust mean estimation with mean-shift contamination that can tolerate a constant fraction of outliers.In particular, our algorithm has near-optimal sample complexity, runs in sample-polynomial time, and approximates the target mean to any desired accuracy. 0.619Conceptually, our result contributes to a growing body of work that studies inference with respect to natural noise models lying in between fully adversarial and random settings.

link

2025-02-20

Harnessing PDF Data for Improving Japanese Large Multimodal Models

Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data.Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge.To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized.We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation.Additionally, we construct instruction data from extracted image-text pairs to enrich the training data.To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark.Our results demonstrate substantial improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. 0.691Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.We plan to make the source code and data publicly available upon acceptance.

link

2025-02-20

Real-Time Device Reach Forecasting Using HLL and MinHash Data Sketches

Predicting the right number of TVs (Device Reach) in real-time based on a user-specified targeting attributes is imperative for running multi-million dollar ADs business.The traditional approach of SQL queries to join billions of records across multiple targeting dimensions is extremely slow.As a workaround, many applications will have an offline process to crunch these numbers and present the results after many hours.In our case, the solution was an offline process taking 24 hours to onboard a customer resulting in a potential loss of business.To solve this problem, we have built a new real-time prediction system using MinHash and HyperLogLog (HLL) data sketches to compute the device reach at runtime when a user makes a request.However, existing MinHash implementations do not solve the complex problem of multilevel aggregation and intersection.This work will show how we have solved this problem, in addition, we have improved MinHash algorithm to run 4 times faster using Single Instruction Multiple Data (SIMD) vectorized operations for high speed and accuracy with constant space to process billions of records.Finally, by experiments, we prove that the results are as accurate as traditional offline prediction system with an acceptable error rate of 5%. 0.624

link

2025-02-20

Ray-Tracing for Conditionally Activated Neural Networks

In this paper, we introduce a novel architecture for conditionally activated neural networks combining a hierarchical construction of multiple Mixture of Experts (MoEs) layers with a sampling mechanism that progressively converges to an optimized configuration of expert activation.This methodology enables the dynamic unfolding of the network's architecture, facilitating efficient path-specific training.Experimental results demonstrate that this approach achieves competitive accuracy compared to conventional baselines while significantly reducing the parameter count required for inference. 0.706Notably, this parameter reduction correlates with the complexity of the input patterns, a property naturally emerging from the network's operational dynamics without necessitating explicit auxiliary penalty functions.

link

2025-02-20

Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs

While large language models demonstrate remarkable capabilities at task-specific applications through fine-tuning, extending these benefits across diverse languages is essential for broad accessibility.However, effective cross-lingual transfer is hindered by LLM performance gaps across languages and the scarcity of fine-tuning data in many languages.Through analysis of LLM internal representations from over 1,000+ language pairs, we discover that middle layers exhibit the strongest potential for cross-lingual alignment.Building on this finding, we propose a middle-layer alignment objective integrated into task-specific training.Our experiments on slot filling, machine translation, and structured text generation show consistent improvements in cross-lingual transfer, especially to lower-resource languages.The method is robust to the choice of alignment languages and generalizes to languages unseen during alignment.Furthermore, we show that separately trained alignment modules can be merged with existing task-specific modules, improving cross-lingual capabilities without full re-training.Our code is publicly available (https://github.com/dannigt/mid-align). 0.622

link

2025-02-19

An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

Rice is one of the most widely cultivated crops globally and has been developed into numerous varieties.The quality of rice during cultivation is primarily determined by its cultivar and characteristics.Traditionally, rice classification and quality assessment rely on manual visual inspection, a process that is both time-consuming and prone to errors.However, with advancements in machine vision technology, automating rice classification and quality evaluation based on its cultivar and characteristics has become increasingly feasible, enhancing both accuracy and efficiency.This study proposes a real-time evaluation mechanism for comprehensive rice grain assessment, integrating a one-stage object detection approach, a deep convolutional neural network, and traditional machine learning techniques.The proposed framework enables rice variety identification, grain completeness grading, and grain chalkiness evaluation.The rice grain dataset used in this study comprises approximately 20,000 images from six widely cultivated rice varieties in China.Experimental results demonstrate that the proposed mechanism achieves a mean average precision (mAP) of 99.14% in the object detection task and an accuracy of 97.89% in the classification task.Furthermore, the framework attains an average accuracy of 97.56% in grain completeness grading within the same rice variety, contributing to an effective quality evaluation system. 0.608

link

2025-02-19

A consensus set for the aggregation of partial rankings: the case of the Optimal Set of Bucket Orders Problem

In rank aggregation problems (RAP), the solution is usually a consensus ranking that generalizes a set of input orderings.There are different variants that differ not only in terms of the type of rankings that are used as input and output, but also in terms of the objective function employed to evaluate the quality of the desired output ranking. 0.614In contrast, in some machine learning tasks (e.g. subgroup discovery) or multimodal optimization tasks, attention is devoted to obtaining several models/results to account for the diversity in the input data or across the search landscape.Thus, in this paper we propose to provide, as the solution to an RAP, a set of rankings to better explain the preferences expressed in the input orderings.We exemplify our proposal through the Optimal Bucket Order Problem (OBOP), an RAP which consists in finding a single consensus ranking (with ties) that generalizes a set of input rankings codified as a precedence matrix.To address this, we introduce the Optimal Set of Bucket Orders Problem (OSBOP), a generalization of the OBOP that aims to produce not a single ranking as output but a set of consensus rankings.Experimental results are presented to illustrate this proposal, showing how, by providing a set of consensus rankings, the fitness of the solution significantly improves with respect to the one of the original OBOP, without losing comprehensibility.

link

2025-02-19

From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education

Large Language Models (LLMs), such as GPT-4, have demonstrated impressive mathematical reasoning capabilities, achieving near-perfect performance on benchmarks like GSM8K. However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation.Current models fail to provide meaningful insights into the causes of student mistakes, limiting their utility in educational contexts.To address these challenges, we present three key contributions.First, we introduce \textbf{MathCCS} (Mathematical Classification and Constructive Suggestions), a multi-modal benchmark designed for systematic error analysis and tailored feedback. 0.644MathCCS includes real-world problems, expert-annotated error categories, and longitudinal student data.Evaluations of state-of-the-art models, including \textit{Qwen2-VL}, \textit{LLaVA-OV}, \textit{Claude-3.5-Sonnet} and \textit{GPT-4o}, reveal that none achieved classification accuracy above 30\% or generated high-quality suggestions (average scores below 4/10), highlighting a significant gap from human-level performance.Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision.Finally, we propose a multi-agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real-time refinement, enhancing error classification and feedback generation.Together, these contributions provide a robust platform for advancing personalized education, bridging the gap between current AI capabilities and the demands of real-world teaching.

link

2025-02-19

Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge

Estimating the construction year of buildings is of great importance for sustainability.Sustainable buildings minimize energy consumption and are a key part of responsible and sustainable urban planning and development to effectively combat climate change.By using Artificial Intelligence (AI) and recently proposed Transformer models, we are able to estimate the construction epoch of buildings from a multi-modal dataset.In this paper, we introduce a new benchmark multi-modal dataset, i.e. the Map your City Dataset (MyCD), containing top-view Very High Resolution (VHR) images, Earth Observation (EO) multi-spectral data from the Copernicus Sentinel-2 satellite constellation, and street-view images in many different cities in Europe, co-localized with respect to the building under study and labelled with the construction epoch.We assess EO generalization performance on new/ previously unseen cities that have been held-out from training and appear only during inference.In this work, we present the community-based data challenge we organized based on MyCD.The ESA AI4EO Challenge MapYourCity was opened in 2024 for 4 months.Here, we present the Top-4 performing models, and the main evaluation results. 0.622During inference, the performance of the models using both all three input modalities and only the two top-view modalities, i.e. without the street-view images, is examined.The evaluation results show that the models are effective and can achieve good performance on this difficult real-world task of estimating the age of buildings, even on previously unseen cities, as well as even using only the two top-view modalities (i.e. VHR and Sentinel-2) during inference.

link

2025-02-19

Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning

Code verification has recently found great success as a critical component in training large scale reasoning models for coding.Synthetic techniques such as self-generated test cases and reward models provide a way to enhance code capabilities beyond predefined tests.Building on these advancements, we propose new benchmarks designed to systematically evaluate the impact of synthetic verification methods on assessing solution correctness. 0.629We introduce HE-R, HE-R+, MBPP-R, and MBPP-R+, which transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.Using these benchmarks, we analyze synthetic verification methods in standard, reasoning-based, and reward-based LLMs.Our results show that recent reasoning models significantly improve test case generation and that scaling test cases enhances verification accuracy.

link

2025-02-19

In-Place Updates of a Graph Index for Streaming Approximate Nearest Neighbor Search

Indices for approximate nearest neighbor search (ANNS) are a basic component for information retrieval and widely used in database, search, recommendation and RAG systems.In these scenarios, documents or other objects are inserted into and deleted from the working set at a high rate, requiring a stream of updates to the vector index.Algorithms based on proximity graph indices are the most efficient indices for ANNS, winning many benchmark competitions.However, it is challenging to update such graph index at a high rate, while supporting stable recall after many updates.Since the graph is singly-linked, deletions are hard because there is no fast way to find in-neighbors of a deleted vertex.Therefore, to update the graph, state-of-the-art algorithms such as FreshDiskANN accumulate deletions in a batch and periodically consolidate, removing edges to deleted vertices and modifying the graph to ensure recall stability.In this paper, we present IP-DiskANN (InPlaceUpdate-DiskANN), the first algorithm to avoid batch consolidation by efficiently processing each insertion and deletion in-place.Our experiments using standard benchmarks show that IP-DiskANN has stable recall over various lengthy update patterns in both high-recall and low-recall regimes.Further, its query throughput and update speed are better than using the batch consolidation algorithm and HNSW. 0.615

link

2025-02-19

Performance optimization of BLAS algorithms with band matrices for RISC-V processors

The rapid development of RISC-V instruction set architecture presents new opportunities and challenges for software developers.Is it sufficient to simply recompile high-performance software optimized for x86-64 onto RISC-V CPUs?Are current compilers capable of effectively optimizing C and C++ codes or is it necessary to use intrinsics or assembler?Can we analyze and improve performance without well-developed profiling tools? 0.62Do standard optimization techniques work?Are there specific RISC-V features that need to be considered?These and other questions require careful consideration.In this paper, we present our experience optimizing four BLAS algorithms for band matrix operations on RISC-V processors.We demonstrate how RISC-V-optimized implementations of OpenBLAS algorithms can be significantly accelerated through improved vectorization of computationally intensive loops.Experiments on Lichee Pi 4A and Banana Pi BPI-F3 devices using RVV 0.7.1 and RVV 1.0 vector instruction sets respectively, show speedups of 1.5x to 10x depending on the operation compared to the OpenBLAS baseline.In particular, the successful use of vector register grouping with RVV can lead to significant performance improvements.

link

2025-02-19

Mitigating Popularity Bias in Collaborative Filtering through Fair Sampling

Recommender systems often suffer from popularity bias, where frequently interacted items are overrepresented in recommendations.This bias stems from propensity factors influencing training data, leading to imbalanced exposure.In this paper, we introduce a Fair Sampling (FS) approach to address this issue by ensuring that both users and items are selected with equal probability as positive and negative instances.Unlike traditional inverse propensity score (IPS) methods, FS does not require propensity estimation, eliminating errors associated with inaccurate calculations.Our theoretical analysis demonstrates that FS effectively neutralizes the influence of propensity factors, achieving unbiased learning.Experimental results validate that FS outperforms state-of-the-art methods in both point-wise and pair-wise recommendation tasks, enhancing recommendation fairness without sacrificing accuracy. 0.637The implementation is available at https://anonymous.4open.science/r/Fair-Sampling.

link

2025-02-19

Performance Comparison of Graph Representations Which Support Dynamic Graph Updates

Research in graph-structured data has grown rapidly due to graphs' ability to represent complex real-world information and capture intricate relationships, particularly as many real-world graphs evolve dynamically through edge/vertex insertions and deletions.This has spurred interest in programming frameworks for managing, maintaining, and processing such dynamic graphs.In this report, we evaluate the performance of PetGraph (Rust), Stanford Network Analysis Platform (SNAP), SuiteSparse:GraphBLAS, cuGraph, Aspen, and our custom implementation in tasks including loading graphs from disk to memory, cloning loaded graphs, applying in-place edge deletions/insertions, and performing a simple iterative graph traversal algorithm.Our implementation demonstrates significant performance improvements: it outperforms PetGraph, SNAP, SuiteSparse:GraphBLAS, cuGraph, and Aspen by factors of 177x, 106x, 76x, 17x, and 3.3x in graph loading; 20x, 235x, 0.24x, 1.3x, and 0x in graph cloning; 141x/45x, 44x/25x, 13x/11x, 28x/34x, and 3.5x/2.2x in edge deletions/insertions; and 67x/63x, 86x/86x, 2.5x/2.6x, 0.25x/0.24x, and 1.3x/1.3x in traversal on updated graphs with deletions/insertions. 0.605

link

2025-02-19

MEX: Memory-efficient Approach to Referring Multi-Object Tracking

Referring Multi-Object Tracking (RMOT) is a relatively new concept that has rapidly gained traction as a promising research direction at the intersection of computer vision and natural language processing.Unlike traditional multi-object tracking, RMOT identifies and tracks objects and incorporates textual descriptions for object class names, making the approach more intuitive.Various techniques have been proposed to address this challenging problem; however, most require the training of the entire network due to their end-to-end nature.Among these methods, iKUN has emerged as a particularly promising solution.Therefore, we further explore its pipeline and enhance its performance.In this paper, we introduce a practical module dubbed Memory-Efficient Cross-modality -- MEX.This memory-efficient technique can be directly applied to off-the-shelf trackers like iKUN, resulting in significant architectural improvements.Our method proves effective during inference on a single GPU with 4 GB of memory.Among the various benchmarks, the Refer-KITTI dataset, which offers diverse autonomous driving scenes with relevant language expressions, is particularly useful for studying this problem.Empirically, our method demonstrates effectiveness and efficiency regarding HOTA tracking scores, substantially improving memory allocation and processing speed. 0.618

link

2025-02-19

PSCon: Toward Conversational Product Search

Conversational Product Search (CPS) is confined to simulated conversations due to the lack of real-world CPS datasets that reflect human-like language.Additionally, current conversational datasets are limited to support cross-market and multi-lingual usage.In this paper, we introduce a new CPS data collection protocol and present PSCon, a novel CPS dataset designed to assist product search via human-like conversations.The dataset is constructed using a coached human-to-human data collection protocol and supports two languages and dual markets.Also, the dataset enables thorough exploration of six subtasks of CPS: user intent detection, keyword extraction, system action prediction, question selection, item ranking, and response generation.Furthermore, we also offer an analysis of the dataset and propose a benchmark model on the proposed CPS dataset. 0.62

link

2025-02-19

DataSciBench: An LLM Agent Benchmark for Data Science

This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science.Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. 0.677In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics.We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics.This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics).Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules.Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered.This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses.Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models.We release all code and data at https://github.com/THUDM/DataSciBench.

link

2025-02-19

Partially Observable Gaussian Process Network and Doubly Stochastic Variational Inference

To reduce the curse of dimensionality for Gaussian processes (GP), they can be decomposed into a Gaussian Process Network (GPN) of coupled subprocesses with lower dimensionality.In some cases, intermediate observations are available within the GPN.However, intermediate observations are often indirect, noisy, and incomplete in most real-world systems.This work introduces the Partially Observable Gaussian Process Network (POGPN) to model real-world process networks.We model a joint distribution of latent functions of subprocesses and make inferences using observations from all subprocesses.POGPN incorporates observation lenses (observation likelihoods) into the well-established inference method of deep Gaussian processes.We also introduce two training methods for POPGN to make inferences on the whole network using node observations.The application to benchmark problems demonstrates how incorporating partial observations during training and inference can improve the predictive performance of the overall network, offering a promising outlook for its practical application. 0.621

link

2025-02-19

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields.Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required.This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators.Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered.Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed.In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams who participated in the challenge.Given their diverse nature, these automatically generated relevance judgments can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques.The released resource is available at the following link: https://llm4eval.github.io/LLMJudge-benchmark/ 0.655

link

2025-02-19

IC-D2S: A Hybrid Ising-Classical-Machines Data-Driven QUBO Solver Method

We present a heuristic algorithm designed to solve Quadratic Unconstrained Binary Optimization (QUBO) problems efficiently.The algorithm, referred to as IC-D2S, leverages a hybrid approach using Ising and classical machines to address very large problem sizes.Considering the practical limitation on the size of the Ising machine(IM), our algorithm partitions the QUBO problem into a collection of QUBO subproblems (called subQUBOs) and utilizes the IM to solve each subQUBO.Our proposed heuristic algorithm uses a set of control parameters to generate the subQUBOs and explore the search space.Also, it utilizes an annealer based on cosine waveform and applies a mutation operator at each step of the search to diversify the solution space and facilitate the process of finding the global minimum of the problem.We have evaluated the effectiveness of our IC-D2S algorithm on three large-sized problem sets and compared its efficiency in finding the (near-)optimal solution with three QUBO solvers.One of the solvers is a software-based algorithm (D2TS), while the other one (D-Wave) employs a similar approach to ours, utilizing both classical and Ising machines.The results demonstrate that for large-sized problems (>= 5000)the proposed algorithm identifies superior solutions. 0.607Additionally, for smaller-sized problems (= 2500), IC-D2S efficiently finds the optimal solution in a significantly faster manner.

link

2025-02-19

Bounded Synthesis of Synchronized Distributed Models from Lightweight Specifications

We present an approach to automatically synthesize synchronized models from lightweight formal specifications.Our approach takes as input a specification of a distributed system along with a global linear time constraint, which must be fulfilled by the interaction of the system's components.It produces executable models for the component specifications (in the style of Promela language) whose concurrent execution satisfies the global constraint.The component specifications consist of a collection of actions described by means of pre and post conditions together with first-order relational formulas prescribing their behavior.We use the Alloy Analyzer to encode the component specifications and enumerate their potential implementations up to some bound, whose concurrent composition is model checked against the global property.Even though this approach is sound and complete up to the selected bound, it is impractical as the number of candidate implementations grows exponentially.To address this, we propose an algorithm that uses batches of counterexamples to prune the solution space, it has two main phases: exploration, the algorithm collects a batch of counterexamples, and exploitation, where this knowledge is used to speed up the search.The approach is sound, while its completeness depends on the batches used.We present a prototype tool, describe some experiments, and compare it with related approaches. 0.649

link

2025-02-19

Where's the Bug? Attention Probing for Scalable Fault Localization

Ensuring code correctness remains a challenging problem even as large language models (LLMs) become increasingly capable at code-related tasks.While LLM-based program repair systems can propose bug fixes using only a user's bug report, their effectiveness is fundamentally limited by their ability to perform fault localization (FL), a challenging problem for both humans and LLMs.Existing FL approaches rely on executable test cases, require training on costly and often noisy line-level annotations, or demand resource-intensive LLMs.In this paper, we present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels, outperforming traditional FL baselines and prompting of large-scale LLMs.We evaluate our approach across a variety of code settings, including real-world Java bugs from the standard Defects4J dataset as well as seven other datasets which span a diverse set of bug types and languages.Averaged across all eight datasets, BAP improves by 34.6% top-1 accuracy compared to the strongest baseline and 93.4% over zero-shot prompting GPT-4o. 0.641BAP is also significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.

link

2025-02-18

Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing

Limited by the context window size of Large Language Models(LLMs), handling various tasks with input tokens exceeding the upper limit has been challenging, whether it is a simple direct retrieval task or a complex multi-hop reasoning task.Although various methods have been proposed to enhance the long-context processing capabilities of LLMs, they either incur substantial post-training costs, or require additional tool modules(e.g.,RAG), or have not shown significant improvement in realistic tasks.Our work observes the correlation between the attention distribution and generated answers across each layer, and establishes the attention allocation aligns with retrieval-augmented capabilities through experiments.Drawing on the above insights, we propose a novel method InfiniRetri that leverages the LLMs's own attention information to enable accurate retrieval across inputs of infinitely length.Our evaluations indicate that InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model, surpassing other method or larger models and setting a new state-of-the-art(SOTA).Moreover, our method achieves significant performance improvements on real-world benchmarks, with a maximum 288% improvement. 0.697In addition, InfiniRetri can be applied to any Transformer-based LLMs without additional training and substantially reduces inference latency and compute overhead in long texts.In summary, our comprehensive studies show InfiniRetri's potential for practical applications and creates a paradigm for retrievaling information using LLMs own capabilities under infinite-length tokens.Code will be released in link.

link

2025-02-18

Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search

Recent dense retrievers usually thrive on the emergency capabilities of Large Language Models (LLMs), using them to encode queries and documents into an embedding space for retrieval.These LLM-based dense retrievers have shown promising performance across various retrieval scenarios.However, relying on a single embedding to represent documents proves less effective in capturing different perspectives of documents for matching.In this paper, we propose Deliberate Thinking based Dense Retriever (DEBATER), which enhances these LLM-based retrievers by enabling them to learn more effective document representations through a step-by-step thinking process.DEBATER introduces the Chain-of-Deliberation mechanism to iteratively optimize document representations using a continuous chain of thought.To consolidate information from various thinking steps, DEBATER also incorporates the Self Distillation mechanism, which identifies the most informative thinking steps and integrates them into a unified text embedding.Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. 0.641All codes are available at https://github.com/OpenBMB/DEBATER.

link

2025-02-18

Efficient Learning Under Density Shift in Incremental Settings Using Cramér-Rao-Based Regularization

The continuous surge in data volume and velocity is often dealt with using data orchestration and distributed processing approaches, abstracting away the machine learning challenges that exist at the algorithmic level.With growing interest in automating the learning loop, training with data that arrive in a sequence rather than in the classical in-memory training data form will face a machine learning challenge because of evolving feature distributions across batches of training data biasing the cross-validation step (\cite{sugiyama2012machine}).This work takes a distributed density estimation angle to the problem where data are temporally distributed.It processes data in batches and allows a neural network to treat a batch as training data.The method accumulates knowledge about the data density via posterior probability absorption using the Fisher Information Matrix, which contains information about the local optimization gradients for the batch.This is then used as a regularizer for the loss in the following batch, and therefore the density estimate for the entire dataset constructively gets more robust to the non-iid distribution shift.This needs the presence of a pair of batches in memory at a time, so the space cost is not a function of the size of the complete, distributed dataset.We proposed a novel regularization-based approach Covariate Shift Correction $C^{2}A$ that leverages Fisher information and Kullback-Leibler divergence to adapt to both natural and sequential covariate shift caused by dataset fragmentation.$C^{2}A$ achieves $19\%$ accuracy at maximum against state-of-the-art methods. 0.674

link

2025-02-18

Smoothed Analysis of Dynamic Graph Algorithms

Recent years have seen significant progress in the study of dynamic graph algorithms, and most notably, the introduction of strong lower bound techniques for them (e.g., Henzinger, Krinninger, Nanongkai and Saranurak, STOC 2015; Larsen and Yu, FOCS 2023).As worst-case analysis (adversarial inputs) may lead to the necessity of high running times, a natural question arises: in which cases are high running times really necessary, and in which cases these inputs merely manifest unique pathological cases? Early attempts to tackle this question were made by Nikoletseas, Reif, Spirakis and Yung (ICALP 1995) and by Alberts and Henzinger (Algorithmica 1998), who considered models with very little adversarial control over the inputs, and showed fast algorithms exist for them.The question was then overlooked for decades, until Henzinger, Lincoln and Saha (SODA 2022) recently addressed uniformly random inputs, and presented algorithms and impossibility results for several subgraph counting problems. To tackle the above question more thoroughly, we employ smoothed analysis, a celebrated framework introduced by Spielman and Teng (J. ACM, 2004). 0.645An input is proposed by an adversary but then a noisy version of it is processed by the algorithm instead.Parameterized by the amount of adversarial control, this input model fully interpolates between worst-case inputs and a uniformly random input.Doing so, we extend impossibility results for some problems to the smoothed model with only a minor quantitative loss.That is, we show that partially-adversarial inputs suffice to impose high running times for certain problems.In contrast, we show that other problems become easy even with the slightest amount of noise.In addition, we study the interplay between the adversary and the noise, leading to three natural models of smoothed inputs, for which we show a hierarchy of increasing complexity.

link

2025-02-18

Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge

Large Language Models (LLMs) have significantly advanced medical question-answering by leveraging extensive clinical data and medical literature.However, the rapid evolution of medical knowledge and the labor-intensive process of manually updating domain-specific resources pose challenges to the reliability of these systems.To address this, we introduce Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates the construction and continuous updating of medical knowledge graphs, integrates reasoning, and retrieves current external evidence, such as PubMed and WikiSearch.By dynamically linking new findings and complex medical concepts, AMG-RAG not only improves accuracy but also enhances interpretability in medical queries. Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of 66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to 100 times larger. 0.617Notably, these improvements are achieved without increasing computational overhead, highlighting the critical role of automated knowledge graph generation and external evidence retrieval in delivering up-to-date, trustworthy medical insights.

link

2025-02-18

Fragility-aware Classification for Understanding Risk and Improving Generalization

Classification models play a critical role in data-driven decision-making applications such as medical diagnosis, user profiling, recommendation systems, and default detection.Traditional performance metrics, such as accuracy, focus on overall error rates but fail to account for the confidence of incorrect predictions, thereby overlooking the risk of confident misjudgments. 0.625This risk is particularly significant in cost-sensitive and safety-critical domains like medical diagnosis and autonomous driving, where overconfident false predictions may cause severe consequences.To address this issue, we introduce the Fragility Index (FI), a novel metric that evaluates classification performance from a risk-averse perspective by explicitly capturing the tail risk of confident misjudgments.To enhance generalizability, we define FI within the robust satisficing (RS) framework, incorporating data uncertainty.We further develop a model training approach that optimizes FI while maintaining tractability for common loss functions.Specifically, we derive exact reformulations for cross-entropy loss, hinge-type loss, and Lipschitz loss, and extend the approach to deep learning models.Through synthetic experiments and real-world medical diagnosis tasks, we demonstrate that FI effectively identifies misjudgment risk and FI-based training improves model robustness and generalizability.Finally, we extend our framework to deep neural network training, further validating its effectiveness in enhancing deep learning models.

link

2025-02-18

RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection

While recent low-cost radar-camera approaches have shown promising results in multi-modal 3D object detection, both sensors face challenges from environmental and intrinsic disturbances.Poor lighting or adverse weather conditions degrade camera performance, while radar suffers from noise and positional ambiguity.Achieving robust radar-camera 3D object detection requires consistent performance across varying conditions, a topic that has not yet been fully explored.In this work, we first conduct a systematic analysis of robustness in radar-camera detection on five kinds of noises and propose RobuRCDet, a robust object detection model in BEV.Specifically, we design a 3D Gaussian Expansion (3DGE) module to mitigate inaccuracies in radar points, including position, Radar Cross-Section (RCS), and velocity.The 3DGE uses RCS and velocity priors to generate a deformable kernel map and variance for kernel size adjustment and value distribution.Additionally, we introduce a weather-adaptive fusion module, which adaptively fuses radar and camera features based on camera signal confidence.Extensive experiments on the popular benchmark, nuScenes, show that our model achieves competitive results in regular and noisy conditions. 0.625

link

2025-02-18

Constrained Online Convex Optimization with Polyak Feasibility Steps

In this work, we study online convex optimization with a fixed constraint function $g : \mathbb{R}^d \rightarrow \mathbb{R}$.Prior work on this problem has shown $O(\sqrt{T})$ regret and cumulative constraint satisfaction $\sum_{t=1}^{T} g(x_t)\leq 0$, while only accessing the constraint value and subgradient at the played actions $g(x_t), \partial g(x_t)$. Using the same constraint information, we show a stronger guarantee of anytime constraint satisfaction $g(x_t)\leq 0\ \forall t\in [T]$, and matching $O(\sqrt{T})$ regret guarantees.These contributions are thanks to our approach of using Polyak feasibility steps to ensure constraint satisfaction, without sacrificing regret.Specifically, after each step of online gradient descent, our algorithm applies a subgradient descent step on the constraint function where the step-size is chosen according to the celebrated Polyak step-size.We further validate this approach with numerical experiments. 0.711

link

2025-02-18

Theorem Prover as a Judge for Synthetic Data Generation

The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs).However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality.While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone.In response, we introduce iterative autoformalisation, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%.Building upon that, we introduce Theorem Prover as a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation.Finally, we present Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF).Across multiple LLMs, applying TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA. 0.666

link

2025-02-17

Algorithm Engineering of SSSP With Negative Edge Weights

Computing shortest paths is one of the most fundamental algorithmic graph problems.It is known since decades that this problem can be solved in near-linear time if all weights are nonnegative.A recent break-through by [Bernstein, Nanongkai, Wulff-Nilsen '22] presented a randomized near-linear time algorithm for this problem.A subsequent improvement in [Bringmann, Cassis, Fischer '23] significantly reduced the number of logarithmic factors and thereby also simplified the algorithm. 0.609It is surprising and exciting that both of these algorithms are combinatorial and do not contain any fundamental obstacles for being practical. We launch the, to the best of our knowledge, first extensive investigation towards a practical implementation of [Bringmann, Cassis, Fischer '23].To this end, we give an accessible overview of the algorithm, discussing what adaptions are necessary to obtain a fast algorithm in practice.We manifest these adaptions in an efficient implementation.We test our implementation on a benchmark data set that is adapted to be more difficult for our implementation in order to allow for a fair comparison. 0.739As in [Bringmann, Cassis, Fischer '23] as well as in our implementation there are multiple parameters to tune, we empirically evaluate their effect and thereby determine the best choices. 0.689Our implementation is then extensively compared to one of the state-of-the-art algorithms for this problem 0.689[Goldberg, Radzik '93].On the hardest instance type, we are faster by up to almost two orders of magnitude.

link

2025-02-17

Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition

This paper investigates the integration of technical vocabulary in merged language models.We explore the knowledge transfer mechanisms involved when combining a general-purpose language-specific model with a domain-specific model, focusing on the resulting model's comprehension of technical jargon.Our experiments analyze the impact of this merging process on the target model's proficiency in handling specialized terminology.We present a quantitative evaluation of the performance of the merged model, comparing it with that of the individual constituent models. 0.621The findings offer insights into the effectiveness of different model merging methods for enhancing domain-specific knowledge and highlight potential challenges and future directions in leveraging these methods for cross-lingual knowledge transfer in Natural Language Processing.

link

2025-02-17

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans.However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation.In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability.In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. 0.655Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.

link

2025-02-17

Distributed Consensus Network: A Modularized Communication Framework and Reliability Probabilistic Analysis

In this paper, we propose a modularized framework for communication processes applicable to crash and Byzantine fault-tolerant consensus protocols.We abstract basic communication components and show that the communication process of the classic consensus protocols such as RAFT, single-decree Paxos, PBFT, and Hotstuff, can be represented by the combination of communication components.Based on the proposed framework, we develop an approach to analyze the consensus reliability of different protocols, where link loss and node failure are measured as a probability.We propose two latency optimization methods and implement a RAFT system to verify our theoretical analysis and the effectiveness of the proposed latency optimization methods. 0.606We also discuss decreasing consensus failure rate by adjusting protocol designs.This paper provides theoretical guidance for the design of future consensus systems with a low consensus failure rate and latency under the possible communication loss.

link

2025-02-17

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. 0.601There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL) and search algorithms.In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it.We formalize this condition using anti-concentration [Erd\H{o}s, 1945].This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time budget grows.We corroborate our theory empirically on both didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.

link

2025-02-17

Minimal Ranks, Maximum Confidence: Parameter-efficient Uncertainty Quantification for LoRA

Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of large language models by decomposing weight updates into low-rank matrices, significantly reducing storage and computational overhead.While effective, standard LoRA lacks mechanisms for uncertainty quantification, leading to overconfident and poorly calibrated models.Bayesian variants of LoRA address this limitation, but at the cost of a significantly increased number of trainable parameters, partially offsetting the original efficiency gains.Additionally, these models are harder to train and may suffer from unstable convergence. In this work, we propose a novel parameter-efficient Bayesian LoRA, demonstrating that effective uncertainty quantification can be achieved in very low-dimensional parameter spaces.The proposed method achieves strong performance with improved calibration and generalization while maintaining computational efficiency. 0.67Our empirical findings show that, with the appropriate projection of the weight space: (1) uncertainty can be effectively modeled in a low-dimensional space, and (2) weight covariances exhibit low ranks.

link

2025-02-17

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks.However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving.Unlike pure text data, collecting large-scale decision-making data is challenging.Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity.To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations.This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning.Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories.Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory.These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories.The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. 0.61In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities.By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments.This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.

link

2025-02-17

Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening

We propose Diffusion-Sharpening, a fine-tuning approach that enhances downstream alignment by optimizing sampling trajectories.Existing RL-based fine-tuning methods focus on single training timesteps and neglect trajectory-level alignment, while recent sampling trajectory optimization methods incur significant inference NFE costs.Diffusion-Sharpening overcomes this by using a path integral framework to select optimal trajectories during training, leveraging reward feedback, and amortizing inference costs.Our method demonstrates superior training efficiency with faster convergence, and best inference efficiency without requiring additional NFEs. 0.611Extensive experiments show that Diffusion-Sharpening outperforms RL-based fine-tuning methods (e.g., Diffusion-DPO) and sampling trajectory optimization methods (e.g., Inference Scaling) across diverse metrics including text alignment, compositional capabilities, and human preferences, offering a scalable and efficient solution for future diffusion model fine-tuning.Code: https://github.com/Gen-Verse/Diffusion-Sharpening

link

2025-02-17

Diffusion Models without Classifier-free Guidance

This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG).Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions.The proposed technique originates from the idea of CFG and is easy yet effective, making it a plug-and-play module for existing models.Our method significantly accelerates the training process, doubles the inference speed, and achieve exceptional quality that parallel and even surpass concurrent diffusion models with CFG.Extensive experiments demonstrate the effectiveness, efficiency, scalability on different models and datasets. 0.622Finally, we establish state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34.Our code is available at https://github.com/tzco/Diffusion-wo-CFG.

link

LLMs

2025-02-20

Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis

With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields.This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities.Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi-agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. 0.641Inspired by this, we introduce Tree-of-Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. 0.678To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine-grained analysis of independent novelty arguments within scholarly articles.Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.

link

2025-02-20

Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. 0.606Specifically, we identify a critical issue of ''$\textbf{reconstruction error explosion}$'' in existing LLMs sparsification methods. 0.68This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers.As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance.Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter.Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials.Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal.Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. 0.704Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models.Notably, our method achieves a reduction of 52.10 in perplexity for the 70$\%$ sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50$\%$, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively.

link

2025-02-20

SurveyX: Academic Survey Automation via Large Language Models

Large Language Models (LLMs) have demonstrated exceptional comprehension capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for automated survey generation. 0.675However, recent research related to automated survey generation remains constrained by some critical limitations like finite context window, lack of in-depth content discussion, and absence of systematic evaluation frameworks.Inspired by human writing processes, we propose SurveyX, an efficient and organized system for automated survey generation that decomposes the survey composing process into two phases: the Preparation and Generation phases.By innovatively introducing online reference retrieval, a pre-processing method called AttributeTree, and a re-polishing process, SurveyX significantly enhances the efficacy of survey composition.Experimental evaluation results show that SurveyX outperforms existing automated survey generation systems in content quality (0.259 improvement) and citation quality (1.76 enhancement), approaching human expert performance across multiple evaluation dimensions.Examples of surveys generated by SurveyX are available on www.surveyx.cn

link

2025-02-20

Rapid Word Learning Through Meta In-Context Learning

Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts.Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored.In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow).This method trains language models to generate new examples of a word's usage given a few in-context examples, using a special placeholder token to represent the new word.This training is repeated on many new words to develop a general word-learning ability.We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data.Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. 0.694These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.

link

2025-02-20

A Multi-Agent Perspective on Modern Information Retrieval

The rise of large language models (LLMs) has introduced a new era in information retrieval (IR), where queries and documents that were once assumed to be generated exclusively by humans can now also be created by automated agents. 0.669These agents can formulate queries, generate documents, and perform ranking.This shift challenges some long-standing IR paradigms and calls for a reassessment of both theoretical frameworks and practical methodologies.We advocate for a multi-agent perspective to better capture the complex interactions between query agents, document agents, and ranker agents.Through empirical exploration of various multi-agent retrieval settings, we reveal the significant impact of these interactions on system performance.Our findings underscore the need to revisit classical IR paradigms and develop new frameworks for more effective modeling and evaluation of modern retrieval systems.

link

2025-02-20

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential.Given the challenges in continual learning with large language models (LLMs), retrieval-augmented generation (RAG) has become the dominant way to introduce new information. 0.644However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory.Recent RAG approaches augment vector embeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity.However, their performance on more basic factual memory tasks drops considerably below standard RAG.We address this unintended deterioration and propose HippoRAG 2, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks.HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM.This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities.This work paves the way for non-parametric continual learning for LLMs. 0.632Our code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG.

link

2025-02-20

Optimizing Model Selection for Compound AI Systems

Compound AI systems that combine multiple LLM calls, such as self-refine and multi-agent-debate, achieve strong performance on many AI tasks. 0.628We address a core question in optimizing compound systems: for each LLM call or module in the system, how should one decide which LLM to use? 0.718We show that these LLM choices have a large effect on quality, but the search space is exponential. 0.706We propose LLMSelector, an efficient framework for model selection in compound systems, which leverages two key empirical insights: (i) end-to-end performance is often monotonic in how well each module performs, with all other modules held fixed, and (ii) per-module performance can be estimated accurately by an LLM.Building upon these insights, LLMSelector iteratively selects one module and allocates to it the model with the highest module-wise performance, as estimated by an LLM, until no further gain is possible. 0.711LLMSelector is applicable to any compound system with a bounded number of modules, and its number of API calls scales linearly with the number of modules, achieving high-quality model allocation both empirically and theoretically. 0.724Experiments with popular compound systems such as multi-agent debate and self-refine using LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector confers 5%-70% accuracy gains compared to using the same LLM for all modules. 0.714

link

2025-02-20

Dynamic Low-Rank Sparse Adaptation for Large Language Models

Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. 0.656Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 0.6481) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. 0.692In this paper, we introduce dynamic Low-rank Sparse Adaptation (LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency.In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. 0.633Besides, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby efficiently determining the layer-wise sparsity rates during fine-tuning.Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. 0.637Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. 0.693For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.73 and increased zero-shot accuracy by 16.32$\%$, achieving a 2.60$\times$ speedup on CPU and 2.23$\times$ speedup on GPU, requiring only 45 minutes of fine-tuning on a single NVIDIA A100 80GB GPU.Code is available at https://github.com/wzhuang-xmu/LoSA.

link

2025-02-20

eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables

Large Language Models (LLMs) have demonstrated exceptional versatility across diverse domains, yet their application in e-commerce remains underexplored due to a lack of domain-specific datasets. 0.648To address this gap, we introduce eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, including detailed product attributes and user-specific queries.Leveraging eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to produce high-quality, attribute-specific product reviews from structured tabular data. 0.643Fine-tuned models were rigorously evaluated using standard Table2Text metrics, alongside correctness, faithfulness, and fluency assessments.Our results demonstrate substantial improvements in generating contextually accurate reviews, highlighting the transformative potential of tailored datasets and fine-tuning methodologies in optimizing e-commerce workflows.This work highlights the potential of LLMs in e-commerce workflows and the essential role of domain-specific datasets in tailoring them to industry-specific challenges. 0.735

link

2025-02-20

Fundamental Limitations in Defending LLM Finetuning APIs

LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. 0.709Previous work has established several successful attacks against specific fine-tuning API defences.In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks.We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge.Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity.We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks.We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences.

link

2025-02-20

Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs

While large language models demonstrate remarkable capabilities at task-specific applications through fine-tuning, extending these benefits across diverse languages is essential for broad accessibility.However, effective cross-lingual transfer is hindered by LLM performance gaps across languages and the scarcity of fine-tuning data in many languages. 0.662Through analysis of LLM internal representations from over 1,000+ language pairs, we discover that middle layers exhibit the strongest potential for cross-lingual alignment. 0.679Building on this finding, we propose a middle-layer alignment objective integrated into task-specific training.Our experiments on slot filling, machine translation, and structured text generation show consistent improvements in cross-lingual transfer, especially to lower-resource languages.The method is robust to the choice of alignment languages and generalizes to languages unseen during alignment.Furthermore, we show that separately trained alignment modules can be merged with existing task-specific modules, improving cross-lingual capabilities without full re-training.Our code is publicly available (https://github.com/dannigt/mid-align).

link

2025-02-20

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector.Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. 0.643Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. 0.725This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values.These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization.For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance. 0.653

link

2025-02-20

Revealing and Mitigating Over-Attention in Knowledge Editing

Large Language Models have demonstrated superior performance across a wide range of tasks, but they still exhibit undesirable errors due to incorrect knowledge learned from the training data.To avoid this, knowledge editing methods emerged to precisely edit the specific model knowledge via efficiently modifying a very small percentage of parameters.% However, those methods can lead to the problem of Specificity Failure: when the content related to the edited knowledge occurs in the context, it can inadvertently corrupt other pre-existing knowledge.However, those methods can lead to the problem of Specificity Failure, where the existing knowledge and capabilities are severely degraded due to editing.Our preliminary indicates that Specificity Failure primarily stems from the model's attention heads assigning excessive attention scores to entities related to the edited knowledge, thereby unduly focusing on specific snippets within the context, which we denote as the Attention Drift phenomenon.To mitigate such Attention Drift issue, we introduce a simple yet effective method Selective Attention Drift Restriction}(SADR), which introduces an additional regularization term during the knowledge editing process to restrict changes in the attention weight distribution, thereby preventing undue focus on the edited entity.Experiments on five frequently used strong LLMs demonstrate the effectiveness of our method, where SADR can significantly mitigate Specificity Failure in the predominant knowledge editing tasks. 0.7

link

2025-02-20

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs).However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data.To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data.Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. 0.608With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM.Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data.Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash.Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

link

2025-02-20

Red-Teaming LLM Multi-Agent Systems via Communication Attacks

Large Language Model-based Multi-Agent Systems (LLM-MAS) have revolutionized complex problem-solving capability by enabling sophisticated agent collaboration through message-based communications. 0.648While the communication framework is crucial for agent coordination, it also introduces a critical yet unexplored security vulnerability.In this work, we introduce Agent-in-the-Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM-MAS by intercepting and manipulating inter-agent messages. 0.702Unlike existing attacks that compromise individual agents, AiTM demonstrates how an adversary can compromise entire multi-agent systems by only manipulating the messages passing between agents.To enable the attack under the challenges of limited control and role-restricted communication format, we develop an LLM-powered adversarial agent with a reflection mechanism that generates contextually-aware malicious instructions. 0.674Our comprehensive evaluation across various frameworks, communication structures, and real-world applications demonstrates that LLM-MAS is vulnerable to communication-based attacks, highlighting the need for robust security measures in multi-agent systems. 0.707

link

2025-02-20

GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks

Large Language Models (LLMs) have shown great promise in tool-making, yet existing frameworks often struggle to efficiently construct reliable toolsets and are limited to single-task settings. 0.688To address these challenges, we propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that dynamically constructs and evolves a hierarchical graph of reusable tools across multiple scenarios.We evaluate GATE on open-ended tasks (Minecraft), agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, TabMWP).Our results show that GATE achieves up to 4.3x faster milestone completion in Minecraft compared to the previous SOTA, and provides an average improvement of 9.23% over existing tool-making methods in code generation tasks and 10.03% in agent tasks.GATE demonstrates the power of adaptive evolution, balancing tool quantity, complexity, and functionality while maintaining high efficiency.Code and data are available at \url{https://github.com/ayanami2003/GATE}.

link

2025-02-20

CLIPPER: Compression enables long-context synthetic data generation

LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. 0.702We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim.Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts.Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex.Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models.Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard.Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).

link

2025-02-20

Prompt-to-Leaderboard

Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. 0.694This averaging obscures user- and prompt-specific variations in model performance.To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt.The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. 0.631The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses.Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard.Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. 0.664In January 2025, the router we trained based on this methodology achieved the \#1 spot in the Chatbot Arena leaderboard.Our code is available at this GitHub link: https://github.com/lmarena/p2l.

link

2025-02-20

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass.While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. 0.669To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression.By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution.Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2.

link

2025-02-20

Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning

Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decisionmaking. 0.72We present ALFA, a framework that improves LLM question-asking by (i) decomposing the notion of a "good" question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesizing attribute-specific question variations, and (iii) aligning models via preference-based optimization to explicitly learn to ask better questions along these fine-grained attributes. 0.615Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs dataset, composed of 17k real-world clinical interactions augmented with 80k attribute-specific preference pairs of follow-up questions, as well as a novel expert-annotated interactive healthcare QA task to evaluate question-asking abilities.Models aligned with ALFA reduce diagnostic errors by 56.6% on MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level win-rate of 64.4% and strong generalizability. 0.647Our findings suggest that explicitly guiding question-asking with structured, fine-grained attributes offers a scalable path to improve LLMs, especially in expert application domains. 0.738

link

2025-02-20

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. 0.607To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. 0.63This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise.LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. 0.671This design enables multiplicative speedups by combining these optimizations.Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages.Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length.We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity.On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. 0.665Code is released at https://github.com/mit-han-lab/omniserve.

link

Developer Research

2025-02-18

Investigating Issues that Lead to Code Technical Debt in Machine Learning Systems

[Context] Technical debt (TD) in machine learning (ML) systems, much like its counterpart in software engineering (SE), holds the potential to lead to future rework, posing risks to productivity, quality, and team morale.Despite growing attention to TD in SE, the understanding of ML-specific code-related TD remains underexplored.[Objective]This paper aims to identify and discuss the relevance of code-related issues that lead to TD in ML code throughout the ML workflow.[Method] The study first compiled a list of 34 potential issues contributing to TD in ML code by examining the phases of the ML workflow, their typical associated activities, and problem types.This list was refined through two focus group sessions involving nine experienced ML professionals, where each issue was assessed based on its occurrence contributing to TD in ML code and its relevance.[Results] The list of issues contributing to TD in the source code of ML systems was refined from 34 to 30, with 24 of these issues considered highly relevant.The data pre-processing phase was the most critical, with 14 issues considered highly relevant.Shortcuts in code related to typical pre-processing tasks (e.g., handling missing values, outliers, inconsistencies, scaling, rebalancing, and feature selection) often result in "patch fixes" rather than sustainable solutions, leading to the accumulation of TD and increasing maintenance costs. 0.61Relevant issues were also found in the data collection, model creation and training, and model evaluation phases.[Conclusion] We have made the final list of issues available to the community and believe it will help raise awareness about issues that need to be addressed throughout the ML workflow to reduce TD and improve the maintainability of ML code.

link

2025-02-18

Interactive Agents to Overcome Ambiguity in Software Engineering

AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources.In this work, we study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance across three key steps: (a) leveraging interactivity to improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c) asking targeted questions.Our findings reveal that models struggle to distinguish between well-specified and underspecified instructions.However, when models interact for underspecified inputs, they effectively obtain vital information from the user, leading to significant improvements in performance and underscoring the value of effective interaction.Our study highlights critical gaps in how current state-of-the-art models handle ambiguity in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements. 0.606

link

2025-02-11

Linting is People! Exploring the Potential of Human Computation as a Sociotechnical Linter of Data Visualizations

Traditionally, linters are code analysis tools that help developers by flagging potential issues from syntax and logic errors to enforcing syntactical and stylistic conventions. 0.611Recently, linting has been taken as an interface metaphor, allowing it to be extended to more complex inputs, such as visualizations, which demand a broader perspective and alternative approach to evaluation.We explore a further extended consideration of linting inputs, and modes of evaluation, across the puritanical, neutral, and rebellious dimensions.We specifically investigate the potential for leveraging human computation in linting operations through Community Notes -- crowd-sourced contextual text snippets aimed at checking and critiquing potentially accurate or misleading content on social media.We demonstrate that human-powered assessments not only identify misleading or error-prone visualizations but that integrating human computation enhances traditional linting by offering social insights.As is required these days, we consider the implications of building linters powered by Artificial Intelligence.

link

2025-02-10

Combining Large Language Models with Static Analyzers for Code Review Generation

Code review is a crucial but often complex, subjective, and time-consuming activity in software development. 0.676Over the past decades, significant efforts have been made to automate this process.Early approaches focused on knowledge-based systems (KBS) that apply rule-based mechanisms to detect code issues, providing precise feedback but struggling with complex, context-dependent cases. 0.604More recent work has shifted toward fine-tuning pre-trained language models for code review, enabling broader issue coverage but often at the expense of precision.In this paper, we propose a hybrid approach that combines the strengths of KBS and learning-based systems (LBS) to generate high-quality, comprehensive code reviews.Our method integrates knowledge at three distinct stages of the language model pipeline: during data preparation (Data-Augmented Training, DAT), at inference (Retrieval-Augmented Generation, RAG), and after inference (Naive Concatenation of Outputs, NCO).We empirically evaluate our combination strategies against standalone KBS and LBS fine-tuned on a real-world dataset.Our results show that these hybrid strategies enhance the relevance, completeness, and overall quality of review comments, effectively bridging the gap between rule-based tools and deep learning models.

link

2025-02-10

On the Limitations of Combining Sentiment Analysis Tools in a Cross-Platform Setting

A positive working climate is essential in modern software development. 0.643It enhances productivity since a satisfied developer tends to deliver better results.Sentiment analysis tools are a means to analyze and classify textual communication between developers according to the polarity of the statements.Most of these tools deliver promising results when used with test data from the domain they are developed for (e.g., GitHub).But the tools' outcomes lack reliability when used in a different domain (e.g., Stack Overflow).One possible way to mitigate this problem is to combine different tools trained in different domains.In this paper, we analyze a combination of three sentiment analysis tools in a voting classifier according to their reliability and performance.The tools are trained and evaluated using five already existing polarity data sets (e.g. from GitHub).The results indicate that this kind of combination of tools is a good choice in the within-platform setting.However, a majority vote does not necessarily lead to better results when applying in cross-platform domains.In most cases, the best individual tool in the ensemble is preferable.This is mainly due to the often large difference in performance of the individual tools, even on the same data set.However, this may also be due to the different annotated data sets.

link

2025-02-06

SPRINT: An Assistant for Issue Report Management

Managing issue reports is essential for the evolution and maintenance of software systems. 0.626However, manual issue management tasks such as triaging, prioritizing, localizing, and resolving issues are highly resource-intensive for projects with large codebases and users.To address this challenge, we present SPRINT, a GitHub application that utilizes state-of-the-art deep learning techniques to streamline issue management tasks.SPRINT assists developers by: (i) identifying existing issues similar to newly reported ones, (ii) predicting issue severity, and (iii) suggesting code files that likely require modification to solve the issues. 0.637We evaluated SPRINT using existing datasets and methodologies, measuring its predictive performance, and conducted a user study with five professional developers to assess its usability and usefulness.The results show that SPRINT is accurate, usable, and useful, providing evidence of its effectiveness in assisting developers in managing issue reports.SPRINT is an open-source tool available at https://github.com/sea-lab-wm/sprint.

link

2025-02-06

Combining Language and App UI Analysis for the Automated Assessment of Bug Reproduction Steps

Bug reports are essential for developers to confirm software problems, investigate their causes, and validate fixes. 0.734Unfortunately, reports often miss important information or are written unclearly, which can cause delays, increased issue resolution effort, or even the inability to solve issues.One of the most common components of reports that are problematic is the steps to reproduce the bug(s) (S2Rs), which are essential to replicate the described program failures and reason about fixes.Given the proclivity for deficiencies in reported S2Rs, prior work has proposed techniques that assist reporters in writing or assessing the quality of S2Rs.However, automated understanding of S2Rs is challenging, and requires linking nuanced natural language phrases with specific, semantically related program information.Prior techniques often struggle to form such language to program connections - due to issues in language variability and limitations of information gleaned from program analyses. To more effectively tackle the problem of S2R quality annotation, we propose a new technique called AstroBR, which leverages the language understanding capabilities of LLMs to identify and extract the S2Rs from bug reports and map them to GUI interactions in a program state model derived via dynamic analysis.We compared AstroBR to a related state-of-the-art approach and we found that AstroBR annotates S2Rs 25.2% better (in terms of F1 score) than the baseline.Additionally, AstroBR suggests more accurate missing S2Rs than the baseline (by 71.4% in terms of F1 score).

link

2025-02-05

Leveraging Creativity as a Problem Solving Tool in Software Engineering

Today's software engineering (SE) complexities require a more diverse tool set going beyond technical expertise to be able to successfully tackle all challenges.Previous studies have indicated that creativity is a prime indicator for overcoming these hurdles.In this paper, we port results from creativity research in the field of cognitive psychology to the field of SE.After all, programming is a highly creative endeavour.We explore how to leverage creativity as a practical problem solving tool to wield for software developers. 0.693The seven distinct but intertwined creative problem solving themes unfolded in this paper are accompanied with practical perspectives, specifically geared for software professionals.Just like technical skills such as knowledge of programming languages, we believe that creativity can be learned and improved with practice.

link

2025-02-05

Harnessing Large Language Models for Curated Code Reviews

In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes that ensure an efficient code review process. 0.707Well-crafted comments not only streamline the code review itself but are also essential for subsequent tasks like code refinement, where the code is modified to satisfy the input review comment.Although various AI-based approaches aimed to automate comment generation, their effectiveness remains limited by the quality of the training data.Existing code review datasets are often noisy and unrefined, posing limitations to the learning potential of AI models and hindering the automation process. To address these challenges, we propose a curation pipeline designed to enhance the quality of the largest publicly available code review dataset.We begin by establishing an evaluation framework, incorporating specific criteria and categories to empirically study the initial quality of the dataset.Using a large language model (LLM)-driven approach, we then apply our curation pipeline to refine the dataset.A comparative analysis of the newly curated dataset, based on the same evaluation framework, demonstrates substantial improvements in the clarity and conciseness of the comments.Additionally, we assess the impact of the curated dataset on automating downstream tasks, specifically comment generation and code refinement.Our findings show that the curated dataset leads to enhanced model performance in generating more accurate comments.Curated comments are also more useful as they lead to more accurate code refinement.

link

2025-02-04

Innovating the software engineering class through multi-team development

Often software engineering classes have the student concentrate on designing and planning the project but stop short of actual student team development of code. 0.653This leads to criticism by employers of new graduates that they are missing skills in working in teams and coordinating multiple overlapping changes to a code base.Additionally, students that are not actively experiencing team development are unprepared to understand and modify existing legacy-code bases written by others. 0.601This paper presents a new approach to teaching undergraduate software engineering that emphasizes not only software engineering methodology but also experiencing development as a member of a team and modifying a legacy code base. 0.64Our innovative software engineering course begins with learning the fundamentals of software engineering, followed by examining an existing framework of a social media application.The students are then grouped into multiple software teams, each focusing on a different aspect of the app.The separate teams must define requirements, design, and provide documentation on the services.Using an Agile development approach, the teams incrementally add to the code base and demonstrate features as the application evolves.Subsequent iterations of the class pick up the prior students code base, providing experience working with a legacy code base.Preliminary results of using this approach at the university are presented in this paper including quantitative analysis.Analysis of student software submissions to the cloud-based code repository shows student engagement and contributions over the span of the course.Positive student evaluations show the effectiveness of applying the principles of software engineering to the development of a complex solution in a team environment. 0.626Keywords: Software engineering, teaching, college computer science, innovative methods, agile.

link

Data Annotation Techniques