Vincent's Arxiv FrontPage


Generated on 2025-01-21.


This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.


New Datasets

2025-01-16

Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis

Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety.However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations.To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components.First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap.Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data.This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition).Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets.The code and dataset will be released in https://github.com/TingxuanSix/Surg-FTDA. 0.85

link

2025-01-16

Sequential PatchCore: Anomaly Detection for Surface Inspection using Synthetic Impurities

The appearance of surface impurities (e.g., water stains, fingerprints, stickers) is an often-mentioned issue that causes degradation of automated visual inspection systems.At the same time, synthetic data generation techniques for visual surface inspection have focused primarily on generating perfect examples and defects, disregarding impurities.This study highlights the importance of considering impurities when generating synthetic data.We introduce a procedural method to include photorealistic water stains in synthetic data.The synthetic datasets are generated to correspond to real datasets and are further used to train an anomaly detection model and investigate the influence of water stains. 0.812The high-resolution images used for surface inspection lead to memory bottlenecks during anomaly detection training.To address this, we introduce Sequential PatchCore - a method to build coresets sequentially and make training on large images using consumer-grade hardware tractable.This allows us to perform transfer learning using coresets pre-trained on different dataset versions.Our results show the benefits of using synthetic data for pre-training an explicit coreset anomaly model and the extended performance benefits of finetuning the coreset using real data.We observed how the impurities and labelling ambiguity lower the model performance and have additionally reported the defect-wise recall to provide an industrially relevant perspective on model performance.

link

2025-01-16

IFRA: a machine learning-based Instrumented Fall Risk Assessment Scale derived from Instrumented Timed Up and Go test in stroke patients

Effective fall risk assessment is critical for post-stroke patients.The present study proposes a novel, data-informed fall risk assessment method based on the instrumented Timed Up and Go (ITUG) test data, bringing in many mobility measures that traditional clinical scales fail to capture.IFRA, which stands for Instrumented Fall Risk Assessment, has been developed using a two-step process: first, features with the highest predictive power among those collected in a ITUG test have been identified using machine learning techniques; then, a strategy is proposed to stratify patients into low, medium, or high-risk strata.The dataset used in our analysis consists of 142 participants, out of which 93 were used for training (15 synthetically generated), 17 for validation and 32 to test the resulting IFRA scale (22 non-fallers and 10 fallers). 0.781Features considered in the IFRA scale include gait speed, vertical acceleration during sit-to-walk transition, and turning angular velocity, which align well with established literature on the risk of fall in neurological patients.In a comparison with traditional clinical scales such as the traditional Timed Up & Go and the Mini-BESTest, IFRA demonstrates competitive performance, being the only scale to correctly assign more than half of the fallers to the high-risk stratum (Fischer's Exact test p = 0.004).Despite the dataset's limited size, this is the first proof-of-concept study to pave the way for future evidence regarding the use of IFRA tool for continuous patient monitoring and fall prevention both in clinical stroke rehabilitation and at home post-discharge.

link

2025-01-16

CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding

In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement.However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement.Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe.In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories.This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency.We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. 0.764Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity.Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87.Collectively, the results demonstrate the system's suitability for industrial applications.

link

2025-01-16

The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them.This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination.To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead. 0.788

link

2025-01-16

Cueless EEG imagined speech for subject identification: dataset and benchmarks

Electroencephalogram (EEG) signals have emerged as a promising modality for biometric identification.While previous studies have explored the use of imagined speech with semantically meaningful words for subject identification, most have relied on additional visual or auditory cues.In this study, we introduce a cueless EEG-based imagined speech paradigm, where subjects imagine the pronunciation of semantically meaningful words without any external cues.This innovative approach addresses the limitations of prior methods by requiring subjects to select and imagine words from a predefined list naturally.The dataset comprises over 4,350 trials from 11 subjects across five sessions. 0.775We assess a variety of classification methods, including traditional machine learning techniques such as Support Vector Machines (SVM) and XGBoost, as well as time-series foundation models and deep learning architectures specifically designed for EEG classification, such as EEG Conformer and Shallow ConvNet.A session-based hold-out validation strategy was employed to ensure reliable evaluation and prevent data leakage.Our results demonstrate outstanding classification accuracy, reaching 97.93%.These findings highlight the potential of cueless EEG paradigms for secure and reliable subject identification in real-world applications, such as brain-computer interfaces (BCIs).

link

2025-01-16

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

Our objective is to translate continuous sign language into spoken language text.Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework.Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing.These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form.Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance.We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. 0.809We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines.Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.

link

2025-01-15

MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation

Our work addresses the problem of stochastic long-term dense anticipation.The goal of this task is to predict actions and their durations several minutes into the future based on provided video observations.Anticipation over extended horizons introduces high uncertainty, as a single observation can lead to multiple plausible future outcomes.To address this uncertainty, stochastic models are designed to predict several potential future action sequences.Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner.While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points.However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field.To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network.Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length.We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency. 0.847

link

2025-01-15

CveBinarySheet: A Comprehensive Pre-built Binaries Database for IoT Vulnerability Analysis

Binary Static Code Analysis (BSCA) is a pivotal area in software vulnerability research, focusing on the precise localization of vulnerabilities within binary executables.Despite advancements in BSCA techniques, there is a notable scarcity of comprehensive and readily usable vulnerability datasets tailored for diverse environments such as IoT, UEFI, and MCU firmware.To address this gap, we present CveBinarySheet, a meticulously curated database containing 1033 CVE entries spanning from 1999 to 2024. 0.822Our dataset encompasses 16 essential third-party components, including busybox and curl, and supports five CPU architectures: x86-64, i386, MIPS, ARMv7, and RISC-V64.Each precompiled binary is available at two compiler optimization levels (O0 and O3), facilitating comprehensive vulnerability analysis under different compilation scenarios.By providing detailed metadata and diverse binary samples, CveBinarySheet aims to accelerate the development of state-of-the-art BSCA tools, binary similarity analysis, and vulnerability matching applications.

link

2025-01-15

Profile and neighbourhood complexity of graphs with excluded minors and tree-structured graphs

The $r$-neighbourhood complexity of a graph $G$ is the function counting, for a given integer $k$, the largest possible number, over all vertex-subsets $A$ of size $k$, of subsets of $A$ realized as the intersection between the $r$-neighbourhood of some vertex and $A$.A refinement of this notion is the $r$-profile complexity, that counts the maximum number of distinct distance-vectors from any vertex to the vertices of $A$, ignoring distances larger than $r$. Typically, in structured graph classes such as graphs of bounded VC-dimension or chordal graphs, these functions are bounded, leading to insights into their structural properties and efficient algorithms. We improve existing bounds on the $r$-profile complexity (and thus on the $r$-neighbourhood complexity) for graphs in several structured graph classes.We show that the $r$-profile complexity of graphs excluding $K_h$ as a minor is in $O_h(r^{3h-3}k)$. For graphs of treewidth at most $t$ we give a bound in $O_t(r^{t+1}k)$, which is tight up to a function of $t$ as a factor.These bounds improve results and answer a question of Joret and Rambaud[Combinatorica, 2024]. 0.702For outerplanar graphs, we can improve our treewidth bound by a factor of $r$ and conjecture that a similar improvement holds for graphs with bounded simple treewidth.For graphs of treelength at most $\ell$, we give the upper bound in $O(k(r^2(\ell+1)^k))$. Our bounds also imply relations between the order, diameter and metric dimension of graphs in these classes, improving results from [Beaudou et al., SIDMA 2017].

link

2025-01-15

Empowering Agricultural Insights: RiceLeafBD - A Novel Dataset and Optimal Model Selection for Rice Leaf Disease Diagnosis through Transfer Learning Technique

The number of people living in this agricultural nation of ours, which is surrounded by lush greenery, is growing on a daily basis.As a result of this, the level of arable land is decreasing, as well as residential houses and industrial factories.The food crisis is becoming the main threat for us in the upcoming days.Because on the one hand, the population is increasing, and on the other hand, the amount of food crop production is decreasing due to the attack of diseases.Rice is one of the most significant cultivated crops since it provides food for more than half of the world's population.Bangladesh is dependent on rice (Oryza sativa) as a vital crop for its agriculture, but it faces a significant problem as a result of the ongoing decline in rice yield brought on by common diseases.Early disease detection is the main difficulty in rice crop cultivation.In this paper, we proposed our own dataset, which was collected from the Bangladesh field, and also applied deep learning and transfer learning models for the evaluation of the datasets. 0.779We elaborately explain our dataset and also give direction for further research work to serve society using this dataset. 0.704We applied a light CNN model and pre-trained InceptionNet-V2, EfficientNet-V2, and MobileNet-V2 models, which achieved 91.5% performance for the EfficientNet-V2 model of this work.The results obtained assaulted other models and even exceeded approaches that are considered to be part of the state of the art.It has been demonstrated by this study that it is possible to precisely and effectively identify diseases that affect rice leaves using this unbiased datasets.After analysis of the performance of different models, the proposed datasets are significant for the society for research work to provide solutions for decreasing rice leaf disease.

link

2025-01-15

Modeling Melt Pool Features and Spatter Using Symbolic Regression and Machine Learning

Additive manufacturing (AM) is a rapidly evolving technology that has attracted applications across a wide range of fields due to its ability to fabricate complex geometries.However, one of the key challenges in AM is achieving consistent print quality.This inconsistency is often attributed to uncontrolled melt pool dynamics, partly caused by spatter which can lead to defects.Therefore, capturing and controlling the evolution of the melt pool is crucial for enhancing process stability and part quality.In this study, we developed a framework to support decision-making in AM operations, facilitating quality control and minimizing defects via machine learning (ML) and polynomial symbolic regression models.We implemented experimentally validated computational tools as a cost-effective approach to collect large datasets from laser powder bed fusion (LPBF) processes.For a dataset consisting of 281 process conditions, parameters such as melt pool dimensions (length, width, depth), melt pool geometry (area, volume), and volume indicated as spatter were extracted. 0.811Using machine learning (ML) and polynomial symbolic regression models, a high R2 of over 95 % was achieved in predicting the melt pool dimensions and geometry features for both the training and testing datasets, with either process conditions (power and velocity) or melt pool dimensions as the model inputs.In the case of volume indicated as spatter, R2 improved after logarithmic transforming the model inputs, which was either the process conditions or the melt pool dimensions.Among the investigated ML models, the ExtraTree model achieved the highest R2 values of 96.7 % and 87.5 %.

link

2025-01-15

Learning Joint Denoising, Demosaicing, and Compression from the Raw Natural Image Noise Dataset

This paper introduces the Raw Natural Image Noise Dataset (RawNIND), a diverse collection of paired raw images designed to support the development of denoising models that generalize across sensors, image development workflows, and styles. 0.863Two denoising methods are proposed: one operates directly on raw Bayer data, leveraging computational efficiency, while the other processes linear RGB images for improved generalization to different sensors, with both preserving flexibility for subsequent development.Both methods outperform traditional approaches which rely on developed images.Additionally, the integration of denoising and compression at the raw data level significantly enhances rate-distortion performance and computational efficiency.These findings suggest a paradigm shift toward raw data workflows for efficient and flexible image processing.

link

2025-01-15

Visual WetlandBirds Dataset: Bird Species Identification and Behavior Recognition in Videos

The current biodiversity loss crisis makes animal monitoring a relevant field of study.In light of this, data collected through monitoring can provide essential insights, and information for decision-making aimed at preserving global biodiversity.Despite the importance of such data, there is a notable scarcity of datasets featuring videos of birds, and none of the existing datasets offer detailed annotations of bird behaviors in video format. 0.709In response to this gap, our study introduces the first fine-grained video dataset specifically designed for bird behavior detection and species classification. 0.724This dataset addresses the need for comprehensive bird video datasets and provides detailed data on bird actions, facilitating the development of deep learning models to recognize these, similar to the advancements made in human action recognition. 0.826The proposed dataset comprises 178 videos recorded in Spanish wetlands, capturing 13 different bird species performing 7 distinct behavior classes. 0.866In addition, we also present baseline results using state of the art models on two tasks: bird behavior recognition and species classification.

link

2025-01-15

CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation

Localizing text descriptions in large-scale 3D scenes is inherently an ambiguous task.This nonetheless arises while describing general concepts, e.g. all traffic lights in a city. To facilitate reasoning based on such concepts, text localization in the form of distribution is required.In this paper, we generate the distribution of the camera poses conditioned upon the textual description. To facilitate such generation, we propose a diffusion-based architecture that conditionally diffuses the noisy 6DoF camera poses to their plausible locations. The conditional signals are derived from the text descriptions, using the pre-trained text encoders.The connection between text descriptions and pose distribution is established through pretrained Vision-Language-Model, i.e. CLIP.Furthermore, we demonstrate that the candidate poses for the distribution can be further refined by rendering potential poses using 3D Gaussian splatting, guiding incorrectly posed samples towards locations that better align with the textual description, through visual reasoning. We demonstrate the effectiveness of our method by comparing it with both standard retrieval methods and learning-based approaches.Our proposed method consistently outperforms these baselines across all five large-scale datasets.Our source code and dataset will be made publicly available. 0.927

link

2025-01-15

CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

3D scene generation has garnered growing attention in recent years and has made significant progress.Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments.To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities.Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff.Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation.Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles.To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations.Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. 0.866The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations.Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.

link

2025-01-15

Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

As Large Language Models (LLMs) and generative AI become increasingly widespread, concerns about content safety have grown in parallel.Currently, there is a clear lack of high-quality, human-annotated datasets that address the full spectrum of LLM-related safety risks and are usable for commercial applications.To bridge this gap, we propose a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories.This taxonomy is designed to meet the diverse requirements of downstream users, offering more granular and flexible tools for managing various risk types.Using a hybrid data generation pipeline that combines human annotations with a multi-LLM "jury" system to assess the safety of responses, we obtain Aegis 2.0, a carefully curated collection of 34,248 samples of human-LLM interactions, annotated according to our proposed taxonomy. 0.814To validate its effectiveness, we demonstrate that several lightweight models, trained using parameter-efficient techniques on Aegis 2.0, achieve performance competitive with leading safety models fully fine-tuned on much larger, non-commercial datasets.In addition, we introduce a novel training blend that combines safety with topic following data.This approach enhances the adaptability of guard models, enabling them to generalize to new risk categories defined during inference.We plan to open-source Aegis 2.0 data and models to the research community to aid in the safety guardrailing of LLMs.

link

2025-01-14

SAR Strikes Back: A New Hope for RSVQA

Remote sensing visual question answering (RSVQA) is a task that automatically extracts information from satellite images and processes a question to predict the answer from the images in textual form, helping with the interpretation of the image.While different methods have been proposed to extract information from optical images with different spectral bands and resolutions, no method has been proposed to answer questions from Synthetic Aperture Radar (SAR) images.SAR images capture electromagnetic information from the scene, and are less affected by atmospheric conditions, such as clouds.In this work, our objective is to introduce SAR in the RSVQA task, finding the best way to use this modality.In our research, we carry out a study on different pipelines for the task of RSVQA taking into account information from both SAR and optical data.To this purpose, we also present a dataset that allows for the introduction of SAR images in the RSVQA framework. 0.75We propose two different models to include the SAR modality.The first one is an end-to-end method in which we add an additional encoder for the SAR modality.In the second approach, we build on a two-stage framework.First, relevant information is extracted from SAR and, optionally, optical data.This information is then translated into natural language to be used in the second step which only relies on a language model to provide the answer.We find that the second pipeline allows us to obtain good results with SAR images alone.We then try various types of fusion methods to use SAR and optical images together, finding that a fusion at the decision level achieves the best results on the proposed dataset.We show that SAR data offers additional information when fused with the optical modality, particularly for questions related to specific land cover classes, such as water areas.

link

2025-01-14

Bootstrapping Corner Cases: High-Resolution Inpainting for Safety Critical Detect and Avoid for Automated Flying

Modern machine learning techniques have shown tremendous potential, especially for object detection on camera images.For this reason, they are also used to enable safety-critical automated processes such as autonomous drone flights.We present a study on object detection for Detect and Avoid, a safety critical function for drones that detects air traffic during automated flights for safety reasons.An ill-posed problem is the generation of good and especially large data sets, since detection itself is the corner case.Most models suffer from limited ground truth in raw data, \eg recorded air traffic or frontal flight with a small aircraft.It often leads to poor and critical detection rates.We overcome this problem by using inpainting methods to bootstrap the dataset such that it explicitly contains the corner cases of the raw data. 0.712We provide an overview of inpainting methods and generative models and present an example pipeline given a small annotated dataset.We validate our method by generating a high-resolution dataset, which we make publicly available and present it to an independent object detector that was fully trained on real data.

link

2025-01-14

CG-MER: A Card Game-based Multimodal dataset for Emotion Recognition

The field of affective computing has seen significant advancements in exploring the relationship between emotions and emerging technologies.This paper presents a novel and valuable contribution to this field with the introduction of a comprehensive French multimodal dataset designed specifically for emotion recognition. 0.816The dataset encompasses three primary modalities: facial expressions, speech, and gestures, providing a holistic perspective on emotions. 0.798Moreover, the dataset has the potential to incorporate additional modalities, such as Natural Language Processing (NLP) to expand the scope of emotion recognition research.The dataset was curated through engaging participants in card game sessions, where they were prompted to express a range of emotions while responding to diverse questions.The study included 10 sessions with 20 participants (9 females and 11 males).The dataset serves as a valuable resource for furthering research in emotion recognition and provides an avenue for exploring the intricate connections between human emotions and digital technologies. 0.805

link

2025-01-14

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora.For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance.To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. 0.823This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. 0.829The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. 0.745Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.

link

2025-01-14

ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

Large Language Models (LLMs) have shown impressive potential in clinical question answering (QA), with Retrieval Augmented Generation (RAG) emerging as a leading approach for ensuring the factual accuracy of model responses.However, current automated RAG metrics perform poorly in clinical and conversational use cases.Using clinical human evaluations of responses is expensive, unscalable, and not conducive to the continuous iterative development of RAG systems.To address these challenges, we introduce ASTRID - an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG - consisting of three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF).Our novel evaluation metric, CF, is designed to better capture the faithfulness of a model's response to the knowledge base without penalising conversational elements.To validate our triad, we curate a dataset of over 200 real-world patient questions posed to an LLM-based QA agent during surgical follow-up for cataract surgery - the highest volume operation in the world - augmented with clinician-selected questions for emergency, clinical, and non-clinical out-of-domain scenarios.We demonstrate that CF can predict human ratings of faithfulness better than existing definitions for conversational use cases.Furthermore, we show that evaluation using our triad consisting of CF, RA, and CR exhibits alignment with clinician assessment for inappropriate, harmful, or unhelpful responses.Finally, using nine different LLMs, we demonstrate that the three metrics can closely agree with human evaluations, highlighting the potential of these metrics for use in LLM-driven automated evaluation pipelines.We also publish the prompts and datasets for these experiments, providing valuable resources for further research and development. 0.762

link

2025-01-14

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM).These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM.In this paper, we study red-teaming strategies that enable a targeted security assessment.We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset.This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. 0.761We show that established auto-regressive model architectures do not perform well in this setting.We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART).DART modifies the reference prompt by perturbing it in the embedding space, directly controlling the amount of change introduced.We systematically evaluate our method by comparing its effectiveness with established methods based on model fine-tuning and zero- and few-shot prompting.Our results show that DART is significantly more effective at discovering harmful inputs in close proximity to the reference prompt.

link

2025-01-14

Exploring Robustness of LLMs to Sociodemographically-Conditioned Paraphrasing

Large Language Models (LLMs) have shown impressive performance in various NLP tasks.However, there are concerns about their reliability in different domains of linguistic variations.Many works have proposed robustness evaluation measures for local adversarial attacks, but we need globally robust models unbiased to different language styles.We take a broader approach to explore a wider range of variations across sociodemographic dimensions to perform structured reliability tests on the reasoning capacity of language models.We extend the SocialIQA dataset to create diverse paraphrased sets conditioned on sociodemographic styles.The assessment aims to provide a deeper understanding of LLMs in (a) their capability of generating demographic paraphrases with engineered prompts and (b) their reasoning capabilities in real-world, complex language scenarios.We also explore measures such as perplexity, explainability, and ATOMIC performance of paraphrases for fine-grained reliability analysis of LLMs on these sets.We find that demographic-specific paraphrasing significantly impacts the performance of language models, indicating that the subtleties of language variations remain a significant challenge.The code and dataset will be made available for reproducibility and future research. 0.816

link

2025-01-14

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated.However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context.Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked.These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes.To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. 0.773Each instance in AfriHate is annotated by native speakers familiar with the local culture.We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. 0.719The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate 0.81

link

2025-01-14

Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages

Object naming - the act of identifying an object with a word or a phrase - is a fundamental skill in interpersonal communication, relevant to many disciplines, such as psycholinguistics, cognitive linguistics, or language and vision research.Object naming datasets, which consist of concept lists with picture pairings, are used to gain insights into how humans access and select names for objects in their surroundings and to study the cognitive processes involved in converting visual stimuli into semantic concepts.Unfortunately, object naming datasets often lack transparency and have a highly idiosyncratic structure.Our study tries to make current object naming data transparent and comparable by using a multilingual, computer-assisted approach that links individual items of object naming lists to unified concepts.Our current sample links 17 object naming datasets that cover 30 languages from 10 different language families. 0.937We illustrate how the comparative dataset can be explored by searching for concepts that recur across the majority of datasets and comparing the conceptual spaces of covered object naming datasets with classical basic vocabulary lists from historical linguistics and linguistic typology. 0.827Our findings can serve as a basis for enhancing cross-linguistic object naming research and as a guideline for future studies dealing with object naming tasks.

link

2025-01-14

GameFactory: Creating New Games with Generative Interactive Videos

Generative game engines have the potential to revolutionize game development by autonomously creating new content and reducing manual workload.However, existing video-based game generation methods fail to address the critical challenge of scene generalization, limiting their applicability to existing games with fixed styles and scenes.In this paper, we present GameFactory, a framework focused on exploring scene generalization in game video generation.To enable the creation of entirely new and diverse games, we leverage pre-trained video diffusion models trained on open-domain video data.To bridge the domain gap between open-domain priors and small-scale game dataset, we propose a multi-phase training strategy that decouples game style learning from action control, preserving open-domain generalization while achieving action controllability.Using Minecraft as our data source, we release GF-Minecraft, a high-quality and diversity action-annotated video dataset for research. 0.732Furthermore, we extend our framework to enable autoregressive action-controllable game video generation, allowing the production of unlimited-length interactive game videos.Experimental results demonstrate that GameFactory effectively generates open-domain, diverse, and action-controllable game videos, representing a significant step forward in AI-driven game generation.Our dataset and project page are publicly available at \url{https://vvictoryuki.github.io/gamefactory/}. 0.894

link

2025-01-14

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos.To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space.These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens.To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video.Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). 0.797Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.

link

2025-01-14

PokerBench: Training Large Language Models to become Professional Poker Players

We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs).As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge.Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology.This makes Poker the ideal next frontier for large language models.PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players.We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker.However, after fine-tuning, these models show marked improvements.We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games.Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games.PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios.The dataset and code will be made available at: \url{https://github.com/pokerllm/pokerbench}. 0.838

link

2025-01-13

Empirical Comparison of Four Stereoscopic Depth Sensing Cameras for Robotics Applications

Depth sensing is an essential technology in robotics and many other fields.Many depth sensing (or RGB-D) cameras are available on the market and selecting the best one for your application can be challenging.In this work, we tested four stereoscopic RGB-D cameras that sense the distance by using two images from slightly different views.We empirically compared four cameras (Intel RealSense D435, Intel RealSense D455, StereoLabs ZED 2, and Luxonis OAK-D Pro) in three scenarios: (i) planar surface perception, (ii) plastic doll perception, (iii) household object perception (YCB dataset).We recorded and evaluated more than 3,000 RGB-D frames for each camera.For table-top robotics scenarios with distance to objects up to one meter, the best performance is provided by the D435 camera.For longer distances, the other three models perform better, making them more suitable for some mobile robotics applications.OAK-D Pro additionally offers integrated AI modules (e.g., object and human keypoint detection).ZED 2 is not a standalone device and requires a computer with a GPU for depth data acquisition.All data (more than 12,000 RGB-D frames) are publicly available at https://osf.io/f2seb. 0.908

link

2025-01-13

A Novel Approach to Network Traffic Analysis: the HERA tool

Cybersecurity threats highlight the need for robust network intrusion detection systems to identify malicious behaviour.These systems rely heavily on large datasets to train machine learning models capable of detecting patterns and predicting threats.In the past two decades, researchers have produced a multitude of datasets, however, some widely utilised recent datasets generated with CICFlowMeter contain inaccuracies.These result in flow generation and feature extraction inconsistencies, leading to skewed results and reduced system effectiveness.Other tools in this context lack ease of use, customizable feature sets, and flow labelling options.In this work, we introduce HERA, a new open-source tool that generates flow files and labelled or unlabelled datasets with user-defined features. 0.726Validated and tested with the UNSW-NB15 dataset, HERA demonstrated accurate flow and label generation.

link

2025-01-13

TiEBe: A Benchmark for Assessing the Current Knowledge of Large Language Models

In a rapidly evolving knowledge landscape and the increasing adoption of large language models, a need has emerged to keep these models continuously updated with current events.While existing benchmarks evaluate general factual recall, they often overlook two critical aspects: the ability of models to integrate evolving knowledge through continual learning and the significant regional disparities in their performance.To address these gaps, we introduce the Timely Events Benchmark (TiEBe), a dataset containing over 11,000 question-answer pairs focused on globally and regionally significant events. 0.809TiEBeleverages structured retrospective data from Wikipedia, enabling continuous updates to assess LLMs' knowledge of evolving global affairs and their understanding of events across different regions.Our benchmark demonstrates that LLMs exhibit substantial geographic disparities in factual recall, emphasizing the need for more balanced global knowledge representation.Furthermore, TiEBe serves as a tool for evaluating continual learning strategies, providing insights into models' ability to acquire new information without forgetting past knowledge.

link

2025-01-13

Evaluating Agent-based Program Repair at Google

Agent-based program repair offers to automatically resolve complex bugs end-to-end by combining the planning, tool use, and code generation abilities of modern LLMs.Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench, a collection of bugs from highly-rated GitHub Python projects.In addition, various agentic approaches such as SWE-Agent have been proposed to solve bugs in this benchmark.This paper explores the viability of using an agentic approach to address bugs in an enterprise context.To investigate this, we curate an evaluation set of 178 bugs drawn from Google's issue tracking system.This dataset spans both human-reported (78) and machine-reported bugs (100). 0.7To establish a repair performance baseline on this benchmark, we implement Passerine, an agent similar in spirit to SWE-Agent that can work within Google's development environment.We show that with 20 trajectory samples and Gemini 1.5 Pro, Passerine can produce a patch that passes bug tests (i.e., plausible) for 73% of machine-reported and 25.6% of human-reported bugs in our evaluation set.After manual examination, we found that 43% of machine-reported bugs and 17.9% of human-reported bugs have at least one patch that is semantically equivalent to the ground-truth patch. These results establish a baseline on an industrially relevant benchmark, which as we show, contains bugs drawn from a different distribution -- in terms of language diversity, size, and spread of changes, etc. -- compared to those in the popular SWE-Bench dataset.

link

2025-01-13

UnCommon Objects in 3D

We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI.uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360$^{\circ}$ coverage.uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories.It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations.Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds. 0.714In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction.We train several large 3D models on MVImgNet, CO3Dv2, and uCO3D and obtain superior results using the latter, showing that uCO3D is better for learning applications.

link

2025-01-09

BRATI: Bidirectional Recurrent Attention for Time-Series Imputation

Missing data in time-series analysis poses significant challenges, affecting the reliability of downstream applications.Imputation, the process of estimating missing values, has emerged as a key solution.This paper introduces BRATI, a novel deep-learning model designed to address multivariate time-series imputation by combining Bidirectional Recurrent Networks and Attention mechanisms.BRATI processes temporal dependencies and feature correlations across long and short time horizons, utilizing two imputation blocks that operate in opposite temporal directions.Each block integrates recurrent layers and attention mechanisms to effectively resolve long-term dependencies. We evaluate BRATI on three real-world datasets under diverse missing-data scenarios: randomly missing values, fixed-length missing sequences, and variable-length missing sequences. 0.714Our findings demonstrate that BRATI consistently outperforms state-of-the-art models, delivering superior accuracy and robustness in imputing multivariate time-series data.

link

2025-01-09

A Novel Pathology Foundation Model by Mayo Clinic, Charité, and Aignostics

Recent advances in digital pathology have demonstrated the effectiveness of foundation models across diverse applications.In this report, we present a novel vision foundation model based on the RudolfV approach.Our model was trained on a dataset comprising 1.2 million histopathology whole slide images, collected from two medical institutions: Mayo Clinic and Charit\'e - Universt\"atsmedizin Berlin. 0.714Comprehensive evaluations show that our model achieves state-of-the-art performance across twenty-one public benchmark datasets, even though it is neither the largest model by parameter count nor by training dataset size.

link

2025-01-09

Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets.However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets.In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models.Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes.This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models.To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats.This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects.Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation. 0.727

link

2025-01-09

Explainable AI-Enhanced Deep Learning for Pumpkin Leaf Disease Detection: A Comparative Analysis of CNN Architectures

Pumpkin leaf diseases are significant threats to agricultural productivity, requiring a timely and precise diagnosis for effective management.Traditional identification methods are laborious and susceptible to human error, emphasizing the necessity for automated solutions.This study employs on the "Pumpkin Leaf Disease Dataset", that comprises of 2000 high-resolution images separated into five categories. 0.848Downy mildew, powdery mildew, mosaic disease, bacterial leaf spot, and healthy leaves.The dataset was rigorously assembled from several agricultural fields to ensure a strong representation for model training.We explored many proficient deep learning architectures, including DenseNet201, DenseNet121, DenseNet169, Xception, ResNet50, ResNet101 and InceptionResNetV2, and observed that ResNet50 performed most effectively, with an accuracy of 90.5% and comparable precision, recall, and F1-Score.We used Explainable AI (XAI) approaches like Grad-CAM, Grad-CAM++, Score-CAM, and Layer-CAM to provide meaningful representations of model decision-making processes, which improved understanding and trust in automated disease diagnostics.These findings demonstrate ResNet50's potential to revolutionize pumpkin leaf disease detection, allowing for earlier and more accurate treatments.

link

Data Quality

2024-12-24

A region-wide, multi-year set of crop field boundary labels for Africa

African agriculture is undergoing rapid transformation.Annual maps of crop fields are key to understanding the nature of this transformation, but such maps are currently lacking and must be developed using advanced machine learning models trained on high resolution remote sensing imagery.To enable the development of such models, we delineated field boundaries in 33,746 Planet images captured between 2017 and 2023 across the continent using a custom labeling platform with built-in procedures for assessing and mitigating label error.We collected 42,403 labels, including 7,204 labels arising from tasks dedicated to assessing label quality (Class 1 labels), 32,167 from sites mapped once by a single labeller (Class 2) and 3,032 labels from sites where 3 or more labellers were tasked to map the same location (Class 4). 0.668Class 1 labels were used to calculate labeller-specific quality scores, while Class 1 and 4 sites mapped by at least 3 labellers were used to further evaluate label uncertainty using a Bayesian risk metric. 0.672Quality metrics showed that label quality was moderately high (0.75) for measures of total field extent, but low regarding the number of individual fields delineated (0.33), and the position of field edges (0.05).These values are expected when delineating small-scale fields in 3-5 m resolution imagery, which can be too coarse to reliably distinguish smaller fields, particularly in dense croplands, and therefore requires substantial labeller judgement.Nevertheless, previous work shows that such labels can train effective field mapping models.Furthermore, this large, probabilistic sample on its own provides valuable insight into regional agricultural characteristics, highlighting variations in the median field size and density.The imagery and vectorized labels along with quality information is available for download from two public repositories.

link

2024-12-17

Label Errors in the Tobacco3482 Dataset

Tobacco3482 is a widely used document classification benchmark dataset.However, our manual inspection of the entire dataset uncovers widespread ontological issues, especially large amounts of annotation label problems in the dataset. 0.665We establish data label guidelines and find that 11.7% of the dataset is improperly annotated and should either have an unknown label or a corrected label, and 16.7% of samples in the dataset have multiple valid labels. 0.812We then analyze the mistakes of a top-performing model and find that 35% of the model's mistakes can be directly attributed to these label issues, highlighting the inherent problems with using a noisily labeled dataset as a benchmark. 0.747Supplementary material, including dataset annotations and code, is available at https://github.com/gordon-lim/tobacco3482-mistakes/.

link

Benchmarks

2025-01-16

A New Teacher-Reviewer-Student Framework for Semi-supervised 2D Human Pose Estimation

Conventional 2D human pose estimation methods typically require extensive labeled annotations, which are both labor-intensive and expensive.In contrast, semi-supervised 2D human pose estimation can alleviate the above problems by leveraging a large amount of unlabeled data along with a small portion of labeled data.Existing semi-supervised 2D human pose estimation methods update the network through backpropagation, ignoring crucial historical information from the previous training process.Therefore, we propose a novel semi-supervised 2D human pose estimation method by utilizing a newly designed Teacher-Reviewer-Student framework.Specifically, we first mimic the phenomenon that human beings constantly review previous knowledge for consolidation to design our framework, in which the teacher predicts results to guide the student's learning and the reviewer stores important historical parameters to provide additional supervision signals.Secondly, we introduce a Multi-level Feature Learning strategy, which utilizes the outputs from different stages of the backbone to estimate the heatmap to guide network training, enriching the supervisory information while effectively capturing keypoint relationships.Finally, we design a data augmentation strategy, i.e., Keypoint-Mix, to perturb pose information by mixing different keypoints, thus enhancing the network's ability to discern keypoints.Extensive experiments on publicly available datasets, demonstrate our method achieves significant improvements compared to the existing methods. 0.604

link

2025-01-16

Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation -- Extended Version

Measuring inter-dataset similarity is an important task in machine learning and data mining with various use cases and applications.Existing methods for measuring inter-dataset similarity are computationally expensive, limited, or sensitive to different entities and non-trivial choices for parameters.They also lack a holistic perspective on the entire dataset.In this paper, we propose two novel metrics for measuring inter-dataset similarity.We discuss the mathematical foundation and the theoretical basis of our proposed metrics.We demonstrate the effectiveness of the proposed metrics by investigating two applications in the evaluation of synthetic data and in the evaluation of feature selection methods.The theoretical and empirical studies conducted in this paper illustrate the effectiveness of the proposed metrics. 0.619

link

2025-01-16

IFRA: a machine learning-based Instrumented Fall Risk Assessment Scale derived from Instrumented Timed Up and Go test in stroke patients

Effective fall risk assessment is critical for post-stroke patients.The present study proposes a novel, data-informed fall risk assessment method based on the instrumented Timed Up and Go (ITUG) test data, bringing in many mobility measures that traditional clinical scales fail to capture.IFRA, which stands for Instrumented Fall Risk Assessment, has been developed using a two-step process: first, features with the highest predictive power among those collected in a ITUG test have been identified using machine learning techniques; then, a strategy is proposed to stratify patients into low, medium, or high-risk strata.The dataset used in our analysis consists of 142 participants, out of which 93 were used for training (15 synthetically generated), 17 for validation and 32 to test the resulting IFRA scale (22 non-fallers and 10 fallers).Features considered in the IFRA scale include gait speed, vertical acceleration during sit-to-walk transition, and turning angular velocity, which align well with established literature on the risk of fall in neurological patients.In a comparison with traditional clinical scales such as the traditional Timed Up & Go and the Mini-BESTest, IFRA demonstrates competitive performance, being the only scale to correctly assign more than half of the fallers to the high-risk stratum (Fischer's Exact test p = 0.004). 0.605Despite the dataset's limited size, this is the first proof-of-concept study to pave the way for future evidence regarding the use of IFRA tool for continuous patient monitoring and fall prevention both in clinical stroke rehabilitation and at home post-discharge.

link

2025-01-16

ARMAX identification of low rank graphical models

In large-scale systems, complex internal relationships are often present.Such interconnected systems can be effectively described by low rank stochastic processes.When identifying a predictive model of low rank processes from sampling data, the rank-deficient property of spectral densities is often obscured by the inevitable measurement noise in practice.However, existing low rank identification approaches often did not take noise into explicit consideration, leading to non-negligible inaccuracies even under weak noise.In this paper, we address the identification issue of low rank processes under measurement noise.We find that the noisy measurement model admits a sparse plus low rank structure in latent-variable graphical models.Specifically, we first decompose the problem into a maximum entropy covariance extension problem, and a low rank graphical estimation problem based on an autoregressive moving-average with exogenous input (ARMAX) model.To identify the ARMAX low rank graphical models, we propose an estimation approach based on maximum likelihood.The identifiability and consistency of this approach are proven under certain conditions.Simulation results confirm the reliable performance of the entire algorithm in both the parameter estimation and noisy data filtering. 0.684

link

2025-01-16

NS-Gym: Open-Source Simulation Environments and Benchmarks for Non-Stationary Markov Decision Processes

In many real-world applications, agents must make sequential decisions in environments where conditions are subject to change due to various exogenous factors.These non-stationary environments pose significant challenges to traditional decision-making models, which typically assume stationary dynamics.Non-stationary Markov decision processes (NS-MDPs) offer a framework to model and solve decision problems under such changing conditions.However, the lack of standardized benchmarks and simulation tools has hindered systematic evaluation and advance in this field. 0.61We present NS-Gym, the first simulation toolkit designed explicitly for NS-MDPs, integrated within the popular Gymnasium framework.In NS-Gym, we segregate the evolution of the environmental parameters that characterize non-stationarity from the agent's decision-making module, allowing for modular and flexible adaptations to dynamic environments.We review prior work in this domain and present a toolkit encapsulating key problem characteristics and types in NS-MDPs.This toolkit is the first effort to develop a set of standardized interfaces and benchmark problems to enable consistent and reproducible evaluation of algorithms under non-stationary conditions. 0.703We also benchmark six algorithmic approaches from prior work on NS-MDPs using NS-Gym.Our vision is that NS-Gym will enable researchers to assess the adaptability and robustness of their decision-making algorithms to non-stationary conditions.

link

2025-01-16

On the Energy Consumption of Test Generation

Research in the area of automated test generation has seen remarkable progress in recent years, resulting in several approaches and tools for effective and efficient generation of test cases.In particular, the EvoSuite tool has been at the forefront of this progress embodying various algorithms for automated test generation of Java programs.EvoSuite has been used to generate test cases for a wide variety of programs as well.While there are a number of empirical studies that report results on the effectiveness, in terms of code coverage and other related metrics, of the various test generation strategies and algorithms implemented in EvoSuite, there are no studies, to the best of our knowledge, on the energy consumption associated to the automated test generation.In this paper, we set out to investigate this aspect by measuring the energy consumed by EvoSuite when generating tests.We also measure the energy consumed in the execution of the test cases generated, comparing them with those manually written by developers. 0.665The results show that the different test generation algorithms consumed different amounts of energy, in particular on classes with high cyclomatic complexity.Furthermore, we also observe that manual tests tend to consume more energy as compared to automatically generated tests, without necessarily achieving higher code coverage.Our results also give insight into the methods that consume significantly higher levels of energy, indicating potential points of improvement both for EvoSuite as well as the different programs under test. 0.622

link

2025-01-16

Reward-Guided Controlled Generation for Inference-Time Alignment in Diffusion Models: Tutorial and Review

This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models.While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures).In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning.This tutorial explores the foundational aspects of such inference-time algorithms.We review these methods from a unified perspective, demonstrating that current techniques -- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance -- aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards.Within this framework, we present several novel algorithms not yet covered in the literature. 0.609Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models.The code of this tutorial on protein design is available at https://github.com/masa-ue/AlignInversePro

link

2025-01-16

A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise

We study the problem of PAC learning $\gamma$-margin halfspaces in the presence of Massart noise.Without computational considerations, the sample complexity of this learning problem is known to be $\widetilde{\Theta}(1/(\gamma^2 \epsilon))$. Prior computationally efficient algorithms for the problem incur sample complexity $\tilde{O}(1/(\gamma^4 \epsilon^3))$ and achieve 0-1 error of $\eta+\epsilon$, where $\eta<1/2$ is the upper bound on the noise rate.Recent work gave evidence of an information-computation tradeoff, suggesting that a quadratic dependence on $1/\epsilon$ is required for computationally efficient algorithms.Our main result is a computationally efficient learner with sample complexity $\widetilde{\Theta}(1/(\gamma^2 \epsilon^2))$, nearly matching this lower bound.In addition, our algorithm is simple and practical, relying on online SGD on a carefully selected sequence of convex losses. 0.691

link

2025-01-16

Cueless EEG imagined speech for subject identification: dataset and benchmarks

Electroencephalogram (EEG) signals have emerged as a promising modality for biometric identification.While previous studies have explored the use of imagined speech with semantically meaningful words for subject identification, most have relied on additional visual or auditory cues.In this study, we introduce a cueless EEG-based imagined speech paradigm, where subjects imagine the pronunciation of semantically meaningful words without any external cues.This innovative approach addresses the limitations of prior methods by requiring subjects to select and imagine words from a predefined list naturally.The dataset comprises over 4,350 trials from 11 subjects across five sessions.We assess a variety of classification methods, including traditional machine learning techniques such as Support Vector Machines (SVM) and XGBoost, as well as time-series foundation models and deep learning architectures specifically designed for EEG classification, such as EEG Conformer and Shallow ConvNet.A session-based hold-out validation strategy was employed to ensure reliable evaluation and prevent data leakage.Our results demonstrate outstanding classification accuracy, reaching 97.93%. 0.662These findings highlight the potential of cueless EEG paradigms for secure and reliable subject identification in real-world applications, such as brain-computer interfaces (BCIs).

link

2025-01-16

Comparative Insights from 12 Machine Learning Models in Extracting Economic Ideology from Political Text

This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology.As an evaluation benchmark, I use manifesto data spanning six elections in the United Kingdom and pre-annotated by expert and crowd coders.The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels.The results show that generative models such as GPT-4o and Gemini 1.5 Flash consistently outperform other models against all benchmarks.However, they pose issues of accessibility and resource availability.Fine-tuning yielded competitive performance and offers a reliable alternative through domain-specific optimization. 0.667But its dependency on training data severely limits scalability.Zero-shot models consistently face difficulties with identifying signals of economic ideology, often resulting in negative associations with human coding.Using general knowledge for the domain-specific task of ideology scaling proved to be unreliable.Other key findings include considerable within-party variation, fine-tuning benefiting from larger training data, and zero-shot's sensitivity to prompt content.The assessments include the strengths and limitations of each model and derive best-practices for automated analyses of political content.

link

2025-01-16

Parallel multi-objective metaheuristics for smart communications in vehicular networks

This article analyzes the use of two parallel multi-objective soft computing algorithms to automatically search for high-quality settings of the Ad hoc On Demand Vector routing protocol for vehicular networks.These methods are based on an evolutionary algorithm and on a swarm intelligence approach.The experimental analysis demonstrates that the configurations computed by our optimization algorithms outperform other state-of-the-art optimized ones. 0.636In turn, the computational efficiency achieved by all the parallel versions is greater than 87 %.Therefore, the line of work presented in this article represents an efficient framework to improve vehicular communications.

link

2025-01-16

KU AIGEN ICL EDI@BC8 Track 3: Advancing Phenotype Named Entity Recognition and Normalization for Dysmorphology Physical Examination Reports

The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms.However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms.To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step.Our pipeline resulted in an exact extraction and normalization F1 score 2.6\% higher than the mean score of all submissions received in response to the challenge.Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9\%. 0.694These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.

link

2025-01-16

Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models

Machine learning developers frequently use interactive computational notebooks, such as Jupyter notebooks, to host code for data processing and model training.Jupyter notebooks provide a convenient tool for writing machine learning pipelines and interactively observing outputs, however, maintaining Jupyter notebooks, e.g., to add new features or fix bugs, can be challenging due to the length and complexity of the notebooks.Moreover, there is no existing benchmark related to developer edits on Jupyter notebooks. 0.647To address this, we present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub, and perform the first study of the using LLMs to predict code edits in Jupyter notebooks.Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning workflows.We observed that the edits on Jupyter notebooks are highly localized, with changes averaging only 166 lines of code in repositories.While larger models outperform smaller counterparts in code editing, all models have low accuracy on our dataset even after finetuning, demonstrating the complexity of real-world machine learning maintenance tasks.Our findings emphasize the critical role of contextual information in improving model performance and point toward promising avenues for advancing large language models' capabilities in engineering machine learning code.

link

2025-01-15

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents.Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. 0.663To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. 0.622The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis.A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts.The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation.Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text.These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.

link

2025-01-15

Automatic tuning of communication protocols for vehicular ad hoc networks using metaheuristics

The emerging field of vehicular ad hoc networks (VANETs) deals with a set of communicating vehicles which are able to spontaneously interconnect without any pre-existing infrastructure.In such kind of networks, it is crucial to make an optimal configuration of the communication protocols previously to the final network deployment.This way, a human designer can obtain an optimal QoS of the network beforehand.The problem we consider in this work lies in configuring the File Transfer protocol Configuration (FTC) with the aim of optimizing the transmission time, the number of lost packets, and the amount of data transferred in realistic VANET scenarios.We face the FTC with five representative state-of-the-art optimization techniques and compare their performance. 0.679These algorithms are: Particle Swarm Optimization (PSO), Differential Evolution (DE), Genetic Algorithm (GA), Evolutionary Strategy (ES), and Simulated Annealing (SA).For our tests, two typical environment instances of VANETs for Urban and Highway scenarios have been defined.The experiments using ns- 2 (a well-known realistic VANET simulator) reveal that PSO outperforms all the compared algorithms for both studied VANET instances.

link

2025-01-15

RouteNet-Gauss: Hardware-Enhanced Network Modeling with Machine Learning

Network simulation is pivotal in network modeling, assisting with tasks ranging from capacity planning to performance estimation.Traditional approaches such as Discrete Event Simulation (DES) face limitations in terms of computational cost and accuracy.This paper introduces RouteNet-Gauss, a novel integration of a testbed network with a Machine Learning (ML) model to address these challenges.By using the testbed as a hardware accelerator, RouteNet-Gauss generates training datasets rapidly and simulates network scenarios with high fidelity to real-world conditions.Experimental results show that RouteNet-Gauss significantly reduces prediction errors by up to 95% and achieves a 488x speedup in inference time compared to state-of-the-art DES-based methods.RouteNet-Gauss's modular architecture is dynamically constructed based on the specific characteristics of the network scenario, such as topology and routing.This enables it to understand and generalize to different network configurations beyond those seen during training, including networks up to 10x larger.Additionally, it supports Temporal Aggregated Performance Estimation (TAPE), providing configurable temporal granularity and maintaining high accuracy in flow performance metrics.This approach shows promise in improving both simulation efficiency and accuracy, offering a valuable tool for network operators. 0.6

link

2025-01-15

Incrementally Learning Multiple Diverse Data Domains via Multi-Source Dynamic Expansion Model

Continual Learning seeks to develop a model capable of incrementally assimilating new information while retaining prior knowledge.However, current research predominantly addresses a straightforward learning context, wherein all data samples originate from a singular data domain.This paper shifts focus to a more complex and realistic learning environment, characterized by data samples sourced from multiple distinct domains.We tackle this intricate learning challenge by introducing a novel methodology, termed the Multi-Source Dynamic Expansion Model (MSDEM), which leverages various pre-trained models as backbones and progressively establishes new experts based on them to adapt to emerging tasks.Additionally, we propose an innovative dynamic expandable attention mechanism designed to selectively harness knowledge from multiple backbones, thereby accelerating the new task learning.Moreover, we introduce a dynamic graph weight router that strategically reuses all previously acquired parameters and representations for new task learning, maximizing the positive knowledge transfer effect, which further improves generalization performance.We conduct a comprehensive series of experiments, and the empirical findings indicate that our proposed approach achieves state-of-the-art performance. 0.68

link

2025-01-15

Enhanced Multi-Scale Cross-Attention for Person Image Generation

In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task.Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities.Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively.Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement.This has not been considered by any other existing GAN-based image generation work.To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks.To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA).Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively.Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods.However, our method is significantly faster than diffusion-based methods in both training and inference. 0.625

link

2025-01-15

Lights, Camera, Matching: The Role of Image Illumination in Fair Face Recognition

Facial brightness is a key image quality factor impacting face recognition accuracy differentials across demographic groups.In this work, we aim to decrease the accuracy gap between the similarity score distributions for Caucasian and African American female mated image pairs, as measured by d' between distributions.To balance brightness across demographic groups, we conduct three experiments, interpreting brightness in the face skin region either as median pixel value or as the distribution of pixel values.Balancing based on median brightness alone yields up to a 46.8% decrease in d', while balancing based on brightness distribution yields up to a 57.6% decrease. 0.608In all three cases, the similarity scores of the individual distributions improve, with mean scores maximally improving 5.9% for Caucasian females and 3.7% for African American females. 0.609

link

2025-01-15

Learning Joint Denoising, Demosaicing, and Compression from the Raw Natural Image Noise Dataset

This paper introduces the Raw Natural Image Noise Dataset (RawNIND), a diverse collection of paired raw images designed to support the development of denoising models that generalize across sensors, image development workflows, and styles.Two denoising methods are proposed: one operates directly on raw Bayer data, leveraging computational efficiency, while the other processes linear RGB images for improved generalization to different sensors, with both preserving flexibility for subsequent development.Both methods outperform traditional approaches which rely on developed images.Additionally, the integration of denoising and compression at the raw data level significantly enhances rate-distortion performance and computational efficiency. 0.617These findings suggest a paradigm shift toward raw data workflows for efficient and flexible image processing.

link

2025-01-15

Computing Approximated Fixpoints via Dampened Mann Iteration

Fixpoints are ubiquitous in computer science and when dealing with quantitative semantics and verification one is commonly led to consider least fixpoints of (higher-dimensional) functions over the nonnegative reals.We show how to approximate the least fixpoint of such functions, focusing on the case in which they are not known precisely, but represented by a sequence of approximating functions that converge to them.We concentrate on monotone and non-expansive functions, for which uniqueness of fixpoints is not guaranteed and standard fixpoint iteration schemes might get stuck at a fixpoint that is not the least.Our main contribution is the identification of an iteration scheme, a variation of Mann iteration with a dampening factor, which, under suitable conditions, is shown to guarantee convergence to the least fixpoint of the function of interest. 0.61We then argue that these results are relevant in the context of model-based reinforcement learning for Markov decision processes (MDPs), showing that the proposed iteration scheme instantiates to MDPs and allows us to derive convergence to the optimal expected return.More generally, we show that our results can be used to iterate to the least fixpoint almost surely for systems where the function of interest can be approximated with given probabilistic error bounds, as it happens for probabilistic systems, such as simple stochastic games, that can be explored via sampling.

link

2025-01-15

Kolmogorov-Arnold Networks for Time Series Granger Causality Inference

We introduce Granger Causality Kolmogorov-Arnold Networks (GCKAN), an innovative architecture that extends the recently proposed Kolmogorov-Arnold Networks (KAN) to the domain of causal inference.By extracting base weights from KAN layers and incorporating the sparsity-inducing penalty along with ridge regularization, GCKAN infers the Granger causality from time series while enabling automatic time lag selection.Additionally, we propose an algorithm leveraging time-reversed Granger causality to enhance inference accuracy.The algorithm compares prediction and sparse-inducing losses derived from the original and time-reversed series, automatically selecting the casual relationship with the higher score or integrating the results to mitigate spurious connectivities. 0.633Comprehensive experiments conducted on Lorenz-96, gene regulatory networks, fMRI BOLD signals, and VAR datasets demonstrate that the proposed model achieves competitive performance to state-of-the-art methods in inferring Granger causality from nonlinear, high-dimensional, and limited-sample time series.

link

2025-01-15

Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

As Large Language Models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation.Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data.The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries.Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b).Validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity, inter-rater reliability (ICC and Krippendorff's alpha) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity.Seven physician raters evaluated 779 summaries and answered 8,329 questions, achieving over 80% power for inter-rater reliability.The PDSQI-9 demonstrated strong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting structural validity and generalizability. 0.617Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility.Substantive validity was supported by correlations between note length and scores for Succinct (rho = -0.200, p = 0.029) and Organized (rho = -0.190, p = 0.037).Discriminant validity distinguished high- from low-quality summaries (p < 0.001).The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows.

link

2025-01-15

CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation

Localizing text descriptions in large-scale 3D scenes is inherently an ambiguous task.This nonetheless arises while describing general concepts, e.g. all traffic lights in a city. To facilitate reasoning based on such concepts, text localization in the form of distribution is required.In this paper, we generate the distribution of the camera poses conditioned upon the textual description. To facilitate such generation, we propose a diffusion-based architecture that conditionally diffuses the noisy 6DoF camera poses to their plausible locations. The conditional signals are derived from the text descriptions, using the pre-trained text encoders.The connection between text descriptions and pose distribution is established through pretrained Vision-Language-Model, i.e. CLIP.Furthermore, we demonstrate that the candidate poses for the distribution can be further refined by rendering potential poses using 3D Gaussian splatting, guiding incorrectly posed samples towards locations that better align with the textual description, through visual reasoning. We demonstrate the effectiveness of our method by comparing it with both standard retrieval methods and learning-based approaches. 0.629Our proposed method consistently outperforms these baselines across all five large-scale datasets. 0.681Our source code and dataset will be made publicly available.

link

2025-01-14

FairTTTS: A Tree Test Time Simulation Method for Fairness-Aware Classification

Algorithmic decision-making has become deeply ingrained in many domains, yet biases in machine learning models can still produce discriminatory outcomes, often harming unprivileged groups.Achieving fair classification is inherently challenging, requiring a careful balance between predictive performance and ethical considerations.We present FairTTTS, a novel post-processing bias mitigation method inspired by the Tree Test Time Simulation (TTTS) method.Originally developed to enhance accuracy and robustness against adversarial inputs through probabilistic decision-path adjustments, TTTS serves as the foundation for FairTTTS.By building on this accuracy-enhancing technique, FairTTTS mitigates bias and improves predictive performance.FairTTTS uses a distance-based heuristic to adjust decisions at protected attribute nodes, ensuring fairness for unprivileged samples.This fairness-oriented adjustment occurs as a post-processing step, allowing FairTTTS to be applied to pre-trained models, diverse datasets, and various fairness metrics without retraining.Extensive evaluation on seven benchmark datasets shows that FairTTTS outperforms traditional methods in fairness improvement, achieving a 20.96% average increase over the baseline compared to 18.78% for related work, and further enhances accuracy by 0.55%. 0.632In contrast, competing methods typically reduce accuracy by 0.42%. 0.707These results confirm that FairTTTS effectively promotes more equitable decision-making while simultaneously improving predictive performance.

link

2025-01-14

Cube-based Isomorph-free Finite Model Finding

Complete enumeration of finite models of first-order logic (FOL) formulas is pivotal to universal algebra, which studies and catalogs algebraic structures.Efficient finite model enumeration is highly challenging because the number of models grows rapidly with their size but at the same time, we are only interested in models modulo isomorphism.While isomorphism cuts down the number of models of interest, it is nontrivial to take that into account computationally. This paper develops a novel algorithm that achieves isomorphism-free enumeration by employing isomorphic graph detection algorithm nauty, cube-based search space splitting, and compact model representations.We name our algorithm cube-based isomorph-free finite model finding algorithm (CBIF).Our approach contrasts with the traditional two-step algorithms, which first enumerate (possibly isomorphic) models and then filter the isomorphic ones out in the second stage.The experimental results show that CBIF is many orders of magnitude faster than the traditional two-step algorithms. 0.666CBIF enables us to calculate new results that are not found in the literature, including the extension of two existing OEIS sequences, thereby advancing the state of the art.

link

2025-01-14

Self-supervised Deep Hyperspectral Inpainting with the Plug and Play and Deep Image Prior Models

Hyperspectral images are typically composed of hundreds of narrow and contiguous spectral bands, each containing information regarding the material composition of the imaged scene.However, these images can be affected by various sources of noise, distortions, or data loss, which can significantly degrade their quality and usefulness.This paper introduces a convergent guaranteed algorithm, LRS-PnP-DIP(1-Lip), which successfully addresses the instability issue of DHP that has been reported before. 0.624The proposed algorithm extends the successful joint low-rank and sparse model to further exploit the underlying data structures beyond the conventional and sometimes restrictive unions of subspace models.A stability analysis guarantees the convergence of the proposed algorithm under mild assumptions , which is crucial for its application in real-world scenarios. 0.64Extensive experiments demonstrate that the proposed solution consistently delivers visually and quantitatively superior inpainting results, establishing state-of-the-art performance.

link

2025-01-14

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks.While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security.Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. 0.633To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs.This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy.Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks' shortcomings.Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation.We open-source our artifact at: https://github.com/Co1lin/CWEval .

link

2025-01-14

Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings

Large language models (LLMs) have shown significant improvements in many natural language processing (NLP) tasks, accelerating their rapid adoption across many industries.These models are resource-intensive, requiring extensive computational resources both during training and inference, leading to increased energy consumption and negative environmental impact.As their adoption accelerates, the sustainability of LLMs has become a critical issue, necessitating strategies to optimize their runtime efficiency without compromising performance.Hence, it is imperative to identify the parameters that significantly influence the performance and energy efficiency of LLMs.To that end, in this work, we investigate the effect of important parameters on the performance and energy efficiency of LLMs during inference and examine their trade-offs. First, we analyze how different types of models with varying numbers of parameters and architectures perform on tasks like text generation, question answering, and summarization by benchmarking LLMs such as Falcon-7B, Mistral-7B-v0.1, T5-3B, GPT-2, GPT-J-6B, and GPT-Neo-2.7B. Second, we study input and output sequence characteristics such as sequence length concerning energy consumption, performance, and throughput.Finally, we explore the impact of hardware-based power-saving techniques, i.e., Dynamic Voltage Frequency Scaling (DVFS), on the models' latency and energy efficiency.Our extensive benchmarking and statistical analysis reveal many interesting findings, uncovering how specific optimizations can reduce energy consumption while maintaining throughput and accuracy. 0.675This study provides actionable insights for researchers and practitioners to design energy-efficient LLM inference systems.

link

2025-01-14

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines.With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2).However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. 0.616To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers.We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head.Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively.It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.

link

2025-01-14

A GPU-Accelerated Distributed Algorithm for Optimal Power Flow in Distribution Systems

We propose a GPU-accelerated distributed optimization algorithm for controlling multi-phase optimal power flow in active distribution systems with dynamically changing topologies.To handle varying network configurations and enable adaptable decomposition, we advocate a componentwise decomposition strategy.However, this approach can lead to a prolonged computation time mainly due to the excessive iterations required for achieving consensus among a large number of fine-grained components.To overcome this, we introduce a technique that segregates equality constraints from inequality constraints, enabling GPU parallelism to reduce per-iteration time by orders of magnitude, thereby significantly accelerating the overall computation.Numerical experiments on IEEE test systems ranging from 13 to 8500 buses demonstrate the superior scalability of the proposed approach compared to its CPU-based counterparts. 0.608

link

2025-01-14

A Similarity Measure Between Functions with Applications to Statistical Learning and Optimization

In this note, we present a novel measure of similarity between two functions. 0.63It quantifies how the sub-optimality gaps of two functions convert to each other, and unifies several existing notions of functional similarity.We show that it has convenient operation rules, and illustrate its use in empirical risk minimization and non-stationary online optimization.

link

2025-01-13

Empirical Evaluation of the Implicit Hitting Set Approach for Weighted CSPs

SAT technology has proven to be surprisingly effective in a large variety of domains.However, for the Weighted CSP problem dedicated algorithms have always been superior.One approach not well-studied so far is the use of SAT in conjunction with the Implicit Hitting Set approach.In this work, we explore some alternatives to the existing algorithm of reference.The alternatives, mostly borrowed from related boolean frameworks, consider trade-offs for the two main components of the IHS approach: the computation of low-cost hitting vectors, and their transformation into high-cost cores.For each one, we propose 4 levels of intensity.Since we also test the usefulness of cost function merging, our experiments consider 32 different implementations. 0.65Our empirical study shows that for WCSP it is not easy to identify the best alternative.Nevertheless, the cost-function merging encoding and extracting maximal cores seems to be a robust approach.

link

2025-01-13

Big Atomics

In this paper, we give theoretically and practically efficient implementations of Big Atomics, i.e., $k$-word linearizable registers that support the load, store, and compare-and-swap (CAS) operations.While modern hardware supports $k = 1$ and sometimes $k = 2$ (e.g., double-width compare-and-swap in x86), our implementations support arbitrary $k$. Big Atomics are useful in many applications, including atomic manipulation of tuples, version lists, and implementing load-linked/store-conditional (LL/SC).We design fast, lock-free implementations of big atomics based on a novel fast-path-slow-path approach we develop.We then use them to develop an efficient concurrent hash table, as evidence of their utility. We experimentally validate the approach by comparing a variety of implementations of big atomics under a variety of workloads (thread counts, load/store ratios, contention, oversubscription, and number of atomics).The experiments compare two of our lock-free variants with C++ std::atomic, a lock-based version, a version using sequence locks, and an indirect version.The results show that our approach is close to the fastest under all conditions and far outperforms others under oversubscription. 0.697We also compare our big atomics based concurrent hash table to a variety of other state-of-the-art hash tables that support arbitrary length keys and values, including implementations from Intel's TBB, Facebook's Folly, libcuckoo, and a recent release from Boost.The results show that our approach of using big atomics in the design of hash tables is a promising direction.

link

2025-01-13

The Paradox of Success in Evolutionary and Bioinspired Optimization: Revisiting Critical Issues, Key Studies, and Methodological Pathways

Evolutionary and bioinspired computation are crucial for efficiently addressing complex optimization problems across diverse application domains.By mimicking processes observed in nature, like evolution itself, these algorithms offer innovative solutions beyond the reach of traditional optimization methods.They excel at finding near-optimal solutions in large, complex search spaces, making them invaluable in numerous fields.However, both areas are plagued by challenges at their core, including inadequate benchmarking, problem-specific overfitting, insufficient theoretical grounding, and superfluous proposals justified only by their biological metaphor.This overview recapitulates and analyzes in depth the criticisms concerning the lack of innovation and rigor in experimental studies within the field.To this end, we examine the judgmental positions of the existing literature in an informed attempt to guide the research community toward directions of solid contribution and advancement in these areas.We summarize guidelines for the design of evolutionary and bioinspired optimizers, the development of experimental comparisons, and the derivation of novel proposals that take a step further in the field. 0.602We provide a brief note on automating the process of creating these algorithms, which may help align metaheuristic optimization research with its primary objective (solving real-world problems), provided that our identified pathways are followed.Our conclusions underscore the need for a sustained push towards innovation and the enforcement of methodological rigor in prospective studies to fully realize the potential of these advanced computational techniques.

link

2025-01-13

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

Distributed-memory implementations of numerical optimization algorithm, such as stochastic gradient descent (SGD), require interprocessor communication at every iteration of the algorithm.On modern distributed-memory clusters where communication is more expensive than computation, the scalability and performance of these algorithms are limited by communication cost.This work generalizes prior work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) which attains a continuous performance trade off between the two baseline algorithms. 0.604We present theoretical analysis which show the convergence, computation, communication, and memory trade offs between $s$-step SGD, FedAvg, 2D parallel SGD, and other parallel SGD variants.We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system. 0.608Our empirical results show that HybridSGD achieves better convergence than FedAvg at similar processor scales while attaining speedups of $5.3\times$ over $s$-step SGD and speedups up to $121\times$ over FedAvg when used to solve binary classification tasks using the convex, logistic regression model on datasets obtained from the LIBSVM repository.

link

2025-01-13

IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion

Facial video editing has become increasingly important for content creators, enabling the manipulation of facial expressions and attributes.However, existing models encounter challenges such as poor editing quality, high computational costs and difficulties in preserving facial identity across diverse edits.Additionally, these models are often constrained to editing predefined facial attributes, limiting their flexibility to diverse editing prompts.To address these challenges, we propose a novel facial video editing framework that leverages the rich latent space of pre-trained text-to-image (T2I) diffusion models and fine-tune them specifically for facial video editing tasks.Our approach introduces a targeted fine-tuning scheme that enables high quality, localized, text-driven edits while ensuring identity preservation across video frames.Additionally, by using pre-trained T2I models during inference, our approach significantly reduces editing time by 80%, while maintaining temporal consistency throughout the video sequence.We evaluate the effectiveness of our approach through extensive testing across a wide range of challenging scenarios, including varying head poses, complex action sequences, and diverse facial expressions.Our method consistently outperforms existing techniques, demonstrating superior performance across a broad set of metrics and benchmarks. 0.766

link

2025-01-13

Evaluating Agent-based Program Repair at Google

Agent-based program repair offers to automatically resolve complex bugs end-to-end by combining the planning, tool use, and code generation abilities of modern LLMs.Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench, a collection of bugs from highly-rated GitHub Python projects.In addition, various agentic approaches such as SWE-Agent have been proposed to solve bugs in this benchmark. 0.604This paper explores the viability of using an agentic approach to address bugs in an enterprise context.To investigate this, we curate an evaluation set of 178 bugs drawn from Google's issue tracking system.This dataset spans both human-reported (78) and machine-reported bugs (100). To establish a repair performance baseline on this benchmark, we implement Passerine, an agent similar in spirit to SWE-Agent that can work within Google's development environment.We show that with 20 trajectory samples and Gemini 1.5 Pro, Passerine can produce a patch that passes bug tests (i.e., plausible) for 73% of machine-reported and 25.6% of human-reported bugs in our evaluation set.After manual examination, we found that 43% of machine-reported bugs and 17.9% of human-reported bugs have at least one patch that is semantically equivalent to the ground-truth patch. These results establish a baseline on an industrially relevant benchmark, which as we show, contains bugs drawn from a different distribution -- in terms of language diversity, size, and spread of changes, etc. -- compared to those in the popular SWE-Bench dataset. 0.625

link

2025-01-13

Dynamic Prototype Rehearsal for Continual Learning in ECG Arrhythmia Detection

Continual Learning (CL) methods aim to learn from a sequence of tasks while avoiding the challenge of forgetting previous knowledge.We present DREAM-CL, a novel CL method for ECG arrhythmia detection that introduces dynamic prototype rehearsal memory.DREAM-CL selects representative prototypes by clustering data based on learning behavior during each training session.Within each cluster, we apply a smooth sorting operation that ranks samples by training difficulty, compressing extreme values and removing outliers. 0.611The more challenging samples are then chosen as prototypes for the rehearsal memory, ensuring effective knowledge retention across sessions.We evaluate our method on time-incremental, class-incremental, and lead-incremental scenarios using two widely used ECG arrhythmia datasets, Chapman and PTB-XL.The results demonstrate that DREAM-CL outperforms the state-of-the-art in CL for ECG arrhythmia detection.Detailed ablation and sensitivity studies are performed to validate the different design choices of our method.

link

2025-01-13

Dataset Distillation via Committee Voting

Dataset distillation aims to synthesize a smaller, representative dataset that preserves the essential properties of the original data, enabling efficient model training with reduced computational resources.Prior work has primarily focused on improving the alignment or matching process between original and synthetic data, or on enhancing the efficiency of distilling large datasets.In this work, we introduce ${\bf C}$ommittee ${\bf V}$oting for ${\bf D}$ataset ${\bf D}$istillation (CV-DD), a novel and orthogonal approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets.We start by showing how to establish a strong baseline that already achieves state-of-the-art accuracy through leveraging recent advancements and thoughtful adjustments in model design and optimization processes. 0.685By integrating distributions and predictions from a committee of models while generating high-quality soft labels, our method captures a wider spectrum of data features, reduces model-specific biases and the adverse effects of distribution shifts, leading to significant improvements in generalization.This voting-based strategy not only promotes diversity and robustness within the distilled dataset but also significantly reduces overfitting, resulting in improved performance on post-eval tasks.Extensive experiments across various datasets and IPCs (images per class) demonstrate that Committee Voting leads to more reliable and adaptable distilled data compared to single/multi-model distillation methods, demonstrating its potential for efficient and accurate dataset distillation.Code is available at: https://github.com/Jiacheng8/CV-DD.

link

2025-01-09

Relative Pose Estimation through Affine Corrections of Monocular Depth Priors

Monocular depth estimation (MDE) models have undergone significant advancements over recent years.Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth.However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored.While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions.In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions.We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints.We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the ``metric" ones.Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. 0.604We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. 0.655Code is available at https://github.com/MarkYu98/madpose.

link

LLMs

2025-01-16

Managed-Retention Memory: A New Class of Memory for the AI Era

AI clusters today are one of the major uses of High Bandwidth Memory (HBM).However, HBM is suboptimal for AI workloads for several reasons.Analysis shows HBM is overprovisioned on write performance, but underprovisioned on density and read bandwidth, and also has significant energy per bit overheads.It is also expensive, with lower yield than DRAM due to manufacturing complexity. 0.616We propose a new memory class: Managed-Retention Memory (MRM), which is more optimized to store key data structures for AI inference workloads.We believe that MRM may finally provide a path to viability for technologies that were originally proposed to support Storage Class Memory (SCM). 0.686These technologies traditionally offered long-term persistence (10+ years) but provided poor IO performance and/or endurance.MRM makes different trade-offs, and by understanding the workload IO patterns, MRM foregoes long-term data retention and write performance for better potential performance on the metrics important for these workloads. 0.601

link

2025-01-16

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. 0.663While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. 0.629Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination that hinder the model's ability to capture true causal relationships.To address this, we propose a novel causal reward modeling approach that integrates causal inference to mitigate these spurious correlations.Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered.Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. 0.619As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning. 0.628

link

2025-01-16

Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework

In this work, we develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) specifically for wireless communication applications. 0.61The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard.By utilizing advanced language models for entity extraction and question generation, rigorous data curation processes are employed to maintain high quality and relevance.Additionally, we introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data with 2.24\% and 1.31\% performance boost for different models compared to baselines, respectively.To demonstrate the effectiveness of the fine-tuned models with the proposed methodologies on practical tasks, we also consider different tasks, including summarizing optimization problems from technical papers and solving the mathematical problems related to non-orthogonal multiple access (NOMA), which are generated by using the proposed multi-agent framework.Simulation results show significant performance gain in summarization tasks with 20.9\% in the ROUGE-L metrics.We also study the scaling laws of fine-tuning LLMs and the challenges LLMs face in the field of wireless communications, offering insights into their adaptation to wireless communication tasks. 0.715This dataset and fine-tuning methodology aim to enhance the training and evaluation of LLMs, contributing to advancements in LLMs for wireless communication research and applications. 0.718

link

2025-01-16

LLM-Based Routing in Mixture of Experts: A Novel Framework for Trading

Recent advances in deep learning and large language models (LLMs) have facilitated the deployment of the mixture-of-experts (MoE) mechanism in the stock investment domain.While these models have demonstrated promising trading performance, they are often unimodal, neglecting the wealth of information available in other modalities, such as textual data.Moreover, the traditional neural network-based router selection mechanism fails to consider contextual and real-world nuances, resulting in suboptimal expert selection.To address these limitations, we propose LLMoE, a novel framework that employs LLMs as the router within the MoE architecture. 0.73Specifically, we replace the conventional neural network-based router with LLMs, leveraging their extensive world knowledge and reasoning capabilities to select experts based on historical price data and stock news. 0.626This approach provides a more effective and interpretable selection mechanism.Our experiments on multimodal real-world stock datasets demonstrate that LLMoE outperforms state-of-the-art MoE models and other deep neural network approaches.Additionally, the flexible architecture of LLMoE allows for easy adaptation to various downstream tasks. 0.669

link

2025-01-16

A Survey of Research in Large Language Models for Electronic Design Automation

Within the rapidly evolving domain of Electronic Design Automation (EDA), Large Language Models (LLMs) have emerged as transformative technologies, offering unprecedented capabilities for optimizing and automating various aspects of electronic design. 0.71This survey provides a comprehensive exploration of LLM applications in EDA, focusing on advancements in model architectures, the implications of varying model sizes, and innovative customization techniques that enable tailored analytical insights. 0.682By examining the intersection of LLM capabilities and EDA requirements, the paper highlights the significant impact these models have on extracting nuanced understandings from complex datasets. 0.627Furthermore, it addresses the challenges and opportunities in integrating LLMs into EDA workflows, paving the way for future research and application in this dynamic field. 0.752Through this detailed analysis, the survey aims to offer valuable insights to professionals in the EDA industry, AI researchers, and anyone interested in the convergence of advanced AI technologies and electronic design.

link

2025-01-16

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Language has long been conceived as an essential tool for human reasoning.The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. 0.683Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process.This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. 0.733Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. 0.685This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. 0.639Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. 0.714Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model.The introduction of OpenAI's o1 series marks a significant milestone in this research direction.In this survey, we present a comprehensive review of recent progress in LLM reasoning. 0.727We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. 0.682We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.

link

2025-01-16

Domain Adaptation of Foundation LLMs for e-Commerce

We present the e-Llama models: 8 billion and 70 billion parameter large language models that are adapted towards the e-commerce domain.These models are meant as foundation models with deep knowledge about e-commerce, that form a base for instruction- and fine-tuning.The e-Llama models are obtained by continuously pretraining the Llama 3.1 base models on 1 trillion tokens of domain-specific data. 0.603We discuss our approach and motivate our choice of hyperparameters with a series of ablation studies.To quantify how well the models have been adapted to the e-commerce domain, we define and implement a set of multilingual, e-commerce specific evaluation tasks. We show that, when carefully choosing the training setup, the Llama 3.1 models can be adapted towards the new domain without sacrificing significant performance on general domain tasks. 0.634We also explore the possibility of merging the adapted model and the base model for a better control of the performance trade-off between domains.

link

2025-01-16

CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education

Many non-traditional students in cybersecurity programs often lack access to advice from peers, family members and professors, which can hinder their educational experiences.Additionally, these students may not fully benefit from various LLM-powered AI assistants due to issues like content relevance, locality of advice, minimum expertise, and timing. 0.705This paper addresses these challenges by introducing an application designed to provide comprehensive support by answering questions related to knowledge, skills, and career preparation advice tailored to the needs of these students.We developed a learning tool platform, CyberMentor, to address the diverse needs and pain points of students majoring in cybersecurity.Powered by agentic workflow and Generative Large Language Models (LLMs), the platform leverages Retrieval-Augmented Generation (RAG) for accurate and contextually relevant information retrieval to achieve accessibility and personalization. 0.634We demonstrated its value in addressing knowledge requirements for cybersecurity education and for career marketability, in tackling skill requirements for analytical and programming assignments, and in delivering real time on demand learning support.Using three use scenarios, we showcased CyberMentor in facilitating knowledge acquisition and career preparation and providing seamless skill-based guidance and support.We also employed the LangChain prompt-based evaluation methodology to evaluate the platform's impact, confirming its strong performance in helpfulness, correctness, and completeness.These results underscore the system's ability to support students in developing practical cybersecurity skills while improving equity and sustainability within higher education.Furthermore, CyberMentor's open-source design allows for adaptation across other disciplines, fostering educational innovation and broadening its potential impact.

link

2025-01-16

FLOL: Fast Baselines for Real-World Low-Light Enhancement

Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. 0.612The problem of enhancing images captured during night or in dark environments has been well-studied in the image signal processing literature.However, current deep learning-based solutions struggle with efficiency and robustness in real-world scenarios (e.g. scenes with noise, saturated pixels, bad illumination).We propose a lightweight neural network that combines image processing in the frequency and spatial domains.Our method, FLOL+, is one of the fastest models for this task, achieving state-of-the-art results on popular real scenes datasets such as LOL and LSRW.Moreover, we are able to process 1080p images under 12ms.Code and models at https://github.com/cidautai/FLOL

link

2025-01-16

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws.Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. 0.601Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. 0.608In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation.Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process.We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates.Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.

link

2025-01-16

Enhancing Lexicon-Based Text Embeddings with Large Language Models

Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. 0.646While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. 0.638Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. 0.661Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. 0.634Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts.Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).

link

2025-01-16

Distilling Multi-modal Large Language Models for Autonomous Driving

Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios.Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. 0.607However, using LLMs at test time introduces high computational costs. 0.731To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM.DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks.Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective.Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. 0.699Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios.DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.

link

2025-01-15

Trusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography

We often interact with untrusted parties. 0.601Prioritization of privacy can limit the effectiveness of these interactions, as achieving certain goals necessitates sharing private data.Traditionally, addressing this challenge has involved either seeking trusted intermediaries or constructing cryptographic protocols that restrict how much data is revealed, such as multi-party computations or zero-knowledge proofs.While significant advances have been made in scaling cryptographic approaches, they remain limited in terms of the size and complexity of applications they can be used for.In this paper, we argue that capable machine learning models can fulfill the role of a trusted third party, thus enabling secure computations for applications that were previously infeasible.In particular, we describe Trusted Capable Model Environments (TCMEs) as an alternative approach for scaling secure computation, where capable machine learning model(s) interact under input/output constraints, with explicit information flow control and explicit statelessness.This approach aims to achieve a balance between privacy and computational efficiency, enabling private inference where classical cryptographic solutions are currently infeasible.We describe a number of use cases that are enabled by TCME, and show that even some simple classic cryptographic problems can already be solved with TCME.Finally, we outline current limitations and discuss the path forward in implementing them.

link

2025-01-15

Learning to Extract Cross-Domain Aspects and Understanding Sentiments Using Large Language Models

Aspect-based sentiment analysis (ASBA) is a refined approach to sentiment analysis that aims to extract and classify sentiments based on specific aspects or features of a product, service, or entity.Unlike traditional sentiment analysis, which assigns a general sentiment score to entire reviews or texts, ABSA focuses on breaking down the text into individual components or aspects (e.g., quality, price, service) and evaluating the sentiment towards each.This allows for a more granular level of understanding of customer opinions, enabling businesses to pinpoint specific areas of strength and improvement.The process involves several key steps, including aspect extraction, sentiment classification, and aspect-level sentiment aggregation for a review paragraph or any other form that the users have provided.ABSA has significant applications in areas such as product reviews, social media monitoring, customer feedback analysis, and market research.By leveraging techniques from natural language processing (NLP) and machine learning, ABSA facilitates the extraction of valuable insights, enabling companies to make data-driven decisions that enhance customer satisfaction and optimize offerings.As ABSA evolves, it holds the potential to greatly improve personalized customer experiences by providing a deeper understanding of sentiment across various product aspects.In this work, we have analyzed the strength of LLMs for a complete cross-domain aspect-based sentiment analysis with the aim of defining the framework for certain products and using it for other similar situations. 0.657We argue that it is possible to that at an effectiveness of 92\% accuracy for the Aspect Based Sentiment Analysis dataset of SemEval-2015 Task 12.

link

2025-01-15

Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

As Large Language Models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation. 0.71Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. 0.709The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. 0.654Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). 0.628Validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity, inter-rater reliability (ICC and Krippendorff's alpha) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity.Seven physician raters evaluated 779 summaries and answered 8,329 questions, achieving over 80% power for inter-rater reliability.The PDSQI-9 demonstrated strong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting structural validity and generalizability.Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility.Substantive validity was supported by correlations between note length and scores for Succinct (rho = -0.200, p = 0.029) and Organized (rho = -0.190, p = 0.037).Discriminant validity distinguished high- from low-quality summaries (p < 0.001).The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows. 0.686

link

2025-01-15

Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

As Large Language Models (LLMs) and generative AI become increasingly widespread, concerns about content safety have grown in parallel. 0.693Currently, there is a clear lack of high-quality, human-annotated datasets that address the full spectrum of LLM-related safety risks and are usable for commercial applications. 0.725To bridge this gap, we propose a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories.This taxonomy is designed to meet the diverse requirements of downstream users, offering more granular and flexible tools for managing various risk types.Using a hybrid data generation pipeline that combines human annotations with a multi-LLM "jury" system to assess the safety of responses, we obtain Aegis 2.0, a carefully curated collection of 34,248 samples of human-LLM interactions, annotated according to our proposed taxonomy. 0.647To validate its effectiveness, we demonstrate that several lightweight models, trained using parameter-efficient techniques on Aegis 2.0, achieve performance competitive with leading safety models fully fine-tuned on much larger, non-commercial datasets.In addition, we introduce a novel training blend that combines safety with topic following data.This approach enhances the adaptability of guard models, enabling them to generalize to new risk categories defined during inference.We plan to open-source Aegis 2.0 data and models to the research community to aid in the safety guardrailing of LLMs. 0.662

link

2025-01-15

Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks. 0.659To facilitate this investigation, we construct MM-StyleBench, a novel high-quality dataset for benchmarking artistic stylization.We then develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs' responses and human preference.Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity.ArtCoT is proposed, demonstrating that art-specific task decomposition and the use of concrete language boost MLLMs' reasoning ability for aesthetics.Our findings offer valuable insights into MLLMs for art and can benefit a wide range of downstream applications, such as style transfer and artistic image generation.Code available at https://github.com/songrise/MLLM4Art.

link

Developer Research

2025-01-16

Clinicians don't know what explanations they need: A case study on eliciting AI software explainability requirements

This paper analyses how software developers elicit explainability requirements when creating a software application with an AI component, through a case study using AI in the medical context of predicting cerebral palsy (CP) risk in infants. 0.603Following a small software development team at a Norwegian hospital, we observe their process of simultaneously developing the AI application and discovering what explanations clinicians require from the AI predictions.Since clinicians struggled to articulate their explainability needs before interacting with the system, an iterative approach proved effective: the team started with minimal explanations and refined these based on clinicians' responses during real patient examinations.Our preliminary findings from the first two iterations show that clinicians valued "interrogative explanations" - i.e., tools that let them explore and compare the AI predictions with their own assessments - over detailed technical explanations of the AI model's inner workings.Based on our analysis, we suggest that successful explainability requirements emerge through iterative collaboration between developers and users rather than being fully specified upfront. 0.664To the best of our knowledge, this is the first empirical case study on eliciting explainability requirements in software engineering.

link

2025-01-16

Simulated Interactive Debugging

Debugging software, i.e., the localization of faults and their repair, is a main activity in software engineering. 0.724Therefore, effective and efficient debugging is one of the core skills a software engineer must develop. 0.683However, the teaching of debugging techniques is usually very limited or only taught in indirect ways, e.g., during software projects.As a result, most Computer Science (CS) students learn debugging only in an ad-hoc and unstructured way.In this work, we present our approach called Simulated Interactive Debugging that interactively guides students along the debugging process. 0.613The guidance aims to empower the students to repair their solutions and have a proper "learning" experience.We envision that such guided debugging techniques can be integrated into programming courses early in the CS education curriculum. 0.635To perform an initial evaluation, we developed a prototypical implementation using traditional fault localization techniques and large language models.Students can use features like the automated setting of breakpoints or an interactive chatbot.We designed and executed a controlled experiment that included this IDE-integrated tooling with eight undergraduate CS students.Based on the responses, we conclude that the participants liked the systematic guidance by the assisted debugger. 0.616In particular, they rated the automated setting of breakpoints as the most effective, followed by the interactive debugging and chatting, and the explanations for how breakpoints were set.In our future work, we will improve our concept and implementation, add new features, and perform more intensive user studies.

link

2024-12-24

Automated Code Review In Practice

Code review is a widespread practice to improve software quality and transfer knowledge. 0.63It is often seen as time-consuming due to the need for manual effort and potential delays.Several AI-assisted tools, such as Qodo, GitHub Copilot, and Coderabbit, provide automated reviews using large language models (LLMs).The effects of such tools in the industry are yet to be examined. This study examines the impact of LLM-based automated code review tools in an industrial setting.The study was conducted within a software development environment that adopted an AI-assisted review tool (based on open-source Qodo PR Agent).Around 238 practitioners across ten projects had access to the tool.We focused on three projects with 4,335 pull requests, 1,568 of which underwent automated reviews.Data collection comprised three sources: (1) a quantitative analysis of pull request data, including comment labels indicating whether developers acted on the automated comments, (2) surveys sent to developers regarding their experience with reviews on individual pull requests, and (3) a broader survey of 22 practitioners capturing their general opinions on automated reviews. 73.8% of automated comments were resolved.However, the average pull request closure duration increased from five hours 52 minutes to eight hours 20 minutes, with varying trends across projects.Most practitioners reported a minor improvement in code quality due to automated reviews. 0.6The LLM-based tool proved useful in software development, enhancing bug detection, increasing awareness of code quality, and promoting best practices.However, it also led to longer pull request closure times and introduced drawbacks like faulty reviews, unnecessary corrections, and irrelevant comments.

link

2024-12-24

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

Recently, an increasing number of AI-driven programming assistants powered by code LLMs have been integrated into various real-world software development environments, significantly boosting developer productivity.However, existing code generation benchmarks primarily focus on general-purpose scenarios, leaving the code generation performance of LLMs for specific application domains largely unknown.In this paper, we introduce a new benchmark, MultiCodeBench, to fill this gap.MultiCodeBench comprises 2,400 programming tasks, covering 12 popular software development domains and 15 programming languages.Specifically, we perform in-depth research to identify these 12 application domains.Given that each domain may involve multiple technical frameworks, and that different frameworks present distinct challenges in the coding process, we categorize the commonly used frameworks and platforms within each domain.We then sample programming problems from GitHub repositories related to these subdomains.To ensure the quality of the tasks and mitigate data leakage issues, we invite annotators to rewrite the docstrings for each task in MultiCodeBench.Additionally, we build a static analysis-based dependency parsing tool to extract the dependencies in the ground truth for each task, enabling deeper performance analysis.Through extensive experiments on MultiCodeBench with eleven representative mainstream LLMs, we reveal the code generation performance of the LLMs across different application domains, providing practical insights for developers in downstream fields when selecting LLMs.Furthermore, we analyze the reasons behind the models' failures in completing software application development tasks, offering guidance for model developers to enhance domain-specific code generation capabilities. 0.603

link

Data Annotation Techniques