Vincent's Arxiv FrontPage


Generated on 2024-05-09.


This frontpage is made by scraping arxiv and by runnig a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.


New Datasets

2024-05-08

Impact of Tone-Aware Explanations in Recommender Systems

In recommender systems, the presentation of explanations plays a crucial role in supporting users' decision-making processes.Although numerous existing studies have focused on the effects (transparency or persuasiveness) of explanation content, explanation expression is largely overlooked.Tone, such as formal and humorous, is directly linked to expressiveness and is an important element in human communication.However, studies on the impact of tone on explanations within the context of recommender systems are insufficient.Therefore, this study investigates the effect of explanation tones through an online user study from three aspects: perceived effects, domain differences, and user attributes.We create a dataset using a large language model to generate fictional items and explanations with various tones in the domain of movies, hotels, and home products. 0.81Collected data analysis reveals different perceived effects of tones depending on the domains.Moreover, user attributes such as age and personality traits are found to influence the impact of tone.This research underscores the critical role of tones in explanations within recommender systems, suggesting that attention to tone can enhance user experience.

link

2024-05-08

QFMTS: Generating Query-Focused Summaries over Multi-Table Inputs

Table summarization is a crucial task aimed at condensing information from tabular data into concise and comprehensible textual summaries.However, existing approaches often fall short of adequately meeting users' information and quality requirements and tend to overlook the complexities of real-world queries.In this paper, we propose a novel method to address these limitations by introducing query-focused multi-table summarization.Our approach, which comprises a table serialization module, a summarization controller, and a large language model (LLM), utilizes textual queries and multiple tables to generate query-dependent table summaries tailored to users' information needs.To facilitate research in this area, we present a comprehensive dataset specifically tailored for this task, consisting of 4909 query-summary pairs, each associated with multiple tables. 0.889Through extensive experiments using our curated dataset, we demonstrate the effectiveness of our proposed method compared to baseline approaches.Our findings offer insights into the challenges of complex table reasoning for precise summarization, contributing to the advancement of research in query-focused multi-table summarization.

link

2024-05-08

Identifying every building's function in large-scale urban areas with multi-modality remote-sensing data

Buildings, as fundamental man-made structures in urban environments, serve as crucial indicators for understanding various city function zones.Rapid urbanization has raised an urgent need for efficiently surveying building footprints and functions.In this study, we proposed a semi-supervised framework to identify every building's function in large-scale urban areas with multi-modality remote-sensing data.In detail, optical images, building height, and nighttime-light data are collected to describe the morphological attributes of buildings.Then, the area of interest (AOI) and building masks from the volunteered geographic information (VGI) data are collected to form sparsely labeled samples.Furthermore, the multi-modality data and weak labels are utilized to train a segmentation model with a semi-supervised strategy.Finally, results are evaluated by 20,000 validation points and statistical survey reports from the government.The evaluations reveal that the produced function maps achieve an OA of 82% and Kappa of 71% among 1,616,796 buildings in Shanghai, China.This study has the potential to support large-scale urban management and sustainable urban development.All collected data and produced maps are open access at https://github.com/LiZhuoHong/BuildingMap. 0.864

link

2024-05-08

MOTLEE: Collaborative Multi-Object Tracking Using Temporal Consistency for Neighboring Robot Frame Alignment

Knowing the locations of nearby moving objects is important for a mobile robot to operate safely in a dynamic environment.Dynamic object tracking performance can be improved if robots share observations of tracked objects with nearby team members in real-time.To share observations, a robot must make up-to-date estimates of the transformation from its coordinate frame to the frame of each neighbor, which can be challenging because of odometry drift.We present Multiple Object Tracking with Localization Error Elimination (MOTLEE), a complete system for a multi-robot team to accurately estimate frame transformations and collaboratively track dynamic objects.To accomplish this, robots use open-set image-segmentation methods to build object maps of their environment and then use our Temporally Consistent Alignment of Frames Filter (TCAFF) to align maps and estimate coordinate frame transformations without any initial knowledge of neighboring robot poses.We show that our method for aligning frames enables a team of four robots to collaboratively track six pedestrians with accuracy similar to that of a system with ground truth localization in a challenging hardware demonstration.The code and hardware dataset are available at https://github.com/mit-acl/motlee. 0.821

link

2024-05-08

BenthicNet: A global compilation of seafloor images for deep learning applications

Advances in underwater imaging enable the collection of extensive seafloor image datasets that are necessary for monitoring important benthic ecosystems.The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering expedient mobilization of this crucial environmental information.Recent machine learning approaches provide opportunities to increase the efficiency with which seafloor image datasets are analyzed, yet large and consistent datasets necessary to support development of such approaches are scarce.Here we present BenthicNet: a global compilation of seafloor imagery designed to support the training and evaluation of large-scale image recognition models.An initial set of over 11.4 million images was collected and curated to represent a diversity of seafloor environments using a representative subset of 1.3 million images. 0.877These are accompanied by 2.6 million annotations translated to the CATAMI scheme, which span 190,000 of the images. 0.703A large deep learning model was trained on this compilation and preliminary results suggest it has utility for automating large and small-scale image analysis tasks.The compilation and model are made openly available for use by the scientific community at https://doi.org/10.20383/103.0614.

link

2024-05-07

Novel View Synthesis with Neural Radiance Fields for Industrial Robot Applications

Neural Radiance Fields (NeRFs) have become a rapidly growing research field with the potential to revolutionize typical photogrammetric workflows, such as those used for 3D scene reconstruction.As input, NeRFs require multi-view images with corresponding camera poses as well as the interior orientation.In the typical NeRF workflow, the camera poses and the interior orientation are estimated in advance with Structure from Motion (SfM).But the quality of the resulting novel views, which depends on different parameters such as the number and distribution of available images, as well as the accuracy of the related camera poses and interior orientation, is difficult to predict.In addition, SfM is a time-consuming pre-processing step, and its quality strongly depends on the image content.Furthermore, the undefined scaling factor of SfM hinders subsequent steps in which metric information is required.In this paper, we evaluate the potential of NeRFs for industrial robot applications.We propose an alternative to SfM pre-processing: we capture the input images with a calibrated camera that is attached to the end effector of an industrial robot and determine accurate camera poses with metric scale based on the robot kinematics.We then investigate the quality of the novel views by comparing them to ground truth, and by computing an internal quality measure based on ensemble methods.For evaluation purposes, we acquire multiple datasets that pose challenges for reconstruction typical of industrial applications, like reflective objects, poor texture, and fine structures. 0.702We show that the robot-based pose determination reaches similar accuracy as SfM in non-demanding cases, while having clear advantages in more challenging scenarios.Finally, we present first results of applying the ensemble method to estimate the quality of the synthetic novel view in the absence of a ground truth.

link

2024-05-07

Towards Geographic Inclusion in the Evaluation of Text-to-Image Models

Rapid progress in text-to-image generative models coupled with their deployment for visual content creation has magnified the importance of thoroughly evaluating their performance and identifying potential biases.In pursuit of models that generate images that are realistic, diverse, visually appealing, and consistent with the given prompt, researchers and practitioners often turn to automated metrics to facilitate scalable and cost-effective performance profiling.However, commonly-used metrics often fail to account for the full diversity of human preference; often even in-depth human evaluations face challenges with subjectivity, especially as interpretations of evaluation criteria vary across regions and cultures.In this work, we conduct a large, cross-cultural study to study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images from state-of-the art public APIs.We collect over 65,000 image annotations and 20 survey responses. 0.808We contrast human annotations with common automated metrics, finding that human preferences vary notably across geographic location and that current metrics do not fully account for this diversity.For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative.In addition, the utility of automatic evaluations is dependent on assumptions about their set-up, such as the alignment of feature extractors with human perception of object similarity or the definition of "appeal" captured in reference datasets used to ground evaluations.We recommend steps for improved automatic and human evaluations.

link

2024-05-07

ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning

Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more.Each of these methods works in isolation instead of synergistically.Here we address this problem and build a language-driven human understanding system -- ChatHuman, which combines and integrates the skills of many different methods. 0.773To do so, we finetune a Large Language Model (LLM) to select and use a wide variety of existing tools in response to user inputs.In doing so, ChatHuman is able to combine information from multiple tools to solve problems more accurately than the individual tools themselves and to leverage tool output to improve its ability to reason about humans.The novel features of ChatHuman include leveraging academic publications to guide the application of 3D human-related tools, employing a retrieval-augmented generation model to generate in-context-learning examples for handling new tools, and discriminating and integrating tool results to enhance 3D human understanding.Our experiments show that ChatHuman outperforms existing models in both tool selection accuracy and performance across multiple 3D human-related tasks.ChatHuman is a step towards consolidating diverse methods for human analysis into a single, powerful, system for 3D human reasoning.

link

2024-05-07

Tactile-Augmented Radiance Fields

We present a scene representation, which we call a tactile-augmented radiance field (TaRF), that brings vision and touch into a shared 3D space.This representation can be used to estimate the visual and tactile signals for a given 3D position within a scene.We capture a scene's TaRF from a collection of photos and sparsely sampled touch probes.Our approach makes use of two insights: (i) common vision-based touch sensors are built on ordinary cameras and thus can be registered to images using methods from multi-view geometry, and (ii) visually and structurally similar regions of a scene share the same tactile features.We use these insights to register touch signals to a captured visual scene, and to train a conditional diffusion model that, provided with an RGB-D image rendered from a neural radiance field, generates its corresponding tactile signal.To evaluate our approach, we collect a dataset of TaRFs. 0.729This dataset contains more touch samples than previous real-world datasets, and it provides spatially aligned visual signals for each captured touch signal. 0.868We demonstrate the accuracy of our cross-modal generative model and the utility of the captured visual-tactile data on several downstream tasks.Project page: https://dou-yiming.github.io/TaRF

link

2024-05-06

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

Image safety classifiers play an important role in identifying and mitigating the spread of unsafe images online (e.g., images including violence, hateful rhetoric, etc.).At the same time, with the advent of text-to-image models and increasing concerns about the safety of AI models, developers are increasingly relying on image safety classifiers to safeguard their models.Yet, the performance of current image safety classifiers remains unknown for real-world and AI-generated images.To bridge this research gap, in this work, we propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers.First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe based on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.). 0.794Then, we evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers that are powered by general-purpose visual language models.Our assessment indicates that existing image safety classifiers are not comprehensive and effective enough in mitigating the multifaceted problem of unsafe images.Also, we find that classifiers trained only on real-world images tend to have degraded performance when applied to AI-generated images.Motivated by these findings, we design and implement a comprehensive image moderation tool called PerspectiveVision, which effectively identifies 11 categories of real-world and AI-generated unsafe images.The best PerspectiveVision model achieves an overall F1-Score of 0.810 on six evaluation datasets, which is comparable with closed-source and expensive state-of-the-art models like GPT-4V. UnsafeBench and PerspectiveVision can aid the research community in better understanding the landscape of image safety classification in the era of generative AI.

link

2024-05-06

Are Human Rules Necessary? Generating Reusable APIs with CoT Reasoning and In-Context Learning

Inspired by the great potential of Large Language Models (LLMs) for solving complex coding tasks, in this paper, we propose a novel approach, named Code2API, to automatically perform APIzation for Stack Overflow code snippets.Code2API does not require additional model training or any manual crafting rules and can be easily deployed on personal computers without relying on other external tools.Specifically, Code2API guides the LLMs through well-designed prompts to generate well-formed APIs for given code snippets.To elicit knowledge and logical reasoning from LLMs, we used chain-of-thought (CoT) reasoning and few-shot in-context learning, which can help the LLMs fully understand the APIzation task and solve it step by step in a manner similar to a developer.Our evaluations show that Code2API achieves a remarkable accuracy in identifying method parameters (65%) and return statements (66%) equivalent to human-generated ones, surpassing the current state-of-the-art approach, APIzator, by 15.0% and 16.5% respectively.Moreover, compared with APIzator, our user study demonstrates that Code2API exhibits superior performance in generating meaningful method names, even surpassing the human-level performance, and developers are more willing to use APIs generated by our approach, highlighting the applicability of our tool in practice.Finally, we successfully extend our framework to the Python dataset, achieving a comparable performance with Java, which verifies the generalizability of our tool. 0.713

link

2024-05-06

Language-Image Models with 3D Understanding

Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks.We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space.To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. 0.722Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D.We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information.(2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input and output formats.(3) Cube-LLM can be visually prompted such as 2D box or a set of candidate 3D boxes from specialists.Our experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios, respectively.Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning.Our project is available at https://janghyuncho.github.io/Cube-LLM.

link

2024-05-06

Large Language Models Reveal Information Operation Goals, Tactics, and Narrative Frames

Adversarial information operations can destabilize societies by undermining fair elections, manipulating public opinions on policies, and promoting scams.Despite their widespread occurrence and potential impacts, our understanding of influence campaigns is limited by manual analysis of messages and subjective interpretation of their observable behavior.In this paper, we explore whether these limitations can be mitigated with large language models (LLMs), using GPT-3.5 as a case-study for coordinated campaign annotation.We first use GPT-3.5 to scrutinize 126 identified information operations spanning over a decade.We utilize a number of metrics to quantify the close (if imperfect) agreement between LLM and ground truth descriptions.We next extract coordinated campaigns from two large multilingual datasets from X (formerly Twitter) that respectively discuss the 2022 French election and 2023 Balikaran Philippine-U.S. military exercise in 2023. 0.816For each coordinated campaign, we use GPT-3.5 to analyze posts related to a specific concern and extract goals, tactics, and narrative frames, both before and after critical events (such as the date of an election).While the GPT-3.5 sometimes disagrees with subjective interpretation, its ability to summarize and interpret demonstrates LLMs' potential to extract higher-order indicators from text to provide a more complete picture of the information campaigns compared to previous methods.

link

2024-05-06

Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks.These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical imaging, and autonomous vehicles.The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts.However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries.In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, {especially open-source ones,} struggle with robustness and reasoning when dealing with complex videos.Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs.Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities.Our dataset and code are publicly available at: https://mbzuai-oryx.github.io/CVRR-Evaluation-Suite/. 0.72

link

2024-05-02

Sparse multi-view hand-object reconstruction for unseen environments

Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings.On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions.On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection.In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while keeping the computational cost low compared to dense multi-view methods.In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting.Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views.We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. 0.794We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality.

link

2024-05-02

UQA: Corpus for Urdu Question Answering

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. 0.765UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs.The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T.The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results.For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM.UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models.Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains.The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA. 0.812

link

2024-05-02

WildChat: 1M ChatGPT Interaction Logs in the Wild

Chatbots such as GPT-4 and ChatGPT are now serving millions of users.Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice.To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers.From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. 0.891We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study.In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. 0.718This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions.Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models.WildChat is released at https://wildchat.allen.ai under AI2 ImpACT Licenses.

link

2024-05-02

V-FLUTE: Visual Figurative Language Understanding with Textual Explanations

Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment.However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit.To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE).We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation.Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. 0.731The figurative phenomena can be present either in the image, the caption, or both.We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.

link

2024-05-02

MANTIS: Interleaved Multi-Image Instruction Tuning

The recent years have witnessed a great array of large multimodal models (LMMs) to effectively solve single-image vision language tasks.However, their abilities to solve multi-image visual language tasks is yet to be improved.The existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from web, which is neither efficient nor effective.In this paper, we aim at building strong multi-image LMMs via instruction tuning with academic-level resources.Therefore, we meticulously construct Mantis-Instruct containing 721K instances from 14 multi-image datasets. 0.862We design Mantis-Instruct to cover different multi-image skills like co-reference, reasoning, comparing, temporal understanding.We combine Mantis-Instruct with several single-image visual-language datasets to train our model Mantis to handle any interleaved image-text inputs. 0.738We evaluate the trained Mantis on five multi-image benchmarks and eight single-image benchmarks.Though only requiring academic-level resources (i.e. 36 hours on 16xA100-40G), Mantis-8B can achieve state-of-the-art performance on all the multi-image benchmarks and beats the existing best multi-image LMM Idefics2-8B by an average of 9 absolute points.We observe that Mantis performs equivalently well on the held-in and held-out evaluation benchmarks.We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis can maintain a strong single-image performance on par with CogVLM and Emu2.Our results are particularly encouraging as it shows that low-cost instruction tuning is indeed much more effective than intensive pre-training in terms of building multi-image LMMs.

link

2024-05-01

KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents

In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains.This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape.Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys.Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts.This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction.To address this gap, we introduce KVP10k , a new dataset and benchmark specifically designed for KVP extraction.The dataset contains 10707 richly annotated images. 0.89In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task.KVP10k sets itself apart with its extensive diversity in data and richly detailed annotations, paving the way for advancements in the field of information extraction from complex business documents.

link

2024-05-01

ChatBI: Towards Natural Language to Complex Business Intelligence SQL

The Natural Language to SQL (NL2SQL) technology provides non-expert users who are unfamiliar with databases the opportunity to use SQL for data analysis.Converting Natural Language to Business Intelligence (NL2BI) is a popular practical scenario for NL2SQL in actual production systems.Compared to NL2SQL, NL2BI introduces more challenges. In this paper, we propose ChatBI, a comprehensive and efficient technology for solving the NL2BI task.First, we analyze the interaction mode, an important module where NL2SQL and NL2BI differ in use, and design a smaller and cheaper model to match this interaction mode.In BI scenarios, tables contain a huge number of columns, making it impossible for existing NL2SQL methods that rely on Large Language Models (LLMs) for schema linking to proceed due to token limitations.The higher proportion of ambiguous columns in BI scenarios also makes schema linking difficult.ChatBI combines existing view technology in the database community to first decompose the schema linking problem into a Single View Selection problem and then uses a smaller and cheaper machine learning model to select the single view with a significantly reduced number of columns.The columns of this single view are then passed as the required columns for schema linking into the LLM.Finally, ChatBI proposes a phased process flow different from existing process flows, which allows ChatBI to generate SQL containing complex semantics and comparison relations more accurately. We have deployed ChatBI on Baidu's data platform and integrated it into multiple product lines for large-scale production task evaluation. 0.704The obtained results highlight its superiority in practicality, versatility, and efficiency.At the same time, compared with the current mainstream NL2SQL technology under our real BI scenario data tables and queries, it also achieved the best results.

link

2024-05-01

New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis

The emergence of multimodal data on social media platforms presents new opportunities to better understand user sentiments toward a given aspect.However, existing multimodal datasets for Aspect-Category Sentiment Analysis (ACSA) often focus on textual annotations, neglecting fine-grained information in images.Consequently, these datasets fail to fully exploit the richness inherent in multimodal.To address this, we introduce a new Vietnamese multimodal dataset, named ViMACSA, which consists of 4,876 text-image pairs with 14,618 fine-grained annotations for both text and image in the hotel domain. 0.88Additionally, we propose a Fine-Grained Cross-Modal Fusion Framework (FCMF) that effectively learns both intra- and inter-modality interactions and then fuses these information to produce a unified multimodal representation.Experimental results show that our framework outperforms SOTA models on the ViMACSA dataset, achieving the highest F1 score of 79.73%.We also explore characteristics and challenges in Vietnamese multimodal sentiment analysis, including misspellings, abbreviations, and the complexities of the Vietnamese language.This work contributes both a benchmark dataset and a new framework that leverages fine-grained multimodal information to improve multimodal aspect-category sentiment analysis.Our dataset is available for research purposes: https://github.com/hoangquy18/Multimodal-Aspect-Category-Sentiment-Analysis.

link

2024-05-01

Long-Term Human Trajectory Prediction using 3D Dynamic Scene Graphs

We present a novel approach for long-term human trajectory prediction, which is essential for long-horizon robot planning in human-populated environments.State-of-the-art human trajectory prediction methods are limited by their focus on collision avoidance and short-term planning, and their inability to model complex interactions of humans with the environment.In contrast, our approach overcomes these limitations by predicting sequences of human interactions with the environment and using this information to guide trajectory predictions over a horizon of up to 60s.We leverage Large Language Models (LLMs) to predict interactions with the environment by conditioning the LLM prediction on rich contextual information about the scene.This information is given as a 3D Dynamic Scene Graph that encodes the geometry, semantics, and traversability of the environment into a hierarchical representation.We then ground these interaction sequences into multi-modal spatio-temporal distributions over human positions using a probabilistic approach based on continuous-time Markov Chains.To evaluate our approach, we introduce a new semi-synthetic dataset of long-term human trajectories in complex indoor environments, which also includes annotations of human-object interactions. 0.851We show in thorough experimental evaluations that our approach achieves a 54% lower average negative log-likelihood (NLL) and a 26.5% lower Best-of-20 displacement error compared to the best non-privileged baselines for a time horizon of 60s.

link

2024-05-01

Are Models Biased on Text without Gender-related Language?

Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions.A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data.In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings?To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios.USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations.To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language.By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation.Surprisingly, we find low fairness across all 28 tested models.Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words.These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation.We release the full dataset and code at https://ucinlp.github.io/unstereo-eval. 0.728

link

2024-05-01

Causal Evaluation of Language Models

Causal reasoning is viewed as crucial for achieving human-level machine intelligence.Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning.In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models.First, we propose the CaLM framework, which establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results).This taxonomy defines a broad evaluation design space while systematically selecting criteria and priorities.Second, we compose the CaLM dataset, comprising 126,334 data samples, to provide curated sets of causal targets, adaptations, metrics, and errors, offering extensive coverage for diverse research pursuits. 0.932Third, we conduct an extensive evaluation of 28 leading language models on a core set of 92 causal targets, 9 adaptations, 7 metrics, and 12 error types.Fourth, we perform detailed analyses of the evaluation results across various dimensions (e.g., adaptation, scale).Fifth, we present 50 high-level empirical findings across 9 dimensions (e.g., model), providing valuable guidance for future language model development.Finally, we develop a multifaceted platform, including a website, leaderboards, datasets, and toolkits, to support scalable and adaptable assessments.We envision CaLM as an ever-evolving benchmark for the community, systematically updated with new causal targets, adaptations, models, metrics, and error types to reflect ongoing research advancements.Project website is at https://opencausalab.github.io/CaLM.

link

2024-04-30

On Training a Neural Network to Explain Binaries

In this work, we begin to investigate the possibility of training a deep neural network on the task of binary code understanding.Specifically, the network would take, as input, features derived directly from binaries and output English descriptions of functionality to aid a reverse engineer in investigating the capabilities of a piece of closed-source software, be it malicious or benign.Given recent success in applying large language models (generative AI) to the task of source code summarization, this seems a promising direction.However, in our initial survey of the available datasets, we found nothing of sufficiently high quality and volume to train these complex models.Instead, we build our own dataset derived from a capture of Stack Overflow containing 1.1M entries. 0.824A major result of our work is a novel dataset evaluation method using the correlation between two distances on sample pairs: one distance in the embedding space of inputs and the other in the embedding space of outputs.Intuitively, if two samples have inputs close in the input embedding space, their outputs should also be close in the output embedding space.We found this Embedding Distance Correlation (EDC) test to be highly diagnostic, indicating that our collected dataset and several existing open-source datasets are of low quality as the distances are not well correlated.We proceed to explore the general applicability of EDC, applying it to a number of qualitatively known good datasets and a number of synthetically known bad ones and found it to be a reliable indicator of dataset value.

link

2024-04-30

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization.In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks.Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters.The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task.Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500.For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data.We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data.The code and datasets will be made available at the https://VimTextSpotter.github.io. 0.866

link

2024-04-30

Automated, Reliable, and Efficient Continental-Scale Replication of 7.3 Petabytes of Climate Simulation Data: A Case Study

We report on our experiences replicating 7.3 petabytes (PB) of Earth System Grid Federation (ESGF) climate simulation data from Lawrence Livermore National Laboratory (LLNL) in California to Argonne National Laboratory (ANL) in Illinois and Oak Ridge National Laboratory (ORNL) in Tennessee. 0.766This movement of some 29 million files, twice, undertaken in order to establish new ESGF nodes at ANL and ORNL, was performed largely automatically by a simple replication tool, a script that invoked Globus to transfer large bundles of files while tracking progress in a database.Under the covers, Globus organized transfers to make efficient use of the high-speed Energy Sciences network (ESnet) and the data transfer nodes deployed at participating sites, and also addressed security, integrity checking, and recovery from a variety of transient failures.This success demonstrates the considerable benefits that can accrue from the adoption of performant data replication infrastructure.

link

2024-04-30

DOCCI: Descriptions of Connected and Contrasting Images

Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research.However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models.To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. 0.92We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar.Each description is highly compositional and typically encompasses multiple challenges.Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B.Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.

link

2024-04-29

The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages

Toxic language remains an ongoing challenge on social media platforms, presenting significant issues for users and communities.This paper provides a cross-topic and cross-lingual analysis of toxicity in Reddit conversations.We collect 1.5 million comment threads from 481 communities in six languages: English, German, Spanish, Turkish,Arabic, and Dutch, covering 80 topics such as Culture, Politics, and News. 0.738We thoroughly analyze how toxicity spikes within different communities in relation to specific topics.We observe consistent patterns of increased toxicity across languages for certain topics, while also noting significant variations within specific language communities.

link

2024-04-29

Survey on Datasets for Perception in Unstructured Outdoor Environments

Perception is an essential component of pipelines in field robotics.In this survey, we quantitatively compare publicly available datasets available in unstructured outdoor environments. 0.735We focus on datasets for common perception tasks in field robotics.Our survey categorizes and compares available research datasets.This survey also reports on relevant dataset characteristics to help practitioners determine which dataset fits best for their own application.We believe more consideration should be taken in choosing compatible annotation policies across the datasets in unstructured outdoor environments.

link

2024-04-29

A Universal Metric of Dataset Similarity for Cross-silo Federated Learning

Federated Learning is increasingly used in domains such as healthcare to facilitate collaborative model training without data-sharing.However, datasets located in different sites are often non-identically distributed, leading to degradation of model performance in FL.Most existing methods for assessing these distribution shifts are limited by being dataset or task-specific.Moreover, these metrics can only be calculated by exchanging data, a practice restricted in many FL scenarios.To address these challenges, we propose a novel metric for assessing dataset similarity.Our metric exhibits several desirable properties for FL: it is dataset-agnostic, is calculated in a privacy-preserving manner, and is computationally efficient, requiring no model training.In this paper, we first establish a theoretical connection between our metric and training dynamics in FL.Next, we extensively evaluate our metric on a range of datasets including synthetic, benchmark, and medical imaging datasets. 0.769We demonstrate that our metric shows a robust and interpretable relationship with model performance and can be calculated in privacy-preserving manner.As the first federated dataset similarity metric, we believe this metric can better facilitate successful collaborations between sites.

link

2024-04-29

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

Determining the location of an image anywhere on Earth is a complex visual task, which makes it particularly relevant for evaluating computer vision algorithms.Yet, the absence of standard, large-scale, open-access datasets with reliably localizable images has limited its potential.To address this issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset comprising over 5.1 million geo-referenced street view images, covering 225 countries and territories. 0.87In contrast to existing benchmarks, we enforce a strict train/test separation, allowing us to evaluate the relevance of learned geographical features beyond mere memorization.To demonstrate the utility of our dataset, we conduct an extensive benchmark of various state-of-the-art image encoders, spatial representations, and training strategies.All associated codes and models can be found at https://github.com/gastruc/osv5m.

link

2024-04-29

A Multilevel Strategy to Improve People Tracking in a Real-World Scenario

The Pal\'acio do Planalto, office of the President of Brazil, was invaded by protesters on January 8, 2023.Surveillance videos taken from inside the building were subsequently released by the Brazilian Supreme Court for public scrutiny.We used segments of such footage to create the UFPR-Planalto801 dataset for people tracking and re-identification in a real-world scenario. 0.739This dataset consists of more than 500,000 images. 0.953This paper presents a tracking approach targeting this dataset.The method proposed in this paper relies on the use of known state-of-the-art trackers combined in a multilevel hierarchy to correct the ID association over the trajectories.We evaluated our method using IDF1, MOTA, MOTP and HOTA metrics.The results show improvements for every tracker used in the experiments, with IDF1 score increasing by a margin up to 9.5%.

link

2024-04-25

CBRW: A Novel Approach for Cancelable Biometric Template Generation based on

Cancelable Biometric is a challenging research field in which security of an original biometric image is ensured by transforming the original biometric into another irreversible domain.Several approaches have been suggested in literature for generating cancelable biometric templates.In this paper, two novel and simple cancelable biometric template generation methods based on Random Walk (CBRW) have been proposed.By employing random walk and other steps given in the proposed two algorithms viz.CBRW-BitXOR and CBRW-BitCMP, the original biometric is transformed into a cancellable template.The performance of the proposed methods is compared with other state-of-the-art methods.Experiments have been performed on eight publicly available gray and color datasets i.e. CP (ear) (gray and color), UTIRIS (iris) (gray and color), ORL (face) (gray), IIT Delhi (iris) (gray and color), and AR (face) (color). 0.825Performance of the generated templates is measured in terms of Correlation Coefficient (Cr), Root Mean Square Error (RMSE), Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), Mean Absolute Error (MAE), Number of Pixel Change Rate (NPCR), and Unified Average Changing Intensity (UACI).By experimental results, it has been proved that proposed methods are superior than other state-of-the-art methods in qualitative as well as quantitative analysis.Furthermore, CBRW performs better on both gray as well as color images.

link

2024-04-25

Dataset of Quotation Attribution in German News Articles

Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles.Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems.To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. 0.756The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. 0.895The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation.We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. 0.753Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.

link

2024-04-25

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images.Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs.However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension.In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs.Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world. 0.712These categories, due to their inherent complexity and diversity, effectively simulate real-world text-rich environments.We further conduct a thorough evaluation involving 34 prominent MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the current limitations of MLLMs in text-rich visual comprehension.We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs.The dataset and evaluation code can be accessed at https://github.com/AILab-CVC/SEED-Bench. 0.787

link

2024-04-25

Meta-Transfer Derm-Diagnosis: Exploring Few-Shot Learning and Transfer Learning for Skin Disease Classification in Long-Tail Distribution

Addressing the challenges of rare diseases is difficult, especially with the limited number of reference images and a small patient population.This is more evident in rare skin diseases, where we encounter long-tailed data distributions that make it difficult to develop unbiased and broadly effective models.The diverse ways in which image datasets are gathered and their distinct purposes also add to these challenges.Our study conducts a detailed examination of the benefits and drawbacks of episodic and conventional training methodologies, adopting a few-shot learning approach alongside transfer learning.We evaluated our models using the ISIC2018, Derm7pt, and SD-198 datasets. 0.757With minimal labeled examples, our models showed substantial information gains and better performance compared to previously trained models.Our research emphasizes the improved ability to represent features in DenseNet121 and MobileNetV2 models, achieved by using pre-trained models on ImageNet to increase similarities within classes.Moreover, our experiments, ranging from 2-way to 5-way classifications with up to 10 examples, showed a growing success rate for traditional transfer learning methods as the number of examples increased.The addition of data augmentation techniques significantly improved our transfer learning based model performance, leading to higher performances than existing methods, especially in the SD-198 and ISIC2018 datasets.All source code related to this work will be made publicly available soon at the provided URL.

link

2024-04-25

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.(2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.(3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. 0.791We evaluate InternVL 1.5 through a series of benchmarks and comparative studies.Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.Code has been released at https://github.com/OpenGVLab/InternVL.

link

2024-04-25

Learning Visuotactile Skills with Two Multifingered Hands

Aiming to replicate human-like dexterity, perceptual experiences, and motion patterns, we explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data.Two significant challenges exist: the lack of an affordable and accessible teleoperation system suitable for a dual-arm setup with multifingered hands, and the scarcity of multifingered hand hardware equipped with touch sensing.To tackle the first challenge, we develop HATO, a low-cost hands-arms teleoperation system that leverages off-the-shelf electronics, complemented with a software suite that enables efficient data collection; the comprehensive software suite also supports multimodal data processing, scalable policy learning, and smooth policy deployment.To tackle the latter challenge, we introduce a novel hardware adaptation by repurposing two prosthetic hands equipped with touch sensors for research.Using visuotactile data collected from our system, we learn skills to complete long-horizon, high-precision tasks which are difficult to achieve without multifingered dexterity and touch feedback.Furthermore, we empirically investigate the effects of dataset size, sensing modality, and visual input preprocessing on policy learning.Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data.Videos, code, and datasets can be found at https://toruowo.github.io/hato/ . 0.77

link

2024-04-25

The Third Monocular Depth Estimation Challenge

This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC).The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. 0.702As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised.The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method.The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.

link

Data Quality

2024-05-08

CARE-SD: Classifier-based analysis for recognizing and eliminating stigmatizing and doubt marker labels in electronic health records: model development and validation

Objective: To detect and classify features of stigmatizing and biased language in intensive care electronic health records (EHRs) using natural language processing techniques.Materials and Methods: We first created a lexicon and regular expression lists from literature-driven stem words for linguistic features of stigmatizing patient labels, doubt markers, and scare quotes within EHRs.The lexicon was further extended using Word2Vec and GPT 3.5, and refined through human evaluation.These lexicons were used to search for matches across 18 million sentences from the de-identified Medical Information Mart for Intensive Care-III (MIMIC-III) dataset.For each linguistic bias feature, 1000 sentence matches were sampled, labeled by expert clinical and public health annotators, and used to supervised learning classifiers.Results:Lexicon development from expanded literature stem-word lists resulted in a doubt marker lexicon containing 58 expressions, and a stigmatizing labels lexicon containing 127 expressions.Classifiers for doubt markers and stigmatizing labels had the highest performance, with macro F1-scores of .84 and .79, positive-label recall and precision values ranging from .71 to .86, and accuracies aligning closely with human annotator agreement (.87).Discussion:This study demonstrated the feasibility of supervised classifiers in automatically identifying stigmatizing labels and doubt markers in medical text, and identified trends in stigmatizing language use in an EHR setting. 0.636Additional labeled data may help improve lower scare quote model performance.Conclusions: Classifiers developed in this study showed high model performance and can be applied to identify patterns and target interventions to reduce stigmatizing labels and doubt markers in healthcare systems.

link

2024-05-06

Learning Robust Classifiers with Self-Guided Spurious Correlation Mitigation

Deep neural classifiers tend to rely on spurious correlations between spurious attributes of inputs and targets to make predictions, which could jeopardize their generalization capability.Training classifiers robust to spurious correlations typically relies on annotations of spurious correlations in data, which are often expensive to get. 0.616In this paper, we tackle an annotation-free setting and propose a self-guided spurious correlation mitigation framework.Our framework automatically constructs fine-grained training labels tailored for a classifier obtained with empirical risk minimization to improve its robustness against spurious correlations. 0.662The fine-grained training labels are formulated with different prediction behaviors of the classifier identified in a novel spuriousness embedding space. 0.662We construct the space with automatically detected conceptual attributes and a novel spuriousness metric which measures how likely a class-attribute correlation is exploited for predictions.We demonstrate that training the classifier to distinguish different prediction behaviors reduces its reliance on spurious correlations without knowing them a priori and outperforms prior methods on five real-world datasets.

link

2024-05-06

Why is SAM Robust to Label Noise?

Sharpness-Aware Minimization (SAM) is most known for achieving state-of the-art performances on natural image and language tasks.However, its most pronounced improvements (of tens of percent) is rather in the presence of label noise. 0.637Understanding SAM's label noise robustness requires a departure from characterizing the robustness of minimas lying in "flatter" regions of the loss landscape. 0.635In particular, the peak performance under label noise occurs with early stopping, far before the loss converges.We decompose SAM's robustness into two effects: one induced by changes to the logit term and the other induced by changes to the network Jacobian.The first can be observed in linear logistic regression where SAM provably up-weights the gradient contribution from clean examples.Although this explicit up-weighting is also observable in neural networks, when we intervene and modify SAM to remove this effect, surprisingly, we see no visible degradation in performance.We infer that SAM's effect in deeper networks is instead explained entirely by the effect SAM has on the network Jacobian.We theoretically derive the implicit regularization induced by this Jacobian effect in two layer linear networks.Motivated by our analysis, we see that cheaper alternatives to SAM that explicitly induce these regularization effects largely recover the benefits in deep networks trained on real-world datasets.

link

2024-05-01

FMLFS: A federated multi-label feature selection based on information theory in IoT environment

In certain emerging applications such as health monitoring wearable and traffic monitoring systems, Internet-of-Things (IoT) devices generate or collect a huge amount of multi-label datasets.Within these datasets, each instance is linked to a set of labels.The presence of noisy, redundant, or irrelevant features in these datasets, along with the curse of dimensionality, poses challenges for multi-label classifiers. 0.652Feature selection (FS) proves to be an effective strategy in enhancing classifier performance and addressing these challenges.Yet, there is currently no existing distributed multi-label FS method documented in the literature that is suitable for distributed multi-label datasets within IoT environments.This paper introduces FMLFS, the first federated multi-label feature selection method.Here, mutual information between features and labels serves as the relevancy metric, while the correlation distance between features, derived from mutual information and joint entropy, is utilized as the redundancy measure.Following aggregation of these metrics on the edge server and employing Pareto-based bi-objective and crowding distance strategies, the sorted features are subsequently sent back to the IoT devices.The proposed method is evaluated through two scenarios: 1) transmitting reduced-size datasets to the edge server for centralized classifier usage, and 2) employing federated learning with reduced-size datasets.Evaluation across three metrics - performance, time complexity, and communication cost - demonstrates that FMLFS outperforms five other comparable methods in the literature and provides a good trade-off on three real-world datasets.

link

2024-04-29

Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking

Data augmentation techniques apply transformations to existing texts to generate additional data.The transformations may produce low-quality texts, where the meaning of the text is changed and the text may even be mangled beyond human comprehension.Analyzing the synthetically generated texts and their corresponding labels is slow and demanding.To winnow out texts with incorrect labels, we develop INSPECTOR, a human-in-the-loop data inspection technique. 0.622INSPECTOR combines the strengths of provenance tracking techniques with assistive labeling.INSPECTOR allows users to group related texts by their transformation provenance, i.e., the transformations applied to the original text, or feature provenance, the linguistic features of the original text.For assistive labeling, INSPECTOR computes metrics that approximate data quality, and allows users to compare the corresponding label of each text against the predictions of a large language model.In a user study, INSPECTOR increases the number of texts with correct labels identified by 3X on a sentiment analysis task and by 4X on a hate speech detection task.The participants found grouping the synthetically generated texts by their common transformation to be the most useful technique.Surprisingly, grouping texts by common linguistic features was perceived to be unhelpful.Contrary to prior work, our study finds that no single technique obviates the need for human inspection effort.This validates the design of INSPECTOR which combines both analysis of data provenance and assistive labeling to reduce human inspection effort.

link

2024-04-18

KDk: A Defense Mechanism Against Label Inference Attacks in Vertical Federated Learning

Vertical Federated Learning (VFL) is a category of Federated Learning in which models are trained collaboratively among parties with vertically partitioned data.Typically, in a VFL scenario, the labels of the samples are kept private from all the parties except for the aggregating server, that is the label owner.Nevertheless, recent works discovered that by exploiting gradient information returned by the server to bottom models, with the knowledge of only a small set of auxiliary labels on a very limited subset of training data points, an adversary can infer the private labels.These attacks are known as label inference attacks in VFL.In our work, we propose a novel framework called KDk, that combines Knowledge Distillation and k-anonymity to provide a defense mechanism against potential label inference attacks in a VFL scenario.Through an exhaustive experimental campaign we demonstrate that by applying our approach, the performance of the analyzed label inference attacks decreases consistently, even by more than 60%, maintaining the accuracy of the whole VFL almost unaltered. 0.61

link

Benchmarks

2024-05-08

Novel Actor-Critic Algorithm for Robust Decision Making of CAV under Delays and Loss of V2X Data

Current autonomous driving systems heavily rely on V2X communication data to enhance situational awareness and the cooperation between vehicles.However, a major challenge when using V2X data is that it may not be available periodically because of unpredictable delays and data loss during wireless transmission between road stations and the receiver vehicle.This issue should be considered when designing control strategies for connected and autonomous vehicles.Therefore, this paper proposes a novel 'Blind Actor-Critic' algorithm that guarantees robust driving performance in V2X environment with delayed and/or lost data.The novel algorithm incorporates three key mechanisms: a virtual fixed sampling period, a combination of Temporal-Difference and Monte Carlo learning, and a numerical approximation of immediate reward values.To address the temporal aperiodicity problem of V2X data, we first illustrate this challenge.Then, we provide a detailed explanation of the Blind Actor-Critic algorithm where we highlight the proposed components to compensate for the temporal aperiodicity problem of V2X data.We evaluate the performance of our algorithm in a simulation environment and compare it to benchmark approaches. 0.768The results demonstrate that training metrics are improved compared to conventional actor-critic algorithms.Additionally, testing results show that our approach provides robust control, even under low V2X network reliability levels.

link

2024-05-08

QFMTS: Generating Query-Focused Summaries over Multi-Table Inputs

Table summarization is a crucial task aimed at condensing information from tabular data into concise and comprehensible textual summaries.However, existing approaches often fall short of adequately meeting users' information and quality requirements and tend to overlook the complexities of real-world queries.In this paper, we propose a novel method to address these limitations by introducing query-focused multi-table summarization.Our approach, which comprises a table serialization module, a summarization controller, and a large language model (LLM), utilizes textual queries and multiple tables to generate query-dependent table summaries tailored to users' information needs.To facilitate research in this area, we present a comprehensive dataset specifically tailored for this task, consisting of 4909 query-summary pairs, each associated with multiple tables.Through extensive experiments using our curated dataset, we demonstrate the effectiveness of our proposed method compared to baseline approaches. 0.659Our findings offer insights into the challenges of complex table reasoning for precise summarization, contributing to the advancement of research in query-focused multi-table summarization.

link

2024-05-08

Selective Classification Under Distribution Shifts

In selective classification (SC), a classifier abstains from making predictions that are likely to be wrong to avoid excessive errors.To deploy imperfect classifiers -- imperfect either due to intrinsic statistical noise of data or for robustness issue of the classifier or beyond -- in high-stakes scenarios, SC appears to be an attractive and necessary path to follow.Despite decades of research in SC, most previous SC methods still focus on the ideal statistical setting only, i.e., the data distribution at deployment is the same as that of training, although practical data can come from the wild.To bridge this gap, in this paper, we propose an SC framework that takes into account distribution shifts, termed generalized selective classification, that covers label-shifted (or out-of-distribution) and covariate-shifted samples, in addition to typical in-distribution samples, the first of its kind in the SC literature.We focus on non-training-based confidence-score functions for generalized SC on deep learning (DL) classifiers and propose two novel margin-based score functions.Through extensive analysis and experiments, we show that our proposed score functions are more effective and reliable than the existing ones for generalized SC on a variety of classification tasks and DL classifiers. 0.611

link

2024-05-08

ProbRadarM3F: mmWave Radar based Human Skeletal Pose Estimation with Probability Map Guided Multi-Format Feature Fusion

Millimetre wave (mmWave) radar is a non-intrusive privacy and relatively convenient and inexpensive device, which has been demonstrated to be applicable in place of RGB cameras in human indoor pose estimation tasks.However, mmWave radar relies on the collection of reflected signals from the target, and the radar signals containing information is difficult to be fully applied.This has been a long-standing hindrance to the improvement of pose estimation accuracy.To address this major challenge, this paper introduces a probability map guided multi-format feature fusion model,ProbRadarM3F.This is a novel radar feature extraction framework using a traditional FFT method in parallel with a probability map based positional encoding method.ProbRadarM3F fuses the traditional heatmap features and the positional features, then effectively achieves the estimation of 14 keypoints of the human body.Experimental evaluation on the HuPR dataset proves the effectiveness of the model proposed in this paper, outperforming other methods experimented on this dataset with an AP of 69.9 %. 0.618The emphasis of our study is focusing on the position information that is not exploited before in radar singal.This provides direction to investigate other potential non-redundant information from mmWave rader.

link

2024-05-08

Picking watermarks from noise (PWFN): an improved robust watermarking model against intensive distortions

Digital watermarking is the process of embedding secret information by altering images in a way that is undetectable to the human eye.To increase the robustness of the model, many deep learning-based watermarking methods use the encoder-decoder architecture by adding different noises to the noise layer.The decoder then extracts the watermarked information from the distorted image.However, this method can only resist weak noise attacks.To improve the robustness of the algorithm against stronger noise, this paper proposes to introduce a denoise module between the noise layer and the decoder.The module is aimed at reducing noise and recovering some of the information lost during an attack.Additionally, the paper introduces the SE module to fuse the watermarking information pixel-wise and channel dimensions-wise, improving the encoder's efficiency.Experimental results show that our proposed method is comparable to existing models and outperforms state-of-the-art under different noise intensities. 0.662In addition, ablation experiments show the superiority of our proposed module.

link

2024-05-08

Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

Diffusion models are a powerful generative framework, but come with expensive inference.Existing acceleration methods often compromise image quality or fail under complex conditioning when operating in an extremely low-step regime.In this work, we propose a novel distillation framework tailored to enable high-fidelity, diverse sample generation using just one to three steps.Our approach comprises three key components: (i) Backward Distillation, which mitigates training-inference discrepancies by calibrating the student on its own backward trajectory; (ii) Shifted Reconstruction Loss that dynamically adapts knowledge transfer based on the current time step; and (iii) Noise Correction, an inference-time technique that enhances sample quality by addressing singularities in noise prediction.Through extensive experiments, we demonstrate that our method outperforms existing competitors in quantitative metrics and human evaluations. 0.649Remarkably, it achieves performance comparable to the teacher model using only three denoising steps, enabling efficient high-quality generation.

link

2024-05-08

DiskGNN: Bridging I/O Efficiency and Model Accuracy for Out-of-Core GNN Training

Graph neural networks (GNNs) are machine learning models specialized for graph data and widely used in many applications.To train GNNs on large graphs that exceed CPU memory, several systems store data on disk and conduct out-of-core processing.However, these systems suffer from either read amplification when reading node features that are usually smaller than a disk page or degraded model accuracy by treating the graph as disconnected partitions.To close this gap, we build a system called DiskGNN, which achieves high I/O efficiency and thus fast training without hurting model accuracy.The key technique used by DiskGNN is offline sampling, which helps decouple graph sampling from model computation.In particular, by conducting graph sampling beforehand, DiskGNN acquires the node features that will be accessed by model computation, and such information is utilized to pack the target node features contiguously on disk to avoid read amplification.Besides, \name{} also adopts designs including four-level feature store to fully utilize the memory hierarchy to cache node features and reduce disk access, batched packing to accelerate the feature packing process, and pipelined training to overlap disk access with other operations.We compare DiskGNN with Ginex and MariusGNN, which are state-of-the-art systems for out-of-core GNN training.The results show that DiskGNN can speed up the baselines by over 8x while matching their best model accuracy. 0.675

link

2024-05-08

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images.However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models.Existing works mainly adopt a retraining process to enhance DM efficiency.This is computationally expensive and not very scalable.To this end, we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining.Specifically, for single-denoising-step pruning, we develop a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify redundant tokens, and a similarity-based recovery method to restore tokens for the convolution operation. 0.607In addition, we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different denoising timesteps for better generation quality.Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining nearly the same FID and CLIP scores as the full model.Project webpage: https://atedm.github.io.

link

2024-05-08

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models

Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem.Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term "Type I hallucinations".Instead, they focus on hallucinations responding to very specific question formats -- typically a multiple-choice response regarding a particular object or attribute -- which we term "Type II hallucinations".Additionally, such benchmarks often require external API calls to models which are subject to change. 0.601In practice, we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated.To address this, we propose THRONE, a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs.We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics.By evaluating a large selection of recent LVLMs using public datasets, we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring TypeI hallucinations are incomplete.Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.

link

2024-05-08

OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies

Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing.The difficulties in interpreting and annotating event data limit its scalability.While domain adaptation from images to event data can help to mitigate this issue, there exist data representational differences that require additional effort to resolve.In this work, for the first time, we synergize information from image, text, and event-data domains and introduce OpenESS to enable scalable ESS in an open-world, annotation-efficient manner.We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams.To pursue better cross-modality adaptation, we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization.Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. 0.752Notably, we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels.

link

2024-05-07

Temporal and Heterogeneous Graph Neural Network for Remaining Useful Life Prediction

Predicting Remaining Useful Life (RUL) plays a crucial role in the prognostics and health management of industrial systems that involve a variety of interrelated sensors.Given a constant stream of time series sensory data from such systems, deep learning models have risen to prominence at identifying complex, nonlinear temporal dependencies in these data.In addition to the temporal dependencies of individual sensors, spatial dependencies emerge as important correlations among these sensors, which can be naturally modelled by a temporal graph that describes time-varying spatial relationships.However, the majority of existing studies have relied on capturing discrete snapshots of this temporal graph, a coarse-grained approach that leads to loss of temporal information.Moreover, given the variety of heterogeneous sensors, it becomes vital that such inherent heterogeneity is leveraged for RUL prediction in temporal sensor graphs.To capture the nuances of the temporal and spatial relationships and heterogeneous characteristics in an interconnected graph of sensors, we introduce a novel model named Temporal and Heterogeneous Graph Neural Networks (THGNN).Specifically, THGNN aggregates historical data from neighboring nodes to accurately capture the temporal dynamics and spatial correlations within the stream of sensor data in a fine-grained manner.Moreover, the model leverages Feature-wise Linear Modulation (FiLM) to address the diversity of sensor types, significantly improving the model's capacity to learn the heterogeneity in the data sources.Finally, we have validated the effectiveness of our approach through comprehensive experiments. 0.671Our empirical findings demonstrate significant advancements on the N-CMAPSS dataset, achieving improvements of up to 19.2% and 31.6% in terms of two different evaluation metrics over state-of-the-art methods.

link

2024-05-07

Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality.To achieve this, some recent works have been proposed to simultaneously predict hand trajectories and object affordances on human egocentric videos.They are regarded as the representation of future hand-object interactions, indicating potential human motion and motivation.However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis.Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions.To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner.We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones.Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate interaction prediction.The experimental results show that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our proposed new evaluation protocol. 0.741This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction.The code of Diff-IP2D will be released at https://github.com/IRMVLab/Diff-IP2D.

link

2024-05-07

Towards Stability of Parameter-free Optimization

Hyperparameter tuning, particularly the selection of an appropriate learning rate in adaptive gradient training methods, remains a challenge.To tackle this challenge, in this paper, we propose a novel parameter-free optimizer, AdamG (Adam with the golden step size), designed to automatically adapt to diverse optimization problems without manual tuning.The core technique underlying AdamG is our golden step size derived for the AdaGrad-Norm algorithm, which is expected to help AdaGrad-Norm preserve the tuning-free convergence and approximate the optimal step size in expectation w.r.t.various optimization scenarios.To better evaluate tuning-free performance, we propose a novel evaluation criterion, stability, to comprehensively assess the efficacy of parameter-free optimizers in addition to classical performance criteria. 0.65Empirical results demonstrate that compared with other parameter-free baselines, AdamG achieves superior performance, which is consistently on par with Adam using a manually tuned learning rate across various optimization tasks.

link

2024-05-07

Parallelized Multi-Agent Bayesian Optimization in Lava

In parallel with the continuously increasing parameter space dimensionality, search and optimization algorithms should support distributed parameter evaluations to reduce cumulative runtime.Intel's neuromorphic optimization library, Lava-Optimization, was introduced as an abstract optimization system compatible with neuromorphic systems developed in the broader Lava software framework.In this work, we introduce Lava Multi-Agent Optimization (LMAO) with native support for distributed parameter evaluations communicating with a central Bayesian optimization system.LMAO provides an abstract framework for deploying distributed optimization and search algorithms within the Lava software framework.Moreover, LMAO introduces support for random and grid search along with process connections across multiple levels of mathematical precision.We evaluate the algorithmic performance of LMAO with a traditional non-convex optimization problem, a fixed-precision transductive spiking graph neural network for citation graph classification, and a neuromorphic satellite scheduling problem.Our results highlight LMAO's efficient scaling to multiple processes, reducing cumulative runtime and minimizing the likelihood of converging to local optima. 0.625

link

2024-05-07

Towards Continual Knowledge Graph Embedding via Incremental Distillation

Traditional knowledge graph embedding (KGE) methods typically require preserving the entire knowledge graph (KG) with significant training costs when new knowledge emerges.To address this issue, the continual knowledge graph embedding (CKGE) task has been proposed to train the KGE model by learning emerging knowledge efficiently while simultaneously preserving decent old knowledge.However, the explicit graph structure in KGs, which is critical for the above goal, has been heavily ignored by existing CKGE methods.On the one hand, existing methods usually learn new triples in a random order, destroying the inner structure of new KGs.On the other hand, old triples are preserved with equal priority, failing to alleviate catastrophic forgetting effectively.In this paper, we propose a competitive method for CKGE based on incremental distillation (IncDE), which considers the full use of the explicit graph structure in KGs.First, to optimize the learning order, we introduce a hierarchical strategy, ranking new triples for layer-by-layer learning.By employing the inter- and intra-hierarchical orders together, new triples are grouped into layers based on the graph structure features.Secondly, to preserve the old knowledge effectively, we devise a novel incremental distillation mechanism, which facilitates the seamless transfer of entity representations from the previous layer to the next one, promoting old knowledge preservation.Finally, we adopt a two-stage training paradigm to avoid the over-corruption of old knowledge influenced by under-trained new knowledge.Experimental results demonstrate the superiority of IncDE over state-of-the-art baselines. 0.62Notably, the incremental distillation mechanism contributes to improvements of 0.2%-6.5% in the mean reciprocal rank (MRR) score. 0.603

link

2024-05-07

Designing an Objective-Driven Test Method for the Comparative Performance Evaluation of Commercial DTI Solutions for Counter UAS systems

Unmanned Aerial Systems (UASs) or drones become more and more commercially available and cheap.There has been much emphasis on developing and deploying Counter-UAS systems (UASs) with Detection Tracking and Identification (DTI) solutions.However, the capabilities of these systems are hard to benchmark. 0.627Performance claims of these systems are currently not supported by evidence.In addition, no standard test methodologies are available for these DTI systems and different test methodologies make comparison of these systems hard or impossible.We report on the definition, development and verification of an objective-driven test method and corresponding comparative performance evaluation for commercial DTI solutions for C-UASs.The developed methodology is based on end-user scenarios that are operationally relevant.The test methodology is based on a generic DTI system lay-out and is detailed towards detection, tracking and identification, taking into account contextual information and end-user input.The comparative performance evaluation is developed to enable the use of the methodology in a relevant environment, thereby taking into account any potential environmental aspect that might influence DTI system performance. 0.657Validation of the work in a relevant environment has been done in three operational trials.The operational trial results show that the method allows for performance evaluation at component level (i.e., detection, tracking or identification component) and at system level (combinations of these components and integrated DTI system of system solutions).

link

2024-05-07

Comparing Ways of Obtaining Candidate Orderings from Approval Ballots

To understand and summarize approval preferences and other binary evaluation data, it is useful to order the items on an axis which explains the data.In a political election using approval voting, this could be an ideological left-right axis such that each voter approves adjacent candidates, an analogue of single-peakedness.In a perfect axis, every approval set would be an interval, which is usually not possible, and so we need to choose an axis that gets closest to this ideal.The literature has developed algorithms for optimizing several objective functions (e.g., minimize the number of added approvals needed to get a perfect axis), but provides little help with choosing among different objectives. 0.626In this paper, we take a social choice approach and compare 5 different axis selection rules axiomatically, by studying the properties they satisfy.We establish some impossibility theorems, and characterize (within the class of scoring rules) the rule that chooses the axes that maximize the number of votes that form intervals, using the axioms of ballot monotonicity and resistance to cloning.Finally, we study the behavior of the rules on data from French election surveys, on the votes of justices of the US Supreme Court, and on synthetic data.

link

2024-05-07

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Quantization can accelerate large language model (LLM) inference.Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4.Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving.We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs.To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache.QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin.QoQ is implemented by the QServe inference library that achieves measured speedup. 0.609The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores.Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM.Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization.In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency.We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization.As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM.Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100.Thus, QServe effectively reduces the dollar cost of LLM serving by 3x.Code is available at https://github.com/mit-han-lab/qserve.

link

2024-05-06

On the Influence of Data Resampling for Deep Learning-Based Log Anomaly Detection: Insights and Recommendations

Numerous DL-based approaches have garnered considerable attention in the field of software Log Anomaly Detection.However, a practical challenge persists: the class imbalance in the public data commonly used to train the DL models.This imbalance is characterized by a substantial disparity in the number of abnormal log sequences compared to normal ones, for example, anomalies represent less than 1% of one of the most popular datasets.Previous research has indicated that existing DLLAD approaches may exhibit unsatisfactory performance, particularly when confronted with datasets featuring severe class imbalances.Mitigating class imbalance through data resampling has proven effective for other software engineering tasks, however, it has been unexplored for LAD thus far.This study aims to fill this gap by providing an in-depth analysis of the impact of diverse data resampling methods on existing DLLAD approaches from two distinct perspectives.Firstly, we assess the performance of these DLLAD approaches across three datasets and explore the impact of resampling ratios of normal to abnormal data on ten data resampling methods.Secondly, we evaluate the effectiveness of the data resampling methods when utilizing optimal resampling ratios of normal to abnormal data.Our findings indicate that oversampling methods generally outperform undersampling and hybrid methods. 0.677Data resampling on raw data yields superior results compared to data resampling in the feature space.In most cases, certain undersampling and hybrid methods show limited effectiveness. 0.656Additionally, by exploring the resampling ratio of normal to abnormal data, we suggest generating more data for minority classes through oversampling while removing less data from majority classes through undersampling.In conclusion, our study provides valuable insights into the intricate relationship between data resampling methods and DLLAD.

link

2024-05-06

Boosting Single Positive Multi-label Classification with Generalized Robust Loss

Multi-label learning (MLL) requires comprehensive multi-semantic annotations that is hard to fully obtain, thus often resulting in missing labels scenarios.In this paper, we investigate Single Positive Multi-label Learning (SPML), where each image is associated with merely one positive label.Existing SPML methods only focus on designing losses using mechanisms such as hard pseudo-labeling and robust losses, mostly leading to unacceptable false negatives.To address this issue, we first propose a generalized loss framework based on expected risk minimization to provide soft pseudo labels, and point out that the former losses can be seamlessly converted into our framework.In particular, we design a novel robust loss based on our framework, which enjoys flexible coordination between false positives and false negatives, and can additionally deal with the imbalance between positive and negative samples.Extensive experiments show that our approach can significantly improve SPML performance and outperform the vast majority of state-of-the-art methods on all the four benchmarks. 0.699

link

2024-05-06

Low-light Object Detection

In this competition we employed a model fusion approach to achieve object detection results close to those of real images.Our method is based on the CO-DETR model, which was trained on two sets of data: one containing images under dark conditions and another containing images enhanced with low-light conditions.We used various enhancement techniques on the test data to generate multiple sets of prediction results.Finally, we applied a clustering aggregation method guided by IoU thresholds to select the optimal results. 0.603

link

2024-05-06

Optimizing Hand Region Detection in MediaPipe Holistic Full-Body Pose Estimation to Improve Accuracy and Avoid Downstream Errors

This paper addresses a critical flaw in MediaPipe Holistic's hand Region of Interest (ROI) prediction, which struggles with non-ideal hand orientations, affecting sign language recognition accuracy.We propose a data-driven approach to enhance ROI estimation, leveraging an enriched feature set including additional hand keypoints and the z-dimension.Our results demonstrate better estimates, with higher Intersection-over-Union compared to the current method. 0.726Our code and optimizations are available at https://github.com/sign-language-processing/mediapipe-hand-crop-fix.

link

2024-05-06

Deep Space Separable Distillation for Lightweight Acoustic Scene Classification

Acoustic scene classification (ASC) is highly important in the real world.Recently, deep learning-based methods have been widely employed for acoustic scene classification.However, these methods are currently not lightweight enough as well as their performance is not satisfactory. 0.626To solve these problems, we propose a deep space separable distillation network.Firstly, the network performs high-low frequency decomposition on the log-mel spectrogram, significantly reducing computational complexity while maintaining model performance.Secondly, we specially design three lightweight operators for ASC, including Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC).These operators exhibit highly efficient feature extraction capabilities in acoustic scene classification tasks.The experimental results demonstrate that the proposed method achieves a performance gain of 9.8% compared to the currently popular deep learning methods, while also having smaller parameter count and computational complexity.

link

2024-05-06

GREEN: Generative Radiology Report Evaluation and Error Notation

Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to the need for accurate medical communication about medical images.Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph).In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively.Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts.We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts.Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches." 0.675

link

2024-05-06

Trackable Island-model Genetic Algorithms at Wafer Scale

Emerging ML/AI hardware accelerators, like the 850,000 processor Cerebras Wafer-Scale Engine (WSE), hold great promise to scale up the capabilities of evolutionary computation.However, challenges remain in maintaining visibility into underlying evolutionary processes while efficiently utilizing these platforms' large processor counts.Here, we focus on the problem of extracting phylogenetic information from digital evolution on the WSE platform.We present a tracking-enabled asynchronous island-based genetic algorithm (GA) framework for WSE hardware.Emulated and on-hardware GA benchmarks with a simple tracking-enabled agent model clock upwards of 1 million generations a minute for population sizes reaching 16 million.This pace enables quadrillions of evaluations a day.We validate phylogenetic reconstructions from these trials and demonstrate their suitability for inference of underlying evolutionary conditions.In particular, we demonstrate extraction of clear phylometric signals that differentiate wafer-scale runs with adaptive dynamics enabled versus disabled.Together, these benchmark and validation trials reflect strong potential for highly scalable evolutionary computation that is both efficient and observable. 0.642Kernel code implementing the island-model GA supports drop-in customization to support any fixed-length genome content and fitness criteria, allowing it to be leveraged to advance research interests across the community.

link

2024-05-06

A Controlled Experiment on the Energy Efficiency of the Source Code Generated by Code Llama

Context.Nowadays, 83% of software developers use Large Language Models (LLMs) to generate code.LLMs recently became essential to increase the productivity of software developers and decrease the time and cost of software development.Developers ranging from novices to experts use LLM tools not only to detect and patch bugs, but also to integrate generated code into their software.However, as of today there is no objective assessment of the energy efficiency of the source code generated by LLM tools.Released in August 2023, Code Llama is one of the most recent LLM tools. Goal.In this paper, we present an empirical study that assesses the energy efficiency of Code Llama with respect to human-written source code. Method.We design an experiment involving three human-written benchmarks implemented in C++, JavaScript, and Python. 0.648We ask Code Llama to generate the code of the benchmarks using different prompts and temperatures.Therefore, we execute both implementations and profile their energy efficiency. Results. 0.667Our study shows that the energy efficiency of code generated by Code Llama is heavily-dependent on the chosen programming language and the specific code problem at hand.Also, human implementations tend to be more energy efficient overall, with generated JavaScript code outperforming its human counterpart.Moreover, explicitly asking Code Llama to generate energy-efficient code results in an equal or worse energy efficiency, as well as using different temperatures seems not to affect the energy efficiency of generated code. Conclusions.According to our results, code generated using Code Llama does not guarantee energy efficiency, even when prompted to do so.Therefore, software developers should evaluate the energy efficiency of generated code before integrating it into the software system under development.

link

2024-05-06

Collage: Light-Weight Low-Precision Strategy for LLM Training

Large models training is plagued by the intense compute cost and limited hardware memory.A practical solution is low-precision representation but is troubled by loss in numerical accuracy and unstable training rendering the model less useful.We argue that low-precision floating points can perform well provided the error is properly compensated at the critical locations in the training process.We propose Collage which utilizes multi-component float representation in low-precision to accurately perform operations with numerical errors accounted.To understand the impact of imprecision to training, we propose a simple and novel metric which tracks the lost information during training as well as differentiates various precision strategies.Our method works with commonly used low-precision such as half-precision ($16$-bit floating points) and can be naturally extended to work with even lower precision such as $8$-bit. 0.646Experimental results show that pre-training using Collage removes the requirement of using $32$-bit floating-point copies of the model and attains similar/better training performance compared to $(16, 32)$-bit mixed-precision strategy, with up to $3.7\times$ speedup and $\sim 15\%$ to $23\%$ less memory usage in practice.

link

2024-05-06

Cosine Annealing Optimized Denoising Diffusion Error Correction Codes

To address the issue of increased bit error rates during the later stages of linear search in denoising diffusion error correction codes, we propose a novel method that optimizes denoising diffusion error correction codes (ECC) using cosine annealing.In response to the challenge of decoding long codewords, the proposed method employs a variance adjustment strategy during the reverse diffusion process, rather than maintaining a constant variance.By leveraging cosine annealing, this method effectively lowers the bit error rate and enhances decoding effciency.This letter extensively validates the approach through experiments and demonstrates signifcant improvements in bit error rate reduction and iteration effciency compared to existing methods. 0.657This advancement offers a promising solution for improving ECC decoding performance, potentially impacting secure digital communication practices.

link

2024-05-06

Classification of Breast Cancer Histopathology Images using a Modified Supervised Contrastive Learning Method

Deep neural networks have reached remarkable achievements in medical image processing tasks, specifically classifying and detecting various diseases.However, when confronted with limited data, these networks face a critical vulnerability, often succumbing to overfitting by excessively memorizing the limited information available.This work addresses the challenge mentioned above by improving the supervised contrastive learning method to reduce the impact of false positives.Unlike most existing methods that rely predominantly on fully supervised learning, our approach leverages the advantages of self-supervised learning in conjunction with employing the available labeled data.We evaluate our method on the BreakHis dataset, which consists of breast cancer histopathology images, and demonstrate an increase in classification accuracy by 1.45% at the image level and 1.42% at the patient level compared to the state-of-the-art method.This improvement corresponds to 93.63% absolute accuracy, highlighting our approach's effectiveness in leveraging data properties to learn more appropriate representation space. 0.628

link

2024-05-06

Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation

We describe a method for recovering the irradiance underlying a collection of images corrupted by atmospheric turbulence.Since supervised data is often technically impossible to obtain, assumptions and biases have to be imposed to solve this inverse problem, and we choose to model them explicitly.Rather than initializing a latent irradiance ("template") by heuristics to estimate deformation, we select one of the images as a reference, and model the deformation in this image by the aggregation of the optical flow from it to other images, exploiting a prior imposed by Central Limit Theorem.Then with a novel flow inversion module, the model registers each image TO the template but WITHOUT the template, avoiding artifacts related to poor template initialization.To illustrate the robustness of the method, we simply (i) select the first frame as the reference and (ii) use the simplest optical flow to estimate the warpings, yet the improvement in registration is decisive in the final reconstruction, as we achieve state-of-the-art performance despite its simplicity.The method establishes a strong baseline that can be further improved by integrating it seamlessly into more sophisticated pipelines, or with domain-specific methods if so desired. 0.6

link

2024-05-06

Cutting through buggy adversarial example defenses: fixing 1 line of code breaks Sabre

Sabre is a defense to adversarial examples that was accepted at IEEE S&P 2024.We first reveal significant flaws in the evaluation that point to clear signs of gradient masking.We then show the cause of this gradient masking: a bug in the original evaluation code.By fixing a single line of code in the original repository, we reduce Sabre's robust accuracy to 0%.In response to this, the authors modify the defense and introduce a new defense component not described in the original paper.But this fix contains a second bug; modifying one more line of code reduces robust accuracy to below baseline levels. 0.657

link

2024-05-02

The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights

Bridging the significant gap between large language model's English and non-English performance presents a great challenge.While some previous studies attempt to mitigate this gap with translated training data, the recently proposed question alignment approach leverages the model's English expertise to improve multilingual performance with minimum usage of expensive, error-prone translation.In this paper, we explore how broadly this method can be applied by examining its effects in reasoning with executable code and reasoning with common sense.We also explore how to apply this approach efficiently to extremely large language models using proxy-tuning.Experiment results on multilingual reasoning benchmarks mGSM, mSVAMP and xCSQA demonstrate that the question alignment approach can be used to boost multilingual performance across diverse reasoning scenarios, model families, and sizes.For instance, when applied to the LLaMA2 models, our method brings an average accuracy improvements of 12.2% on mGSM even with the 70B model. 0.611To understand the mechanism of its success, we analyze representation space, chain-of-thought and translation data scales, which reveals how question translation training strengthens language alignment within LLMs and shapes their working patterns.

link

2024-05-02

Community-Invariant Graph Contrastive Learning

Graph augmentation has received great attention in recent years for graph contrastive learning (GCL) to learn well-generalized node/graph representations.However, mainstream GCL methods often favor randomly disrupting graphs for augmentation, which shows limited generalization and inevitably leads to the corruption of high-level graph information, i.e., the graph community.Moreover, current knowledge-based graph augmentation methods can only focus on either topology or node features, causing the model to lack robustness against various types of noise.To address these limitations, this research investigated the role of the graph community in graph augmentation and figured out its crucial advantage for learnable graph augmentation.Based on our observations, we propose a community-invariant GCL framework to maintain graph community structure during learnable graph augmentation.By maximizing the spectral changes, this framework unifies the constraints of both topology and feature augmentation, enhancing the model's robustness.Empirical evidence on 21 benchmark datasets demonstrates the exclusive merits of our framework. 0.603Code is released on Github (https://github.com/ShiyinTan/CI-GCL.git).

link

2024-05-02

Using Waste Factor to Optimize Energy Efficiency in Multiple-Input Single-Output (MISO) and Multiple-Input Multiple-Output (MIMO) Systems

This paper introduces Waste Factor (W) and Waste Figure (WF) to assess power efficiency in any multiple-input multiple-output (MIMO) or single-input multiple-output (SIMO) or multiple-input single-output (MISO) cascaded communication system.This paper builds upon the new theory of Waste Factor, which systematically models added wasted power in any cascade for parallel systems such as MISO, SIMO, and MIMO systems, which are prevalent in current wireless networks.Here, we also show the advantage of W compared to conventional metrics for quantifying and analyzing energy efficiency. 0.663This work explores the utility of W in assessing energy efficiency in communication channels, within Radio Access Networks (RANs).

link

2024-05-02

Invariant Risk Minimization Is A Total Variation Model

Invariant risk minimization (IRM) is an arising approach to generalize invariant features to different environments in machine learning.While most related works focus on new IRM settings or new application scenarios, the mathematical essence of IRM remains to be properly explained.We verify that IRM is essentially a total variation based on $L^2$ norm (TV-$\ell_2$) of the learning risk with respect to the classifier variable.Moreover, we propose a novel IRM framework based on the TV-$\ell_1$ model.It not only expands the classes of functions that can be used as the learning risk, but also has robust performance in denoising and invariant feature preservation based on the coarea formula.We also illustrate some requirements for IRM-TV-$\ell_1$ to achieve out-of-distribution generalization.Experimental results show that the proposed framework achieves competitive performance in several benchmark machine learning scenarios. 0.617

link

2024-05-02

Improving Domain Generalization on Gaze Estimation via Branch-out Auxiliary Regularization

Despite remarkable advancements, mainstream gaze estimation techniques, particularly appearance-based methods, often suffer from performance degradation in uncontrolled environments due to variations in illumination and individual facial attributes.Existing domain adaptation strategies, limited by their need for target domain samples, may fall short in real-world applications.This letter introduces Branch-out Auxiliary Regularization (BAR), an innovative method designed to boost gaze estimation's generalization capabilities without requiring direct access to target domain data.Specifically, BAR integrates two auxiliary consistency regularization branches: one that uses augmented samples to counteract environmental variations, and another that aligns gaze directions with positive source domain samples to encourage the learning of consistent gaze features.These auxiliary pathways strengthen the core network and are integrated in a smooth, plug-and-play manner, facilitating easy adaptation to various other models.Comprehensive experimental evaluations on four cross-dataset tasks demonstrate the superiority of our approach. 0.616

link

2024-05-02

Test-time Assessment of a Model's Performance on Unseen Domains via Optimal Transport

Gauging the performance of ML models on data from unseen domains at test-time is essential yet a challenging problem due to the lack of labels in this setting.Moreover, the performance of these models on in-distribution data is a poor indicator of their performance on data from unseen domains.Thus, it is essential to develop metrics that can provide insights into the model's performance at test time and can be computed only with the information available at test time (such as their model parameters, the training data or its statistics, and the unlabeled test data).To this end, we propose a metric based on Optimal Transport that is highly correlated with the model's performance on unseen domains and is efficiently computable only using information available at test time.Concretely, our metric characterizes the model's performance on unseen domains using only a small amount of unlabeled data from these domains and data or statistics from the training (source) domain(s).Through extensive empirical evaluation using standard benchmark datasets, and their corruptions, we demonstrate the utility of our metric in estimating the model's performance in various practical applications. 0.74These include the problems of selecting the source data and architecture that leads to the best performance on data from an unseen domain and the problem of predicting a deployed model's performance at test time on unseen domains.Our empirical results show that our metric, which uses information from both the source and the unseen domain, is highly correlated with the model's performance, achieving a significantly better correlation than that obtained via the popular prediction entropy-based metric, which is computed solely using the data from the unseen domain.

link

2024-05-02

SATO: Stable Text-to-Motion Framework

Is the Text to Motion model robust?Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions.However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models.Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs.In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module.Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO).SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off.We present a methodology for constructing an SATO that satisfies the stability of attention and prediction.To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML.Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance. 0.601

link

2024-05-02

Common pitfalls to avoid while using multiobjective optimization in machine learning

Recently, there has been an increasing interest in exploring the application of multiobjective optimization (MOO) in machine learning (ML).The interest is driven by the numerous situations in real-life applications where multiple objectives need to be optimized simultaneously.A key aspect of MOO is the existence of a Pareto set, rather than a single optimal solution, which illustrates the inherent trade-offs between objectives.Despite its potential, there is a noticeable lack of satisfactory literature that could serve as an entry-level guide for ML practitioners who want to use MOO.Hence, our goal in this paper is to produce such a resource.We critically review previous studies, particularly those involving MOO in deep learning (using Physics-Informed Neural Networks (PINNs) as a guiding example), and identify misconceptions that highlight the need for a better grasp of MOO principles in ML.Using MOO of PINNs as a case study, we demonstrate the interplay between the data loss and the physics loss terms.We highlight the most common pitfalls one should avoid while using MOO techniques in ML.We begin by establishing the groundwork for MOO, focusing on well-known approaches such as the weighted sum (WS) method, alongside more complex techniques like the multiobjective gradient descent algorithm (MGDA).Additionally, we compare the results obtained from the WS and MGDA with one of the most common evolutionary algorithms, NSGA-II.We emphasize the importance of understanding the specific problem, the objective space, and the selected MOO method, while also noting that neglecting factors such as convergence can result in inaccurate outcomes and, consequently, a non-optimal solution. 0.62Our goal is to offer a clear and practical guide for ML practitioners to effectively apply MOO, particularly in the context of DL.

link

2024-05-02

MANTIS: Interleaved Multi-Image Instruction Tuning

The recent years have witnessed a great array of large multimodal models (LMMs) to effectively solve single-image vision language tasks.However, their abilities to solve multi-image visual language tasks is yet to be improved.The existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from web, which is neither efficient nor effective.In this paper, we aim at building strong multi-image LMMs via instruction tuning with academic-level resources.Therefore, we meticulously construct Mantis-Instruct containing 721K instances from 14 multi-image datasets.We design Mantis-Instruct to cover different multi-image skills like co-reference, reasoning, comparing, temporal understanding.We combine Mantis-Instruct with several single-image visual-language datasets to train our model Mantis to handle any interleaved image-text inputs.We evaluate the trained Mantis on five multi-image benchmarks and eight single-image benchmarks.Though only requiring academic-level resources (i.e. 36 hours on 16xA100-40G), Mantis-8B can achieve state-of-the-art performance on all the multi-image benchmarks and beats the existing best multi-image LMM Idefics2-8B by an average of 9 absolute points.We observe that Mantis performs equivalently well on the held-in and held-out evaluation benchmarks. 0.63We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis can maintain a strong single-image performance on par with CogVLM and Emu2.Our results are particularly encouraging as it shows that low-cost instruction tuning is indeed much more effective than intensive pre-training in terms of building multi-image LMMs.

link

2024-05-02

Accelerating Convergence in Bayesian Few-Shot Classification

Bayesian few-shot classification has been a focal point in the field of few-shot learning.This paper seamlessly integrates mirror descent-based variational inference into Gaussian process-based few-shot classification, addressing the challenge of non-conjugate inference.By leveraging non-Euclidean geometry, mirror descent achieves accelerated convergence by providing the steepest descent direction along the corresponding manifold.It also exhibits the parameterization invariance property concerning the variational distribution.Experimental results demonstrate competitive classification accuracy, improved uncertainty quantification, and faster convergence compared to baseline models. 0.676Additionally, we investigate the impact of hyperparameters and components.Code is publicly available at https://github.com/keanson/MD-BSFC.

link

2024-05-02

New Tools for Smoothed Analysis: Least Singular Value Bounds for Random Matrices with Dependent Entries

We develop new techniques for proving lower bounds on the least singular value of random matrices with limited randomness.The matrices we consider have entries that are given by polynomials of a few underlying base random variables.This setting captures a core technical challenge for obtaining smoothed analysis guarantees in many algorithmic settings. 0.615Least singular value bounds often involve showing strong anti-concentration inequalities that are intricate and much less understood compared to concentration (or large deviation) bounds. First, we introduce a general technique involving a hierarchical $\epsilon$-nets to prove least singular value bounds.Our second tool is a new statement about least singular values to reason about higher-order lifts of smoothed matrices, and the action of linear operators on them. Apart from getting simpler proofs of existing smoothed analysis results, we use these tools to now handle more general families of random matrices.This allows us to produce smoothed analysis guarantees in several previously open settings.These include new smoothed analysis guarantees for power sum decompositions, subspace clustering and certifying robust entanglement of subspaces, where prior work could only establish least singular value bounds for fully random instances or only show non-robust genericity guarantees.

link

2024-05-02

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs.However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations.On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment.Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness.To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements.Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. 0.638On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. 0.602Our models, code, and data are all publicly available at https://github.com/prometheus-eval/prometheus-eval.

link

2024-05-01

RST-LoRA: A Discourse-Aware Low-Rank Adaptation for Long Document Abstractive Summarization

For long document summarization, discourse structure is important to discern the key content of the text and the differences in importance level between sentences.Unfortunately, the integration of rhetorical structure theory (RST) into parameter-efficient fine-tuning strategies for long document summarization remains unexplored.Therefore, this paper introduces RST-LoRA and proposes four RST-aware variants to explicitly incorporate RST into the LoRA model.Our empirical evaluation demonstrates that incorporating the type and uncertainty of rhetorical relations can complementarily enhance the performance of LoRA in summarization tasks.Furthermore, the best-performing variant we introduced outperforms the vanilla LoRA and full-parameter fine-tuning models, as confirmed by multiple automatic and human evaluations, and even surpasses previous state-of-the-art methods. 0.603

link

2024-05-01

Self-Play Preference Optimization for Language Model Alignment

Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences.Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment.In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy.Our approach, dubbed \textit{Self-Play Preference Optimization} (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantee.Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO).In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0.It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. 0.628Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.)from GPT-4 or other stronger language models.

link

LLMs

2024-05-08

Seeds of Stereotypes: A Large-Scale Textual Analysis of Race and Gender Associations with Diseases in Online Sources

Background Advancements in Large Language Models (LLMs) hold transformative potential in healthcare, however, recent work has raised concern about the tendency of these models to produce outputs that display racial or gender biases.Although training data is a likely source of such biases, exploration of disease and demographic associations in text data at scale has been limited. Methods We conducted a large-scale textual analysis using a dataset comprising diverse web sources, including Arxiv, Wikipedia, and Common Crawl.The study analyzed the context in which various diseases are discussed alongside markers of race and gender.Given that LLMs are pre-trained on similar datasets, this approach allowed us to examine the potential biases that LLMs may learn and internalize. 0.708We compared these findings with actual demographic disease prevalence as well as GPT-4 outputs in order to evaluate the extent of bias representation. Results Our findings indicate that demographic terms are disproportionately associated with specific disease concepts in online texts.gender terms are prominently associated with disease concepts, while racial terms are much less frequently associated.We find widespread disparities in the associations of specific racial and gender terms with the 18 diseases analyzed.Most prominently, we see an overall significant overrepresentation of Black race mentions in comparison to population proportions. Conclusions Our results highlight the need for critical examination and transparent reporting of biases in LLM pretraining datasets. 0.665Our study suggests the need to develop mitigation strategies to counteract the influence of biased training data in LLMs, particularly in sensitive domains such as healthcare. 0.706

link

2024-05-08

Conversational Topic Recommendation in Counseling and Psychotherapy with Decision Transformer and Large Language Models

Given the increasing demand for mental health assistance, artificial intelligence (AI), particularly large language models (LLMs), may be valuable for integration into automated clinical support systems. 0.643In this work, we leverage a decision transformer architecture for topic recommendation in counseling conversations between patients and mental health professionals.The architecture is utilized for offline reinforcement learning, and we extract states (dialogue turn embeddings), actions (conversation topics), and rewards (scores measuring the alignment between patient and therapist) from previous turns within a conversation to train a decision transformer model.We demonstrate an improvement over baseline reinforcement learning methods, and propose a novel system of utilizing our model's output as synthetic labels for fine-tuning a large language model for the same task.Although our implementation based on LLaMA-2 7B has mixed results, future work can undoubtedly build on the design. 0.633

link

2024-05-08

Concerns on Bias in Large Language Models when Creating Synthetic Personae

This position paper explores the benefits, drawbacks, and ethical considerations of incorporating synthetic personae in HCI research, particularly focusing on the customization challenges beyond the limitations of current Large Language Models (LLMs).These perspectives are derived from the initial results of a sub-study employing vignettes to showcase the existence of bias within black-box LLMs and explore methods for manipulating them. 0.683The study aims to establish a foundation for understanding the challenges associated with these models, emphasizing the necessity of thorough testing before utilizing them to create synthetic personae for HCI research.

link

2024-05-08

Air Gap: Protecting Privacy-Conscious Conversational Agents

The growing use of large language model (LLM)-based conversational agents to manage sensitive user data raises significant privacy concerns. 0.715While these agents excel at understanding and acting on context, this capability can be exploited by malicious actors.We introduce a novel threat model where adversarial third-party apps manipulate the context of interaction to trick LLM-based agents into revealing private information not relevant to the task at hand. 0.665Grounded in the framework of contextual integrity, we introduce AirGapAgent, a privacy-conscious agent designed to prevent unintended data leakage by restricting the agent's access to only the data necessary for a specific task.Extensive experiments using Gemini, GPT, and Mistral models as agents validate our approach's effectiveness in mitigating this form of context hijacking while maintaining core agent functionality.For example, we show that a single-query context hijacking attack on a Gemini Ultra agent reduces its ability to protect user data from 94% to 45%, while an AirGapAgent achieves 97% protection, rendering the same attack ineffective.

link

2024-05-08

Encoder-Decoder Framework for Interactive Free Verses with Generation with Controllable High-Quality Rhyming

Composing poetry or lyrics involves several creative factors, but a challenging aspect of generation is the adherence to a more or less strict metric and rhyming pattern.To address this challenge specifically, previous work on the task has mainly focused on reverse language modeling, which brings the critical selection of each rhyming word to the forefront of each verse.On the other hand, reversing the word order requires that models be trained from scratch with this task-specific goal and cannot take advantage of transfer learning from a Pretrained Language Model (PLM). 0.617We propose a novel fine-tuning approach that prepends the rhyming word at the start of each lyric, which allows the critical rhyming decision to be made before the model commits to the content of the lyric (as during reverse language modeling), but maintains compatibility with the word order of regular PLMs as the lyric itself is still generated in left-to-right order.We conducted extensive experiments to compare this fine-tuning against the current state-of-the-art strategies for rhyming, finding that our approach generates more readable text and better rhyming capabilities.Furthermore, we furnish a high-quality dataset in English and 12 other languages, analyse the approach's feasibility in a multilingual context, provide extensive experimental results shedding light on good and bad practices for lyrics generation, and propose metrics to compare methods in the future.

link

2024-05-08

Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers

Large Language Models (LLMs) have profoundly changed the world. 0.717Their self-attention mechanism is the key to the success of transformers in LLMs. 0.735However, the quadratic computational cost $O(n^2)$ to the length $n$ input sequence is the notorious obstacle for further improvement and scalability in the longer context.In this work, we leverage the convolution-like structure of attention matrices to develop an efficient approximation method for attention computation using convolution matrices.We propose a $\mathsf{conv}$ basis system, "similar" to the rank basis, and show that any lower triangular (attention) matrix can always be decomposed as a sum of $k$ structured convolution matrices in this basis system.We then design an algorithm to quickly decompose the attention matrix into $k$ convolution matrices.Thanks to Fast Fourier Transforms (FFT), the attention {\it inference} can be computed in $O(knd \log n)$ time, where $d$ is the hidden dimension.In practice, we have $ d \ll n$, i.e., $d=3,072$ and $n=1,000,000$ for Gemma.Thus, when $kd = n^{o(1)}$, our algorithm achieve almost linear time, i.e., $n^{1+o(1)}$. Furthermore, the attention {\it training forward} and {\it backward gradient} can be computed in $n^{1+o(1)}$ as well.Our approach can avoid explicitly computing the $n \times n$ attention matrix, which may largely alleviate the quadratic computational complexity.Furthermore, our algorithm works on any input matrices.This work provides a new paradigm for accelerating attention computation in transformers to enable their application to longer contexts.

link

2024-05-08

SuFIA: Language-Guided Augmented Dexterity for Robotic Surgical Assistants

In this work, we present SuFIA, the first framework for natural language-guided augmented dexterity for robotic surgical assistants.SuFIA incorporates the strong reasoning capabilities of large language models (LLMs) with perception modules to implement high-level planning and low-level control of a robot for surgical sub-task execution. 0.639This enables a learning-free approach to surgical augmented dexterity without any in-context examples or motion primitives.SuFIA uses a human-in-the-loop paradigm by restoring control to the surgeon in the case of insufficient information, mitigating unexpected errors for mission-critical tasks.We evaluate SuFIA on four surgical sub-tasks in a simulation environment and two sub-tasks on a physical surgical robotic platform in the lab, demonstrating its ability to perform common surgical sub-tasks through supervised autonomous operation under challenging physical and workspace conditions.Project website: orbit-surgical.github.io/sufia

link

2024-05-08

LLMs with Personalities in Multi-issue Negotiation Games

Powered by large language models (LLMs), AI agents have become capable of many human tasks. 0.672Using the most canonical definitions of the Big Five personality, we measure the ability of LLMs to negotiate within a game-theoretical framework, as well as methodological challenges to measuring notions of fairness and risk. 0.659Simulations (n=1,500) for both single-issue and multi-issue negotiation reveal increase in domain complexity with asymmetric issue valuations improve agreement rates but decrease surplus from aggressive negotiation.Through gradient-boosted regression and Shapley explainers, we find high openness, conscientiousness, and neuroticism are associated with fair tendencies; low agreeableness and low openness are associated with rational tendencies.Low conscientiousness is associated with high toxicity.These results indicate that LLMs may have built-in guardrails that default to fair behavior, but can be "jail broken" to exploit agreeable opponents. 0.729We also offer pragmatic insight in how negotiation bots can be designed, and a framework of assessing negotiation behavior based on game theory and computational social science.

link

2024-05-08

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images.However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models.Existing works mainly adopt a retraining process to enhance DM efficiency. 0.661This is computationally expensive and not very scalable.To this end, we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining.Specifically, for single-denoising-step pruning, we develop a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify redundant tokens, and a similarity-based recovery method to restore tokens for the convolution operation.In addition, we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different denoising timesteps for better generation quality.Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining nearly the same FID and CLIP scores as the full model. 0.608Project webpage: https://atedm.github.io.

link

2024-05-08

Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. 0.664However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. 0.6This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied. 0.67This is a concern as providing flawed or misleading generated feedback could be detrimental to student learning.Inspired by recent work that has utilised very powerful LLMs, such as GPT-4, to evaluate the outputs produced by less powerful models, we conduct an automated analysis of the quality of the feedback produced by several open source models using a dataset from an introductory programming course.First, we investigate the viability of employing GPT-4 as an automated evaluator by comparing its evaluations with those of a human expert.We observe that GPT-4 demonstrates a bias toward positively rating feedback while exhibiting moderate agreement with human raters, showcasing its potential as a feedback evaluator.Second, we explore the quality of feedback generated by several leading open-source LLMs by using GPT-4 to evaluate the feedback. 0.666We find that some models offer competitive performance with popular proprietary LLMs, such as ChatGPT, indicating opportunities for their responsible use in educational settings. 0.699

link

2024-05-07

SmmPack: Obfuscation for SMM Modules with TPM Sealed Key

System Management Mode (SMM) is the highest-privileged operating mode of x86 and x86-64 processors. 0.646Through SMM exploitation, attackers can tamper with the Unified Extensible Firmware Interface (UEFI) firmware, disabling the security mechanisms implemented by the operating system and hypervisor.Vulnerabilities enabling SMM code execution are often reported as Common Vulnerabilities and Exposures (CVEs); however, no security mechanisms currently exist to prevent attackers from analyzing those vulnerabilities. 0.604To increase the cost of vulnerability analysis of SMM modules, we introduced SmmPack. 0.61The core concept of SmmPack involves encrypting an SMM module with the key securely stored in a Trusted Platform Module (TPM). 0.625We assessed the effectiveness of SmmPack in preventing attackers from obtaining and analyzing SMM modules using various acquisition methods. 0.651Our results show that SmmPack significantly increases the cost by narrowing down the means of module acquisition.Furthermore, we demonstrated that SmmPack operates without compromising the performance of the original SMM modules. 0.645We also clarified the management and adoption methods of SmmPack, as well as the procedure for applying BIOS updates, and demonstrated that the implementation of SmmPack is realistic.

link

2024-05-07

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

Augmenting Large Language Models (LLMs) with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs). 0.639While studying the alignment of LLMs to human values has received widespread attention, the safety of VLMs has not received the same attention. 0.711In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach.By comparing each VLM to their respective LLM backbone, we find that each VLM is more susceptible to jailbreaking. 0.626We consider this as an undesirable outcome from visual instruction-tuning, which imposes a forgetting effect on an LLM's safety guardrails. 0.66Therefore, we provide recommendations for future work based on evaluation strategies that aim to highlight the weaknesses of a VLM, as well as take safety measures into account during visual instruction tuning.

link

2024-05-07

The Silicone Ceiling: Auditing GPT's Race and Gender Biases in Hiring

Large language models (LLMs) are increasingly being introduced in workplace settings, with the goals of improving efficiency and fairness. 0.724However, concerns have arisen regarding these models' potential to reflect or exacerbate social biases and stereotypes.This study explores the potential impact of LLMs on hiring practices. 0.78To do so, we conduct an algorithm audit of race and gender biases in one commonly-used LLM, OpenAI's GPT-3.5, taking inspiration from the history of traditional offline resume audits.We conduct two studies using names with varied race and gender connotations: resume assessment (Study 1) and resume generation (Study 2).In Study 1, we ask GPT to score resumes with 32 different names (4 names for each combination of the 2 gender and 4 racial groups) and two anonymous options across 10 occupations and 3 evaluation tasks (overall rating, willingness to interview, and hireability).We find that the model reflects some biases based on stereotypes.In Study 2, we prompt GPT to create resumes (10 for each name) for fictitious job candidates.When generating resumes, GPT reveals underlying biases; women's resumes had occupations with less experience, while Asian and Hispanic resumes had immigrant markers, such as non-native English and non-U.S. education and work experiences.Our findings contribute to a growing body of literature on LLM biases, in particular when used in workplace contexts. 0.723

link

2024-05-07

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Efficient use of GPU memory is essential for high throughput LLM inference. 0.716Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation.Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. 0.623This approach eliminates fragmentation, enabling high-throughput LLM serving with larger batch sizes. 0.745However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory.This change requires attention kernels to be rewritten to support paging, and serving framework to implement a memory manager.Thus, the PagedAttention model leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management. 0.63In contrast to PagedAttention, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging, that already exists, to enable on-demand physical memory allocation.Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework.We show that vAttention enables seamless dynamic memory management for unchanged implementations of various attention kernels.vAttention also generates tokens up to 1.97x faster than vLLM, while processing input prompts up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention and FlashInfer.

link

2024-05-07

Towards Continual Knowledge Graph Embedding via Incremental Distillation

Traditional knowledge graph embedding (KGE) methods typically require preserving the entire knowledge graph (KG) with significant training costs when new knowledge emerges.To address this issue, the continual knowledge graph embedding (CKGE) task has been proposed to train the KGE model by learning emerging knowledge efficiently while simultaneously preserving decent old knowledge.However, the explicit graph structure in KGs, which is critical for the above goal, has been heavily ignored by existing CKGE methods.On the one hand, existing methods usually learn new triples in a random order, destroying the inner structure of new KGs.On the other hand, old triples are preserved with equal priority, failing to alleviate catastrophic forgetting effectively. 0.603In this paper, we propose a competitive method for CKGE based on incremental distillation (IncDE), which considers the full use of the explicit graph structure in KGs.First, to optimize the learning order, we introduce a hierarchical strategy, ranking new triples for layer-by-layer learning.By employing the inter- and intra-hierarchical orders together, new triples are grouped into layers based on the graph structure features.Secondly, to preserve the old knowledge effectively, we devise a novel incremental distillation mechanism, which facilitates the seamless transfer of entity representations from the previous layer to the next one, promoting old knowledge preservation.Finally, we adopt a two-stage training paradigm to avoid the over-corruption of old knowledge influenced by under-trained new knowledge.Experimental results demonstrate the superiority of IncDE over state-of-the-art baselines.Notably, the incremental distillation mechanism contributes to improvements of 0.2%-6.5% in the mean reciprocal rank (MRR) score.

link

2024-05-07

Toward In-Context Teaching: Adapting Examples to Students' Misconceptions

When a teacher provides examples for a student to study, these examples must be informative, enabling a student to progress from their current state toward a target concept or skill.Good teachers must therefore simultaneously infer what students already know and adapt their teaching to students' changing state of knowledge.There is increasing interest in using computational models, particularly large language models, as pedagogical tools.As students, language models in particular have shown a remarkable ability to adapt to new tasks given small numbers of examples.But how effectively can these models adapt as teachers to students of different types?To study this question, we introduce a suite of models and evaluation methods we call AdapT. AdapT has two components: (1) a collection of simulated Bayesian student models that can be used for evaluation of automated teaching methods; (2) a platform for evaluation with human students, to characterize the real-world effectiveness of these methods.We additionally introduce (3) AToM, a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimizes for the correctness of future beliefs.In evaluations of simulated students across three learning domains (fraction arithmetic, English morphology, function learning), AToM systematically outperforms LLM-based and standard Bayesian teaching models. 0.658In human experiments, both AToM and LLMs outperform non-adaptive random example selection. 0.652Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.

link

2024-05-07

Unveiling Disparities in Web Task Handling Between Human and Web Agent

With the advancement of Large-Language Models (LLMs) and Large Vision-Language Models (LVMs), agents have shown significant capabilities in various tasks, such as data analysis, gaming, or code generation. 0.628Recently, there has been a surge in research on web agents, capable of performing tasks within the web environment.However, the web poses unforeseeable scenarios, challenging the generalizability of these agents.This study investigates the disparities between human and web agents' performance in web tasks (e.g., information search) by concentrating on planning, action, and reflection aspects during task execution.We conducted a web task study with a think-aloud protocol, revealing distinct cognitive actions and operations on websites employed by humans.Comparative examination of existing agent structures and human behavior with thought processes highlighted differences in knowledge updating and ambiguity handling when performing the task.Humans demonstrated a propensity for exploring and modifying plans based on additional information and investigating reasons for failure.These findings offer insights into designing planning, reflection, and information discovery modules for web agents and designing the capturing method for implicit human knowledge in a web task.

link

2024-05-07

xLSTM: Extended Long Short-Term Memory

In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM).Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). 0.674However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale.We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? 0.665Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques.Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule.Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures.Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.

link

2024-05-07

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

Large language models (LLMs) have manifested strong ability to generate codes for productive activities. 0.683However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding.To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks.NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains.Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction.Comparing with manual solutions, it achieves an efficiency increase of more than 4 times.Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval.On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB.The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench.

link

2024-05-07

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Quantization can accelerate large language model (LLM) inference.Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4.Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. 0.615We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs.To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache.QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin.QoQ is implemented by the QServe inference library that achieves measured speedup.The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. 0.646Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM.Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization.In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency.We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization.As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. 0.672Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. 0.611Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. 0.676Code is available at https://github.com/mit-han-lab/qserve.

link

2024-05-07

ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning

Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more.Each of these methods works in isolation instead of synergistically.Here we address this problem and build a language-driven human understanding system -- ChatHuman, which combines and integrates the skills of many different methods.To do so, we finetune a Large Language Model (LLM) to select and use a wide variety of existing tools in response to user inputs. 0.647In doing so, ChatHuman is able to combine information from multiple tools to solve problems more accurately than the individual tools themselves and to leverage tool output to improve its ability to reason about humans.The novel features of ChatHuman include leveraging academic publications to guide the application of 3D human-related tools, employing a retrieval-augmented generation model to generate in-context-learning examples for handling new tools, and discriminating and integrating tool results to enhance 3D human understanding.Our experiments show that ChatHuman outperforms existing models in both tool selection accuracy and performance across multiple 3D human-related tasks.ChatHuman is a step towards consolidating diverse methods for human analysis into a single, powerful, system for 3D human reasoning.

link

Developer Research

2024-05-07

WALLETRADAR: Towards Automating the Detection of Vulnerabilities in Browser-based Cryptocurrency Wallets

Cryptocurrency wallets, acting as fundamental infrastructure to the blockchain ecosystem, have seen significant user growth, particularly among browser-based wallets (i.e., browser extensions).However, this expansion accompanies security challenges, making these wallets prime targets for malicious activities.Despite a substantial user base, there is not only a significant gap in comprehensive security analysis but also a pressing need for specialized tools that can aid developers in reducing vulnerabilities during the development process. 0.606To fill the void, we present a comprehensive security analysis of browser-based wallets in this paper, along with the development of an automated tool designed for this purpose.We first compile a taxonomy of security vulnerabilities resident in cryptocurrency wallets by harvesting historical security reports.Based on this, we design WALLETRADAR, an automated detection framework that can accurately identify security issues based on static and dynamic analysis.Evaluation of 96 popular browser-based wallets shows WALLETRADAR's effectiveness, by successfully automating the detection process in 90% of these wallets with high precision.This evaluation has led to the discovery of 116 security vulnerabilities corresponding to 70 wallets.By the time of this paper, we have received confirmations of 10 vulnerabilities from 8 wallet developers, with over $2,000 bug bounties.Further, we observed that 12 wallet developers have silently fixed 16 vulnerabilities after our disclosure.WALLETRADAR can effectively automate the identification of security risks in cryptocurrency wallets, thereby enhancing software development quality and safety in the blockchain ecosystem.

link

2024-05-02

A Systematic Literature Review on Large Language Models for Automated Program Repair

Automated Program Repair (APR) attempts to patch software bugs and reduce manual debugging efforts. 0.631Very recently, with the advances in Large Language Models (LLMs), an increasing number of APR techniques have been proposed, facilitating software development and maintenance and demonstrating remarkable performance.However, due to ongoing explorations in the LLM-based APR field, it is challenging for researchers to understand the current achievements, challenges, and potential opportunities.This work provides the first systematic literature review to summarize the applications of LLMs in APR between 2020 and 2024.We analyze 127 relevant papers from LLMs, APR and their integration perspectives.First, we categorize existing popular LLMs that are applied to support APR and outline three types of utilization strategies for their deployment.Besides, we detail some specific repair scenarios that benefit from LLMs, e.g., semantic bugs and security vulnerabilities.Furthermore, we discuss several critical aspects of integrating LLMs into APR research, e.g., input forms and open science.Finally, we highlight a set of challenges remaining to be investigated and the potential guidelines for future research.Overall, our paper provides a systematic overview of the research landscape to the APR community, helping researchers gain a comprehensive understanding of achievements and promote future research.

link

2024-05-01

Leveraging Stack Traces for Spectrum-based Fault Localization in the Absence of Failing Tests

Bug fixing is a crucial task in software maintenance to hold user trust. 0.638Although various automated fault localization techniques exist, they often require specific conditions to be effective.For example, Spectrum-Based Fault Localization (SBFL) techniques need at least one failing test to identify bugs, which may not always be available.Bug reports, particularly those with stack traces, provide detailed information on system execution failures and are invaluable for developers. 0.629This study focuses on utilizing stack traces from crash reports as fault-triggering tests for SBFL.Our findings indicate that only 3.33% of bugs have fault-triggering tests, limiting traditional SBFL efficiency.However, 98.3% of bugfix intentions align directly with exceptions in stack traces, and 78.3% of buggy methods are reachable within an average of 0.34 method calls, proving stack traces as a reliable source for locating bugs.We introduce a new approach, SBEST, that integrates stack trace data with test coverage to enhance fault localization.Our approach shows a significant improvement, increasing Mean Average Precision (MAP) by 32.22% and Mean Reciprocal Rank (MRR) by 17.43% over traditional stack trace ranking methods.

link

2024-04-24

VulEval: Towards Repository-Level Evaluation of Software Vulnerability Detection

Deep Learning (DL)-based methods have proven to be effective for software vulnerability detection, with a potential for substantial productivity enhancements for detecting vulnerabilities.Current methods mainly focus on detecting single functions (i.e., intra-procedural vulnerabilities), ignoring the more complex inter-procedural vulnerability detection scenarios in practice.For example, developers routinely engage with program analysis to detect vulnerabilities that span multiple functions within repositories. 0.614In addition, the widely-used benchmark datasets generally contain only intra-procedural vulnerabilities, leaving the assessment of inter-procedural vulnerability detection capabilities unexplored. To mitigate the issues, we propose a repository-level evaluation system, named \textbf{VulEval}, aiming at evaluating the detection performance of inter- and intra-procedural vulnerabilities simultaneously.Specifically, VulEval consists of three interconnected evaluation tasks: \textbf{(1) Function-Level Vulnerability Detection}, aiming at detecting intra-procedural vulnerability given a code snippet; \textbf{(2) Vulnerability-Related Dependency Prediction}, aiming at retrieving the most relevant dependencies from call graphs for providing developers with explanations about the vulnerabilities; and \textbf{(3)Repository-Level Vulnerability Detection}, aiming at detecting inter-procedural vulnerabilities by combining with the dependencies identified in the second task.VulEval also consists of a large-scale dataset, with a total of 4,196 CVE entries, 232,239 functions, and corresponding 4,699 repository-level source code in C/C++ programming languages.Our analysis highlights the current progress and future directions for software vulnerability detection.

link

2024-04-17

A Deep Dive into Large Language Models for Automated Bug Localization and Repair

Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR).In this study, we take a deep dive into automated bug fixing utilizing LLMs.In contrast to many deep learning-based APR methods that assume known bug locations, rely on line-level localization tools, or address bug prediction and fixing in one step, our approach uniquely employs LLMs to predict bug location at the token level and subsequently utilizes them for bug fixing.This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information and improved incorporation of inductive biases.We introduce Toggle: Token-Granulated Bug Localization and Repair, a comprehensive program repair framework that integrates a bug localization model, an adjustment unit, and a bug-fixing model. 0.629Toggle takes a buggy function as input and generates a complete corrected function.We investigate various styles of prompting to the bug fixing model to identify the most effective prompts that better utilize the inductive bias and significantly outperform others.Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark, and exhibits better and comparable performance on several other widely-used APR datasets, including Defects4J.

link

Data Annotation Techniques