SemanticScuttle - klotz.me » klotz: sanitization+llm

LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs

"The paper provides the following contributions: (1) We introduce a novel three-stage architectural framework to identify erroneous instances in tabular data. This framework encompasses a comprehensive approach that combines the power of LLM models, context models, and data-cleaning tools. By leveraging this combined approach, our framework achieves significant improvements in both the effectiveness and efficiency of error detection compared to traditional tools, together with enhancing LLMClean’s ability to handle diverse and complex error patterns present in tabular data. (2) We present an innovative method that utilizes LLM models, such as Llama-2, GPT-3.5, and GPT-4, to autonomously generate context models directly from real-world data. (3) We propose an innovative prompt ensembling technique designed to enhance the stability of LLM models. (4) We develop an error detection tool that enforces a suite of OFD dependencies extracted from the automatically generated context models. (5) We conduct extensive experimental evaluation, comparing the performance of LLMClean against a range of baseline methods using three real-world datasets from different domains, including IoT, Industry 4.0, and healthcare. To the best of our knowledge, LLMClean is the first method that effectively leverages LLM models to enhance data cleaning tools through automatically generated context models.

2024-05-26 Tags: llm, tabular data, sanitization, ontological functional dependencies by klotz

SemanticScuttle - klotz.me

klotz: sanitization* + llm*

Linked Tags

Related Tags