QUIP: Query-driven Missing Value Imputation

Abstract

This paper develops a query-time missing value imputation framework, entitled QUIP, that minimizes the joint costs of imputation and query execution. QUIP achieves this by modifying how relational operators are processed. It adds a cost-based decision function in each operator that checks whether the operator should invoke imputation prior to execution or to defer the imputations for downstream operators to resolve. QUIP implements a new approach to evaluating outer join that preserve missing values during query processing, and a bloom filter based index structure to optimize the space and running overhead. We have implemented QUIP using ImputeDB - a specialized database engine for data cleaning. Extensive experiments on both real and synthetic data sets demonstrates the effectiveness and efficiency of QUIP, which outperforms the state-of-the-art ImputeDB by 2 to 10 times on different query sets and data sets, and achieves the order-of-magnitudes improvement over offline approach.

Publication
In ArXiv preprint, 2022
Yiming Lin
Yiming Lin
PhD Student

My research interests include Data Cleaning, Query Processing, Efficient Online Data Processing Systems.