Projects | Academic

Structure-aware Document Analytics

This blog describes one line of my research on accurate document analytics driven by document structures, conducted from Fall 2023 to the present at UC Berkeley. The vast majority (over 80%) of today’s data exists in unstructured formats, with documents representing a major portion. When analyzing documents, current systems treat them as plain text sent to AI models (e.g., LLMs) for synthesis, ignoring underlying structures and thus leading to limited accuracy and performance. In this blog, we present a series of work that explores accurate document analytics by looking at document structures. We demonstrate that discovering structures within documents can significantly improve downstream analytics. In particular, we exhaustively explore and identify three types of document structures that encompass most real-world documents we have encountered: form-like templatized documents, hierarchically structured documents, and loose-metadata documents. For each type of document, we develop tools or systems to process them effectively for analytics.

BLIP

Large Language Models (LLMs) are powerful tools for processing data. However, LLMs are also complex black-boxes, returning answers to queries on data, without any indication for where the answer came from or whether it is trustworthy. We introduce the notion of provenance for data processing with LLMs. While existing heuristics (such as embedding similarity or directly asking an LLM) could provide some hints for where the answer was derived, they provide no guarantees that the answer can be derived using the identified provenance, and indeed, are often incorrect. Instead, we propose the notion of verifiable provenance wherein we identify a subset of the input text that reproduces the same (or equivalent) answer as that on the complete text, and introduce the notion of minimality, where the verifiable provenance is as small as possible. To identify such a provenance, a naive solution would require checking all possible subsets of the source data with the LLM, which is prohibitively expensive. We present BLIP, a bolt-on framework for efficiently inferring a small-sized verifiable provenance for any LLM-powered data processing task, with any LLM. As part of BLIP, we introduce eight strategies, each guaranteed to find a minimal verifiable provenance, as well as an adaptive strategy that combines their strengths to reduce cost further. We further extend BLIP to produce multiple minimal verifiable provenances. Experiments on five datasets show that the provenance generated by BLIP is always guaranteed to reproduce the answer—achieving over 30% higher accuracy than the best-performing baseline with a comparable provenance size. Moreover, BLIP incurs a low cost, comparable to the original query on the original data.

TWIX

Many documents are programmatically generated by populating fields in a visual template. Effective data extraction from these documents is crucial to supporting downstream analytical tasks. Current data extraction tools often struggle with complex document layouts, incur high latency and/or cost on large datasets, and often require significant human effort, when extracting tables or values given user-specified fields from documents. The key insight of our tool, TWIX, is to predict the underlying template used to create such documents, modeling the visual and structural commonalities across documents. Data extraction based on this predicted template provides a more principled, accurate, and efficient solution at a low cost. Comprehensive evaluations on 34 diverse real-world datasets show that uncovering the template is crucial for data extraction from templatized documents. TWIX achieves over 90% precision and recall on average, outperforming tools from industry. Textract and Azure Document Intelligence, and vision-based LLMs like GPT-4-Vision, by over 25% in precision and recall. TWIX scales easily to large datasets and is 734X faster and 5836X cheaper than vision-based LLMs for extracting data from a large document collection with 817 pages.

ZenDB

Querying and extracting value from unstructured document collection remains a considerable challenge. While Large Language Models (LLMs) have made remarkable progress in document understanding, they fail to give high accuracy results for analytical queries on documents, and additionally incur high costs. While Retrieval-Augmented Generation (RAG) can reduce costs, accuracy degrades further. Our key insight is that documents in a collection often follow similar templates that impart a common semantic structure. We therefore introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Extensive experiments on three real-world document collections demonstrate ZenDB’s benefits, achieving up to 31 times cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 81% in recall, at a marginally higher cost.

PLAQUE

Predicate pushing down is a key optimization used to speed up query processing. Much of the existing practice is restricted to pushing predicates explicitly listed in the query. In this paper, we consider the challenge of learning predicates during query execution which are then exploited to accelerate execution. Prior related approaches with a similar goal are restricted (e.g., learn from only join columns or from specific data statistics). We significantly expand the realm of predicates that can be learned from different query operators (aggregations, joins, grouping, etc.) and develop a system, entitled PLAQUE, that learns such predicates during query execution. Comprehensive evaluations on both synthetic and real datasets demonstrate that the learned predicate approach adopted by PLAQUE can significantly accelerate query execution by up to 33x, and this improvement increases to up to 100x when User-Defined Functions (UDFs) are utilized in queries.

ZIP

This project develops a query-time missing value imputation framework, entitled ZIP, that modifies relational operators to be imputation-aware in order to minimize the joint cost of imputing and query processing. The modified operators use a cost-based decision function to determine whether to invoke imputation or to defer to downstream operators to resolve missing values. The modified query processing logic ensures results with deferred imputations are identical to those produced if all missing values were imputed first. ZIP includes a novel outer-join based approach to preserve missing values during execution, and a bloom filter based index to optimize the space and running overhead. Extensive experiments on both real and synthetic data sets demonstrate 10 to 25 times improvement when augmenting the state-of-the-art technology, ImputeDB, with ZIP-based deferred imputation. ZIP also outperforms the offline approach by up to 19607 times in a real data set.

LOCATER

LOCATER explores the data cleaning challenges that arise in using WiFi connectivity data to locate users to semantic indoor locations such as buildings, regions, rooms. LOCATER, has been deployed in more than 30 buildings in 3 universities (UCI, BSU, Plaksha University in India) and an elderly living facility, Walnut Village, located in Orange County. The system, which has been in operation for over 3 years at UC, Irvine uses data cleaning technologies over WiFi events to locate people inside buildings. It is a practical solution to passive, server-side, indoor localization that leverages existing infrastructure (and hence has zero cost), yet it achieves roughly 85% accuracy which is similar to that achieved by expensive dedicated hardware based solutions available commercially today. See our demo for one location-based application, occupancy, which is built using LOCATER.

EnrichDB

EnrichDB is a system designed to support just-in-time data enrichment during query processing. EnrichDB is motivated by applications that consume (potentially large volumes of) raw data that must first be interpreted using expensive machine learning / signal processing functions prior to being queried/used in analysis. Executing such enrichment during data ingestion (to support real-time analytics) is challenging to scale specially when dataset can be very large and/or when data arrives at a high velocity. EnrichDB addresses this challenge by supporting enrichment at all phases of data processing including intermixing enrichment with query processing. It exploits query context to steer enrichment in ways such that the query results can be computed progressively. EnrichDB is implemented using a layered approach on top of PostgreSQL, though it can easily be layered on other databases.

T-COVE

T-COVE is an exposure tracing and occupancy system based on cleaning wi-fi events on organizational premises. It first supports a real-time occupancy tracking application that displays real-time occupancy, i.e., the number of users, of locations of different granularities, such as building/floor/region. T-COVE has been deployed in over 30 buildings in UCI and BSU campuses and has been running since 2020. T-COVE will be planned to be installed in several other campuses and companies in the future. Another application supported in T-COVE is a passive exposure tracing system with potentially 100% adoption in campus area, that could be used effectively to track exposures as one of COVID-19 protection polycies in UCI. T-COVE is passive and off-the-shelf without the needs to install any new hardware or software while achieving a very usable accuracy, around 90%.