LLM-powerd Data System

This blog describes one line of my research on accurate document analytics driven by document structures, conducted from Fall 2023 to the present at UC Berkeley. The vast majority (over 80%) of today’s data exists in unstructured formats, with documents representing a major portion. When analyzing documents, current systems treat them as plain text sent to AI models (e.g., LLMs) for synthesis, ignoring underlying structures and thus leading to limited accuracy and performance. In this blog, we present a series of work that explores accurate document analytics by looking at document structures. We demonstrate that discovering structures within documents can significantly improve downstream analytics. In particular, we exhaustively explore and identify three types of document structures that encompass most real-world documents we have encountered: form-like templatized documents, hierarchically structured documents, and loose-metadata documents. For each type of document, we develop tools or systems to process them effectively for analytics.