Yiming Lin

Postdoctoral Fellow

University of California, Berkeley

Biography

I’m moving to a new website. The current one is no longer maintained.

Yiming is currently a Postdoctoral Researcher at University of California, Berkeley, working with Prof. Aditya Parameswaran. He earned a Ph.D. degree from University of California, Irvine, under the supervision of Prof. Sharad Mehrotra, and a bachelor’s degree from Harbin Institute of Technology. He is broadly interested in building data management systems at the intersection of structured and unstructured data with Large Language Models (LLMs). His recent research explores developing LLM-powered data systems for multimodal analytics, with a focus on extracting structure from unstructured documents, supporting scalable AI computations, provenance in LLM-powered data processing, and agentic query engines.

He loves landscape photography and traveling, and keeps sharing the most recent photographs in his 500px!

Download my resumé.

Interests

Unstructured Data Management
Data Cleaning
Data Preparation
Query Processing
Query Optimization

Experience

Research Intern in Microsoft Research

Microsoft

Jun 2022 – Sep 2022 Seattle

I worked with Yeye He during internship at MSR. We developed an Auto Business Intelligence (BI) system that helps end-users by accurately predicting BI models given a set of input tables, i.e., to discover join columns accurately. We propose a principled graph-based optimization problem that considers both local join prediction and global schema-graph structures, which achieves over 90% F1-score on real-world and TPC benchmarks. Our paper is accepted in the research track in PVLDB 2023.

Applied Scientist Intern

Amazon

Jun 2021 – Sep 2021 Seattle

I worked with Dmitri, Kalashnikov and Vidit, Bansal during the internship in Amazon. I was working on data cleaning project during Amazon internship. Specifically, this work tries to resolve super dirty clusters produced by ER algorithms, which contain multiple errors, incorrect/missing/incomplete/copied values. Our proposed algorithm SCC improves the old method used in Amazon by around 61% precision (from 34.1% to 95.5%) and by around 52% F-1 score (from 42.4% to 94.7%).

Research Assistant

University of California, Irvine

Sep 2017 – Present California

I worked on several projects with the focus of research areas in data cleaning, query processing and building efficient online data processing systems.

Projects

Structure-aware Document Analytics

This blog describes one line of my research on accurate document analytics driven by document structures, conducted from Fall 2023 to the present at UC Berkeley. The vast majority (over 80%) of today’s data exists in unstructured formats, with documents representing a major portion. When analyzing documents, current systems treat them as plain text sent to AI models (e.g., LLMs) for synthesis, ignoring underlying structures and thus leading to limited accuracy and performance. In this blog, we present a series of work that explores accurate document analytics by looking at document structures. We demonstrate that discovering structures within documents can significantly improve downstream analytics. In particular, we exhaustively explore and identify three types of document structures that encompass most real-world documents we have encountered: form-like templatized documents, hierarchically structured documents, and loose-metadata documents. For each type of document, we develop tools or systems to process them effectively for analytics.

BLIP

Large Language Models (LLMs) are powerful tools for processing data. However, LLMs are also complex black-boxes, returning answers to queries on data, without any indication for where the answer came from or whether it is trustworthy. We introduce the notion of provenance for data processing with LLMs. While existing heuristics (such as embedding similarity or directly asking an LLM) could provide some hints for where the answer was derived, they provide no guarantees that the answer can be derived using the identified provenance, and indeed, are often incorrect. Instead, we propose the notion of verifiable provenance wherein we identify a subset of the input text that reproduces the same (or equivalent) answer as that on the complete text, and introduce the notion of minimality, where the verifiable provenance is as small as possible. To identify such a provenance, a naive solution would require checking all possible subsets of the source data with the LLM, which is prohibitively expensive. We present BLIP, a bolt-on framework for efficiently inferring a small-sized verifiable provenance for any LLM-powered data processing task, with any LLM. As part of BLIP, we introduce eight strategies, each guaranteed to find a minimal verifiable provenance, as well as an adaptive strategy that combines their strengths to reduce cost further. We further extend BLIP to produce multiple minimal verifiable provenances. Experiments on five datasets show that the provenance generated by BLIP is always guaranteed to reproduce the answer—achieving over 30% higher accuracy than the best-performing baseline with a comparable provenance size. Moreover, BLIP incurs a low cost, comparable to the original query on the original data.

TWIX

Many documents are programmatically generated by populating fields in a visual template. Effective data extraction from these documents is crucial to supporting downstream analytical tasks. Current data extraction tools often struggle with complex document layouts, incur high latency and/or cost on large datasets, and often require significant human effort, when extracting tables or values given user-specified fields from documents. The key insight of our tool, TWIX, is to predict the underlying template used to create such documents, modeling the visual and structural commonalities across documents. Data extraction based on this predicted template provides a more principled, accurate, and efficient solution at a low cost. Comprehensive evaluations on 34 diverse real-world datasets show that uncovering the template is crucial for data extraction from templatized documents. TWIX achieves over 90% precision and recall on average, outperforming tools from industry. Textract and Azure Document Intelligence, and vision-based LLMs like GPT-4-Vision, by over 25% in precision and recall. TWIX scales easily to large datasets and is 734X faster and 5836X cheaper than vision-based LLMs for extracting data from a large document collection with 817 pages.

ZenDB

Querying and extracting value from unstructured document collection remains a considerable challenge. While Large Language Models (LLMs) have made remarkable progress in document understanding, they fail to give high accuracy results for analytical queries on documents, and additionally incur high costs. While Retrieval-Augmented Generation (RAG) can reduce costs, accuracy degrades further. Our key insight is that documents in a collection often follow similar templates that impart a common semantic structure. We therefore introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Extensive experiments on three real-world document collections demonstrate ZenDB’s benefits, achieving up to 31 times cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 81% in recall, at a marginally higher cost.

PLAQUE

Predicate pushing down is a key optimization used to speed up query processing. Much of the existing practice is restricted to pushing predicates explicitly listed in the query. In this paper, we consider the challenge of learning predicates during query execution which are then exploited to accelerate execution. Prior related approaches with a similar goal are restricted (e.g., learn from only join columns or from specific data statistics). We significantly expand the realm of predicates that can be learned from different query operators (aggregations, joins, grouping, etc.) and develop a system, entitled PLAQUE, that learns such predicates during query execution. Comprehensive evaluations on both synthetic and real datasets demonstrate that the learned predicate approach adopted by PLAQUE can significantly accelerate query execution by up to 33x, and this improvement increases to up to 100x when User-Defined Functions (UDFs) are utilized in queries.

ZIP

This project develops a query-time missing value imputation framework, entitled ZIP, that modifies relational operators to be imputation-aware in order to minimize the joint cost of imputing and query processing. The modified operators use a cost-based decision function to determine whether to invoke imputation or to defer to downstream operators to resolve missing values. The modified query processing logic ensures results with deferred imputations are identical to those produced if all missing values were imputed first. ZIP includes a novel outer-join based approach to preserve missing values during execution, and a bloom filter based index to optimize the space and running overhead. Extensive experiments on both real and synthetic data sets demonstrate 10 to 25 times improvement when augmenting the state-of-the-art technology, ImputeDB, with ZIP-based deferred imputation. ZIP also outperforms the offline approach by up to 19607 times in a real data set.

LOCATER

LOCATER explores the data cleaning challenges that arise in using WiFi connectivity data to locate users to semantic indoor locations such as buildings, regions, rooms. LOCATER, has been deployed in more than 30 buildings in 3 universities (UCI, BSU, Plaksha University in India) and an elderly living facility, Walnut Village, located in Orange County. The system, which has been in operation for over 3 years at UC, Irvine uses data cleaning technologies over WiFi events to locate people inside buildings. It is a practical solution to passive, server-side, indoor localization that leverages existing infrastructure (and hence has zero cost), yet it achieves roughly 85% accuracy which is similar to that achieved by expensive dedicated hardware based solutions available commercially today. See our demo for one location-based application, occupancy, which is built using LOCATER.

EnrichDB

EnrichDB is a system designed to support just-in-time data enrichment during query processing. EnrichDB is motivated by applications that consume (potentially large volumes of) raw data that must first be interpreted using expensive machine learning / signal processing functions prior to being queried/used in analysis. Executing such enrichment during data ingestion (to support real-time analytics) is challenging to scale specially when dataset can be very large and/or when data arrives at a high velocity. EnrichDB addresses this challenge by supporting enrichment at all phases of data processing including intermixing enrichment with query processing. It exploits query context to steer enrichment in ways such that the query results can be computed progressively. EnrichDB is implemented using a layered approach on top of PostgreSQL, though it can easily be layered on other databases.

T-COVE

T-COVE is an exposure tracing and occupancy system based on cleaning wi-fi events on organizational premises. It first supports a real-time occupancy tracking application that displays real-time occupancy, i.e., the number of users, of locations of different granularities, such as building/floor/region. T-COVE has been deployed in over 30 buildings in UCI and BSU campuses and has been running since 2020. T-COVE will be planned to be installed in several other campuses and companies in the future. Another application supported in T-COVE is a passive exposure tracing system with potentially 100% adoption in campus area, that could be used effectively to track exposures as one of COVID-19 protection polycies in UCI. T-COVE is passive and off-the-shelf without the needs to install any new hardware or software while achieving a very usable accuracy, around 90%.

Selected Publications

Guangxue Zhang, Yiming Lin, Sharad Mehrotra (2026). Efficient and Effective Batch-Aware Model Selection for Large Language Models. In SIGMOD, 2026.

Yiming Lin, Sepanta Zeighami, Aditya G. Parameswaran (2025). Bolt-on, Verifiable Provenance for LLM-Powered Data Processing. Under Revision in VLDB 2026.

Sepanta Zeighami, Yiming Lin, Shreya Shankar, Aditya Parameswaran (2025). LLM-Powered Proactive Data Systems. IEEE Data Engineering Bulletin March 2025 issue.

PDF

Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, Aditya G. Parameswaran (2025). TWIX: Automatically Reconstructing Structured Data from Templatized Documents. In SIGMOD 2026.

PDF

Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeighami, Aditya G. Parameswaran, Eugene Wu (2024). Towards Accurate and Efficient Document Analytics with Large Language Models. In ICDE, 2025.

PDF

Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J.D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, Eugene Wu (2024). SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines. In VLDB 2024 (Industry).

PDF

Yiming Lin, Sharad Mehrotra (2024). PLAQUE: Automated Predicate Learning at Query Time. In SIGMOD, 2024.

PDF

Yiming Lin, Sharad Mehrotra (2023). ZIP: Lazy Imputation during Query Processing. In PVLDB, 2024.

PDF

Yiming Lin, Yeye He, Surajit Chaudhuri (2023). Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph. In PVLDB, 2023.

PDF

Rithwik Kerur, Yiming Lin (2023). Robust Occupancy Computation Based on WiFi Connectivity Events. In ASTRIDE@ICDE 2023, Winner of the first place in the ASTRIDE workshop competition.

PDF

Yiming Lin, Pramod Khargonekar, Sharad Mehrotra, Nalini Venkatasubramanian (2021). T-cove: an exposure tracing system based on cleaning wi-fi events on organizational premises. In PVLDB 2021 (demo).

PDF Cite Code Poster Video DOI

Yiming Lin, Daokun Jiang, Roberto Yus, Georgios Bouloukakis, Andrew Chio, Sharad Mehrotra, Nalini Venkatasubramanian (2020). Locater: Cleaning Wifi Connectivity Datasets for Semantic Localization. In PVLDB 2021.

PDF Cite Code Slides Video DOI

Yiming Lin, Hongzhi Wang, Jianzhong Li, Hong Gao (2020). Efficient entity resolution on heterogeneous records. . In ICDE, 2020.

PDF Cite

Yiming Lin, Hongzhi Wang, Jianzhong Li, Hong Gao (2019). Data source selection for information integration in big data era. In Information Sciences 2019.