Yiming Lin

Yiming Lin

Postdoctoral Fellow

University of California, Berkeley

Biography

Yiming is currently a Postdoctoral Researcher at University of California, Berkeley, working with Prof. Aditya Parameswaran. He earned Ph.D. degree from University of California, Irvine, under the supervision of Prof. Sharad Mehrotra. Before that, he earned master and bachelor degrees of computer science in Harbin Institute of Technology. His research interests lie in the domain of building large-scale data management systems for structured and unstructured data, concentrating on optimizations in the query optimizer, query executor, and the methodologies for data preparation and cleaning.

He loves landscape photography and traveling, and keeps sharing the most recent photographs in his 500px!

Download my resumé.

Interests
  • Unstructured Data Management
  • Data Cleaning
  • Data Preparation
  • Query Processing
  • Query Optimization

Experience

 
 
 
 
 
Research Intern in Microsoft Research
Microsoft
Jun 2022 – Sep 2022 Seattle
I worked with Yeye He during internship at MSR. We developed an Auto Business Intelligence (BI) system that helps end-users by accurately predicting BI models given a set of input tables, i.e., to discover join columns accurately. We propose a principled graph-based optimization problem that considers both local join prediction and global schema-graph structures, which achieves over 90% F1-score on real-world and TPC benchmarks. Our paper is accepted in the research track in PVLDB 2023.
 
 
 
 
 
Applied Scientist Intern
Amazon
Jun 2021 – Sep 2021 Seattle
I worked with Dmitri, Kalashnikov and Vidit, Bansal during the internship in Amazon. I was working on data cleaning project during Amazon internship. Specifically, this work tries to resolve super dirty clusters produced by ER algorithms, which contain multiple errors, incorrect/missing/incomplete/copied values. Our proposed algorithm SCC improves the old method used in Amazon by around 61% precision (from 34.1% to 95.5%) and by around 52% F-1 score (from 42.4% to 94.7%).
 
 
 
 
 
Research Assistant
University of California, Irvine
Sep 2017 – Present California
I worked on several projects with the focus of research areas in data cleaning, query processing and building efficient online data processing systems.

Projects

*
T-COVE
T-COVE is an exposure tracing and occupancy system based on cleaning wi-fi events on organizational premises. It first supports a real-time occupancy tracking application that displays real-time occupancy, i.e., the number of users, of locations of different granularities, such as building/floor/region. T-COVE has been deployed in over 30 buildings in UCI and BSU campuses and has been running since 2020. T-COVE will be planned to be installed in several other campuses and companies in the future. Another application supported in T-COVE is a passive exposure tracing system with potentially 100% adoption in campus area, that could be used effectively to track exposures as one of COVID-19 protection polycies in UCI. T-COVE is passive and off-the-shelf without the needs to install any new hardware or software while achieving a very usable accuracy, around 90%.

Selected Publications

(2025). TWIX: Automatically Reconstructing Structured Data from Templatized Documents. Under Review.

PDF

(2024). Towards Accurate and Efficient Document Analytics with Large Language Models. Under Review.

PDF

(2024). SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines. In VLDB 2024 (Industry).

PDF

(2024). PLAQUE: Automated Predicate Learning at Query Time. In SIGMOD, 2024.

PDF

(2023). ZIP: Lazy Imputation during Query Processing. In PVLDB, 2024.

PDF

(2023). Robust Occupancy Computation Based on WiFi Connectivity Events. In ASTRIDE@ICDE 2023, Winner of the first place in the ASTRIDE workshop competition.

PDF

(2021). T-cove: an exposure tracing system based on cleaning wi-fi events on organizational premises. In PVLDB 2021 (demo).

PDF Cite Code Poster Video DOI

(2020). Locater: Cleaning Wifi Connectivity Datasets for Semantic Localization. In PVLDB 2021.

PDF Cite Code Slides Video DOI

(2020). Efficient entity resolution on heterogeneous records. (Extended Abstract). In ICDE, 2020.

PDF Cite

(2019). Efficient entity resolution on heterogeneous records.. In TKDE, 2019.

PDF Cite

(2019). Data source selection for information integration in big data era. In Information Sciences 2019.

PDF Cite

(2016). Efficient quality-driven source selection from massive data sources. In Journal of Systems and Software, 2016.

PDF Cite

Contact

Hello From California!