Biography

Yiming is currently a PhD student from ISG group, University of California, Irvine from 2017, under the supervision of Prof. Sharad Mehrotra. Before that, he earned master and bachelor degrees of computer science in Harbin Institute of Technology. His research interests include data cleaning, query processing and building efficient online data processing systems.

Download my resumé.

Interests
  • Data Cleaning
  • Query Processing
  • Efficient Online Data Processing Systems

Experience

 
 
 
 
 
Applied Scientist Intern
Amazon
Jun 2021 – Sep 2021 Seattle
I was working on data cleaning project during Amazon internship. Specifically, this work tries to resolve super dirty clusters produced by ER algorithms, which contain multiple errors, incorrect/missing/incomplete/copied values. Our proposed algorithm SCC improves the old method used in Amazon by around 61% precision (from 34.1% to 95.5%) and by around 52% F-1 score (from 42.4% to 94.7%).
 
 
 
 
 
Research Assistant
University of California, Irvine
Sep 2017 – Present California
I worked on several projects with the focus of research areas in data cleaning, query processing and building efficient online data processing systems.

Projects

*
LOCATER
LOCATER explores the data cleaning challenges that arise in using WiFi connectivity data to locate users to semantic indoor locations such as buildings, regions, rooms. WiFi connectivity data consists of sporadic connections between devices and nearby WiFi access points (APs), each of which may cover a relatively large area within a building. Our system, entitled semantic LOCATion cleanER (LOCATER), postulates semantic localization as a series of data cleaning tasks - first, it treats the problem of determining the AP to which a device is connected between any two of its connection events as a missing value detection and repair problem. It then associates the device with the semantic subregion (e.g., a conference room in the region) by postulating it as a location disambiguation problem. LOCATER uses a bootstrapping semi-supervised learning method for coarse localization and a probabilistic method to achieve finer localization. The paper shows that LOCATER can achieve significantly high accuracy at both the coarse and fine levels. Comparing with localisation techniques in sensor network community, LOCATER is 1) off-the-shelf, i.e., LOCATER does not reuqire installing any new hardwares in buildins and thus could potentially be widely deployed; 2) passive, i.e., LOCATER does not need to install any new softwares in users' side, such as phone or laptop; 3) effective, i.e., LOCATER can achieve around 90% accuracy, which is a good number for many applications.
T-COVE
T-COVE is an exposure tracing and occupancy system based on cleaning wi-fi events on organizational premises. It first supports a real-time occupancy tracking application that displays real-time occupancy, i.e., the number of users, of locations of different granularities, such as building/floor/region. T-COVE has been deployed in over 30 buildings in UCI and BSU campuses and has been running since 2020. T-COVE will be planned to be installed in several other campuses and companies in the future. Another application supported in T-COVE is a passive exposure tracing system with potentially 100% adoption in campus area, that could be used effectively to track exposures as one of COVID-19 protection polycies in UCI. T-COVE is passive and off-the-shelf without the needs to install any new hardware or software while achieving a very usable accuracy, around 90%.

Publications

(2022). QUIP: Query-driven Missing Value Imputation. In ArXiv preprint 2022.

PDF Cite

(2021). T-cove: an exposure tracing system based on cleaning wi-fi events on organizational premises. In PVLDB 2021.

PDF Cite Code Poster Video DOI

(2020). Locater: Cleaning Wifi Connectivity Datasets for Semantic Localization. In PVLDB 2021.

PDF Cite Code Slides Video DOI

(2020). Efficient entity resolution on heterogeneous records. (Extended Abstract). In ICDE, 2020.

PDF Cite

(2019). Demo Abstract: SemIoTic: Bridging the Semantic Gap in IoT Spaces. In BuildSys 2019.

PDF Cite

(2019). Efficient quality-driven source selection from massive data sources. In TKDE, 2019.

PDF Cite

(2019). Data source selection for information integration in big data era. In Information Sciences 2019.

PDF Cite

(2016). Efficient quality-driven source selection from massive data sources. In Journal of Systems and Software, 2016.

PDF Cite

Contact

Hi From California!