This Data Scientist’s Open-Source Tools Are Accelerating and Democratizing Data Preparation

23 Feb 2022
copy
  • Top of page
  • Main text
  • More on this topic
copy

Data preparation is widely regarded as the most time-consuming part of data science and can take up to 80% of a data scientist’s time. SFU computing science professor Jiannan Wang’s mission is to speed up data science by greatly reducing the time spent on data preparation. To do this, he develops innovative technologies and open-source tools for data scientists to use.

Data preparation refers to the process of collecting, exploring, cleaning, transforming and integrating data into a form for downstream analysis and modeling. By 2025, it is estimated that the market for data preparation will be over $13 billion.

“Data preparation is not a single problem,” says Wang.

“It consists of many challenging problems such as discovering, understanding, cleaning and integrating the data.”

These problems are more easily solved by crowdsourcing and using human intelligence than by being fully automated. For example, entity resolution is the task of disambiguating records that refer to real-world entities. It is central to data cleaning and integration, but algorithmic solutions are far from perfect. Wang built CrowdER, the first crowdsourced entity resolution system able to outperform the best human-only and machine-only systems. To reduce the human cost, he also developed the first quality-aware task assignment system for various data preparation tasks.

Wang’s project SampleClean was proposed to scale the expensive data cleaning process. The main idea of this project is to have a human clean a small sample of the data, and then use these results for the machine to learn the cleaning process and lessen the impact of unclean data on query results. This system has been incorporated in the Berkeley Data Analytics Stack, one of the world’s most popular big data stacks at that time.

His mission to speed up data science, however, can perhaps best be seen in his string similarity join work. String similarity join is defined as finding all pairs of similar strings whose similarity values are above a user-specified threshold. Wang’s proposed algorithms made several major breakthroughs and ran 10 to 100 times faster than all other algorithms at the String Similarity Join/Search Competition hosted by EDBT in 2013, reducing the algorithm run time from hours to minutes.

In recognition of these research breakthroughs, Wang recently received a 2020 Outstanding Early Career Researcher Award from CS-Can|Info-Can. This organization’s mission is, “To foster excellence in Computer Science research and higher education in Canada, drive innovation and benefit society.” This award comes after Wang also received the IEEE TCDE Rising Star Award in 2018.

“This award recognizes my past achievements, but in my opinion, research is a marathon,” says Wang, who also serves as the program director for the Master of Science in Professional Computer Science Program at SFU.

His long-term research goal can be found in his new project DataPrep, an all-in-one data preparation system that provides the easiest way for data scientists to prepare data in Python. After beginning on this project in May 2019, DataPrep has already been downloaded over 120,000 times and has received positive feedback on forums such as Reddit.

Through his research, Wang hopes to build a community that is “equal, diverse and inclusive” while saving data scientists time during the crucial stage of data preparation.

“This award gives me more motivation to do excellent research and to focus on the impact of this research,” says Wang.

“There are millions of data scientists in the world that spend a lot of time on data preparation, so solving this problem could have a huge impact on society.”

Simon Fraser University’s School of Computing Science is mobilizing brilliant minds to create business and societal innovation for good. For more information, visit sfu.ca/csresearch