back to list

Project: Data-centric AI - Automated data quality assessment


The success (and the cost) of a machine learning product or project depends to a great extend on the quality of the available data. If the data has significant flaws, it may make a project much more expensive and much more time consuming than planned, and may even cause the entire project to fail. On the other hand, if we can make it easier to discover good data, or to check the quality of the data easily beforehand, we can drastically increase the success of machine learning projects.

This is closely related to data-centric AI, where the goal is to improve the data rather than the ML models.

There are several approaches that we want to explore.

Data readiness levels: Similar to nutrition labels or energy efficiency labels, these assign a simple A, B, C,... label to the 'readiness' of a dataset to be used in machine learning. 

Another one a data performance metrics (DataPerf), where the goals is to measure the quality of data with informative tests:

In this project, the goal is to build an *automated* tool to assess the quality of data according to a range of tests and techniques. This will be tested in the real world by applying it over a large number of datasets, e.g. automatically attaching a data readiness label to all the datasets on OpenML.

Potentially, this could also be combined with automatically improving these datasets.

Joaquin Vanschoren
Get in contact