Data and AI cluster

Project: Automated Machine Learning: An indestructible AutoML bot

Description

There are an infinite number of ways to design a machine learning system, and many careful decisions need to be made based on prior experience. The field of automated machine learning (AutoML) aims to make these decisions in a data-driven, objective, and automated way.

There exist a range of AutoML tools (e.g. AutoGluon-Tabular, Auto-sklearn, H2O Autopilot, GAMA, TPOT,...). Many of these systems, however, expect rather clean data and can easily break when the data has certain imperfections. For instance, they will try a one-hot-encoder on a categorical feature with 1000s of categories, exploding the feature space and crash/hang. Also, many don't handle string features well and will therefore obtain very suboptimal results.

Consider the following challenge: you are given any (tabular) dataset, e.g. from OpenML or Kaggle, and an associated task (e.g. classification), and your AutoML bot has to find reasonably good models without crashing. You can do anything you find reasonable to achieve this. Some suggestions:

Identify the semantic type of a feature to properly encode it.
Find solutions to handle multi-modal data (e.g. include embeddings for string data)
Learn or hard-code rules on how to handle specific problems with the data.
When your bot performs worse than existing solutions (e.g. on Kaggle), see whether there are tricks that you could add to it.
Store good solutions in a 'memory' to fall back on and reuse prior experience.
'Evolve' the bot to adapt to new types of tricky datasets.

You can evaluate your AutoML bot by pitting it against existing AutoML systems on a set of tricky datasets. You don't have to develop this from scratch, you can build on GAMA, an extensible AutoML tool developed in our group.

Details

Supervisor: Joaquin Vanschoren
Secondary supervisor: Pieter Gijsbers
Interested?: Get in contact