Data and AI cluster

Project: [ATOS] GenAI Evaluation Harness

Description

Problem Definition

GenAI systems deliver inconsistent quality (hallucinations, poor retrieval, variable latency). Customers want predictable KPIs and a reliable evaluation framework.

Objective

Develop an evaluation harness that automatically tests GenAI systems on:

* Groundedness & correctness

* Hallucination detection

* Retrieval quality (RAG)

* Latency & token costs

* Robustness under variations

* Regression testing during model updates

Research Questions

* How do you define reliable groundedness/hallucination metrics?

* How do you systematically evaluate RAG systems?

* What are the best techniques for regression testing LLMs?

* Cost optimization: where are the biggest savings found?

Deliverables

* Python eval framework (CLI + notebook)

* Benchmark dataset + judge prompts

* Test report generator (quality, risk, costs)

* CI integration for regression tests

* Final report + recommendations for tuning & prompting

Note: this is an industry project and requires an interview with the company

Details

Supervisor: Joaquin Vanschoren
External location: ATOS
Interested?: Get in contact