back to list
Project: [ATOS] GenAI Evaluation Harness
Description
Problem DefinitionGenAI systems deliver inconsistent quality (hallucinations, poor retrieval, variable latency). Customers want predictable KPIs and a reliable evaluation framework.
Objective
Develop an evaluation harness that automatically tests GenAI systems on:
* Groundedness & correctness
* Hallucination detection
* Retrieval quality (RAG)
* Latency & token costs
* Robustness under variations
* Regression testing during model updates
Research Questions
* How do you define reliable groundedness/hallucination metrics?
* How do you systematically evaluate RAG systems?
* What are the best techniques for regression testing LLMs?
* Cost optimization: where are the biggest savings found?
Deliverables
* Python eval framework (CLI + notebook)
* Benchmark dataset + judge prompts
* Test report generator (quality, risk, costs)
* CI integration for regression tests
* Final report + recommendations for tuning & prompting
Note: this is an industry project and requires an interview with the company
Details
- Supervisor
-
Joaquin Vanschoren
- External location
- ATOS
- Interested?
-
Get in contact