back to list

Project: TSAgentBench: Benchmarking LLM Agents on Time Series Tasks

Description

Background & Motivation

Large language model agents are increasingly being applied to time series tasks: forecasting, anomaly detection, classification, and root cause analysis. Systems that combine LLMs with tool use, retrieval, and code execution have shown promising results across domains including healthcare monitoring, energy management, predictive maintenance, and scientific data analysis. Yet rigorous, standardised evaluation of these agents on time series tasks is largely absent.

Existing benchmarks for LLM agents focus primarily on language understanding, coding, or general reasoning. Time series benchmarks, on the other hand, evaluate models rather than agents. They measure forecasting accuracy or classification performance without considering the multi-step reasoning, tool use, and contextual adaptation that characterise agentic behaviour. The result is a significant evaluation gap: we have no systematic way to assess how well LLM agents actually perform on the kinds of time series problems that arise in practice.

This project constructs TSAgentBench: a general-purpose benchmark for evaluating LLM agents on time series tasks across multiple domains. Inspired by MLE-Bench, which evaluates agents on realistic ML engineering tasks, and TemporalBench, which introduces contextual time series evaluation, TSAgentBench assesses agents on tasks that require reading and reasoning about sequential data, selecting appropriate analysis strategies, using computational tools, and adapting to unfamiliar data characteristics.


Research Objectives

The following objectives represent the broader research landscape for this project. The precise thesis scope will be defined collaboratively with the supervisor based on the student’s background and interests. Students are not expected to address all objectives.

1. Survey existing LLM agent evaluation frameworks, such as MLE-Bench, TemporalBench, and BLADE, and time series benchmarks, identifying gaps specific to agentic time series reasoning.

2. Design a task taxonomy covering core time series capabilities: pattern understanding, anomaly reasoning, forecasting, and contextual adaptation to unfamiliar data.

3. Construct evaluation tasks across at least two domains, such as healthcare, energy, or climate, with standardised pipelines, ground-truth labels, and automatic evaluation protocols.

4. Implement an evaluation harness supporting tool-augmented agents and direct LLM prompting baselines.

5. Benchmark representative LLM agents and produce a structured analysis of capability gaps and failure modes.


Students with strong interests, relevant hands-on experience, and motivation for a publishable thesis are encouraged. For more details, please contact the supervisors (s.deng@tue.nl or y.deng2@tue.nl).


References

[1] Chan, J., et al. “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering.” ICLR 2025.

[2] Weng, M. , et al. “TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks.” arXiv:2602.13272, 2026.

[3] Xie, Q., et al. “FinBen: A Holistic Financial Benchmark for Large Language Models.” Neurips 2026.

[4] Jin, M., et al. “Time-LLM: Time Series Forecasting by Reprogramming Large Language Models.” ICLR 2024.

[5] Gu, K. , et al. “BLADE: Benchmarking Language Model Agents for Data-Driven Science.” arXiv 2024.

[6] Wang, C. , et al. “AutoTS: Automatic Time Series Analysis with LLM Agents.” arXiv 2025.

[7] Goswami, M., et al. “MOMENT: A Family of Open Time-Series Foundation Models.” ICML 2024.

Details
Supervisor
Amy Deng
Secondary supervisor
Yuanyuan Deng
Interested?
Get in contact