Data and AI cluster

Project: [Closed] (PwC) LLMs for Data Analysis Pipelines

Description

LLM has the potential to make data more accessible to a non-technical audience through prompt-based analytics. It also has the potential to help make engineering teams more efficient by quickly getting a first draft of a data pipeline.

Both of these applications hinge on appropriate tagging and management of data:

• If we ask the LLM to plot quarterly sales for product X, how will the LLM know which sales field to pick if it is not appropriately tagged in the metadata layer?

• The same basic principle applies in an engineering context, but more complex (e.g., anonymization policies, variety of ingest, compute, and storage/publish options, etc.)

• For this, you could provide the LLM with a bunch of data engineering policies and principles that your company follows so that it builds pipelines according to your approach.

• How do you set your metadata up to be able to leverage these new capabilities? How do you still build in a fail-safe / undo / human intervention step in the process?

Based on underlying metadata of the differnet sytems and transformation script (e.g. SQL), create an approach to automatically detection data lineage / data discovery. This as a fundamental building block for setting up a good data management process. A specific use case could be ESG data. Where we see that it takes a lot of time to create insights in ESG data lineage etc - would help if we can do a quick scan based on data from the systems.

Details

Supervisor: Bart Engelen
External location: PwC
Link: Thesis