LLM has the potential to make data more accessible to a non-technical audience through prompt-based analytics. It also has the potential to help make engineering teams more efficient by quickly getting a first draft of a data pipeline.
Both of these applications hinge on appropriate tagging and management of data:
• If we ask the LLM to plot quarterly sales for product X, how will the LLM know which sales field to pick if it is not appropriately tagged in the metadata layer?
• The same basic principle applies in an engineering context, but more complex (e.g., anonymization policies, variety of ingest, compute, and storage/publish options, etc.)
• For this, you could provide the LLM with a bunch of data engineering policies and principles that your company follows so that it builds pipelines according to your approach.
• How do you set your metadata up to be able to leverage these new capabilities? How do you still build in a fail-safe / undo / human intervention step in the process?
Based on underlying metadata of the differnet sytems and transformation script (e.g. SQL), create an approach to automatically detection data lineage / data discovery. This as a fundamental building block for setting up a good data management process. A specific use case could be ESG data. Where we see that it takes a lot of time to create insights in ESG data lineage etc - would help if we can do a quick scan based on data from the systems.