The AI craze is in hyperdrive, and it shows no clear indication of dying down anytime soon. The question is, do we jump on the bandwagon or do we hold back for sanity to prevail?

You know the answer. We are not just jumping, but we are flying with whatever we have to see where we reach. The horizon has never been so bleak, and it doesn't matter which side is up and which is down, we are flying alright!

Dealing with AI with the Data on Hand

At Etherion, I have been actively looking into open-source data to clean, analyze, and develop Insights to share with the community. I have been eagerly looking to work with new tools in the market, AI-enabled and disabled, both kinds, and it has been a lot of fun. I learned that working with AI-integrated products is fun, but as a co-founder and part of the leadership team, I am mindful of the objective and goal that is set out, cause it is so easy to just get lost in the flow.

Speaking of which, my technical flow looks something like this -

  • Acquire data - Kaggle, data.org, and web scraping
  • Cleaning and Exploratory Analysis
  • Data Normalization and Modeling
  • Data Warehouse & Pipeline
  • Data Visualization
  • Documenting the Insights and Dashboarding

It is a simple, proven step-by-step process that keeps me interested in the project as well as allows me some leeway to involve people as required. It also allows me to try out new tools at each step. Maybe that is where the complication crept into the process.

My Cursor slipped onto Ollama, and the Rest is Magic.

While my process is modular and simple, I have been feeling quite adventurous lately, so I figured, why not just run the entire analytics through AI? It is the flavor of the times, and the tools are great and simple to jump in without any prior homework. The steps are simple -

  • Download Ollama & any model you want to try. Llama 3.2 is available for free
  • Set up the project in VSCode. I am using UV to set up the environment.
  • Load Libraries like Langchain and the subcomponents

A few tutorials and documentations later, I was able to feed the CSV files with rental data I picked up from Kaggle into Llama 3.2 and chat with it to explore the data.

I was able to get Min, Max, and Average rent before it started to throw errors with type. After a few normalization steps, I was able to get it working again, and it was quick to respond to my prompts as well. The surprising bit was how it struggled with the parts where the data was not in the correct or acceptable format. It raised a question -

Can a lack of Data Governance derail the performance of a well-trained model?

In my experience working with multiple clients on the data integration, migration and modernization projects the crux of the problem is rarely technical rather more process and business alignment related that causes major projects to fail.

DAMA-DMBOK2 and the need for professional data expertise

Yes, the pre-trained models are only as good as the data they are trained on, and can only work with good data for accurate insights. What is DAMA-DMBOK?

DAMA-DMBOK stands for the DAMA Guide to the Data Management Body of Knowledge. It is the de facto global standard for best practices in data management, developed by DAMA International (Data Management Association).

They do have a series of certifications that are helpful for Data Professionals to master the data management framework and work on large-scale data transformation projects in big organizations looking to take control of their data.

How does DAMA-DMBOK fit into the conversation with AI?

It's simple, isn't it? Any LLM trained on bad data will give out bad results, and any LLM working with bad data is going to hallucinate to varying degrees. Data sits at the crux of this advent of AI, and it requires professionals to manage it. It provides a structural clarity and a much-needed alignment within the organization when it comes to generation, access, use, and deprecation of data.

The Framework to enable AI

Data Governance provides a framework for organizations to assess, design, map, and quality assure the data to train LLM to enable the organization to be driven by AI.

DAMA has its version of a data governance framework that manages the flow of information as well as the structure of the organization to maintain data integrity. Here is the DAMA Wheel for reference -

In my project above, I know I haven't lost track of it, I saw that there is a need for a data quality framework to make sure that my interaction with the data continues to be efficient and accurate.

There are multiple data quality tools and frameworks to follow, but the quickest and simplest one yet is to integrate dbt in the pipeline. Oh, create a data pipeline so that Llama3.2 works with quality data.

I created a data pipeline to download the updated data from Kaggle and created DBT tests to make sure the data is in the required format and quality. The clean data then was exported to another csv and fed into Langchain which uses Chroma to create a vector database for quick processing with Llama3.2I do not want to go into technical bit, I will create another post with step by step details about integrating any dataset with local LLM and creating a AI Chatbot to interact with the dataset.

Where do we stand with Data Governance at Etherion Consulting LLP?

Every time I work with a new dataset, and more so now when I work with local LLMs for a much smoother workflow, I go back to the same question: How can I make sure the data is accurate and of good quality?

The answer invariably takes me back to implementing a robust Data Governance Framework with integrated Data Security, Privacy, and Compliance measures.

At Etherion, we are working on implementing a governance-first approach whenever we work with clients who are looking to integrate AI into their workflow. There is no quick fix, and unless we assess the organization's data for maturity and people's awareness regarding compliance, security, and privacy of data.

Next Steps

Etherion is working on creating a playbook on Implementing Data Governance within organizations that are looking to enable AI. A top-down perspective of introducing AI in the organization that feels more seamless than a forced approach implemented due to current market trends.

We want to empower our clients with the expertise that we have garnered by studying the frameworks proposed by DAMA and efficiently drive this AI revolution.

Catch you next time with another post. Subscribe to get notified.

Thanks for reading Bytes from Etherion! Subscribe for free to receive new posts and support my work.


The link has been copied!