Introduction
I'm learning about data analysis and wanted to find an easier way to work with CSV files. Writing Python code with pandas was taking me a long time to learn, so I thought about using AI to help.
I built a tool called the Data Analysis Agent that lets you ask questions about your data in plain English. Instead of writing code, you can just ask things like "What's the average sales?" and get an answer.
In this blog post, I'll show you how I made this tool and how you can use it too.
What is the Data Analysis Agent?
The Data Analysis Agent is a tool I built that lets you analyze CSV files by asking questions in plain English. Instead of struggling to write complex pandas code, you can simply ask questions like:
- "What is the average sales?"
- "Show me the top 5 highest values in the salary column"
- "Count the number of rows"
- "What is the correlation between age and salary?"
- "Group by department and show the mean salary"
The system then uses a local AI model to understand your question and generate the appropriate pandas commands to answer it. It's like having a helpful assistant that knows how to work with data!
Key Features
1. Interactive Natural Language Interface
The coolest thing about this tool is that you can ask questions about your data in plain English. No need to learn complex programming syntax - just ask what you want to know!
2. Local AI Integration with Ollama
I chose to use Ollama to run AI models locally on your computer instead of using cloud services. This means:
- Privacy: Your data stays on your computer
- Speed: No waiting for internet connections
- Cost: Completely free to use
- Offline: Works even without internet
3. Multiple AI Model Support
You can use different AI models depending on what you have installed. Some popular ones include:
- Llama2: A good all-around model
- CodeLlama: Great for generating code
- Mistral: Another solid option
- Any model: You can use whatever you have in Ollama
4. Automatic CSV Detection
The tool automatically finds all your CSV files in the data/
folder and lets you pick which one to analyze. No need to remember file paths!
5. Pandas Integration
All the actual data analysis is done using pandas, which is the standard tool for data analysis in Python. This means you get reliable, well-tested results.
How It All Works (The Simple Version)
I tried to keep the code organized in a way that makes sense. Here's how the different parts work together:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ You ask a │───▶│ AI Model │───▶│ Pandas Code │
│ question │ │ (Ollama) │ │ gets made │
│ in English │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Pick which │ │ Load your │ │ Show you │
│ AI model │ │ CSV file │ │ the results │
└─────────────────┘ └─────────────────┘ └─────────────────┘
The Code Pieces (What I Built)
I broke the project into smaller pieces to make it easier to understand and work on. Here are the main parts:
1. Main Application (main.py
)
This is where everything starts. It handles:
- Letting you pick which AI model to use
- Loading your dataset
- The main menu where you ask questions
- Running the whole show
def main():
# Pick which AI model to use
selected_model = select_model()
# Load your dataset
file_path, df = select_dataset()
# Keep asking questions until you're done
while True:
query = input("Enter your query: ")
result = analyze_data(df, query, selected_model)
print(result)
2. Data Analyzer (data_analyzer.py
)
This is the brain of the operation. It:
- Takes your question and turns it into a prompt for the AI
- Gets the AI's response
- Runs the pandas code the AI generates
- Handles any errors that might happen
def analyze_data(df, query, selected_model):
prompt = f"""
System: You are a data analysis assistant. The dataset has columns: {list(df.columns)}.
The user asked: "{query}"
Suggest a pandas operation to answer the query. Return only the pandas command.
"""
ollama_response = query_ollama(prompt, selected_model)
# Run the code the AI gave us
result = eval(command, {'df': df, 'pd': pd})
return result
3. Ollama Client (ollama_client.py
)
This talks to the AI model on your computer:
- Sends your questions to the AI
- Gets the AI's answers back
- Handles any connection problems
4. Model Selection (model_selection.py
)
This helps you pick which AI model to use:
- Finds all the AI models you have installed
- Lets you choose which one to use
- Remembers your choice so it's faster next time
5. Data Management
-
CSV Files (
csv_files.py
): Finds all your CSV files and lets you pick one -
Data Loader (
data_loader.py
): Loads your CSV file into pandas
How It Works: Step-by-Step
Step 1: Model Discovery
The tool looks for AI models you have installed on your computer:
response = requests.get("http://localhost:11434/api/tags")
models = [model['name'] for model in response.json()['models']]
Step 2: Dataset Loading
It finds all your CSV files and loads the one you pick:
csv_files = glob.glob(os.path.join("data", "*.csv"))
df = pd.read_csv(selected_file)
Step 3: Query Processing
When you ask a question, here's what happens:
- Makes a Prompt: It takes your question and adds information about your dataset
- Asks the AI: Sends your question to the AI model you picked
- Gets Code: The AI gives back pandas code to answer your question
- Runs the Code: It runs the pandas code on your data
- Shows Results: You get your answer!
Step 4: Code Generation Example
If you ask "What is the average sales?", the AI might give you:
df['Sales'].mean()
If you ask "Show me the top 5 highest salaries":
df.nlargest(5, 'Salary')[['Name', 'Salary']]
What You Can Do With It
I included some sample datasets to help you get started:
- Sales Data: A dataset with sales records (Date, Product, Sales, Region)
- Employee Data: HR data with employee information (Name, Department, Salary, Experience)
- Voter Information: Electoral data with demographic details
- Candidate Data: Information about political candidates by state
Some Examples of What You Can Ask
You ask: "What is the average salary by department?"
The AI generates: df.groupby('Department')['Salary'].mean()
You ask: "Show me the top 3 highest sales"
The AI generates: df.nlargest(3, 'Sales')
You ask: "How many employees are in each department?"
The AI generates: df['Department'].value_counts()
Some Technical Stuff I Learned
Security Considerations
I made sure the tool is safe to use by controlling what code can run:
local_vars = {'df': df, 'pd': pd}
result = eval(command, local_vars)
This means only pandas operations can run, so nothing dangerous can happen to your computer.
Error Handling
I tried to handle common problems that might come up:
- If Ollama isn't running
- If your CSV file is corrupted
- If the AI gives a weird response
- If the generated code has an error
Performance Optimizations
I made a few small improvements:
- The tool remembers which AI models you have so it doesn't have to ask every time
- It loads data efficiently using pandas
- It doesn't use much memory
Things to Keep in Mind
1. AI Model Quality
- Different AI models give different results
- Some models are better at generating code than others
- You might need to try different models to see which works best for you
2. Computer Requirements
- AI models need quite a bit of RAM to run
- Having a good graphics card helps with larger models
- Downloading models can take a while the first time
3. Question Complexity
- Simple questions work best
- Very complex analysis might confuse the AI
- You might need to break down complicated questions into simpler ones
Want to Try It Yourself?
The project is open-source and available on GitHub.
To get started:
- Install Ollama and download at least one AI model
- Put your CSV files in the
data/
folder - Start with simple questions and work your way up to more complex ones
I'm still learning and improving this tool, so feel free to try it out and let me know what you think!
This project shows how local AI models can make data analysis more accessible to everyone. By keeping everything on your own computer, you get privacy and control while still getting the benefits of AI assistance.
Top comments (0)