DEV Community

Cover image for πŸš€ From Agents to Data Intelligence: Load Files, Scrape Web & Analyze with LangChain
vaasav kumar
vaasav kumar

Posted on

πŸš€ From Agents to Data Intelligence: Load Files, Scrape Web & Analyze with LangChain

In my previous blog, I covered:
πŸ‘‰ From LLMs to Agents: Build Smart AI Systems with Tools in LangChain

We learned how to:

  • build custom tools
  • create AI agents
  • fetch real-world data

πŸ”₯ What’s Next?

Now let’s take it further.

πŸ‘‰ Instead of just querying tools, we will make AI work with real data sources:

In this blog, we will learn:

  1. πŸ“„ Load and analyze text files
  2. πŸ“Š Process CSV data
  3. 🌐 Fetch and analyze web URLs (web scraping)
  4. ⚑ Optimize using semantic search (vector DB)

πŸ“„ 1. Load Text File Using TextLoader

We can directly load a *.txt file into LangChain:

from langchain_community.document_loaders import TextLoader

loader = TextLoader("tata_motors.txt")
docs = loader.load()
docs
Enter fullscreen mode Exit fullscreen mode

Output

πŸ‘‰ This converts your text file into structured documents.

Add Queries to fetch result from .txt file

from langchain_community.document_loaders import TextLoader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

loader = TextLoader("tata_motors.txt", encoding="utf-8")
docs = loader.load()

# Combine all texts into one single string
context = "\n\n".join(doc.page_content for doc in docs)

# Ask Questions
query = """
How much worth Tata Motors has provided on behalf of its Singapore holding company?
"""

prompt = ChatPromptTemplate.from_template("""
You are a stock research assistant.

Use only the context below.
Do not invent missing values.

User query:
{query}

Context:
{context}
""")

chain = prompt | llm
response = chain.invoke({
    "query": query,
    "context": context
})

print(response.content)
Enter fullscreen mode Exit fullscreen mode

Output

πŸ“Š 2. Load CSV Data Using CSVLoader

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("cars.csv")
data = loader.load()
data
Enter fullscreen mode Exit fullscreen mode

Output

πŸ‘‰ cars.csv file contents

You can also use Pandas for better control:

pip install -U pandas
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ This allows LLMs to behave like a data analyst on your CSV.

import os
import pandas as pd
from langchain_openai import ChatOpenAI

df = pd.read_csv("cars.csv")
question = "List the cars within 10 Lakhs budget?"

csv_text = df.to_string(index=False)

prompt = f"""
You are answering questions from this CSV data.

CSV data:
{csv_text}

Question:
{question}

Answer clearly using only the CSV data.
"""

response = llm.invoke(prompt)

print(response.content)
Enter fullscreen mode Exit fullscreen mode

Output

🌐 3. Load URLs & Perform Web Scraping

Now comes the powerful part.

pip install -U unstructured
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ LLM will read web content and generate structured analysis.

from langchain_community.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.tickertape.in/stocks/tata-motors-TMC",
    "https://groww.in/stocks/tata-motors-ltd",
]

loader = UnstructuredURLLoader(urls=urls)
documents = loader.load()

query = """
Analyze valuation, profitability, entry point, red flags,
and overall whether Tata Motors stock looks attractive.
"""

prompt = f"""
You are a stock research assistant.

Use only the context below. Do not invent missing values.

User query:
{query}

Return the answer in this exact format:

# Tata Motors Stock Analysis

## 1. Quick View
- Overall view:
- Reason:

## 2. Key Metrics Found
| Metric | Value | Interpretation |
|---|---:|---|
| Market Cap | | |
| PE Ratio | | |
| PB Ratio | | |
| Dividend Yield | | |
| Risk / Volatility | | |
| Red Flags | | |

## 3. Valuation

## 4. Profitability / Quality

## 5. Entry Point

## 6. Red Flags / Risks

## 7. Final Tentative View
"""

response = llm.invoke(prompt)

print(response.content)
Enter fullscreen mode Exit fullscreen mode

Output

⚠️ Problem: Slow Performance

If you load many URLs:

  • ⏳ Processing becomes slow
  • πŸ“‰ Context becomes too large
  • πŸ’Έ Cost increases
  • ⚑ Solution: Semantic Search (Vector DB)

Instead of passing all data, we:

  • Split content into chunks
  • Convert into embeddings
  • Store in vector DB
  • Retrieve only relevant data

⚑ 4. Optimize using semantic search (vector DB)

from langchain_community.document_loaders import UnstructuredURLLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

urls = [
    "https://www.tickertape.in/stocks/tata-motors-TMC",
    "https://groww.in/stocks/tata-motors-ltd",
]

loader = UnstructuredURLLoader(urls=urls)
documents = loader.load()

# Step 1: Split into Chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

# Step 2: Create Embeddings + Store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_url_db"
)

# Step 3: Retrieve Relevant Data
retriever = vector_db.as_retriever(search_kwargs={"k": 4})
retrieved_docs = retriever.invoke(query)

context = "\n\n".join(
    doc.page_content for doc in retrieved_docs
)

query = """
Analyze valuation, profitability, entry point, red flags,
and overall whether Tata Motors stock looks attractive.
"""

prompt = ChatPromptTemplate.from_template("""
You are a stock research assistant.

Use only the context below. Do not invent missing values.

User query:
{query}

Context:
{context}

Return the answer in this exact format:

# Tata Motors Stock Analysis

## 1. Quick View
- Overall view:
- Reason:

## 2. Key Metrics Found
| Metric | Value | Interpretation |
|---|---:|---|
| Market Cap | | |
| PE Ratio | | |
| PB Ratio | | |
| Dividend Yield | | |
| Risk / Volatility | | |
| Red Flags | | |

## 3. Valuation

## 4. Profitability / Quality

## 5. Entry Point

## 6. Red Flags / Risks

## 7. Final Tentative View
""")

context = "\n\n".join(
    doc.page_content for doc in retrieved_docs
)

# Step 4: Final Analysis
chain = prompt | llm
response = chain.invoke({
    "query": query,
    "context": context,
})

print(response.content)
Enter fullscreen mode Exit fullscreen mode

Output

πŸš€ What You Learned

In this blog, we moved from:
πŸ‘‰ AI Agents β†’ AI + Data Intelligence

You learned how to:

  • Load text and CSV data
  • Scrape and analyze web content
  • Handle large data efficiently
  • Use vector databases for semantic search

Top comments (0)