Running Ollama Locally for Enterprise PDF Classification from SharePoint
Overview
In this comprehensive guide, we'll explore how to leverage Ollama running locally to classify enterprise PDFs stored in SharePoint. This solution provides a cost-effective, privacy-focused approach to document intelligence without relying on cloud-based AI services.
Why This Matters
Key Benefits:
- π Privacy First: Keep sensitive enterprise data on-premises
- π° Cost-Effective: No per-API-call charges
- β‘ Performance: Low-latency processing for local documents
- ποΈ Control: Full control over model selection and fine-tuning
π οΈ Prerequisites
Before we begin, ensure you have:
- β Ollama installed locally (Download here)
- β Python 3.8+ with pip
- β SharePoint access with appropriate permissions
- β Microsoft 365 credentials
π¦ Required Libraries
Install the necessary Python packages:
pip install ollama
pip install office365-rest-python-client
pip install PyPDF2
pip install python-dotenv
Architecture Overview
Our solution follows this workflow:
- Connect to SharePoint using Microsoft Graph API
- Download PDFs from specified document library
- Extract text content from PDFs
- Classify documents using Ollama's LLM
- Update metadata/tags in SharePoint
π» Implementation
Step 1: Setting Up Ollama
First, pull a suitable model for classification:
# Pull a lightweight model
ollama pull llama2:7b
# Or for better accuracy
ollama pull llama2:13b
Step 2: SharePoint Connection
Create a config.py
file for credentials:
import os
from dotenv import load_dotenv
load_dotenv()
SHAREPOINT_SITE = os.getenv('SHAREPOINT_SITE')
CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
TENANT_ID = os.getenv('TENANT_ID')
Step 3: PDF Classification Script
Here's the main classification implementation:
import ollama
from office365.sharepoint.client_context import ClientContext
from office365.runtime.auth.client_credential import ClientCredential
import PyPDF2
import io
class PDFClassifier:
def __init__(self, sharepoint_site, client_id, client_secret):
self.site = sharepoint_site
credentials = ClientCredential(client_id, client_secret)
self.ctx = ClientContext(sharepoint_site).with_credentials(credentials)
def extract_pdf_text(self, pdf_bytes):
"""Extract text from PDF bytes"""
pdf_file = io.BytesIO(pdf_bytes)
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
return text[:2000] # Limit for prompt
def classify_document(self, text_content):
"""Classify document using Ollama"""
prompt = f"""
Analyze the following document excerpt and classify it into ONE of these categories:
- Financial Report
- Legal Contract
- Technical Documentation
- Marketing Material
- HR Document
- General Correspondence
Document:
{text_content}
Respond with ONLY the category name, nothing else.
"""
response = ollama.chat(
model='llama2:7b',
messages=[{'role': 'user', 'content': prompt}]
)
return response['message']['content'].strip()
def process_library(self, library_name):
"""Process all PDFs in a SharePoint library"""
library = self.ctx.web.lists.get_by_title(library_name)
items = library.items.get().execute_query()
results = []
for item in items:
file = item.file.get().execute_query()
if file.name.endswith('.pdf'):
# Download PDF
pdf_content = file.read()
# Extract text
text = self.extract_pdf_text(pdf_content)
# Classify
category = self.classify_document(text)
results.append({
'file': file.name,
'category': category
})
print(f"β
Classified: {file.name} β {category}")
return results
Step 4: Running the Classifier
from config import *
# Initialize classifier
classifier = PDFClassifier(
sharepoint_site=SHAREPOINT_SITE,
client_id=CLIENT_ID,
client_secret=CLIENT_SECRET
)
# Process documents
results = classifier.process_library('Documents')
# Display results
for result in results:
print(f"π {result['file']}: {result['category']}")
Advanced Features
Custom Classification Categories
You can easily adapt the categories to your needs:
CATEGORIES = [
"Invoice",
"Purchase Order",
"Compliance Report",
"Employee Handbook",
"Project Proposal"
]
Batch Processing with Progress Bar
from tqdm import tqdm
def process_library_with_progress(self, library_name):
items = self.get_library_items(library_name)
for item in tqdm(items, desc="Processing PDFs"):
# Classification logic here
pass
π Best Practices
Performance Optimization
- Batch Processing: Process documents in batches during off-hours
- Caching: Cache classifications to avoid re-processing
- Model Selection: Balance between accuracy and speed
Security Considerations
- π Use environment variables for credentials
- π Implement role-based access control
- π Enable audit logging for classification activities
- π Consider encryption for sensitive document content
Monitoring & Logging
Implement comprehensive logging:
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('classification.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
Common Issues & Solutions
Issue | Solution |
---|---|
Ollama not responding | Check if service is running: ollama serve
|
SharePoint auth fails | Verify app permissions in Azure AD |
PDF extraction errors | Use pdfplumber as alternative to PyPDF2 |
Slow classification | Use smaller model or implement caching |
Conclusion
This solution demonstrates how to:
β
Leverage local LLMs for enterprise document classification
β
Integrate with SharePoint seamlessly
β
Maintain data privacy and control
β
Build cost-effective AI solutions
π Resources
Discussion
Have you implemented similar solutions? What challenges did you face? Share your experiences in the comments below!
Tags: #AI #MachineLearning #SharePoint #Python #Ollama #EnterpriseAI #DocumentClassification
Top comments (0)