Nanaji G

Posted on Oct 12

Running Ollama Locally for Enterprise PDF Classification from SharePoint

#ai #machinelearning #python #tutorial

Running Ollama Locally for Enterprise PDF Classification from SharePoint

Overview

In this comprehensive guide, we'll explore how to leverage Ollama running locally to classify enterprise PDFs stored in SharePoint. This solution provides a cost-effective, privacy-focused approach to document intelligence without relying on cloud-based AI services.

Why This Matters

Key Benefits:

🔒 Privacy First: Keep sensitive enterprise data on-premises
💰 Cost-Effective: No per-API-call charges
⚡ Performance: Low-latency processing for local documents
🎛️ Control: Full control over model selection and fine-tuning

🛠️ Prerequisites

Before we begin, ensure you have:

✅ Ollama installed locally (Download here)
✅ Python 3.8+ with pip
✅ SharePoint access with appropriate permissions
✅ Microsoft 365 credentials

📦 Required Libraries

Install the necessary Python packages:

pip install ollama
pip install office365-rest-python-client
pip install PyPDF2
pip install python-dotenv

Architecture Overview

Our solution follows this workflow:

Connect to SharePoint using Microsoft Graph API
Download PDFs from specified document library
Extract text content from PDFs
Classify documents using Ollama's LLM
Update metadata/tags in SharePoint

💻 Implementation

Step 1: Setting Up Ollama

First, pull a suitable model for classification:

# Pull a lightweight model
ollama pull llama2:7b

# Or for better accuracy
ollama pull llama2:13b

Step 2: SharePoint Connection

Create a config.py file for credentials:

import os
from dotenv import load_dotenv

load_dotenv()

SHAREPOINT_SITE = os.getenv('SHAREPOINT_SITE')
CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
TENANT_ID = os.getenv('TENANT_ID')

Step 3: PDF Classification Script

Here's the main classification implementation:

import ollama
from office365.sharepoint.client_context import ClientContext
from office365.runtime.auth.client_credential import ClientCredential
import PyPDF2
import io

class PDFClassifier:
    def __init__(self, sharepoint_site, client_id, client_secret):
        self.site = sharepoint_site
        credentials = ClientCredential(client_id, client_secret)
        self.ctx = ClientContext(sharepoint_site).with_credentials(credentials)

    def extract_pdf_text(self, pdf_bytes):
        """Extract text from PDF bytes"""
        pdf_file = io.BytesIO(pdf_bytes)
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()

        return text[:2000]  # Limit for prompt

    def classify_document(self, text_content):
        """Classify document using Ollama"""
        prompt = f"""
        Analyze the following document excerpt and classify it into ONE of these categories:
        - Financial Report
        - Legal Contract
        - Technical Documentation
        - Marketing Material
        - HR Document
        - General Correspondence

        Document:
        {text_content}

        Respond with ONLY the category name, nothing else.
        """

        response = ollama.chat(
            model='llama2:7b',
            messages=[{'role': 'user', 'content': prompt}]
        )

        return response['message']['content'].strip()

    def process_library(self, library_name):
        """Process all PDFs in a SharePoint library"""
        library = self.ctx.web.lists.get_by_title(library_name)
        items = library.items.get().execute_query()

        results = []

        for item in items:
            file = item.file.get().execute_query()

            if file.name.endswith('.pdf'):
                # Download PDF
                pdf_content = file.read()

                # Extract text
                text = self.extract_pdf_text(pdf_content)

                # Classify
                category = self.classify_document(text)

                results.append({
                    'file': file.name,
                    'category': category
                })

                print(f"✅ Classified: {file.name} → {category}")

        return results

Step 4: Running the Classifier

from config import *

# Initialize classifier
classifier = PDFClassifier(
    sharepoint_site=SHAREPOINT_SITE,
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET
)

# Process documents
results = classifier.process_library('Documents')

# Display results
for result in results:
    print(f"📄 {result['file']}: {result['category']}")

Advanced Features

Custom Classification Categories

You can easily adapt the categories to your needs:

CATEGORIES = [
    "Invoice",
    "Purchase Order",
    "Compliance Report",
    "Employee Handbook",
    "Project Proposal"
]

Batch Processing with Progress Bar

from tqdm import tqdm

def process_library_with_progress(self, library_name):
    items = self.get_library_items(library_name)

    for item in tqdm(items, desc="Processing PDFs"):
        # Classification logic here
        pass

🔍 Best Practices

Performance Optimization

Batch Processing: Process documents in batches during off-hours
Caching: Cache classifications to avoid re-processing
Model Selection: Balance between accuracy and speed

Security Considerations

🔐 Use environment variables for credentials
🔐 Implement role-based access control
🔐 Enable audit logging for classification activities
🔐 Consider encryption for sensitive document content

Monitoring & Logging

Implement comprehensive logging:

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('classification.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

Common Issues & Solutions

Issue	Solution
Ollama not responding	Check if service is running: `ollama serve`
SharePoint auth fails	Verify app permissions in Azure AD
PDF extraction errors	Use `pdfplumber` as alternative to PyPDF2
Slow classification	Use smaller model or implement caching

Conclusion

This solution demonstrates how to:

✅ Leverage local LLMs for enterprise document classification

✅ Integrate with SharePoint seamlessly

✅ Maintain data privacy and control

✅ Build cost-effective AI solutions

🔗 Resources

Discussion

Have you implemented similar solutions? What challenges did you face? Share your experiences in the comments below!

Tags: #AI #MachineLearning #SharePoint #Python #Ollama #EnterpriseAI #DocumentClassification

DEV Community

Running Ollama Locally for Enterprise PDF Classification from SharePoint

Running Ollama Locally for Enterprise PDF Classification from SharePoint

Overview

Why This Matters

Key Benefits:

🛠️ Prerequisites

📦 Required Libraries

Architecture Overview

💻 Implementation

Step 1: Setting Up Ollama

Step 2: SharePoint Connection

Step 3: PDF Classification Script

Step 4: Running the Classifier

Advanced Features

Custom Classification Categories

Batch Processing with Progress Bar

🔍 Best Practices

Performance Optimization

Security Considerations

Monitoring & Logging

Common Issues & Solutions

Conclusion

🔗 Resources

Discussion

Top comments (0)