DEV Community

Nanaji G
Nanaji G Subscriber

Posted on

Running Ollama Locally for Enterprise PDF Classification from SharePoint

Running Ollama Locally for Enterprise PDF Classification from SharePoint

Overview

In this comprehensive guide, we'll explore how to leverage Ollama running locally to classify enterprise PDFs stored in SharePoint. This solution provides a cost-effective, privacy-focused approach to document intelligence without relying on cloud-based AI services.


Why This Matters

Key Benefits:

  • πŸ”’ Privacy First: Keep sensitive enterprise data on-premises
  • πŸ’° Cost-Effective: No per-API-call charges
  • ⚑ Performance: Low-latency processing for local documents
  • πŸŽ›οΈ Control: Full control over model selection and fine-tuning

πŸ› οΈ Prerequisites

Before we begin, ensure you have:

  • βœ… Ollama installed locally (Download here)
  • βœ… Python 3.8+ with pip
  • βœ… SharePoint access with appropriate permissions
  • βœ… Microsoft 365 credentials

πŸ“¦ Required Libraries

Install the necessary Python packages:

pip install ollama
pip install office365-rest-python-client
pip install PyPDF2
pip install python-dotenv
Enter fullscreen mode Exit fullscreen mode

Architecture Overview

Our solution follows this workflow:

  1. Connect to SharePoint using Microsoft Graph API
  2. Download PDFs from specified document library
  3. Extract text content from PDFs
  4. Classify documents using Ollama's LLM
  5. Update metadata/tags in SharePoint

πŸ’» Implementation

Step 1: Setting Up Ollama

First, pull a suitable model for classification:

# Pull a lightweight model
ollama pull llama2:7b

# Or for better accuracy
ollama pull llama2:13b
Enter fullscreen mode Exit fullscreen mode

Step 2: SharePoint Connection

Create a config.py file for credentials:

import os
from dotenv import load_dotenv

load_dotenv()

SHAREPOINT_SITE = os.getenv('SHAREPOINT_SITE')
CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
TENANT_ID = os.getenv('TENANT_ID')
Enter fullscreen mode Exit fullscreen mode

Step 3: PDF Classification Script

Here's the main classification implementation:

import ollama
from office365.sharepoint.client_context import ClientContext
from office365.runtime.auth.client_credential import ClientCredential
import PyPDF2
import io

class PDFClassifier:
    def __init__(self, sharepoint_site, client_id, client_secret):
        self.site = sharepoint_site
        credentials = ClientCredential(client_id, client_secret)
        self.ctx = ClientContext(sharepoint_site).with_credentials(credentials)

    def extract_pdf_text(self, pdf_bytes):
        """Extract text from PDF bytes"""
        pdf_file = io.BytesIO(pdf_bytes)
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()

        return text[:2000]  # Limit for prompt

    def classify_document(self, text_content):
        """Classify document using Ollama"""
        prompt = f"""
        Analyze the following document excerpt and classify it into ONE of these categories:
        - Financial Report
        - Legal Contract
        - Technical Documentation
        - Marketing Material
        - HR Document
        - General Correspondence

        Document:
        {text_content}

        Respond with ONLY the category name, nothing else.
        """

        response = ollama.chat(
            model='llama2:7b',
            messages=[{'role': 'user', 'content': prompt}]
        )

        return response['message']['content'].strip()

    def process_library(self, library_name):
        """Process all PDFs in a SharePoint library"""
        library = self.ctx.web.lists.get_by_title(library_name)
        items = library.items.get().execute_query()

        results = []

        for item in items:
            file = item.file.get().execute_query()

            if file.name.endswith('.pdf'):
                # Download PDF
                pdf_content = file.read()

                # Extract text
                text = self.extract_pdf_text(pdf_content)

                # Classify
                category = self.classify_document(text)

                results.append({
                    'file': file.name,
                    'category': category
                })

                print(f"βœ… Classified: {file.name} β†’ {category}")

        return results
Enter fullscreen mode Exit fullscreen mode

Step 4: Running the Classifier

from config import *

# Initialize classifier
classifier = PDFClassifier(
    sharepoint_site=SHAREPOINT_SITE,
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET
)

# Process documents
results = classifier.process_library('Documents')

# Display results
for result in results:
    print(f"πŸ“„ {result['file']}: {result['category']}")
Enter fullscreen mode Exit fullscreen mode

Advanced Features

Custom Classification Categories

You can easily adapt the categories to your needs:

CATEGORIES = [
    "Invoice",
    "Purchase Order",
    "Compliance Report",
    "Employee Handbook",
    "Project Proposal"
]
Enter fullscreen mode Exit fullscreen mode

Batch Processing with Progress Bar

from tqdm import tqdm

def process_library_with_progress(self, library_name):
    items = self.get_library_items(library_name)

    for item in tqdm(items, desc="Processing PDFs"):
        # Classification logic here
        pass
Enter fullscreen mode Exit fullscreen mode

πŸ” Best Practices

Performance Optimization

  • Batch Processing: Process documents in batches during off-hours
  • Caching: Cache classifications to avoid re-processing
  • Model Selection: Balance between accuracy and speed

Security Considerations

  • πŸ” Use environment variables for credentials
  • πŸ” Implement role-based access control
  • πŸ” Enable audit logging for classification activities
  • πŸ” Consider encryption for sensitive document content

Monitoring & Logging

Implement comprehensive logging:

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('classification.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)
Enter fullscreen mode Exit fullscreen mode

Common Issues & Solutions

Issue Solution
Ollama not responding Check if service is running: ollama serve
SharePoint auth fails Verify app permissions in Azure AD
PDF extraction errors Use pdfplumber as alternative to PyPDF2
Slow classification Use smaller model or implement caching

Conclusion

This solution demonstrates how to:

βœ… Leverage local LLMs for enterprise document classification

βœ… Integrate with SharePoint seamlessly

βœ… Maintain data privacy and control

βœ… Build cost-effective AI solutions


πŸ”— Resources


Discussion

Have you implemented similar solutions? What challenges did you face? Share your experiences in the comments below!

Tags: #AI #MachineLearning #SharePoint #Python #Ollama #EnterpriseAI #DocumentClassification

Top comments (0)