DEV Community

Alain Airom
Alain Airom

Posted on

“OpenRAG” From Documents to Agentic Search in Minutes (from IBM research open source)

Using IBM’s open-source RAG distribution, powered by OpenSearch, Langflow, and Docling.

TL;DR-OpenRAG again… or not!
So, a while back, I mentioned putting my hands on ‘OpenRag’. Well, I’m back at it again, except this time I’ve got my hands on a different “OpenRAG “ (see the case change 😉)— this one’s from IBM. Honestly, don’t even ask me why they have the exact same name; it’s like tech companies are sharing one giant bowl of alphabet soup and keep pulling out the same letters. Despite the identity crisis, this version absolutely rocks. The installation is so fast it’ll give you whiplash, and the “readiness for use” is basically at light speed. So, without further “due” (and before a third OpenRAG appears out of thin air), let’s jump right into it! 🪂


The Setup: Zero to RAG in 5 Minutes

If you’re ready to get your hands dirty, head straight to openr.ag. You can either hit “Get Started” for the quick tour or dive directly into the GitHub repository.

The core features are well presented on the web page, really clear;

_

_OpenRAG is a comprehensive Retrieval-Augmented Generation platform that enables intelligent document search and AI-powered conversations. Users can upload, process, and query documents through a chat interface backed by large language models and semantic search capabilities. The system utilizes Langflow for document ingestion, retrieval workflows, and intelligent nudges, providing a seamless RAG experience. Built with Starlette and Next.js. Powered by OpenSearch, Langflow, and Docling_.
_

One thing is for sure: the setup instructions are remarkably “neat.” Usually, “open source” is code for “spend three hours debugging your environment,” but here, you can actually have your system up and running in less than five minutes.

What’s Under the Hood?

The architecture is laid out with impressive clarity in the documentation. The real “chef’s kiss” feature, however, is the Docling integration for document processing.

  • Zero Configuration: It works right out of the box.
  • Hands-off Processing: You literally have nothing to do to get it ready — it just handles the heavy lifting of document ingestion for you.

The User Experience from setup to the Interface

For a tool this powerful, the UI is surprisingly intuitive. It’s clean, user-friendly, and doesn’t require a PhD in UI navigation to find what you need. Let’s go through the steps, but before that I should mention that as usual, I use Podman/Podman desktop locally alongside with local Ollama.

mkdir openrag-workspace
cd openrag-workspace
####
# I didn't use this script!
####
curl -fsSL https://docs.openr.ag/files/run_openrag_with_prereqs.sh | bash
Enter fullscreen mode Exit fullscreen mode
  • The steps I did to set up OpenRAG locally; 👣 (Podman/Podman Desktop already launched)
mkdir openragAAM
cd openragAAM
####
uvx openrag
uv run openrag
####
uv run openrag
2026-01-08 13:03:01 [debug    ] Ensured directory exists: /Users/alainairom/.openrag/documents
2026-01-08 13:03:01 [debug    ] Ensured directory exists: /Users/alainairom/.openrag/flows
2026-01-08 13:03:01 [debug    ] Ensured directory exists: /Users/alainairom/.openrag/keys
2026-01-08 13:03:01 [debug    ] Ensured directory exists: /Users/alainairom/.openrag/config
2026-01-08 13:03:01 [debug    ] Ensured directory exists: /Users/alainairom/.openrag/data
2026-01-08 13:03:01 [debug    ] Ensured directory exists: /Users/alainairom/.openrag/data/opensearch-data
2026-01-08 13:03:01 [debug    ] Loaded .env file from /Users/alainairom/.openrag/tui/.env
Enter fullscreen mode Exit fullscreen mode
  • Images pulled to run locally with Podman/Docker

  • The Setup/Configuration Screen

  • Basic Setup

  • Advanced Setup

  • These setup screens contribute to update the “.env” for the user by very neat screens.
# Ingestion Configuration
# Set to true to disable Langflow ingestion and use traditional OpenRAG processor
# If unset or false, Langflow pipeline will be used (default: upload -> ingest -> delete)
DISABLE_INGEST_WITH_LANGFLOW=false

# Langflow HTTP timeout configuration (in seconds)
# For large documents (300+ pages), ingestion can take 30+ minutes
# Increase these values if you experience timeouts with very large PDFs
# Default: 2400 seconds (40 minutes) total timeout, 30 seconds connection timeout
# LANGFLOW_TIMEOUT=2400
# LANGFLOW_CONNECT_TIMEOUT=30

# make one like so https://docs.langflow.org/api-keys-and-authentication#langflow-secret-key
LANGFLOW_SECRET_KEY=

# flow ids for chat and ingestion flows
LANGFLOW_CHAT_FLOW_ID=1098eea1-6649-4e1d-aed1-b77249fb8dd0
LANGFLOW_INGEST_FLOW_ID=5488df7c-b93f-4f87-a446-b67028bc0813
LANGFLOW_URL_INGEST_FLOW_ID=72c3d17c-2dac-4a73-b48a-6518473d7830
# Ingest flow using docling
# LANGFLOW_INGEST_FLOW_ID=1402618b-e6d1-4ff2-9a11-d6ce71186915
NUDGES_FLOW_ID=ebc01d31-1976-46ce-a385-b0240327226c

# Set a strong admin password for OpenSearch; a bcrypt hash is generated at
# container startup from this value. Do not commit real secrets.
# must match the hashed password in secureconfig, must change for secure deployment!!!
# NOTE: if you set this by hand, it must be a complex password: 
# The password must contain at least 8 characters, and must contain at least one uppercase letter, one lowercase letter, one digit, and one special character.
OPENSEARCH_PASSWORD=

# Path to persist OpenSearch data (indices, documents, cluster state)
# Default: ./opensearch-data
OPENSEARCH_DATA_PATH=./opensearch-data

# make here https://console.cloud.google.com/apis/credentials
GOOGLE_OAUTH_CLIENT_ID=
GOOGLE_OAUTH_CLIENT_SECRET=

# Azure app registration credentials for SharePoint/OneDrive
MICROSOFT_GRAPH_OAUTH_CLIENT_ID=
MICROSOFT_GRAPH_OAUTH_CLIENT_SECRET=

# AWS Access Key ID and Secret Access Key with access to your S3 instance
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

# OPTIONAL: dns routable from google (etc.) to handle continous ingest (something like ngrok works). This enables continous ingestion
WEBHOOK_BASE_URL=

# Model Provider API Keys
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
OLLAMA_ENDPOINT=
WATSONX_API_KEY=
WATSONX_ENDPOINT=
WATSONX_PROJECT_ID=

# LLM Provider configuration. Providers can be "anthropic", "watsonx", "ibm" or "ollama".
LLM_PROVIDER=
LLM_MODEL=

# Embedding provider configuration. Providers can be "watsonx", "ibm" or "ollama".
EMBEDDING_PROVIDER=
EMBEDDING_MODEL=

# OPTIONAL url for openrag link to langflow in the UI
LANGFLOW_PUBLIC_URL=

# OPTIONAL: Override host for docling service (for special networking setups)
# HOST_DOCKER_INTERNAL=host.containers.internal

# Langflow auth
LANGFLOW_AUTO_LOGIN=False
LANGFLOW_SUPERUSER=
LANGFLOW_SUPERUSER_PASSWORD=
LANGFLOW_NEW_USER_IS_ACTIVE=False
LANGFLOW_ENABLE_SUPERUSER_CLI=False

# Langfuse tracing (optional)
# Get keys from https://cloud.langfuse.com or your self-hosted instance
LANGFUSE_SECRET_KEY=
LANGFUSE_PUBLIC_KEY=
# Leave empty for Langfuse Cloud, or set for self-hosted (e.g., http://localhost:3002)
LANGFUSE_HOST=
Enter fullscreen mode Exit fullscreen mode

  • In the folder created as a workspace, the minimal Python code is created; 😉
def main():
    print("Hello from openragaam!")


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode
  • There is also the possibilty to toggle between CPU and GPU (if you have it) or to troubleshoot the services!


Main Application Code Overview-Where (a part of) the Magic is Implemented
On the GitHub repository, we can navigate to the “/src/main.py” is where the main logic is implemented.

# Configure structured logging early
from connectors.langflow_connector_service import LangflowConnectorService
from connectors.service import ConnectorService
from services.flows_service import FlowsService
from utils.container_utils import detect_container_environment
from utils.embeddings import create_dynamic_index_body
from utils.logging_config import configure_from_env, get_logger
from utils.telemetry import TelemetryClient, Category, MessageId

configure_from_env()
logger = get_logger(__name__)

import asyncio
import atexit
import mimetypes
import multiprocessing
import os
import shutil
import subprocess
from functools import partial

from starlette.applications import Starlette
from starlette.routing import Route

# Set multiprocessing start method to 'spawn' for CUDA compatibility
multiprocessing.set_start_method("spawn", force=True)

# Create process pool FIRST, before any torch/CUDA imports
from utils.process_pool import process_pool  # isort: skip
import torch

# API endpoints
from api import (
    auth,
    chat,
    connectors,
    docling,
    documents,
    flows,
    knowledge_filter,
    langflow_files,
    models,
    nudges,
    oidc,
    provider_health,
    router,
    search,
    settings,
    tasks,
    upload,
)

# Existing services
from api.connector_router import ConnectorRouter
from auth_middleware import optional_auth, require_auth

# API Key authentication
from api_key_middleware import require_api_key
from services.api_key_service import APIKeyService
from api import keys as api_keys
from api.v1 import chat as v1_chat, search as v1_search, documents as v1_documents, settings as v1_settings, knowledge_filters as v1_knowledge_filters

# Configuration and setup
from config.settings import (
    API_KEYS_INDEX_BODY,
    API_KEYS_INDEX_NAME,
    DISABLE_INGEST_WITH_LANGFLOW,
    INDEX_BODY,
    INDEX_NAME,
    SESSION_SECRET,
    clients,
    get_embedding_model,
    is_no_auth_mode,
    get_openrag_config,
)
from services.auth_service import AuthService
from services.langflow_mcp_service import LangflowMCPService
from services.chat_service import ChatService

# Services
from services.document_service import DocumentService
from services.knowledge_filter_service import KnowledgeFilterService

# Configuration and setup
# Services
from services.langflow_file_service import LangflowFileService
from services.models_service import ModelsService
from services.monitor_service import MonitorService
from services.search_service import SearchService
from services.task_service import TaskService
from session_manager import SessionManager
...
#######
#### 1000 Lines later :D
...

if __name__ == "__main__":
    import uvicorn

    # TUI check already handled at top of file
    # Register cleanup function
    atexit.register(cleanup)

    # Create app asynchronously
    app = asyncio.run(create_app())

    # Run the server (startup tasks now handled by Starlette startup event)
    uvicorn.run(
        app,
        workers=1,
        host="0.0.0.0",
        port=8000,
        reload=False,  # Disable reload since we're running from main
    )
Enter fullscreen mode Exit fullscreen mode

Here, a high-performance Retrieval-Augmented Generation (RAG) backend built with the Starlette framework is implemented. It is designed to be a “plug-and-play” system that bridges document processing (Docling), vector storage (OpenSearch), and orchestration (Langflow).

Core Components

  • Vector Engine: Uses OpenSearch for storing and searching document embeddings.
  • Orchestration: Integrates with Langflow for complex AI workflows and MCP (Model Context Protocol) for server management.
  • Document Processing: Features Docling for “out-of-the-box” document ingestion and parsing.
  • Authentication: Supports both OIDC (OpenID Connect) for users and API Key-based authentication for external integrations.

How the Logic Works

1. Initialization & Startup

When the app starts, it follows a strict sequence to ensure hardware and database readiness:

  • Hardware Check: It forces the spawn method for multiprocessing to ensure CUDA (GPU) compatibility for local embedding models.
  • Service Boot: It initializes a SessionManager, DocumentService, and a TaskService (which manages a process pool for heavy lifting).
  • OpenSearch Readiness: The app waits for OpenSearch to respond before creating dynamic indexes based on the dimensions of the specific embedding model being used.
  • Auto-Ingestion: It scans a local directory (/app/openrag-documents) and automatically ingests any default files found, making them immediately searchable.

2. The Connector Router

A standout feature in the logic is the ConnectorRouter. The system can switch between two ways of handling data:

  • Langflow Connector: Uses Langflow’s visual pipelines for ingestion.
  • OpenRAG Connector: Uses a traditional, high-speed internal processing engine.

3. Request Lifecycle

Most endpoints follow this internal flow:

  • Auth Middleware: Checks for a valid JWT or API Key.
  • T*ask Delegation*: For heavy tasks (like uploading a 100-page PDF), the API doesn’t make the user wait. It creates a Task ID via TaskService, processes the file in a background process pool, and allows the user to poll for the status.
  • Retrieval & Chat: Requests to /v1/chat or /v1/search query the OpenSearch index using the configured embedding model and return context-aware responses.

A word on CUDA and GPU — The Multi-Process Logic & CUDA Safety

As we saw in the main.py file (above), OpenRAG utilizes a custom Process Pool managed with the spawn start method. This is a critical technical choice for several reasons:

  • CUDA Compatibility: Libraries like PyTorch are not “fork-safe.” If a process forks after the GPU has been initialized, the child process often inherits a corrupted state, leading to immediate crashes. By using multiprocessing.set_start_method("spawn", force=True), OpenRAG ensures that every worker process starts with a clean, independent slate, which is essential for stable GPU-accelerated embedding and document parsing.
  • The Task Service: When you upload a document, the API doesn’t process it in the main thread. Instead, it delegates the heavy lifting to the TaskService. This service pushes the job to a dedicated process pool (initialized before any heavy AI imports) to ensure that document ingestion never blocks the chat interface or API responsiveness.
  • Docling Acceleration: OpenRAG leverages Docling with optional CUDA acceleration. By offloading these CPU/GPU-intensive layout analysis tasks to background processes, the system can handle large batches of PDFs without the “wobble” or memory spikes typical of less-optimized RAG implementations.

Why it Matters for the User

This “under-the-hood” engineering is what enables the “rocket fast” readiness you mentioned. Because the system proactively manages its hardware resources and process lifecycle, it can spin up indexes and start ingesting documents almost instantly upon deployment. All of this — the security, the vector search, and the complex process orchestration — is handled entirely within the OpenRAG framework, allowing you to focus on your data rather than your infrastructure.


Conclusion

In summary, OpenRAG serves as the central engine orchestrating every phase of the AI lifecycle, from the initial environment setup to complex document retrieval. By leveraging its modular architecture, the system seamlessly integrates high-speed Docling ingestion, dynamic OpenSearch indexing, and flexible Langflow workflows into a single, cohesive unit. Whether it is managing secure OIDC authentication, running background task processing via a dedicated pool, or providing a clean UI for user interaction, OpenRAG ensures that the entire pipeline is fast, open-source, and ready for production right out of the box.

Links

Top comments (0)