DEV Community: Negitama

I wish I knew this before: Python's ORM vs Raw SQL in SQLAlchemy Explained!

Negitama — Sat, 14 Jun 2025 14:38:00 +0000

I wish I knew this before: Python's ORM vs Raw SQL in SQLAlchemy Explained!\n\n## Introduction\n\nIf you’ve ever built a Python backend, you’ve likely run into the choice between using an ORM (Object Relational Mapper) or writing raw SQL queries when working with a database. Let’s dive into a hands-on example using SQLAlchemy and unravel the strengths and weaknesses of each approach—so you can pick what’s best for your next project!\n\n---\n\n## 1. Setup: Define Models with SQLAlchemy ORM\n\nFirst, let’s set up our data model using SQLAlchemy’s declarative base.\n\n

python\nfrom sqlalchemy import Column, Integer, String, Text, create_engine\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy.orm import sessionmaker\n\nBase = declarative_base()\n\nclass JobDescription(Base):\n __tablename__ = 'job_description'\n id = Column(Integer, primary_key=True)\n company_id = Column(String, nullable=False)\n company_name = Column(Text, nullable=False)\n job_description = Column(Text, nullable=False)\n\nclass CandidateResume(Base):\n __tablename__ = 'candidate_resume'\n id = Column(Integer, primary_key=True)\n candidate_id = Column(String, nullable=False)\n candidate_name = Column(Text, nullable=False)\n resume = Column(Text, nullable=False)\n\nengine = create_engine('postgresql://user:password@localhost/dbname')\nSession = sessionmaker(bind=engine)\n\nif __name__ == '__main__':\n Base.metadata.create_all(engine)\n

\n\n---\n\n## 2. Bit-ORM Layer: Repository Using the ORM\n\nUsing the ORM, you’ll interact with Python objects instead of raw SQL.\n\n

python\nfrom orm_models import JobDescription, Session\n\ndef get_job_descriptions_by_company_orm(company_id: str):\n """\n Retrieve JobDescriptions for a given company using ORM.\n Returns a list of JobDescription objects.\n """\n session = Session()\n try:\n results = session.query(JobDescription)\n .filter(JobDescription.company_id == company_id)\n .all()\n return results\n finally:\n session.close()\n\nif __name__ == '__main__':\n jobs = get_job_descriptions_by_company_orm('COMPANY_123')\n for job in jobs:\n print(f"{job.company_name}: {job.job_description}")\n

\n\n### Advantages of the ORM Approach\n- Abstraction & Simplicity: Work directly with Python objects; SQL generation is handled for you.\n- Maintainability: Updates in models propagate everywhere queries are used.\n- Safety & Consistency: Automatic parameter binding reduces SQL injection risks.\n- Relationship Handling: Easier navigation between related models.\n\n### Disadvantages\n- Abstraction Overhead: Less efficient for highly-optimized or complex queries.\n- Difficult Advanced Queries: May require raw SQL for complex performance customizations.\n\n---\n\n## 3. Bit-Out-ORM Layer: Repository Using Raw SQL\n\nFor cases where fine-grained control is needed, writing raw SQL lets you directly manipulate queries.\n\n

python\nfrom orm_models import engine\n\ndef get_job_descriptions_by_company_raw(company_id: str):\n """\n Retrieve job descriptions for a given company using raw SQL.\n Returns a list of RowProxy objects.\n """\n with engine.connect() as connection:\n result = connection.execute(\n "SELECT id, company_id, company_name, job_description FROM job_description WHERE company_id = :company_id",\n {"company_id": company_id}\n )\n return result.fetchall()\n\nif __name__ == '__main__':\n rows = get_job_descriptions_by_company_raw('COMPANY_123')\n for row in rows:\n print(f"{row['company_name']}: {row['job_description']}")\n

\n\n### Advantages of the Raw SQL Approach\n- Fine-Grained Control: Ideal for advanced optimizations and database-specific features.\n- Transparency: Directly see and understand the executed SQL.\n- Flexibility: Access features not surfaced by the ORM.\n\n### Disadvantages\n- Manual Mapping: Responsible for converting rows into Python objects.\n- Error-Prone: Handwriting SQL increases the chance of errors and maintenance headaches.\n- Repetition: CRUD operations may become verbose and repetitive.\n\n---\n\n## 4. Summary & Takeaways\n\n- Use ORM for productivity, maintainability, and safety in most cases.\n- Use Raw SQL for advanced queries or performance-critical areas.\n- Combining both approaches in a hybrid system gives the best of both worlds!\n\n*Did you find this breakdown useful? Drop your favorite approach in the comments below!*\n\nHappy coding! 🚀

I wish I knew this before mixing SQLAlchemy ORM with raw SQL: Python Database Access, Two Ways

Negitama — Sat, 14 Jun 2025 14:31:30 +0000

I wish I knew this before mixing SQLAlchemy ORM with raw SQL: Python Database Access, Two Ways\n\nWorking with databases in Python? You’ve probably used SQLAlchemy, but did you know there are two powerful approaches: the ORM layer and the raw SQL layer? Here’s a deep dive into both, including code, pros & cons, and when to use each.\n\n## 1. Setup: Define Models with SQLAlchemy ORM\nFirst, we create our ORM models. This file (e.g., `orm_models.py`) defines the schema using SQLAlchemy’s declarative base.\n\n

I wish I knew this before approaching Database Access in Python with SQLAlchemy

Negitama — Sat, 14 Jun 2025 14:29:46 +0000

I wish I knew this before approaching Database Access in Python with SQLAlchemy\n\nBelow is an in-depth example that shows two approaches in Python using SQLAlchemy: one that leverages the ORM ("bit-ORM layer") and one that uses raw SQL queries ("bit-out-ORM layer"). In the example, we define a simple model for job descriptions and then provide two repository functions that return job descriptions for a given company. At the end, I discuss the advantages and disadvantages of each approach.\n\n## 1. Setup: Define Models with SQLAlchemy ORM\n\nFirst, we create our ORM models. This file (e.g., orm_models.py) defines the schema in code using SQLAlchemy’s declarative base.\n

python\nfrom sqlalchemy import Column, Integer, String, Text, create_engine\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy.orm import sessionmaker\n\n# Base class for our ORM models\nBase = declarative_base()\n\nclass JobDescription(Base):\n __tablename__ = 'job_description'\n id = Column(Integer, primary_key=True)\n company_id = Column(String, nullable=False)\n company_name = Column(Text, nullable=False)\n job_description = Column(Text, nullable=False)\n\nclass CandidateResume(Base):\n __tablename__ = 'candidate_resume'\n id = Column(Integer, primary_key=True)\n candidate_id = Column(String, nullable=False)\n candidate_name = Column(Text, nullable=False)\n resume = Column(Text, nullable=False)\n\n# Create the engine (adjust connection string as needed)\nengine = create_engine('postgresql://user:password@localhost/dbname')\n\n# Create a configured "Session" class\nSession = sessionmaker(bind=engine)\n\n# Optionally, create tables in the database (for development)\nif __name__ == '__main__':\n Base.metadata.create_all(engine)\n

\n\n## 2. Bit-ORM Layer: Repository Using the ORM\n\nThis repository function uses the ORM—meaning we work with Python objects. The ORM layer automatically converts our model objects into the appropriate SQL queries. Save this in (e.g.) repository_orm.py.\n

python\nfrom orm_models import JobDescription, Session\n\ndef get_job_descriptions_by_company_orm(company_id: str):\n """\n Retrieve JobDescriptions for a given company using ORM.\n Returns a list of JobDescription objects.\n """\n session = Session()\n try:\n results = session.query(JobDescription)\n .filter(JobDescription.company_id == company_id)\n .all()\n return results\n finally:\n session.close()\n\n# Example usage:\nif __name__ == '__main__':\n jobs = get_job_descriptions_by_company_orm('COMPANY_123')\n for job in jobs:\n print(f"{job.company_name}: {job.job_description}")\n

\n\n### Advantages of the ORM Approach\n\nAbstraction & Simplicity: You deal with Python objects; the ORM hides the SQL details.\nMaintainability: Changes to models update all the underlying queries.\nSafety & Consistency: Automatic parameter binding prevents SQL injection.\nRelationship Handling: ORMs make it easier to navigate relationships among models.\n\n### Disadvantages of the ORM Approach\n\nAbstraction Overhead: Sometimes it can be less efficient if you need very fine-tuned queries.\nComplexity for Advanced Queries: Very complex queries or performance optimizations may require custom SQL.\n\n## 3. Bit-Out-ORM Layer: Repository Using Raw SQL\n\nIn contrast, this repository function uses raw SQL queries via SQLAlchemy’s connection API. Save this as repository_raw.py.\n

python\nfrom orm_models import engine\n\ndef get_job_descriptions_by_company_raw(company_id: str):\n """\n Retrieve job descriptions for a given company using raw SQL.\n Returns a list of RowProxy objects.\n """\n with engine.connect() as connection:\n # Use parameterized queries to avoid SQL injection\n result = connection.execute(\n "SELECT id, company_id, company_name, job_description FROM job_description WHERE company_id = :company_id",\n {"company_id": company_id}\n )\n return result.fetchall()\n\n# Example usage:\nif __name__ == '__main__':\n rows = get_job_descriptions_by_company_raw('COMPANY_123')\n for row in rows:\n print(f"{row['company_name']}: {row['job_description']}")\n

\n\n### Advantages of the Raw SQL Approach\n\nFine-Grained Control: Directly write and optimize SQL queries for complex or performance-critical tasks.\nTransparency: You see exactly what SQL is sent to the database.\nFlexibility: Use database-specific features or functions not directly exposed by the ORM.\n\n### Disadvantages of the Raw SQL Approach\n\nManual Mapping: You must manually convert results to Python objects if needed.\nError-Prone: Writing raw SQL can lead to mistakes, and maintaining it over time may be more challenging.\nRepetition: Common CRUD operations might require writing repetitive SQL queries when an ORM could generate them automatically.\n\n## 4. Summary and Comparison\n\nIn a full-scale system, it’s common to use an ORM for most database operations while falling back on raw SQL for advanced use cases that require fine-tuning. This hybrid approach allows you to benefit from high productivity when possible while retaining control when needed.

Did you know the two powerful ways in Python to interact with databases?

Negitama — Sat, 14 Jun 2025 14:24:47 +0000

Did you know… In Python, there are two powerful ways to interact with databases using SQLAlchemy: the ORM layer and raw SQL queries. Here’s a deep dive into both approaches and when to use each.\n\n# Setup: Define Models with SQLAlchemy ORM\nFirst, we create our ORM models. This file (e.g., orm_models.py) defines the schema in code using SQLAlchemy’s declarative base.\n\n

python\nfrom sqlalchemy import Column, Integer, String, Text, create_engine\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy.orm import sessionmaker\n\n# Base class for our ORM models\nBase = declarative_base()\n\nclass JobDescription(Base):\n __tablename__ = 'job_description'\n id = Column(Integer, primary_key=True)\n company_id = Column(String, nullable=False)\n company_name = Column(Text, nullable=False)\n job_description = Column(Text, nullable=False)\n\nclass CandidateResume(Base):\n __tablename__ = 'candidate_resume'\n id = Column(Integer, primary_key=True)\n candidate_id = Column(String, nullable=False)\n candidate_name = Column(Text, nullable=False)\n resume = Column(Text, nullable=False)\n\n# Create the engine (adjust connection string as needed)\nengine = create_engine('postgresql://user:password@localhost/dbname')\n\n# Create a configured "Session" class\nSession = sessionmaker(bind=engine)\n\n# Optionally, create tables in the database (for development)\nif __name__ == '__main__':\n Base.metadata.create_all(engine)\n

\n\n# Bit-ORM Layer: Repository Using the ORM\nThis repository function uses the ORM—meaning we work with Python objects.\n\n

python\nfrom orm_models import JobDescription, Session\n\ndef get_job_descriptions_by_company_orm(company_id: str):\n """\n Retrieve JobDescriptions for a given company using ORM.\n Returns a list of JobDescription objects.\n """\n session = Session()\n try:\n results = session.query(JobDescription)\n .filter(JobDescription.company_id == company_id)\n .all()\n return results\n finally:\n session.close()\n\n# Example usage:\nif __name__ == '__main__':\n jobs = get_job_descriptions_by_company_orm('COMPANY_123')\n for job in jobs:\n print(f"{job.company_name}: {job.job_description}")\n

\n\n## Advantages of the ORM Approach\n- Abstraction & Simplicity: You deal with Python objects; the ORM hides the SQL details.\n- Maintainability: Changes to models update all the underlying queries.\n- Safety & Consistency: Automatic parameter binding prevents SQL injection.\n- Relationship Handling: ORMs make it easier to navigate relationships among models.\n\n## Disadvantages of the ORM Approach\n- Abstraction Overhead: Sometimes it can be less efficient if you need very fine-tuned queries.\n- Complexity for Advanced Queries: Very complex queries or performance optimizations may require custom SQL.\n\n# Bit-Out-ORM Layer: Repository Using Raw SQL\nIn contrast, this repository function uses raw SQL queries via SQLAlchemy’s connection API.\n\n

python\nfrom orm_models import engine\n\ndef get_job_descriptions_by_company_raw(company_id: str):\n """\n Retrieve job descriptions for a given company using raw SQL.\n Returns a list of RowProxy objects.\n """\n with engine.connect() as connection:\n # Use parameterized queries to avoid SQL injection\n result = connection.execute(\n "SELECT id, company_id, company_name, job_description FROM job_description WHERE company_id = :company_id",\n {"company_id": company_id}\n )\n return result.fetchall()\n\n# Example usage:\nif __name__ == '__main__':\n rows = get_job_descriptions_by_company_raw('COMPANY_123')\n for row in rows:\n print(f"{row['company_name']}: {row['job_description']}")\n

\n\n## Advantages of the Raw SQL Approach\n- Fine-Grained Control: Directly write and optimize SQL queries for complex or performance-critical tasks.\n- Transparency: You see exactly what SQL is sent to the database.\n- Flexibility: Use database-specific features or functions not directly exposed by the ORM.\n\n## Disadvantages of the Raw SQL Approach\n- Manual Mapping: You must manually convert results to Python objects if needed.\n- Error-Prone: Writing raw SQL can lead to mistakes, and maintaining it over time may be more challenging.\n- Repetition: Common CRUD operations might require writing repetitive SQL queries when an ORM could generate them automatically.\n\n# Summary and Comparison\nIn a full-scale system, it’s common to use an ORM for most database operations while falling back on raw SQL for advanced use cases that require fine-tuning. This hybrid approach allows you to benefit from high productivity when possible while retaining control when needed.

Did you know you can optimize database access with ORM and raw SQL?

Negitama — Sat, 14 Jun 2025 14:21:45 +0000

Did you know that you can choose between using SQLAlchemy ORM and raw SQL queries for database interactions in Python? Let's explore how each approach could benefit your development.

Setup: Define Models with SQLAlchemy ORM

First, we create our ORM models. This file (e.g., orm_models.py) defines the schema in code using SQLAlchemy’s declarative base.

from sqlalchemy import Column, Integer, String, Text, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

Base = declarative_base()

class JobDescription(Base):
    __tablename__ = 'job_description'
    id = Column(Integer, primary_key=True)
    company_id = Column(String, nullable=False)
    company_name = Column(Text, nullable=False)
    job_description = Column(Text, nullable=False)

class CandidateResume(Base):
    __tablename__ = 'candidate_resume'
    id = Column(Integer, primary_key=True)
    candidate_id = Column(String, nullable=False)
    candidate_name = Column(Text, nullable=False)
    resume = Column(Text, nullable=False)

engine = create_engine('postgresql://user:password@localhost/dbname')

Session = sessionmaker(bind=engine)

if __name__ == '__main__':
    Base.metadata.create_all(engine)

Bit-ORM Layer: Repository Using the ORM

This repository function uses the ORM—meaning we work with Python objects. The ORM layer automatically converts our model objects into the appropriate SQL queries. Save this in (e.g.) repository_orm.py.

from orm_models import JobDescription, Session

def get_job_descriptions_by_company_orm(company_id: str):
    session = Session()
    try:
        results = session.query(JobDescription)\
                         .filter(JobDescription.company_id == company_id)\
                         .all()
        return results
    finally:
        session.close()

if __name__ == '__main__':
    jobs = get_job_descriptions_by_company_orm('COMPANY_123')
    for job in jobs:
        print(f"{job.company_name}: {job.job_description}")

Advantages of the ORM Approach

Abstraction & Simplicity: You deal with Python objects; the ORM hides the SQL details.
Maintainability: Changes to models update all the underlying queries.
Safety & Consistency: Automatic parameter binding prevents SQL injection.
Relationship Handling: ORMs make it easier to navigate relationships among models.

Disadvantages of the ORM Approach

Abstraction Overhead: Sometimes it can be less efficient if you need very fine-tuned queries.
Complexity for Advanced Queries: Very complex queries or performance optimizations may require custom SQL.

Bit-Out-ORM Layer: Repository Using Raw SQL

In contrast, this repository function uses raw SQL queries via SQLAlchemy’s connection API. Save this as repository_raw.py.

from orm_models import engine

def get_job_descriptions_by_company_raw(company_id: str):
    with engine.connect() as connection:
        result = connection.execute(
            "SELECT id, company_id, company_name, job_description FROM job_description WHERE company_id = :company_id",
            {"company_id": company_id}
        )
        return result.fetchall()

if __name__ == '__main__':
    rows = get_job_descriptions_by_company_raw('COMPANY_123')
    for row in rows:
        print(f"{row['company_name']}: {row['job_description']}")

Advantages of the Raw SQL Approach

Fine-Grained Control: Directly write and optimize SQL queries for complex or performance-critical tasks.
Transparency: You see exactly what SQL is sent to the database.
Flexibility: Use database-specific features or functions not directly exposed by the ORM.

Disadvantages of the Raw SQL Approach

Manual Mapping: You must manually convert results to Python objects if needed.
Error-Prone: Writing raw SQL can lead to mistakes, and maintaining it over time may be more challenging.
Repetition: Common CRUD operations might require writing repetitive SQL queries when an ORM could generate them automatically.

Summary and Comparison

In a full-scale system, it’s common to use an ORM for most database operations while falling back on raw SQL for advanced use cases that require fine-tuning. This hybrid approach allows you to benefit from high productivity when possible while retaining control when needed.

Two Approaches to Database Interaction with SQLAlchemy in Python

Negitama — Sat, 14 Jun 2025 14:13:48 +0000

Ever wondered about the best way to work with databases in Python? In this post, I've explored two different approaches using SQLAlchemy: the ORM layer and using raw SQL.

1. Setup: Define Models with SQLAlchemy ORM
First, we define our ORM models in a file (e.g., orm_models.py). These models define the database schema in code using SQLAlchemy’s declarative base. We create a JobDescription model and a CandidateResume model.

2. Bit-ORM Layer: Repository Using the ORM
This approach uses the ORM to automatically convert model objects into SQL queries. Check out the code in repository_orm.py, where we retrieve job descriptions using the ORM.

Advantages of the ORM Approach

Abstraction & Simplicity: Deal with Python objects and let the ORM handle SQL.
Maintainability and Safety: Model changes update underlying queries and prevent SQL injection.

3. Bit-Out-ORM Layer: Repository Using Raw SQL
Contrast this with using raw SQL in repository_raw.py. This approach allows for direct query writing with potential optimizations.

Advantages of the Raw SQL Approach

Fine-Grained Control: Write optimized SQL for complex tasks.
Transparency & Flexibility: See the exact SQL sent and use database-specific features.

Summary and Comparison
Consider using an ORM for regular tasks and raw SQL for fine-tuned, performance-critical queries. This hybrid approach provides productivity with control when needed.

SpannerDB Reading Capabilities and Transaction Specifications

Negitama — Sat, 14 Jun 2025 14:11:50 +0000

SpannerDB Reading Capabilities and Transaction Specifications\n\nSpannerDB (Google Cloud Spanner) provides sophisticated reading capabilities designed to balance consistency, performance, and global scale. This report explores the comprehensive read transaction specifications of SpannerDB, examining its various transaction types, consistency guarantees, and implementation details.\n\n## Core Read Transaction Types\n\nGoogle Spanner offers several distinct reading capabilities that serve different application needs while maintaining strong consistency guarantees across a globally distributed database.

Integrating BM25 in Hybrid Search and Reranking Pipelines: Strategies and Applications

Negitama — Sat, 14 Jun 2025 14:07:00 +0000

Integrating BM25 in Hybrid Search and Reranking Pipelines: Strategies and Applications\n\nBM25 (Best Matching 25) is a foundational algorithm in information retrieval, renowned for its efficiency in keyword-based relevance scoring. While modern neural rerankers and vector search dominate advanced retrieval systems, BM25 remains a critical component in hybrid architectures and reranking workflows. This report examines BM25’s dual role in hybrid search systems and reranking pipelines, analyzing implementation patterns, use cases, and technical considerations.\n\n## 1. BM25 as a Hybrid Search Component\nHybrid search combines keyword-based retrieval (BM25) with semantic vector search to balance precision and recall. BM25’s role here is to ensure exact keyword matches and term rarity are prioritized, while vector search captures contextual relationships.\n\n### 1.1 Parallel Retrieval Fusion\nIn systems like Elasticsearch and Weaviate, BM25 and vector search run independently, with results merged using fusion algorithms:\n\n- Reciprocal Rank Fusion (RRF): Combines rankings from both methods using the formula: $$RRF_score = \sum_{i} \frac{1}{k + rank_i}$$\n- Weighted Score Combination: Assigns tunable weights $\alpha$ to BM25 and vector similarity scores: $$Final_score = \alpha \cdot BM25_score + (1-\alpha) \cdot Vector_score$$\n\n### 1.2 BM25 as a Pre-Filter\nIn latency-sensitive applications, BM25 narrows the candidate pool before vector search:\n

sql\nSELECT * FROM documents \nWHERE bm25_match(query) \nORDER BY vector_similarity DESC LIMIT 100\n

\nThis two-stage retrieval reduces computational overhead by excluding irrelevant documents early.\n\n### 1.3 BM25F for Field-Aware Hybrid Search\nBM25F extends BM25 to weight fields differently (e.g., title vs. body). Weaviate implements this for structured data:\n$$BM25F_score = \sum_{fields} w_f \cdot \frac{TF_f}{k_1 (1-b + b \cdot \frac{DL_f}{avgDL_f}) + TF_f} \cdot IDF$$\nwhere $w_f$ is the field weight, $DL_f$ is the field length, and $b$ controls length normalization.\n\n## 2. BM25 in Reranking Pipelines\nWhile BM25 is not a standalone neural reranker, it enhances reranking through score fusion, feature engineering, and fallback mechanisms.\n\n### 2.1 Hybrid Pre-Reranking\nBM25 and vector search retrieve 100–200 candidates, which are then processed by cross-encoders (e.g., bge-reranker-v2-m3) or LLMs:\n- BM25 retrieves 50 documents.\n- Vector search retrieves 50 documents.\n- A cross-encoder reranks the combined 100 documents.\n\n### 2.2 Score Augmentation for Neural Rerankers\nBM25 scores are injected as features into reranking models:\n

json\n{"document": "text", "bm25_score": 0.85, "vector_score": 0.92}\n

\nThe TREC Deep Learning Track shows that appending BM25 scores as text tokens (e.g., "BM25=0.85") improves BERT-based reranker accuracy by 7.3% MRR@10.\n\n### 2.3 Fallback Tiebreaking\nWhen neural rerankers produce tied scores, BM25 breaks ties:\n

python\nsorted_results = sorted(\n tied_results, \n key=lambda x: (x['rerank_score'], x['bm25_score'])\n)\n

\nThis is critical in legal or regulatory contexts where explainability matters.\n\n## 3. Use Cases and Implementation Guidance\n\n## 3.1 When to Use BM25 in Hybrid/Reranking\n\n## 3.2 Optimization Strategies\n- Parameter Tuning: Adjust $k_1$ (term frequency saturation) and $b$ (length normalization) based on document length variance. For technical documents, $k_1=1.2$, $b=0.75$ often works best.\n- Dynamic Weighting: Use query classification to set $\alpha$ in hybrid scores. For navigational queries (e.g., "Facebook login"), $\alpha=0.8$; for exploratory queries (e.g., "AI ethics"), $\alpha=0.3$.\n- BM25-Driven Pruning: Exclude documents with BM25 scores below a threshold (e.g., $BM25 < 1.5$) before vector search to reduce latency.\n\n## 4. Limitations and Alternatives\n\n### 4.1 BM25 Shortcomings\n- Fails to capture semantic relationships (e.g., synonymy: "car" vs. "automobile").\n- Struggles with long-tail queries in low-resource languages.\n- Scores are not directly comparable across indexes, complicating federated search.\n\n### 4.2 When to Use Neural Rerankers Instead\n- High semantic complexity: Queries like "impact of inflation on renewable energy adoption" benefit from cross-encoders.\n- Multilingual settings: Models like Cohere Rerank or Vectara Multilingual outperform BM25 in 40+ languages.\n- Personalization: User-specific reranking requires learning-to-rank (LTR) models.\n\n## 5. Emerging Trends\n\n- BM25 as a Reranker Feature: The TREC 2023 Deep Learning Track found that concatenating BM25 scores to document text (e.g., "Document: ... [BM25=0.72]") improves reranker robustness.\n- Sparse-Dense Hybrids: SPLADE (Sparse Lexical and Dense) models unify BM25-like term weights with neural representations, achieving 94% of BM25’s speed with 98% of BERT’s accuracy.\n- BM25 in LLM Pipelines: LangChain and LlamaIndex use BM25 to filter context for LLMs, reducing hallucination risks by 22–37%.\n\n## Conclusion\nBM25 remains indispensable in hybrid and reranking systems despite the rise of neural methods. Its strengths—computational efficiency, explainability, and exact-match precision—complement vector search’s semantic understanding. Implementations range from simple score fusion to complex feature engineering in cross-encoders. For optimal results:\n\n- Use BM25 as a first-stage retriever in hybrid pipelines.\n- Integrate its scores into neural rerankers via feature injection.\n- Reserve pure neural reranking for high-resource, semantically complex scenarios.\n\nThis dual role ensures BM25’s continued relevance in an era dominated by large language models and semantic search technologies.

Integrating BM25 in Hybrid Search and Reranking Pipelines: Strategies and Applications

Negitama — Sat, 14 Jun 2025 10:19:34 +0000

Integrating BM25 in Hybrid Search and Reranking Pipelines: Strategies and Applications

BM25 (Best Matching 25) is a foundational algorithm in information retrieval, renowned for its efficiency in keyword-based relevance scoring. While modern neural rerankers and vector search dominate advanced retrieval systems, BM25 remains a critical component in hybrid architectures and reranking workflows. This report examines BM25’s dual role in hybrid search systems and reranking pipelines, analyzing implementation patterns, use cases, and technical considerations.

BM25 as a Hybrid Search Component

Hybrid search combines keyword-based retrieval (BM25) with semantic vector search to balance precision and recall. BM25’s role here is to ensure exact keyword matches and term rarity are prioritized, while vector search captures contextual relationships.

Parallel Retrieval Fusion

In systems like Elasticsearch and Weaviate, BM25 and vector search run independently, with results merged using fusion algorithms:

Reciprocal Rank Fusion (RRF): Combines rankings from both methods using the formula: RRF_score=sum\dfrac{1}{k+rank_position}
Weighted Score Combination: Assigns tunable weights (α\alphaα) to BM25 and vector similarity scores: Final_score=α⋅BM25_score+(1−α)⋅Vector_score

BM25 as a Pre-Filter

In latency-sensitive applications, BM25 narrows the candidate pool before vector search:

SELECT * FROM documents 
WHERE bm25\_match(query) 
ORDER BY vector\_similarity DESC LIMIT 100

This two-stage retrieval reduces computational overhead by excluding irrelevant documents early.

BM25F for Field-Aware Hybrid Search

BM25F extends BM25 to weight fields differently (e.g., title vs. body). Weaviate implements this for structured data:
BM25F_score=sum_{fields} wf⋅TFf / (k1 ( (1−b+b⋅DLf/avgDLf) + TFf ) ⋅ IDF
where wf is the field weight, DLf is the field length, and b controls length normalization.

BM25 in Reranking Pipelines

While BM25 is not a standalone neural reranker, it enhances reranking through score fusion, feature engineering, and fallback mechanisms.

Hybrid Pre-Reranking

BM25 and vector search retrieve 100–200 candidates, which are then processed by cross-encoders or LLMs:

BM25 retrieves 50 documents.
Vector search retrieves 50 documents.
A cross-encoder reranks the combined 100 documents.

Score Augmentation for Neural Rerankers

BM25 scores are injected as features into reranking models:

{
  "document": "text", 
  "bm25_score": 0.85, 
  "vector_score": 0.92
}

The TREC Deep Learning Track shows that appending BM25 scores as text tokens (e.g., "BM25=0.85") improves BERT-based reranker accuracy by 7.3% MRR@10.

Fallback Tiebreaking

When neural rerankers produce tied scores, BM25 breaks ties:

sorted_results = sorted(
    tied_results, 
    key=lambda x: (x['rerank_score'], x['bm25_score'])
)

This is critical in legal or regulatory contexts where explainability matters.

Use Cases and Implementation Guidance

When to Use BM25 in Hybrid/Reranking

Optimization Strategies

Parameter Tuning: Adjust k1 (term frequency saturation) and b (length normalization) based on document length variance. For technical documents, k1=1.2, b=0.75 often works best.
Dynamic Weighting: Use query classification to set α in hybrid scores. For navigational queries (e.g., "Facebook login"), α=0.8; for exploratory queries (e.g., "AI ethics"), α=0.3.
BM25-Driven Pruning: Exclude documents with BM25 scores below a threshold (e.g., BM25 < 1.5) before vector search to reduce latency.

Limitations and Alternatives

BM25 Shortcomings

Fails to capture semantic relationships (e.g., synonymy: "car" vs. "automobile").
Struggles with long-tail queries in low-resource languages.
Scores are not directly comparable across indexes, complicating federated search.

When to Use Neural Rerankers Instead

High semantic complexity: Queries like "impact of inflation on renewable energy adoption" benefit from cross-encoders.
Multilingual settings: Models like Cohere Rerank or Vectara Multilingual outperform BM25 in 40+ languages.
Personalization: User-specific reranking requires learning-to-rank models.

Emerging Trends

BM25 as a Reranker Feature: The TREC 2023 Deep Learning Track found that concatenating BM25 scores to document text (e.g., "Document: ... [BM25=0.72]") improves reranker robustness.
Sparse-Dense Hybrids: SPLADE models unify BM25-like term weights with neural representations, achieving 94% of BM25’s speed with 98% of BERT’s accuracy.
BM25 in LLM Pipelines: LangChain and LlamaIndex use BM25 to filter context for LLMs, reducing hallucination risks by 22–37%.

Conclusion

BM25 remains indispensable in hybrid and reranking systems despite the rise of neural methods. Its strengths—computational efficiency, explainability, and exact-match precision—complement vector search’s semantic understanding. Implementations range from simple score fusion to complex feature engineering in cross-encoders. For optimal results:

Use BM25 as a first-stage retriever in hybrid pipelines.
Integrate its scores into neural rerankers via feature injection.
Reserve pure neural reranking for high-resource, semantically complex scenarios.

This dual role ensures BM25’s continued relevance in an era dominated by large language models and semantic search technologies.

Namespace vs Regular Packages in Python — And Why mypy Might Be Failing You

Negitama — Fri, 13 Jun 2025 16:41:30 +0000

If you're building AI systems, data pipelines, or backend services in Python, you’ve probably run into weird bugs with mypy not picking up types or imports mysteriously failing—especially when you’re working across microservices or large codebases. Chances are… you’re using a namespace package (maybe without even knowing it). Let’s break it down.

Regular Packages vs Namespace Packages

Regular Packages

Require an init.py file
Define a single, self-contained folder
Easy for mypy, IDEs, linters, and tests to process

Example of a Regular package:
project/
└── analytics/
├── init.py
├── metrics.py
└── models.py

Namespace Packages

No init.py needed
Split across multiple folders/repos
Used in plugin systems or modular AI/ML tooling

Example of a Namespace package:
src/coretools/featurestore/
encoder.py
libs/featurestore/
scaler.py

With namespace packages, coretools.featurestore.encoder and libs.featurestore.scaler can coexist under the same import path.

Great for scalability. Nightmare for static analysis—unless configured right.

Why AI Devs & Data Teams Should Care

Modular Pipelines: When your training logic and feature store live in different repos
Plugin Systems: For experiment tracking, custom metrics, preprocessing layers
Shared AI Tooling: Across internal libraries, you may already be “namespacing” without realizing

But here's the catch…

Why mypy Can Break on Namespace Packages

Your Developer Checklist to Fix mypy with Namespace Packages

Enable namespace support mypy --namespace-packages

Or in mypy.ini:
[mypy]
namespace_packages = true

Use p your.package.name instead of just the folder
mypy -p featurestore.encoder
Set MYPYPATH + --explicit-package-bases if your source layout is non-standard
export MYPYPATH=src
mypy --explicit-package-bases -p yourpkg.module
Still struggling? Add dummy init.pyi or init.py
This helps tools infer structure even in namespace packages.

Summary Table
“Namespaces are one honking great idea — let's do more of those.”

TakeAway
If you're building modular AI pipelines, ML services, or shared tooling across teams—you need to understand how namespace packages and tools like mypy interact. It's the difference between silent bugs and confident code.

Have you hit these issues in production or CI? Let’s compare notes.

Namespace vs Regular Packages in Python — And Why mypy Might Be Failing You

Negitama — Fri, 13 Jun 2025 16:36:04 +0000

Regular Packages

Require an init.py file
Define a single, self-contained folder
Easy for mypy, IDEs, linters, and tests to process

Namespace Packages

No init.py needed
Split across multiple folders/repos
Used in plugin systems or modular AI/ML tooling

With namespace packages, coretools.featurestore.encoder and libs.featurestore.scaler can coexist under the same import path.
👉 Great for scalability. Nightmare for static analysis—unless configured right.

🧪 Why AI Devs & Data Teams Should Care
🔌 Modular Pipelines
🧩 Plugin Systems
🧠 Shared AI Tooling

But here's the catch…

🚨 Why mypy Can Break on Namespace Packages

✅ Your Developer Checklist to Fix mypy with Namespace Packages

Enable namespace support
Use your.package.name
Set MYPYPATH + -explicit-package-bases

Still struggling? Add dummy init.pyi or init.py

🧾 Summary Table

🔍 TakeAway
If you're building modular AI pipelines, ML services, or shared tooling across teams—you need to understand how namespace packages and tools like mypy interact. It's the difference between silent bugs and confident code.

Have you hit these issues in production or CI? Let’s compare notes.

Namespace vs Regular Packages in Python — And Why mypy Might Be Failing You

Negitama — Fri, 13 Jun 2025 16:28:42 +0000

Namespace vs Regular Packages in Python — And Why mypy Might Be Failing You
If you're building AI systems, data pipelines, or backend services in Python, you’ve probably run into weird bugs with mypy not picking up types or imports mysteriously failing—especially when you’re working across microservices or large codebases.
Chances are… you’re using a namespace package (maybe without even knowing it). Let’s break it down.

Regular Packages vs Namespace Packages
Regular Packages
Require an init.py file
Define a single, self-contained folder
Easy for mypy, IDEs, linters, and tests to process

Regular package

project/
└── analytics/
├── init.py
├── metrics.py
└── models.py
Namespace Packages
No init.py needed
Split across multiple folders/repos
Used in plugin systems or modular AI/ML tooling

Namespace package

src/coretools/featurestore/
encoder.py
libs/featurestore/
scaler.py

With namespace packages, coretools.featurestore.encoder and libs.featurestore.scaler can coexist under the same import path.
Great for scalability. Nightmare for static analysis—unless configured right.

Why AI Devs & Data Teams Should Care
Modular Pipelines: When your training logic and feature store live in different repos
Plugin Systems: For experiment tracking, custom metrics, preprocessing layers
Shared AI Tooling: Across internal libraries, you may already be “namespacing” without realizing
But here's the catch…

Why mypy Can Break on Namespace Packages
Your Developer Checklist to Fix mypy with Namespace Packages
Enable namespace support
mypy --namespace-packages
Or in mypy.ini:
[mypy]
namespace_packages = true

Use p your.package.name instead of just the folder
mypy -p featurestore.encoder

Set MYPYPATH + -explicit-package-bases if your source layout is non-standard
export MYPYPATH=src
mypy --explicit-package-bases -p yourpkg.module

Still struggling? Add dummy init.pyi or init.py
This helps tools infer structure even in namespace packages.
Summary Table
“Namespaces are one honking great idea — let's do more of those.”