Team Timescale for Timescale

Posted on Dec 23, 2024 • Originally published at timescale.com on Dec 23, 2024

General-Purpose vs. Domain-Specific Embedding Models: How to Choose?

#ai #opensource #machinelearning #beginners

When building a search or RAG application, you face a crucial decision—which embedding model should you use? The choice is no longer just between proprietary models like OpenAI and open-source alternatives. Now, you also need to consider domain-specific models trained for particular fields like finance, healthcare, or legal text. Once you've identified potential candidates, the testing process begins. You need to check the following items off your list:

Set up infrastructure for each model
Create representative test data
Build evaluation pipelines
Run reliable benchmarks
Compare results fairly

Who's got time for all of this when you're trying to ship 🚀 your application?

Oh, and there's another challenge!

If you already have a working search system, testing new models often means either disrupting your production environment or building a separate testing setup. This frequently leads to postponing model evaluation, potentially missing out on significant accuracy improvements and cost savings—especially critical when dealing with specialized domain knowledge.

*We wanted to find a straightforward way to evaluate different embedding models and understand their real-world performance on domain-specific text. *

In this guide, we'll demonstrate this process using financial data as our example. We'll show you how to use pgai Vectorizer, an open-source tool for embedding creation and sync, to test different embedding models on your own data. We'll walk through our evaluation comparing a general-purpose model (OpenAI's text-embedding-3-small) against a finance-specialized model (Voyage AI's finance-2) using real financial statements (SEC filings) as test data and share a reusable process you can adapt for your domain.

Here's what we learned from our tests—and how you can run similar comparisons yourself.

Embedding Models Compared: General vs. Domain-Specialized

In this embedding model evaluation, we will compare two models with different specializations:

OpenAI's text-embedding-3-small (768 dimensions)
Voyage AI's voyage-finance-2 (1,024 dimensions)

We chose these models because they represent an interesting comparison: OpenAI's model is widely used and considered an industry standard for general-purpose embeddings, while Voyage AI's finance-2 model is specifically trained on financial text and documentation.

Do Models Understand Financial Context?

Given this text chunk from an SEC filing: "The Company's adjusted EBITDA increased by 15 % -year-over-year, primarily driven by operational efficiencies and market expansion, while maintaining a healthy debt-to-equity ratio of 0.8."

We tested questions like the following:

"What was the company's EBITDA growth?" (Short question)
"How do operational improvements and market growth impact the company's financial performance?" (Detailed question)
"What does the debt-to-equity ratio suggest about the company's financial health?" (Context-based question)

Essentially, we are testing how different embedding models handle financial terminology and relationships. By taking SEC filing chunks and generating various types of questions—from direct metrics to implied financial health—we can see if the models understand not just explicit financial data but also financial context and implications, which is how analysts typically analyze companies.

Setting Up the Test Environment

Instead of building the test infrastructure from scratch, we'll use pgai Vectorizer to simplify the test massively. You can follow the pgai Vectorizer quick start guide to set up pgai in just a few minutes.

Using pgai Vectorizer saves you significant development time by handling embedding operations directly in PostgreSQL. Here's what it can do:

Creates and updates embeddings automatically when source data changes
Supports models from OpenAI and specialized providers like Voyage AI
Handles text chunking with configurable settings
Manages the embedding queue and retries
Creates a view combining source data with embeddings

Since pgai Vectorizer is built on PostgreSQL, you can use familiar SQL commands and integrate them with your existing database. Who needs bespoke vector databases?

First, let's create our SEC filings table and load the data into the database. We have a convenient function that loads datasets from Hugging Face directly into your PostgreSQL database:

CREATE TABLE sec_filings (
    id SERIAL PRIMARY KEY,
    text text
);

SELECT ai.load_dataset(
    name => 'MemGPT/example-sec-filings',
    table_name => 'sec_filings',
    batch_size => 1000,
    max_batches => 10,
    if_table_exists => 'append'
);

Testing different embedding models is super simple with pgai Vectorizer. You simply create multiple vectorizers, each using a different model you want to evaluate. The vectorizer handles all the complexity of creating and managing embeddings for each model. Here's how we set up our two models to compare:

-- Set up OpenAI's general-purpose model
SELECT ai.create_vectorizer(
    'sec_filings'::regclass,
    destination => 'sec_filings_openai_embeddings',
    embedding => ai.embedding_openai(
        'text-embedding-3-small',
        768
    ),
    chunking => ai.chunking_recursive_character_text_splitter(
        'text',
        chunk_size => 512,
        chunk_overlap => 50
    )
);

-- Set up Voyage's finance-specialized model
SELECT ai.create_vectorizer(
    'sec_filings'::regclass,
    destination => 'sec_filings_voyage_embeddings',
    embedding => ai.embedding_voyageai(
        'voyage-finance-2',
        1024
    ),
    chunking => ai.chunking_recursive_character_text_splitter(
        'text',
        chunk_size => 512,
        chunk_overlap => 50
    )
);

You can query the generated embeddings directly in the embedding view for each model:

SELECT * FROM sec_filings_voyage_embeddings LIMIT 5;

If you want to try it out, here’s the full API reference for pgai Vectorizer.

The Evaluation Logic

This evaluation will focus on how well each model can find relevant text when given different types of questions. The methodology is as follows:

Randomly select 20 text chunks from the dataset.
Generate 20 questions per chunk.
These questions are evenly distributed across five distinct types:
1. Short questions (under 10 words) for testing basic comprehension.
2. Long questions for detailed analysis.
3. Direct questions about explicit content.
4. Implied questions that require contextual understanding.
5. Unclear questions to test handling of ambiguous queries.
The evaluation retrieves the top 10 most similar chunks for each question across the embedding models.
Vector search testing:
1. For each model in the test list:
2. For each stored question:
  1. Do vector search on the model's embedding table.
  2. Check if source_chunk_id appears in TOP_K results.
  3. Score binary: 1 if found, 0 if not found.
Calculate the score:
1. Calculate total successful retrievals.
2. Divide by total questions (NUM_CHUNKS * NUM_QUESTIONS_PER_CHUNK).
3. Tally up results.

The advantage of this evaluation method is the simplicity and the fact that you don’t have to curate the ground truth manually. In practice, we’ve seen this method work well, but it does have its limitations (as all methods do): If the content in the dataset is too semantically similar, the questions generated by the large language model may not be specific enough to retrieve the chunk the question was generated from, the evaluation does not check about the rank of the answer within the top-k, etc. You should always spot-check any eval method on your particular dataset.

The Evaluation Code

The full evaluation code is available on GitHub if you want to run your own tests on different embedding models. Here are the key highlights from our financial evaluation code.

Create financially-focused test questions:

def generate_questions(self, chunk: str, question_type: str, count: int) -> List[str]:
    prompts = {
        'short': "Generate {count} short but challenging finance-specific questions about this SEC filing text. Questions should be under 10 words but test deep understanding:",
        'long': "Generate {count} detailed questions that require analyzing financial metrics, trends, and implications from this SEC filing text:",
        'direct': "Generate {count} questions about specific financial data, numbers, or statements explicitly mentioned in this SEC filing:",
        'implied': "Generate {count} questions about potential business risks, market implications, or strategic insights that can be inferred from this SEC filing:",
        'unclear': "Generate {count} intentionally ambiguous questions about financial concepts or business implications that require careful analysis of this SEC filing:"
    }

    system_prompt = """You are an expert in financial analysis and SEC filings.
Generate challenging, finance-specific questions that test deep understanding of financial concepts, 
business implications, and regulatory compliance. Questions should be difficult enough to 
challenge both general-purpose and finance-specialized language models."""

    prompt = prompts[question_type].format(count=count) + f"\n\nSEC Filing Text: {chunk}"

    questions = []
    for attempt in range(max_retries):
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7 + (attempt * 0.1)
        )

Evaluate how well each model understands financial context:

def step3_evaluate_models(self):
    """Test how well each model understands financial content"""
    print("Step 3: Evaluating models...")
    self.results = {}
    detailed_results = []

    for table in Config.EMBEDDING_TABLES:
        print(f"Testing {table}...")
        scores = []
        for q in self.questions_data:
            # For Voyage's finance model
            if 'voyage' in table:
                search_results = self.db.vector_search(
                    table, 
                    q['question'],
                    Config.TOP_K
                )
            # For OpenAI's general model
            elif 'openai' in table:
                search_results = self.db.vector_search(
                    table, 
                    q['question'],
                    Config.TOP_K
                )

            found = any(
                r[0] == q['source_chunk_id'] and 
                r[1] == q['source_chunk_seq'] 
                for r in search_results
            )
            scores.append(1 if found else 0)

            detailed_results.append({
                'model': table,
                'question': q['question'],
                'question_type': q['question_type'],
                'found_correct_chunk': found,
                'num_results': len(search_results)
            })

        # Calculate accuracy by question type
        self.results[table] = {
            'overall_accuracy': sum(scores) / len(scores),
            'by_type': {
                q_type: sum(scores[i] for i, q in enumerate(self.questions_data) 
                           if q['question_type'] == q_type) / 
                           Config.QUESTION_DISTRIBUTION[q_type] / 
                           Config.NUM_CHUNKS
                for q_type in Config.QUESTION_DISTRIBUTION.keys()
            }
        }

The Results

Our evaluation comparing the finance-specialized Voyage model against OpenAI's general-purpose model revealed quite significant differences in their ability to handle financial text. Testing with about 10,000 rows of SEC filings data, the Voyage finance-2 model achieved 54 % overall accuracy , significantly outperforming OpenAI's text-embedding-3-small at 38.5 %.

The gap was most dramatic in direct financial queries, where Voyage reached 63.75 % accuracy compared to OpenAI's 40 %. Even with ambiguous financial questions, the specialized model maintained its edge at 62.5 % versus 48.75 %. This suggests that domain-specific training substantially improves the handling of financial terminology and concepts.

Cost and processing times showed interesting trade-offs. While Voyage took a few minutes to process our test data, OpenAI completed the task in under a minute at a cost below $0.01 for roughly 190,000 tokens. However, for applications heavily focused on financial data, the specialized model's 15.5 % overall accuracy advantage justifies the slight additional resource investment.

These results indicate that your choice between general and finance-specialized models should consider your needs: document volume, search patterns, accuracy requirements, and cost constraints. While specialized models require more resources, they could provide significant value through improved financial search accuracy and a better understanding of complex financial relationships.

How To Choose An Embedding Model

After reviewing the results, consider these factors when choosing between general and finance-specialized embedding models:

Financial comprehension

Direct financial queries: The specialized model showed 23.75 % higher accuracy.
Context-based financial questions: The specialized model outperformed by 13.75 %.
Ambiguous queries: Both models struggled, but specialized still led by 13.75 %.

Model characteristics:

Voyage finance-2 (1024d): It achieved 54 % overall accuracy, showing better financial understanding.
OpenAI small (768d): It reached 38.5 % overall accuracy, with faster processing.
Higher dimensionality helps capture complex financial relationships.

Cost considerations:

Voyage:
- It has a slightly higher resource usage but better financial accuracy.
- Voyage AI’s latest models (voyage-3-large and voyage-code-3) support Matryoshka and quantization to reduce cost.
OpenAI: It’s faster, cheaper, but has less financial context.
The trade-off depends on your accuracy needs.

Before choosing your model, consider these practical factors:

Document characteristics: Volume and type of financial documents affect total processing costs. Use pgai Vectorizer's chunking options to optimize for your content.
Search requirements: Consider whether users need specific financial metrics or deeper financial relationship understanding.
Performance needs: Balance accuracy against latency requirements. Cloud vs. self-hosted affects both costs and performance.
Budget constraints: Factor in API costs, computing resources, and potential savings from better search accuracy.

Conclusion

Choosing between general and finance-specialized embedding models significantly impacts your application's effectiveness and costs. We used pgai Vectorizer to evaluate both approaches and provided a framework for making this decision.

Ready to test these models on your financial data? Install pgai Vectorizer and try it yourself. Handle all your embedding operations in PostgreSQL—no specialized databases needed.

DEV Community

General-Purpose vs. Domain-Specific Embedding Models: How to Choose?

Embedding Models Compared: General vs. Domain-Specialized

Do Models Understand Financial Context?

Setting Up the Test Environment

The Evaluation Logic

The Evaluation Code

The Results

How To Choose An Embedding Model

Conclusion

Further reading

Top comments (0)

Read next

How to chat with Local LLM in Obsidian

Building Bedrock Agents for AWS Account Metadata and Cost Analysis

AI Models Now Master Basic Math But Struggle with Complex Problems, New Study Shows

Emergent Abilities of Large Language Models – Fact or Mirage?