CDL Head of Architecture, Matt Eisengruber, and Data & AI Architect, Matt Houghton, assess the implications of DeepSeek as it emerges from China as a major player in the LLM space.
Part 1: Industry Coverage and Benchmarks
Introduction
In the rapidly evolving world of artificial intelligence, DeepSeek has emerged as a formidable contender in the landscape of Large Language Models (LLMs). Developed with a focus on high-performance reasoning and multilingual capabilities, DeepSeek gained traction for its open-source transparency and competitive benchmark results. As organisations increasingly rely on LLMs for automation, analytics, and customer engagement, DeepSeek’s rise signals a shift toward more accessible and customisable AI solutions.
“Open-source LLMs like DeepSeek are democratising access to cutting-edge AI, enabling innovation beyond the walls of Big Tech.” — Dr. Andrew Ng, AI Pioneer
Industry Coverage
The LLM space is currently dominated by models from OpenAI (GPT-4), Anthropic (Claude), Google DeepMind (Gemini), and Meta (LLaMA). However, the emergence of open-weight and open-source alternatives like Mistral and DeepSeek is reshaping the competitive landscape.
DeepSeek, developed by a Chinese AI research group, has positioned itself as a high-performing, multilingual model with strong reasoning capabilities. It supports both instruction-following and code generation tasks, making it versatile for enterprise use.
Key differentiators:
- Excellent price-for-performance ratio
- Open-source availability (Apache 2.0 licence)
- Multilingual support, including Chinese and English
- Strong performance on reasoning benchmarks like MATH and GSM8K
Benchmarks
DeepSeek has demonstrated impressive results across several industry-standard benchmarks:
Source: DeepSeek GitHub, llm-stats.com
Expert Opinions
At the 2025 AI Frontiers Conference, DeepSeek was highlighted as a “breakthrough in open source reasoning”, with several researchers praising its balance of performance and accessibility.
“DeepSeek’s performance on multilingual and mathematical reasoning tasks is a game-changer for global enterprises.” — Dr. Fei-Fei Li, Stanford University
Part 2: CDL’s approach to testing LLMs
When testing Generative AI (GenAI) systems, "non-deterministic" refers to the inherent challenge where the same input can produce different outputs on repeated runs, making it difficult to reliably test and verify the system's behaviour due to its unpredictable nature.
At CDL, we have introduced new testing approaches and tools to address these non-deterministic challenges in GenAI:
- Run multiple tests with the same input and analyse the distribution of outputs to understand the range of possible results.
- Develop specific metrics to measure different aspects of the output like coherence, accuracy, relevance, and bias to assess quality even with variations in responses.
- Use a wide variety of input prompts to test the model's ability to handle different contexts and situations.
- Continuously monitor the performance of the model in real-world scenarios and use feedback to refine the training data and improve its accuracy.
- Move to intent-based testing where we focus on evaluating whether the output aligns with the intended meaning or purpose of the prompt rather than just checking for exact matches.
Testing Methodology
Our process for testing a model is first to define a set of questions. These are shown in the diagram as the prompts. This is a JSON lines file containing the question, the expected answer, also known as the ground truth data and a category.
The tests, known as evaluations, are run in a couple of modes:
First, we run an automated LLM-as-a-judge evaluation. This is where we ask our LLM under test to answer the question and then we pass that along with the ground truth data to a second LLM and ask for it to check the first LLM.
The second mode is a human evaluation. We take the same prompts and ground truth data, but this time we ask a team of people to evaluate the model.
Evaluation output is shown in the Bedrock console and all output and results are stored in S3 for further analysis. We have also enabled these tests as part of our CI / CD Pipelines, you can read Matt Houghton’s blog on how this was completed.
We utilise the Amazon Bedrock evaluations feature to assess performance and effectiveness of the model.
Amazon Bedrock computes our required performance metrics, such as the semantic robustness of a model and the correctness of a knowledge base in retrieving information and generating responses.
For model evaluations, we use both automatic evaluations and a team of human workers to rate and provide their input for the evaluation. This approach provides us with flexibility, such as utilising both company employees and industry subject-matter experts - in this case, the insurance industry. We can also include and assess retrieval-augmented generation (RAG) workloads to validate knowledge bases, provide highly relevant information and generate useful, appropriate responses.
Test Results
As you can see from the table above, DeepSeek holds its own and lives up to its hype against other models as a chat bot when fielding insurance specific queries in a RAG based architecture.
Part 3: Closing Thoughts on the Results and Possible Implications on the Insurance Industry
Summary of Findings
- DeepSeek R1 is a top-tier open-source LLM that could be used in the insurance industry as a chatbot.
- It performs competitively with proprietary models in most benchmarks.
- Its low hallucination rate and high accuracy make it suitable for enterprise applications.
Implications for the Insurance Industry
- DeepSeek’s capabilities open new possibilities for insurers:
- Risk Assessment: Automating underwriting with accurate, explainable reasoning.
- Fraud Detection: Analysing patterns in claims with multilingual support.
- Customer Service: Deploying chatbots that understand complex queries.
“LLMs like DeepSeek can transform how insurers interact with customers and assess risk, especially in multilingual markets.” — Insurance AI Journal, May 2025
Future Outlook
- As DeepSeek continues to evolve, we anticipate:
- Larger context windows for document-heavy industries
- Integration with retrieval-augmented generation (RAG) for real-time data access
- Domain-specific fine-tuning for insurance, legal, and healthcare sectors
Conclusion
DeepSeek is not just another LLM—it’s a signal of the growing power and potential of open-source AI. For the insurance industry, it represents a cost-effective, high-performance alternative to proprietary models. As we continue to explore its applications, stay tuned for future posts where we dive into other models as they emerge.
Top comments (0)