Let's Make It Easy M-C-Peasy

Anna Morelia — Tue, 09 Sep 2025 17:44:14 +0000

🎥 Kicking off Easy M-C-Peasy: A new video series on MCP

The Model Context Protocol (MCP) is starting to pop up everywhere if you’re building with agents or LLMs. There’s a lot of people writing about MCP, but if you’re new to it, it can feel a little overwhelming to piece together how it works, who to trust, and why you should care.

That’s why we're starting a new video series called Easy M-C-Peasy with short, practical explainers of MCP basics, architecture, and more.

In this first episode, we take on the foundational question:

“What is MCP and why does it matter?”

Why Stacklok cares

At Stacklok, we’re working on ToolHive — an open source project to make it easier (and safer) to connect AI agents with external tools. MCP is a big piece of that story, because it’s becoming the common language for how LLMs and agents plug into the world.

We don’t want MCP to feel like an insider-only protocol. The more people who understand it, the stronger the ecosystem gets — and the better we can all build. That’s why we’re investing time in making MCP approachable with this series.

Full Playlist Available

Dive into the entire YouTube playlist to explore every video back-to-back.

Share what you're curious about

💡If you’ve got questions — about registries, security, auth, or just “how do I even get started?” — drop them in the comments. We’ll pull from those for future episodes.

Which LLMs Are (and Aren't) Ready for Secure Code?

Anna Morelia — Tue, 22 Apr 2025 18:52:48 +0000

Using the LLM Security Leaderboard to Select Models for Safe and Sustainable Code

Most language model benchmarking and comparison is focused on speed and accuracy. But with AI code generation, language model choice affects the safety and sustainability of resulting code. While many popular AI code-generation approaches rely on frontier models from providers like OpenAI and Anthropic, small- and mid-sized open-source models have advanced significantly and address specific needs for speed, efficiency, privacy, security, and compliance. To ensure developers and enterprises make informed choices, we’ve launched the LLM Security Leaderboard on Hugging Face to evaluate open-source models across four (initial) security dimensions. We’re taking an open, community-driven approach to this evaluation, and encourage you to join us in refining this benchmark.

You can read more about our criteria and methodology here. Below are our takeaways from this first wave of analysis:

Key Findings

All models struggle with Bad Package Detection: Llama 3.2-3B led, but only correctly flagged ~29% of bad NPM and PyPI packages. Nearly all the models we evaluated detected less than 5% of bad packages, and several popular models detected 0%, they simply provided instructions on how to install the package, regardless of whether the package existed or included typos. These models put the responsibility for bad package detection squarely on the user.
CVE Knowledge is Alarmingly Low: Awareness of Common Vulnerabilities and Exposures (CVEs) in dependencies is a basic requirement for secure code. Yet most models scored between 8% and 18% accuracy in this category. Qwen2.5-Coder-3B-Instruct was the leader, but still scored low at 18.25%. These results suggest that the depth and consistency of CVE knowledge needs to be significantly improved.
Insecure Code Recognition is a Mixed Bag: Top models like Qwen2.5-Coder-32B-Instruct and microsoft/phi-4 successfully identified vulnerabilities in roughly half of the code snippets presented. Lower-performing models recognized vulnerabilities in fewer than a quarter of cases; the inconsistency underscores the need for more targeted training on secure coding practices.
Model Size != Security: While larger models often perform better on general benchmarks, security-specific performance varied significantly. Smaller models like Llama-3.2-3B-Instruct and IBM's Granite 3.3-2B-Instruct punched above their weight, reinforcing that sheer model size is not decisive and that architecture, training methodologies, and datasets play crucial roles in security capabilities.
Newer != Better: Newer models like Qwen2.4-Coder-32b (knowledge cutoff June 24, 2024) and Granite-3.3-2b-Instruct (knowledge cutoff April 24, 2024) have about the same or lower bad package and CVE detection capabilities as older models like Llama-3.2-3b-Instruct (knowledge cutoff March 23, 2023), suggesting that these newer models were not trained on the latest bad package and CVE knowledge.

What This Means for Developers and Researchers

These findings should guide how teams approach secure AI adoption for software development:

Select models thoughtfully, especially when using LLMs in security-sensitive codegen workflows.
Prioritize secure prompting techniques - careless prompting can exacerbate vulnerabilities.
Complement LLMs with security-aware tools, like Stacklok's open-source project CodeGate, to reinforce defenses.
Augment LLMs with Retrieval-Augmented Generation (RAG), using knowledge from leading vulnerability datasets such as NVD, OSV, Stacklok Insight, etc.
Push for better fine-tuning and training on security datasets across the community.

Get Involved

This is just the beginning. The LLM Security Leaderboard is live at Hugging Face, and we're inviting the community to submit models, suggest new evaluation methods, and contribute to a stronger, safer AI ecosystem.

Explore the leaderboard. Submit your models. Join the conversation.

Let's build a future where AI coding is safe and secure.

DEV Community: Anna Morelia