DEV Community: Scott Griffiths

Interactive AI Safety Playgrounds: Enterprise AI in Action

Scott Griffiths — Wed, 20 Aug 2025 22:51:44 +0000

Hands-on demonstration of OAS, DACP, BCE, and Cortex working together in real-time

Enterprise AI deployment has a dirty secret: most organizations spend more time fighting their AI systems than benefiting from them. Between runaway costs, unpredictable outputs, and security nightmares, it's no wonder that 73% of AI projects never make it to production.

We've built something different. And now you can see it in action.

The Problem with AI Demos

Most AI safety demonstrations are either academic papers with toy examples or marketing slides with impossible promises. What you don't see are real systems handling real complexity at enterprise scale.

That changes today.

We've launched interactive playgrounds that let you experience the complete AI safety stack in your browser. No accounts, no setup, no bullshit. Just click and explore the technology that's solving enterprise AI's biggest problems.

Try the Technology Stack

🛠️ OAS Engine Playground

What it demonstrates: How to generate production-ready AI agents from simple YAML specifications

Why it matters: Most organizations struggle to standardize AI agent development across teams. Our Open Agent Spec (OAS) lets you define agents declaratively and generate code for 6 different LLM engines.

What you'll see:

Real YAML-to-code generation in your browser
Side-by-side comparison of OpenAI, Claude, Grok, Local, Custom, and Cortex engines
Complete agent specifications with behavioral contracts
Actual CLI commands you'd run in production

The magic moment: Watch identical agent specifications generate completely different integration code for each LLM provider, while maintaining consistent behavior through behavioral contracts.

🔄 DACP Workflow Playground

What it demonstrates: Multi-agent orchestration with declarative workflow definitions

Why it matters: Enterprise AI isn't about single-shot queries. It's about coordinated workflows where multiple AI agents collaborate to solve complex problems.

What you'll see:

Real-time multi-agent communication across different LLM providers
3-stage security operations pipeline (threat analysis → risk assessment → incident response)
Agent-to-agent message routing with conditional escalation
Live workflow visualization with progress tracking

The magic moment: Watch a security incident flow through three different AI agents (Claude for threat analysis, Claude for risk assessment, OpenAI for incident response) with automatic escalation based on risk scores.

🛡️ Live Security Dashboard

What it demonstrates: Real-time AI safety monitoring with behavioral contract enforcement

Why it matters: You can't manage what you can't measure. Enterprise AI needs comprehensive monitoring, not just "is the API responding?"

What you'll see:

Live threat detection pipeline processing real security events
Behavioral contract validation in action
Multi-stage agent processing with performance metrics
Cost optimization recommendations from Cortex

The magic moment: Watch behavioral contracts catch and correct AI agent violations in real-time, maintaining system reliability even when individual agents misbehave.

The Cost Optimization Revolution

Here's where it gets interesting. Traditional enterprise AI consultants will quote you $200k-500k for a basic multi-agent deployment. Our Cortex cost optimization technology achieves 85-95% cost reduction through intelligent routing.

See it yourself: Run the OAS playground and watch Cortex analyze your agent requirements in real-time, automatically selecting the optimal engine based on:

Cost per token (40% weighting)
Response speed (30% weighting)
Reliability score (30% weighting)

The result? Enterprise-grade AI capabilities at startup prices.

What Makes This Different

Real Production Code

These aren't demos or mockups. Every YAML specification generates actual production code you could deploy today. The behavioral contracts are the same ones protecting our live systems.

Multi-Engine Reality

Most demos show you one LLM provider. We show you six, including cost optimization routing that saves 90%+ on API calls while maintaining quality.

Enterprise Complexity

Our workflows demonstrate real enterprise scenarios: security operations, compliance monitoring, incident response. Not toy examples, but the complex coordination enterprises actually need.

Transparent Technology

Every algorithm is explained. Every cost calculation is shown. Every routing decision is justified. No black boxes, no "trust us" moments.

The Technology Behind the Magic

Open Agent Spec (OAS): Command-line tool for generating AI agents from YAML specifications. Supports 6 engines with behavioral contract enforcement.

Declarative Agent Communication Protocol (DACP): Workflow orchestration language for multi-agent systems. Think Kubernetes for AI agents.

Behavioral Contract Engineering (BCE): 5-stage validation pipeline ensuring AI agents operate within defined safety boundaries. Includes context-aware validation that prevents hallucinations by ensuring outputs are grounded in actual input data rather than fabricated information.

Cortex Cost Optimization: Intelligent routing system achieving 85-95% cost reduction through real-time engine selection.

Try It Now

Start with OAS: Generate a security agent and see YAML-to-code magic
Explore DACP: Watch multi-agent workflows coordinate across LLM providers
Monitor with BCE: See real-time safety validation in action

No registration required. No sales calls. Just technology that works.

What's Next

We're actively deploying this stack with enterprise clients achieving:

94% reduction in AI-related security incidents
90% cost savings through Cortex optimization
10x faster agent development through OAS standardization

If you're tired of AI projects that promise the moon and deliver PowerPoints, these playgrounds show what's actually possible when you build AI safety into the foundation rather than bolting it on afterward.

Ready to see your AI systems actually work reliably? Start with the playgrounds, then let's talk about bringing this technology to your organization.

Built and deployed by PrimeVector - the AI safety consultancy that shows its work.

Building a Unified AI Safety Platform

Scott Griffiths — Thu, 14 Aug 2025 00:58:48 +0000

The Challenge: Enterprise AI Safety at Scale

As organizations rush to deploy AI agents in production, they face a critical trilemma: security, cost, and performance. Current solutions force you to choose - you can have secure AI that's expensive, or cost-effective AI with security gaps.

After working with enterprises struggling with AI deployment, we identified some key pain points:

Fragmented safety tools that don't work together
No real-time monitoring of AI agent behavior
Cost explosion when implementing proper safety measures
Lack of multi-agent coordination and communication standards

Our response was to build an integrated platform that addresses all these challenges simultaneously.

The Architecture: Breaking the Trilemma with 4 Unified Technologies

Rather than accepting the traditional trade-offs, we designed a unified platform where each technology addresses a different aspect of the security-cost-performance trilemma:

1. Behavioral Contract Engineering (BCE) - Solves Security

A 5-stage validation pipeline that ensures AI safety without sacrificing performance:

Input Validation → Contract Check → Security Analysis → Response Generation → Output Validation

Temperature controls prevent erratic behavior
Content filtering blocks harmful outputs
PII protection ensures compliance
Hallucination detection maintains accuracy

2. Open Agent Stack (OAS) - Solves Performance

Multi-engine AI framework that optimizes performance across 5 LLM providers:

OpenAI (GPT-4, GPT-3.5)
Anthropic (Claude 3.5 Sonnet)
xAI (Grok)
Local (Ollama, privacy-focused)
Custom (your own LLM implementations)

Engine-agnostic design means behavioral contracts work identically across all providers.

3. Distributed Agent Communication Protocol (DACP) - Enhances Performance

Enables sophisticated multi-agent workflows for complex tasks:

# Example: 3-stage security workflow
workflow = {
    "threat-analyzer": analyze_threat,
    "risk-assessor": assess_risk, 
    "incident-responder": coordinate_response
}

# Conditional routing based on risk scores
if risk_score >= 7.0:
    workflow.escalate_to("incident-responder")

4. Cortex Cost Optimization - Solves Cost

3-layer intelligent routing system that dramatically reduces AI costs:

Layer 1 (Sensory): Simple pattern matching
Layer 2 (ONNX): Local ML models for common tasks
Layer 3 (LLM): Full reasoning for complex scenarios

Results: 85-95% cost reduction while maintaining safety standards.

Implementation Deep-Dive

Database Schema for Unified Monitoring

-- Agent task tracking across all technologies
CREATE TABLE agent_tasks (
    id STRING PRIMARY KEY,
    agent_id STRING NOT NULL,
    task_type STRING NOT NULL,
    status STRING NOT NULL,
    intelligence_engine STRING,  -- OpenAI, Claude, etc.
    current_stage STRING,        -- BCE validation stage
    progress INTEGER,            -- 0-100%
    confidence_score FLOAT,      -- AI confidence level
    total_duration_ms FLOAT,
    created_at DATETIME,
    updated_at DATETIME
);

-- Real-time metrics for dashboard
CREATE TABLE agent_metrics (
    id STRING PRIMARY KEY,
    agent_id STRING NOT NULL,
    metric_type STRING NOT NULL,  -- cost, success_rate, response_time
    metric_value FLOAT NOT NULL,
    timestamp DATETIME NOT NULL
);

API Design for Cross-Technology Integration

# Unified API endpoints
@app.get("/api/v1/unified/agents")
async def get_agents():
    """Returns agents from OAS, DACP, BCE, and Cortex"""

@app.get("/api/v1/unified/tasks")  
async def get_tasks():
    """Real-time task monitoring across all systems"""

@app.get("/api/v1/unified/metrics")
async def get_metrics():
    """Cost optimization and security metrics"""

Production Deployment Architecture

Tech Stack:

Backend: FastAPI + SQLAlchemy + Alembic migrations
Frontend: Streamlit with real-time updates
Database: SQLite (dev) / PostgreSQL (prod)
Infrastructure: Ubuntu + Nginx + SSL via Let's Encrypt
Monitoring: Custom metrics collection + systemd services

Deployment:

# Automated Ubuntu deployment script
./scripts/deploy_ubuntu.sh

# Sets up:
# - Python virtual environment
# - Database with migrations  
# - Nginx reverse proxy
# - SSL certificates
# - Systemd services
# - UFW firewall configuration

Real-World Results

Live Demo Platform

🔗 https://bce.primevector.dev

System Overview dashboard showing real-time integration of OAS, DACP, BCE, and Cortex technologies

The platform showcases real-time integration of all 4 technologies:

System Overview Tab:

Multi-engine task processing across AI providers
Real-time system health monitoring
Cost savings visualization from Cortex optimization

Live Agent Processing Tab:

Real-time agent activity across OAS engines
BCE security validation in progress
DACP workflow coordination
Per-agent cost tracking and optimization

BCE Security Pipeline Tab:

5-stage validation process visualization
Contract success rate monitoring with safety maintained
Active threat blocking and violation management

Technology Stack Tab:

Complete architecture explanation
Integration points between systems
Performance metrics and capabilities

Live Agent Processing showing real-time agent activity, BCE validation stages, and cost optimization across multiple AI engines

Example Performance Metrics

From typical production deployment:

📊 Security Performance:
- 88% contract success rate (industry target: >85%)
- 6.7ms average validation time (target: <10ms)
- 408 security threats blocked automatically
- 30.2% threat blocking rate

💰 Cost Optimization:
- 90% of tasks routed to Layer 2 (ONNX)
- 27% total cost reduction achieved
- $1.35 average savings per 1,000 tasks
- Real-time cost tracking per agent

🔄 Agent Coordination:
- 15+ active behavioral contracts
- Multi-engine workflow support
- Conditional escalation workflows
- 99.4% agent communication success rate

Technical Challenges Solved

1. Cross-Engine Behavioral Contracts

Making safety rules work identically across OpenAI, Claude, and Grok required careful abstraction:

@behavioural_contract(
    temperature_control={"mode": "strict", "range": [0.1, 0.5]},
    response_contract={"required_fields": ["risk_assessment", "confidence_level"]}
)
def analyze_threat(engine: str, threat_data: str) -> SecurityOutput:
    # Same contract works for any engine
    return engine_router.process(engine, threat_data)

2. Real-Time Cost Tracking

Cortex optimization required transparent cost calculation:

def calculate_routing_cost(task_complexity: float) -> RoutingDecision:
    if task_complexity < 0.3:
        return Route.LAYER_1_SENSORY  # $0.0001 per task
    elif task_complexity < 0.7:
        return Route.LAYER_2_ONNX     # $0.001 per task  
    else:
        return Route.LAYER_3_LLM      # $0.01 per task

3. Agent State Synchronization

DACP workflows needed reliable agent communication:

class WorkflowRuntime:
    def execute_step(self, agent_id: str, task_data: dict):
        # Update unified task tracking
        self.update_agent_status(agent_id, "processing")

        # Execute with BCE validation
        result = self.execute_with_contracts(agent_id, task_data)

        # Route to next agent if needed
        if result.requires_escalation:
            self.route_to_next_agent(result)

Open Source Contributions

Repositories:

OAS Framework: https://github.com/prime-vector/open-agent-spec
DACP Protocol: Integrated workflow orchestration

Community Contributions:

Added 5-engine support to OAS
Integrated behavioral testing framework
Created security agent templates for rapid deployment
Published PyPI package for easy installation

Try It Yourself

Live Demo: https://bce.primevector.dev

Installation:

pip install open-agent-spec
oas init --spec security-threat-analyzer.yaml --output my_agent/

Breaking the AI Trilemma: A New Paradigm

Traditional AI deployment forces an impossible choice between security, cost, and performance. Our unified platform proves this trilemma is a false constraint.

The Old Paradigm:

🔒 High Security = High cost, slower performance
💰 Low Cost = Security risks, limited functionality
⚡ High Performance = Expensive, potential safety gaps

The New Reality:

🛡️ Advanced Security via BCE behavioral contracts
💸 85-95% Cost Reduction through Cortex optimization
🚀 Enhanced Performance with OAS multi-engine + DACP coordination

Key Takeaways:

Integration over isolation - Unified platforms outperform point solutions
The trilemma is solvable - Smart architecture achieves all three goals
Real-time monitoring is essential - You can't manage what you can't see
Multi-engine support future-proofs your AI investments

The future of enterprise AI isn't about choosing between safety, cost, and performance. It's about architecting systems that deliver all three simultaneously.

What challenges are you facing with AI safety in production? Let's discuss in the comments below.

🔗 GitHub: https://github.com/prime-vector

🌐 Live Demo: https://bce.primevector.dev

💼 Enterprise AI Consulting: https://primevector.com.au/

Dev.to Metadata

---
title: "Building a Unified AI Safety Platform"
published: false
description: "How we built a production AI safety platform integrating BCE, OAS, DACP, and Cortex for enterprise-grade AI agent security and cost optimization"
tags: ai, security, enterprise, python
cover_image: https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7onu974fnf9xzxhdzqyo.png
canonical_url: 
series: AI Safety Engineering
---

Mastering Reliability in High-Velocity Software Development

Scott Griffiths — Wed, 15 Nov 2023 03:42:52 +0000

Introduction

Welcome to the high-speed world of modern software development, where the DevOps culture pushes for ever-increasing velocity in delivering new features and updates. However, in this race towards faster deployment, a critical question often emerges: Are we sacrificing reliability for speed? This is where Site Reliability Engineering (SRE) plays a pivotal role.

In this blog, we're zooming in on SRE and how it answers the call for balancing the DevOps-driven pursuit of speed with the uncompromising need for reliable systems. SRE isn't just about firefighting operational issues; it’s about strategically managing service reliability using tools like Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets. Join us as we explore how SRE navigates the velocity/reliability trade-off, ensuring that rapid development complements, rather than compromises, system stability.

Understanding SLOs and SLIs in an SRE Context

In the fast-paced world of DevOps, where the goal is to deploy features rapidly, the need for a framework to ensure these deployments are reliably executed becomes paramount. This is where Service Level Objectives (SLOs) and Service Level Indicators (SLIs) come into play, serving as the cornerstone of SRE.

Service Level Objectives (SLOs) are essentially goals set for the reliability of a service. They are the benchmarks against which a service's performance is measured, ensuring that the drive for speed doesn't compromise quality. For example, an SLO might specify that "99.95% of all requests should be successful," setting a clear expectation for service reliability.

Service Level Indicators (SLIs), on the other hand, are the actual metrics used to gauge the performance of the service against these objectives. In our example, the SLI would measure the real percentage of successful requests over a period. If the SLI shows that 99.97% of requests were successful, the service is exceeding its SLO; if it falls to 99.90%, it’s a signal that the service might not meet the set objective.

In the context of SRE, SLOs and SLIs are not just numbers; they are tools that bridge the gap between the rapid deployment ethos of DevOps and the essential need for system reliability.
By continuously monitoring SLIs in relation to SLOs, SRE teams can identify and address reliability issues before they escalate. This proactive approach allows for fast-paced development and deployment while maintaining the high standards of service quality that users expect and depend on.

SLOs and SLIs also foster a culture of transparency and accountability. They provide clear, objective data that teams can rally around, reducing subjective debates and focusing efforts on measurable outcomes. This clarity is crucial in environments where the speed of DevOps can often lead to ambiguity about service performance and user experience.

The Role of Error Budgets in Balancing Innovation and Reliability

Error budgets serve as a critical tool in Site Reliability Engineering, quantifying the acceptable level of risk or unreliability in a system. These budgets are directly derived from Service Level Objectives (SLOs). For instance, if an SLO dictates that a service must maintain 99.95% uptime, this implies an error budget of 0.05% downtime. This allowance provides a quantifiable metric to balance the need for system stability with the desire for continuous innovation.

Guiding Development and Operational Decisions

Error budgets influence key decisions regarding software development and operations. When there is remaining error budget, teams might be more inclined to push new features, updates, or experiments, knowing that there's a cushion to absorb potential reliability impacts. Conversely, if the error budget is close to being exhausted, it signals the need to focus on stabilising and improving the current system.

Error Budgets as a Communication Tool

One of the most significant aspects of error budgets is their role in enhancing communication within and across teams. By having a clear, quantifiable measure of system reliability, teams can align on priorities and risks. It helps avoid the subjective debate about whether the system is 'reliable enough' and instead provides a data-driven approach to assess system performance and make informed decisions.

Monitoring and Responding to Error Budget Consumption

Monitoring the consumption of the error budget is crucial. Teams should set up alerts to notify when the budget is being consumed at a rate that might warrant attention. This proactive approach enables teams to address issues before they escalate and exhaust the budget.

Learning from Error Budget Expenditures

Finally, how an error budget is expended can provide valuable insights into the system’s reliability and the effectiveness of current practices. Analysing instances where the error budget was consumed can reveal patterns, systemic weaknesses, and opportunities for improvement. This analysis can drive a continuous improvement cycle, where learnings are integrated back into development and operational processes, enhancing the system's overall reliability and performance.

DORA Metrics and SRE

Deployment Frequency
This metric measures how often an organisation successfully releases to production. A high deployment frequency is often a sign of a robust and agile development process. In the context of SLOs and SLIs, frequent deployments should not compromise the reliability and performance of the service. If the service consistently meets its SLOs, it indicates that the organisation can maintain reliability even with frequent updates and changes.
Lead Time for Changes
Lead time for changes is the duration from code commit to code successfully running in production. Shorter lead times can indicate a more efficient development and deployment process. However, it's crucial that these rapid changes do not adversely affect service reliability, which is where SLOs come into play. Ensuring that changes adhere to predefined SLOs helps maintain service stability despite the speed of deployments.
Change Failure Rate
This metric tracks the percentage of changes that result in a failure in the production environment. A high change failure rate might suggest issues in the testing or deployment processes. The relationship between change failure rate and error budgets is significant. If the error budget is consistently exhausted due to high failure rates, it's a clear indicator that the focus needs to shift towards improving reliability and perhaps re-evaluating the SLOs.
Time to Restore Service
This measures the time it takes to restore a service after a failure or incident. An essential aspect of SRE, a shorter time to restore service directly contributes to the efficient use of the error budget. It reflects the team’s ability to quickly respond to and resolve issues, ensuring that the service adheres to its SLOs. In the context of DevSecOps, this metric underscores the importance of having robust incident management and rapid response systems in place.

Integrating DORA Metrics with SLO/SLI

The DORA metrics complement SLOs and SLIs by providing a broader view of the software delivery and operational stability:

Deployment Frequency: Aligns with SLIs by measuring how often a team successfully releases to production, reflecting the velocity and reliability of new features or updates.
Lead Time for Changes: Can be influenced by SLOs to ensure that rapid changes do not compromise service reliability.
Change Failure Rate: Directly relates to the error budget. Exceeding the budget due to high failure rates would necessitate a shift in focus towards reliability.
Time to Restore Service: Is an SLI that is critical to maintaining the error budget. A shorter time to restore service means less consumed budget and more room for innovation.

Examples and Case Studies

Case Study 1: The Importance of Defining SLOs and SLIs

In a recent engagement, it was observed that there were no clear Service Level Objectives (SLOs) or defined Service Level Indicators (SLIs). This absence led to a lack of awareness around response times and system performance. As a result, the team was often reactive, rather than proactive, in managing system reliability

The introduction of SLOs and SLIs would enable the company to set measurable targets for system performance and reliability.
By doing so, they could shift from a reactive to a proactive stance, ensuring that performance issues are identified and addressed before impacting the end users. This change would not only improve system reliability but also enhance customer satisfaction

Case Study 2: The Gap in Alerting and Accountability

Another observation was the lack of effective alerting, especially in lower environments. Many alerts were turned off due to excessive email notifications, leading to a 'cry wolf' scenario where important alerts were lost amidst the noise.

This situation was compounded by a lack of accountability around errors and no clear error budget strategy.
Errors were often overlooked unless they had a high impact, leading to a culture where only major issues received attention.

The introduction of a well-thought-out error budget and a more refined alerting system could encourage a more balanced approach to error management. It would help the team to track and respond to both major and minor issues effectively, thereby improving overall system health and reliability.

Case 3: The Need for Unified Dashboards for Efficient Troubleshooting

The absence of unified dashboards in a recent engagement presented a significant challenge in monitoring and troubleshooting. Engineers often faced difficulties in determining whether issues were environment-related or application-specific. This uncertainty led to increased resolution times and often unnecessary debugging efforts.

By implementing unified dashboards, the company could dramatically streamline its troubleshooting process. These dashboards would provide a comprehensive view of the system’s health across different environments, making it easier to pinpoint the root cause of issues. For instance, if a problem occurs only in the production environment but not in development or testing, it's more likely to be environment-specific rather than a flaw in the application itself.

This clarity is invaluable. It not only speeds up the resolution of issues but also helps in efficiently allocating resources. Engineers can focus their efforts on the actual problem area—be it environmental configurations or application code—rather than getting bogged down in unnecessary investigations. Moreover, this approach can lead to a more structured and effective debugging process, reducing downtime and enhancing overall system reliability.

Embracing a Culture of Reliability in SRE

At the heart of SRE lies a commitment to building and nurturing a culture of reliability. This isn't about a set-and-forget approach to system stability; it's about creating an environment where reliability is continuously pursued, measured, and improved

Continuous Learning from Incidents: In SRE, incidents are not just challenges to be overcome but opportunities for learning. Each incident, be it minor or major, is a chance to delve deeper into the workings of the system, understand its weaknesses, and fortify its strengths. This approach ensures that the team doesn’t just fix issues but learns from them, enhancing the overall resilience of the system.

Embracing Feedback: Feedback, both from within the team and from users, is a cornerstone of SRE. It's not just about identifying what went wrong but also understanding what can be done better. By actively seeking and valuing feedback, SRE teams can adapt their practices, tools, and approaches to meet the evolving needs of the system and its users.

Continuous Process Improvement: SRE is an iterative process. Tools and strategies like SLOs, SLIs, and error budgets are not static. They evolve as the team gains new insights, as the software changes, and as user expectations grow.
This continuous improvement is crucial for ensuring that the organisation not only meets its current reliability targets but is also well-prepared to handle future challenges.

Scaling with Confidence: The culture of reliability fostered by SRE empowers organisations to scale their operations and systems with confidence. Knowing that reliability is ingrained in the process, and not an afterthought, gives teams the confidence to innovate and expand, secure in the knowledge that the system’s stability is being continuously monitored and enhanced.

In essence, embracing a culture of reliability in SRE is about creating a dynamic, responsive, and resilient approach to software development and system operations. It's about ensuring that reliability is at the forefront of every decision, every strategy, and every action.
This culture is the bedrock upon which organisations can build systems that are not only technologically advanced but also dependable and robust

Conclusion

In the interplay between the DevOps drive for high velocity and the SRE focus on reliability, we find a harmonious balance that defines the future of software development and system operations. SRE, with its robust framework of SLOs, SLIs, and error budgets, empowers organisations to embrace the speed of DevOps without losing sight of system stability and user experience. It’s about building and maintaining resilient, user-centric systems that not only move fast but also stand strong. In this evolving landscape, SRE emerges not just as a methodology, but as a necessary paradigm to ensure that our pursuit of speed fortifies, rather than undermines, the reliability of our systems.

GitOps - CD for cloud native apps

Scott Griffiths — Tue, 08 Nov 2022 11:05:18 +0000

Tldr;
GitOps is a pull based model that uses Git as the source of truth for application and Infra code. State (Actual vs Desired) is managed via an operator that runs in your Kubernetes cluster

What Is It

GitOps is a paradigm for kubernetes cluster management that uses Git as the source of trust for declarative applications and infrastructure

How Is It Different

Gitops Is a Pull-Based Model

The majority of CI/CD tools available today use a push-based model. A push-based pipeline means that code starts with the CI system and then continues its path through a series of encoded scripts to push changes to the Kubernetes cluster
Pull relates to the Operator installed to the cluster that watches the image repository for new updates

Why Use This Approach

GitOps takes full advantage of the move towards immutable infrastructure and declarative container orchestration The approach helps to prevent configuration drift

What Does This Look like

In a pull pipeline, a Kubernetes Operator reads new images from the image repository from inside of the cluster.

At the centre of the GitOps pattern is the Operator/Agent. It monitors the single source of truth (a config repo) that contains deployment manifest and the actual state in the cluster

The Operator constantly monitors the Actual State in the cluster, and the Desired State defined in the Repo

Separation of Concerns

The pipelines can only communicate by Git updates:

Whenever Git is updated, the Operator is notified.
Whenever the Operator detects drifts, monitoring and alerting tooling are notified

Benefits

Consistency

Prod states matches your test env’s Reliability
With Git’s capability to revert/rollback and fork, you gain stable and reproducible rollbacks Developer Experience
Focus on dev code rather than Kubernetes exp (faster onboarding) Standards and Consistency
One model for apps, Infra and Kubernetes changes Enhanced security
reduced potential to expose credentials outside of your cluster

Gitops/SRE - 3 Initialisms

Argocd in 5 Mins (Example)

Prerequisites (To be installed and running)

Docker / Kubernetes
Git
Kubectl

Set Alias
alias k=kubectl

Create Namespace and Install Argocd in Your Local Cluster

k create namespace argocd

git clone https://github.com/marcel-dempers/docker-development-youtube-series.git

cd docker-development-youtube-series/argo/

k -n argocd apply -f argo-cd/install.yaml

View Running Pods

k -n argocd get pods

Set Port Forwarding

k get pods -n argocd -l app.kubernetes.io/name=argocd-server -o name | cut -d'/' -f 2
username: admin
password: (result of query)

Deploy Sample App and View in the UI

k apply -n argocd -f argo-cd/app.yaml

Delete / Cleanup

k -n argocd delete -f install.yaml
k delete -n argocd -f app.yaml
k delete namespace argocd

Useful Tools

Software Test Automation - The Functional checks

Scott Griffiths — Sat, 23 Oct 2021 04:40:38 +0000

Can we increase our understanding and expectations of a system by combining various functional automation tests at different steps within the development lifecycle?

We look at how some of the fundamental disciplines of unit, integration, API, UI and infrastructure automation
. And how a distributed (through the SDLC) while centralised (for dashboards, reporting, alerts) approach can lower the barrier to entry and provide faster feedback

What might an ideal automation distribution look like if we split a percentage of functional checks across each part of the SDLC ?

And if we had the opportunity to do run different automated tests across the development lifecycle might it look something like this

NOTE: Performance which runs across 4 cycles(Development,Test,Deploy,Operate) and Pen testing are not included given they are more non functional focused, For more on performance Engineering you can check out the blog here

Let's take a look at each of the automated functional checks that we would usually implement to test the knows state of our application:

Unit
Integration
Infrastructure
API (Application programming interface)
UI (User interface)
Security

Tests are generally categorized into low, medium or high level > . Meaning, the higher the level the more complicated, expensive and longer it takes to execute, implement, troubleshoot and maintain

The unit test

A fast running test done against a method or function that looks to validate its behavior.
We give an input and expect a certain output

Due to their quick feedback they are ideal for running locally, in the CI pipeline and as a 1st line of defense in the CD pipeline

The Integration test

Used to confirm integration with other dependencies (Api's, databases, message Hubs). They provide fast feedback, Useful to determine you are interacting correctly with the required dependencies

A lot of the time these are mocked for state verification and/or stubbed for characteristics type verification, so you have less dependence on 3rd party services

The Infrastructure test

Used to verify infrastructure behavior and can include checks on directory permissions, running processes and services, open ports, node counts, storage accounts etc.

Handy to run these upon application deployment (VM's & Kubernetes) or when releasing a new SOE (standard operating environment)

These are often underutilized, and can help round off a well orchestrated automation approach

The API test

The API test often triggers off a sequence off actions. You send a request and expect a particular response code with the right payload

Usually can give you good feedback that a number of parts of the system are working as expected(APIs, Db's, Hubs, caches, load-balancers)

The UI test

The API test often triggers off a sequence off actions. You send a request and expect a particular response code with the right payload

Usually can give you good feedback that a number of parts of the system are working as expected(APIs, Db's, Hubs, caches, load balancers)

The Security test

A complex topic, however at a high level we want to know whether we have exposed ourselves to vulnerabilities in our code, containers and infrastructure

Automation Observability

We want to understand how all of our different suites of automation are performing across all environments at any one time

To do this we need to collate the data from each source and present that back as something useful, Such as a dashboard

We introduce TLO's (test level objectives), TLA's (test level agreements) and TLI's (test level indicators). Which are defined at design time to align with the team and business objectives.

They look to bring more clarity, accountability and transparency to the automation being executed. They also open communication channels and help to frame objectives

Summary

The goal being a way to distributed automation where tests execute at each stage of the development lifecycle
And where its data is collated in a centralized manner and exposed though a series of dashboards

This leads to a more sustainable, resilient automation solution that detects problems early and these can then be fixed easier

Performance Engineering - The Reliability Edition

Scott Griffiths — Mon, 22 Mar 2021 19:59:05 +0000

Question

Can we improve the reliability of a system by employing various performance engineering techniques to different stages of the development process?

This is a look at how a solid Performance Engineering strategy that uses Reliability principles and DevOps idealisms to complement and strengthen current or proposed performance initiatives

These approaches attempt to achieve better business cohesion, reliability and velocity benefits. To do this we can look at applying various methodologies from Performance Engineering using a Shift left and Move Right approaches that extend Traditional Performance Testing techniques

At its core, to understand an applications performance we need

A mechanism to run load against an application or system
A way of measuring how they performed
A way of comparing the results against what we believe is the ideal state

Each area of performance within the DevOps model has its part to play. That is, they all relate in some shape or form to the principles around building, defining and maintaining a reliable system

In a nutshell

Each Performance execution and analysis piece should look to be guided by the Engineering Efficiency, DevOps and Reliability principles that apply to software development

Reliability Engineering(RE) attempts to predict and prevent the risk of there being a failure whether that be a component or an entire system of services
Performance Engineering(PE) states we should start earlier in the SDLC to get faster feedback, but also extends into Operations and Support to use real world data to build/update of the performance models (scripts and analysis)
Performance Testing (PT) is all about determining what the performance of an application is (baselining) or comparing to how you believe it should be(delta analysis) under various conditions and situations in the 'test' environment

A look at Performance Engineering

PE looks incorporate the methodologies of 'Agile' and use these in conjunction with 'DevOps' idealisms in order to provide a improved approach that adds value rather than one that tends to hinder delivery velocity

We can do this by looking at adopting a left shift / move right approach that incorporates a cloud first performance automation approach. This can then lead to reduced feedback cycle (velocity increase) and bottlenecks / bugs being caught early on (reliability increase).

The Performance Engineering Model

PE is all about applying process and strategies at each step of the SDLC, the following are example actions/options that can be applied within each vertical

The idea being that performance is a consideration at each step in the software lifecycle, The captured metrics are gathered from Dev, Test, Deploy and Operations and used to refine the next cycle of performance

Traditional performance testing

Quite often done within the test phase and entails a big bang approach that consists of many pods/VM's to generate load against an application/system

Pro's	Con's
Simulates real world conditions as closely as possible	Often a integrated(shared) environment which can affect results
Integrated tests execute against multiple components at once	Data is often 'test' data which could affect behaviour/results
Tools can replicate thousands (if not more) of users	Replicating 'Prod' environments can be expensive
Extensive metrics/reports from tool	Finding route case when diagnosing issues can be complex
	Commercial Tooling can be expensive to operate item

Performance/Reliability options to improve efficiencies, engagement and observability

--> We can attempt to find this out using combination PE, RE and DevOps principles and methodologies

Shift Left Approach

Reducing the SDLC feedback loop to uncover and rectify potential system and environment issues early

Shift Left Benefits

Item	Description
Team cohesion	Foster developer engagement and contribution.
Less bugs	Reduced development costs.
Improved performance	Detect and eliminate bottlenecks shortly early.
Reduced risk	Find bugs and performance issues earlier.
Speed up time-to-market	Having more trust in your applications and infrastructure.

Move Right Approach

A "Move Right" approach extends testing out to include user feedback and metrics from your production environment. This can then be used to update the performance model that's developed as a consequence

Move Right Benefits

Item	Description
Increased User experience	Tests closer match the actions expected by your users
Responding faster	Teams have more involvement and ownership over the performance information is presented back
Design hypothesis evaluated	Assumption are reflected upon and adequate action can be taken
Various performance management options	Many different tools for being able to change traffic flows that can alter performance

Measurements and Observability

The use of performance metrics from each environment (Dev/Test/Prod) are used to determine whether they are within SLO's limits.

Idea being we can understand and easily record local (component) and integrated(end 2 end) metrics to provide better performance transparency. These then would be compared to ideal state

These SLO's can be enforced through the use of SLI's (SLI specifications and SLI implementations) and compared to our error budget to measure tolerance

With the view to obtain an current state view of our applications performance in each environment and at each stage of the SDLC these are then compared against our business performance exceptions defined in the SLO and enforced in the SLI

Performance SLI implementations could include:

API / UI response times
DB transaction times
Pod / VM scaling events
CPU use / Network activity / Memory usage

Could all be defined and compared using SLI's

A subset of the performance suite can be used to poke test (performance smoke test) the application after deployment. A degraded Performance run could then trigger a rollback

Summary

A balanced performance strategy that is applied at each stage of the SDLC, that uses guidance from RE principles provides a more well rounded verification process and in turn lead to a culture of empathy, encourage collaboration, reduce delivery cycle duration and mitigate the chance of deploying underperforming software