DEV Community: MUHAMMAD ABIODUN SULAIMAN

System Design for Agentic AI Projects: What to Do, What Not to Do, and When

MUHAMMAD ABIODUN SULAIMAN — Sat, 04 Oct 2025 17:43:58 +0000

Agentic AI, where independent AI agents make complex decisions or orchestrate workflows, is revolutionizing business processes. Yet architecting such systems brings fresh design challenges and high stakes. Here are actionable principles, informed by real-world builds and strategic lessons, for designing robust, scalable Agentic AI solutions, complete with Python pseudocode examples to illustrate implementation patterns.

What to Do: Core System Design Principles Anchor to Business Value

Start by defining explicit, measurable business outcomes for the AI system. Every agent’s function should map directly to a high-impact organizational goal. Avoid speculative builds; only add autonomy where it demonstrably amplifies value and innovation.

class BusinessAlignedAgent:
    def __init__(self, name, business_kpi, success_threshold):
        self.name = name
        self.business_kpi = business_kpi  # e.g., "customer_satisfaction_score"
        self.success_threshold = success_threshold  # e.g., 0.85
        self.metrics_tracker = MetricsTracker()

    def execute_task(self, task_input):
        result = self.process_task(task_input)
        # Always measure business impact
        kpi_value = self.metrics_tracker.measure_kpi(self.business_kpi, result)

        if kpi_value < self.success_threshold:
            self.escalate_to_human(result, kpi_value)

        return result

Use Clear Orchestration and Task Decomposition

Architect systems with a two-tier model: a primary (“supervisor”) agent that owns orchestration and specialized subagents that execute well-bounded, stateless tasks. Decompose complex operations either vertically (sequential dependency) or horizontally (parallel independence), based on the workflow.

class AgentOrchestrator:
    def __init__(self):
        self.subagents = {
            'data_processor': DataProcessorAgent(),
            'validator': ValidatorAgent(),
            'reporter': ReporterAgent()
        }

    def execute_workflow(self, task):
        # Sequential decomposition example
        if task.type == "data_analysis":
            raw_data = self.subagents['data_processor'].process(task.input)
            validated_data = self.subagents['validator'].validate(raw_data)
            report = self.subagents['reporter'].generate_report(validated_data)
            return report

        # Parallel decomposition example
        elif task.type == "multi_source_research":
            parallel_tasks = task.split_into_parallel()
            results = []

            for subtask in parallel_tasks:
                agent = self.select_agent_for_task(subtask)
                results.append(agent.execute_async(subtask))

            return self.merge_results(results)

Favor Statelessness and Modularity

Design subagents as pure functions, with no internal state and no reliance on conversation history. This ensures predictable, parallel execution, easy troubleshooting, reproducible outputs, and streamlined caching. Isolate and test each module rigorously before integration.

class StatelessAgent:
    def __init__(self, tools, config):
        self.tools = tools
        self.config = config
        # No conversation state stored here

    def process(self, task_input, context=None):
        """Pure function - same input always produces same output"""
        # Extract only necessary context
        relevant_context = self.extract_relevant_context(context, task_input)

        # Process without side effects
        result = self.execute_logic(task_input, relevant_context)

        # Return structured output with metadata
        return {
            'result': result,
            'confidence': self.calculate_confidence(result),
            'reasoning': self.explain_reasoning(),
            'metadata': {
                'processing_time': self.get_processing_time(),
                'tools_used': self.get_tools_used()
            }
        }

Explicit Protocols and Structured Data

All agent-to-agent communication should use strict, structured formats: specify objective, input data, expected output, constraints, and success criteria. Responses must report status, reason codes, results, and relevant metadata.

from dataclasses import dataclass
from typing import Dict, Any, List
from enum import Enum

class TaskStatus(Enum):
    PENDING = "pending"
    IN_PROGRESS = "in_progress"
    COMPLETED = "completed"
    FAILED = "failed"
    ESCALATED = "escalated"

@dataclass
class AgentTask:
    objective: str
    input_data: Dict[str, Any]
    constraints: List[str]
    expected_output_format: str
    priority: int = 1
    timeout_seconds: int = 300

@dataclass
class AgentResponse:
    status: TaskStatus
    result: Dict[str, Any]
    confidence_score: float
    reasoning: str
    metadata: Dict[str, Any]
    error_details: str = None
    next_actions: List[str] = None

class CommunicationProtocol:
    def send_task(self, agent_id: str, task: AgentTask) -> str:
        """Send task to agent, return task_id"""
        pass

    def get_response(self, task_id: str) -> AgentResponse:
        """Get agent response for task"""
        pass

Integrate Monitoring, Observability, and Human Oversight Early

From the outset, design dashboards and trace logs to capture every action, error, and decision path. Incorporate humans-in-the-loop to review high-stakes outcomes, intervene on edge cases, and absorb critical feedback for continual refinement.

import logging
from opentelemetry import trace
from datetime import datetime

class ObservableAgent:
    def __init__(self, name):
        self.name = name
        self.tracer = trace.get_tracer(__name__)
        self.logger = logging.getLogger(f"agent.{name}")
        self.metrics_collector = MetricsCollector()

    def execute_with_observability(self, task):
        # Create distributed trace
        with self.tracer.start_as_current_span(f"agent_{self.name}_execute") as span:
            span.set_attribute("agent.name", self.name)
            span.set_attribute("task.type", task.type)

            start_time = datetime.now()

            try:
                # Log task start
                self.logger.info(f"Starting task: {task.objective}")

                # Execute task with monitoring
                result = self.monitored_execute(task)

                # Record success metrics
                self.metrics_collector.record_success(
                    agent_name=self.name,
                    task_type=task.type,
                    duration=(datetime.now() - start_time).total_seconds(),
                    confidence=result.confidence_score
                )

                # Check if human oversight needed
                if self.requires_human_review(result):
                    return self.escalate_to_human(task, result)

                return result

            except Exception as e:
                # Record failure metrics
                self.metrics_collector.record_failure(
                    agent_name=self.name,
                    task_type=task.type,
                    error=str(e)
                )

                span.set_attribute("error", True)
                span.set_attribute("error.message", str(e))

                self.logger.error(f"Task failed: {e}")
                raise

    def requires_human_review(self, result):
        """Determine if result needs human oversight"""
        return (result.confidence_score < 0.7 or 
                "high_stakes" in result.metadata or
                result.status == TaskStatus.ESCALATED)

What Not to Do: Preventable Pitfalls Do Not Over-Engineer Hierarchies

More than two levels of agent delegation often result in convoluted debugging, unclear accountability, and diminished returns. Resist deep trees of subagents; complexity compounds risk.

class OverEngineeredSystem:  # DON'T DO THIS
    def __init__(self):
        # Too many levels - hard to debug and maintain
        self.level1_orchestrator = MainOrchestrator([
            SubOrchestrator([
                SpecializedAgent([
                    MicroAgent(), MicroAgent()
                ]),
                SpecializedAgent([
                    MicroAgent(), MicroAgent()
                ])
            ])
        ])

# PREFER - Simple two-level hierarchy
class SimpleSystem:  # DO THIS INSTEAD
    def __init__(self):
        self.orchestrator = MainOrchestrator()
        self.agents = {
            'processor': ProcessorAgent(),
            'validator': ValidatorAgent(),
            'reporter': ReporterAgent()
        }

Avoid Context Creep and Unbounded State

Do not simply pass all conversation history or business context to every agent. Instead, use the minimal, strictly relevant context needed for the immediate task.

class ContextManager:
    def __init__(self):
        self.context_store = {}
        self.context_filters = {
            'data_processor': ['data_schema', 'processing_rules'],
            'validator': ['validation_rules', 'error_thresholds'],
            'reporter': ['report_template', 'audience_type']
        }

    def get_filtered_context(self, agent_type, task):
        """Return only relevant context for specific agent"""
        full_context = self.context_store.get(task.session_id, {})

        # Filter to only relevant keys for this agent
        relevant_keys = self.context_filters.get(agent_type, [])
        filtered_context = {k: v for k, v in full_context.items() 
                          if k in relevant_keys}

        # Add task-specific context
        filtered_context.update(task.get_immediate_context())

        return filtered_context

When to Do What: Phased Execution Discovery & Design

Interview stakeholders, clarify business KPIs, and identify the smallest, highest-value opportunity for autonomy. Map out explicit agent roles and approval boundaries.

class ProjectDiscovery:
    def __init__(self):
        self.stakeholders = []
        self.business_requirements = {}
        self.agent_specifications = {}

    def conduct_stakeholder_interviews(self):
        for stakeholder in self.stakeholders:
            requirements = self.interview(stakeholder)
            self.business_requirements[stakeholder.role] = requirements

    def identify_automation_opportunities(self):
        opportunities = []

        for process in self.business_requirements['processes']:
            if (process.is_repetitive() and 
                process.has_clear_success_criteria() and
                process.risk_level < 'high'):
                opportunities.append(process)

        # Prioritize by value and feasibility
        return sorted(opportunities, 
                     key=lambda x: x.business_value / x.complexity_score)

Prototype & Evaluate

Build a minimal orchestration layer and 1–3 specialized subagents. Rigorously test each in isolation and as an integrated chain. Validate statelessness and boundary clarity with sample workflows.

class PrototypeValidator:
    def __init__(self, agents, test_cases):
        self.agents = agents
        self.test_cases = test_cases
        self.test_results = {}

    def run_isolation_tests(self):
        """Test each agent individually"""
        for agent_name, agent in self.agents.items():
            for test_case in self.test_cases[agent_name]:
                result = self.test_agent_isolation(agent, test_case)
                self.test_results[f"{agent_name}_isolation"] = result

    def run_integration_tests(self):
        """Test agent interactions and workflows"""
        for workflow in self.test_cases['workflows']:
            result = self.test_workflow_integration(workflow)
            self.test_results[f"workflow_{workflow.name}"] = result

    def validate_statelessness(self, agent):
        """Ensure same input produces same output"""
        test_input = self.generate_test_input()

        # Run same input multiple times
        results = [agent.process(test_input) for _ in range(3)]

        # All results should be identical (stateless)
        return all(r == results[0] for r in results)

Incremental Extension

Expand agent count, memory, and privileges only after robust monitoring and human-in-the-loop review demonstrate sustained, safe performance. Each new extension must undergo regression and error-handling tests.

class SafeExpansionManager:
    def __init__(self, current_system):
        self.current_system = current_system
        self.safety_metrics = SafetyMetrics()
        self.approval_gate = ApprovalGate()

    def propose_expansion(self, new_capability):
        # Check current system health
        if not self.safety_metrics.is_system_healthy():
            return False, "Current system showing issues"

        # Run safety analysis
        risk_assessment = self.analyze_expansion_risk(new_capability)

        if risk_assessment.risk_level > 'medium':
            return False, f"Risk too high: {risk_assessment.details}"

        # Require human approval for expansion
        approval = self.approval_gate.request_approval(
            expansion=new_capability,
            risk_assessment=risk_assessment,
            current_performance=self.safety_metrics.get_performance_summary()
        )

        return approval, "Ready for controlled rollout"

    def controlled_rollout(self, new_capability):
        """Gradual deployment with monitoring"""
        # Start with 5% of traffic
        self.deploy_to_percentage(new_capability, 5)

        # Monitor for 24 hours
        if self.monitor_for_duration(hours=24):
            # Increase to 25%
            self.deploy_to_percentage(new_capability, 25)
            # Continue gradual expansion...

What to Do When: Handling Change and Failure On System Drift or Error

Instantly trigger rollback, escalate to human review, and only proceed to retrain agents with the learnings post-mortem. Never ignore recurring anomalies, investigate, audit, and patch.

class ErrorRecoverySystem:
    def __init__(self):
        self.error_detector = ErrorDetector()
        self.rollback_manager = RollbackManager()
        self.human_escalation = HumanEscalationSystem()
        self.retry_policy = RetryPolicy()

    def handle_system_error(self, error, context):
        # Immediate containment
        if error.severity == 'critical':
            self.rollback_manager.immediate_rollback()
            self.human_escalation.emergency_alert(error, context)

        # Retry with exponential backoff for transient errors
        elif error.is_transient():
            return self.retry_with_backoff(error.failed_operation, context)

        # Pattern analysis for recurring errors
        elif self.error_detector.is_recurring_pattern(error):
            self.escalate_for_investigation(error)

        # Standard error recovery
        else:
            return self.standard_recovery(error, context)

    def retry_with_backoff(self, operation, context, max_retries=3):
        """Implement exponential backoff retry pattern"""
        for attempt in range(max_retries):
            try:
                return operation.execute(context)
            except Exception as e:
                if attempt == max_retries - 1:
                    # Final attempt failed
                    self.human_escalation.notify(e, context)
                    raise

                # Exponential backoff: 2^attempt seconds
                wait_time = 2 ** attempt
                time.sleep(wait_time)

    def post_mortem_analysis(self, error_incident):
        """Analyze failures and improve system"""
        analysis = {
            'root_cause': self.analyze_root_cause(error_incident),
            'prevention_measures': self.identify_prevention_measures(error_incident),
            'system_improvements': self.suggest_improvements(error_incident)
        }

        # Update agent training with lessons learned
        self.update_agent_training(analysis)
        return analysis

Amidst New Constraints

When regulations or requirements shift, update agent privileges, audit trails, and test compliance before resuming autonomous actions.

class ComplianceManager:
    def __init__(self):
        self.regulation_tracker = RegulationTracker()
        self.audit_system = AuditSystem()
        self.constraint_engine = ConstraintEngine()

    def handle_regulatory_change(self, new_regulation):
        # Immediate system pause for critical changes
        if new_regulation.impact_level == 'critical':
            self.pause_autonomous_operations()

        # Update constraint engine
        new_constraints = self.regulation_tracker.translate_to_constraints(new_regulation)
        self.constraint_engine.update_constraints(new_constraints)

        # Re-validate all agents against new constraints
        validation_results = self.validate_agents_compliance()

        if all(result.compliant for result in validation_results):
            self.resume_operations_with_new_constraints()
        else:
            self.remediate_non_compliant_agents(validation_results)

    def continuous_compliance_monitoring(self):
        """Ongoing compliance checking"""
        while self.system_is_running():
            for agent in self.active_agents():
                compliance_check = self.audit_system.check_compliance(
                    agent, self.constraint_engine.current_constraints
                )

                if not compliance_check.passed:
                    self.handle_compliance_violation(agent, compliance_check)

            time.sleep(self.compliance_check_interval)

For Coverage or Performance Gaps

Analyze dashboard metrics and user feedback, then upgrade agent skills or add specialized nodes incrementally, not en masse. Each new addition should prove its value through KPIs before full deployment.

class PerformanceOptimizer:
    def __init__(self):
        self.metrics_analyzer = MetricsAnalyzer()
        self.feedback_processor = FeedbackProcessor()
        self.improvement_engine = ImprovementEngine()

    def identify_performance_gaps(self):
        # Analyze quantitative metrics
        performance_data = self.metrics_analyzer.get_performance_summary()
        gaps = []

        for metric, value in performance_data.items():
            if value < self.get_threshold(metric):
                gaps.append(PerformanceGap(metric, value, self.get_threshold(metric)))

        # Analyze qualitative feedback
        user_feedback = self.feedback_processor.analyze_recent_feedback()
        gaps.extend(self.feedback_processor.identify_capability_gaps(user_feedback))

        return sorted(gaps, key=lambda x: x.business_impact, reverse=True)

    def implement_targeted_improvements(self, performance_gaps):
        for gap in performance_gaps[:3]:  # Focus on top 3 gaps
            improvement = self.improvement_engine.design_improvement(gap)

            # A/B test the improvement
            test_result = self.run_ab_test(improvement)

            if test_result.shows_improvement():
                self.deploy_improvement(improvement)
                self.monitor_improvement_impact(improvement)
            else:
                self.log_failed_improvement_attempt(improvement, test_result)

Conclusion

Well-designed Agentic AI solutions are anchored by clear intent, strong modular boundaries, real-time observability, and phased expansion under active human guidance. The code examples above illustrate how these principles translate into practical implementation patterns that you can adapt for your specific use cases. Adhering to these principles avoids costly missteps and speeds the path from breakthrough to sustainable business impact.

You can follow me on LinkedIn and Twitter for more updates.

🎯 ML Done Right: Versioning Datasets and Models with DVC & MLflow

MUHAMMAD ABIODUN SULAIMAN — Sat, 01 Feb 2025 20:32:06 +0000

Introduction

Data versioning is a crucial aspect of Machine Learning (ML) workflows. It ensures that datasets are reproducible, traceable, and manageable throughout the ML lifecycle. Unlike traditional software development, where Git efficiently tracks code changes, ML workflows require specialized tools to version datasets, models, and metadata.

Two powerful tools for data versioning in ML pipelines are:
DVC (Data Version Control): Designed to manage large datasets efficiently and integrates seamlessly with Git.
MLflow: Focused on experiment tracking, model versioning, and lifecycle management. In this article, we will explore how to set up DVC and MLflow for data versioning in a machine learning workflow using Python.

Why Do We Need Data Versioning?

Before going into implementation, let’s first understand why data versioning is essential:

Reproducibility → Ensures that ML models can be recreated using the exact dataset used during training.
Collaboration → Enables teams to share and work on different versions of datasets.
Traceability → Keeps track of dataset changes, helping to identify issues in models.
Rollback & Experimentation → Facilitates easy rollback to previous dataset versions for comparison.
Storage Efficiency → Avoids redundancy by tracking only changes in datasets instead of duplicating entire files.

1️⃣ Setting Up DVC for Data Versioning

DVC is an open-source tool designed to handle large datasets efficiently while integrating seamlessly with Git.

Installation

To install DVC, run:

pip install dvc

If you’re using cloud storage (e.g., AWS S3, Google Drive, or Azure), install the appropriate extension:

pip install 'dvc[s3]'
pip install 'dvc[gdrive]'

Initialize DVC in a Project

Inside a Git-tracked ML project, initialize DVC:

git init
dvc init
git commit -m "Initialize DVC"

Adding and Versioning Data

Assume we have a dataset stored in data/:

dvc add data/

This creates:

data.dvc → A metadata file that tracks the dataset version.
Updates .gitignore to prevent large files from being committed to Git.

Now, commit these changes to Git:

git add data.dvc .gitignore
git commit -m "Track dataset with DVC"

Remote Storage Configuration

To store the dataset in cloud storage:

dvc remote add myremote s3://mybucket/path/
dvc push

This uploads the dataset to S3. Other options like Google Drive, Azure, and SSH are also supported.

Restoring Previous Versions

To retrieve a previous dataset version:

git checkout <commit_id>
dvc pull

This ensures that data and models remain synchronized with the desired version.

2️⃣ Using MLflow for Data Tracking

MLflow helps track datasets, experiments, and models during development.

Installation

pip install mlflow

Initializing MLflow Tracking

Start the MLflow tracking server:

mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns

This creates a local database (mlflow.db) to store ML experiments.

Logging Dataset Versions with MLflow

Modify your Python script to track dataset versions:

import mlflow
import mlflow.artifacts
import os

# Log dataset version
def log_dataset_version(dataset_path):
    with mlflow.start_run():
        mlflow.log_artifact(dataset_path, artifact_path="datasets")
        print("Dataset version logged in MLflow")

log_dataset_version("data/")

Tracking Experiments

When training an ML model, log key parameters and metrics:

import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Log model and metrics
with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "random_forest_model")

    print(f"Model logged with accuracy: {accuracy}")

Retrieving Dataset Versions

To list previous dataset versions in MLflow:

mlflow artifacts list <run_id>

3️⃣ Combining DVC and MLflow for a Complete Workflow

A best practice is to integrate DVC for dataset versioning and MLflow for experiment tracking.

🔄 Workflow Summary

Store dataset using DVC → dvc add data/
Push dataset to remote storage → dvc push
Track dataset version in MLflow → mlflow.log_artifact("data/")
Train ML models and log parameters in MLflow
Retrieve dataset versions via DVC and experiments via MLflow.

🛠️ Example: End-to-End Pipeline

import dvc.api
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Fetch dataset version from DVC
dataset_path = "data/"
dvc.api.get_url(dataset_path)

# Log dataset version in MLflow
mlflow.log_artifact(dataset_path, artifact_path="datasets")

# Load dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Log results in MLflow
with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "random_forest_model")

    print(f"Model logged with accuracy: {accuracy}")

Data versioning is critical for:
✅ Reproducibility → Ensure consistent ML results.
✅ Collaboration → Manage dataset changes efficiently.
✅ Traceability → Keep track of dataset & model versions.

By integrating DVC and MLflow, you can create a scalable, reproducible, and traceable ML pipeline.

You can connect with me via email or LinkedIn or medium

STREAMLINE YOUR CI/CD PIPELINE WITH GITHUB ACTIONS

MUHAMMAD ABIODUN SULAIMAN — Fri, 12 Apr 2024 14:23:18 +0000

In the current dynamic software development landscape, security and efficiency are critical. Companies and developers work hard to swiftly roll out apps while making sure they are safe from security flaws. This is where the industry-leading containerization technology Docker comes into play. By wrapping up applications in containers, Docker makes the deployment process simpler. However, if done manually, managing these containers, creating them, checking for vulnerabilities, and pushing them safely can be difficult and time-consuming. For this reason, it's critical to automate these procedures via a Continuous Integration/Continuous Deployment (CI/CD) pipeline.

The Role of CI/CD in Modern Software Development

Software release process automation is the goal of Continuous Integration (CI) and Continuous Deployment (CD). Building software that is reliable, safe, and quickly deployable is the aim.

Continuous Integration:

CI involves merging all developers' working copies to a shared mainline several times a day. The main objectives of CI include:

Reducing bugs: Automated testing in CI helps detect and fix bugs quickly, improves software quality, and reduces the time it takes to validate and release new software updates.
Improving software quality: Continuous integration leads to significantly reduced assumptions as integration issues are detected and fixed continuously.

Continuous Deployment:

CD extends CI by automatically deploying all code changes to a testing and/or production environment after the build stage. This means that on top of the automated testing, automated release processes further streamline the development lifecycle. Benefits of CD include:

Faster time to market: Accelerated release cycles ensure features reach production faster.
Higher release rates: Frequent releases promote smaller, more manageable changes and less deployment risk.
Improved customer satisfaction: Continuous delivery of features addresses user feedback more promptly and enhances the user experience.

Implementing GitHub Actions for Docker Management

From the first code change to the last production deployment, GitHub Actions is a CI/CD tool that streamlines the software process. Building processes that automatically create, test, and launch Docker containers is a requirement of using GitHub Actions for Docker administration.

GitHub Actions Workflow: The "Complete Docker Workflow"

An extensive dissection of the "Complete Docker Workflow,” a system for managing Docker deployments with GitHub Actions, is given in this section. There are multiple stages in this workflow, all designed to improve security and expedite procedures across the container management lifecycle.

Workflow Activation Triggers

Scheduled Runs: Set to trigger daily at 16:21 UTC, this ensures regular updates and checks, keeping the application up to date with the latest base image vulnerabilities addressed.

    - cron: '21 16 * * *'

Push Events: Activates on pushes to specific branches or tags (specified branch and semantic versioned tags). This ensures that all changes undergo rigorous testing and security checks before deployment.

push:
    branches:
      - branch-name

Pull Requests: Targets pull requests to the dev-2 branch, allowing for automated reviews and tests and ensuring that new code integrations meet quality standards before merging.

pull_request:
    branches:
      - branch-name

Environment Configuration

Using environment variables and secrets for configurations like Docker registry credentials secures sensitive information and streamlines the setup process across multiple environments or projects. docker.io is specified as our REGISTRY in this instance.

env:
  REGISTRY: ${{ secrets.REGISTRY }}
  IMAGE_NAME: ${{ secrets.DOCKER_USERNAME }}/${{ secrets.IMAGE_NAME }}

Initial Setup and Configurations

Checkout Repository

Uses GitHub's actions/checkout@v3.5.2 to clone the repository, with a minimal fetch depth of 1 to speed up the checkout process.

- name: Checkout repository
        uses: actions/checkout@v3.5.2
        with:
          fetch-depth: 1

Install Cosign

Implements sigstore/cosign-installer@v3.4.0 to install Cosign, which is used later to sign Docker images.

- name: Install Cosign
        uses: sigstore/cosign-installer@v3.4.0

Set up QEMU

Employs docker/setup-qemu-action@v2.1.0 to configure QEMU, facilitating the emulation of different architectures which is essential for cross-platform Docker builds.

- name: Set up QEMU
        uses: docker/setup-qemu-action@v2.1.0

Set up Docker Buildx

Engages docker/setup-buildx-action@v2.5.0 to set up Docker Buildx, enhancing the ability to perform multi-platform builds directly from a single command.

- name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2.5.0

Log in to Docker Registry

Utilizes docker/login-action@v2.1.0 for logging into the Docker registry using credentials stored in GitHub secrets.

- name: Log in to Docker Registry
        uses: docker/login-action@v2.1.0
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_PASSWORD }}

Extract Docker Metadata

Activates docker/metadata-action@v4.4.0 to generate and format Docker image metadata, such as tags, from the environment variables.

- name: Extract Docker metadata
        id: meta
        uses: docker/metadata-action@v4.4.0
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

Building and Pushing Images

The procedure ensures interoperability across various hardware settings by supporting the development of images for many architectures using Docker Buildx and QEMU. Developers' work is made easier by automating the push to registries, freeing them up to concentrate on essential features rather than operational setups. Here, linux/amd64 is the chosen OS. Please be aware that you can use multiple operating systems.

- name: Build and Push container images
        uses: docker/build-push-action@v4.0.0
        with:
          platforms: linux/amd64
          push: true
          tags: ${{ steps.meta.outputs.tags }}

Security Scanning with Trivy

Preventing potential security risks before they arise in production requires integrating Trivy scans to evaluate image vulnerabilities. This scan is essential for keeping a secure deployment as it looks for vulnerabilities at the OS and library levels. If there are multiple tags for the image, you can duplicate the below step and specify each of the tags.

- name: Scan Docker image with Trivy (specifying a tag)
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:tag
          format: 'table'
          severity: 'CRITICAL,HIGH'
          vuln-type: 'os,library'

Digital Signing of Images with Cosign

By confirming the source of the images and signing them using Cosign, GitHub's OIDC integration adds an extra degree of security to ensure that only validated images are released.

- name: Sign the images with GitHub OIDC Token (Non-interactive)
        run: |
          IFS=',' read -ra ADDR <<< "${{ steps.meta.outputs.tags }}"
          for tag in "${ADDR[@]}"; do
            echo "Signing $tag"
            cosign sign --oidc-issuer=https://token.actions.githubusercontent.com --yes "$tag"
          done
        env:
          COSIGN_EXPERIMENTAL: "true"

Complete CI/CD Pipeline

name: Complete Docker Workflow

on:
  schedule:
    - cron: '21 16 * * *'
  push:
    branches:
      - branch-name
    tags:
      - 'v*.*.*'
  pull_request:
    branches:
      - branch-name

env:
  REGISTRY: ${{ secrets.REGISTRY }}
  IMAGE_NAME: ${{ secrets.DOCKER_USERNAME }}/${{ secrets.IMAGE_NAME }}

jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      id-token: write # needed for signing the images with GitHub OIDC Token
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3.5.2
        with:
          fetch-depth: 1

      - name: Install Cosign
        uses: sigstore/cosign-installer@v3.4.0

      - name: Set up QEMU
        uses: docker/setup-qemu-action@v2.1.0

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2.5.0

      - name: Log in to Docker Registry
        uses: docker/login-action@v2.1.0
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_PASSWORD }}

      - name: Extract Docker metadata
        id: meta
        uses: docker/metadata-action@v4.4.0
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      - name: Build and Push container images
        uses: docker/build-push-action@v4.0.0
        with:
          platforms: linux/amd64
          push: true
          tags: ${{ steps.meta.outputs.tags }}

      - name: Scan Docker image with Trivy (tag1)
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:tag1
          format: 'table'
          severity: 'CRITICAL,HIGH'
          vuln-type: 'os,library'

      - name: Scan Docker image with Trivy (tag2)
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:tag2
          format: 'table'
          severity: 'CRITICAL,HIGH'
          vuln-type: 'os,library'

      - name: Sign the images with GitHub OIDC Token (Non-interactive)
        run: |
          IFS=',' read -ra ADDR <<< "${{ steps.meta.outputs.tags }}"
          for tag in "${ADDR[@]}"; do
            echo "Signing $tag"
            cosign sign --oidc-issuer=https://token.actions.githubusercontent.com --yes "$tag"
          done
        env:
          COSIGN_EXPERIMENTAL: "true"

Advantages over Traditional Methods

There are several benefits to automating these procedures over more conventional manual ones.

Reduced Human Error: Automating repetitive tasks lowers the possibility of human error, including deployment script misconfiguration or omission of stages.
Consistency and Reliability: Automation guarantees consistency and repeatability by ensuring that each step is carried out in the same way. This makes the process of developing, testing, and deploying software dependable and predictable.
Security: By methodically identifying and addressing security concerns, automated vulnerability scans considerably lower the risk of releasing vulnerable code.

Conclusion

This pipeline comprehensively covers all aspects of Docker image management, from build to security checks to deployment, ensuring high standards of automation and security using GitHub Actions. This setup not only automates the build and deployment process but also incorporates critical security practices like scanning and signing images, pivotal for maintaining the integrity and trustworthiness of software in a CI/CD environment.

Guide to Resizing EC2 Instance Volume without Deleting the Instance

MUHAMMAD ABIODUN SULAIMAN — Tue, 05 Dec 2023 13:51:24 +0000

As engineers or developers who use AWS Cloud infrastructure, when we provision an EC2 instance, we often face the challenge of the instance storage getting exhausted while our deployment is midway.

While starting out my career with AWS Cloud a couple of years ago, whenever I'm faced with this challenge, I find myself deleting the instance and provisioning a new instance with a higher volume size.

Over the years, I found that I wasn't implementing the best practice. Rather, I was only meant to create a shell script that will increase the volume size to a size of my choice while restarting the instance to ensure that the partition takes up all the space it can.

The script explained in part:

A. Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.

SIZE=${1:-20}

B. Get the ID of the environment host Amazon EC2 instance.

INSTANCEID=$(curl http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')

C. Get the ID of the Amazon EBS volume associated with the instance.

VOLUMEID=$(aws ec2 describe-instances \
  --instance-id $INSTANCEID \
  --query "Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId" \
  --output text \
  --region $REGION)

D. Resize the EBS volume.

aws ec2 modify-volume --volume-id $VOLUMEID --size $SIZE

E. Wait for the resize to finish.

while [ \
  "$(aws ec2 describe-volumes-modifications \
    --volume-id $VOLUMEID \
    --filters Name=modification-state,Values="optimizing","completed" \
    --query "length(VolumesModifications)"\
    --output text)" != "1" ]; do
sleep 1
done

F. Check if we're on an NVMe filesystem

if [[ -e "/dev/xvda" && $(readlink -f /dev/xvda) = "/dev/xvda" ]]
then
  # Rewrite the partition table to take up all the space it can.
  sudo growpart /dev/xvda 1

  # Expand the size of the file system.
  # Check if we're on AL2
  STR=$(cat /etc/os-release)
  SUB="VERSION_ID=\"2\""
  if [[ "$STR" == *"$SUB"* ]]
  then
    sudo xfs_growfs -d /
  else
    sudo resize2fs /dev/xvda1
  fi

else
  # Rewrite the partition table to take up all the space it can.
  sudo growpart /dev/nvme0n1 1

  # Expand the size of the file system.
  # Check if we're on AL2
  STR=$(cat /etc/os-release)
  SUB="VERSION_ID=\"2\""
  if [[ "$STR" == *"$SUB"* ]]
  then
    sudo xfs_growfs -d /
  else
    sudo resize2fs /dev/nvme0n1p1
  fi
fi

The complete script is below:

#!/bin/bash

# Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
SIZE=${1:-20}

# Get the ID of the environment host Amazon EC2 instance.
INSTANCEID=$(curl http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')

# Get the ID of the Amazon EBS volume associated with the instance.
VOLUMEID=$(aws ec2 describe-instances \
  --instance-id $INSTANCEID \
  --query "Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId" \
  --output text \
  --region $REGION)

# Resize the EBS volume.
aws ec2 modify-volume --volume-id $VOLUMEID --size $SIZE

# Wait for the resize to finish.
while [ \
  "$(aws ec2 describe-volumes-modifications \
    --volume-id $VOLUMEID \
    --filters Name=modification-state,Values="optimizing","completed" \
    --query "length(VolumesModifications)"\
    --output text)" != "1" ]; do
sleep 1
done

#Check if we're on an NVMe filesystem
if [[ -e "/dev/xvda" && $(readlink -f /dev/xvda) = "/dev/xvda" ]]
then
  # Rewrite the partition table to take up all the space it can.
  sudo growpart /dev/xvda 1

  # Expand the size of the file system.
  # Check if we're on AL2
  STR=$(cat /etc/os-release)
  SUB="VERSION_ID=\"2\""
  if [[ "$STR" == *"$SUB"* ]]
  then
    sudo xfs_growfs -d /
  else
    sudo resize2fs /dev/xvda1
  fi

else
  # Rewrite the partition table to take up all the space it can.
  sudo growpart /dev/nvme0n1 1

  # Expand the size of the file system.
  # Check if we're on AL2
  STR=$(cat /etc/os-release)
  SUB="VERSION_ID=\"2\""
  if [[ "$STR" == *"$SUB"* ]]
  then
    sudo xfs_growfs -d /
  else
    sudo resize2fs /dev/nvme0n1p1
  fi
fi

You can connect with me via Email or LinkedIn or medium

Passing Credentials to Deployed Streamlit Apps using Streamlit Secrets

MUHAMMAD ABIODUN SULAIMAN — Sat, 05 Nov 2022 16:32:52 +0000

Often as programmers, we get to deal with credentials, for instance, when we need to connect to a database or ingest data from sources besides our local computers. Hence, it becomes imperative to find a way to securely pass the credentials such that they are not exposed in our code scripts.

In this short article, I will explain how I used #StreamlitSecrets to resolve a problem with passing credentials to a #MachineLearning system I deployed to the web using #Streamlit. Before fixing the problem, the deployed app worked fine on my local computer while testing it because the credentials were passed to the ingestion pipeline using an environment variable file (.env); however, upon deploying to the web, the system failed to work due to unavailability of the credentials. Some steps in solving this problem include removing the environment variable file from gitignoreand creating a Python (.py) file; since the programming language I was working with was Python. Upon taking these steps, I could not push the local repository to the remote repository as pre-commit hook detected that credentials had been exposed and did not allow the push. Hence, I finally had to use the #StreamlitSecrets file (secrets.toml) to pass the credentials while updating the credentials on the app settings section of the deployed streamlit app. The steps taken to achieve this are listed below:

Create the directory .streamlit/
Create the secret file secrets.toml
In the created secret file, type in the credentials in the below format:

[db_credentials]
user = 'someuser'
password = 'somepassword'
host = 'somehost'
database = 'somedatabase'

PS: You can include other credentials as needed.

In the python file, pass in the credentials with the following lines of code:

host=st.secrets.db_credentials.host,
user=st.secrets.db_credentials.user,
password=st.secrets.db_credentials.password,
db=st.secrets.db_credentials.database

Navigate to the url link of the deployed app, open settings, and type in the provided credentials in .streamlit/secrets.toml

The above steps ensure that your streamlit app works seamlessly on your local computer and the Streamlit deployed version.

If you find the article helpful, kindly click the like button and comment.

You can follow me on LinkedIn and Twitter for more updates.

Automating AWS Infrastructure with Terraform

MUHAMMAD ABIODUN SULAIMAN — Tue, 30 Aug 2022 16:30:41 +0000

What is Cloud Automation and Why is it Important in IT?

Cloud automation broadly refers to the processes and tools used to provision and manage cloud computing workloads and infrastructure. These processes and tools aim to reduce or eliminate manual processes, saving costs and resources.

With the increasing demand of serving a wider group of customers/clients and expanding the customer base, organizations need to consider the adoption of Cloud Automation as it affords them the ability to scale efficiently. Several IT companies have adopted cloud automation to optimize their business resources while also staying on top of their games.
Some of the reasons why companies will want to consider the adoption of cloud adoption are listed below

It saves an IT team time and money;
It is faster, more secure and more scalable than manually performing tasks;
It leads to fewer errors, as organizations can construct more predictable and reliable workflows; and
It contributes directly to better IT and corporate governance.

Next, we shall explore cloud automation tools which are categorized as:
Cloud Providers (private and public): These include:-

i. AWS: AWS Config, AWS CloudFormation, AWS EC2 Systems Manager;
ii. Azure: Microsoft Azure Resource Manager, Azure Automation;
iii. Google: Google Cloud Composer, Cloud Deployment Manager; and
iv. IBM: IBM Cloud Orchestrator.

Configuration Management tools: Most of these tools allows for Infrastructure-as-a-service (IaaS) setup, and has the below as some of its examples:

i. Red Hat Ansible,
ii. Puppet Enterprise,
iii. Chef Automate,
iv. Salt/SaltStack, and
v. HashiCorp Terraform

Many multi-cloud management vendors incorporate automation capabilities into their tools. Some prominent ones are:

i. VMware,
ii. CloudBolt,
iii. CloudSphere (Hypergrid),
iv. Snow (Embotics),
v. Morpheus Data,
vi. Scalr, and
vii. Flexera (RightScale).

To read more about cloud automation, click here.

Having introduced us to the basics of cloud automation, its importance, and tools used for cloud automation. We shall employ some of the earlier mentioned cloud automation tools to build a project. We will use AWS, HashiCorp Terraform, and Microsoft Visual Studio Code (VSC) as our IDE.

This project introduces beginners to Cloud Infrastructure Automation, with AWS as the cloud provider. A detailed explanation of the processes and resources created in this small project can be found here. Simply follow the guides here, and you should be fine. Feel free to fork this repo, raise a pull request to contribute to this project, and raise an issue if you encounter any challenges

This project created nine resources/processes in our AWS instance right from terraform. These resources are:

Virtual Private Cloud (VPC): A virtual private cloud (VPC) is a secure, isolated private cloud hosted within a public cloud. VPCs combines the scalability and convenience of public cloud computing with private cloud computing data isolation. Click here to read more about VPC.
Internet Gateway: An Internet gateway is a network "node" connecting two networks that use different protocols (rules) to communicate. In the most basic terms, an Internet gateway is where data stops on its way to or from other networks. Click here to read more about internet gateway.
Custom Route Table: A route table contains a set of rules, called routes, that are used to determine where network traffic from your subnet or gateway is directed. In a more explicit term, a route table tells network packets which way they need to go to get to their destination. Click here to read more about custom route table.
Subnet: A subnet, or subnetwork, is a segmented piece of a more extensive network. More specifically, subnets are a logical partition of an IP network into multiple, smaller network segments. Click here to read more about subnets.
We associated the subnet created in step 4 with the route table created in step 3.
We created a Security Group to allow ports 22, 80, 443.
Create a network interface with an IP in the subnet created in step 4.
Assigned an elastic IP to the network interface created in step 7
Created an Ubuntu server and installed/enabled apache2

Remember to drop a comment and drop a like if you have benefitted from this article.

You can connect with me via:
Email: abiodun.msulaiman@gmail.com or
LinkedIn: Muhammad Abiodun Sulaiman