Warren Jitsing

Posted on Dec 23, 2025

Part 08: Building a Sovereign Software Factory: Observability with the ELK Stack

#elasticsearch #monitoring #tutorial #devops

GitHub: https://github.com/InfiniteConsult/0012_cicd_part08_elk

TL;DR: In this installment, we solve the "Noise Problem" where fragmented logs make debugging impossible. We deploy the ELK Stack (Elasticsearch, Logstash, Kibana) as our "Radar System," but first, we must navigate the "Abstraction Leak" of Elasticsearch by tuning the host kernel's vm.max_map_count via an architect script. We implement a sophisticated Ingest Pipeline to parse logs from GitLab, SonarQube, and Artifactory into structured data, and deploy Filebeat as a host-level collector to harvest logs without modifying our existing containers.

The Sovereign Software Factory Series:

Part 01: Building a Sovereign Software Factory: Docker Networking & Persistence
Part 02: Building a Sovereign Software Factory: The Local Root CA & Trust Chains
Part 03: Building a Sovereign Software Factory: Self-Hosted GitLab & Secrets Management
Part 04: Building a Sovereign Software Factory: Jenkins Configuration as Code (JCasC)
Part 05: Building a Sovereign Software Factory: Artifactory & The "Strict TLS" Trap
Part 06: Building a Sovereign Software Factory: SonarQube Quality Gates
Part 07: Building a Sovereign Software Factory: ChatOps with Mattermost
Part 08: Building a Sovereign Software Factory: Observability with the ELK Stack (You are here)
Part 09: Building a Sovereign Software Factory: Monitoring with Prometheus & Grafana
Part 10: Building a Sovereign Software Factory: The Python API Package (Capstone)

Chapter 1: The "Black Box" Problem

1.1 The 3 AM Scenario

We have spent the last seven articles building a robust "Software Factory." We have a Git server, a Build server, a Quality Gate, an Artifact Warehouse, and a Chat bot. The lights are on, the machines are humming, and the code is flowing.

But we have a problem.

Imagine it is 3:00 AM. You receive a frantic message from a developer: "The build is failing, and I can't merge the hotfix."

You log into Jenkins. The build status is red, but the console output just says Build failed: Connection refused. Connection to what? Is Artifactory down? Did the SonarQube token expire? Is the GitLab webhook failing?

To find out, you have to perform the "Systems Administrator Dance":

SSH into the host server.
Run docker ps to find the container IDs.
Run docker logs -f jenkins-controller and scroll through thousands of lines of Java stack traces.
Open a second terminal.
Run docker logs -f gitlab and grep for Nginx errors.
Open a third terminal for SonarQube.

You are manually correlating timestamps across three different windows, trying to guess if a log entry at 15:01:05 in Jenkins matches an error at 15:01:06 in GitLab.

This is the "Black Box" Problem. We have built a distributed system, but we are managing it like a monolith. We have excellent components, but zero visibility into how they interact.

In this article, we are going to fix this. We are going to stop treating logs as text files scattered across disk drives and start treating them as Data. We will build a Centralized Observability Pipeline that allows us to answer complex questions ("Show me all errors that happened in the last 5 minutes across all services") without ever touching the SSH command.

1.2 Architecture from First Principles

To solve this, we need a "Central Investigation Office." In the industry, this is typically handled by the ELK Stack—an acronym for three open-source tools maintained by Elastic:

Elasticsearch (The Brain): A search engine and database. It stores our logs not as text strings, but as structured JSON documents. It allows us to search terabytes of data in milliseconds.
Logstash (The Parser): Historically, this was the middleman. It received messy text streams, used Regex to parse them into fields, and then sent the clean data to the database.
Kibana (The Interface): The visualization layer. This is the dashboard where we will build graphs, charts, and maps to visualize the data stored in "The Brain."

However, for our platform, we are going to make a crucial architectural deviation to keep our stack efficient.

We are removing Logstash.

In a massive enterprise with thousands of servers, Logstash is useful as a heavy-duty buffer and router. But for a single-node "Factory in a Box" like ours, Logstash is unnecessary overhead. It requires a heavy JVM (Java Virtual Machine) and adds latency and complexity to our deployment.

Instead, we will use Filebeat as our shipping agent. Filebeat is lightweight, written in Go, and sits directly on the host. It will ship logs directly to Elasticsearch.

But this creates a problem: if we remove the Parser (Logstash), who parses the logs? If Jenkins sends a raw text line, who turns it into a JSON object?

The answer lies in Elasticsearch Ingest Pipelines. Modern Elasticsearch has the ability to run parsing logic (Grok patterns, Date parsing, GeoIP lookups) on the incoming data itself, just before it is indexed. This simplifies our stack significantly: fewer containers to manage, less RAM consumed, and a more direct data path. We are effectively moving the "intelligence" from the middleman to the destination.

1.3 The "Zero-Privilege" Mandate

When deploying a log collector, you face a critical security choice: How do you access the data?

Option A is Dynamic Autodiscovery. This is a feature often used in large clusters where Filebeat automatically detects when a new container starts, identifies it via the Docker API, and begins harvesting its logs. To do this, Filebeat typically requires access to the Docker Socket (/var/run/docker.sock).

We must treat the Docker Socket as a "Root Key." If a malicious actor compromises a container that has access to this socket, they can issue API commands to spin up new privileged containers, mount the host's root filesystem, and take over the entire server.

Option B is Static File Ingestion. This is the "dumb," manual approach. We explicitly tell Filebeat: "There are log files in this specific directory. Read them."

We choose Option B.

We will not mount the Docker socket. Instead, we will mount the host's volume directory (/var/lib/docker/volumes) into Filebeat as Read-Only. Filebeat will look at the disk, see the log files generated by Jenkins and Nginx, and ship them. It cannot query the Docker API, it cannot inspect container metadata, and most importantly, it cannot control the Docker daemon.

This makes our configuration slightly more verbose (we have to map paths manually), but it ensures that our logging infrastructure cannot be weaponized against us.

Chapter 2: The Architect Script

2.1 The Kernel Barrier: `vm.max_map_count` (Again)

Before we can launch a single container, we must address the Operating System.

If you recall from Article 6 (SonarQube), we encountered a critical kernel tunable called vm.max_map_count. SonarQube failed to start because its embedded Elasticsearch engine required more memory map areas than the default Linux kernel allows.

Now that we are running a dedicated, standalone Elasticsearch cluster, this requirement is even more strict.

Elasticsearch relies heavily on mmap (memory-mapped files) to store its indices. This allows the database to map logical file contents directly into the process's virtual memory space, letting the OS handle the paging. It is a massive performance optimization.

However, the default limit on most Linux distros (Debian included) is 65,530 maps. Elasticsearch requires at least 262,144.

If we ignore this, the Elasticsearch container will start, print a cryptic bootstrap check failure, and immediately die. Because this is a Kernel setting, we cannot fix it inside the Dockerfile; it must be applied to the Host.

This necessitates a dedicated setup script. We cannot simply rely on docker run commands; we need an "Architect" script (01-setup-elk.sh) to prepare the physical ground before we pour the digital concrete.

Here is the complete script. Save this as 01-setup-elk.sh.

#!/usr/bin/env bash

#
# -----------------------------------------------------------
#               01-setup-elk.sh
#
#  The "Architect" script for the ELK Stack.
#
#  1. Kernel: Enforces vm.max_map_count=262144.
#  2. Secrets: Generates Passwords & Encryption Keys.
#  3. Preparation: Temporarily owns config dirs for writing.
#  4. Configs: Generates configs for ES, Kibana, Filebeat.
#  5. Integration: Configures Jenkins sidecar logging & redeploys.
#  6. Permissions: Enforces strict ownership (Root/UID 1000).
#
# -----------------------------------------------------------

set -e

# --- 1. Define Paths ---
HOST_CICD_ROOT="$HOME/cicd_stack"
ELK_BASE="$HOST_CICD_ROOT/elk"
MASTER_ENV_FILE="$HOST_CICD_ROOT/cicd.env"

# Jenkins Integration Paths
JENKINS_MODULE_DIR="$HOME/Documents/FromFirstPrinciples/articles/0008_cicd_part04_jenkins"
JENKINS_ENV_FILE="$JENKINS_MODULE_DIR/jenkins.env"
JENKINS_VOL_DATA="/var/lib/docker/volumes/jenkins-home/_data"

# Source Certificate Paths
CA_DIR="$HOST_CICD_ROOT/ca"
SRC_CA_CRT="$CA_DIR/pki/certs/ca.pem"

SRC_ES_CRT="$CA_DIR/pki/services/elk/elasticsearch.cicd.local/elasticsearch.cicd.local.crt.pem"
SRC_ES_KEY="$CA_DIR/pki/services/elk/elasticsearch.cicd.local/elasticsearch.cicd.local.key.pem"

SRC_KIB_CRT="$CA_DIR/pki/services/elk/kibana.cicd.local/kibana.cicd.local.crt.pem"
SRC_KIB_KEY="$CA_DIR/pki/services/elk/kibana.cicd.local/kibana.cicd.local.key.pem"

# Config Destinations
ES_CONFIG_DIR="$ELK_BASE/elasticsearch/config"
KIBANA_CONFIG_DIR="$ELK_BASE/kibana/config"
FILEBEAT_CONFIG_DIR="$ELK_BASE/filebeat/config"

echo "🚀 Starting ELK 'Architect' Setup..."

# --- 2. Host Kernel Tuning ---
echo "--- Phase 1: Kernel Tuning ---"
REQUIRED_MAP_COUNT=262144
SYSCTL_CONF="/etc/sysctl.conf"

CURRENT_MAP_COUNT=$(sudo sysctl -n vm.max_map_count)
if [ "$CURRENT_MAP_COUNT" -lt "$REQUIRED_MAP_COUNT" ]; then
    echo "Updating runtime limit..."
    sudo sysctl -w vm.max_map_count=$REQUIRED_MAP_COUNT
else
    echo "Runtime limit sufficient."
fi

if ! grep -q "vm.max_map_count=$REQUIRED_MAP_COUNT" "$SYSCTL_CONF"; then
    echo "Persisting limit in $SYSCTL_CONF..."
    sudo sed -i '/vm.max_map_count/d' "$SYSCTL_CONF"
    echo "vm.max_map_count=$REQUIRED_MAP_COUNT" | sudo tee -a "$SYSCTL_CONF" > /dev/null
fi

# --- 3. Directory & Secrets Setup ---
echo "--- Phase 2: Secrets & Directories ---"

sudo mkdir -p "$ES_CONFIG_DIR/certs"
sudo mkdir -p "$KIBANA_CONFIG_DIR/certs"
sudo mkdir -p "$FILEBEAT_CONFIG_DIR/certs"

if [ ! -f "$MASTER_ENV_FILE" ]; then touch "$MASTER_ENV_FILE"; fi

key_exists() { grep -q "^$1=" "$MASTER_ENV_FILE"; }
generate_secret() { openssl rand -hex 32; }
generate_password() { openssl rand -hex 16; }

update_env=false

if ! key_exists "ELASTIC_PASSWORD"; then
    echo "ELASTIC_PASSWORD=\"$(generate_password)\"" >> "$MASTER_ENV_FILE"
    update_env=true
fi
if ! key_exists "KIBANA_PASSWORD"; then
    echo "KIBANA_PASSWORD=\"$(generate_password)\"" >> "$MASTER_ENV_FILE"
    update_env=true
fi
if ! key_exists "XPACK_SECURITY_ENCRYPTIONKEY"; then
    echo "XPACK_SECURITY_ENCRYPTIONKEY=\"$(generate_secret)\"" >> "$MASTER_ENV_FILE"
    echo "XPACK_ENCRYPTEDSAVEDOBJECTS_ENCRYPTIONKEY=\"$(generate_secret)\"" >> "$MASTER_ENV_FILE"
    echo "XPACK_REPORTING_ENCRYPTIONKEY=\"$(generate_secret)\"" >> "$MASTER_ENV_FILE"
    update_env=true
fi

if [ "$update_env" = true ]; then echo "Secrets generated."; else echo "Secrets loaded."; fi

set -a; source "$MASTER_ENV_FILE"; set +a

# --- 4. Ownership Handoff ---
echo "--- Phase 3: Ownership Handoff ---"
CURRENT_USER=$(id -u)
CURRENT_GROUP=$(id -g)
sudo chown -R "$CURRENT_USER:$CURRENT_GROUP" "$ELK_BASE"

# --- 5. Staging Certificates ---
echo "--- Phase 4: Staging Certificates ---"
cp "$SRC_ES_CRT" "$ES_CONFIG_DIR/certs/elasticsearch.crt"
cp "$SRC_ES_KEY" "$ES_CONFIG_DIR/certs/elasticsearch.key"
cp "$SRC_CA_CRT" "$ES_CONFIG_DIR/certs/ca.pem"

cp "$SRC_KIB_CRT" "$KIBANA_CONFIG_DIR/certs/kibana.crt"
cp "$SRC_KIB_KEY" "$KIBANA_CONFIG_DIR/certs/kibana.key"
cp "$SRC_CA_CRT"  "$KIBANA_CONFIG_DIR/certs/ca.pem"

cp "$SRC_CA_CRT"  "$FILEBEAT_CONFIG_DIR/certs/ca.pem"

# --- 6. Configuration Generation ---
echo "--- Phase 5: Generating Config Files ---"

# A. Elasticsearch
cat << EOF > "$ES_CONFIG_DIR/elasticsearch.yml"
cluster.name: "cicd-elk"
node.name: "elasticsearch.cicd.local"
network.host: 0.0.0.0
discovery.type: single-node

path.data: /usr/share/elasticsearch/data
path.logs: /usr/share/elasticsearch/logs

xpack.security.enabled: true

xpack.security.http.ssl:
  enabled: true
  key: /usr/share/elasticsearch/config/certs/elasticsearch.key
  certificate: /usr/share/elasticsearch/config/certs/elasticsearch.crt
  certificate_authorities: [ "/usr/share/elasticsearch/config/certs/ca.pem" ]

xpack.security.transport.ssl:
  enabled: true
  key: /usr/share/elasticsearch/config/certs/elasticsearch.key
  certificate: /usr/share/elasticsearch/config/certs/elasticsearch.crt
  certificate_authorities: [ "/usr/share/elasticsearch/config/certs/ca.pem" ]

ingest.geoip.downloader.enabled: false
EOF

# B. Kibana
cat << EOF > "$KIBANA_CONFIG_DIR/kibana.yml"
server.host: "0.0.0.0"
server.name: "kibana.cicd.local"
server.publicBaseUrl: "https://kibana.cicd.local:5601"

server.ssl.enabled: true
server.ssl.certificate: "/usr/share/kibana/config/certs/kibana.crt"
server.ssl.key: "/usr/share/kibana/config/certs/kibana.key"
server.ssl.certificateAuthorities: ["/usr/share/kibana/config/certs/ca.pem"]

elasticsearch.hosts: [ "https://elasticsearch.cicd.local:9200" ]
elasticsearch.ssl.certificateAuthorities: [ "/usr/share/kibana/config/certs/ca.pem" ]
elasticsearch.username: "kibana_system"
elasticsearch.password: "\${ELASTICSEARCH_PASSWORD}"

xpack.security.encryptionKey: "$XPACK_SECURITY_ENCRYPTIONKEY"
xpack.encryptedSavedObjects.encryptionKey: "$XPACK_ENCRYPTEDSAVEDOBJECTS_ENCRYPTIONKEY"
xpack.reporting.encryptionKey: "$XPACK_REPORTING_ENCRYPTIONKEY"

telemetry.enabled: false
telemetry.optIn: false
newsfeed.enabled: false
map.includeElasticMapsService: false
xpack.fleet.enabled: false
xpack.apm.enabled: false

xpack.actions.preconfigured:
  mattermost-webhook:
    name: "Mattermost CI/CD Channel"
    actionTypeId: .webhook
    config:
      url: "\${MATTERMOST_WEBHOOK_URL}"
      method: post
      hasAuth: false
EOF

# C. Filebeat
cat << EOF > "$FILEBEAT_CONFIG_DIR/filebeat.yml"
filebeat.inputs:
  # 1. Jenkins (Sidecar File)
  # Reads the file generated by standard Java Logging
  - type: filestream
    id: jenkins-log
    paths:
      - /host_volumes/jenkins-home/_data/logs/jenkins.log*
    prospector.scanner.exclude_files: ['\.lck$']
    fields: { service_name: "jenkins" }
    fields_under_root: true
    multiline.type: pattern
    multiline.pattern: '^\[\d{4}-\d{2}-\d{2}'
    multiline.negate: true
    multiline.match: after

  # 2. Host System Logs (Journald)
  # Reads binary logs directly from host journal directories
  - type: journald
    id: host-system
    paths:
      - /var/log/journal
    fields: { service_name: "system" }
    fields_under_root: true

  # 3. GitLab Nginx
  - type: filestream
    id: gitlab-nginx
    paths:
      - /host_volumes/gitlab-logs/_data/nginx/*access.log
      - /host_volumes/gitlab-logs/_data/nginx/*error.log
    fields: { service_name: "gitlab-nginx" }
    fields_under_root: true

  # 4. SonarQube CE
  - type: filestream
    id: sonarqube-ce
    paths:
      - /host_volumes/sonarqube-logs/_data/ce.log
    fields: { service_name: "sonarqube" }
    fields_under_root: true
    multiline.type: pattern
    multiline.pattern: '^\d{4}.\d{2}.\d{2}'
    multiline.negate: true
    multiline.match: after

  # 5. Mattermost
  - type: filestream
    id: mattermost
    paths:
      - /host_volumes/mattermost-logs/_data/mattermost.log
    fields: { service_name: "mattermost" }
    fields_under_root: true

  # 6. Artifactory
  - type: filestream
    id: artifactory
    paths:
      - /host_volumes/artifactory-data/_data/log/artifactory-service.log
      - /host_volumes/artifactory-data/_data/log/artifactory-request.log
      - /host_volumes/artifactory-data/_data/log/access-service.log
    fields: { service_name: "artifactory" }
    fields_under_root: true
    multiline.type: pattern
    multiline.pattern: '^\d{4}-\d{2}-\d{2}'
    multiline.negate: true
    multiline.match: after

output.elasticsearch:
  hosts: ["https://elasticsearch.cicd.local:9200"]
  pipeline: "cicd-logs"
  protocol: "https"
  ssl.certificate_authorities: ["/usr/share/filebeat/certs/ca.pem"]
  username: "elastic"
  password: "$ELASTIC_PASSWORD"

setup.ilm.enabled: false
setup.template.enabled: false
EOF

# D. Jenkins Logging Configuration (Sidecar Strategy)
# 1. Create the Java logging properties file in the Jenkins volume
sudo mkdir -p "$JENKINS_VOL_DATA/logs"
echo "--- Generating Jenkins logging.properties ---"
cat << EOF | sudo tee "$JENKINS_VOL_DATA/logging.properties" > /dev/null
handlers = java.util.logging.FileHandler
.level = INFO

# File Handler Configuration
# Pattern: %h = user.home (jenkins_home), %g = generation number
java.util.logging.FileHandler.pattern = %h/logs/jenkins.log
java.util.logging.FileHandler.limit = 10485760
java.util.logging.FileHandler.count = 3
java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter
java.util.logging.FileHandler.append = true

# Format: [YYYY-MM-DD HH:MM:SS] [LEVEL] Logger - Message
java.util.logging.SimpleFormatter.format = [%1\$tF %1\$tT] [%4\$s] %3\$s - %5\$s %6\$s%n
EOF

# Ensure Jenkins User (UID 1000) owns the config
sudo chown -R 1000:1000 "$JENKINS_VOL_DATA/logs"
sudo chown 1000:1000 "$JENKINS_VOL_DATA/logging.properties"

# 2. Update Jenkins Environment to use this config
if [ -f "$JENKINS_ENV_FILE" ]; then
    if ! grep -q "java.util.logging.config.file" "$JENKINS_ENV_FILE"; then
        echo "--- Injecting JAVA_OPTS into jenkins.env ---"
        echo "" >> "$JENKINS_ENV_FILE"
        echo "# ELK Integration: Sidecar Logging" >> "$JENKINS_ENV_FILE"
        echo 'JAVA_OPTS=-Djava.util.logging.config.file=/var/jenkins_home/logging.properties' >> "$JENKINS_ENV_FILE"
    else
        echo "Jenkins env already configured for logging."
    fi

    # 3. Redeploy Jenkins to apply changes
    echo "--- Redeploying Jenkins Controller ---"
    (cd "$JENKINS_MODULE_DIR" && ./03-deploy-controller.sh)
else
    echo "⚠️ WARNING: Jenkins module not found at $JENKINS_MODULE_DIR."
    echo "   Jenkins logging will not be active until deployed."
fi

# --- 7. Env Files ---
echo "--- Phase 6: Scoped Env Files ---"
cat << EOF > "$ELK_BASE/elasticsearch/elasticsearch.env"
ELASTIC_PASSWORD=$ELASTIC_PASSWORD
ES_JAVA_OPTS=-Xms1g -Xmx1g
EOF

cat << EOF > "$ELK_BASE/kibana/kibana.env"
ELASTICSEARCH_PASSWORD=$KIBANA_PASSWORD
XPACK_SECURITY_ENCRYPTIONKEY=$XPACK_SECURITY_ENCRYPTIONKEY
XPACK_ENCRYPTEDSAVEDOBJECTS_ENCRYPTIONKEY=$XPACK_ENCRYPTEDSAVEDOBJECTS_ENCRYPTIONKEY
XPACK_REPORTING_ENCRYPTIONKEY=$XPACK_REPORTING_ENCRYPTIONKEY
MATTERMOST_WEBHOOK_URL=$SONAR_MATTERMOST_WEBHOOK
EOF

cat << EOF > "$ELK_BASE/filebeat/filebeat.env"
ELASTIC_PASSWORD=$ELASTIC_PASSWORD
EOF

# --- 8. Final Permissions Lockdown ---
echo "--- Phase 7: Locking Down Permissions ---"
chmod 600 "$ES_CONFIG_DIR/certs/"*.key
chmod 600 "$KIBANA_CONFIG_DIR/certs/"*.key
chmod 644 "$ES_CONFIG_DIR/certs/"*.crt
chmod 644 "$KIBANA_CONFIG_DIR/certs/"*.crt
chmod 644 "$FILEBEAT_CONFIG_DIR/certs/"*.pem

chmod 600 "$ELK_BASE"/*/*.env

sudo chown -R 1000:0 "$ES_CONFIG_DIR"
sudo chown -R 1000:0 "$KIBANA_CONFIG_DIR"
sudo chown 1000:0 "$ELK_BASE/elasticsearch/elasticsearch.env"
sudo chown 1000:0 "$ELK_BASE/kibana/kibana.env"

sudo chown -R root:root "$FILEBEAT_CONFIG_DIR"
sudo chown root:root "$ELK_BASE/filebeat/filebeat.env"

sudo chmod 644 "$FILEBEAT_CONFIG_DIR/filebeat.yml"
sudo chmod 644 "$ELK_BASE/elasticsearch/elasticsearch.env"
sudo chmod 644 "$ELK_BASE/kibana/kibana.env"
sudo chmod 644 "$ELK_BASE/filebeat/filebeat.env"

echo "✅ Architect Setup Complete."

2.2 Secrets and Security Hygiene

In Phase 2 of the script, we implement a "Generate Once, Keep Forever" logic to handle credentials securely.

We do not hardcode passwords in our configuration files. Instead, the script performs a check:

It looks for the cicd.env file.
If the ELASTIC_PASSWORD key is missing, it uses openssl rand -hex 16 to generate a high-entropy 32-character password.
It appends this new secret to the environment file.

This ensures that every deployment gets a unique, strong password that persists across restarts, without requiring manual intervention.

The XPACK Encryption Keys
Beyond simple passwords, the script also generates three distinct 32-character strings for XPACK Security:

xpack.security.encryptionKey
xpack.encryptedSavedObjects.encryptionKey
xpack.reporting.encryptionKey

Kibana uses these keys to encrypt sensitive data at rest within its internal indices—specifically "Saved Objects" like dashboards, visualization settings, and alerting rules. It is critical that these keys remain constant. If you were to regenerate them (e.g., by deleting cicd.env), Kibana would lose the ability to decrypt your existing dashboards, rendering them inaccessible.

The Ownership Handoff
Docker has a known friction point regarding file permissions. When you mount a host directory into a container, the container process sees the directory with the numeric UID of the host.

Elasticsearch runs strictly as UID 1000.
If we (the host user) create the ./elk/elasticsearch/data folder, it is owned by our user (likely 1000), which allows the container to write.
However, if we let Docker create the folder automatically during the first docker run, it is often created as Root. The Elasticsearch process will then crash immediately because it lacks permission to write to its own data directory.

To prevent this race condition, Phase 3 of our script preemptively creates the directory structure and explicitly executes chown -R $CURRENT_USER:$CURRENT_GROUP. This ensures the storage volume is correctly owned before the database attempts to initialize.

2.3 Staging Certificates

In Article 2, we established a centralized Certificate Authority (CA) and generated certificates for all our services. These files currently reside in ~/cicd_stack/ca/pki.

However, we do not simply mount the entire ca directory into every container. That would violate the Principle of Least Privilege; Elasticsearch does not need to see GitLab's private key.

In Phase 4 of the script, we perform a "Staging" operation. We copy only the specific certificate (crt) and private key (key) required for each service into its respective configuration directory:

./elk/elasticsearch/config/certs/ gets elasticsearch.crt and .key.
./elk/kibana/config/certs/ gets kibana.crt and .key.
Everyone gets ca.pem (the Trust Anchor).

This physical separation ensures that if the Elasticsearch container were compromised, the attacker would only retrieve the Elasticsearch identity, not the keys to the entire kingdom.

2.4 Dynamic Configuration Generation

Most tutorials ask you to manually edit elasticsearch.yml and kibana.yml. This is error-prone and tedious, especially when dealing with dynamic secrets like our encryption keys.

In Phase 5, we use Bash "Heredocs" (cat << EOF) to generate these configuration files programmatically. This allows us to inject the environment variables ($ELASTIC_PASSWORD, $XPACK_SECURITY_ENCRYPTIONKEY) directly into the configuration files at creation time.

Key Configuration Details:

Elasticsearch (elasticsearch.yml):
- xpack.security.enabled: true: This turns on the security features (Authentication and TLS). Without this, anyone on the network could query the database.
- network.host: 0.0.0.0: Essential for Docker networking. If left at default (localhost), the container would be unreachable from Kibana.
Kibana (kibana.yml):
- elasticsearch.ssl.certificateAuthorities: We explicitly tell Kibana to trust our self-signed CA. Without this, Kibana would refuse to connect to Elasticsearch over HTTPS.
- xpack.actions.preconfigured: We inject the Mattermost Webhook URL here. This pre-configures the alerting connector so we don't have to set it up manually in the UI later.
Filebeat (filebeat.yml):
- This is where we define our "Inputs." Notice we have separate sections for filestream (reading log files) and journald (reading system logs).
- We also define the setup.ilm.enabled: false. Since we are a single node, we disable Index Lifecycle Management (ILM) setup to keep the configuration simple and prevent Filebeat from trying to manage cluster-wide policies.

2.5 The Jenkins "Sidecar" Injection

One of the most complex parts of this script is Phase 5-D, where we fix the Jenkins logging problem.

Standard Jenkins logs are unstructured text printed to STDOUT. To make them parseable, we need Jenkins to write structured logs to a file. We achieve this without rebuilding the Jenkins image by using a "Configuration Injection" strategy:

Generate Config: The script writes a logging.properties file to the Jenkins data volume. This file defines a java.util.logging.FileHandler that writes logs to /var/jenkins_home/logs/jenkins.log.
Inject Env Var: The script appends JAVA_OPTS=-Djava.util.logging.config.file=... to jenkins.env.
Redeploy: It triggers the Jenkins deploy script (03-deploy-controller.sh) to restart the container.

When Jenkins restarts, it reads the new environment variable, loads the logging config, and begins writing clean, rotatable log files to the volume—which Filebeat is already configured to watch.

2.6 Final Permissions Lockdown

In Phase 7, we enforce strict file permissions. This is critical because some components (like Elasticsearch) check permissions on startup and will refuse to run if their configuration files are too open (e.g., world-readable).

Private Keys (*.key): set to 600 (Read/Write by owner only).
Public Certs (*.crt): set to 644 (Readable by everyone).
Elasticsearch/Kibana Configs: Owned by 1000:0 (User 1000, Root Group).
Filebeat Configs: Owned by root:root. This is mandatory. Filebeat requires its configuration file to be owned by root and not writable by others (chmod 644). If you miss this, Filebeat will fail to start with a "config file permission error."

With the Architect script complete, the foundation is laid. The kernel is tuned, secrets are generated, configs are written, and permissions are locked. We are ready to deploy the database.

Chapter 3: The Bedrock – Elasticsearch

3.1 The Deployment Script

With the "Architect" script complete, the kernel is tuned and the secrets are safely stored. We are now ready to lay the foundation of our observability stack: the Elasticsearch database.

This is not a generic "Hello World" deployment. This script is designed to handle the specific startup requirements of a production-grade search engine, including memory locking, file descriptor management, and an automated security bootstrap process.

Create 02-deploy-elasticsearch.sh with the following content:

#!/usr/bin/env bash

#
# -----------------------------------------------------------
#               02-deploy-elasticsearch.sh
#
#  The "Bedrock" script.
#  Deploys Elasticsearch (v9.2.2) and bootstraps security.
#
#  1. Deploy: Runs ES with strict memory/ulimit guards.
#  2. Healthcheck: Waits for Green status via HTTPS (using CA).
#  3. Bootstrap: Sets 'kibana_system' password via API.
#
# -----------------------------------------------------------

set -e
echo "🚀 Deploying Elasticsearch (The Bedrock)..."

# --- 1. Load Secrets ---
HOST_CICD_ROOT="$HOME/cicd_stack"
ELK_BASE="$HOST_CICD_ROOT/elk"
SCOPED_ENV_FILE="$ELK_BASE/elasticsearch/elasticsearch.env"
MASTER_ENV_FILE="$HOST_CICD_ROOT/cicd.env"

if [ ! -f "$SCOPED_ENV_FILE" ]; then
    echo "ERROR: Scoped env file not found at $SCOPED_ENV_FILE"
    echo "Please run 01-setup-elk.sh first."
    exit 1
fi

# We need KIBANA_PASSWORD from master env for the bootstrap step
if [ ! -f "$MASTER_ENV_FILE" ]; then
    echo "ERROR: Master env file not found."
    exit 1
fi
set -a; source "$MASTER_ENV_FILE"; set +a

if [ -z "$KIBANA_PASSWORD" ]; then
    echo "ERROR: KIBANA_PASSWORD not found in cicd.env"
    exit 1
fi

# --- 2. Clean Slate ---
if [ "$(docker ps -q -f name=elasticsearch)" ]; then
    echo "Stopping existing 'elasticsearch'..."
    docker stop elasticsearch
fi
if [ "$(docker ps -aq -f name=elasticsearch)" ]; then
    echo "Removing existing 'elasticsearch'..."
    docker rm elasticsearch
fi

# --- 3. Volume Management ---
echo "Verifying elasticsearch-data volume..."
docker volume create elasticsearch-data > /dev/null

# --- 4. Deploy ---
echo "--- Launching Container ---"

# NOTES:
# - ulimit: Essential for ES performance (avoids bootstrap checks failure)
# - cap-add IPC_LOCK: Allows memory locking to prevent swapping
# - publish: 9200 bound to 127.0.0.1 (Host access only)

docker run -d \
  --name elasticsearch \
  --restart always \
  --network cicd-net \
  --hostname elasticsearch.cicd.local \
  --publish 127.0.0.1:9200:9200 \
  --ulimit nofile=65535:65535 \
  --ulimit memlock=-1:-1 \
  --cap-add=IPC_LOCK \
  --env-file "$SCOPED_ENV_FILE" \
  --volume elasticsearch-data:/usr/share/elasticsearch/data \
  --volume "$ELK_BASE/elasticsearch/config/elasticsearch.yml":/usr/share/elasticsearch/config/elasticsearch.yml \
  --volume "$ELK_BASE/elasticsearch/config/certs":/usr/share/elasticsearch/config/certs \
  docker.elastic.co/elasticsearch/elasticsearch:9.2.2

echo "Container started. Waiting for healthcheck..."

# --- 5. Secure Bootstrap (The "Zero Touch" Logic) ---

MAX_RETRIES=60
COUNT=0
ES_URL="https://127.0.0.1:9200"

# A. Extract ELASTIC_PASSWORD securely (Avoiding 'source' errors)
ELASTIC_PASSWORD=$(grep "^ELASTIC_PASSWORD=" "$SCOPED_ENV_FILE" | cut -d'=' -f2 | tr -d '"')

# Wait for "status" to be green OR yellow
# We removed --cacert because the host trusts the CA
until curl -s -u "elastic:$ELASTIC_PASSWORD" "$ES_URL/_cluster/health" | grep -qE '"status":"(green|yellow)"'; do
    if [ $COUNT -ge $MAX_RETRIES ]; then
        echo "❌ Timeout waiting for Elasticsearch."
        echo "Check logs: docker logs elasticsearch"
        exit 1
    fi
    echo "   [$COUNT/$MAX_RETRIES] Waiting for Green/Yellow status..."
    sleep 5
    COUNT=$((COUNT+1))
done

echo "✅ Elasticsearch is Online."

# B. Set kibana_system Password
echo "--- Bootstrapping Service Accounts ---"

# 1. Run the command and let it print directly to stdout
curl -i \
    -X POST "$ES_URL/_security/user/kibana_system/_password" \
    -u "elastic:$ELASTIC_PASSWORD" \
    -H "Content-Type: application/json" \
    -d "{\"password\":\"$KIBANA_PASSWORD\"}"

# 2. Check the exit status of the curl command itself (simplified check)
if [ $? -eq 0 ]; then
    echo "" # Newline for formatting
    echo "✅ 'kibana_system' password request sent."
else
    echo ""
    echo "❌ Failed to send password request."
    exit 1
fi

echo "--- Bedrock Deployed Successfully ---"

3.2 Performance Tuning & System Limits

In the deployment script above, you will notice a set of aggressive flags passed to the docker run command. These are not optional; they are the difference between a database that runs for months and one that crashes under load.

1. Memory Locking (bootstrap.memory_lock)
We pass --cap-add=IPC_LOCK and --ulimit memlock=-1:-1 to the container.

Elasticsearch is a Java application. If the host operating system decides to "swap" part of the Java Heap to the hard disk to save RAM, performance falls off a cliff. Even worse, the garbage collector can pause for seconds (or minutes) trying to read memory back from the disk. By granting IPC_LOCK capability, we allow Elasticsearch to "lock" its allocated memory in RAM, preventing the OS from ever swapping it out.

2. File Descriptors (ulimit nofile)
We set --ulimit nofile=65535:65535.

Under the hood, Elasticsearch uses the Lucene search library. Lucene creates thousands of tiny files to manage its indices. A standard Linux process is often limited to 1,024 open files. If we don't raise this limit, Elasticsearch will hit a "wall" during heavy indexing and crash with a Too many open files exception. We preemptively raise this to 65k.

3.3 Network Security: The Localhost Bind

You might notice the port mapping looks different from previous articles:
--publish 127.0.0.1:9200:9200

In previous scripts, we mapped ports to 0.0.0.0 (all interfaces) or relied on the Docker bridge network. Here, we are explicitly binding port 9200 to 127.0.0.1 (localhost) only.

This is a Defense in Depth strategy.

Elasticsearch is the "Brain" of our stack. It holds all our logs, which may contain sensitive data. By binding strictly to localhost, we ensure that no external device on the network can talk directly to the database, even if our firewall rules fail.

Who can talk to it?
- Kibana: Yes, because it joins the same Docker network (cicd-net).
- Filebeat: Yes, via the Docker network.
- You (The Admin): Yes, via curl on the host machine.
Who cannot talk to it?
- Another developer's laptop.
- A rogue device on the Wi-Fi.

This forces all human access to go through the visualized, authenticated layer (Kibana) rather than the raw data API.

3.4 The "Bootstrap" Dance (Automating Service Accounts)

The most complex part of this script is Section 5: Secure Bootstrap.

We have a "Chicken and Egg" problem.

Kibana needs a password to log in to Elasticsearch (kibana_system user).
Elasticsearch does not allow us to set this password via an environment variable at startup.
We can only set this password via the REST API after Elasticsearch is running.

Most tutorials solve this by telling you to "run the container, wait a minute, and then manually run this curl command." That is not automation; that is manual labor.

Our script solves this with a Healthcheck Loop.
It uses curl to poll the cluster status (_cluster/health). It loops every 5 seconds until it gets a HTTP 200 OK and a status of "Green" or "Yellow."

Only when the brain is fully awake does the script execute the API Injection:

curl -X POST "$ES_URL/_security/user/kibana_system/_password" ...

This automatically sets the password we generated in the Architect script. The result is a "Zero Touch" deployment: you run the script, walk away, and come back to a fully secured, interconnected cluster.

Chapter 4: The Interface – Kibana

4.1 The Deployment Script

With Elasticsearch running as our data store, we need a way to visualize the information. Kibana is the native interface for the Elastic Stack, providing the dashboards, search bars, and management tools we will use to operate the platform.

While Elasticsearch is the "Backend," Kibana is the "Frontend." It is a Node.js application that queries the database and renders the results. Because it holds the encryption keys for our alerts and saved objects, its deployment requires careful handling of secrets and configuration files.

Create 03-deploy-kibana.sh with the following content:

#!/usr/bin/env bash

#
# -----------------------------------------------------------
#               03-deploy-kibana.sh
# -----------------------------------------------------------

set -e
echo "🚀 Deploying Kibana (The Interface)..."

# --- 1. Load Paths ---
HOST_CICD_ROOT="$HOME/cicd_stack"
ELK_BASE="$HOST_CICD_ROOT/elk"
SCOPED_ENV_FILE="$ELK_BASE/kibana/kibana.env"

# --- 2. Validation ---
if [ ! -f "$SCOPED_ENV_FILE" ]; then
    echo "ERROR: Scoped env file not found at $SCOPED_ENV_FILE"
    echo "Please run 01-setup-elk.sh first."
    exit 1
fi

# --- 3. Clean Slate ---
if [ "$(docker ps -q -f name=kibana)" ]; then
    echo "Stopping existing 'kibana'..."
    docker stop kibana
fi
if [ "$(docker ps -aq -f name=kibana)" ]; then
    echo "Removing existing 'kibana'..."
    docker rm kibana
fi

# --- 4. Deploy ---
echo "--- Launching Container ---"

# FIX: We rely fully on the env file now.
# It contains: ELASTICSEARCH_PASSWORD, XPACK Keys, and MATTERMOST_WEBHOOK_URL

docker run -d \
  --name kibana \
  --restart always \
  --network cicd-net \
  --hostname kibana.cicd.local \
  --publish 127.0.0.1:5601:5601 \
  --env-file "$SCOPED_ENV_FILE" \
  --volume "$ELK_BASE/kibana/config/kibana.yml":/usr/share/kibana/config/kibana.yml:ro \
  --volume "$ELK_BASE/kibana/config/certs":/usr/share/kibana/config/certs:ro \
  docker.elastic.co/kibana/kibana:9.2.2

echo "Container started. Waiting for API healthcheck..."

# --- 5. Healthcheck Loop ---
KIBANA_URL="https://127.0.0.1:5601"
MAX_RETRIES=60
COUNT=0

while true; do
    if [ $COUNT -ge $MAX_RETRIES ]; then
        echo "❌ Timeout waiting for Kibana."
        echo "Check logs: docker logs kibana"
        exit 1
    fi

    STATUS_OUTPUT=$(curl -sS "$KIBANA_URL/api/status" || true)

    if echo "$STATUS_OUTPUT" | grep -q '"level":"available"'; then
        echo "   [$COUNT/$MAX_RETRIES] Status: AVAILABLE"
        break
    else
        CURRENT_LEVEL=$(echo "$STATUS_OUTPUT" | grep -o '"level":"[^"]*"' | head -n1 || echo "unreachable")
        echo "   [$COUNT/$MAX_RETRIES] Status: ${CURRENT_LEVEL:-unreachable}"
    fi

    sleep 5
    COUNT=$((COUNT+1))
done

echo "✅ Kibana is Green and Available."
echo "   Access at: https://kibana.cicd.local:5601"

4.2 Configuration Strategy (The "Glue")

A critical aspect of this deployment is how we manage the connection between Kibana and Elasticsearch without exposing credentials in plain text.

In standard Docker tutorials, you might see environment variables passed directly in the docker run command (e.g., -e ELASTICSEARCH_PASSWORD=changeme). This is insecure because anyone running docker inspect can see the password.

We rely instead on Environment Injection.

In our deployment script, we use the --env-file flag to load elk/kibana/kibana.env. This file was not written by hand; it was programmatically generated by our Architect script in Chapter 2. By doing this, we achieve three things:

Credential Isolation: The ELASTICSEARCH_PASSWORD is injected at runtime. It exists only in the secure file on the host and the environment of the running process.
Persistent Encryption: The three XPACK encryption keys are critical for Kibana's internal security. They encrypt "Saved Objects" (dashboards, visualizations, and alerting rules) at rest. By injecting them from a persistent file, we ensure that if we destroy and recreate the container, Kibana can still decrypt our saved dashboards.
Pre-configured Integrations: We inject the MATTERMOST_WEBHOOK_URL here. This allows Kibana to register the connector immediately upon startup, saving us from having to manually paste the webhook URL into the UI later.

We also treat our configuration files as Immutable. Notice that kibana.yml and the certs directory are mounted with the :ro (Read-Only) flag.

--volume "$ELK_BASE/kibana/config/kibana.yml":/usr/share/kibana/config/kibana.yml:ro

This is a security best practice. Even if an attacker were to gain remote code execution within the Kibana Node.js process, they would be unable to overwrite the configuration to disable security or replace the trusted certificates. The container is locked down by the host.

4.3 The "Startup Race" (Health Checks)

One of the most common friction points when automating Kibana deployments is the Initialization Gap.

When you execute docker run, the Docker daemon returns a success code almost immediately. However, inside the container, Kibana is just waking up. As a heavy Node.js application, it has a lengthy startup sequence:

It loads the configuration.
It initializes plugins (APM, Maps, Graph, Security).
It optimizes client-side assets (though this is faster in newer versions, it still takes time).
It connects to Elasticsearch to verify the license and index templates.

This process can take anywhere from 30 to 60 seconds on a standard server.

If our script were to exit immediately after the docker run command, you would click the link, see a 502 Bad Gateway (from Nginx) or Connection Refused error, and assume the deployment failed. You might start digging into logs or restarting containers unnecessarily, interrupting a process that was actually working fine.

To solve this, we implement a State-Aware Healthcheck.

We do not simply use sleep 60. Hardcoded sleeps are "brittle"—on a fast machine, you waste 30 seconds; on a slow machine, it's not enough. Instead, we poll the application itself.

STATUS_OUTPUT=$(curl -sS "$KIBANA_URL/api/status" || true)

The Kibana Status API returns a rich JSON object detailing the state of every plugin. Crucially, it provides an overall status field. We grep specifically for:

"level":"available"

This is the only state that matters.

If the status is Red, Kibana is initializing or cannot reach Elasticsearch.
If the status is Yellow, Kibana is migrating saved objects or indices.
If the status is Green (Available), the UI is ready to accept user traffic.

By blocking the script until this specific string appears, we guarantee that when the script says "Green and Available," the link we provide will actually load the login page. This transforms the deployment experience from "Run and Pray" to "Run and Verify."

4.4 Security & Access

Finally, we must address the network exposure of our interface. You will notice the port mapping in our script is explicit:

--publish 127.0.0.1:5601:5601

We are binding strictly to the Loopback Interface (127.0.0.1), not the Wildcard Interface (0.0.0.0).

This is a deliberate architectural constraint. In a "Zero Trust" model, we never expose management interfaces directly to the network.

The Risk: If we bound to 0.0.0.0, anyone on the same Wi-Fi or subnet could navigate to http://<your-ip>:5601. While they would still face the login screen (thanks to our password setup), we prefer to hide the door entirely.
The Access Path: By locking it to localhost, the only way to access Kibana is from the machine itself.
For Setup: We can use curl on the host (as our healthcheck does).
For Browsing: In a real-world production setup, we would place an Nginx Reverse Proxy in front of this container (listening on port 443 with SSL) and route traffic internally to 127.0.0.1:5601. Alternatively, as administrators, we can use an SSH Tunnel (ssh -L 5601:localhost:5601 user@server) to securely bridge our laptop to the server.

This restriction ensures that during the vulnerable setup phase—before we have configured complex firewall rules or VPNs—our visualization tool is effectively invisible to the outside world. We are building a "Dark" control center: fully functional, but only accessible to those with the keys to the host.

Chapter 5: The Parsing Engine – Ingest Pipelines

5.1 The Pipeline Script

We have a database, and we have a UI. But currently, our logs are just messy strings of text like [INFO] 2025-12-15 Build Started. If we send these directly to Elasticsearch, they will be stored as a single blob of text. You won't be able to filter by "Error Level" or "Client IP" because the database doesn't know those fields exist yet.

In the traditional ELK stack, a tool called Logstash would sit in the middle to parse this text. However, Logstash is resource-heavy. Since we are optimizing for a lean "Factory in a Box," we will use Elasticsearch Ingest Pipelines.

This feature allows us to define a chain of "Processors" inside the database itself. When a log document arrives, Elasticsearch runs it through this chain—extracting fields, fixing timestamps, and removing noise—milliseconds before writing it to disk.

Create 04-setup-pipelines.sh with the following content. This script defines a single pipeline called cicd-logs that acts as a "Universal Adapter" for every service in our stack.

#!/usr/bin/env bash

#
# -----------------------------------------------------------
#           04-setup-pipelines.sh
#
#  Configures Elasticsearch Ingest Pipelines.
#  Defines parsing logic for all CICD stack components.
# -----------------------------------------------------------

set -e
echo "🚀 Configuring Elasticsearch Ingest Pipelines..."

# --- 1. Load Secrets ---
HOST_CICD_ROOT="$HOME/cicd_stack"
ELK_BASE="$HOST_CICD_ROOT/elk"
source "$ELK_BASE/filebeat/filebeat.env"

# --- 2. Define the Pipeline JSON ---
# CRITICAL: We use 'EOF' (quoted) to prevent Bash from stripping regex backslashes.

PIPELINE_JSON=$(cat <<'EOF'
{
  "description": "CICD Stack Log Parsing Pipeline",
  "processors": [
    {
      "set": {
        "field": "event.original",
        "value": "{{message}}",
        "ignore_empty_value": true
      }
    },
    {
      "grok": {
        "field": "message",
        "patterns": [
          "\\[%{TIMESTAMP_ISO8601:timestamp}\\] \\[%{LOGLEVEL:log.level}\\] %{DATA:logger_name} - %{GREEDYDATA:message}"
        ],
        "if": "ctx.service_name == 'jenkins'",
        "ignore_missing": true,
        "ignore_failure": true
      }
    },
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{IPORHOST:client.ip} - %{DATA:user.name} \\[%{HTTPDATE:timestamp}\\] \"%{WORD:http.request.method} %{DATA:url.path} HTTP/%{NUMBER:http.version}\" %{NUMBER:http.response.status_code:long} %{NUMBER:http.response.body.bytes:long} \"%{DATA:http.request.referrer}\" \"%{DATA:user_agent.original}\""
        ],
        "if": "ctx.service_name == 'gitlab-nginx'",
        "ignore_missing": true,
        "ignore_failure": true
      }
    },
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{YEAR:year}\\.%{MONTHNUM:month}\\.%{MONTHDAY:day} %{TIME:time} %{LOGLEVEL:log.level}\\s+%{GREEDYDATA:message}"
        ],
        "if": "ctx.service_name == 'sonarqube'",
        "ignore_missing": true,
        "ignore_failure": true
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": "ctx.timestamp = ctx.year + '-' + ctx.month + '-' + ctx.day + 'T' + ctx.time + 'Z'",
        "if": "ctx.service_name == 'sonarqube' && ctx.year != null"
      }
    },
    {
      "json": {
        "field": "message",
        "target_field": "mattermost",
        "if": "ctx.service_name == 'mattermost'",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "field": "timestamp",
        "value": "{{mattermost.timestamp}}",
        "if": "ctx.service_name == 'mattermost' && ctx.mattermost?.timestamp != null"
      }
    },
    {
      "set": {
        "field": "log.level",
        "value": "{{mattermost.level}}",
        "if": "ctx.service_name == 'mattermost' && ctx.mattermost?.level != null"
      }
    },
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{TIMESTAMP_ISO8601:timestamp} \\[%{DATA:service_type}\\s*\\] \\[%{LOGLEVEL:log.level}\\s*\\] \\[%{DATA:trace_id}\\] \\[%{DATA:class}:%{NUMBER:line}\\] (?>\\[%{DATA:thread_info}\\] )?- %{GREEDYDATA:message}",
          "%{TIMESTAMP_ISO8601:timestamp}\\|%{DATA:trace_id}\\|%{IP:client.ip}\\|%{DATA:user.name}\\|%{WORD:http.request.method}\\|%{DATA:url.path}\\|%{NUMBER:http.response.status_code:long}\\|.*"
        ],
        "if": "ctx.service_name == 'artifactory'",
        "ignore_missing": true,
        "ignore_failure": true
      }
    },
    {
      "date": {
        "field": "timestamp",
        "formats": [
          "ISO8601",
          "dd/MMM/yyyy:HH:mm:ss Z",
          "yyyy-MM-dd HH:mm:ss.SSS",
          "yyyy-MM-dd HH:mm:ss.SSS Z",
          "yyyy-MM-dd HH:mm:ss Z",
          "MMM  d HH:mm:ss",
          "MMM dd HH:mm:ss"
        ],
        "target_field": "@timestamp",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "field": ["timestamp", "year", "month", "day", "time", "mattermost"],
        "ignore_missing": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "error.message",
        "value": "Pipeline failed: {{ _ingest.on_failure_message }}"
      }
    }
  ]
}
EOF
)

# --- 3. Upload to Elasticsearch ---
echo "--- Uploading 'cicd-logs' Pipeline ---"

RESPONSE=$(curl -s -k -w "\n%{http_code}" -X PUT "https://127.0.0.1:9200/_ingest/pipeline/cicd-logs" \
  -u "elastic:$ELASTIC_PASSWORD" \
  -H "Content-Type: application/json" \
  -d "$PIPELINE_JSON")

HTTP_BODY=$(echo "$RESPONSE" | sed '$d')
HTTP_STATUS=$(echo "$RESPONSE" | tail -n1)

if [ "$HTTP_STATUS" -eq 200 ]; then
  echo "✅ Pipeline updated successfully (HTTP 200)."
  echo "Response: $HTTP_BODY"
else
  echo "❌ Error uploading pipeline (HTTP $HTTP_STATUS)."
  echo "Elasticsearch Response:"
  echo "$HTTP_BODY"
  exit 1
fi

5.2 The Pipeline Blueprint (JSON Reference)

This JSON object is the "Source Code" for our logging logic. It defines a sequential list of steps that every log entry must pass through. Let's break down the critical components of this file so you can understand exactly how we transform chaos into order.

1. The Safety Net (event.original)
The first processor is a set operation.

"set": { "field": "event.original", "value": "{{message}}" }

Before we touch anything, we copy the raw log line into a new field called event.original. If our complex parsing logic fails or corrupts the data later in the pipeline, we still have the original, untouched text safe and sound.

2. The Conditional Switch (if statements)
Notice that almost every processor has an if condition attached to it:

"if": "ctx.service_name == 'jenkins'"

This is our routing logic. Filebeat tags every log with a service_name (e.g., "jenkins", "gitlab-nginx") before shipping it. The pipeline checks this tag. If the log is from Jenkins, it runs the Jenkins Grok pattern; if it's from Nginx, it skips to the Nginx pattern. This allows us to use a Single Pipeline for the entire stack, rather than managing ten small ones.

3. The Parser (grok)
Grok is the heavy lifter. It matches the raw text against predefined patterns and extracts structured variables.

Pattern: %{IPORHOST:client.ip}
Translation: "Find a string that looks like an IP address or Hostname. Extract it and save it into the field client.ip."
Result: We can now build a map in Kibana showing where users are logging in from, because client.ip is a real data field, not just text.

4. The Script (painless)
Sometimes, standard parsers aren't enough. SonarQube logs date and time in two separate fields (2025.12.15 and 10:00:00), but Elasticsearch needs a single ISO8601 string.
We use a tiny script written in Painless (Elastic's secure scripting language) to stitch them together:

ctx.timestamp = ctx.year + '-' + ctx.month + '-' + ctx.day + 'T' + ctx.time + 'Z'

This demonstrates the power of Ingest Pipelines: we can execute real programming logic inside the database to clean our data.

5. The Cleanup (remove)
Once we have extracted client.ip, http.status, and @timestamp, the intermediate fields (like year, month, mattermost JSON objects) are just wasted disk space. The final step remove deletes them, keeping our index lean.

6. Error Handling (on_failure)
At the very bottom, we define an on_failure block. If a log line is malformed and causes a processor to crash, the pipeline does not discard the log. Instead, it catches the exception and adds an error.message field. This ensures we never lose data, even if our code has bugs.

5.3 Decoding the Services

Now we will walk through the specific parsing logic for each service. This is where we bridge the gap between the application's unique "accent" and the database's strict requirements.

1. Jenkins (The Custom Formatter)

In Chapter 2, we forced Jenkins to use a custom Java logging format:

[%1$tF %1$tT] [%4$s] %3$s - %5$s %6$s%n

This results in log lines that look like this:
[2025-12-15 14:30:00] [INFO] hudson.plugins.git.GitSCM - Fetching changes...

Our Grok pattern is a direct mirror of this structure:

"\\[%{TIMESTAMP_ISO8601:timestamp}\\] \\[%{LOGLEVEL:log.level}\\] %{DATA:logger_name} - %{GREEDYDATA:message}"

\\[ ... \\]: We escape the brackets because they are special characters in Regex.
%{TIMESTAMP_ISO8601:timestamp}: Captures the 2025-12-15 14:30:00 part.
%{LOGLEVEL:log.level}: Captures INFO, WARNING, or SEVERE. By mapping this to log.level, Kibana will automatically color-code these rows (Red for Error, Yellow for Warning).

2. GitLab Nginx (Standard Web Traffic)

GitLab's internal Nginx uses the industry-standard "Combined Log Format."

192.168.1.50 - - [15/Dec/2025:14:30:00 +0000] "GET /api/v4/projects HTTP/1.1" 200 450 ...

Our Grok pattern extracts the critical metrics for our security dashboard:

"%{IPORHOST:client.ip} - %{DATA:user.name} \\[%{HTTPDATE:timestamp}\\] ..."

%{IPORHOST:client.ip}: This is the most valuable field. It allows us to build the "Intruder Alert" map.
%{NUMBER:http.response.status_code:long}: We explicitly cast this to a long (integer). If we left it as a string, we wouldn't be able to run math aggregations like "Average Response Code" or "Count of 5xx Errors."

3. SonarQube (The Format Outlier)

SonarQube is difficult. It logs date and time separated by a space, using dots for the date:

2025.12.15 14:30:00 INFO web[][o.s.s.p.Platform] ...

A standard Grok pattern can extract 2025.12.15 as year and 14:30:00 as time, but Elasticsearch requires a single ISO string for time-based indexing.

We solve this with a two-step combo:

Grok: Extract the parts (year, month, day, time).
Painless Script:

ctx.timestamp = ctx.year + '-' + ctx.month + '-' + ctx.day + 'T' + ctx.time + 'Z'

This script manually constructs a valid ISO string (2025-12-15T14:30:00Z) which the Date Processor can then understand.

4. Mattermost (Structure-Native)

Mattermost is modern; it logs in JSON format by default.

{"timestamp": "2025-12-15 14:30:00", "level": "error", "msg": "Database timeout"}

We don't need Grok (Regex) here. We use the JSON Processor:

"json": { "field": "message", "target_field": "mattermost" }

This tells Elasticsearch: "The message field contains a JSON string. Parse it and put the resulting object into a new field called mattermost."
We then use set processors to promote mattermost.timestamp and mattermost.level to the top-level @timestamp and log.level fields, ensuring consistency with Jenkins and Nginx.

5. Artifactory (The Polyglot)

Artifactory is complex because it generates two distinct types of logs in the same stream:

Service Logs: Java application errors (standard stack traces).
Request Logs: Pipe-separated values (|) tracking file downloads.

We handle this using a Grok Array. We provide multiple patterns to the processor:

"patterns": [
  "...Service Log Pattern...",
  "...Request Log Pattern..."
]

Elasticsearch tries the first pattern. If it fails, it tries the second. This allows a single pipeline to handle both "Application Crashed" (Pattern A) and "User downloaded generic-local/app.jar" (Pattern B) seamlessly.

6. The Rosetta Stone: Date Normalization

Finally, the pipeline ends with the Date Processor.

"formats": [ "ISO8601", "dd/MMM/yyyy:HH:mm:ss Z", "yyyy-MM-dd HH:mm:ss.SSS", ... ]

This is the most critical step. Jenkins says 2025-12-15, Nginx says 15/Dec/2025. If we simply indexed these as strings, sorting by time would be impossible.
This processor takes the timestamp field we extracted from any of the services above, matches it against any of the allowed formats, and converts it to the sacred @timestamp field in UTC. This ensures that when you zoom in on "14:30" in Kibana, you see the events from Jenkins, Nginx, and SonarQube perfectly aligned.

7. A Note on System Logs (The Pass-Through)

You might notice that our System Logs (Journald) are missing from the pipeline's logic. We have defined rules for Jenkins, Nginx, and SonarQube, but there is no if "ctx.service_name == 'system'" block.

This is intentional.

Unlike the flat text files generated by Jenkins or Nginx, the Linux Journal (journald) is a structured binary format. When Filebeat reads from the Journal (via the journald input we configured in filebeat.yml), it doesn't just read a line of text. It reads a rich object containing the timestamp, the process name, the PID, and the priority level.

Filebeat automatically maps these binary fields to the Elastic Common Schema (ECS) before the data even leaves the host. For example:

Journal _COMM becomes process.name.
Journal PRIORITY becomes syslog.priority.
Journal __REALTIME_TIMESTAMP becomes @timestamp.

Because the data arrives at Elasticsearch already parsed and structured, it skips every conditional if block in our pipeline. It effectively "falls through" the logic untouched, landing in the index ready to be searched. We don't need to fix what isn't broken.

5.4 The Deployment Mechanism

The final part of our script handles the delivery of this logic to the database.

It is important to understand that Ingest Pipelines are not configuration files. You cannot simply copy a .json file into a folder on the server and expect Elasticsearch to pick it up. Pipelines are part of the Cluster State—they live in the memory and disk of the Master Nodes, synchronized across the cluster.

To install a pipeline, we must interact with the Elasticsearch REST API.

RESPONSE=$(curl -s -k -w "\n%{http_code}" -X PUT "https://127.0.0.1:9200/_ingest/pipeline/cicd-logs" \
  -u "elastic:$ELASTIC_PASSWORD" \
  -H "Content-Type: application/json" \
  -d "$PIPELINE_JSON")

1. The Method (PUT)
We use the HTTP PUT verb. This is idempotent; if the pipeline already exists, this command overwrites it with the new version. This is perfect for CI/CD: if we update our parsing logic later, we simply re-run this script to apply the changes.

2. The Endpoint (_ingest/pipeline/cicd-logs)
We are defining a resource named cicd-logs. This name is critical. Later, when we configure Filebeat (Chapter 6), we will tell it specifically to use pipeline: "cicd-logs". If these names do not match, the data will bypass our parser and arrive as raw text.

3. The Payload (-d "$PIPELINE_JSON")
We send the JSON object we defined earlier as the body of the request. Note that we used a Bash Heredoc (cat <<'EOF') to capture that JSON. This ensures that the complex regex backslashes inside the Grok patterns are preserved and not interpreted by the shell.

4. The Verification
The script does not blindly assume success.

HTTP_STATUS=$(echo "$RESPONSE" | tail -n1)

if [ "$HTTP_STATUS" -eq 200 ]; then ...

If our JSON syntax is invalid—for example, a missing comma or an unescaped quote—Elasticsearch will reject the request with a 400 Bad Request error and a detailed message explaining where the syntax failed. Our script captures this error code and prints the server's response, allowing you to debug the JSON immediately without digging through server logs.

With the pipeline registered, the "Brain" now knows how to read. It is time to deploy the "Collector" to start feeding it data.

Chapter 6: The Collector – Filebeat (`05-deploy-filebeat.sh`)

This chapter focuses on the "Shipper" that connects our log files to the pipeline we just built.

6.1 The Collector Script

We have the database (Elasticsearch), the interface (Kibana), and the parsing logic (Ingest Pipelines). Now we need the agent to physically transport the logs from the hard drive to the cluster.

Create 05-deploy-filebeat.sh with the following content:

#!/usr/bin/env bash

#
# -----------------------------------------------------------
#               05-deploy-filebeat.sh
#
#  The "Collector" Script.
#  Deploys Filebeat (v9.2.2) to ship logs to Elasticsearch.
#
#  1. Mounts Host Docker Volumes (read app logs).
#  2. Mounts Host Journal (read system logs).
#  3. Mounts Data Volume (PERSIST REGISTRY).
#  4. Runs as ROOT to bypass host directory permissions.
#
# -----------------------------------------------------------

set -e
echo "🚀 Deploying Filebeat (The Collector)..."

# --- 1. Load Paths ---
HOST_CICD_ROOT="$HOME/cicd_stack"
ELK_BASE="$HOST_CICD_ROOT/elk"
SCOPED_ENV_FILE="$ELK_BASE/filebeat/filebeat.env"

if [ ! -f "$SCOPED_ENV_FILE" ]; then
    echo "ERROR: Scoped env file not found at $SCOPED_ENV_FILE"
    echo "Please run 01-setup-elk.sh first."
    exit 1
fi

# --- 2. Clean Slate ---
if [ "$(docker ps -q -f name=filebeat)" ]; then
    echo "Stopping existing 'filebeat'..."
    docker stop filebeat
fi
if [ "$(docker ps -aq -f name=filebeat)" ]; then
    echo "Removing existing 'filebeat'..."
    docker rm filebeat
fi

# --- 3. Volume Management (CRITICAL FIX) ---
# We must persist the registry so we don't re-ingest old logs on restart.
echo "Verifying filebeat-data volume..."
docker volume create filebeat-data > /dev/null

# --- 4. Deploy ---
echo "--- Launching Container ---"

# NOTES:
# - user root: Required to traverse /var/lib/docker and read Journal.
# - /var/log/journal: Required for native journald input.
# - /etc/machine-id: Required for journald reader to track host identity.

docker run -d \
  --name filebeat \
  --restart always \
  --network cicd-net \
  --user root \
  --env-file "$SCOPED_ENV_FILE" \
  --volume "$ELK_BASE/filebeat/config/filebeat.yml":/usr/share/filebeat/filebeat.yml:ro \
  --volume "$ELK_BASE/filebeat/config/certs":/usr/share/filebeat/certs:ro \
  --volume filebeat-data:/usr/share/filebeat/data \
  --volume /var/lib/docker/volumes:/host_volumes:ro \
  --volume /var/log/journal:/var/log/journal:ro \
  --volume /etc/machine-id:/etc/machine-id:ro \
  docker.elastic.co/beats/filebeat:9.2.2

echo "Container started. Verifying connection..."

# --- 5. Verification ---
MAX_RETRIES=15
COUNT=0
echo "Waiting for Filebeat to establish connection..."

while [ $COUNT -lt $MAX_RETRIES ]; do
    sleep 2
    # Check for successful connection message
    if docker logs filebeat 2>&1 | grep -q "Connection to backoff.*established"; then
        echo "✅ Filebeat successfully connected to Elasticsearch!"
        exit 0
    fi

    # Fail fast on Pipeline errors (common misconfiguration)
    if docker logs filebeat 2>&1 | grep -q "pipeline/cicd-logs.*missing"; then
        echo "❌ ERROR: Filebeat says the 'cicd-logs' pipeline is missing!"
        echo "   Did you run 04-setup-pipelines.sh?"
        exit 1
    fi

    # Fail fast on Certificate errors
    if docker logs filebeat 2>&1 | grep -q "x509: certificate signed by unknown authority"; then
        echo "❌ ERROR: SSL Certificate trust issue."
        echo "   Check that ca.pem is correctly mounted and generated."
        exit 1
    fi

    echo "   [$COUNT/$MAX_RETRIES] Connecting..."
    COUNT=$((COUNT+1))
done

echo "⚠️  Connection check timed out. Check logs manually:"
echo "   docker logs -f filebeat"

6.2 The "Agent" Pattern: Host vs. Sidecar

In the world of container observability, there are two primary ways to collect logs: the Sidecar Pattern and the Host Agent Pattern. It is important to understand why we have chosen the latter for this architecture.

The Sidecar Pattern (The Expensive Way)
In many Kubernetes tutorials, you will see a "Sidecar" approach. This involves defining a Pod that contains two containers: your application (e.g., Jenkins) and a logging agent (Filebeat). The agent shares the storage volume with the app, reads the logs, and ships them.

Pros: Complete isolation.
Cons: Massive resource overhead. If you have 20 services, you run 20 instances of Filebeat. If each takes 50MB of RAM, you waste 1GB of memory just on shippers.

The Host Agent Pattern (The Efficient Way)
We are using the Host Agent strategy. We run exactly one instance of Filebeat for the entire server. This agent sits on the "metal" (conceptually) and watches the Docker storage directory from the outside.

The Mechanics: Docker stores named volumes at /var/lib/docker/volumes/ on the host Linux filesystem.
The Trick: By mounting this host directory into Filebeat (--volume /var/lib/docker/volumes:/host_volumes:ro), our single agent can peek inside the data directories of Jenkins, GitLab, SonarQube, and Mattermost simultaneously.

This approach reduces our memory footprint significantly (1 agent vs. 10) and simplifies management. We have one configuration file to rule them all, rather than scattering logging configs across ten different repositories.

6.3 Access Control: The Root CompromiseYou will notice a controversial flag in our deployment script:

--user root

In almost every security guide, running containers as root is considered a vulnerability. However, for a Host Agent, it is a functional requirement.

The Permission Barrier: The directory /var/lib/docker/volumes is owned by root:root with strict permissions (typically 700 or 750). If we ran Filebeat as the default filebeat user (UID 1000), it would be denied access immediately. It physically cannot enter the directory to read the log files.
The Mitigation (Read-Only): To balance the risk of running as root, we use the Docker Read-Only flag (:ro) on the volume mount:

--volume /var/lib/docker/volumes:/host_volumes:ro

This imposes a hard limit at the filesystem level. Even if the Filebeat process were hijacked by an attacker, they could read your logs (confidentiality risk), but they could not delete your data or inject malicious files into your application volumes (integrity risk).

This is the standard privilege model for infrastructure monitoring: the observer must have higher privileges than the observed, but we strip its ability to modify the world it watches.

6.4 The Memory of the Elephant: Persistent Registry

One of the most dangerous pitfalls in deploying Filebeat is failing to persist its "Registry."

Filebeat maintains a small internal database called the Registry. This file records exactly how far it has read into every log file it tracks. It stores the unique inode of the file and the byte offset (e.g., "I have read 10,240 bytes of jenkins.log").

In our script, we explicitly handle this state:

docker volume create filebeat-data
...
--volume filebeat-data:/usr/share/filebeat/data

The Scenario Without Persistence:
Imagine we did not map this volume.

Filebeat starts, reads 5,000 lines of Jenkins logs, and ships them to Elasticsearch.
You restart the Filebeat container to change a config.
The new container starts with a fresh, empty Registry.
It looks at jenkins.log, sees 5,000 lines, and thinks, "A new file! I must read this from the beginning."
It re-ships all 5,000 lines. You now have duplicates of every single event in your database.

By mounting the data directory to a named Docker volume, we ensure that Filebeat's brain survives a restart. It wakes up, checks the disk, sees it has already processed those 5,000 lines, and waits patiently for line 5,001.

6.5 System Visibility (Journald)

Finally, we address the "Blind Spot" of standard Docker logging.

docker logs only captures what an application writes to STDOUT (Standard Output). It does not capture what happens to the container itself from the Operating System's perspective.

Scenario: Your Jenkins server is under heavy load. It consumes all available RAM. The Linux Kernel's "Out of Memory" (OOM) Killer steps in and terminates the process to save the system.
The Result: Jenkins is dead. If you look at the Jenkins logs, they just stop. There is no error message because the process was killed before it could write one. You are left guessing.

To solve this, we mount the Host System Journal:

--volume /var/log/journal:/var/log/journal:ro \
--volume /etc/machine-id:/etc/machine-id:ro

/var/log/journal: This is where modern Linux systems (Ubuntu, CentOS, Debian) store system-level logs in a binary format. Filebeat reads this directly.
/etc/machine-id: This is required for the reader to correctly associate the journal entries with the specific host machine.

By ingesting this stream, we can see the "Meta-Events." In our dashboard, we will be able to correlate a sudden stop in Jenkins logs with a simultaneous kernel: Out of memory: Kill process 1234 (java) event from the system log. This context is often the difference between a 5-minute fix and a 5-hour debugging session.

Chapter 7: The Discovery – Verification

7.1 First Contact (The Setup)

The scripts have finished running. The containers are green. Now comes the moment of truth: seeing your data.

Before we open the browser, we must ensure your computer knows how to find the services. Our scripts used specific hostnames (kibana.cicd.local) which makes handling SSL certificates and networking much cleaner than using raw IP addresses.

1. Update Your Hosts File
On your host machine (the one running the browser), open your /etc/hosts file (or C:\Windows\System32\drivers\etc\hosts on Windows) and add the following line:

127.0.0.1  elasticsearch.cicd.local kibana.cicd.local

2. The Login
Navigate to https://kibana.cicd.local:5601.

You will be greeted by the Elastic login screen.

Username: elastic
Password: The value you generated in cicd.env (look for ELASTIC_PASSWORD).

3. Creating the Data View
When you first log in, Kibana is "blind." It knows it is connected to a database, but it doesn't know which tables (indices) you care about. We need to define a Data View (formerly known as an Index Pattern).

Open the main menu (hamburger icon) and go to Stack Management > Data Views.
Click Create data view.
Name: Give it a friendly name like CICD Logs.
Index pattern: Enter filebeat-*.
- Why? Filebeat creates a new index every day (e.g., filebeat-9.2.2-2025.12.17). By using the wildcard *, we tell Kibana to look at all existing and future Filebeat indices.
- Confirmation: You should see a success message on the right listing the matching indices (e.g., filebeat-9.2.2-2025...).
Timestamp field: Select @timestamp.
- Critical: This tells Kibana which field represents "Time." If you pick the wrong field, your time-series charts will be broken.
Click Save data view to Kibana.

Kibana now has a lens through which it can see your data.

7.2 Verifying the Pipeline (The "Discover" Tab)

Now that Kibana can see the data, let's verify that our Ingest Pipeline (Chapter 5) is actually working. If the pipeline failed, we would see a giant block of text in the message field and nothing else. If it succeeded, that text should be exploded into useful fields.

Click on the Discover icon (compass) in the left sidebar.
Ensure the date picker (top right) is set to "Today" or "Last 15 minutes".
In the search bar, type service_name: "gitlab-nginx" and hit Enter.

You should see a list of log events. Click on the small arrow > next to any document to expand it into Table/JSON view.

The "Anatomy of a Log"
Look at the JSON structure (like the example below). This is the proof of success:

event.original: This contains the raw, ugly text: 172.30.0.5 - - [17/Dec/2025...]. This is our safety net.
client.ip: The pipeline successfully extracted 172.30.0.5. This is now a searchable IP address, not just text.
http.response.status_code: It extracted 404. Because we cast this to a long in our pipeline, we can now build charts aggregating "Top Error Codes."
@timestamp: Notice the time is 2025-12-17T09:51:49.000Z. The pipeline took the Nginx format (17/Dec/2025...) and standardized it to UTC ISO8601.

If you see these fields, your "Brain" is correctly parsing the "Voice" of your infrastructure.

7.3 Troubleshooting Common Issues

If you navigate to Discover and see... nothing, don't panic. This is the "No Data" state, and it usually has three simple causes.

1. The Time Trap
Kibana defaults to "Last 15 minutes." If your logs are from an hour ago (or if your server time is drifting), you won't see them.

Fix: Set the time picker to "Last 24 hours".

2. The Connection Gap
If Filebeat can't reach Elasticsearch, no data arrives.

Fix: Check the Filebeat logs: docker logs filebeat. Look for Connection refused. If you see this, check that Elasticsearch is healthy (docker ps) and that you are using the correct password.

3. The Pipeline Reject
If your pipeline has a syntax error, Elasticsearch might reject the logs entirely.

Fix: Look at the Filebeat logs again. If you see 400 Bad Request or pipeline [cicd-logs] missing, it means the data made it to the door but was turned away. Re-run 04-setup-pipelines.sh to ensure the logic is loaded.

Chapter 8: The Dashboard – Visualization

8.1 System Health (The Blame Wheel)

Raw logs are useful for debugging, but they are terrible for monitoring. You cannot scroll through thousands of lines of text to guess if the server is healthy. We need a visual signal—a "Traffic Light" that turns Red when things are wrong.

We will build a Pie Chart to visualize the distribution of errors across our stack.

Open the main menu and go to Dashboard.
Click Create dashboard.
Click Create visualization (This opens the "Lens" editor).
Configure the Chart:
- View: Select Pie from the chart type dropdown.
- Metric (Slice Size): Count of records.
- Slice by: Drag the field service_name.keyword onto the "Slices" area.
The "Unified" Query:
- We have a challenge: Application logs use text levels (ERROR, WARN), but System logs (Journald) use numbers (priority 3, priority 4).
- In the Search bar at the top, paste this KQL (Kibana Query Language) string:
```
log.level.keyword: ("WARN" or "error" or "WARNING") or log.syslog.priority <= 4
```

This query captures both types of failures in a single view.
1. Save: Click "Save and return" and name it "System Health Status".

The Value:
If this chart is empty or shows only small slices, you are fine. If you see a massive slice labeled jenkins, you know exactly which service is screaming, without reading a single log line.

8.2 Intruder Alert (The Security Radar)

Next, we need to secure our perimeter. We want to see if anyone is scanning our Nginx proxy or trying to brute-force URLs. We will use a Stacked Bar Chart to show HTTP response codes over time.

On your dashboard, click Create visualization again.
Configure the Chart:
- View: Select Bar (Stacked).
- Horizontal Axis: Drag @timestamp here. Kibana will automatically group data into buckets (e.g., "per minute").
- Vertical Axis: Count of records.
- Break down by: Drag http.response.status_code here.
Add Filters:
- In the search bar, add: service_name: "gitlab-nginx".
- We want to highlight specific anomalies. Click the Plus (+) next to the Query bar to add explicit filters for interesting codes: 403 (Forbidden), 502 (Bad Gateway), and 302 (Redirects/Logins).
Save: Click "Save and return" and name it "GitLab Access Patterns".

The Value:

A spike in 302: Someone is hammering the login page (Brute Force).
A spike in 403: Someone is scanning for secrets or unauthorized paths.
A spike in 502: Your GitLab container has likely crashed behind the proxy.

8.3 Factory Pulse (Build Frequency)

Finally, we want to see the "Heartbeat" of our factory. We don't want to count logs; we want to count work. We will track how often Jenkins provisions a new agent to run a build.

Click Create visualization.
Configure the Chart:
- View: Select Area (Stacked).
- Horizontal Axis: @timestamp.
- Vertical Axis: Count of records.

The Filter Logic:

In the search bar, use this query:

logger_name.keyword: "hudson.slaves.NodeProvisioner" and message: "*provisioning successfully completed*"

A Note on Accuracy (The Double Log):
- Observation: You might notice that for every one build, the count goes up by 2.
- Cause: The current version of Jenkins logs this specific success message twice (once for the request, once for the completion).
- Impact: While the absolute number is inflated, the trend line is accurate. A spike on this chart still represents a spike in workload. We accept this imperfection rather than over-engineering a fix.
Save: Name it "Build Frequency".

8.4 Assembling the View

You now have a dashboard with three powerful widgets.

Resize: Grab the bottom-right corner of the Security Radar and Build Frequency charts and stretch them across the full width of the screen. Time-series data needs width to be readable.
Position: Place the Health Pie Chart in the top-left corner.
Save the Dashboard:
- Click the Save button in the top right.
- Title: CICD Stack.
- Store time with dashboard: Toggle this On. This ensures that whenever you open this dashboard, it defaults to the correct time window.

You now have a professional-grade observability dashboard. But if you restart the containers now, this manual work might be lost. In the next chapter, we will ensure this dashboard is saved as code so it can be automatically restored.

Chapter 9: Disaster Recovery

9.1 The "Ephemeral" Problem

You have just spent valuable time building a dashboard. You tweaked the colors, filtered out the noise, and aligned the charts perfectly.

But in a DevOps environment, infrastructure is ephemeral. If you tear down your stack to save costs or spin it up on a new server for a demo, that dashboard is gone. Kibana stores these configurations in its internal index (.kibana), and unless you have a robust snapshot strategy, your manual clicks evaporate when the volume is deleted.

We need to treat our dashboard like our application code: Version Controlled and Automated.

9.2 Generating the Artifact (`export.ndjson`)

First, we need to extract the "Source Code" of the dashboard we built in Chapter 8.

In Kibana, go to Stack Management > Saved Objects.
Find your dashboard titled "CICD Stack".
Select the checkbox next to it.
Crucial Step: Click the Export button.
- A modal will appear asking if you want to include related objects. Toggle this ON. This ensures that the Data View (filebeat-*) and the individual visualizations (Pie Chart, Bar Chart) are bundled into the file.
Save the file as export.ndjson in your project root.

You now have a portable snapshot of your monitoring logic. You can delete your entire Docker environment, run a script, and have this exact view back in seconds.

9.3 The Restoration Script

Now we write the automation to load this file. Create 06-import-dashboard.sh.

This script uses the Kibana API to upload our saved objects. Notice that it does not require the -k (insecure) flag for curl. Because our host machine trusts the Private Certificate Authority we generated during the setup phase, we can interact with our local HTTPS services securely and correctly.

#!/bin/bash

# -----------------------------------------------------------
#  06-import-dashboard.sh
#  Imports the Kibana Dashboard from 'export.ndjson'.
# -----------------------------------------------------------

set -e

# 1. Source the Environment Variables (Absolute Path)
#    We look for the file generated by 01-setup-elk.sh in the runtime directory.
ENV_FILE="$HOME/cicd_stack/elk/filebeat/filebeat.env"

if [ -f "$ENV_FILE" ]; then
    source "$ENV_FILE"
else
    echo "❌ Error: Could not find credentials at $ENV_FILE"
    echo "   Did you run 01-setup-elk.sh?"
    exit 1
fi

# 2. Configuration
KIBANA_URL="https://kibana.cicd.local:5601"
DASHBOARD_FILE="export.ndjson"

# 3. Check for the Export File
if [ ! -f "$DASHBOARD_FILE" ]; then
    echo "❌ Error: '$DASHBOARD_FILE' not found in current directory."
    exit 1
fi

echo "⏳ Waiting for Kibana to be ready..."
until curl -s "$KIBANA_URL/api/status" | grep -q "available"; do
  echo -n "."
  sleep 5
done
echo " Kibana is up!"

echo "🚀 Importing Dashboard from $DASHBOARD_FILE..."
echo "---------------------------------------------"

# 4. Import the Dashboard
#    -X POST: The API method
#    -u: Basic Auth using the password from the env file
#    -H "kbn-xsrf: true": Required header for Kibana API
#    --form file=@...: Uploads the file

curl -X POST "$KIBANA_URL/api/saved_objects/_import?overwrite=true" \
  -u "elastic:$ELASTIC_PASSWORD" \
  -H "kbn-xsrf: true" \
  --form file=@$DASHBOARD_FILE

echo ""
echo "---------------------------------------------"
echo "✅ Import request finished."

9.4 The Mechanics of the Restore

1. The "Green" Gate
Just like in our deployment scripts, we cannot fire requests at Kibana before it is ready. The until loop hits the /api/status endpoint. It will block execution until Kibana explicitly returns "level":"available". This prevents the script from failing if you run it immediately after starting the container.

2. The Overwrite Flag
.../saved_objects/_import?overwrite=true
We explicitly set overwrite=true. This makes the script Idempotent. You can run it ten times in a row; if the dashboard already exists, it updates it. If we didn't do this, the second run would crash with a "Conflict" error.

3. The CSRF Token
-H "kbn-xsrf: true"
Kibana has strict security protections against Cross-Site Request Forgery. Even though we are authenticating with a valid password, the API will reject any write operation that lacks this specific header. It is a mandatory handshake that proves the request is intentional.

With this script in your toolkit, your observability stack is now as reproducible as the applications it monitors.

Chapter 10: Validation – The Stress Test

10.1 The "Fire Drill" Philosophy

We have built a complex machine. We have containers, pipelines, secure certificate authorities, and dashboards. But right now, it is sitting idle. A monitoring system that looks good when nothing is happening is useless; you need to know what it looks like when the world is burning.

In this final chapter, we are going to attack our own infrastructure. We will execute a "Fire Drill" to prove that:

Filebeat can handle a sudden flood of events (Backpressure).
The Pipeline correctly parses logs even under load.
The Dashboard translates this chaos into a clear, readable signal.

10.2 The Nuclear Option (The Script)

Create 99-stress-test.sh. This script is designed to be noisy. It uses standard Linux tools to generate two distinct types of failures: System Stability failures and Network Security events.

#!/usr/bin/env bash

#
# -----------------------------------------------------------
#           99-stress-test.sh
#
#  Generates 1000 System Errors and 1000 Access Events
#  to stress-test ELK dashboards.
# -----------------------------------------------------------

set -e

echo "🔥 Starting ELK 'Nuclear' Stress Test..."
echo "---------------------------------"

# 1. Generate 1000 System Events (Blame Wheel -> System Slice)
echo "1. Injecting 1000 System Errors & Warnings..."
for i in {1..1000}; do
    logger -p syslog.err "CICD-STRESS-TEST: Critical Database Failure #$i"
    logger -p syslog.warning "CICD-STRESS-TEST: Memory Threshold Exceeded #$i"

    # Progress bar effect (print a dot every 50 events)
    ((i % 50 == 0)) && echo -n "."
done
echo " Done."

echo "---------------------------------"

# 2. Generate 1000 Access Events (Security Radar -> 302 Spike)
# We target Port 10300 (GitLab HTTPS).
# Expect 302 (Redirect to Login) or 401/403 depending on the endpoint.
echo "2. Simulating 1000 Access Attempts on GitLab (Port 10300)..."
for i in {1..1000}; do
    # Hit the protected admin endpoint on the correct mapped port
    curl -s -o /dev/null "https://gitlab.cicd.local:10300/admin"

    ((i % 50 == 0)) && echo -n "."
done
echo " Done."

echo "---------------------------------"
echo "🎉 Stress Test Complete."
echo "   Go to Kibana -> Refresh Dashboard (Last 15 Minutes)"
echo "   You should see a massive spike in '302' events."

10.3 Vector 1: The System Flood (Testing Journald)

The first loop uses the logger command. This tool writes directly to the Linux System Journal, bypassing Docker completely.

The Test:
We are injecting 1,000 messages with specific severity levels (syslog.err and syslog.warning).

The Visual Result:
Look at your "System Health Status" Pie Chart.

Before: It was likely empty or had thin slices, as a healthy system generates very few errors.
After: You will see a massive new slice appear, dominating the chart. This slice represents the Host System.
Why this matters: We filtered this chart to only show "Bad Things" (WARN or ERROR). The fact that this slice appeared instantly proves that our Unified Query works: it successfully caught the syslog.priority signal from the host, even though it didn't come from a container.

10.4 Vector 2: The Network Flood (Testing Nginx)

The second loop uses curl to hit the GitLab Admin interface (/admin). Since our script does not provide a session cookie, GitLab rejects us.

The Test:
We fire 1,000 requests at the Nginx Reverse Proxy.

The Visual Result:
Look at your "GitLab Access Patterns" Bar Chart.

The Wall: You will see a massive vertical spike.
The Code: The bars will be colored for status code 302 (Redirect).
Why this matters: If we had failed to configure the Ingest Pipeline (Chapter 5) correctly, these logs would just be text. Kibana wouldn't know they were "302" errors. The fact that you can see a "302" bar proves the regex parser extracted http.response.status_code correctly.

10.5 Conclusion: The Single Pane of Glass

Congratulations. You have successfully deployed a production-grade Observability Stack for a CI/CD environment.

Let's review what we achieved:

Architecture: We built a secure, single-node architecture using custom Docker networks and TLS encryption.
Ingestion: We used Filebeat as a lightweight "Host Agent" to collect data from both containers (Docker volumes) and the OS (Journald).
Parsing: We replaced the heavy Logstash with lean Ingest Pipelines, moving the logic into the database to save RAM.
Persistence: We automated the dashboard recovery with export.ndjson, ensuring our work is never lost.
Validation: We proved it works by attacking it.

You have moved beyond "SSH and Grep." You now have a Single Pane of Glass—a central nervous system that tells you the health, security, and workload of your factory in real-time. This is the foundation of modern DevOps.