DEV Community: soul-o mutwiri

The Role of Data Engineers - AWS

soul-o mutwiri — Wed, 02 Jul 2025 05:50:31 +0000

Building and managing Data Infrastructure and platforms:
databases
data warehouses on cloud - s3, aws Glue, Amazon Redshift etc.
Ingest data from various sources:
Use tools like AWS glue Jobs or aws Lambda functions to ingest data

from databases, applications, files, streaming devices into a centralized data platforms.
Prepare ingested data for analytics
use AWS glue, Apache spark, Amazon EMR to prepare data for cleaning, transforming and enriching it.
Catalog and document Curated datasets
-use AWS Glue crawlers to determine format and schema, group data into tables. write metadata to aws Glue data Catalog. Use metadata tagging in Data catalog for data governance, compliance and discoverability.
Automate regular data workflows and pipelines
simplify and accelerate data processing using services like AWS Glue Workflows, AWS lambda or AWS step functions.

The data engineer builds the system that delivers usable data to the data analyst, who querys and analyzes the data to gain business insights/reports/visualizations.

Before a data engineer begins these questions must be answered:

Which data should be analyzed? What is its value to the business or organization?
Who owns the data? Where is it located?
Is the data usable in its current state? What transformations are required?
Who needs to see the data?
After the data is curated and ready for consumption, how should it be presented?

MODEL TRAINING AND EVALUATION

soul-o mutwiri — Tue, 01 Jul 2025 03:08:44 +0000

model training

Model training is a big part of Machine learning. it is important to ensure a proper division between training and evaluation efforts.

It is important to evaluation the model to estimate quality of its predictions for the data that the model has not been trained on.
bUT as a starting point your cannot check the accuracy of predictions for future instances as its supervised learning. so you need to use some of the data that you already know the answer for as a proxy for future data.
Instead of using the same data that was used for training to evaluate. A common strategy is to split all available labeled data into training set, validation set and test set often in 80:10:10 ratio or 70:15:15

Model Evaluation

After the model has interacted with unseen test data, we can deploy the model to production and monitor its to ensure business problem was indeed being addressed.

Its ability to more accurately predict skils, would reduce number of transfers a customer experienced. Thus resulting to a better customer experience. Model evaluation is used to verify that the model is performing accurately.

MODEL TUNING AND FEATURE ENGINEERING

Once we have evaluated our model and began the process of iterative tweaks to the model and our data. We can adjust how fast or slow the model was learning or taking to reach an optimal value.
then we move to feature engineering,
Feature engineering trying to answer questions like what was the time of a customer most recent orders, what was a customer most recent order....we feed these features into the model training aligorithm, it can only learn from exactly what we show it.

MODEL DEPLOYMENT

deploying the model, to solve the business needs and meet the expectations suh as directing customer to the correct agent the first time. Imagine if a company has a endless types of products, customer can be sent to a generalizt or even a wrong specialist, who will then figure what customer needs before sending them to agent with right skills... for a company handling millions of customer calls, this is inneffiecient and costs money and time.
customer calls get connected to..wrong department, non-technical support..then correct agent...

DATA PROCESSING PHASES

soul-o mutwiri — Tue, 01 Jul 2025 02:45:42 +0000

Once a business plan and formulation of the problem to be solved in data analysis or data engineering or machine learning solutions, the next phase is about data collection and preparation.
Data processing steps include data collection and intergration, data preprocesssing and data visualization, and feature engineering.

example: how to route customers to agents with the right skills thus reducing call transfers.
How can we predict which skill would solve a customer call..

data collection and intergration ensures the raw data is in once centrally accessible place.
data preprocessing and data visualization involves transforming raw data into an understandable format. this involves data cleaning and exploratory data analysis.
at this stage, we exclude unnecesary labels, entirely inaccurate labels and even combine simmilar labels so as to simplify our model.
data visualization hepps give a quick sense of features and labels summaries. this helps better understand the data and so on.

feature engineering is the process of creating and extracting variables from data

example: we want to base predictions on past data from customer service calls, thus supervised learning/
training our model on historical data that include correct labels or agent skills. then the model can make it own predictions on simmilar data moving forward. the data we need comes from asking questions that help establish our features.

kenyan debt

soul-o mutwiri — Sat, 21 Jun 2025 12:10:49 +0000

DATA CLEANING: Common data issues and their solutions

soul-o mutwiri — Wed, 21 May 2025 15:50:00 +0000

Data cleaning is a useful process to articulate the desired state of data before being ingested and used for insights and visualization. Data driven decisions often depend on accuracy of the data being presented.

Some of the common data issues are

Missing data
Incorrect data
Outliers
Duplication
Irrelevant data

potential techniques for fixing

Missing data
Imputation (mean, median, mode)
dropping rows/columns with excessive missing values
Incorrect data
Validate against external reference
standardization of formats
manual correction by domain expert/review
Outliers - deleting / retaining based on domain review
Duplication - fuzzy matching, use unique indentifiers
Irrelevant data - feature importance score, correlation analysis to determine and remove features of low variance or no contribution to target variables.

Pandas based solutions

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from scipy import stats

Step 1: Load your dataset

df = pd.read_csv('your_data.csv')

Step 2: Handle missing data

imputer = SimpleImputer(strategy='mean')

df['A'] = imputer.fit_transform(df[['A']])

df = df.dropna(axis=0, thresh=df.shape[1] // 2) # Drop rows with > 50% missing

Step 3: Handle incorrect data (standardize formats)

df['A'] = df['A'].astype(str)
df['date'] = pd.to_datetime(df['date'], errors='coerce')

Step 4: Handle outliers (IQR method)

Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['A'] >= (Q1 - 1.5 * IQR)) & (df['A'] <= (Q3 + 1.5 * IQR))]

Step 5: Remove duplicates

df = df.drop_duplicates()

Step 6: Remove irrelevant features (based on feature importance)

X = df.drop(columns=['target'])
y = df['target']
model = RandomForestClassifier()
model.fit(X, y)
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
important_features = feature_importances[feature_importances > 0.01].index
df = df[important_features.tolist() + ['target']]

Now df is cleaned and ready for analysis or modeling.

Features that orchestration offers

soul-o mutwiri — Sat, 17 May 2025 17:35:44 +0000

High availability no downtime
Scalability or high performance
Disaster recovery - backup and restore

K8s architecture
A cluster is made of atleast one master node(control plane) and worker nodes which have kubelet running on it. kubelet makes possible for nodes to communicates.

on master node - api server (entrypoint to k8s cluster), controller manager(keeps track of whats happenin in the cluster DESIRED STATE<> ACTUAL STATE), scheduler(ensures pods placement), etcd(status data, snapshots and data recovery), virtual network(turns all nodes into a one machine).

K8S components
node - virtual machine
deployment - is a blueprint of my-app pods, abstraction of
pods. Db cannot be replicated using a
deployment; so use statefulset for stateful apps.
to reduce read/write duplications/ improve consistencies.
pod - creates a running environment abstraction inside a
container.
each pod gets ip address, emphemeral(data lost on restart)
service and its ip address stays even when a pod dies, it is also a load balancer
ingress - does the forwarding to allow external
communication.
configmap - configuration of the application, no need to
create new image just change the config map
secrets - confidential information is stored here e.g
credentials. password, certificates
reference secret in deployment/pod

volume - storage plugged into the kubernetes to aid data
persistence.

.....
METADATA
SPECIFICATION
STATUS .. k8s updates status vs specification... e.g replicates... etcd hold the current status of k8s
........
KUBECTL
is cli to interact with cluster api-server to submit commands, to create or delete compoenents. work processes make it happen that is:

create pods, create services.

Make it an external service
type - service type
default = ClusterIP > AN INTERNAL SERVICE
nodeport = exposes the service on each node ip at a static port for external communication

configmap and secret must exist before deployments.

Kubernetes foundation

soul-o mutwiri — Thu, 15 May 2025 08:10:22 +0000

Gitops Tools like Flux and ArgoCD are pull based approached... continously monitor the Git repository for changes and pull those changes to update the kubernetes cluster.

ingress objects define how external traffic shd be routed to different services within the cluster. They expose services within the cluster to external networks without need for each service havings its IP addresses.

pipeline help automate build, test and deployment of an application.

Cloud Native Orchestration

High level of automation from development to deployment, CI/CD pipelines with minimal human involvement, backed with version control system like git. easier disaster recovery
Self healing
failing is expected, health checks monitor applications and restart them, one some parts stop working while other continue.
Scalable
handling more load, scaling based on metrics like memory to ensure performance of services.
Cost efficient
usage based pricing, optimized infrastructure usage.
Easy to maintain
microservices allow small and portable aplications easier to test & distribute across teams.
Secure by default
zero trust computing, user and processess are authenticated.

running containers
to start containers - docker run nginx

building container images

containers are metaphoe of shipping containers.
there is a standard format of a shipping containr to make it easy to stack on a container ship, unload and onto a truck no matter what is inside.

container images - are what makes containers portable and easy to reuse.
Docker describes a container image as following:

“A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings

``
__

dockerfile
images can be built by reading instructions from a buildfile called dockerfile.

The instructions are the same as the ones one would use to install an application on a server.

Every container image starts with a base image.

This could be your favorite linux distribution

FROM ubuntu:20.04

Run commands to add software and libraries to your image

Here we install python3 and the pip package manager

RUN apt-get update && \
apt-get -y install python3 python3-pip

The copy command can be used to copy your code to the image

Here we copy a script called "my-app.py" to the containers filesystem

COPY my-app.py /app/

Defines the workdir in which the application runs

From this point on everything will be executed in /app

WORKDIR /app

The process that should be started when the container runs

In this case we start our python app "my-app.py"

CMD ["python3","my-app.py"]

To build the image

docker build -t my-python-image -f dockerfile

-t my-python-image - specifies the name tag of your image.
-f Dockerfil - specify the dockerfile to be used.

to distribute these images we use a container registry; where you can upload and download images by push and pull commands

docker push my-registry.com/my-python-image
docker pull my0registry.com/my-python-image

some of the public registries is Docker Hub and QUay

4cs OF CLOUD NATIVE SECURITY
CLOUD
CLUSTER
CONTAINER
CODE

These are layer that need to be protected when using containers

CONTAINER ORCHESTRATION FUNDAMENTATIONS

it is harder to manage alot of containers so you need a system for management of these containers. container orchestration provides a way to build a cluster of multiple server and host container on top... most systems have a control plane for management of the containers and worker nodes that host the containers. one of the most common systems is kubernetes..to orchestrate containers.

Problems to be solved through container orchestration ssytems...

providing computing resource like VMs where containers run on
schedule containers to serve efficiently
allocate resources like CPU and memory
scale containers based on the load
provide networking to connect containers together
provide storage if containers need to persist data.

Networking
to make application accessible to outside - containers have the ability to map a port from the container to a port from the host system.
to communicate between container across host- overlay netwoekr is spanned across host systems. the overlay network manages ip address; which container gets which IP address and how the traffic has to flow to access individual containers.

CNI standard - guides writing and configuring network plugins and makes it easy to swap plugin in various orchestration platforms.

SERVICE DISCOVERY and DNS

no need to rememmer the ip addressses of important systems
1000s of containers have individual ip addresses.
diferent hosts in different Data centers and locations
information in containers can be removed when deleted
SERVICE REGISTRY all this information is put in a service registry
SERVICE DISCOVERY - process to find other services in the network and reguest information from them.

Approaches to Service Discovery

DNS servers - have a service APi that register new services as they are created.
KEY-VALUE store - datastore to store infomration about services. Popular choices for clustering are etcd, consul and apache zookeeper.
They are highly availble systems with strong failover mechanisms.

SERVICE MESH
Helps with monitoring, access control and encryption of network traffic when containers communicate with each other....a service mesh add a a proxy server to every container in your architecture....a proxy is a software that is used to manage network traffic.... it sits between client and server and modify/filter traffic before it reaches the server. POPULAR proxies are nginx, haproxy and envoy.
proxies handle communication between services....traffic is routed through proxies instead... popular service meshes are istio **and **linkerd. The proxies in a mesh service form the data plane shaping traffic flow... the rules and configs to be applied to proxies are managed centrally in the control plane of the service mesh.
Service mesh interface smi IS STANDARD .

STORAGE
containers are ephemeral
container images are read only so we need to add a read-write layer to allow writing files.
if need to persist data on host, a volume can be used to achieve that.
directories on host are passed through into container filesystem.

**CONTROL PLANE NODES

## KUBE-APISERVER centerpiece, all components interact with Api-server, users access the cluster thro it.
## ETCD a db that holds state of cluster
## KUBE-SCHEDULER when a new workload is scheduled, kube-scheduler chooses a worker node that could fit.
## KUBE-CONTROLLER-MANAGER manage state of cluster. A desired no of application is available all the time.
## CLOUD-CONTROLLER-MANAGER interacts with api of cloud providers like load balancers,storage or SG.

**Worker PLANE NODES

Container runtime responsible for running containers of worker node. Docker and now containerd
kubelet Agent that runs on every worker node in cluster. talks to api-server and container runtime to handle stage of starting containers.
kube-proxy network proxy that handles inside and outside communication. networking capabiliites of underlying os.

kubernetes namespace are used to divide a cluster into virtual clusters, organize objects and manage which users has access to which folder.

setting up a test cluster : minikube, kind, MicroK8s.
setting up prod grade cluster: kubeadmin, kops, kubespray
cloude provider: EKS, GKE, AKS

Kubernetes API: Authentication (X.509 digitally signed certificates), authorization(RBAC), admission control (Open poLICY Agent to manage admission control externally).
Thro the API a user or service can create, delete, retrieve resources in k8s.

RUNNING CONTAINERS IN KUBERNETES
Unlike in local machine where you start containers directly, in K8S, they use PODS.

kubelet <-->cri containerd plugin containers

NETWORKING
-Container-to-container communications {pod }
-pod to pod communication {overlay network}

pod-to-service communication {kube-proxy and packet filter on the node} -external-to-service communications {kube-proxy and packet filter on node}

When implementing networking, choose network vendors: CALICO, CILIUM, WEAVE.

Every pod gets it own ip address dynamically, USING core-dns, a dns server used for service discovery and name resolution in the cluster.

Network policies: cluster internal firewalls, with help of selector to specify the traffic allowed to and from pods that match the selector.

SCHEDULING

process of choosing the right worker node to run a containerized workload on. scheduling process starts when a pod is created. when a pod is described, scheduler selects a node where the pod actually get started by kubelet and container runtime. scheduler uses information about application requirements, to filter all nodes that fit the requirements, if multiple fit, the node with least amount of pods is choosen. if fails, the scheduler keeps trying.

KUBERNETES objects vs WORKLOAD OBJECTS

Objects are described in data serialization language YAML and send them to the API-server, where they get validated before being created.
the version(apiversion), kind of object to be created (kind), metadata(unique data that can be used to indentify it), Spec(specifications of the object).

K8s cluster like a factory has many components...security, storage, management of pods.
K8s objects control how pods are deployed, scaled and managed....they are like different parts of the factory.
Configuration management, cross-node networking, routing of external traffic, load balancing or scaling of the pods.
Workload Objects - they actually build and manage the applications.

Breakdown of the Analogy using a factory
Kubernetes Concept Factory Analogy Purpose

Pod A single worker assembling a product

Runs one or more containers (the "product").

ReplicaSet A controller object to ensure desired number of pods is running at any given time.

Deployment An automated assembly line for mass production Ensures many copies of a Pod run smoothly::by defining lifecycle, managing replicasetss,stateless applications in k8s.

StatefulSet A specialized assembly line for custom orders Manages Pods that need unique identities like ip add, stable name, persistent storage and graceful handling of updates and scaling (e.g., stateful applications like databases).

DaemonSet Maintenance crew in every section of the factory

Runs a Pod on every node (e.g., log collectors, monitoring and other infrastructure related workload).

Job A temporary worker hired for a one-time task Runs a Pod to completion (e.g., a backup job).

CronJob A scheduled task (like a nightly cleanup crew) Runs Jobs at specific times.

Service The shipping department (delivers products) Exposes Pods to the network.

ConfigMap/Secret Blueprints & security documents Stores configuration & sensitive data.

PersistentVolume Warehouse storage Provides long-term storage for Pods.

Namespace Different factory departments (e.g., "Production" vs. "R&D") Isolates resources in the cluster.

Key Takeaway
Workload objects = Assembly lines (they handle the actual work of running apps).

Other Kubernetes objects = Supporting roles (security, storage, networking, etc.).

KUBECTL official command line interface client

kubectl api-resources - - to list available objects

kubectl explain pod - - to get more info about object

kubectl explain pod.spec - - more about pod spec

kubectl create -f .yaml - - create object in k8s from yaml

HELM

used to create templates and package k8s objects. package manager, allowing easier update and interaction with objects.

HELM Charts - helm packages k8s objects in charts, which can be shared with other via registry (artifactHub).

POD CONCEPT

Collection of one or more containers that shared a namespace and cgroups.
it is the smallest unit of deployable unit in k8s.
A pod allows combination of multiple processes that are interdependent. all containers in a pod share ip address and even filesystem even if they are different images.
Sidecare container - supports the main application like logging, monitoring,security, proxying within the same pod.
initContainers - starts containers before main application starts....containers: .... initContainers: .....

Resources - set a resource request and a maximum limit for CPU and memory
LivelinessProbe - configure health check that periodically check if container is alive, if check fails, container restarted.
SecurityContext - set user and group setting and kernel capabilities.

Pod lifecycle phases

pending -- when a pod is in a cluster but no container is set up and ready to run...waiting to be scheduled or downloading images over the network.
Running -- when pod is bound to a node and all containers created, atleast one container running/starting/restarting.
Succeeded -- all containers are terminated and not restarted
Failed -- all container in pod terminated, atleast non-zero status container
Unknown - state could not be obtained.

**NETWORKING OBJECTS

** defines and abstract netwokring

Service objects - used to expose set of pods as network sertvice.
ingress objects

service types

clusterIP - round-robin load balancer- single endpoint for a set of pods. most common service type.
NodePort - extends ClusterIP by adding simple routing rules ..opens a port on every node in a cluster and maps it clusterip - allows external traffic to cluster
LoadBalancer - extends NodePort by deploying external LB instance. need API like GCP, AWS.
ExternalName - Uses K8S internal DNS server to create DNS alias. resolves hostnames..
Headless Services - depends on whether service has selectors defined ...eg statefulset containter

Ingress Objects
provides a means to expose HTTP and HTTPS routes from outside of cluster for a service within the cluster. it routing rules are implemented with ingress controller.

ingress and egress rules are deffined in NetworkPolicies.

kubectl expose -- help
--exposes a resource as a new kubernetes service
kubectl get service --- helps list services exposed
kubectl scale --help ;;; scales as per replicasets indicated

Configuration Objects
-storing confiigs in env

its bad practice to include config in container build -configMap - store config files as key-value pairs.
- mount configMap as a volume in Pod
- map variable from configMap to env variables in pod

volumes:

name: nginx-conf configMap: name: nginx-conf
- to store passwords, key and other sensitive information, secret objects are used... secrets as base64 encoded.

HASHICORP vault is a seret management tool in cloud environment.

AUTOSCALING Objects

Horizontal Pod Autoscaler (HPA)
most used, a second pod, third pod gets scheduled if capacity threshold is met...metrics-server
Cluster Autoscaler
if cluster capacity if fized, new worker nodes are added to cluster as per demand...
Vertical pod Autoscaler
pod resource requests and limits increase dynamically, new concept...

CLOUD NATIVE APPLICATION DELIVERY
continous intergration - building and testing of code - version control and collaboration of temas
Continous Delivery - automates deployment of pre-built software. deployed to development or staging before prod envirments or systems.
CICD tools: ArgoCD, jenkins , GitLab

Principles of GITops and how it intergrates with kubernetes Gitops - intergrates provisioning and change process of infrastructure with version control operations. ,, manages infrastructure changes... Pull based: agent watches the git repository for changes and if changes are detected, they are applied to the infrastructure running state/// FLUX and ArgoCD...

CLOUD NATIVE OBSERVABILITY

Allow analysis of collected data, understand system and react to errorr states. ## PROMOTHEUS metrics - quantitative measurements taken over time..error rate logs - messages of error, warning, debug infor present.. traces - progress of request as it passes through a service ; in distributed system, how long it took when being processed by a service.

docker log nginx - views logs in a container nginx
kubectl logs nginx
kubectl logs -p -c ruby web-1 -- view previous terminated container logs from pod web-1

kubectl logs -f -c ruby web-1 -- streaming of logs

fluentd or filebear tools can ship and store logs ... grafana
prometheus - open source monitoing system to collect metrics

prometheus collect data and grafana helps build dashboard for collected metrics,,with prometheus one of the data sources..

counter 0 value that increases(error count)
gauge 0 value that increase or decrease(memory)
node-level logging, logging via sidecare container, application-level logging 00pushes logs directly from application running in a cluster...

docker and dockerfiles

soul-o mutwiri — Wed, 30 Apr 2025 03:30:41 +0000

Containers help with:
 Dependency management of applications
 Writing secure application code
 Efficient use of hardware resource
Open container initiative runtime specifications (OCI)
Run basic containers
Docker desktop
Docker version
Docker hub
 Building container images is based on iso 668 standardization
 Dockerfile contains instructions on how to build container images with docker.
 Processes are isolated using namespaces and cgroups.
 Container images make containers portable and easy to reuse. It contains everything need to run an application. The code, runtime, system tools. System libraries and settings.
 DOCKERFILE::
Container images
FROM UBUNTU 20:4 ## this is the base image
RUN apt-get update && apt-get -y install python python3-pip #Run commands to add soft and lib
COPY my-app.py /app/ #copy commands copy code to image filesystem
WORKDIR /app #define the workdir in which the app runs
CMD [“python 3”, “my-app-py”] #process that should be started when the container runs,
# we are running our python app “my-app-py”

To Build this image
Docker build -t my-python-image -f Dockerfile
-t my-python-image = specifies a name tag for the image
-f Dockerfile = specifies where your dockerfile can be found
To distribute the images use a container registry
Docker push my-registry.com/my-python-image
Docker pull my-registry.com/my-python-image

CONTAINER ORCHESTRATION
With large amounts of containers, one needs a system that helps wit the management of these containers.

Providing compute resources like virtual machines where containers can run on
Schedule containers to servers in an efficient way
Allocate resources like CPU and Memory to containers
Manage the availability of containers and replace them if they fail
Scale if load increases
Provide networking to connect containers together
Provision storage if containers need to persist data.

In most container orchestration system consists of control plane and work node

Responsible for the management of the containers – control plane
Work notes - host the containers Kubernetes is the standard system to orchestrate containers Networking Container networking implementation is based on the container network interface (CNI). It guide network plugins and how they can be swapped in different orchestration platforms. Network namespaces allow each container to have its own ip address, Need to map a port from the container to port from the host system to open access from outside the host system. Overlay network – puts container across hosts in a virtual network that is spanned across the host systems. Host network may be 172.16.4.x Container network – 192.168.8.X
Server in this network may have 172.16.4.11, 172.16.4.12, 172.16.4.13 containers inside these each of these servers may derive a wide range of ips – 192.168.1.1, 192.168.1.4

Service Discovery and DNS
In container orchestration platforms there are 1000s of containers with individual ip addresses,
Containers deployed in different hosts, data centers and even geolocations.
Use of ip to communicate is nearly impossible to communicate, DNS is used to communicate
All this information is automated through use of service registry.
Finding about other services in the network and requesting info about them is service discovery.
Approaches to service discovery
DNS - Register new services as they are created
Key-value store - data stores like etcd, consul, apache zookeeper

Service Mesh
Service describes how traffic in container platforms is handled by proxies. (SMI)
Proxy is a server application that sits between a client and server. Used to manage network traffic
Popular proxies - Nginx, haproxy or envoy.
A service mesh adds a proxy server to every container that you have in the architecture
Therefore, it helps manage complex and opaque networking, implement monitoring, access control and encryption of networking traffic as containers communicate with each other.
When service meshes are used, traffic is routed through proxies instead of application talking to each other directly
Istio and linkerd are popular service meshes
The proxies in a service mesh form the data plane, where rules centrally managed in the control plane of the service mesh are implemented and shape traffic flow.
Config files are written and uploaded to the control plane to enforce new rules. e.g. service A and service B should always communicate encrypted.

Storage
Containers are ephemeral
They are read only and read-write is lost when container is stopped or deleted.
To persist container, a volume is used.
Often multiple containers are started on different host systems or a container started on a different container still needs to access its volume.
A robust storage system that is attached to the host servers. Storage is provisioned via a storage system. Server A and server B can share a volume to read and write data.
 Container storage interface (CSI) offers a uniform interface which allows attaching different storage systems no matter if its on premise or cloud.

KUBERNETES CONTAINERIZATION ORCHESTRATION

soul-o mutwiri — Tue, 29 Apr 2025 16:26:41 +0000

Kubernetes and Cloud Native Essentials LFS250
Containers are a standardized way to package and ship modern software.

IMAGE: defines how to build and package container images
Runtime: designs configs, execution environment and lifecycle of containers. Open source standards worthy considering
OCI specifies image,runtime, distribution
CNI - container network interface
CRI – Container runtime Interface
CSI - container storage interface
SMI – service mesh interface KUBERNETES =container orchestration systems, open source platform for managing containerized workloads in the IT sector. Developer hhave a singular view of the application unlike the operations team; An application may have the following items in a node js application...
Frontend service- end user accesses the frontend
Db access service - where the frontend stores the data,
Backend service – accesses the database
The services are working together and exposed to the end user
Master node manages the applications running in the compute resources
Within teh containers there is the: o The applications itself o Requirements o Dependencies for the Underlying Operating systems o Application runtime o Etc
From an operations team perspective, there are several issues to consider in regards to the compute resources in the orchestration platform when running the application in production. o Deploying on the master  COMPUTE work node – 4 vcpus  Scaling- Where each node hosts the containers maybe having one or two front end, 3 backends, 3 databases  Network - exposing the services to each other and maybe an end user, load balancing  Insights – Prometheus and – ability to see the entire service mesh. Self healing and configuration management. What is the difference between VM and container Virtual machine Container Host os, hypervisor, (os,libs, run time application) Host os, runtime engine (docker engine), (actual container with the libraries which will be scaled) Containers usally start with a manifest and if we need a third party service is introduced, it is easily scalable as they are not running on the same Host. In cloud native, it is modular and portable. Cloud native architecture
Optimize cost , reliability and faster time to market through high level of automation (cicd pipelines to help rebuild the system incase of disaster, accommodate incremental changes, testing and deployment applicatiosn )
its design patterns that help build and run scalable applications in modern, dynamic environments such as hybrid clouds, private and public when under alot of load.
Instead of monolithic approach, cloud native architectural designs means we are looking at: o Containers and microservices o Service mesh o Immutable infrastructure - self healing, healthchecks, o Declarative Apis Scaling services that have alot of load like shopping cart and checkout. Despite its advantages, its complex to intergrate microservices architecture. Traditionally, Once you’re inside a zone, you can access every system inside. Patterns like zero trust computing mitigates that by requiring authentication from every user and process. Autoscalling = configure min and max limit of instances, metric to trigger scalling. Esp. On demand pricing models are desired for autoscaling ...it improves resilience and service availability. Means the resources are dybamically adjusted based on the current demand. ...metrics like CPU and memory can decide when to scale based on increase or decreate in loads Horizontal scalling - like spawning new compute resources - new racks and hardware A, B, C Vertical scaling - change in size of hardware like adding more cpu or memory Ram slots IN SERVER A.

Serverless
= abstract the underlying infrastructure
= based on ideal of scalling and provisioning based on events like incoming requests from an event data across ervices, platforms etc.
No need to prepare and configure resources like load balancers, ec2, OS and network to run an application.
Let the cloud provider choose the right environment, just provide the application code.
Ideal for = Event or data streams, scheduled tasks, business logic and batch processing.

FUNDAMENTALS OF CONTAINER ORCHESTRATION
Container Orchestration
Introduction
Container Orchestration
Use of Containers
Container Basics
Running Containers
Demo: Running Containers
Building Container Images
Demo: Building Container Images
Security
Container Orchestration Fundamentals
Networking
Service Discovery & DNS
Service Mesh
Storage

INTRODUCTION TO POSTGRES AND DBEAVER

soul-o mutwiri — Wed, 09 Apr 2025 21:46:02 +0000

local host mayb keep losing dataso you need to go dbeaver and invalidate /reconnect
also you can replace localhost with 127.0.0.1 or with the assigned ip address.

this will help when there is no localhost and need to use the database within the local network or host.

the port where the db is running
by default postgress 5432

assignment is on

Host: 172.178.131.221
Port: 5432
Database: warehouse
User: luxds
Password: 1234
Schema: dataanalytics
Table: international_debt https://neon.tech/postgresql/postgresql-administration/psql-commands

select * from datanalytics.international_debt
select count(*) from dataanalytics.international_debt
;
github repository containinng sql scripts plus and article for intro to sql for data analytics...

harun mbaabu
22:27
SQL STATEMENT TO CHECK NUMBER OF ROWS?
Elijah Mwangi
22:28
Select count();
dancan Kombo
22:28
select count() from dataanalytics.international_debt
You
22:29
select count(*) as row_counts from analytics.internationa_debt
harun mbaabu
22:29
1). What is the total amount of debt owed by all countries in the dataset?
harun mbaabu
22:31
2). How many distinct countries are recorded in the dataset?

3). What are the distinct types of debt indicators, and what do they represent?

4). Which country has the highest total debt, and how much does it owe?

5). What is the average debt across different debt indicators?

6). Which country has made the highest amount of principal repayments?

7). What is the most common debt indicator across all countries?

8). Identify any other key debt trends and summarize your findings

markdown, push the changes to github,

INTRODUCTION TO PYTHON

soul-o mutwiri — Wed, 09 Apr 2025 21:45:37 +0000

INTRODUCTION TO SQL

soul-o mutwiri — Wed, 09 Apr 2025 21:45:18 +0000

SQL databases are used in digital spaces for instance in commerce websbites, social media and other online spaces.
they help store and manipulate data, using sql queries, analyst can derive insights from the data stored in the data.

lets dive in and create a table, where the data is stored.

customers table showcasing table fields.

user_ID|firstname|lastname|age|city

SQl query to select all the data from this table looks like this:
select * from customers

it is important to note that SQL is case insensitive so
select = SELECT

A SELECT statement is used to select data from a table.

SQL query find the name and salary columsn from the Employees table:

select name, salary
from Employees;

SQL query to sort customers by their age and limit by the 2 oldest customers. in this query, where is used to filter out the rows which have no age value. Desc will be used to rank the oldest to the youngest.

select * from customers
where age is not null
order by age desc
limit 2

SQL query that selects the 3rd to 5th rows from the Employeees, ordered by the salary column in descending order.

select * employees
order by salary desc
limit 3 offset 2

STRING FUNCTIONS
there are some useful functions in sql to work with on text data

we can use concat function to combine text from multiple columns.

select concat (first_name, '', lastname) as name
from customers
order by lastname desc;

Lower and upper functions
this convert the texts in the column to lowercase or uppercase respectively.
select lower(firstname) from customers
Substring function
it allows you to extract part of the text in a column, it takes the starting position and the number of characters we want to extract
to take the first 3 characters of the firstname
select substring(firstname, 1,3)
from customers;