As a data engineer, you know the joy of wrangling massive datasets and navigating complex data pipelines. Argo Workflows, a popular workflow engine for Kubernetes, is your trusty companion in this data-driven journey, allowing you to define, run, and manage data pipelines as code.
However, like any adventure, there are challenges along the way. One such challenge is that Argo Workflows typically require a Kubernetes environment to run workflows, which may not always be readily available for local development. Another challenge is managing artifacts, such as data files and intermediate results, within these workflows. This can be particularly challenging in local development environments where external artifact storage services may not be accessible.
But why bother with local development, you might ask? Well, as data engineers, we know that cloud costs can quickly add up, and every penny counts. Developing and testing data pipelines directly on the cloud can rack up expenses and leave us scratching our heads over the bill. Also local development allows you to simulate various edge cases, failure scenarios, and performance testing in a controlled environment. This can be challenging or not possible with cloud-based services, where you may have limited control over the environment.
But fear not, for MinIO and Minikube are here to save the day! MinIO, the powerful object storage server that offers S3-compatible capabilities, is the superhero you need to effortlessly manage and store your artifacts in your local development environment. Meanwhile, Minikube provides a lightweight and easy-to-use Kubernetes environment right on your local computer. Together, with Argo Workflows on Minikube, MinIO empowers you to unleash your pipeline creativity and conquer the data engineering realm without relying on external services.
In this blog post, we will embark on a thrilling journey of setting up Argo Workflows and MinIO on Minikube for local development. We will guide you through the installation and configuration of Argo Workflows and MinIO, as a local S3 artifact repository, and running workflows to leverage the full potential of your data pipelines. Get ready to optimize your data workflows and accelerate your data engineering projects with this step-by-step guide!
Prerequisites
To set up MinIO and Argo Workflows on Minikube for local development on macOS, you’ll need a few tools installed on your machine:
Homebrew: Homebrew is a popular package manager for macOS that makes it easy to install and manage software packages. If you don’t have Homebrew installed already, you can install it by following the instructions on the Homebrew website or use this command below.
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Helm: Helm is a package manager for Kubernetes that simplifies the deployment and management of applications on a Kubernetes cluster. You can install Helm using Homebrew by running the following command in your terminal:
brew install helm
Minikube: Minikube is a tool that allows you to run a single-node Kubernetes cluster on your local machine for local development and testing. You can install Minikube using Homebrew by running the following command in your terminal:
brew install minikube
Once you have Homebrew, Helm, and Minikube installed, you’re ready to proceed.
Installing and Configuring MinIO on Minikube
First you should start your cluster with hyperkit driver on Minikube: open your terminal and run the following command:
minikube start
After entering your root password, Minikube will start a single-node Kubernetes cluster on your local machine.
Next, we will need kubectl which is a command-line tool used to interact with Kubernetes clusters.
minikube kubectl -- get pods -A
To use kubectl with Minikube, you can set up an alias in your terminal to point to the kubectl binary provided by Minikube. Run the following command:
alias kubectl="minikube kubectl --"
Create a namespace for our tutorial.
kubectl create namespace argo
And set is as current for kubectl:
kubectl config set-context --current --namespace=argo
Next, you can deploy MinIO to your Minikube cluster using a YAML file that you have locally. You can use the kubectl apply command to apply the YAML file and create the MinIO pod and service in the "argo" namespace. In localvolume specify a path to your local directory.
kubectl apply -f minio-dev-clusterip.yaml
We created a minio-service of type ClusterIP, which is only accessible within the Minikube cluster and not available from outside the cluster. This is because ClusterIP services are designed to be accessible only within the same Kubernetes cluster.
You can see in the image below that external-ip is None.
To make minio-service accessible from outside the Minikube cluster, we are going to use an Ingress controller. Ingress is a Kubernetes resource that allows you to configure how external traffic should be routed to services running within the cluster. By using an Ingress controller, we can expose minio-service to the outside world and enable access from external sources, such as our web browser.
Alternatively, we could use kubectl port-forward to forward the ports of minio-service to our local machine, but this can be slow and not as convenient as using an Ingress controller.
To enable Ingress functionality in our Minikube cluster, we’ll be using the popular Ingress controller provided by nginx. First, we’ll add the nginx ingress-nginx repository to our Helm repository list using the following command:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
Then, we’ll update our Helm repositories to ensure we have the latest information with:
helm repo update
Next, we’ll install the *ingress-nginx *chart using Helm, and we’ll specify the namespace ‘argo’ for the installation:
helm install -n argo nginx-ingress ingress-nginx/ingress-nginx
Once the installation is complete, we can enable the Ingress addon in our Minikube cluster using:
minikube addons enable ingress
Additionally, we can enable the ingress-dns addon, which allows us to automatically resolve DNS names for Ingress resources within the Minikube cluster, using:
minikube addons enable ingress-dns
Now we need to deploy ingress for our minio-service.
kubectl apply -f ingress-minio.yaml
Let’s check what we got after the deployment. Running this command will display a list of Ingress resources that are deployed in the “argo” namespace, along with their current status and other relevant information.
kubectl get ingresses -n argo
It might take some time to get an address for the service but finally we see it as 192.168.64.16.
Now you can access the MinIO web interface by opening your favorite web browser and navigating to the IP address “192.168.64.16/minio”. This IP address is the IP of the Minikube cluster, which may vary depending on your local setup.
Once you enter the URL in your browser, it will direct you to the MinIO web login interface. By default, the MinIO instance will have the login credentials set to “minioadmin” for both the username and password.
Alternatively, for the same result you can expose the MinIO service to the external world using the LoadBalancer type of service without using ingress. To do this, you simply need to update the service type in our YAML file and deploy it.
kubectl apply -f minio-dev-loadbalancer.yaml
You can see that our minio-service now is LoadBalancer.
By running minikube service minio-service -n argo, Minikube will automatically open a browser window with the MinIO web interface, allowing you to interact with the MinIO object storage server directly from your local machine. If it didn’t happen automatically, you can use the first url from the cli output.
Just in case you need to clean up your Minikube environment and delete all resources, you can use the following command:
minikube delete --all --purge
This will remove all Minikube clusters and associated resources from your local machine.
Installing and Configuring Argo Workflows on Minikube
To get started, you can download the Argo Workflows installation YAML file from the official GitHub repository using the following command:
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.4.6/namespace-install.yaml
This command will apply the YAML file to the argo namespace in your Minikube cluster, setting up Argo Workflows for use. This YAML file contains the necessary configurations and resources to set up Argo Workflows, such as deployments, services, and RBAC rules in a dedicated namespace.
Let’s see what we have now in our argo namespace:
argo-server provides the API interface for interacting with Argo Workflows, while workflow-controller manages the execution of workflow instances within the Kubernetes cluster, ensuring workflows are executed according to their definitions and handling any issues that may arise during execution.
Let’s set the authentication mode for Argo Workflows to “server” for enhanced security.
kubectl patch deployment \
argo-server \
--namespace argo \
--type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": [
"server",
"--auth-mode=server"
]}]'
Create a role binding called default-admin that grants the admin cluster role to the default service account in the argo namespace. This role binding ensures that the default service account in the argo namespace has administrative privileges, allowing it to perform actions that are typically restricted to cluster administrators, such as managing resources across namespaces and creating, updating, and deleting cluster-wide resources.
kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=argo:default --namespace=argo
As you may have guessed, we will expose the IP address for the argo-service using the Ingress controller in the same way as we did for the minio-service.
Let’s configure the *argo-server *deployment by adding an environment variable called BASE_HREF, which specifies the base URL path for accessing the Argo Workflows UI. This will be used to set the base URL path for the argo-service, which is exposed using the Ingress controller.
kubectl set env deployment/argo-server BASE_HREF=/argo/
Now we need to deploy ingress for our argo-server.
kubectl apply -f ingress-argo.yaml
Check the external address
kubectl get ingresses -n argo
You can see in the image above that the address is 192.168.64.16, which is the same as the minio-service. Now, you can use this address, append /argo and the argo namespace to access the Argo Workflows UI, like this: http://192.168.64.16/argo/workflows/argo.
Click on the Continue button, and now you can work with Argo Workflows.
Configuring MinIO as S3 Artifact Repository
In this chapter, we will explore how to set up MinIO as an S3-compatible artifact repository for use with Argo Workflows.
An S3 Artifact Repository in the context of Argo Workflows is a storage location that serves as a centralized hub for storing artifacts, such as input and output data, that are used in workflows. Think of it like a digital warehouse where Argo Workflows can securely store and retrieve data required for running workflows.
The first step is to create an access key. Access keys are used to securely authenticate and authorize access to MinIO objects and resources. Think of access keys as a pair of credentials that include an access key ID and a secret access key. The access key ID serves as the username, while the secret access key acts as the password for authentication.
Open http://192.168.64.16/minio from our previous setup and login with minioadmin/minioadmin credentials. Switch to Access Keys tab http://192.168.64.16:9001/access-keys and click on Create Access Key.
Save the generated access key ID and secret access key values, as we will need them shortly. Finally, click the ‘Create’ button.
We will also need a bucket. Switch to the ‘Object Browser’ tab and click on ‘Create a bucket’. Give it the name ‘argo-bucket’ and click the ‘Create Bucket’ button.”
Let’s create a secret on the cluster with our MinIO access key. Use your saved values here.
kubectl create secret generic argo-artifacts \
--from-literal=accesskey=oZ2FhcthT4YfakFj \
--from-literal=secretkey=alac4FBYegQTIUljJPRFRC3Lfj0iaRej \
-n argo
In the context of Kubernetes, a secret is a way to securely store sensitive information, such as passwords, access keys, and API tokens. Secrets are encrypted and can be used by pods running in a cluster to access sensitive data without exposing it in plaintext. In this case, the secret argo-artifactswill be used by Argo Workflows to authenticate and authorize access to MinIO objects and resources, ensuring that the sensitive information is protected and not exposed in plain text within the cluster.
You can see in the picture below that our key is really encrypted.
There are three ways in Argo workflows to configure the artifact repository, in this tutorial we’re going to use artifact-repositories ConfigMap.
In Argo Workflows, the artifact-repositories ConfigMap is a Kubernetes resource that is used to define and configure different artifact repositories that can be used to store and retrieve artifacts during workflow execution.
The artifact-repositories ConfigMap is typically created in the same namespace as the Argo Workflows installation and contains configuration settings for different artifact repositories. These settings include the repository type, repository URL, and any authentication or access credentials required for accessing the repository.
kubectl apply -f artifact-repositories.yaml
This YAML file essentially tells Argo Workflows that we have a default artifact repository called ‘default-v1-s3-artifact-repository’ that is based on the MinIO service and uses the ‘argo-bucket’ bucket. It also specifies that the ‘argo-artifacts’ secret should be used as the source of credentials for connecting to the artifact repository.
If you want to use a different bucket or endpoint, you can add an additional key in this configmap or even create a separate configmap, and then override the artifact repository for a specific workflow using the ‘artifactRepositoryRef’ field:
artifactRepositoryRef:
configMap: artifact-repositories # or any other ConfigMap
key: your-custom-artifact-repository
However, if you wish to use the default ‘default-v1-s3-artifact-repository’, you don’t need to provide ‘artifactRepositoryRef’ in your workflows.
Now that we are ready, we can test our s3 artifact repository configuration in the next chapter.
Running Workflows
In this chapter we’re going to run a very simple workflow to demonstrate the usage of our s3 artifact repository.
Open Argo Workflows UI — http://192.168.64.16/argo/workflows/argo
Click on “+ SUBMIT NEW WORKFLOW” button and then choose “Edit using full workflow options”.
Copy and insert the workflow from the code below and click Create button.
This is a very simple workflow that runs a container with docker/whalesay image with the latest tag, and executes the cowsay hello world | tee /tmp/hello_world.txt command inside the container. The outputs section defines an artifact named message, which is a file located at /tmp/hello_world.txt inside the container. The artifact (file hello_world.txt) is then configured to be uploaded to an S3 artifact repository with a key named hello_world.txt.
Our workflow is green, and we can view the logs directly from the UI in the Containers tab by clicking on the Logs button.
These logs are from our workflow pod. We should expect to see the same content in the file located in the S3 bucket.
Let’s verify this by navigating to the MinIO UI at http://192.168.64.16/minio and going to the Object Browser tab.
There, we can find our file and download it to inspect its contents. If the content matches, then our workflow has successfully uploaded the artifact to the S3 bucket.
So the content is the same. Yay!
Now that we have successfully configured an S3 artifact repository in Argo Workflows, you can experiment with it and use it as inputs and outputs for your workflows. You can streamline your workflows, share data across different steps, and enable more advanced data processing and analysis within your workflows. You can read more about Argo Workflows artifacts here.
Remember to stop your local cluster when it’s no longer needed.
minikube stop cluster
Conclusion
Don’t be shy, take charge with Argo Workflows and MinIO, and let your workflows know who’s in charge!
Top comments (0)