DEV Community: JayReddy

Building a Monitoring dashboard with Kubernetes and Grafana.

JayReddy — Fri, 23 Dec 2022 18:10:02 +0000

Grafana and Prometheus are popular tools for visualizing and monitoring metrics in a Kubernetes cluster. By integrating Grafana with Prometheus, you can create dashboards and panels that display various metrics, including CPU and memory usage, network throughput, and more. This can help you identify performance issues, troubleshoot problems, and optimize the performance of your cluster.

Here are some things you can do with Grafana and Prometheus to tune the performance of the Kubernetes cluster:

Monitor resource usage: By creating panels that display CPU, memory, and other resource usage metrics, you can identify which components of your cluster are consuming the most resources, and take steps to optimize their performance.

Analyze request latencies: By creating panels that display request latencies and response times, you can identify bottlenecks in your cluster and take steps to improve performance.

Identify spikes and anomalies: By using Grafana’s anomaly detection features, you can identify unusual spikes or dips in your metrics, and investigate their root cause.

Set alerts and notifications: By setting up alerts and notifications in Grafana, you can be notified when certain thresholds are crossed, or when certain conditions are met, so you can take timely action to address performance issues.

Integrating Kubernetes with Grafana needs RBAC roles and grafana installation on the Kubernetes cluster. Let’s tackle one problem at once.

1. Configure Grafana RBAC roles and permissions
Grafana provides Role-Based Access Control (RBAC) to control access to the various features and functions of the platform. In Grafana, users can be organized into organizations, and organizations can be assigned roles that define the level of access and permissions they have within the platform.

To configure RBAC roles and permissions in Grafana, you will need to access the Grafana configuration file (usually located at /etc/grafana/grafana.ini) and the Grafana database.

Here are the steps to configure RBAC roles and permissions in Grafana:

- Locate the [auth.anonymous] the section in the Grafana configuration file and set the enabled option to true to enable anonymous access to Grafana. This will allow users to access Grafana without logging in.
- In the [auth.anonymous] section, set the org_role option to the role you want anonymous users to have. For example, to give anonymous users the Viewer role, set org_role = Viewer.
- In the [auth.ldap] section, set the enabled option to true to enable LDAP authentication. This will allow users to log in to Grafana using their LDAP credentials.
- In the [auth.ldap] section, set the default_role option to the role you want to assign to LDAP users by default. For example, to give LDAP users the Editor role by default, set default_role = Editor.
- In the [auth.ldap] section set the allow_sign_up option to true to allow users to sign up for Grafana using their LDAP credentials.
- In the [auth.ldap] section, configure the LDAP server connection settings, including the server, bind_dn, and bind_password options.
- In the [auth.ldap] section, configure the LDAP user search settings, including the search_filter, search_base_dns, and search_bind_dn options.
- In the Grafana database, create a new organization and assign the desired roles to the organization.
- In the Grafana database, create new users and assign them to the appropriate organization.
- By configuring the RBAC roles and permissions in this way, you can control access to the various features and functions of Grafana based on the role and organization of each user.

Here is an example of the code you might use to configure RBAC roles and permissions in Grafana:

[auth.anonymous]
# Enable anonymous access
enabled = true
# Set the default role for anonymous users
org_role = Viewer
[auth.ldap]
# Enable LDAP authentication
enabled = true
# Set the default role for LDAP users
default_role = Editor
# Allow users to sign up for Grafana using their LDAP credentials
allow_sign_up = true
# Configure LDAP server connection settings
server = ldap://ldap.example.com
bind_dn = cn=admin,dc=example,dc=com
bind_password = password
# Configure LDAP user search settings
search_filter = (sAMAccountName=%s)
search_base_dns = dc=example,dc=com
search_bind_dn = cn=admin,dc=example,dc=com

To create a new organization and assign roles to it in the Grafana database, you can use SQL commands like the following:

-- Create a new organization
INSERT INTO org (name) VALUES ('My Organization');
-- Get the ID of the new organization
SELECT id FROM org WHERE name = 'My Organization';
-- Assign the Viewer role to the organization
INSERT INTO org_role (org_id, role) VALUES (1, 'Viewer');
-- Assign the Editor role to the organization
INSERT INTO org_role (org_id, role) VALUES (1, 'Editor');

To create a new user and assign them to an organization in the Grafana database, you can use SQL commands like the following:

-- Create a new user
INSERT INTO user (login, email, name) VALUES ('user1', 'user1@example.com', 'User 1');
-- Get the ID of the new user
SELECT id FROM user WHERE login = 'user1';-- Assign the user to the organization
INSERT INTO user_org (org_id, user_id, role) VALUES (1, 1, 'Viewer');

2. Kubernetes setup

To install and set up Grafana on a Kubernetes cluster, you can follow these steps:

Deploy the Grafana Helm chart:

helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana

Verify that the Grafana pod is running:

kubectl get pods -n default

Forward the Grafana service to your local machine:

kubectl port-forward service/grafana -n default 3000:3000

Open your web browser and go to http://localhost:3000. You should see the Grafana login page.

Click on the “Add data source” button and add your data source. Grafana supports a wide range of data sources, including Prometheus, InfluxDB, and more.

Create a dashboard and add panels to display your metrics. You can use the query editor to customize the metrics that are displayed in each panel.

3. Grafana setup
To set up Grafana on Kubernetes, you will need to create a Kubernetes deployment and service to run Grafana in a container. You will also need to set up persistent storage for Grafana to ensure that your data is preserved across restarts and failures.

Here are the steps to set up Grafana on Kubernetes:

Install the Kubernetes command-line tool kubectl and set up a connection to your Kubernetes cluster.

Create a configuration file for the Grafana deployment. This file should specify the container image for Grafana, the number of replicas to run, and any environment variables or volume mounts you need. For example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana-test
  labels:
    app: grafana-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana-test1
  template:
    metadata:
      labels:
        app: grafana-test1
    spec:
      containers:
      - name: grafana-test2
        image: grafana/grafana:7.4.5
        env:
        - name: GF_SECURITY_ADMIN_PASSWORD
          value: secret
        volumeMounts:
        - name: grafana-storage
          mountPath: /var/lib/grafana
      volumes:
      - name: grafana-storage
        persistentVolumeClaim:
          claimName: grafana-storage-claim

Create a configuration file for the Grafana service. This file should specify the type of service you want to create (e.g. ClusterIP, NodePort, LoadBalancer) and the port mapping for the Grafana container. For example:

apiVersion: v1
kind: Service
metadata:
  name: grafana-test
  labels:
    app: grafana-test
spec:
  type: LoadBalancer
  ports:
  - port: 3000
    targetPort: 3000
  selector:
    app: grafana

Create a persistent volume claim to provide persistent storage for Grafana. This will allow Grafana to store its data across restarts and failures.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-storage-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Use kubectl to apply the deployment, service, and persistent volume claim configuration files to your Kubernetes cluster.

kubectl apply -f grafana-deployment.yaml
kubectl apply -f grafana-service.yaml
kubectl apply -f grafana-storage.yaml

Wait for the Grafana pod to be up and running. You can use the following command to check the status of the pod:

kubectl get pods -l app=grafana-test

Once the Grafana pod is running, you can access the Grafana web interface by visiting the service URL.

Conclusion
Grafana is a popular open-source data visualization and monitoring platform that can be used to monitor and visualize data from a variety of sources, including Kubernetes. Kubernetes is a container orchestration platform that can be used to deploy, manage and scale containerized applications.

Building production-grade monitoring tools are very critical and building them efficiently is critical. This demo illustrates on how to integrate performant and advanced open-source tools and build monitoring services to track, isolate, remediate and mitigate enterprise-grade issues.

Follow me for more….

https://www.linkedin.com/in/jayachandra-sekhar-reddy/

How to handle nested JSON with Apache Spark

JayReddy — Thu, 03 Feb 2022 08:28:00 +0000

Learn how to convert a nested JSON file into a DataFrame/table

Handling Semi-Structured data like JSON can be challenging sometimes, especially when dealing with web responses where we get HTTP responses in JSON format or when a client decides to transfer the data in JSON format to achieve optimal performance by marshaling data over the wire.

The business requirement might demand the incoming JSON data to be stored in tabular format for efficient querying.

This blog post is intended to demonstrate how to flatten JSON to tabular data and save it in desired file format.

This use-case can also be solved by using the JOLT tool that has some advanced features to handle JSON.

Let's start digging by importing the required packages.

Required imports:

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{ArrayType, StructType}
import scala.io.Source

Sample nested JSON file,

val nestedJSON ="""{
                   "Total_value": 3,
                   "Topic": "Example",
                   "values": [
                              {
                                "value1": "#example1",
                                "points": [
                                           [
                                           "123",
                                           "156"
                                          ]
                                    ],
                                "properties": {
                                 "date": "12-04-19",
                                 "model": "Model example 1"
                                    }
                                 },
                               {"value2": "#example2",
                                "points": [
                                           [
                                           "124",
                                           "157"
                                          ]
                                    ],
                                "properties": {
                                 "date": "12-05-19",
                                 "model": "Model example 2"
                                    }
                                 }
                              ]
                       }"""

step 1: Read inline JSON file as Dataframe to perform transformations on the input data.

we are using the sparks createDataset method to read the data with tight dependency on the schema.

Dataset is a strongly typed collection of objects that are domain-specific, datasets offer the flexibility to transform the domain-specific objects in parallel using functional operations.

val flattenDF = spark.read.json(spark.createDataset(nestedJSON :: Nil))

step 2: read the DataFrame fields through schema and extract field names by mapping over the fields,

val fields = df.schema.fields
val fieldNames = fields.map(x => x.name)

step 3: iterate over field indices to get all values and types, and explode the JSON file. Run pattern matching to output our data.

we explode columns based on data types like ArrayType or StructType.

for (i <- fields.indices) {
        val field = fields(i)
        val fieldName = field.name       
        val fieldtype = field.dataType
        fieldtype match {
          case aType: ArrayType =>
            val firstFieldName = fieldName
            val fieldNamesExcludingArrayType = fieldNames.filter(_ != firstFieldName)
            val explodeFieldNames = fieldNamesExcludingArrayType ++ Array(s"explode_outer($firstFieldName) as $firstFieldName")
            val explodedDf = df.selectExpr(explodeFieldNames: _*)
            return flattenDataframe(explodedDf)

          case sType: StructType =>
            val childFieldnames = sType.fieldNames.map(childname => fieldName + "." + childname)
            val newfieldNames = fieldNames.filter(_ != fieldName) ++ childFieldnames
            val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_").replace("$", "_").replace("__", "_").replace(" ", "").replace("-", ""))))
            val explodedf = df.select(renamedcols: _*)
            return flattenDataframe(explodedf)
          case _ =>
        }
      }

Complete Code:

object json_to_scala_faltten {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("json-to-parquet").master("local[4]").getOrCreate()

    import spark.implicits._

    val flattenDF = spark.read.json(spark.createDataset(nestedJSON :: Nil))
    def flattenDF(df: DataFrame): DataFrame = {
      val fields = df.schema.fields
      val fieldNames = fields.map(x => x.name)
      for (i <- fields.indices) {
        val field = fields(i)
        val fieldtype = field.dataType
        val fieldName = field.name
        fieldtype match {
          case aType: ArrayType =>
            val firstFieldName = fieldName
            val fieldNamesExcludingArrayType = fieldNames.filter(_ != firstFieldName)
            val explodeFieldNames = fieldNamesExcludingArrayType ++ Array(s"explode_outer($firstFieldName) as $firstFieldName")
            val explodedDf = df.selectExpr(explodeFieldNames: _*)
            return flattenDF(explodedDf)

          case sType: StructType =>
            val childFieldnames = sType.fieldNames.map(childname => fieldName + "." + childname)
            val newfieldNames = fieldNames.filter(_ != fieldName) ++ childFieldnames
            val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_").replace("$", "_").replace("__", "_").replace(" ", "").replace("-", ""))))
            val explodedf = df.select(renamedcols: _*)
            return flattenDF(explodedf)
          case _ =>
        }
      }
      df
    }
val FDF = flattenDataframe(flattenDF)
FDF.show()
FDF.write.format("formatType").save("/path/filename")
  }

}

Output:

Conclusion:
Semi-Structured Data is challenging to work with when you are getting the data in nested form. Hopefully, this post gives you an overview of how to perform a simple ETL on JSON data. You can make modifications to the logic and find out more about how to get the desired results.
References:
https://github.com/bazaarvoice/jolt/releases

https://spark.apache.org/docs/latest/sql-data-sources-json.html

A curated list on Data Engineering

JayReddy — Tue, 01 Feb 2022 08:09:24 +0000

Catch up on the trending articles from Data Engineering Space.

A curated list of the most engaging blog posts will be shared here and a newsletter will be sent out to keep the readers up-to-date with the world of Data Engineering.

Data is the epicenter of the Digital world. Every byte of data has a story to tell.

The True business value lies in a well-narrated story. To achieve this, data engineers should pre-plan, design, develop and deploy data pipelines carefully.

Companies collect and analyze vast amounts of data as their success and growth depend on it.

Only 30% of the data is deriving meaningful insights. The rest of it is being mounted unproductively on a remote storage device. We can only leverage the true value of these datasets when they are well arranged and streamlined for accessibility and ease of use.

Handling and managing data is not as easy as it sounds. With the efficient design, the system can derive valuable insights.

Data Engineering to the rescue.

Data Engineering is a brilliant and rewarding approach to get maximum value out of your data by carefully organizing, curating, and streamlining the data end-to-end.

Data Engineering has a lot to offer towards achieving data-centric business requirements. Companies are adapting Data Engineering excessively and focusing on implementing it in every business use case to make the data speak.

A good start to know how rewarding Data Engineering is to refer to what the Experts in the Data field are predicting about the data trends and how the future will be. this post might change your overall perspective of the field,

5 Big Data Experts Predictions For 2022

Thoughts and Opinions in the blog post are from the Data startup founders and Data Engineers contributing to fast-paced Data-Centric companies.

It’s hard to predict which technologies will leave their mark in the Data sector and which ones will become a part of history.

Adding new features to cope with challenging and changing businesses will make the technologies adapt and grow. Experts highlight which technologies are achieving this and how they will be contributing to the greater good.

It must be well established by now about how important Data Engineering will be for companies.

Our next post is about how teams can leverage cloud technologies for collaboration and productivity.

SQL- the de facto standard data querying language. Most business operations that are data-centric heavily rely on general SQL querying.

How to share SQL queries in Amazon Redshift with your team

This post explains how remote teams can share work over the cloud with team members and how the tasks can be fulfilled by delegating the work.

The content is about Amazon Redshift, a Cloud Data Warehouse, and SQL, with hands-on illustration that is well structured.

Reading and writing data is achieved with simple SQL queries for a long time. Data querying was a crucial and integral part of the business when the main focus was descriptive analytics(generating reports).

Over time business requirements extended to fulfill challenging business requirements data needed to be curated and aggregated to a final agreed-upon version. This final version can be utilized for analytics by Business Intelligence teams.

We can derive meaningful insights by applying advanced transformations to the datasets.

Data transformation is a must and heavily applied operation in any ETL, ELT job, and is implemented in almost all business use cases ranging from very simple to high-level projects.

When considering cost and performance, business requirements demand different strategies.

Data Transformation happens at two stages in a Data Pipeline, before and after loading to reliable storage. Former is when we extract the data from the source, transform and load it to the destination(ETL), while in the latter we extract data from the source, load it to the destination and then perform transformations on the destination datasets(ELT).

ETL strategy can be optimal to transform small datasets in memory.

When the dataset is large, applying transformation in-memory is no longer a viable option as it requires spinning up many master and worker nodes in the cloud. This approach can be time-consuming and results in outcomes with high latency by affecting the cost to compute.

ELT can be very rewarding when the requirement is to apply transformations on large datasets to reduce operational costs.

Extract csv data and load it to PostgreSQL using Meltano ELT

In this blog post, you will learn how to perform ELT with an exciting DataOps Framework “Meltano” and work on PostgreSQL, a high-demand relational SQL language with python.

It just doesn’t stop there. Data Engineering is not just extracting data, transforming for meaningful insights, and loading it to a reliable storage unit.

We have to bring together multiple pieces of the puzzle to make the data journey possible. one piece is designing data pipelines for the data movement from source to destination.

Data Engineering Pipeline with AWS Step Functions, CodeBuild and Dagster

This blog post explains how an end-to-end data pipeline is built to collect, process, and visualize data on the cloud.

A workflow is a unit of work that has a sequence of actions.

A workflow is designed to function in a repeatable fashion, triggered by a pre-defined schedule or events.

After a workflow is triggered, each action in the workflow needs monitoring. Monitoring should be set up and configured to store the state of every single workflow action in the form of logs for each pipeline run.

The operations team will be alerted if any action fails to implement corrective actions. Dagster is a workflow management platform similar to airflow, which orchestrates your Data Engineering tasks for machine learning, analytics, ETL and comes with an event scheduler that handles failures during unlikely events and helps in monitoring the state by sending out notifications and logs to the team.

AWS Step Functions is a low-code, visual workflow service designed to build distributed applications by automating IT and business processes and building data + machine learning pipelines.

Distributed applications are suitable for delivering performance boosts and resilience to the overall system and are in high demand.

Here, you get to learn how to write distributed applications for parallel processing and high performance on the cloud using AWS step Functions.

Handling data variety often seems a challenging endeavor. Data today comes in different types and formats that can be stored and used according to business requirements. When I say storage, that doesn’t just mean a traditional database. Data storage comes in different shapes and sizes, from on-premise enterprise storage to cloud storage.

Depending on the type, data lakes are viable for storing unstructured data and are suitable for Data science-related tasks.

Data warehouses are for structured and semi-structured data. Data Warehouses stores data from ETL jobs and are used for analytics by Business Intelligence teams to derive meaningful business insights.

Cloud data lakes are in high demand. Delta sharing is one of a kind Azure service to manage and handle your data lakes.

Azure Synapse — How to use Delta Sharing?

In this blog post, you will learn about Delta sharing and how to use it for your business needs.

Delta sharing is a limitless analytics service designed to bring together enterprise data warehousing and Big Data analytics.

Automation is one of the must-implement features if the workflow has repetitive tasks to minimize resource utilization.

In Data Pipelines, one of the most common tasks is data movement from source to destination. The same operations get applied for different use cases.

Snowflake is a Trending and Popular Cloud Data warehouse offering easy to store and process data interface. Snowflake provides integrations of tens of different libraries and languages to expand your business use cases based on underlying technologies. Amazon Simple Storage Service(S3) is the most popular and excessively adapted cloud object storage in the market.

Knowing how to use and implement Snowflake to automate the data movement can be an immense advantage to your business.

Automating Data Movement from Snowflake to S3

This post might be a good start for you to learn how to automate your data flows from the cloud data warehouse to reliable cloud storage.

Opinions are my own. Please leave a comment.

Until next time.

Subscribe to my newsletter to stay up to date on the Data Engineering content. Lambdaverse

Quill- Most efficient Scala driver for Apache Cassandra and Spark

JayReddy — Mon, 31 Jan 2022 18:31:43 +0000

Apache Cassandra is an open-source, distributed data storage system that is durable, scalable, consistently tuneable, and is highly efficient for OLAP.

It was 2009, Cassandra first started as an Incubator project at Apache, Shortly thereafter Apache Cassandra gained a lot of traction and grew to what we see today. Cassandra has an active community of enthusiastic developers and is being used in production by many big companies on the web. to name a few Facebook, Twitter, and Netflix.

Apache Cassandra operates blazingly fast when compared to RDBMS for database writes and can store hundreds of terabytes of data in a decentralized and symmetrical way with no single point of failure.

Quill is a Scala library that provides a Quoted Domain Specific Language to express queries in Scala and execute them in a target language.

Quill and Cassandra are a perfect match to query, run and optimize unstructured data in Apache spark offering exceptional capabilities to work with No-SQL distributed databases.

Cassandra can be integrated with apache Spark using a docker image or by using a jar file. Let's go with the docker image approach as we might want to experiment with Cassandra shell(CSH) and Cassandra Query Language.

Setup:
Let's spin up the Apache Cassandra node on Docker,

version: "3"
services
Cassandra:
    image: cassandra:latest
    volumes: ["./cassandra/cassandra.yml:/etc/cassandra/cassandra.yml"]
    ports: ["9042:9042"]
    networks: ["sandbox"]
networks:
sandbox:
    driver: bridge

We will use Apache Spark’s recent release is 3.2.0 as a shell with a connector and Cassandra’s client library 3.11 and pass the scala jar file as a parameter to run.

./spark-shell — packages com.datastax.spark:spark-cassandra-connector_2.12:3.2.0-beta,com.datastax.cassandra:cassandra-driver-core:3.11 spark-cassandra-connector-assembly-1.1.1-SNAPSHOT.jar

Apache Cassandra has a dedicated shell that is more like Scala’s REPL called cqlsh for running CQL scripts.

to start cqlsh we need to execute our container image,

sudo docker exec -it container_name cqlsh

First, we have to create a keyspace, which is like a container where our tables will be residing.

CREATE KEYSPACE spark_playground WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };

Once the CSH and KEYSPACE is in place, we can run any Cassandra query and write it to file using Sparks write method,

Let's get to the main focus of the article now.

Spark Integration with Quill

Quill shines in offering a fully type-safe library to use Spark’s highly-optimized SQL engine.

Let's see how Quill works,

Simple case classes are used for mapping the database schema.
“Quoted DSL” — Quill leverages the mechanism of quote block in which Queries are defined. Quill parses each quoted block of code at compile time with the help of Scala’s
powerful compiler and translates the queries into an internal Abstract Syntax Tree (AST). similar to Sparks dags.
Compile-time query generation: The ctx.run is responsible for reading the AST which is generated in step 2 and translates to the target language at compile-time.
Compile-time query validation: upon configuration, the query is verified against the database schema at compile-time. compilation status depends on the configuration. The query validation step does not alter the database state.

Let's Import quill-spark dependency into your sbt and build the project for usage.

libraryDependencies ++= Seq(
  "io.getquill" %% "quill-spark" % "3.16.1-SNAPSHOT"
)

Usage:

Let's create a Spark session and import the required packages.

import org.apache.spark.sql.{SparkSession, sqlContext}
val spark =
  SparkSession.builder()
    .master("local")
    .appName("spark-quill-test")
    .getOrCreate()
// The Spark SQL Context must be provided by the user through an implicit value:
implicit val sqlContext = spark.sqlContext
import spark.implicits._
// Import the Quill Spark Context
import io.getquill.QuillSparkContext._

Note Unlike the other modules, the Spark context is a companion object. Also, it does not depend on a spark session.

Using Quill-Spark:

The run method returns a Dataset transformed by the Quill query using the SQL engine.

// Typically you start with some type dataset.
val data: Dataset[Data] = spark.read.format("csv").option("header", "true").load("/home/lambdaverse/spark_and_cassandra/test_data.csv")
// The liftQuery method converts Datasets to Quill queries:
val data: Query[Data] = quote { liftQuery(data) }
val data: Query[(Data] = quote {
  data.join(addresses).on((p, a) => p.fact == a.dim)
}

Here is an example of a Dataset being converted into Quill, filtered, and then written back out.

import org.apache.spark.sql.Dataset
def filter(myDataset: Dataset[Data], name: String): Dataset[Int] =
  run {
    liftQuery(myDataset).filter(_.fact == lift(fact)).map(_.dim)
  }

Conclusion:

Apache Spark is a popular big Data Analytical engine used in many fortune 500 companies.

Apache spark is known for its ability to process Structured and Un-Structured data. almost 70% of the data that is being generated today is Unstructured.

Cassandra is written in JAVA offering exceptional features and is a modern data stack, Quill is a Scala library that supports Cassandra integration.

This post is purely meant for educational purposes and opinions and content is referred from the official Quill page for better elaboration.

Please clap if you like the post and support.
subscribe to my newsletter to stay up to date on the Data Engineering content. [https://lambdaverse.substack.com/]