<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alessio Marinelli</title>
    <description>The latest articles on DEV Community by Alessio Marinelli (@mobs75).</description>
    <link>https://dev.to/mobs75</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1093574%2F0951e688-f940-46bc-a952-0c8ac5f73c71.jpeg</url>
      <title>DEV Community: Alessio Marinelli</title>
      <link>https://dev.to/mobs75</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mobs75"/>
    <language>en</language>
    <item>
      <title>AI-Powered Cybersecurity Platform That Detects, Analyzes, and Responds to Attacks Automatically on a Kubernetes Cluster</title>
      <dc:creator>Alessio Marinelli</dc:creator>
      <pubDate>Tue, 21 Apr 2026 08:19:22 +0000</pubDate>
      <link>https://dev.to/mobs75/ai-powered-cybersecurity-platform-that-detects-analyzes-and-responds-to-attacks-automatically-on-34o</link>
      <guid>https://dev.to/mobs75/ai-powered-cybersecurity-platform-that-detects-analyzes-and-responds-to-attacks-automatically-on-34o</guid>
      <description>&lt;p&gt;From a Snort alert to a blocked IP in under 60 seconds. No cloud. No vendor lock-in. Full human control Validated on NVIDIA DGX Spark.&lt;/p&gt;

&lt;p&gt;There are plenty of tools that help you run a pentest. You launch nmap, feed the output to an LLM, get some suggestions. Useful — but fundamentally reactive. You still need a human in front of a terminal to make anything happen.&lt;/p&gt;

&lt;p&gt;I wanted something different. I wanted a system that watches your infrastructure continuously, understands what it sees, decides what to do, and acts — while still keeping a human in the loop for every critical decision.&lt;/p&gt;

&lt;p&gt;After months of work, that system exists. I call it AI-Pentest Suite.&lt;/p&gt;

&lt;p&gt;The Problem with Existing Tools&lt;br&gt;
Most AI security tools today fall into one of two categories.&lt;/p&gt;

&lt;p&gt;The first is the AI assistant model — CLI tools where you give a target, recon tools run, the LLM analyzes the output, and you get a report. Genuinely useful for a security analyst doing manual assessments. But they are fundamentally CLI wrappers with an LLM on top. They don’t watch anything. They don’t respond to anything. They wait for you to ask.&lt;/p&gt;

&lt;p&gt;The second is the enterprise SIEM/XDR model — powerful platforms that require dedicated teams to operate, whose AI is a black box you cannot inspect, modify, or run offline.&lt;/p&gt;

&lt;p&gt;Neither category solved my problem: an automated, event-driven, AI-powered security pipeline that runs on your own infrastructure, uses a local LLM so your data never leaves your premises, and keeps humans in control of every irreversible action.&lt;/p&gt;

&lt;p&gt;What I Built&lt;/p&gt;

&lt;p&gt;AI-Pentest Suite is a cloud-native security platform that runs on Kubernetes — including virtual machines. It combines three layers:&lt;/p&gt;

&lt;p&gt;Detection — Snort3 IDS runs as a DaemonSet on every node of the cluster, monitoring network traffic in real time. A PyTorch autoencoder pre-filters anomalies before they even reach the AI layer, cutting noise and false positives.&lt;/p&gt;

&lt;p&gt;Analysis — When Snort generates an alert, it flows through Kafka into an AI pipeline running on Apache OpenServerless. A local Mistral LLM analyzes the alert in context, assigns a threat score from 0 to 100, categorizes the attack type, correlates it with the MITRE ATT&amp;amp;CK framework via a RAG knowledge base of 1,290 documents, and recommends an action. The platform has been tested and is fully operational on NVIDIA DGX Spark — enterprise-class GPU hardware that brings AI inference to millisecond latency even under heavy load. This is not a proof of concept running on a laptop: it is a pipeline validated on real GPU hardware.&lt;/p&gt;

&lt;p&gt;Response — A policy engine checks the IP’s history in Redis, determines severity and recidivism, and routes to a human approval step. The operator has 30 seconds to approve or modify the recommended action. If no response comes, the system auto-decides. A firewall agent running on each node executes the iptables block. Everything is logged to PostgreSQL for audit.&lt;/p&gt;

&lt;p&gt;The entire cycle — from alert to blocked IP — takes under 60 seconds.&lt;/p&gt;

&lt;p&gt;The Architecture That Makes It Different&lt;br&gt;
The platform runs on Kubernetes, which means it works on bare metal, VMs, or cloud IaaS. You don’t need dedicated hardware to get started.&lt;/p&gt;

&lt;p&gt;The AI pipeline is built on Apache OpenServerless — an open-source serverless platform based on Apache OpenWhisk. This means the analysis functions scale automatically with load. When your infrastructure is quiet, they consume zero resources. When you are under a port scan or brute force attack, they spin up in parallel.&lt;/p&gt;

&lt;p&gt;The scanning layer — Nuclei with 9,000+ templates and Metasploit integration — runs as Kubernetes workloads too, triggered on demand or scheduled. A full pentest pipeline from recon to exploit verification to PDF report can run end-to-end without a human touching a keyboard.&lt;/p&gt;

&lt;p&gt;The LLM runs entirely on local hardware. The platform has been tested and validated on the NVIDIA DGX Spark, NVIDIA’s personal AI supercomputer based on the Blackwell architecture. No data is sent to external services. Your network traffic, your alerts, your findings — they stay in your environment.&lt;/p&gt;

&lt;p&gt;Human-in-the-Loop, by Design&lt;br&gt;
The most important architectural decision I made was making human approval mandatory for every high-impact action.&lt;/p&gt;

&lt;p&gt;The system can recommend blocking an IP. It can recommend running an exploit. It will not do either without explicit operator approval. This is not a safety limitation — it is a feature. In security, a false positive that blocks legitimate traffic can be as damaging as the attack itself. The AI is fast and accurate. The human is accountable.&lt;/p&gt;

&lt;p&gt;This principle — the system recommends, the operator decides — runs through every layer of the architecture.&lt;/p&gt;

&lt;p&gt;What It Actually Looks Like&lt;br&gt;
When an attack hits, the operator sees something like this in the pipeline output:&lt;/p&gt;

&lt;p&gt;{&lt;/p&gt;

&lt;p&gt;"src_ip": "10.x.x.x",&lt;/p&gt;

&lt;p&gt;"attack_category": "reconnaissance",&lt;/p&gt;

&lt;p&gt;"threat_score": 85,&lt;/p&gt;

&lt;p&gt;"confidence": 0.93,&lt;/p&gt;

&lt;p&gt;"recommended_action": "block_ip",&lt;/p&gt;

&lt;p&gt;"reason": "Systematic port scan across 1000 ports, SYN flood pattern, repeat offender",&lt;/p&gt;

&lt;p&gt;"audit_id": "a3be821f"&lt;/p&gt;

&lt;p&gt;}&lt;/p&gt;

&lt;p&gt;That output is the result of a real scan hitting the cluster, Snort catching it, the autoencoder filtering it, Mistral analyzing it, the policy engine checking Redis history, and the firewall agent executing the block. No human typed a command. The analyst approved the block in the human-loop step and the rest was automatic.&lt;/p&gt;

&lt;p&gt;What Is Coming Next&lt;/p&gt;

&lt;p&gt;The platform is actively developed. The next phases include Nuclei scanning as a distributed Kubernetes workload, full CVE correlation integrated into the detection pipeline, Metasploit execution via a dedicated cluster deployment, and a unified pentest orchestration pipeline that goes from recon to exploitation to PDF report in a single command.&lt;/p&gt;

&lt;p&gt;The longer-term goal is to bring RAG-powered AI analysis to every component of the pipeline — not just anomaly detection, but CVE lookup, exploit selection, and remediation recommendations, all running on local models with no external dependencies.&lt;/p&gt;

&lt;p&gt;Closing Thought&lt;/p&gt;

&lt;p&gt;Security tooling should not require a dedicated team to operate. The building blocks — Kubernetes, Kafka, open-source LLMs, Snort, Metasploit — are all available. What was missing was an architecture that connected them into a coherent, automated, human-supervised pipeline.&lt;/p&gt;

&lt;p&gt;That is what I built.&lt;/p&gt;

&lt;p&gt;Get in Touch&lt;/p&gt;

&lt;p&gt;If you are a security team that wants to explore what this looks like in a real environment, or you are simply curious about the platform, feel free to reach out directly:&lt;/p&gt;

&lt;p&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/alessio-marinelli-b302042a/" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/alessio-marinelli-b302042a/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Email: &lt;a href="mailto:marinelli_alessio@yahoo.it"&gt;marinelli_alessio@yahoo.it&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Architecture diagrams and demo materials available on request. The codebase is proprietary.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>cybersecurity</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Openwhisk Pulsar Spark Integration — Part 1</title>
      <dc:creator>Alessio Marinelli</dc:creator>
      <pubDate>Fri, 02 Feb 2024 14:27:43 +0000</pubDate>
      <link>https://dev.to/mobs75/openwhisk-pulsar-spark-integration-part-1-ol8</link>
      <guid>https://dev.to/mobs75/openwhisk-pulsar-spark-integration-part-1-ol8</guid>
      <description>&lt;p&gt;There are several alternatives to Kafka that could be considered to increase the resilience and management of Spark’s jobs performed by OpenWhisk.&lt;/p&gt;

&lt;p&gt;Below is an overview to understand the differences in to use them and to see how each of the proposed products perform in this role.&lt;/p&gt;

&lt;p&gt;The products offered as intermediaries between OpenWhisk and Spark are the following:&lt;/p&gt;

&lt;p&gt;· Apache Pulsar;&lt;/p&gt;

&lt;p&gt;· Apache Flink;&lt;/p&gt;

&lt;p&gt;· RabbitMQ;&lt;/p&gt;

&lt;p&gt;Currently, there is a kafka provider for Apache OpenWhisk and this feasibility study should be used to find a viable alternative with other products.&lt;/p&gt;

&lt;p&gt;Apache Kafka&lt;/p&gt;

&lt;p&gt;Strength: It excels at processing real-time data streams and handling large volumes of data.&lt;/p&gt;

&lt;p&gt;Usage: It acts as a buffer for data between OpenWhisk and Spark, allowing Spark to process data in a way that is fault-resilient and scalable.&lt;/p&gt;

&lt;p&gt;Ideal scenarios: Best for situations where efficient and scalable management of data flows is essential.&lt;/p&gt;

&lt;p&gt;Apache Pulsar&lt;/p&gt;

&lt;p&gt;It offers messaging and data streaming capabilities, but with some additional features such as native support for multi-tenancy and a sharper separation between storage and computing.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Strength: It offers similar functionality to Kafka but with a more modern architecture that separates storage from compute, making it easier to scale.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Usage: Like Kafka, Pulsar can be used to capture and store data from OpenWhisk before it is processed by Spark.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It’s also suitable for multi-tenant use cases.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ideal scenarios: Great for environments that require efficient scalability and more granular data management.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Apache Flink&lt;/p&gt;

&lt;p&gt;Flink is more of a stream processing engine than a messaging system, but it can be used in conjunction with OpenWhisk and Spark to improve real-time data management and resiliency.&lt;/p&gt;

&lt;p&gt;It is particularly strong in processing complex data streams and can maintain a consistent state even in the event of failures.&lt;/p&gt;

&lt;p&gt;In summary, while Kafka and Pulsar are more focused on messaging and data buffering, Flink offers more advanced stream processing capabilities.&lt;/p&gt;

&lt;p&gt;The choice depends on the type of data processing you need to perform between OpenWhisk and Spark and the level of complexity your system needs to handle.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Strength: It is not just a messaging system but a complete stream processing engine that can handle complex data processing in real-time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Usage: Flink can be used to directly process streaming data from OpenWhisk before passing it to Spark for further batch processing or analysis.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It gives you more accurate control over processing status and logic.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ideal scenarios: Ideal for applications that require complex data stream processing and real-time state management.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;RabbitMQ&lt;/p&gt;

&lt;p&gt;Another messaging system that is more directed towards the message queue.&lt;/p&gt;

&lt;p&gt;It’s less suitable for processing real-time data streams than Kafka, but it can be a good choice for scenarios where reliable message delivery is the priority.&lt;/p&gt;

&lt;p&gt;Architectural Approaches for Intermediation Between OpenWhisk and Spark&lt;/p&gt;

&lt;p&gt;To use Kafka, Pulsar, or Flink as intermediaries in an architecture that includes OpenWhisk and Spark, there are two main approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;create a custom package (similar to the one that exists for Kafka in OpenWhisk)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Direct use&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here’s how both approaches work:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a Custom Package:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this approach, you create an OpenWhisk-specific package that integrates the intermediary service (Kafka, Pulsar or Flink) as a native part of the OpenWhisk ecosystem.&lt;/p&gt;

&lt;p&gt;This requires more upfront work, as you will have to write code to handle the interaction between OpenWhisk and the intermediary, but it offers greater integration and ease of use in the long run.&lt;/p&gt;

&lt;p&gt;A custom package is useful if you plan to reuse this integration across many projects, or if you have specific needs that require closer interaction between OpenWhisk and the intermediary.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Direct use:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here, you use Kafka, Pulsar or Flink independently, without creating a specific package for OpenWhisk.&lt;/p&gt;

&lt;p&gt;In this scenario, OpenWhisk handles its actions and triggers as usual, and interaction with the intermediary takes place through standard network connections, APIs, or messaging protocols.&lt;/p&gt;

&lt;p&gt;This approach is faster to set up and doesn’t require additional development, but it may be less integrated and may require more manual management.&lt;/p&gt;

&lt;p&gt;The choice between these two approaches depends on how often you plan to use the intermediary with OpenWhisk, the complexity of your needs, and the resources available for development.&lt;/p&gt;

&lt;p&gt;If you plan to use it frequently and have specific needs, developing a custom package may be the best choice.&lt;/p&gt;

&lt;p&gt;If, on the other hand, you need a easier and faster solution, direct use may be adequate.&lt;/p&gt;

&lt;p&gt;How to write a package with Pulsar that can act as an intermediary between OpenWhisk and spark.&lt;/p&gt;

&lt;p&gt;Writing a package to integrate Apache Pulsar with OpenWhisk is a project that requires an understanding of both OpenWhisk and Pulsar, as well as programming skills.&lt;/p&gt;

&lt;p&gt;Here are the general steps you should follow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;API knowledge: Familiarize yourself with the OpenWhisk and Apache Pulsar APIs. This will help you understand how the two platforms interact.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Setting Up the Development Environment:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set up a development environment with OpenWhisk and Apache Pulsar installed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You might need to install additional software like Docker if you’re working locally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Make sure you have access to a Pulsar environment to test the integration.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Package Definition:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identify the specific Pulsar features that you want to expose through the package.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This could include posting messages, subscribing to topics, etc.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Decide how to manage configuration, such as login credentials and connection parameters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Develop the code for package components, which can include actions, triggers, and rules in OpenWhisk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Write the necessary code to interface OpenWhisk with Pulsar.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This could include creating Pulsar clients, managing message publishing and subscribing, and handling errors.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Test the package in different scenarios to make sure it works as expected.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Make sure that the package handles errors and abnormal situations correctly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Write clear documentation for your package. This should include instructions on how to install and use it, along with examples.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deployment and Maintenance.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;High-level overview.&lt;/p&gt;

&lt;p&gt;Writing a package is a very complex project. It requires a thorough understanding of all the technologies covered and the ability to write robust and secure code.&lt;/p&gt;

&lt;p&gt;Preparing the Environment&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Installing Apache OpenWhisk:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Make sure you have a working version of OpenWhisk. If you’re working locally, you may need to install Docker and run OpenWhisk via containers.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Installing Apache Pulsar:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Set up an instance of Apache Pulsar.&lt;/p&gt;

&lt;p&gt;You can install it locally or use a cloud service.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Installing Apache Spark:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Set up an instance of Apache Spark.&lt;/p&gt;

&lt;p&gt;You can install it locally or use a cloud service.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Setting Up a Development Environment:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Configure your development environment with the necessary languages and tools (e.g., Java or Python, IDEs, version control systems).&lt;/p&gt;

&lt;p&gt;Package Design&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Defining Features:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Decide which Pulsar features you want to expose. This could include creating topics, posting posts, and subscribing to topics.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Structuring the Package:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Plan how to structure the package.&lt;/p&gt;

&lt;p&gt;Normally, an OpenWhisk package includes actions, triggers, and rules.&lt;/p&gt;

&lt;p&gt;Development&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Code for the Pulsar Interface:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Develop code to interact with the Pulsar API.&lt;/p&gt;

&lt;p&gt;For example, to post a message on a topic.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OpenWhisk Integration:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Write code to integrate these features with OpenWhisk. This could include actions that trigger events in Pulsar.&lt;/p&gt;

&lt;p&gt;Error Management and Security: Make sure your package handles errors correctly and implements standard security practices.&lt;/p&gt;

&lt;p&gt;Test e Debugging&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Test the Package:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Test the package in different scenarios to make sure it works as expected.&lt;/p&gt;

&lt;p&gt;Debugging: Fix any issues or bugs that emerge from testing.&lt;/p&gt;

&lt;p&gt;Documentation &amp;amp; Distribution&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Document the Package:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Write clear documentation that includes instructions on how to install and use the package.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy the Package:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Share the package with the community or deploy it in a production environment.&lt;/p&gt;

&lt;p&gt;Basic Implementation&lt;/p&gt;

&lt;p&gt;Project Structure&lt;/p&gt;

&lt;p&gt;OpenWhisk Action: This will be the entry point.&lt;/p&gt;

&lt;p&gt;The OpenWhisk action will take the input data and send it to Pulsar.&lt;/p&gt;

&lt;p&gt;Apache Pulsar: Will act as a messaging system to transfer data from the OpenWhisk action to the Spark job.&lt;/p&gt;

&lt;p&gt;Spark Job: Will process the data received from Pulsar.&lt;/p&gt;

&lt;p&gt;OpenWhisk Action development&lt;/p&gt;

&lt;p&gt;Language: Choose a language that OpenWhisk supports, such as Python or Java.&lt;/p&gt;

&lt;p&gt;Pulsar Interface: The action must include a Pulsar client to send messages to the Pulsar topic.&lt;/p&gt;

&lt;h1&gt;
  
  
  Python Example: OpenWhisk Action Sending Data to Pulsar
&lt;/h1&gt;

&lt;p&gt;from pulsar import Client&lt;/p&gt;

&lt;p&gt;def main(args):&lt;/p&gt;

&lt;p&gt;pulsar_client = Client(‘pulsar://localhost:6650’)&lt;/p&gt;

&lt;p&gt;producer = pulsar_client.create_producer(‘my-topic’)&lt;/p&gt;

&lt;h1&gt;
  
  
  Assume that ‘date’ is passed as part of ‘args’
&lt;/h1&gt;

&lt;p&gt;data = args.get(“data”, “default data”)&lt;/p&gt;

&lt;p&gt;producer.send((data).encode(‘utf-8’))&lt;/p&gt;

&lt;p&gt;pulsar_client.close()&lt;/p&gt;

&lt;p&gt;return {“result”: “Data sent to Pulsar”}&lt;/p&gt;

&lt;p&gt;Apache Pulsar Configuration&lt;/p&gt;

&lt;p&gt;Make sure Pulsar is configured to receive and store messages from the specified topic.&lt;/p&gt;

&lt;p&gt;Configure how long messages are retained to suit your needs.&lt;/p&gt;

&lt;p&gt;Development of the Job Spark&lt;/p&gt;

&lt;p&gt;Write a Spark job that I can read from the Pulsar topic.&lt;/p&gt;

&lt;p&gt;The Spark job should be able to process the data and, if necessary, produce an output.&lt;/p&gt;

&lt;p&gt;Python&lt;/p&gt;

&lt;h1&gt;
  
  
  Python Example: Spark Job to Read from Pulsar
&lt;/h1&gt;

&lt;p&gt;from pyspark.sql import SparkSession&lt;/p&gt;

&lt;p&gt;spark = SparkSession.builder.appName(“PulsarSpark”).getOrCreate()&lt;/p&gt;

&lt;h1&gt;
  
  
  Configure the DataFrame to read data from Pulsar
&lt;/h1&gt;

&lt;p&gt;df = spark.read.format(“pulsar”).option(“service.url”, “pulsar://localhost:6650”).option(“admin.url”, “&lt;a href="http://localhost:8080%22).option(%22topic" rel="noopener noreferrer"&gt;http://localhost:8080").option("topic&lt;/a&gt;", “persistent://public/default/my-topic”).load()&lt;/p&gt;

&lt;h1&gt;
  
  
  Data Processing
&lt;/h1&gt;

&lt;p&gt;spark.stop()&lt;/p&gt;

&lt;p&gt;Testing &amp;amp; Integration&lt;/p&gt;

&lt;p&gt;OpenWhisk Action Test: Verifies that the OpenWhisk action can successfully send data to Pulsar.&lt;/p&gt;

&lt;p&gt;Spark Job Test: Make sure the Spark job is able to read and process data from Pulsar.&lt;/p&gt;

&lt;p&gt;Integration: Test the entire flow from OpenWhisk to Pulsar to Spark.&lt;/p&gt;

&lt;p&gt;Deploy &amp;amp; Monitor&lt;/p&gt;

&lt;p&gt;Deploy all components in production.&lt;/p&gt;

&lt;p&gt;Monitor the application to make sure it’s working as expected and to handle any issues.&lt;/p&gt;

&lt;p&gt;Final Thoughts&lt;/p&gt;

&lt;p&gt;Error Handling: Make sure your code handles errors correctly in each step.&lt;/p&gt;

&lt;p&gt;Security: Implement appropriate security measures, such as authentication for Pulsar and Spark.&lt;/p&gt;

&lt;p&gt;Detailed Implementation&lt;/p&gt;

&lt;p&gt;OpenWhisk Actions&lt;/p&gt;

&lt;p&gt;First, let’s create an OpenWhisk action that sends data to a Pulsar topic.&lt;/p&gt;

&lt;p&gt;Make sure you have the Pulsar client for Python installed.&lt;/p&gt;

&lt;h1&gt;
  
  
  writeOnPulsar.py
&lt;/h1&gt;

&lt;p&gt;from pulsar import Client&lt;/p&gt;

&lt;p&gt;def main(args):&lt;/p&gt;

&lt;p&gt;pulsar_client = Client(‘pulsar://localhost:6650’)&lt;/p&gt;

&lt;p&gt;producer = pulsar_client.create_producer(‘my-topic’)&lt;/p&gt;

&lt;h1&gt;
  
  
  ‘data’ can be passed as part of ‘args’
&lt;/h1&gt;

&lt;p&gt;data = args.get(“data”, “default data”)&lt;/p&gt;

&lt;p&gt;producer.send((data).encode(‘utf-8’))&lt;/p&gt;

&lt;p&gt;pulsar_client.close()&lt;/p&gt;

&lt;p&gt;return {“result”: “Data sent to Pulsar”}&lt;/p&gt;

&lt;p&gt;To perform this action in OpenWhisk, you’ll need to package and deploy it to your OpenWhisk instance.&lt;/p&gt;

&lt;p&gt;Spark Job&lt;/p&gt;

&lt;p&gt;Now, let’s write a simple Spark job that reads data from the Pulsar topic. Make sure you’ve set up Spark with Pulsar support.&lt;/p&gt;

&lt;h1&gt;
  
  
  spark_pulsar_job.py
&lt;/h1&gt;

&lt;p&gt;from pyspark.sql import SparkSession&lt;/p&gt;

&lt;p&gt;spark = SparkSession.builder.appName(“PulsarSpark”).getOrCreate()&lt;/p&gt;

&lt;h1&gt;
  
  
  Configura il DataFrame per leggere i dati da Pulsar
&lt;/h1&gt;

&lt;p&gt;df = spark.read.format(“pulsar”) \&lt;/p&gt;

&lt;p&gt;.option(“service.url”, “pulsar://localhost:6650”) \&lt;/p&gt;

&lt;p&gt;.option(“admin.url”, “&lt;a href="http://localhost:8080%22" rel="noopener noreferrer"&gt;http://localhost:8080"&lt;/a&gt;) \&lt;/p&gt;

&lt;p&gt;.option(“topic”, “persistent://public/default/my-topic”) \&lt;/p&gt;

&lt;p&gt;.load()&lt;/p&gt;

&lt;h1&gt;
  
  
  Here you can process the data as needed
&lt;/h1&gt;

&lt;p&gt;df.show()&lt;/p&gt;

&lt;p&gt;spark.stop()&lt;/p&gt;

&lt;p&gt;Questo codice deve essere eseguito in un ambiente Spark.&lt;/p&gt;

&lt;p&gt;Testing &amp;amp; Deployment&lt;/p&gt;

&lt;p&gt;Once you’ve written these scripts, you should test them locally or in a development environment to ensure that they work as expected.&lt;/p&gt;

&lt;p&gt;You can start the OpenWhisk action and then run the Spark job to see if the data is being transmitted correctly.&lt;/p&gt;

&lt;p&gt;To use the OpenWhisk action in a real-world environment, you’ll need to package it in a way that’s manageable and deployable on OpenWhisk.&lt;/p&gt;

&lt;p&gt;Create a Virtual Environment (Optional but Recommended)&lt;/p&gt;

&lt;p&gt;First of all, it’s a good practice to create a Python virtual environment to manage dependencies.&lt;/p&gt;

&lt;p&gt;python -m venv myenv&lt;/p&gt;

&lt;p&gt;source myenv/bin/activate # on Windows usa myenv\Scripts\activate&lt;/p&gt;

&lt;p&gt;Installing Dependencies&lt;/p&gt;

&lt;p&gt;Install the necessary dependencies, in this case the Pulsar client for Python.&lt;/p&gt;

&lt;p&gt;pip install ‘pulsar-client==3.4.0’&lt;/p&gt;

&lt;p&gt;opzional&lt;/p&gt;

&lt;h1&gt;
  
  
  avro serialization
&lt;/h1&gt;

&lt;p&gt;pip install ‘pulsar-client[avro]==3.4.0’&lt;/p&gt;

&lt;h1&gt;
  
  
  functions runtime
&lt;/h1&gt;

&lt;p&gt;pip install ‘pulsar-client[functions]==3.4.0’&lt;/p&gt;

&lt;h1&gt;
  
  
  all optional components
&lt;/h1&gt;

&lt;p&gt;pip install ‘pulsar-client[all]==3.4.0’&lt;/p&gt;

&lt;p&gt;Create the Requirements File&lt;/p&gt;

&lt;p&gt;Create a requirements.txt file that lists all dependencies.&lt;/p&gt;

&lt;p&gt;This is important to ensure that the OpenWhisk environment has all the necessary libraries.&lt;/p&gt;

&lt;h1&gt;
  
  
  Replace with the specific version you’re using
&lt;/h1&gt;

&lt;p&gt;pip install -r requirements.txt&lt;/p&gt;

&lt;p&gt;Writing the Action Code&lt;/p&gt;

&lt;p&gt;Write the action code (like the openwhisk_pulsar_action.py example I provided earlier) and make sure it’s working and tested.&lt;/p&gt;

&lt;p&gt;Pack Action&lt;/p&gt;

&lt;p&gt;OpenWhisk allows you to package Python actions into a zip file that includes the code and dependencies.&lt;/p&gt;

&lt;p&gt;Here’s how:&lt;/p&gt;

&lt;p&gt;zip -r my_action.zip openwhisk_pulsar_action.py myenv&lt;/p&gt;

&lt;p&gt;This command creates a zip archive containing your script and virtual environment.&lt;/p&gt;

&lt;p&gt;Deploying Action on OpenWhisk&lt;/p&gt;

&lt;p&gt;You can now use the wsk command to deploy the action.&lt;/p&gt;

&lt;p&gt;wsk action create myPulsarAction my_action.zip — kind python:3.7 — main main&lt;/p&gt;

&lt;p&gt;This command creates an action on OpenWhisk called myPulsarAction using your zip file.&lt;/p&gt;

&lt;p&gt;Test the action&lt;/p&gt;

&lt;p&gt;Once you’ve deployed your action, it’s a good practice to test it to make sure it works as expected in your OpenWhisk environment.&lt;/p&gt;

&lt;p&gt;wsk action invoke myPulsarAction — result — param data “test data”&lt;/p&gt;

&lt;p&gt;This invokes the action with a test parameter.&lt;/p&gt;

&lt;p&gt;Once you have zipped your OpenWhisk package, you will use the OpenWhisk Command Line Interface (CLI) wsk command for its distribution.&lt;/p&gt;

&lt;p&gt;Here are the detailed steps:&lt;/p&gt;

&lt;p&gt;· Install and configure the OpenWhisk CLI.&lt;/p&gt;

&lt;p&gt;· Log in to your OpenWhisk instance.&lt;/p&gt;

&lt;p&gt;· Creating the Action from a Zip Package&lt;/p&gt;

&lt;p&gt;· Navigate to the Zip File Directory:&lt;/p&gt;

&lt;p&gt;· Make sure you’re in the same directory as your zip file.&lt;/p&gt;

&lt;p&gt;· For example, if your file is called my_action.zip, navigate to the directory that contains it.&lt;/p&gt;

&lt;p&gt;Run the following command to create the action in OpenWhisk:&lt;/p&gt;

&lt;p&gt;wsk action create nomeAzione my_action.zip — kind python:3.7 — main nomeFunzionePrincipale&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;p&gt;nomeAzione is the name you want to give to the action in OpenWhisk.&lt;/p&gt;

&lt;p&gt;my_action.zip is the name of your zip file.&lt;/p&gt;

&lt;p&gt;— kind python:3.7 specifies the Python runtime you are using. Replace it with the appropriate version if necessary.&lt;/p&gt;

&lt;p&gt;— main nomeFunzionePrincipale specify the name of the main function in your Python code.&lt;/p&gt;

&lt;p&gt;For example, if your function is called main, you use — main main.&lt;/p&gt;

&lt;p&gt;Practical Example&lt;/p&gt;

&lt;p&gt;Let’s say you have a my_action.zip file and inside it is a Python script with a main function.&lt;/p&gt;

&lt;p&gt;The command becomes:&lt;/p&gt;

&lt;p&gt;wsk action create myPulsarAction my_action.zip — kind python:3.7 — main main&lt;/p&gt;

&lt;p&gt;This command tells OpenWhisk to create an action called myPulsarAction using the my_action.zip file, with the Python 3.7 environment and the main function as the entry point.&lt;/p&gt;

&lt;p&gt;Once you’ve deployed your action, you can test it with:&lt;/p&gt;

&lt;p&gt;wsk action invoke myPulsarAction — result — param data “test data”&lt;/p&gt;

&lt;p&gt;This invokes the myPulsarAction action with a test parameter and returns the result.&lt;/p&gt;

&lt;p&gt;Remember that these commands assume that you have configured your OpenWhisk CLI correctly and that you have the necessary permissions to create actions on your instance OpenWhisk. If you encounter any issues during the deployment process, it may be helpful to check the OpenWhisk documentation or seek assistance in the OpenWhisk community.&lt;/p&gt;

&lt;p&gt;Intercept and call the OpenWhisk action from Spark by reading from Pulsar’s tail.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OpenWhisk Action Publishes Data on Pulsar&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When the OpenWhisk action is invoked, it publishes data to a specific topic in Pulsar. This step has already been configured in the OpenWhisk action code we discussed.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Configuring Spark to Read from Pulsar&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To have Spark read data from a Pulsar topic, you need to set up a Spark job that uses the Pulsar-Spark Connector library.&lt;/p&gt;

&lt;p&gt;Here’s a basic example:&lt;/p&gt;

&lt;p&gt;from pyspark.sql import SparkSession&lt;/p&gt;

&lt;p&gt;spark = SparkSession.builder.appName(“SparkPulsarJob”).getOrCreate()&lt;/p&gt;

&lt;h1&gt;
  
  
  Configuring to read from a Pulsar topic
&lt;/h1&gt;

&lt;p&gt;df = spark.read.format(“pulsar”) \&lt;/p&gt;

&lt;p&gt;.option(“service.url”, “pulsar://localhost:6650”) \&lt;/p&gt;

&lt;p&gt;.option(“admin.url”, “&lt;a href="http://localhost:8080%22" rel="noopener noreferrer"&gt;http://localhost:8080"&lt;/a&gt;) \&lt;/p&gt;

&lt;p&gt;.option(“topic”, “persistent://public/default/my-topic”) \&lt;/p&gt;

&lt;p&gt;.load()&lt;/p&gt;

&lt;h1&gt;
  
  
  Data processing, for example, print the data for testing.
&lt;/h1&gt;

&lt;p&gt;df.show()&lt;/p&gt;

&lt;p&gt;spark.stop()&lt;/p&gt;

&lt;p&gt;This Spark script connects to the specified Pulsar topic and reads the data that is published to it.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Running the Job Spark&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you’ve configured your Spark job to read from Pulsar, you can run it in a Spark environment. When the OpenWhisk action is invoked and publishes the data to Pulsar, the Spark job will read and process it as scheduled.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Synchronization and Orchestration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Make sure there is proper synchronization between invoking the OpenWhisk action and running the Spark job.&lt;/p&gt;

&lt;p&gt;You may need to orchestrate your workflow to ensure that the Spark job is running and ready to read data when the OpenWhisk action is invoked.&lt;/p&gt;

&lt;p&gt;Final Thoughts&lt;/p&gt;

&lt;p&gt;Error Handling: Make sure your Spark job handles any errors reading or connecting to Pulsar correctly.&lt;/p&gt;

&lt;p&gt;Performance and Scalability: Monitor performance and scale resources as needed, depending on the volume of data and frequency of operations.&lt;/p&gt;

&lt;p&gt;Synchronizing Between Writing to Pulsar and Reading from Spark&lt;/p&gt;

&lt;p&gt;Synchronizing between writing to Apache Pulsar by an OpenWhisk action and reading by a Spark job can be a challenge, especially in a distributed, scalable environment. Here are some strategies you can use:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Using Persistent Topic in Pulsar&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Make sure Pulsar uses persistent topics. This ensures that messages are stored until they are read, reducing the chance of the Spark job losing messages.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Windowing e Buffering&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Configure your Spark job to use a time window or buffering. This approach allows Spark to read data from Pulsar in batches, reducing the risk of missing messages that could be published while the Spark job is not running.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Checking the Status of the Job Spark&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Implement logic that checks to see if the Spark job is running before invoking the OpenWhisk action. This can be accomplished through scripts or by using workflow orchestration systems.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Workflow Orchestration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Use workflow orchestration tools such as Apache Airflow to coordinate when the OpenWhisk action is invoked and when the Spark job is executed. This allows you to have more precise control over when and how data is processed.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Polling Continuo in Spark&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Configure the Spark job to continuously poll the Pulsar topic.&lt;/p&gt;

&lt;p&gt;This means that Spark will continue to control the topic for new data, reducing latency between writing and reading data.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Exponential Backoff and Reconnections&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In case of connection failures or network issues, implement an exponential backoff strategy in your Spark job.&lt;/p&gt;

&lt;p&gt;This helps to manage times when Pulsar is unavailable or there are network issues.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Monitoring &amp;amp; Alarms&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Put in place a monitoring and alarm system to be alerted when there are problems in the flow of data between OpenWhisk, Pulsar and Spark. This allows you to intervene quickly in case of misalignments or errors.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;State Documentation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Consider keeping a record of the state of your messages (for example, in a database or queue system) to keep track of which messages have already been processed by Spark.&lt;/p&gt;

&lt;p&gt;A very effective technique for synchronizing writing to Apache Pulsar from OpenWhisk and reading from Apache Spark is using persistent topics in Pulsar with continuous polling in Spark. This approach offers a good combination of reliability and simplicity. Here’s an example of how you could implement it:&lt;/p&gt;

&lt;p&gt;Configuring the Persistent Topic in Pulsar&lt;/p&gt;

&lt;p&gt;Make sure your topic in Pulsar is set up as persistent. Usually, topics in Pulsar are persistent by default. A persistent topic might have a name similar to:&lt;/p&gt;

&lt;p&gt;persistent://public/default/my-topic.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OpenWhisk Script to Publish on Pulsar&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The OpenWhisk action publishes the data on the Pulsar topic.&lt;/p&gt;

&lt;p&gt;Here there are no changes from the script provided earlier.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Job Spark con Polling Continuo&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Spark job will continuously poll the Pulsar topic.&lt;/p&gt;

&lt;p&gt;Configure the Spark job to read data continuously from Pulsar.&lt;/p&gt;

&lt;p&gt;from pyspark.sql import SparkSession&lt;/p&gt;

&lt;p&gt;def main():&lt;/p&gt;

&lt;p&gt;spark = SparkSession.builder.appName(“SparkPulsarJob”).getOrCreate()&lt;/p&gt;

&lt;h1&gt;
  
  
  Configuration to read continuously from a Pulsar topic
&lt;/h1&gt;

&lt;p&gt;df = spark.readStream.format(“pulsar”) \&lt;/p&gt;

&lt;p&gt;.option(“service.url”, “pulsar://localhost:6650”) \&lt;/p&gt;

&lt;p&gt;.option(“admin.url”, “&lt;a href="http://localhost:8080%22" rel="noopener noreferrer"&gt;http://localhost:8080"&lt;/a&gt;) \&lt;/p&gt;

&lt;p&gt;.option(“topic”, “persistent://public/default/my-topic”) \&lt;/p&gt;

&lt;p&gt;.load()&lt;/p&gt;

&lt;h1&gt;
  
  
  Qui puoi definire l’elaborazione dei dati
&lt;/h1&gt;

&lt;p&gt;query = df.writeStream \&lt;/p&gt;

&lt;p&gt;.outputMode(“append”) \&lt;/p&gt;

&lt;p&gt;.format(“console”) \&lt;/p&gt;

&lt;p&gt;.start()&lt;/p&gt;

&lt;p&gt;query.awaitTermination()&lt;/p&gt;

&lt;p&gt;if &lt;strong&gt;name&lt;/strong&gt; == “&lt;strong&gt;main&lt;/strong&gt;”:&lt;/p&gt;

&lt;p&gt;main()&lt;/p&gt;

&lt;p&gt;This script configures Spark to stream from the Pulsar topic.&lt;/p&gt;

&lt;p&gt;It uses readStream for continuous polling and writeStream to process the received data.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Execution &amp;amp; Monitoring&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Run the Spark job and monitor it to make sure it continuously reads data from Pulsar. Whenever the OpenWhisk action is invoked and publishes data to Pulsar, the Spark job will detect it in near real-time.&lt;/p&gt;

&lt;p&gt;Final Thoughts&lt;/p&gt;

&lt;p&gt;Reliability: This approach is reliable since data is not lost due to topic persistence.&lt;/p&gt;

&lt;p&gt;Simplicity: The logic is quite simple and straightforward.&lt;/p&gt;

&lt;p&gt;Performance: Monitor performance and scale resources as needed.&lt;/p&gt;

&lt;p&gt;Please note that this is only a hypothetical implementation, you will have to adapt it according to a specific configuration of Pulsar and Spark, and to your data processing needs&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
