Gianluigi Mucciolo for AWS Community Builders

Posted on Feb 4, 2024

Serverless Apache Zeppelin on AWS

#serverless #tutorial #aws #bigdata

What is Apache Zeppelin?

First of all, it is worth asking: what is a notebook interface? A notebook is an interface for interactively running code; it lets you explore and visualize data. You can mix narrative, rich media, and data in a unique space.

Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.

It's easier to mix languages in the same notebook. You can write some code and then use markdown to document it all together. You can also easily convert your notebook into a presentation style - perhaps for presenting to management or using dashboards.

What does Serverless Means?

The idea behind serverless is that you as a developer shouldn't need to care about the server infrastructure. You pay to run the code without concerns about what type of physical infrastructure is running below.

There are quite a few advantages to serverless. Scalability essentially comes for free. Because you're just paying to run logic, the cloud provider can easily dedicate more hardware to run your code. Also, you pay by code execution rather than having a fixed rate. Even more, the cloud provider manages the server software and hardware. You shouldn't need to worry about that. Finally, serverless frees up developers to focus on what they're good at - coding.

Solution Requirements

Build a serverless infrastructure to run Apache Zeppelin and persist notebook files. The solution must be publicly available and provide login and logout capability. Also, the compute platform must automatically shut down after 30 minutes of inactivity.

High-level Architecture

The diagram below shows the high-level architecture. As you can see, it is a serverless infrastructure, and you can operate Apache Zeppelin using a public endpoint while Elastic File System stores the notebook files. Amazon CloudWatch custom metric counts the lines of logs and shuts down the Amazon Fargate container after 30 minutes of inactivity.

The only missing feature in this architecture is the login and logout capability. In this case, Apache Zeppelin provides Shiro for notebook authentication. Apache Shiro is a powerful and easy-to-use Java security framework that performs authentication, authorization, cryptography, and session management. Here, you can find a step-by-step guide about how Shiro works. This example uses the default configuration.

Infrastructure as Code Description

The solution uses AWS SAM with the global configuration for Lambda functions and the public API you can use to access Apache Zeppelin. The stack deployment provides the URL as an output value.

Amazon API Gateway

Amazon API Gateway is used as the front door to interact with the application; it exposes the URL the user can use to trigger operations and use Serverless Apache Zeppelin.



AWSTemplateFormatVersion: '2010-09-09'
Globals:
  Function:
    Timeout: 60
    MemorySize: 128
    Architectures: 
      - arm64
Parameters:
  [...]

Resources:
  ZeppelinApi:
    Type: AWS::Serverless::Api
    Properties:
      StageName: !Ref ServiceName

Outputs:
   ZeppelinApi:
     Description: "API Gateway endpoint URL for Prod stage for Hello World function"
     Value: !Sub "https://${ZeppelinApi}.execute-api.${AWS::Region}.amazonaws.com/${ServiceName}/"

Elastic File System

When provisioned, each Amazon ECS task hosted on AWS Fargate receives ephemeral storage for bind mounts; everything on the disk is lost after container termination. To persist notebook files, the solution uses Amazon Elastic File System; all notebooks on EFS are preserved after container termination. The Access Point configuration allows Apache Zeppelin to have write permissions on Amazon Elastic File System.




AWSTemplateFormatVersion: '2010-09-09'
Globals:
  Function:
    Timeout: 60
    MemorySize: 128
    Architectures: 
      - arm64
Parameters:
  [...]

Resources:
  ZeppelinApi:
    [...]

  AccessPoint:
    Type: 'AWS::EFS::AccessPoint'
    Properties:
      FileSystemId: !Ref FileSystem
      PosixUser:
        Uid: "500"
        Gid: "500"
        SecondaryGids:
          - "2000"
      RootDirectory:
        CreationInfo:
          OwnerGid: "500"
          OwnerUid: "500"
          Permissions: "0777"
        Path: !Sub "/${ServiceName}"
  FileSystem:
    Type: AWS::EFS::FileSystem
    Properties:
      PerformanceMode: generalPurpose
      FileSystemTags:
      - Key: ServiceName
        Value: !Ref ServiceName
  MountTarget1:
    [Availability Zone A Configuration]
  MountTarget2:
    [Availability Zone B Configuration]
  MountTarget3:
    [Availability Zone C Configuration]

Amazon Cloud Watch Custom Metric

To provide an auto-shutdown feature, the Apache Serverless solution uses a custom metric. AWS Fargate saves logs into an Amazon CloudWatch Log Group, and the Amazon CloudWatch Custom Metric Filter counts the log lines. If the custom metric is zero for about 30 minutes, the alarm publishes a message to Amazon Simple Notification Service to terminate the Task.



AWSTemplateFormatVersion: '2010-09-09'
Globals:
  Function:
    Timeout: 60
    MemorySize: 128
    Architectures: 
      - arm64
Parameters:
  [...]

Resources:
  ZeppelinApi:
    [...]
  AccessPoint:
    [...]
  FileSystem:
    [...]
  ShutdownSnsTopic:
    [description later in this post]

  ZeppelinLogGroup:
    Type: AWS::Logs::LogGroup
    Properties: 
      LogGroupName: !Sub "/ecs/fargate-${ServiceName}"
      RetentionInDays: 1
  ActivityMetricFilter: 
    Type: AWS::Logs::MetricFilter
    Properties: 
      LogGroupName: !Ref ZeppelinLogGroup
      FilterPattern: "INFO"
      MetricTransformations: 
        - 
          MetricValue: "1"
          MetricNamespace: !Sub "${ServiceName}/Actions"
          MetricName: "ActionsCount"
  ZeppelinActionsCountAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ZeppelinActionsCountAlarm
      MetricName: ActionsCount
      Namespace: !Sub "${ServiceName}/Actions"
      Statistic: SampleCount
      Period: '300'
      EvaluationPeriods: '6'
      TreatMissingData: breaching
      Threshold: '1'
      ComparisonOperator: LessThanOrEqualToThreshold
      AlarmActions:
      - !Ref ShutdownSnsTopic

AWS Fargate

Here is the AWS Fargate Cluster and Task Definition. The Apache Serverless solution uses Shiro to enable login and logout capability. As stated here, you can create a shiro.ini file by executing the cp command. You can find it in the EntryPoint property of the container definition.



AWSTemplateFormatVersion: '2010-09-09'

Globals:

  Function:

    Timeout: 60

    MemorySize: 128

    Architectures: 

      - arm64

Parameters:

  [...]

Resources:

  ZeppelinApi:

    [...]

  AccessPoint:

    [...]

  FileSystem:

    [...]

  ZeppelinLogGroup:

Cluster:

    Type: AWS::ECS::Cluster

    Properties:

      ClusterName: !Join ['', [!Ref ServiceName, Cluster]]

  ZeppelinTaskDefinition:

    Type: AWS::ECS::TaskDefinition

    Properties:

      RequiresCompatibilities:

        - "FARGATE"

      Cpu: !Ref ContainerCPU

      Memory: !Ref MemoryHardLimit

      NetworkMode: "awsvpc"

      TaskRoleArn: !GetAtt ZeppelinTaskRole.Arn

      ExecutionRoleArn: !GetAtt ZeppelinTaskRole.Arn

      ContainerDefinitions:

        - Name: !Ref ServiceName

          Image: "apache/zeppelin:0.10.0"

          EntryPoint: 

            - /bin/bash

            - -c

            - |

              cp conf/shiro.ini.template conf/shiro.ini 

              /usr/bin/tini -- bin/zeppelin.sh

          Command: ["echo", "done!"]

          MemoryReservation: !Ref MemorySoftLimit

          Memory: !Ref MemoryHardLimit

          PortMappings:

            - ContainerPort: !Ref ContainerPort

              Protocol: tcp

            - ContainerPort: 4040

              Protocol: tcp

          LogConfiguration:

            LogDriver: awslogs

            Options:

              awslogs-group: !Ref ZeppelinLogGroup

              awslogs-region: !Ref AWS::Region

              awslogs-stream-prefix: !Sub 'ecs-${ServiceName}-awsvpc'

          MountPoints:

            - ContainerPath: !Ref ZeppelinPersistNotebookPath

              SourceVolume: !Sub "${ServiceName}"

              ReadOnly: false

      Volumes:

        - Name: !Sub "${ServiceName}"

          EFSVolumeConfiguration:

            AuthorizationConfig: 

              IAM: ENABLED

              AccessPointId: !Ref AccessPoint

            FilesystemId: !Ref FileSystem

            TransitEncryption: ENABLED

AWS Lambda | Workflow

Below is the high-level workflow about how the implementation works, how the task is created, and shut down.

Start Apache Serverless

In the beginning, it checks if the Apache Zeppelin Container is running.

In case of a yes, AWS Lambda returns 302 to the Apache Zeppelin public IP. In case of a no, AWS Lambda executes the next step. Then, it checks if the Apache Zeppelin Container exists.

In case of a yes, AWS Lambda returns static web content. It is a loading page with an auto-refresh every 20 seconds. In case of a no, AWS Lambda starts a new Apache Zeppelin container and returns the loading page. Every 20 seconds, the client checks Apache Zeppelin provisioning and gets the notebook interface if the container is running; otherwise, it gets the loading page. When you have the notebook interface, to use Apache Zeppelin, you must provide your user credentials.

Shutdown Apache Serverless

If the custom metric is zero for about 30 minutes, the alarm publishes a message to Amazon Simple Notification Service, and an AWS Lambda Function terminates the cluster. The Amazon Simple Notification Service is the AWS Lambda Function trigger.

Usage Suggestions & Improvements

Apache Zeppelin supports Amazon S3 for persisting notebook files. As stated here, you can use ZEPPELIN_NOTEBOOK_STORAGE, ZEPPELIN_NOTEBOOK_S3_BUCKET, and ZEPPELIN_NOTEBOOK_S3_USER as environment variables.

On the other hand, Amazon Elastic File System offers a very generic solution that can be used for various purposes; the only limit is your imagination. Since Amazon EFS is a file system, you don't have to deal with Amazon S3 Object Storage. In this case, you can simply upload your application to a Docker container and run it on AWS Fargate, just by replacing Apache Zeppelin.

For example, you can run Serverless Visual Studio Code; check the container here.

Another improvement related to Serverless Apache Zeppelin on AWS is configuring Amazon DynamoDB as an external database for Shiro users.

What will be your next application to deploy as Serverless?

DEV Community