DEV Community: Omer Farooq Ahmed

An Introduction to Hive UDFs with Scala

Omer Farooq Ahmed — Thu, 14 Dec 2023 10:04:25 +0000

Introduction

For anyone looking to do big data analytics at scale, Apache Spark is your best bet. Spark's rich Dataframes API allows intuitive transformations on structured data and helps engineers build fast and optimized data pipelines.

However, developing large data applications involves numerous teams and personas with a diverse set of skills, and not everyone is comfortable writing Spark. SQL seems to be the common denominator in most data teams and this is where Apache Hive shines: Petabyte scale analytics using HiveQL (a flavor of SQL). Most transformations that are expressed using dataframes APIs on Spark can be written in SQL, as for the remaining, more complex queries, there's always User Defined Functions (UDFs).

User Defined Functions allow end users to write custom business logic that can be applied to each record of a column. UDFs are useful when the equivalent functionality requires multiple complex SQL queries, where a simple Scala UDF can do the same thing in a few lines of code. This article demonstrates how to use the Hive UDF and GenericUDF abstract classes to build user defined functions in Scala (as there are plenty of articles on building Java UDFs).

Simple UDF

Hive defines two approaches to writing UDFs: Simple and Generic. Simple UDFs can be built by extending the UDF abstract class and implementing the evaluate() function:

package HiveUDFs
import org.apache.hadoop.hive.ql.exec.UDF

class Increment extends UDF {
    def evaluate(num: Int): Int = num + 1
}

Now, simply create a JAR using sbt:

$ sbt package

Add the JAR to Hive classpath, create the temporary function increment, and use it with a Select statement:

>ADD JAR hdfs:///{path}/{to}/{jar}.jar
>CREATE TEMPORARY FUNCTION increment AS 'HiveUDFs.Increment';
>SELECT increment(1);

Now obviously the above UDF is an extremely simple example that can easily be written in a simple SQL query. A practical UDF would be something more complex such as a proprietary algorithm (better for maintainability and readability to write the algo in your org's preferred programming language). That being said, Simple UDFs can be bad for performance because every call to the evaluate function performs Reflection which carries an overhead. Furthermore, while you can overload the evaluate function to accept/return a number of different primitive types, it gets complicated if your Hive table column is an Array, Map or Struct type. Here are the Hive column types and their equivalent Java types for UDFs:

Hive	Java
string	java.lang.String, org.apache.hadoop.io.Text
int	int, java.lang.Integer, org.apache.hadoop.io.IntWritable
boolean	bool, java.lang.Boolean, org.apache.hadoop.io.BooleanWritable
array	java.util.List
map	java.util.Map

While simple UDFs do support arrays with List/ArrayList in Java, writing a UDF to work with arrays in Scala does not always yield expected results. This is where Generic UDFs are more useful. Generic UDFs are the only approach when dealing with a nested array or struct type in a Hive column, or when you want to work with a dynamic number of columns in a UDF.

Generic UDFs

The second way to write a UDF is with the GenericUDF abstract class. Generic UDFs are faster than simple UDFs because there's no reflective call, the arguments are parsed lazily, and Hive passes arguments as generic Object types so there's no need to instantiate and deserialize an Object when it's not needed. Generic UDFs can also deal with complex types such as structs and nested arrays, and can accept a variable number of parameters. The downside is that writing a GenericUDF is a bit more complicated as it defines three functions rather than one, and there is very little documentation to understand the purpose of these functions, especially in Scala. Let's consider writing a Generic UDF that returns the length of an array of integers. The three functions we need to implement are:

initialize:
Since Hive passes all parameters as Object types, in order to interact with the Object (get it's value and type, and write an output) we need to use an ObjectInspector. When Hive analyzes a query with a UDF, it computes the parameter types and passes the appropriate ObjectInspector type for each of the parameters into the initialize function and calls the function (In our case it will be one listInputObjectInspector). We can use this function for type checking and validation of our input. Finally, Hive expects the output to be an ObjectInpector of the return type (in our case this is a javaIntObjectInspector). We also want to store the ObjectInpector from the arguments as a class property because we will need it in the evaluate function to interact with the Deferred Objects.
evaluate:
This function contains the core business logic that is applied to every record of the input column(s). The difference is that it accepts an array of Deferred Objects (the input parameters to the UDF) and we need to call the get() function on each DeferredObject to get the Object. We then need to use the stored ObjectInspector to retrieve the value(s) of the actual UDF parameter(s). Finally we can write our algorithm and return the result.
getDisplayString:
Hive calls this method whenever there is an error running the UDF and so it's used to diaply troubleshooting information.

Here's how we write our ListLength GenericUDF that takes one input parameter of type array and returns an int:

package HiveUDFs

import org.apache.hadoop.hive.ql.exec.UDFArgumentException
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory
import org.apache.hadoop.hive.serde2.objectinspector.{ListObjectInspector, ObjectInspector}
import scala.collection.JavaConverters._

class ListLength extends GenericUDF {
  var listInputObjectInspector: ListObjectInspector = null;

  @throws(classOf[UDFArgumentException])
  override def initialize(arguments: Array[ObjectInspector]): ObjectInspector = {
    assert(arguments.length == 1)
    assert(arguments(0).getCategory() == ObjectInspector.Category.LIST)
    this.listInputObjectInspector = arguments(0).asInstanceOf[ListObjectInspector]
    PrimitiveObjectInspectorFactory.javaIntObjectInspector
  }

  override def evaluate(arguments: Array[GenericUDF.DeferredObject]): Integer = {
    if (arguments.length != 1) return null
    val in: Object = arguments(0).get()
    if (in == null) return null
    val list: java.util.List[_] = this.listInputObjectInspector.getList(in)
    val sList = list.asScala.toList
    sList.length
  }

  override def getDisplayString(children: Array[String]): String = "Getting size of array"

While genericUDFs can be complicated for trivial functions such as this, they can be powerful tools to apply custom logic to the millions of rows of a complex column on Hive, enabling a more diverse group of data users to reuse abstracted business logic on all sorts of structured data.

Top 30 Microsoft Azure Services

Omer Farooq Ahmed — Sun, 11 Jul 2021 17:03:29 +0000

Jeff Delaney from Fireship.io recently made a video describing the 50+ most popular cloud services offered by Amazon Web Services with common use cases. Inspired by that video, here are the top 30 Microsoft Azure cloud services.

Microsoft Azure started off in 2010 as Windows Azure with just three services. Today, Azure offers over 200 products across compute, storage, databases, networking, artificial intelligence and more, enabling developers to create robust applications without worrying about infrastructure management. Let's explore some of these products and services.

Compute

Cloud computing is essentially leasing out computational resources that may be too expensive or cumbersome to manage on premises, from a wide distribution network of managed data centers. Virtual Machines spin up Windows and Linux virtual operating systems that share physical resources, but are completely self contained environments that users can remotely access to deploy large scale workloads.

If you don't want to worry about the operating system and want to completely hand over management of the hosting environment to Azure, simply deploy an application on App Service. This service allows you to focus on your code; deploy a .NET, Python, Java, Ruby, PHP or NodeJS application in a few clicks and have it running on the cloud in no time (giving a higher level of abstraction, albeit lesser control, than VMs).

Serverless computing is all the rage these days, and along with AWS and Google, Azure has thrown its hat in the ring with Azure Functions. While technically not server-less, Functions abstracts away interaction with the actual web app server and lets users deploy a piece of programming logic, wrapped in a function and triggered by an event such as a database entry or a specific time of day. Users only write the core logic rather than an entire application, and their function only runs when invoked by the chosen trigger.

Storage

If you're deploying applications to the cloud, you'll need persistent data storage. Azure Blob Storage allows scalable storage for objects and files and provides an SDK to easily access them. Blob storage is a great trigger for Azure Functions, where uploading a file can automatically run your custom logic in the cloud (for example, if you wanted to run OCR on a file as soon as it's uploaded to a storage container). However, if you wanted to mount a filesystem as a native share on a virtual machine, use Azure Files instead. Azure Archive Storage allows businesses to store terabytes of archived data that will rarely be accessed, at the fraction of the cost of regular file storage. This is great for backups, medical history, audit and compliance data, or for migrating decades of magnetic tape storage to the cloud. Finally, use Azure Data Lake Storage to store both structured and unstructured data as-is and run high performance analytics workloads at scale, with support for the most common analytics frameworks.

Database

Developers need databases to store structured data, like application state and user data. Azure offers a range of managed SQL and NoSQL database solutions. Azure SQL is a suite of services that offer a cloud based relational database experience for different use cases. SQL Server on Azure Virtual Machines creates an entire operating system environment running the popular Microsoft Relational Database Management System, allowing seamless migration of on premises SQL workloads to the cloud. Azure SQL Database abstracts away resource management and scalability to let developers focus on building applications. Finally, Azure SQL Edge is an architecture agnostic, containerized database that runs on IoT and edge devices with native support for streaming and time series data, so developers can perform real-time analytics on a myriad of data captured by sensors.

Moving away from relational databases, Azure Cosmos DB is a schema agnostic, globally distributed NoSQL database. Cosmos abstracts away its internal data model and provides APIs for interacting with data as if it were a MongoDB or Cassandra database, making it truly multi-model.

If you're looking for a low latency, in memory database, Azure Cache for Redis is a blazing fast database for applications that have millions of users generating extremely high traffic, like a social media.

Containers

When you develop an application, you want to be able to run it in any environment. Containers offer the ability to ship the app within an image of its environment configurations, allowing faster and more flexible deployment than on VMs. Container Instances provides running instances of a virtualized operating system running your application and managing environment variables, configuration and networking itself. The aforementioned configurations, networking, operating system and other features of the container are described in a blueprint called an image, and these images are usually stored on online repositories, like Docker Hub. Container Registry is Azure’s private repository for Docker and Open Container Initiative images. Managing containers can be tricky, especially if you’ve deployed a number of microservices, each running as a stand alone container. Kubernetes is an open source container orchestration tool that manages container scaling, state management, health and deployment. Azure Kubernetes Service is a fully managed, serverless Kubernetes experience on the cloud that leverages Azure’s enterprise grade security and governance.

Artificial Intelligence

Developing AI applications is challenging. Training sophisticated machine learning models requires incredibly large, mostly labelled datasets that are difficult to acquire, and high performance computing that uses powerful and expensive graphics processing units. Researchers at Azure have developed ML models for common AI use cases using public and proprietary datasets and exposed APIs for developers to transform their applications and processes with AI, without training models from scratch.

Azure Cognitive Services is a family of AI services across vision, speech, language and decision support. Computer Vision labels everyday objects and links them to 10,000 ontology concepts, reads signs and text using optical character recognition and analyzes spatial movement in images and videos. If your image analysis tasks are specific to an industry or domain, such as medical image analysis or using vision in optimizing manufacturing, use Custom Vision. This service provides a user-friendly interface to upload your own images, label them and train custom models with high performance computing in the cloud. Finally, Face API provides facial verification based on two images, facial recognition of your own organization’s employees by integrating your own private repository and detection of various features on faces, such as emotions, expressions, facial hair and even masks.

Cognitive services also offer a range of products for natural language processing. Language Understanding or LUIS can extract key user goals and intentions, as well as entities, from natural language input. This can be combined with QnA Maker, a conversational agent that can answer questions based on FAQs and similar knowledge bases, to build simple chatbots. Translator uses neural machine translation to translate text and documents (with ability to preserve original document format) in 90 supported languages. Text Analytics can extract sentiments, key phrases, entities, entity links and input language from unstructured text. This also includes the Text Analytics for Health preview, which is a medical domain specific Named Entity Recognition service that can recognize biomedical entities, link them to medical ontology concepts such as UMLS, extract entity relations such as the dosage of a medication entity and recognize negation relating to an entity.

Besides building informational chatbots using QnA Maker, Azure also provides a larger Bot Service for developing more sophisticated chatbots. Transactional chatbots perform operations such as accessing and modifying internal IT documents and databases and dynamic and context aware chatbots can be used as virtual assistants. Bot Framework is an SDK that lets developers create these kinds of chatbots using their programming language of choice. Bot Framework Composer improves this experience by providing a visual tool to build conversational flows using pre-built templates and a number of triggers and actions that you can drag and drop onto a visual canvas.

In the big data space, Azure offers Azure Databricks. This is an Apache Spark big data analytics and machine learning service over a Distributed File System. The distributed cluster of nodes running analytics and AI operations in parallel allow for fast processing of large volumes of data and integration with popular machine learning libraries such as PyTorch unleash endless possibilities for custom ML.

Dev Tools, Management, Security and Networking

Here are a few services that make the overall Azure experience so amazing. Every Azure resource automatically generates metrics and logs. Azure Monitor collects, analyzes and acts on this data to ensure availability, maximize performance and proactively detect problems. Azure Active Directory is Azure’s cloud Identity Access Management system. Azure AD provides single sign on for your apps and governs access by ensuring the right people have access to the right resources. Developing applications involves calls to numerous APIs and therefore secret and key management can get cumbersome. Key Vault simplifies this process by providing a single location to store all secrets and an SDK to easily access them, ensuring data protection and compliance. The advantage of deploying applications on the cloud as opposed to on premises, is the ability to autoscale on demand and have your data always available. Azure Load Balancer evenly distributes incoming network traffic to multiple instances of your application, while Content Delivery Network caches your resources in multiple geographies intelligently, to reduce load times and save bandwidth.

Last but not least, you can create, deploy and manage all your Azure resources using Azure Cloud Shell. Cloud Shell provides a secure Bash or PowerShell session to administer Azure resources using Azure command line tools and even provides support for common programming languages and persistent storage with an attached storage account.

With a host of user friendly, secure and highly available services, the Azure cloud is a great place to focus on your application and leave the rest to Microsoft.

Docker Container on Azure Functions with Python

Omer Farooq Ahmed — Tue, 13 Oct 2020 00:37:22 +0000

Introduction

Serverless Computing, also known as Serverless Architecture, Functions as a Service (FaaS) or just Serverless, is all the rage these days. For any developer looking to quickly deploy their code to the cloud without having to worry about managing server resources or getting charged insane amounts for running a hello world application, services like AWS Lambda, Google Cloud Functions or Azure Functions is the solution. With a host of event triggers, and rich CLI tooling to create boilerplate code and deploy straight to the cloud, all three cloud powerhouses provide a great service to use. However, serverless can be a double edged sword. While providing convenience, it also restricts setting up custom environments on the machine the functions are running on. This is where Docker shines. Cloud Functions like Azure Functions provide the ability to run custom containers from a repository like Docker Hub. In this tutorial, I will show how to create a custom container with an Azure Function that performs optical character recognition (OCR) in python, and deploy it to the Azure Functions app in the cloud. At the end of the tutorial, we will be able to send an HTTP GET request with an image attached to trigger the Azure Function to perform OCR on the image and return the extracted text.

What you need

A Microsoft Azure Account. Get a free account with 1,000,000 free requests per month for Azure Functions: https://azure.microsoft.com/en-gb/free/
Azure Functions core tools. Installation steps here: https://github.com/Azure/azure-functions-core-tools
Azure CLI. Installation steps here: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-apt
Docker. Get Docker here: https://docs.docker.com/get-docker/
If you're new to Docker, check out this excellent introduction from my favorite dev, Jeff Delaney:
An account on Docker Hub: https://hub.docker.com/
Python 3.*
Tesseract OCR: Install both the python module and core libraries:
```
sudo apt install tesseract-ocr
pip install pytesseract
```
Pillow for image processing:
```
pip install pillow
```

Creating an Azure Function

Run the following command to create a local function app project called OcrFunctionsProject. The --docker option will generate a Dockerfile that we can edit to install custom libraries and dependencies in the Azure Functions app where the container will be deployed:
```
func init OcrFunctionsProject --worker-runtime python --docker
```

Navigate into the OcrFunctionsProject folder and edit the Dockerfile to look like this:

FROM mcr.microsoft.com/azure-functions/python:3.0-python3.7
ENV AzureWebJobsScriptRoot=/home/site/wwwroot \
AzureFunctionsJobHost__Logging__Console__IsEnabled=true
COPY requirements.txt /
RUN pip install -r /requirements.txt
RUN apt-get update && apt-get install -y \
tesseract-ocr
COPY . /home/site/wwwroot

This will allow us to install Tesseract OCR in the base Debian machine where the Azure Functions will be hosted.

Add a function to the project using the following command. The --name option specifies a unique name for the function and --template specifies the trigger. In our case we want our function to run in response to an HTTP trigger.
```
func new --name HttpOcrFunc --template "HTTP trigger"
```
Add pytesseract and pillow as a new line to the requirements.txt file so that modules are automatically installed once our container is deployed on the Azure Functions app in the cloud.
Test the new function locally by running the following command in the project root folder:
```
func start
```
Navigate to the HttpOcrFunc endpoint URL in the terminal output and if there is a response with 'This HTTP triggered function executed successfully...', then we're good to go.

Edit the OcrFunctionsProject/HttpOcrFunc/__init__.py file and add the following code:

import logging
import pytesseract
from PIL import Image
import azure.functions as func
import os
def main(req: func.HttpRequest) -> func.HttpResponse:

    logging.info('Python HTTP trigger function processed a request.')

    # test code for OCR
    try:
        file = req.files.get('file')
        file.save('/tmp/1.jpg')
    except ValueError:
        pass

    text = ''
    if file:
        text = str(pytesseract.image_to_string(Image.open('/tmp/1.jpg')))

    return func.HttpResponse('text Extracted from Image: {}'.format(text))

Since the instance on which our Function will be deployed has a read only filesystem, we will persist our data to the /tmp/ directory.

Build and push Docker container

Build the docker image for the container described by our Dockerfile Remember to replace <YOUR_DOCKER_HUBB_ID> with your own Docker ID:
```
docker build --tag <YOUR_DOCKER_HUBB_ID>/ocrfunctionsimage:v1.0.0 .
```
Test the build by running the following command:
```
docker run -p 8080:80 -it <YOUR_DOCKER_HUB_ID>/ocrfunctionsimage:v1.0.0
```
Navigate to http://localhost:8080 and you should see a placeholder image that says: Your Functions 3.0 app is up and running. We can't test the function running in this container because we need an access key that hasn't been generated yet as we haven't associated this local project to an Azure Functions app in the cloud.

Push this image to Docker Hub by logging in:

docker login

and pushing:

docker push <YOUR_DOCKER_HUB_ID>/ocrfunctionsimage:v1.0.0

Create Azure Functions App

Go to Azure portal: https://portal.azure.com/ and create a new Resource Group by clicking on Create a resource and searching for Resource group. Give it a unique name, select your preferred location and click Review + Create. A resource group is a container of related resources for a specific Azure cloud solution. In our case, we will group an Azure Functions app and a Storage account in our resource group.
On the portal home, click Create a resource and search for and select Function App. In the Basics tab, select your Subscription, the resource group you just created, a unique Function App name and then for the publish field select Docker Container and finally region. Select Next: Hosting. In the Hosting tab, for storage, select a new storage account, then select a plan and select Review + Create. Finally, select create and then on the new page, select Go to resource.
On the Function App page, go to Container settings on the sidebar, select Docker Hub in image source and enter YOUR_DOCKER_HUB_ID/ocrfunctionsimage:v1.0.0 in the Full Image Name and Tag field and then click save.
Go to Functions on the sidebar, select the HttpOcrFunc function and click on Get Function URL.
Download Postman https://www.postman.com/ and create a new GET request. Paste the Function URL in the GET request endpoint field and in the body tab, select form-data, then add a key called file, change the type from Text to File, and in the value field upload any image with text in it. Click submit!

If you followed all steps in this tutorial, you should get the extracted text as a response on Postman.

Logs!

Logs are an important part of serverless functions as they can help troubleshoot any errors in the code that might prevent getting the expected output. While you can view the logs on the Azure Functions app portal by clicking on Functions in the sidebar, then clicking the relevant function and then clicking Monitoring in the sidebar, these logs are often very slow to appear after the initial request is made.

A better way to view logs is to test your function locally. This can be done as follows:

First, navigate to OcrFunctionsProject/HttpOcrFunc/function.json and change the value of the authLevel key in httpTrigger bindings to "anonymous".
Build the image using docker build as described in a previous step.
Run the container image using docker run as described in a previous step.
Use Postman to send a request to http://localhost:8080/api/HttpOcrFunc with an image file attached as form-data as described in a previous step and you can get immediate results from the local container as well as logs directly in your terminal. It's a good idea to test the function locally, and once you're done, change authLevel back to "function", rebuild the image, push to docker hub and save the new image on Azure Functions app portal container settings.

Conclusion

We now have a fully functioning Python OCR Docker container deployed to an Azure Function. We can trigger the function using an HTTP GET request to its public endpoint URL and attach an image file that will be parsed by tesseract OCR in the cloud function to extract text and return it in a response. We can create more functions and play around with triggers, for example initiating the function in response to an entry in a relational database. https://docs.microsoft.com/en-us/azure/azure-functions/functions-triggers-bindings?tabs=csharp
We can also persist an output file to blob storage using the Azure Blob Storage Python SDK. https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python
The sky is the limit and once the infrastructure is set up, you can easily edit the function python code, build the container, test is locally, push to Docker hub and then deploy it to Azure Functions. Hope this tutorial was useful for anyone looking to spin up a custom container on Azure Functions.