Christopher Thompson H. for AWS Community Builders

Posted on Aug 26, 2021

Importing metadata from the AWS Glue data catalog into Apache Atlas with EMR

#aws #architecture #cloud #government

What is going to be implemented

We will implement Apache Atlas through the AWS EMR service by connecting the Hive catalog directly to the Glue service, being able to dynamically classify your data and see the lineage of your data as it goes through different processes.

Presentation of services to use

Amazon EMR is a managed service that simplifies the implementation of big data frameworks like Apache Hadoop and Spark. If you are using Amazon EMR, you can choose from a predefined set of applications or choose your own from the list.

The Apache Atlas project is a set of core governance services that enables companies to effectively and efficiently meet their compliance requirements with the entire enterprise data ecosystem. Apache Atlas provides metadata governance and management capabilities for organizations to catalog their data assets, classify and control these assets, and provide collaboration capabilities internally. In other words, help teams make the cycle of their own data transparent. For this reason, it is important in some business or architectural solutions to have these mechanisms of transparency and governance of their own data, in order to make the most of the knowledge of their data, through predictions in different ways. For example, market predictions, customer safety sessions, generation of impact campaigns, among many other ways to take advantage of the behavior of your data.

Of the many features that Apache Atlas offers, the main feature of interest in this article is Apache Hive's data lineage and metadata management. After a successful Atlas setup, use native tools to import tables from Hive, analyze your data, and intuitively present your data lineage to your end users.

Implementation Steps

1.- Create glue-settings.json configuration file

The first thing we will need to do is create a .json file with the following structure on our local computer:

[
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]

2.- Preparation of environment in AWS (Review of default_emr_role and network infrastructure)

This step is important for when we launch an EMR for the first time through the AWS CLI, especially for our command that we will execute in the following steps. The reason for this step is that when you start an EMR cluster you need to assign it a role, however when you first create it, that role is automatically created with the name default_emr_role.

This is easily solved by launching a test cluster through the AWS Management Console. When you launch the cluster for the first time through the console, it will automatically create the default_emr_role role for you, which you can use with the lifting of our original cluster.

Then you can go directly to the IAM service and check if the default role is already created.

Advanced tip: If you want to implement Apache Atlas in a limited and productive scenario, you must create a new role with the least possible privilege for EMR, which will be the one you will use to execute the following steps.

Don't forget to delete the test cluster that you used to create the role.

3.- Prepare parameters to create and run EMR cluster

This step is important for the execution of the following code. The parameters to define are the following:

Cluster_Name: The name of the cluster you will need
Instance_Type: The type of family that each node will have
Instance_Vol_Size: The size of the EBS that is configured with the EMR
Key_Name: The name of the key pair created for the use and connection of this EMR
Subnet_id: The id of a subnet to use for this EMR
S3_EMR_LOGDIR: EMR machine log location

In my case, the parameters that I will choose are the following:

CLUSTER_NAME=EMR-Atlas
INSTANCE_TYPE=m4.large
INSTANCE_VOL_SIZE=80
KEY_NAME=key-0a97d3c96668decaf
SUBNET_ID=subnet-09de17cf9eb1c56d3
S3_EMR_LOGDIR=s3://aws-logs-39483989-us-east-1/elasticmapreduce/

To obtain the subnet, you must go to the Amazon VPC service and obtain the ID of the subnet that you are going to use. You can also do it by command with the AWS CLI. For more information I leave you the following link: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html

4.- Create EMR Cluster with AWS CLI

After having everything configured, the EMR cluster is created through the AWS CLI. It is important to note that these steps could be carried out through the AWS management console, decomposing the command according to the configuration options that are made from the interface. In my case, I find it easier to use the AWS CLI.

The command with all our previously defined configurations would be the following:

aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper --release-label emr-5.33.0 --instance-groups  InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.large --use-default-roles --ebs-root-volume-size 80 --ec2-attributes KeyName=apache-atlas,SubnetId=subnet-0d95c4cdf3119f9ae --configurations file://./glue_settings.json --tags Name=EMR-Atlas --name "EMR-Atlas" --steps Type=CUSTOM_JAR,Jar=command-runner.jar,ActionOnFailure=TERMINATE_CLUSTER,Args=bash,-c,'curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh -o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh'

This will create an EMR cluster that you can monitor if you want from the AWS management console.

5.- Modify import-hive.sh script in EMR cluster

When we have the cluster up and running, we must enter the cluster with any of the various possible forms of connection. In my case I use an SSH connection. If you want more information about the steps I leave you the following link: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node.html

When you are already inside the cluster, you must execute the following commands in order:

sudo cp -ai /apache/atlas/bin/import-hive.sh{,.org}

sudo vim /apache/atlas/bin/import-hive.sh

This in order to modify the import-hive.sh file. You could also use another editor that suits you other than vim.

When you are inside the import-hive.sh file, you must make the following changes:

You will have to change this line of the file:

CP="${ATLASCPPATH}:${HIVE_CP}:${HADOOP_CP}"

For this:

CP="${ATLASCPPATH}:${HIVE_CP}:${HADOOP_CP}:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar"

With the objective that Glue catalog to read directly to the base Atlas catalog.

6.- Importing the Glue data catalog to Atlas

Run the modified script to import the Glue metadata into Atlas.

The user is admin and the password is admin:

/apache/atlas/bin/import-hive.sh


Enter username for atlas :- admin
Enter password for atlas :-

2021-08-25T13:58:234,43 INFO [main] org.apache.atlas.hive.bridge.HiveMetaStoreBridge - Successfully imported 5 tables from database aws_db
Hive Meta Data imported successfully!!!

In this way you will have already imported the glue catalog into Atlas.

Advanced tip: If you want to automate this catalog update you will have to run the import-hive.sh shell file again.

7.- Connection to Atlas

Finally, you must build a tunnel locally from the EMR in order to build an endpoint to connect to the atlas interface. For this, run the following command:

ssh -i apache-atlas.pem -vnNT -L 21000:localhost:21000 hadoop@{ip_of_your_cluster}

With this to connect to the interface you can access the following link:

http://localhost:21000

The login screen will be displayed as shown below: Login with the password admin and user admin.

Already inside the interface, if you search for hive_table you will find the information of your glue catalog:

References Links

Top comments (1)

Chatchai Komrangded (Bas) • Nov 29 '21

Nice one!