DEV Community

Cover image for Importing metadata from the AWS Glue data catalog into Apache Atlas with EMR

Importing metadata from the AWS Glue data catalog into Apache Atlas with EMR

What is going to be implemented

We will implement Apache Atlas through the AWS EMR service by connecting the Hive catalog directly to the Glue service, being able to dynamically classify your data and see the lineage of your data as it goes through different processes.

architecture of reference

Presentation of services to use

Amazon EMR is a managed service that simplifies the implementation of big data frameworks like Apache Hadoop and Spark. If you are using Amazon EMR, you can choose from a predefined set of applications or choose your own from the list.

The Apache Atlas project is a set of core governance services that enables companies to effectively and efficiently meet their compliance requirements with the entire enterprise data ecosystem. Apache Atlas provides metadata governance and management capabilities for organizations to catalog their data assets, classify and control these assets, and provide collaboration capabilities internally. In other words, help teams make the cycle of their own data transparent. For this reason, it is important in some business or architectural solutions to have these mechanisms of transparency and governance of their own data, in order to make the most of the knowledge of their data, through predictions in different ways. For example, market predictions, customer safety sessions, generation of impact campaigns, among many other ways to take advantage of the behavior of your data.

Of the many features that Apache Atlas offers, the main feature of interest in this article is Apache Hive's data lineage and metadata management. After a successful Atlas setup, use native tools to import tables from Hive, analyze your data, and intuitively present your data lineage to your end users.


Implementation Steps

1.- Create glue-settings.json configuration file

The first thing we will need to do is create a .json file with the following structure on our local computer:

[
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

2.- Preparation of environment in AWS (Review of default_emr_role and network infrastructure)

This step is important for when we launch an EMR for the first time through the AWS CLI, especially for our command that we will execute in the following steps. The reason for this step is that when you start an EMR cluster you need to assign it a role, however when you first create it, that role is automatically created with the name default_emr_role.

This is easily solved by launching a test cluster through the AWS Management Console. When you launch the cluster for the first time through the console, it will automatically create the default_emr_role role for you, which you can use with the lifting of our original cluster.

Then you can go directly to the IAM service and check if the default role is already created.

IAM Roles

Advanced tip: If you want to implement Apache Atlas in a limited and productive scenario, you must create a new role with the least possible privilege for EMR, which will be the one you will use to execute the following steps.

Don't forget to delete the test cluster that you used to create the role.

3.- Prepare parameters to create and run EMR cluster

This step is important for the execution of the following code. The parameters to define are the following:

Cluster_Name: The name of the cluster you will need
Instance_Type: The type of family that each node will have
Instance_Vol_Size: The size of the EBS that is configured with the EMR
Key_Name: The name of the key pair created for the use and connection of this EMR
Subnet_id: The id of a subnet to use for this EMR
S3_EMR_LOGDIR: EMR machine log location

In my case, the parameters that I will choose are the following:

CLUSTER_NAME=EMR-Atlas
INSTANCE_TYPE=m4.large
INSTANCE_VOL_SIZE=80
KEY_NAME=key-0a97d3c96668decaf
SUBNET_ID=subnet-09de17cf9eb1c56d3
S3_EMR_LOGDIR=s3://aws-logs-39483989-us-east-1/elasticmapreduce/
Enter fullscreen mode Exit fullscreen mode

To obtain the subnet, you must go to the Amazon VPC service and obtain the ID of the subnet that you are going to use. You can also do it by command with the AWS CLI. For more information I leave you the following link: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html

4.- Create EMR Cluster with AWS CLI

After having everything configured, the EMR cluster is created through the AWS CLI. It is important to note that these steps could be carried out through the AWS management console, decomposing the command according to the configuration options that are made from the interface. In my case, I find it easier to use the AWS CLI.

The command with all our previously defined configurations would be the following:

aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper --release-label emr-5.33.0 --instance-groups  InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.large --use-default-roles --ebs-root-volume-size 80 --ec2-attributes KeyName=apache-atlas,SubnetId=subnet-0d95c4cdf3119f9ae --configurations file://./glue_settings.json --tags Name=EMR-Atlas --name "EMR-Atlas" --steps Type=CUSTOM_JAR,Jar=command-runner.jar,ActionOnFailure=TERMINATE_CLUSTER,Args=bash,-c,'curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh -o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh'
Enter fullscreen mode Exit fullscreen mode

This will create an EMR cluster that you can monitor if you want from the AWS management console.

5.- Modify import-hive.sh script in EMR cluster

When we have the cluster up and running, we must enter the cluster with any of the various possible forms of connection. In my case I use an SSH connection. If you want more information about the steps I leave you the following link: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node.html

When you are already inside the cluster, you must execute the following commands in order:

sudo cp -ai /apache/atlas/bin/import-hive.sh{,.org}

sudo vim /apache/atlas/bin/import-hive.sh
Enter fullscreen mode Exit fullscreen mode

This in order to modify the import-hive.sh file. You could also use another editor that suits you other than vim.

When you are inside the import-hive.sh file, you must make the following changes:

You will have to change this line of the file:

CP="${ATLASCPPATH}:${HIVE_CP}:${HADOOP_CP}"
Enter fullscreen mode Exit fullscreen mode

For this:

CP="${ATLASCPPATH}:${HIVE_CP}:${HADOOP_CP}:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar"
Enter fullscreen mode Exit fullscreen mode

With the objective that Glue catalog to read directly to the base Atlas catalog.

6.- Importing the Glue data catalog to Atlas

Run the modified script to import the Glue metadata into Atlas.

The user is admin and the password is admin:

/apache/atlas/bin/import-hive.sh


Enter username for atlas :- admin
Enter password for atlas :-

2021-08-25T13:58:234,43 INFO [main] org.apache.atlas.hive.bridge.HiveMetaStoreBridge - Successfully imported 5 tables from database aws_db
Hive Meta Data imported successfully!!!
Enter fullscreen mode Exit fullscreen mode

In this way you will have already imported the glue catalog into Atlas.

Advanced tip: If you want to automate this catalog update you will have to run the import-hive.sh shell file again.

7.- Connection to Atlas

Finally, you must build a tunnel locally from the EMR in order to build an endpoint to connect to the atlas interface. For this, run the following command:

ssh -i apache-atlas.pem -vnNT -L 21000:localhost:21000 hadoop@{ip_of_your_cluster}
Enter fullscreen mode Exit fullscreen mode

With this to connect to the interface you can access the following link:

http://localhost:21000

The login screen will be displayed as shown below: Login with the password admin and user admin.

Apache Atlas login

Already inside the interface, if you search for hive_table you will find the information of your glue catalog:

Apache Atlas interfaces


References Links

Discussion (0)