DEV Community

Cover image for Unable to emit metadata to DataHub GMS with Airflow - a solution
Ivica Kolenkaš
Ivica Kolenkaš

Posted on

Unable to emit metadata to DataHub GMS with Airflow - a solution

DataHub is a popular open-source data catalog and its Lineage feature is one of its highlights.

Doing ingestion or data processing with Airflow, a very popular open-source platform for developing and running workflows, is a fairly common setup. DataHub's automatic lineage extraction works great with Airflow - provided you configure the Airflow connection to DataHub correctly.

This article shows how to resolve the infamous Unable to emit metadata to DataHub GMS when using the DataHub Airflow plugin.

datahub.configuration.common.OperationalError: (
'Unable to emit metadata to DataHub GMS', 
{
  'message': '404 Client Error: Not Found for url: <a href="https://my-datahub-host.net/aspects?action=ingestProposal"
})
Enter fullscreen mode Exit fullscreen mode

TL;DR

URL-encode the host portion of your Airflow connection string:

  • correct: datahub-rest://my-datahub-host.net%2Fapi%2Fgms
  • incorrect: datahub-rest://my-datahub-host.net/api/gms

The problem - 404, wrong URL

The problem is very obvious from the error message; 404 - Not Found, indicating that the URL does not exist or is wrong.

A quick glance at the DataHub API docs shows that the REST API is available at https://my-datahub-host.net/api/gms.

Compare that to the URL reported in the error message above and it is obvious that /api/gms is missing from our Airflow connection string - woohooo!

A developer celebrating a new error message

So I quickly look at HashiCorp Vault, which we use as the external connections store in our Airflow deployments and the connection string looks just fine to me - /api/gms is there.

datahub-rest://https://:TOKEN@https%3A%2F%2Fmy-datahub-host.net/api/gms
Enter fullscreen mode Exit fullscreen mode

Let's check how Airflow "understands" the connection because that is what it will use:

airflow connections get datahub_rest_default --output yaml
Enter fullscreen mode Exit fullscreen mode

which outputs

# shortened for brevity
- conn_id: datahub_rest_default
  conn_type: datahub_rest
  host: https://my-datahub-host.net
  schema: 'api/gms'
Enter fullscreen mode Exit fullscreen mode

Two issues pop out immediately:

  • schema should only have http or https in it, not /api/gms (source)
  • host value is missing the /api/gms path

To understand why the connection is not parsed properly, lets look at how Connections look like under the hood.

Anatomy of an Airflow connection

An Airflow Connection object (source) is very well documented so I won't repeat that; use the source, Luke:

class Connection:
    """
    A connection to an external data source.

    :param conn_id: The connection ID.
    :param conn_type: The connection type.
    :param description: "The connection description."
    :param host: The host.
    :param login: The login.
    :param password: The password.
    :param schema: The schema.
    :param port: The port number.
    :param extra: Extra metadata. Non-standard data such as private/SSH keys can be saved here. JSON
        encoded object.
    """
Enter fullscreen mode Exit fullscreen mode

It is good to know that a Connection can also be represented as a connection string (also called an URI).

Because we are dealing with an HTTP connection, the HOST consists of the full URL, including the path; for example, a connection URI for Google Images page would look like http://https://google.com/imghp.

Airflow connection parsing

DataHub's DatahubRestHook (source) is based on Airflow's BaseHook (source) so it inherits this method:

@classmethod
def get_connection(cls, conn_id: str) -> Connection:
    """
    Get connection, given connection id.

    :param conn_id: connection id
    :return: connection
    """
    # shortened for brevity
    conn = ConnectionModel.get_connection_from_secrets(conn_id)
    return conn
Enter fullscreen mode Exit fullscreen mode

From it, and several levels down the code path, we find a function that parses the URI to turn it into a Connection object. To simplify the demo below, I've extracted a couple of functions that parse the URI this into this Gist.

Using that code standalone (without Airflow) shows exactly what's wrong with the connection:

Connection string without URL-encoding it first:

from airflow_connection_parse import parse_from_uri

parse_from_uri("datahub-rest://https://:TOKEN@my-datahub-host.net/api/gms")
Host: https://my-datahub-host.net
Schema: api/gms
Enter fullscreen mode Exit fullscreen mode

Connection string with the URL (host) being URL-encoded:

from airflow_connection_parse import parse_from_uri

parse_from_uri("datahub-rest://https://:TOKEN@my-datahub-host.net%2Fapi%2Fgms")
Host: https://my-datahub-host.net/api/gms
Schema:
Enter fullscreen mode Exit fullscreen mode

schema is empty and the host contains /api/gms as it should according to the DataHub's Airflow integration docs.


URL-encode your connection strings if you're creating them outside of the Airflow ecosystem - using the Airflow UI (not recommended for production) or the airflow CLI will take care of that for you.

In our case, all the connections are managed with Terraform and the code for it missed a simple urlencode(MY_HOST_HERE) function call.

Top comments (0)