DataHub is a popular open-source data catalog and its Lineage feature is one of its highlights.
Doing ingestion or data processing with Airflow, a very popular open-source platform for developing and running workflows, is a fairly common setup. DataHub's automatic lineage extraction works great with Airflow - provided you configure the Airflow connection to DataHub correctly.
This article shows how to resolve the infamous Unable to emit metadata to DataHub GMS
when using the DataHub Airflow plugin.
datahub.configuration.common.OperationalError: (
'Unable to emit metadata to DataHub GMS',
{
'message': '404 Client Error: Not Found for url: <a href="https://my-datahub-host.net/aspects?action=ingestProposal"
})
TL;DR
URL-encode the host
portion of your Airflow connection string:
- correct:
datahub-rest://my-datahub-host.net%2Fapi%2Fgms
- incorrect:
datahub-rest://my-datahub-host.net/api/gms
The problem - 404, wrong URL
The problem is very obvious from the error message; 404 - Not Found, indicating that the URL does not exist or is wrong.
A quick glance at the DataHub API docs shows that the REST API is available at https://my-datahub-host.net/api/gms
.
Compare that to the URL reported in the error message above and it is obvious that /api/gms
is missing from our Airflow connection string - woohooo!
So I quickly look at HashiCorp Vault, which we use as the external connections store in our Airflow deployments and the connection string looks just fine to me - /api/gms
is there.
datahub-rest://https://:TOKEN@https%3A%2F%2Fmy-datahub-host.net/api/gms
Let's check how Airflow "understands" the connection because that is what it will use:
airflow connections get datahub_rest_default --output yaml
which outputs
# shortened for brevity
- conn_id: datahub_rest_default
conn_type: datahub_rest
host: https://my-datahub-host.net
schema: 'api/gms'
Two issues pop out immediately:
-
schema
should only havehttp
orhttps
in it, not/api/gms
(source) -
host
value is missing the/api/gms
path
To understand why the connection is not parsed properly, lets look at how Connections look like under the hood.
Anatomy of an Airflow connection
An Airflow Connection
object (source) is very well documented so I won't repeat that; use the source, Luke:
class Connection:
"""
A connection to an external data source.
:param conn_id: The connection ID.
:param conn_type: The connection type.
:param description: "The connection description."
:param host: The host.
:param login: The login.
:param password: The password.
:param schema: The schema.
:param port: The port number.
:param extra: Extra metadata. Non-standard data such as private/SSH keys can be saved here. JSON
encoded object.
"""
It is good to know that a Connection can also be represented as a connection string (also called an URI).
Because we are dealing with an HTTP connection, the HOST
consists of the full URL, including the path; for example, a connection URI for Google Images page would look like http://https://google.com/imghp
.
Airflow connection parsing
DataHub's DatahubRestHook
(source) is based on Airflow's BaseHook
(source) so it inherits this method:
@classmethod
def get_connection(cls, conn_id: str) -> Connection:
"""
Get connection, given connection id.
:param conn_id: connection id
:return: connection
"""
# shortened for brevity
conn = ConnectionModel.get_connection_from_secrets(conn_id)
return conn
From it, and several levels down the code path, we find a function that parses the URI to turn it into a Connection
object. To simplify the demo below, I've extracted a couple of functions that parse the URI this into this Gist.
Using that code standalone (without Airflow) shows exactly what's wrong with the connection:
Connection string without URL-encoding it first:
from airflow_connection_parse import parse_from_uri
parse_from_uri("datahub-rest://https://:TOKEN@my-datahub-host.net/api/gms")
Host: https://my-datahub-host.net
Schema: api/gms
Connection string with the URL (host
) being URL-encoded:
from airflow_connection_parse import parse_from_uri
parse_from_uri("datahub-rest://https://:TOKEN@my-datahub-host.net%2Fapi%2Fgms")
Host: https://my-datahub-host.net/api/gms
Schema:
schema
is empty and the host
contains /api/gms
as it should according to the DataHub's Airflow integration docs.
URL-encode your connection strings if you're creating them outside of the Airflow ecosystem - using the Airflow UI (not recommended for production) or the airflow
CLI will take care of that for you.
In our case, all the connections are managed with Terraform and the code for it missed a simple urlencode(MY_HOST_HERE)
function call.
Top comments (0)