DEV Community

Cover image for Quick Start Guide to Amundsen Demo 🚀
Retiago Drago
Retiago Drago

Posted on • Edited on

Quick Start Guide to Amundsen Demo 🚀

Outlines

Introduction 🌟

Amundsen is an advanced data discovery and metadata engine designed to boost the productivity of data analysts, data scientists, and engineers when interacting with data.

data discovery fact

It achieves this by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g., highly queried tables show up earlier than less queried tables).

landing page of amundsen in demo

In simple terms, it's like Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.

architecture of amundsen

Amundsen is hosted by the LF AI & Data Foundation. It includes three microservices, one data ingestion library, and one common library:

Amundsen Libraries Description
amundsenfrontendlibrary A Flask application with a React frontend.
amundsensearchlibrary A search service that leverages Elasticsearch for search capabilities.
amundsenmetadatalibrary A metadata service that leverages Neo4j or Apache Atlas as the persistent layer to provide various metadata.
amundsendatabuilder A data ingestion library for building metadata graph and search index.
amundsencommon A common library that holds shared codes among Amundsen's microservices.
amundsengremlin A library that holds code used for converting model objects into vertices and edges in gremlin, used for loading data into an AWS Neptune backend.
amundsenrds Contains ORM models to support relational database as metadata backend store in Amundsen.

Check out their GitHub for more information.

How does it work? 🛠️

Discover trusted data 📊

search result

Search for data within your organization with a simple text search. A PageRank-inspired search algorithm recommends results based on names, descriptions, tags, and querying/viewing activity on the table/dashboard.

See automated and curated metadata 📚

metadata example

Build trust in data using automated and curated metadata — descriptions of tables and columns, other frequent users, when the table was last updated, statistics, a preview of the data if permitted, etc. Easily triage by linking the ETL job and code that generated the data.

Share context with coworkers 👥

data resources

Update tables and columns with descriptions, reduce unnecessary back and forth about which table to use and what a column contains.

Learn from others 👩‍🎓

user details

See what data fellow co-workers frequently use, own, or have bookmarked. Learn what the most common queries for a table look like by seeing dashboards built on a given table.

Check out Amundsen's website and their documentation for more information.

Installation ⚙️

Ensure you have at least 3GB of disk space available to Docker. You'll need to install docker and docker-compose.

Follow these guides to install Docker based on your operating system:

And here's a guide on how to install docker-compose for all systems.

You can check your current Docker version with this command:



docker -v


Enter fullscreen mode Exit fullscreen mode

And to check your docker-compose version, use this command:



docker-compose -v


Enter fullscreen mode Exit fullscreen mode

We'll be using WSL2 for this guide, and we'll start by cloning this repo and its submodules:



git clone --recursive https://github.com/amundsen-io/amundsen.git


Enter fullscreen mode Exit fullscreen mode

git clone

Next, enter the cloned directory:



cd amundsen


Enter fullscreen mode Exit fullscreen mode

If this is your first time, make sure you've allocated the necessary memory. The minimum needed for all the containers to run with the loaded sample data is 3GB.

amundsen: docker configuration setting

If you're using WSL2, you can check your allocation through .wslconfig. Follow this guide to set your .wslconfig for WSL2.

As an example, here's the .wslconfig I use:



[wsl2]
memory=6GB
processors=2
swap=3GB


Enter fullscreen mode Exit fullscreen mode

If you've made changes to the configuration, restart your PC so they can take effect. If no changes were necessary, proceed to the next step.

For this demo, we'll be using Neo4j Backend. Run the following command:



docker-compose -f docker-amundsen.yml up


Enter fullscreen mode Exit fullscreen mode

docker compose up

In a separate terminal window, change your directory to databuilder to ingest the provided sample data into Neo4j:



cd databuilder


Enter fullscreen mode Exit fullscreen mode

Install the dependencies in a virtual environment. For this, we'll be using pyenv, a tool for managing multiple Python versions, and its plugin pyenv-virtualenv for managing multiple virtual environments. If you don't have these installed, check out these guides on how to install pyenv and creating a virtual environment with pyenv.

Check your pyenv versions:



pyenv versions


Enter fullscreen mode Exit fullscreen mode

Activate the environment in the current directory. In my case, my virtual environment is called amundsen_demo:



pyenv local amundsen_demo


Enter fullscreen mode Exit fullscreen mode

pyenv local

Finally, upgrade the version of pip, the package installer for Python:



pip3 install --upgrade pip


Enter fullscreen mode Exit fullscreen mode

Next, install the Python packages listed in the requirements.txt file. This file contains a list of dependencies required by a Python project:



pip3 install -r requirements.txt


Enter fullscreen mode Exit fullscreen mode

Then, install the Amundsen Data builder package using pip:



python3 setup.py install


Enter fullscreen mode Exit fullscreen mode

python setup

We'll then load data into Neo4j and Elasticsearch databases without using an Airflow DAG (Directed Acyclic Graph) with this script:



python3 example/scripts/sample_data_loader.py


Enter fullscreen mode Exit fullscreen mode

The script consists of several jobs:

Amundsen Jobs Description
run_csv_job Reads table data from a CSV file, writes the data to another local directory as a CSV file, and then publishes the data to Neo4j, a graph database management system.
run_table_column_job Similar to run_csv_job, but processes a CSV file containing column data instead.
create_last_updated_job Creates a job that gets the current time, converts it into a predefined data model, and publishes it to Neo4j.
create_es_publisher_sample_job Creates a job that extracts data from Neo4j and publishes it to Elasticsearch, a search and analytics engine.

The script imports necessary modules, sets up configuration, and uses various extractors, loaders, and publishers from the Amundsen Databuilder library to perform the tasks mentioned above.

sample data loader

Now you can view the UI at http://localhost:5000 and try searching for test. You should get some results.

test view UI

You can also perform an exact-match search for a table entity. For instance, search for test_table1 in the table field of the filter, and it'll return the matching records.

To verify the dummy data has been ingested into Neo4j, visit http://localhost:7474/browser/

neo4j browser auth

and run this in the query box:



MATCH (n:Table) RETURN n LIMIT 25


Enter fullscreen mode Exit fullscreen mode

query results

You can verify the data has been loaded into the metadataservice by visiting:

test table1

test table2

Finally, don't forget to stop your running multicontainer app after you've finished using it:



docker-compose -f docker-amundsen.yml down


Enter fullscreen mode Exit fullscreen mode

docker compose down

Miscellaneous 🧩

The End 🏁

That concludes our quick start guide to setting up and running a demo of Amundsen. I hope you found this post helpful and informative. If you have any questions or if there's something you'd like to know more about, feel free to drop a comment below or reach out to me directly.

Remember, there's no limit to what you can achieve with the right tools and a little bit of know-how. Keep exploring, keep learning, and as always, keep pushing the boundaries of what's possible with data.

If you want to stay updated with my latest posts and activities, or if you just want to connect, follow me on Beacons:

Take a look 👀

Happy coding, everyone! 🚀

Top comments (0)