Retiago Drago

Posted on May 9, 2023 • Edited on Nov 18, 2023

Quick Start Guide to Amundsen Demo 🚀

#beginners #tutorial #datascience #opensource

Outlines

Introduction 🌟

How does it work? 🛠️

Discover trusted data 📊

See automated and curated metadata 📚

Share context with coworkers 👥

Learn from others 👩‍🎓

Installation ⚙️

Miscellaneous 🧩

The End 🏁

Introduction 🌟

Amundsen is an advanced data discovery and metadata engine designed to boost the productivity of data analysts, data scientists, and engineers when interacting with data.

It achieves this by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g., highly queried tables show up earlier than less queried tables).

In simple terms, it's like Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.

Amundsen is hosted by the LF AI & Data Foundation. It includes three microservices, one data ingestion library, and one common library:

Amundsen Libraries	Description
amundsenfrontendlibrary	A Flask application with a React frontend.
amundsensearchlibrary	A search service that leverages Elasticsearch for search capabilities.
amundsenmetadatalibrary	A metadata service that leverages Neo4j or Apache Atlas as the persistent layer to provide various metadata.
amundsendatabuilder	A data ingestion library for building metadata graph and search index.
amundsencommon	A common library that holds shared codes among Amundsen's microservices.
amundsengremlin	A library that holds code used for converting model objects into vertices and edges in gremlin, used for loading data into an AWS Neptune backend.
amundsenrds	Contains ORM models to support relational database as metadata backend store in Amundsen.

Check out their GitHub for more information.

How does it work? 🛠️

Discover trusted data 📊

Search for data within your organization with a simple text search. A PageRank-inspired search algorithm recommends results based on names, descriptions, tags, and querying/viewing activity on the table/dashboard.

See automated and curated metadata 📚

Build trust in data using automated and curated metadata — descriptions of tables and columns, other frequent users, when the table was last updated, statistics, a preview of the data if permitted, etc. Easily triage by linking the ETL job and code that generated the data.

Share context with coworkers 👥

Update tables and columns with descriptions, reduce unnecessary back and forth about which table to use and what a column contains.

Learn from others 👩‍🎓

See what data fellow co-workers frequently use, own, or have bookmarked. Learn what the most common queries for a table look like by seeing dashboards built on a given table.

Check out Amundsen's website and their documentation for more information.

Installation ⚙️

Ensure you have at least 3GB of disk space available to Docker. You'll need to install docker and docker-compose.

Follow these guides to install Docker based on your operating system:

And here's a guide on how to install docker-compose for all systems.

You can check your current Docker version with this command:



docker -v

And to check your docker-compose version, use this command:



docker-compose -v

We'll be using WSL2 for this guide, and we'll start by cloning this repo and its submodules:



git clone --recursive https://github.com/amundsen-io/amundsen.git

Next, enter the cloned directory:



cd amundsen

If this is your first time, make sure you've allocated the necessary memory. The minimum needed for all the containers to run with the loaded sample data is 3GB.

If you're using WSL2, you can check your allocation through .wslconfig. Follow this guide to set your .wslconfig for WSL2.

As an example, here's the .wslconfig I use:



[wsl2]
memory=6GB
processors=2
swap=3GB

If you've made changes to the configuration, restart your PC so they can take effect. If no changes were necessary, proceed to the next step.

For this demo, we'll be using Neo4j Backend. Run the following command:



docker-compose -f docker-amundsen.yml up

In a separate terminal window, change your directory to databuilder to ingest the provided sample data into Neo4j:



cd databuilder

Install the dependencies in a virtual environment. For this, we'll be using pyenv, a tool for managing multiple Python versions, and its plugin pyenv-virtualenv for managing multiple virtual environments. If you don't have these installed, check out these guides on how to install pyenv and creating a virtual environment with pyenv.

Check your pyenv versions:



pyenv versions

Activate the environment in the current directory. In my case, my virtual environment is called amundsen_demo:



pyenv local amundsen_demo

Finally, upgrade the version of pip, the package installer for Python:



pip3 install --upgrade pip

Next, install the Python packages listed in the requirements.txt file. This file contains a list of dependencies required by a Python project:



pip3 install -r requirements.txt

Then, install the Amundsen Data builder package using pip:



python3 setup.py install

We'll then load data into Neo4j and Elasticsearch databases without using an Airflow DAG (Directed Acyclic Graph) with this script:



python3 example/scripts/sample_data_loader.py

The script consists of several jobs:

Amundsen Jobs	Description
run_csv_job	Reads table data from a CSV file, writes the data to another local directory as a CSV file, and then publishes the data to Neo4j, a graph database management system.
run_table_column_job	Similar to run_csv_job, but processes a CSV file containing column data instead.
create_last_updated_job	Creates a job that gets the current time, converts it into a predefined data model, and publishes it to Neo4j.
create_es_publisher_sample_job	Creates a job that extracts data from Neo4j and publishes it to Elasticsearch, a search and analytics engine.

The script imports necessary modules, sets up configuration, and uses various extractors, loaders, and publishers from the Amundsen Databuilder library to perform the tasks mentioned above.

Now you can view the UI at http://localhost:5000 and try searching for test. You should get some results.

You can also perform an exact-match search for a table entity. For instance, search for test_table1 in the table field of the filter, and it'll return the matching records.

To verify the dummy data has been ingested into Neo4j, visit http://localhost:7474/browser/

and run this in the query box:



MATCH (n:Table) RETURN n LIMIT 25

You can verify the data has been loaded into the metadataservice by visiting:

test_table1

test_table2

Finally, don't forget to stop your running multicontainer app after you've finished using it:



docker-compose -f docker-amundsen.yml down

Miscellaneous 🧩

The End 🏁

That concludes our quick start guide to setting up and running a demo of Amundsen. I hope you found this post helpful and informative. If you have any questions or if there's something you'd like to know more about, feel free to drop a comment below or reach out to me directly.

Remember, there's no limit to what you can achieve with the right tools and a little bit of know-how. Keep exploring, keep learning, and as always, keep pushing the boundaries of what's possible with data.

If you want to stay updated with my latest posts and activities, or if you just want to connect, follow me on Beacons:

Take a look 👀

Happy coding, everyone! 🚀

Top comments (1)

Joshua Ryder • Apr 29

Hi all windows users from 2025. This guide wont work out of the box. I will hare here what I did to make it work on my machine. If you try getting this working on newest version of Python, you will run into a world of pain. So what I did:

First of syslink of files doesn't work on windows. So copy your requirements-dev.txt into databuilder folder.

Install pyenv on windows, because it will make your life so much easier.

Run this in PowerShell as Admin

Invoke-WebRequest -UseBasicParsing -Uri "raw.githubusercontent.com/pyenv-wi..." -OutFile "./install-pyenv-win.ps1"
.\install-pyenv-win.ps1

review script before running (don't trust random internet dude :P)

pyenv install 3.8.0

pyenv install 3.8.0-win32 <-- maybe not needed. I just did it anyways.

pyenv local 3.8.0 <-- use it.

This pretty much made it work for me. Hope this helps someone else :)

Outlines Introduction 🌟 How does it work? 🛠️ Discover trusted data 📊 See automated and curated metadata 📚 Share context with coworkers 👥 Learn from others 👩‍🎓 Installation ⚙️ Miscellaneous 🧩 The End 🏁