DEV Community: Aloïs Micard

Debian maintainer from zero to hero

Aloïs Micard — Mon, 25 Jan 2021 07:45:23 +0000

I have been using Debian intensively for more than 4years, mainly on the servers I administrate. It's a really powerful & stable (one of the most) OS, with a TONS of package available trough APT.

Debian has helped me in a lot of ways, and I wanted to bring something in return, even the little help I could offer. So, one year ago I have emailed a friend who is also a DD, to ask him if he's willing to become my mentor to help me contribute to Debian.

One year ago, I've started, with his help, the new maintainer process.

Please note that this isn't a technical guide at ALL, it's more my opinions & impressions on the Debian maintainer experience.

1. What kind of contributions?

You can contribute to Debian in a lot of ways: from translating, submitting bug reports / patches (bugfixes), help newcomers on forums, help package & maintain software, etc (full list of wanted contributions).
It only depends on your skills, your free time & what you like to do.

I personally wanted to get involved into packaging (TL;DR: help get software into the Debian archive, so that end-users can install them with apt).

2. Packaging

2.1. Learn how to make a package from scratch

The first step is to learn how to make a package from scratch, so that you can understand exactly how it works, and therefore be able to investigate future errors. I have used the following documentation.

I personally have spent 2 weeks reading the documentation, setting up a VM, and trial by errors. I have learned a lot of things that would help me a lot in the future.

For me, learning to make a package from scratch is a really important step, since it will be the core of your work as a maintainer. Nowadays, a lot of the package process is automated, especially with the dh-make tools. Therefore, a lot of stuff is going on under the hood, and not knowing how packaging work will make you struggle when encountering errors / fixing packaging related bugs.

2.2. Learn what to package

Once you know how to package stuff, you have to choose what you're gonna do with this new skill.

Once again, several options are available:

2.2.1. Create a new package

This option may be the most interesting, especially if you already have a
software in mind, but you're gonna ask you a some questions first:

Does this package will fit in the archive? (copyright, usefulness, etc...)
Do I have enough skills to package this software?
Do I have enough free time?
Doesn't this package already exist?

In case of doubt, always ask somewhere for help. You can use IRC, mailing list, etc.

2.2.2. Help maintain existing package

There's a lot of package in RFH (request for help).

These packages already have a maintainer, but seeking for help, meaning that he may be willing to help you update / fix bugs on the package, as well as upload it on the archive for you (sponsoring).

RFH packages are really great since you'll help an existing maintainer, learn from him, be able to ask questions, and the maintainer may be willing to advocate you to become a Debian maintainer in the future (see below for more details).

2.2.3. Take over abandoned package

Sometimes, maintainer may not have enough time for their packages, and will mark a package RFA (request for adoption).
These package are basically abandoned, looking for adoption. The maintainer may help you in taking over the maintenance.

3. Becoming a Sponsored maintainer

When your package is finished (successfully built + tested) you can't just upload it on the Debian archive. Only DD & DM can do that.

You'll have to ask someone with upload rights to review it & upload it for you (this is called sponsoring).

To ease package reviewing you can use mentors website and/or email the debian-mentors mailing list.

Once your first package is on uploaded on unstable, you will be a Sponsored maintainer!

4. Becoming a Debian maintainer

Once you start mastering the art of packaging, you may candidate to become a Debian Maintainer (DM).

DM are people with a restricted upload rights on the archive. DD can grant them upload rights for specific packages, so that they will be able to upload without sponsoring.

To become a DM you must go through the new-member process, there's several steps required such as getting your PGP uid signed by a DD (to enable trust), being advocated by a DD, agreeing to the SC/DFSG/DMUP.

After completing all steps, your application will stay pending for 4days+ (to allow any objections to be made) and finally the keyring maintainers will add your key to the Debian keyring, making you officially a Debian maintainer.

5. Conclusion

Contributing to such a big & distributed open source project taught me a lot of things, from technical skills such as packaging, reviewing / submitting patches, arguing about them, reporting bugs, ...

To social skills such as asynchronous communication (only email & IRC based) all around the world, working with a lot of different mindset, culture, being patient, humble, etc...

This experience helped me to be a better software developer & human being.

I'd like to thank the whole Debian community who is always really helpful & especially Alexandre Viau, my mentor, who has greatly helped me & pushed me to become a Debian maintainer.

Happy hacking!

Building a fast modern web crawler for the dark web

Aloïs Micard — Mon, 23 Sep 2019 08:37:39 +0000

I have been passionated by web crawler for a long time. I have written several one in many languages such as C++, JavaScript (Node.JS), Python, ... and I love the theory behind them.
But first of all, what is a web crawler ?

What is a web crawler ?

A web crawler is a computer program that browse the internet to index existing pages, images, PDF, ... and allow user to search them using a search engine. It's basically the technology behind the famous google search engine.

Typically a efficient web crawler is designed to be distributed: instead of a single program that runs on a dedicated server, it's multiples instances of several programs that run on several servers (eg: on the cloud) that allows better task repartition, increased performances and increased bandwidth.

But distributed softwares does not come without drawbacks: there is factors that may add extra latency to your program and may decrease performances such as network latency, synchronization problems, poorly designed communication protocol, etc...

To be efficient, a distributed web crawler has to be well designed: it is important to eliminate as many bottlenecks as possible: as french admiral Olivier Lajous has said:

The weakest link determines the strength of the whole chain.

Trandoshan: a dark web crawler

You may know that there is several successful web crawler running on the web such as google bot. So I didn't wanted to make a new one again. What I wanted to do this time was to build a web crawler for the dark web.

What's the dark web ?

I won't be too technical to describe what the dark web is, since it may need is own article.

The web is designed is composed of 3 layers and we can think of it like an iceberg:

The Surface Web, or Clear Web is the part that we browse everyday. It's indexed by popular web crawler such as Google, Qwant, Duckduckgo, etc...
The Deep Web is a part of the web non indexed, It means that you cannot find these websites using a search engine but you'll need to access them by knowing the associated URL / IP address.
The Dark Web is a part of the web that you't cannot access using a regular browser. You'll need to use a particular application or a special proxy. The most famous dark web is the hidden services built on the tor network. They can be accessed using special URL who ends with .onion

How is Trandoshan designed ?

Before talking about the responsibility of each process it is important to understand how they talk to each others.

The inter process communication (IPC) is mainly done using a messaging protocol known as NATS (yellow line in the diagram) based on the producers / consumers pattern. Each message in NATS has a subject (like an email) that allow other process to identify it and therefore to read only messages they want to read. NATS allowing scaling: for example they can be 10 crawler processes reading URL from the messaging server. Each of these process will receive an unique URL to crawl. This allow process concurrency (many instances can run at the same time without any bugs) and therefore increase performances.

Trandoshan is divided in 4 principal processes:

Crawler: The process responsible of crawling pages: it read URLs to crawl from NATS (message identified by subject "todoUrls"), crawl the page, and extract all URLs present in the page. These extracted URLs are sent to NATS with subject "crawledUrls", and the page body (the whole content) is sent to NATS with subject "content".
Scheduler: The process responsible of URL approval: this process read the "crawledUrls" messages, check if the URL is to be crawled (if the URL has not been already crawled) and If so, send the URL to NATS with subject "todoUrls"
Persister: The process responsible of content archiving: it read page content (message identified by subject "content") and store them into a NoSQL database (MongoDB).
API: The process used by other processes to gather informations. For example it is used by the Scheduler to determinate if a page has been already crawled. Instead of directly calling the database to check if an URL exist (which would add extra coupling to the database technology) the scheduler use to API: this allow sort of abstraction between database / processes.

The different processes are written using Go: because it offer a lot of performance (since it's compiled as native binary) and has a lot of library support. Go is perfectly designed to build high performance distributed systems.

The source code of Trandoshan is available on github here: https://github.com/trandoshan-io.

How to run Trandoshan ?

As said before Trandoshan is designed to run on distributed systems and is available as docker image which make it a great candidate for the cloud. In fact there is a repository which hold all configurations files needed to deploy a production instance of Trandoshan on a Kubernetes cluster. The files are available here: https://github.com/trandoshan-io/k8s and the containers images are available on docker hub.

If you have a kubectl configured correctly, you can deploy Trandoshan in a simple command:

./bootstrap.sh

Otherwise you can run Trandoshan locally using docker and docker-compose. In the trandoshan-parent repository there is a compose file and a shell script that allow the application to run using the following command:

./deploy.sh

How to use Trandosan ?

At the moment there is a little Angular application to search for indexed content. The page use the API process to perform search on the database.

Conclusion

That's all for the moment. Trandoshan is production ready but there's a lot of optimization to be done and features to be merged. Since it's an open source project everyone can contribute to it by doing a pull request on the corresponding project.

Happy hacking !