DEV Community: Leonardo Holanda

O que 2.312 vagas pra devs na Gupy dizem sobre o mercado de vagas?

Leonardo Holanda — Sat, 16 Mar 2024 12:37:44 +0000

Após quase 3 meses, 2.312 vagas pra devs da Gupy foram coletadas pelo vagômetro, um rastreador de vagas de TI no Brasil.

Ao chegar nessa marca, pensei em fazer uma postagem compartilhando um resumo dos dados.

Dado que o vagômetro dá 13 informações diferentes sobre as vagas e as possibilidades de composições são muitas, eu optei por apresentar apenas três principais pontos e, para cada ponto, três observações que julgo interessantes.

Caso você queria validar os dados aqui citados ou vê-los em sua totalidade, acesse o vagômetro no link https://vagometro.vercel.app/.

Os dados utilizados nesse texto vão até o dia 16/03/2024.

Vagas de nível Júnior representam apenas 7% do total

Isso dá 160 vagas. Do total de 2.312, a maior parte se divide entre:

867 vagas pra sênior (38%)
579 vagas pra pleno (25%)
424 não informam o nível mas pedem experiência (18%)

Tecnologias mais requisitadas em vagas pra júnior

Dessas 160 vagas, as linguagens mais requisitadas são:

JavaScript
SQL
Java

Outros termos que receberam bastante menção foram:

Testes
API
Git
Agile

Modalidades mais frequentes em vagas pra júnior

A divisão de modalidades foi bastante equilibrada se considerarmos presencial e híbrido como uma categoria só.

81 vagas remotas (51%)
47 vagas híbridas (29%)
32 vagas presenciais (20%)

Para as vagas presenciais/híbridas, as três cidades que mais ofertam vagas foram:

São Paulo
Rio de Janeiro
Barueri/SP

Repostagens das vagas pra júnior

99% das vagas não foram repostadas nenhuma vez. Uma interpretação plausível é de que não há dificuldade em fechar as vagas de nível júnior, logo não há necessidade em repostar.

JavaScript, Java e SQL no topo

Considerando todas as 2.312 vagas, as linguagens que mais apareceram nas vagas foram:

JavaScript
- 860 vagas (30%)
- 3ª posição no ranking
Java
- 777 vagas (34%)
- 6ª posição no ranking
SQL
- 667 vagas (29%)
- 8ª posição no ranking

Termos relacionados à bancos de dados mais requisitados

Oracle
- 328 vagas (14%)
- 23ª posição no ranking
MongoDB
- 273 vagas (12%)
- 28ª posição no ranking
MySQL
- 242 vagas (11%)
- 30ª posição no ranking

Panorama do Backend

Para as tecnologias que podem ser utilizadas no backend, a divisão ficou da seguinte forma:

Java
- 777 vagas (34%)
- 6ª posição no ranking.

O Spring Boot é o framework de Java mais citado com 418 vagas (18%) estando na 20ª posição no ranking.

Node.js (JavaScript ou TypeScript)
- 518 vagas (22%)
- 14ª posição no ranking.

NestJS é o framework de Node.js mais citado com 58 vagas (3%) estando na 90ª posição no ranking.

C#
- 302 vagas (13%)
- 25ª posição no ranking

.NET é o framework de C# mais citado com 112 vagas (5%) estando na 57ª posição no ranking.

Termos Gerais

Outros termos gerais que ganharam bastante menção foram:

Testes
- 1.159 (50%)
- 1ª posição no ranking
API
- 907 vagas (39%)
- 2ª posição no ranking
Git
- 729 vagas (34%)
- 4ª posição no ranking
Agile
- 791 vagas (34%)
- 5ª posição no ranking
REST
- 687 vagas (30%)
- 7ª posição no ranking
Testes unitários
- 656 vagas (28%)
- 9ª posição no ranking
Scrum
- 581 vagas (28%)
- 10ª posição no ranking

Seriam esses os conhecimentos mais valorizados pelo mercado? Comente sua opinião!

Vagas remotas ainda são mais da metade

A divisão das vagas por modalidade ficou assim:

Remoto
- 1.293 vagas (56%)
Híbrido
- 702 vagas (30%)
Presencial
- 317 vagas (14%)

Experiência é valorizada em vagas remotas

Vagas pra sênior e pleno somam 898 vagas (69%) enquanto pra júnior há 81 vagas (6%).

Equilíbrio em vagas presenciais

A diferença entre níveis de experiência é minimizada em vagas presenciais.

Nessa modalidade, foram postadas 50 vagas sênior (16%), 53 vagas pleno (17%) e 32 vagas júnior (10%). O primeiro lugar, cujo nível é desconhecido mas a vaga pede experiência, teve 87 vagas (27%).

Seria uma forma das empresas compensarem a modalidade presencial com uma exigência de experiência mais flexível?

Vagas híbridas mantém desequilíbrio

445 é o total de vagas pra sênior e pleno (63%) que ocupam o primeiro e segundo lugar no ranking.

Vagas pra júnior ficam em quarto lugar com 47 vagas (7%).

Mais dados

O vagômetro coleta diariamente vagas da Gupy e LinkedIn, além de possuir vagas de repositórios GitHub que vão desde de 2016.

Toda vaga é mapeada para encontrar informações como:

Tecnologias requisitadas (JavaScript, Docker, Figma, TensorFlow, etc)
Modalidade da vaga (remoto, híbrido ou presencial)
Tipo de contrato (CLT, PJ, Estágio, etc)
Nível de experiência (Sênior, Pleno, Júnior, etc)
Inclusão da vaga (Afirmativa para pessoas negras, mulheres, PCD, etc)
Nível educacional (Graduação, mestrado, ensino médio, etc)
Idiomas requisitados (Inglês, espanhol, etc)
Certificações requisitadas
Quais cidades ofertam mais vagas
Quais empresas postam mais vagas
Quantas vezes a vaga foi repostada
Quanto tempo se passou entre as repostagens

Além disso, você ainda pode criar um perfil de busca que será comparado com todas as vagas, gerando uma porcentagem de match para cada uma delas e facilitando a busca por vagas com mais aderência ao seu perfil.

Todas as vagas utilizadas nas análises estão listadas perto do rodapé das páginas junto com o link de sua postagem.

O vagômetro é um projeto open-source. Para conferir o repositório, acesse o link: https://github.com/leo-holanda/vagometro.

Para conferir o vagômetro, acesse o link https://vagometro.vercel.app/.

How LogCharts Was Developed

Leonardo Holanda — Thu, 02 Nov 2023 19:16:00 +0000

Using HWInfo logs can be really useful to diagnose problems in your hardware. But a .csv file isn't the most human-readable thing and neither is loading it in Excel or LibreOffice Calc.

In this post, I'm going to talk more about LogCharts, a tool I made to make HWInfo logs visualization more human-friendly.

Here's what you will see:

The Problem
The Solution
The Implementation
Hosting and Analytics
The Result
What I learn from all this

The Problem

You just downloaded HWInfo to find out how your computer hardware behaves and why it behaves the way it does.

You noticed that opening HWInfo just collects the min, max, current and average data. But you need something that shows the data over time.

You discover that HWInfo has a log feature that exports a .csv file containing all the sensor data that were collected between when you activated the feature and deactivated it.

You open the .csv file and this is what it looks like:

It has what you need but you can't exactly interpret it, right? You decide to load it into a table and it looks like this:

It's getting better. Now you can sort, filter and things like that. But what about the relationship between your data and time? It's kinda hard to analyse it through a table, isn't it?

The Solution

What if it was possible to load the .csv and generate line charts showing how your data spreads over time?

That's what LogCharts does. Now, you can see logs the way they are meant to be seen. And with some nice additions like the tooltip and the brush.

Since you may want to compare data from different sensors, you can create as many lines as you want and select which data they will show while assigning colours to them.

The Implementation

I made this project when I was learning the web development basics: HTML, CSS and JavaScript. It feels like ages back to the time when I developed it so I don't remember all the details, unfortunately.

I do remember that I made some poor decisions that made the code harder to maintain. At the time, I thought: "I have an HTML page. I'm gonna create lots of JS files to manipulate this HTML page and lots of style files to style it. And that's it".

Some time ago I needed to get back to the code to fix a bug and it took me some time to understand where the things were and what I should change to fix the bug.

This helped me learn how componentization helps isolate things and how valuable this is.

Hosting and Analytics

After years of hosting LogCharts on GitHub pages, I got curious to see how many people were using it. For this, I needed an analytics solution that GitHub Pages doesn't provide.

After searching for options, I opted for Cloudflare Web Analytics. It doesn't have the cookie banner thing that I find pretty annoying while also working nicely and being easy to deploy.

It helped me find that LogCharts has nearly 280 monthly visits! It ain't much but it's honest work.

Cloudflare Web Analytics report for LogCharts

I also used the project to learn Docker. I created a Dockerfile that runs an NGNIX server that serves LogCharts static files. Someone even created an issue asking me for a docker-compose file which was nice.

The Result

You can check LogCharts here. And its source code here.

What I learned from all this

How to use D3.js to create interactive line charts
How to work with .csv files
How to deploy a project to GitHub Pages and Cloudflare
How to create a GitHub Action
How to use ESBuild to create bundles for production
How to use Docker in a project
How componentization can increase maintainability in a project

The VS Code Extension That Helps You Control Your Pull Requests Size

Leonardo Holanda — Wed, 01 Nov 2023 20:11:53 +0000

When I was working as an intern developer in a startup in my hometown, I noticed one particular thing that was quite annoying: large pull requests.

In this post, I'm gonna talk more about Changes Counter, a VS Code extension I made to tackle this problem. Here's what you will see:

The Problem
The Solution
The Implementation
The Result

The Problem

If you have reviewed a PR with thousands of lines, you know that it ain't fun. It takes time and the more time you spend on it, the more tired you get while also becoming harder to find bugs and mistakes in the code.

This tweet is a classic when posts are made about pull requests

In the context of my internship, I remember seeing some large PRs taking so long to be merged that it would affect the progress of other tasks resulting in bottlenecks.

And there's actually a study on this made by a company called SmartBear which recommends reviewing "no more than 200 to 400 lines of code at a time".

In summary: large pull requests suck.

Why would they happen in my team?

During the Scrum ceremonies, our team would decide which histories and tasks to create on the Jira board. Most of the time, we would agree that the tasks' scope was fine and reasonable. However, when working on these tasks, we noticed that they required more changes than predicted.

Sometimes, we would push the changes to the remote branch and create pull requests with lots of changes. Sometimes, we would notice that it would be too much for our reviewer friend and split the task into two separate ones.

But this whole process would rely on intuition rather than the actual number of changes that we would make in the PR. And being a team of interns, this intuition hasn't yet got enough time to develop itself and achieve good results. The same thing applies to task creation, I think.

So I thought: What if the dev would know exactly how many changes will go in the PR while coding the task instead of relying on intuition?

This way, the dev will always know if the PR is getting larger than desired and then decide to take some action.

The Solution

One of the extensions that we were encouraged to use in the internship was GitLens. It provides lots of features to make the whole git experience better and it actually would come in handy frequently during the work we did.

While using it, I noticed that the Search & Compare feature contains data about changed lines of code. But unfortunately, not in a way that could solve our problem.

GitLens Search & Compare feature screenshot

To solve our problem, the data needed to be presented in an easier way to look at while being more useful to the developer during the process of working on a task. Kinda like a status bar item like the Errors & Warnings.

Also, a notification warning the developer that a given change quantity threshold was exceeded could be nice in case the dev was unaware of it.

I ended up with these requirements:

A status bar item that shows how many lines of code were changed
The item is updated every time a file is saved
The user can set a threshold to determine an acceptable quantity of changed lines
A notification is sent when the threshold is exceeded

The Implementation

With the requirements in mind, I decided to develop a VS Code extension myself and try to tackle these requirements into features.

I knew nothing about coding VS Code extensions. However, this documentation was really valuable in giving the base extension code and instructions on running it in a dev environment.

Then, it was a matter of adding the features I wanted. Lots of things were actually simple and I don't think it's worth mentioning here. But this one problem was interesting to me:

How to run Git commands in TypeScript?

The first question I had was: "How to count the number of lines changed?". Since I already knew about git diff, it was just a matter of running this command and working on its output.

But since VS Code extensions are developed with TypeScript in a Node.js environment, how would it be possible to run git diff and hold its output inside the extension code?

Searching about it, I discovered the Child Process module from Node.js. You can use it to spawn a subprocess that can run git commands. This is the code provided in the documentation as an example:

const { spawn } = require('node:child_process');
const ls = spawn('ls', ['-lh', '/usr']);

ls.stdout.on('data', (data) => {
  console.log(`stdout: ${data}`);
});

ls.stderr.on('data', (data) => {
  console.error(`stderr: ${data}`);
});

ls.on('close', (code) => {
  console.log(`child process exited with code ${code}`);
});

Bringing this to the extension code, we get something like this:

  async getDiffData(): Promise<DiffData> {
    return new Promise((resolve, reject) => {
      const comparisonBranch = this.context.workspaceState.get<string>("comparisonBranch");
      if (comparisonBranch === undefined) {
        reject("A comparison branch wasn't defined. Please, define a comparison branch.");
        return;
      }

      const gitChildProcess = spawn(
        "git",
        ["diff", comparisonBranch, "--shortstat", ...this.diffExclusionParameters],
        {
          cwd: vscode.workspace.workspaceFolders![0].uri.fsPath,
          shell: true, // Diff exclusion parameters doesn't work without this
        }
      );

      gitChildProcess.on("error", (err) => reject(err));

      let chunks: Buffer[] = [];
      gitChildProcess.stdout.on("data", (chunk: Buffer) => {
        chunks.push(chunk);
      });

      gitChildProcess.stderr.on("data", (data: Buffer) => {
        reject(data.toString());
      });

      gitChildProcess.on("close", () => {
        const processOutput = Buffer.concat(chunks);
        resolve(this.extractDiffData(processOutput));
      });
    });
  }

By calling this function, we get all we need through the resolved promise: an object containing the changes data or the error we must handle.

Logging

A thing I learned while developing this extension is how important logging is. When some users reported problems, the lack of information was something that made debugging difficult.

When I started logging some lifecycle events and errors that could eventually appear, I noticed that it would be far easier to debug once the user sent me the log. Knowing where to search for the bug is obviously crucial.

If you are developing VS Code extensions, try to introduce logging as early as possible to avoid debugging in the dark.

And don't forget to search for the best practices before doing it. They are simple and can be helpful to make your logging more consistent with other applications. Here's an example.

The Result

You can see the extension page at the Visual Studio Marketplace here.

Changes Counter extension

And you can see the code here.

Any feedback and suggestions are more than welcome!

What I learned from all this

How to create and deploy a VS Code extension
How to spawn subprocesses and run system commands with the Child Process module from Node.js
The importance of logging

How To Find An Artist's Country of Origin?

Leonardo Holanda — Mon, 30 Oct 2023 13:32:46 +0000

I'm this post, I'm gonna talk more about the approach I used to find an artist's country of origin.

A quick note before we start

For this post, I thought that showing the problems I solved, the solutions I found and the mistakes I made would be more interesting than just showing the code and explaining it. If you want to replicate it, at least you know what not to do. What do you think?

Also, my goal with this series of posts is to share what I learned while developing Cartogrify so I thought it would make more sense.

Anyway, if you just want to see the code, it's near the end.

In this post, we will see:

The Problem
Where to fetch the data?
How to fetch the data?
The First Solution
The Second Solution
The Final Solution

The Problem

Cartogrify fetches the user's 50 top artists from Spotify or Last.fm APIs. In both of them, you will have an array of objects containing the artists' data which includes their names.

To generate the data visualization, you need to know where the artists come from or know that you couldn't find their countries. The aim is to end up with an array of objects containing the artists' names and the country where they come from or undefined.

Since the country detection algorithm spends the hosting free tier resources, it shouldn't run for an artist that it already encountered before. Because of this, every searched artist needs to be saved in a database for future queries.

Also, it's kinda boring to look at a spinner and wait for 20+ artists to have their countries discovered. Since it can take a while, I did this loading screen:

Cartogrify country detection loading screen

It means that the artists must have their country detected sequentially rather than wait for all of them to be detected to proceed.

Where to fetch the data?

I already knew by looking at explr.fm source code that using Last.fm API was an option.

However, I didn't want to follow the devs' approach to extract country data from the artist's tags since it's an unreliable source. Sometimes, there's country tags and sometimes not. So I went searching for alternatives.

While I was searching, I stumbled upon Dr. Markus Schedl's paper "Three web-based heuristics to determine a person's or institution's country of origin". Dr. Schedl's approach relies on using a search engine with a specific query to retrieve top-ranked pages and extract the person's country data from their textual content.

This approach might work well in a research environment but I'm not quite sure about a web environment. The Google Custom Search API limit of 100 search queries for free per day seems heavily restrictive.

However, the article also mentions that other authors use a different approach by fetching data directly from specific websites. This approach is more suitable for web environments due to less usage restrictions which is the reason I chose it in Cartogrify.

You could also just ask ChatGPT. I did some tests and the answers were correct most of the time. But since it isn't free, it isn't an option for me, unfortunately.

Besides Last.fm, I searched for more websites that would contain artists' country data. These were the ones I found:

Rate Your Music
Discogs
MediaWiki API

They ended up being the initial "source pool".

How to fetch the data?

A great thing about some of these websites is that they have public APIs where you can send requests and get data about artists.

There are only two options, then: Send a request to the music website API or go to the artist profile page and do web scraping.

Which source to choose?

Rate Your Music ❌

Rate Your Music doesn't have a public API and it blocked my IP when I sent a request to an artist profile page. So neither of the options is available.

Discogs ❌

Discogs do have a public API but the artist search endpoint response doesn't have country data.

Since they are heavily focused on albums, the only country data available is related to albums. But I suppose is the country where the album was produced so it isn't reliable.

Web scraping, according to some forum posts, can also result in an IP ban.

MediaWiki API ❌

While taking a deep look at MediaWiki, which is under the Wikipedia umbrella, I noticed that it contains data for the most famous artists but the underground ones are missing.

Because of the equivalence between the data from Last.fm and MediaWiki for famous artists and Last.fm giving better results for underground ones, I decided to stick with Last.fm.

Last.fm ✅

Last.fm has a public API but the country data may only be available indirectly through tags or wiki text.

They do allow web scraping and the artist's page can be reached using the artist's name. Most famous artists have their country available on their pages which makes web scraping a reliable option.

The First Solution

Given this context, I have chosen the web scraping approach using Last.fm as a source. It seemed like a step up from the explr.fm approach so I dived into it.

Caveats

CORS
Since sending a request from the browser to a Last.fm page triggers a CORS error, the request must be made from the backend, which means using an Edge Function from Supabase.
Readable Stream
In the beginning, I only fetched 20 artists. For me, it wouldn't make sense to invoke 20 Edge Functions for each artist since it would just spend the free tier resources faster. So I tried to make one request to return the data from 20 artists. This is achievable using a Readable Stream.

How it works?

Here's the idea:

Invoke the Edge Function sending the artists' names array as the request's body
In the Edge Function, fetch the Last.fm profile page for each artist. Return each HTML page in the response stream
In the frontend, read the response stream and concatenate its chunks to an accumulator string
When a chunk is concatenated, check if the accumulator string contains a full artist HTML page. If yes, extract the page from the accumulator string and apply web scraping.

How the web scraping works:

Search for the tags whose content you know that contains country data
Extract their content as strings
Search for country names in each string
The country with more matches is associated with the artist

Where do you get the countries' names?

I was already using an amazing map dataset called Natural Earth to generate Cartogrify's world map. Since it already contains the countries' names, it was an easy choice to use it.

Problems

String comparison
Tags can be misleading
Edge Functions CPU time limit

String comparison

Lots of users were complaining that some artists were being associated with strange countries. The ones that attracted more attention were:

Michal Jackson was from India.
Every folk artist was from Norfolk Island.
Lots of artists were from Saint Barthélemy, Caribbean. Lil Peep, for example.
Artists from Georgia, USA were from Georgia, a country from Europe/Asia.
Artists from New Jersey, USA were from Jersey, Channel Islands.
An American artist named Neon Indian was from... India.
Gilberto Gil, a fantastic Brazilian artist born in the state of Salvador, was from El Salvador.

It's kinda funny, though. Unacceptable but really funny.

Why?

There are two approaches I used to compare strings. The exact match and the substring match.

I started with the exact match because it's the standard logic, right? If it says "Djavan is an artist from Brazil" you split the string by the whitespaces, match "Brazil" with "Brazil" and that's it.

But it turns out that I was associating lots of artists with plenty of country data with an undefined country. This would happen because "Brazil," or "Brazilian" doesn't match with "Brazil", for example. (Examples are in sentence case but were converted to lowercase before comparison)

I thought that using the substring match would loosen the match criteria and therefore give better results. Oh, boy. What would happen is that a lot of undesired matches would occur.

For example, India is a substring of "Michael Jackson is an artist from Indiana". Also, a substring of "Neon Indian". India ended up being the country with the most matches and these artists would be associated with it.

Solution

Use the exact match combined with a demonym's exact match. Demonyms, according to Google, is "a noun used to denote the natives or inhabitants of a particular country, state, city, etc". Like Brazilian, American, Italian, etc.

Lots of times the wiki text would contain something like "Luiz Gonzaga do Nascimento (Exu, Pernambuco, December 13, 1912 — Recife, Pernambuco, August 2, 1989) was a prominent Brazilian folk singer, songwriter, musician and poet."

There ain't no "Brazil" string but a "Brazilian" one. Without demonyms, Luiz Gonzaga would be associated with an undefined country. With demonyms, he is associated with Brazil. And no substring match problems.

I got the demonyms list from this Wikipedia page.

The idea of using something like Levenshtein Distance instead of an exact match also crossed my mind but I just tried to keep it simple.

Tags can be misleading

There are only 3 tags you need to search for content in a Last.fm artist page.

Metadata tag
Wiki tag
Tags tag

Gilberto Gil profile page in Last.fm

The metadata tag is the most valuable because it may contain the exact country data.

The tags tag may not contain country data at all. But when it contains, it can be a demonym, the country name in its own language or things like that.

The wiki tag is less valuable because it can contain country data that isn't associated with the country the artist was born. For example, "In the 1970s, Gil added new elements of African and North American music to his already broad palette [...]". As "African" and "American" are respectively demonyms from Africa and the USA, it would count as a match to Africa and the USA.

Solution

Instead of counting country matches, use a point approach. Each tag will have a point weight associated with its value. Metadata tags have 5 points, tags tags have 3 points and wiki tags have 1 point.

When matches occur, the country receives the points according to the tag where the match occurred. The country with the most points wins.

Edge Function CPU time limit

The Edge Functions were working fine when I was fetching only 20 artists from Spotify and Last.FM APIs. But when I increased the number, this error started to appear.

CPU time limit reached. isolate: 16597602940236451129
CPU time used: 560ms
hyper::Error(User(Body), hyper::Error(Body, Custom { kind: UnexpectedEof, error: "unexpected EOF during chunk size line" }))

What was annoying me was that this error would sometimes appear and sometimes not. I was aware the number of artists was causing it but I couldn't find why this intermittent behaviour was happening.

I thought that the root of the problem was the unexpected EOF message but after posting a question in StackOverflow some answers made me realize that it was just the time limit.

Since I was using a timeout of 1s between each request to avoid being blocked by Last.fm, 30 artists means at least 30s, 35 means 35s and so on.

Solution

Migrate the country detection code to AWS Lambda which can run up to 15 minutes. Since an user with 50 unknown artists takes roughly 1 minute, it's ok.

The Second Solution

Do you remember the "source pool"? Well, there was an option that didn't appear there. It was MusicBrainz API.

When I found MusicBrainz API for the first time, I had already decided to follow the Last.fm approach. For that reason, I didn't further explore their API and wasn't aware that there was a resource named Area which holds the artist's location data. That would have solved my problems.

Fortunately, the same lovely person in the Last.fm Discord gave me this hint about the Area resource. Then, I decided to shift the approach to using MusicBrainz API as its main source.

That one goes to the "What I learned" section. That was my biggest mistake in the process of finding the solution to this problem. It cost me so much time.

How it works?

Invoke the AWS Lambda function passing the artists' names array as the request body
In the Lambda function, request the MusicBrainz API for the data of each artist.
Return the data in the response stream.
Extract the country from the data.

Here's the Lambda function code:

const https = require('https');
const URL = require('url');

async function getArtistData(artistName) {
  try {
    return new Promise((resolve, reject) => {
      setTimeout(() => {
        const req = https.get({
          hostname: 'musicbrainz.org',
          path: `/ws/2/artist/?query=artist:${encodeURIComponent(artistName)}&fmt=json&limit=100`,
          headers: {
            'User-Agent': ###########
          }
        }, (res) => {
          let body = '';
          res.on('data', (chunk) => body += chunk);
          res.on('end', () => resolve(body));
        }); 

        req.on('error', (e) => reject(e));
        req.end();
    }, 1000);
  });
  } catch (e) {
    return new Promise((resolve, reject) => reject(e))
  }
}

exports.handler = awslambda.streamifyResponse(async (event, responseStream, _context) => {
    responseStream.setContentType("text/event-stream");

    const artistsName = event.body.split("###") || [];
    for (const artistName of artistsName) {
      try {
      responseStream.write("START_OF_JSON");
        responseStream.write(JSON.stringify({
          name: artistName,
          data: await getArtistData(artistName)
      }));
          responseStream.write("END_OF_JSON");
      } catch (e) {
        console.log(artistName)
        console.error(e)
        responseStream.write(e)
      }
  }

    responseStream.end();
});

And here's the code that runs on the Angular app.

 findArtistsCountryOfOrigin(artists: Artist[]): Observable<ScrapedArtist> {
    const artists$ = new Subject<ScrapedArtist>();

    const artistsNames = artists.map((artist) => artist.name);
    fetch(environment.PAGE_FINDER_URL, {
      method: "POST",
      body: artistsNames.join("###"),
    })
      .then(async (response) => {
        const streamReader = response.body?.getReader();
        if (!streamReader) return;

        const textDecoder = new TextDecoder();
        let streamAccumulatedContent = "";

        while (true) {
          const { value, done } = await streamReader.read();

          streamAccumulatedContent += textDecoder.decode(value);
          if (
            streamAccumulatedContent.includes("START_OF_JSON") &&
            streamAccumulatedContent.includes("END_OF_JSON")
          ) {
            const startIndex =
              streamAccumulatedContent.indexOf("START_OF_JSON") + this.START_INDICATOR_OFFSET;
            const endIndex = streamAccumulatedContent.indexOf("END_OF_JSON");

            const rawArtistData: RawMusicBrainzArtistData = JSON.parse(
              streamAccumulatedContent.slice(startIndex, endIndex)
            );

            streamAccumulatedContent = streamAccumulatedContent.slice(
              endIndex + this.END_INDICATOR_OFFSET
            );

            const artistData = {
              name: rawArtistData.name,
              artistDataFromMusicBrainz: this.musicBrainzService.getArtistData(rawArtistData),
            };

            const { country, secondaryLocation } =
              this.musicBrainzService.getArtistLocation(artistData);

            if (country == undefined && secondaryLocation != undefined) {
              this.countryService
                .findCountryBySecondaryLocation(secondaryLocation)
                .pipe(
                  switchMap((countryFromSecondaryLocation) => {
                    if (countryFromSecondaryLocation.NE_ID == -1)
                      return this.lastFmService.getLastFmArtistCountry(artistData.name);
                    return of(countryFromSecondaryLocation);
                  })
                )
                .subscribe({
                  next: (country) => {
                    artists$.next({
                      name: artistData.name,
                      country: country,
                      secondaryLocation,
                    });
                  },
                  error: () => {
                    artists$.next({
                      name: artistData.name,
                      country: undefined,
                      secondaryLocation: undefined,
                    });
                  },
                });
            } else if (country == undefined && secondaryLocation == undefined) {
              this.lastFmService.getLastFmArtistCountry(artistData.name).subscribe({
                next: (country) => {
                  artists$.next({
                    name: artistData.name,
                    country: country,
                    secondaryLocation,
                  });
                },
                error: () => {
                  artists$.next({
                    name: artistData.name,
                    country: undefined,
                    secondaryLocation: undefined,
                  });
                },
              });
            } else {
              artists$.next({
                name: artistData.name,
                country,
                secondaryLocation,
              });
            }
          }
          if (done) break;
        }
      })
      .catch((err) => {
        console.log(err);
      });

    return artists$.asObservable();
  }

I will refactor it to make it more presentable. I'm in the "make it work" phase.

Problems

Missing country data
Matching artists' names
Missing Last.fm underground artists

Missing country data

Sometimes, MusicBrainz only knows the city or state that an artist comes from. Since we need the country, the full location is necessary.

Thee Sacred Souls data from MusicBrainz API

The solution that I found is to use the Free Geocoding API. You send a request with an address and it returns the full location. Then, it's a matter of finding the country name through string comparison.

Free Geocoding API response to San Diego query

Matching artists' names

One example that I saw was the Racionais MC's example.

When you fetch data from an artist in MusicBrainz, it returns a list with the most likely artists to match the artist name you provided sorted by a "likelihood" score.

At first, I always got the first one because it had a higher chance of being the artist I wanted. But I noticed that that's not always true.

When searching for Tyler, The Creator, for example, he's not the first on the list. Some artists named only as "The Creator" appear first.

MusicBrainz API response to "Tyler, The Creator" query

To fix this, I tried to exactly match the artist name for the first 100 artists and it worked. But then I got the Racionais MC's problem.

Free Geocoding API response to San Diego query

The "Racionais MC's" string I got from Last.fm is actually different from the "Racionais MC's" string I got from MusicBrainz. That's because the ' character has a different encoding so the strings never match.

In this scenario, I put a fallback that returns the first artist if there's no match in the first 100 artists. That seems to be working for now.

Missing Last.fm underground artists

MusicBrainz has data from a lot of artists but the Last.fm underground artists are a special kind of artist whose data only seems to exist in Last.fm.

When I shifted the approach to use MusicBrainz, I stopped using Last.fm. But due to this missing artist problem, I decided to use Last.fm as a fallback in case MusicBrainz doesn't have the data.

The Final Solution

Finally, we reached the final solution. As I said, it's the MusicBrainz API solution + Last.fm API as a fallback to fix the missing artists problem.

The only difference is that I'm no longer doing web scrapping with the Last.fm artist profile page. I received some advice from the same lovely person in the Last.fm Discord to move away from that approach. It's kinda like a "good neighbor" policy.

What changes is that the data from Last.fm now comes from their API and only the artists' wiki text and tags are available. The techniques to extract content and match countries' names stay the same. This is the approach that I've been using for more than a month now.

I haven't seen any significant complaints from the users anymore. Actually, there were some compliments that were very nice from users who saw artists being assigned to the wrong countries before.

Here and there artists are still being assigned to undefined countries and sometimes wrong countries. However, I found that the ones that got undefined usually don't have enough data available to determine their countries.

The already implemented suggestions feature is doing a great job of reassigning the correct countries for the artists that this solution failed and providing countries for the ones that don't have enough data.

What I learned from all of this

Share early versions of your project with your users and other developers! Valuable advice may come from some lovely people out there.
Do not dismiss a development path without exploring it properly.
Even if you put a lot of effort into improving something, it may fail in cases you couldn't even imagine. Always have a fallback when this happens.
The user's point of view is always the most important one. It doesn't matter that the artist's country of origin was found if the user closes the tab because it took too long.

That's it! I hope you learned something or that this will help you with any problem that you are facing. Tell me in the comments what you think about the solutions. Suggestions and feedback are very welcome!

How Cartogrify Was Developed - The Idea

Leonardo Holanda — Thu, 19 Oct 2023 13:06:00 +0000

Being unemployed sucks. But the only thing that doesn't suck about being unemployed is that you have more time to spend on developing fun projects. The ones whose ideas come to you in the shower, you know?

In this post, I'm going to talk more about one of them. More specifically, about:

How did it all start?
What is Cartogrify?
Who developed Cartogrify?
Stack and hosting providers
What I learned from all of this

How did it all start?

In the past 5 years, I have noticed, like most music junkies, the fantastic tools that analyse your music taste through Spotify. Obscurify, How Bad Is Your Streaming Music and Icebergify are some great examples.

How Bad Is Your Streaming Music landing page

They do an amazing work with the analysis. But something that I appreciated even more was seeing people sharing their results and just talking about the music they love. I have discovered great songs from great people this way, actually.

One day in 2021, I was thinking: why not develop something like this? It seemed like a funny thing to do. Since it should be something new and different from the ones that already exist, I was trying to find a gap that the other tools haven't covered yet. Then, this idea came to my mind:

Google Keep note I made in August 2021 to not forget about Cartogrify idea

It basically says: get a playlist, find where the artists and the playlist author come from, highlight the artists' countries on a world map and show interesting details about it.

And that's how Cartogrify was born. 2 years later, after I graduated, concluded my internship and my research, I finally found time to bring it to life.

Ok, so... what is Cartogrify?

It's a web app that fetches your top artists from Last.fm and Spotify and finds where they come from.

Cartogrify telling you how many different countries your artists come from

Then, it counts how many different countries, regions and continents are associated with your artists. A world map highlighting their countries and a Sankey diagram showing their regions and continents are both generated.

Your world map

Your sankey diagram

Finally, it compares the geographic data with other users to find how internationally diverse your music taste is.

Cartogrify telling you how internationally diverse your music taste is

It also has a ranking showing which country has more diversity and popularity.

Cartogrify diversity ranking

You can see it here.

Who developed Cartogrify?

Me! I'm Leonardo, a software developer from Brazil. I recently graduated in Information Systems from the Federal University of Alagoas while also being an intern for almost 2 years and working with research for 1 year.

Here's my LinkedIn and GitHub profiles, in case you want to see it.

Which stack and hosting providers did I use?

Frontend

In the frontend, I used Angular 16 with PrimeNG components and D3.js for data visualization. The deployment is made with Google Firebase Hosting.

Why Angular?

According to some career advice I received, choosing one technology and focusing on it to increase proficiency can increase the chances of getting a job that uses this technology.

Because I have already used Angular in my internship for 1 year and 6 months, it seemed reasonable to keep using Angular for that reason since getting a job is a priority for me at this moment.

Otherwise, I would have probably chosen Svelte to learn something new since I have previously worked with React and Vue.js. Any of them should work just fine, I think.

Why PrimeNG?

In my internship, Nebular is used as the component library. Since I have also used Material UI in other projects, why not try PrimeNG and see something new?

Why D3.js?

I've already used D3.js in my other project called LogCharts. Because of this, I'm familiar with it and know how capable this tool is for data visualization. So it wasn't a difficult choice.

But the thing they say about D3's learning curve is true. Sometimes, it can be frustrating. I pondered about switching to Chart.js or Google Charts to make only the bar charts in a more straightforward way but decided to be stubborn. Turns out it was a good decision.

Why Firebase?

Because I have seen lots of people talking about it and using it in their projects. Eventually, I got curious and wanted to try it. The free tier is good for small projects and it turns out to be simple to deploy and manage.

If things get bigger, I might go to Cloudflare, though.

Database

PostgreSQL.

Why PostgreSQL?

One requirement that was on my mind since the beginning was zero cost. I don't have funds at the moment to use in the project so I'm looking for generous free tiers.

When I was searching about database hosting providers, I noticed that Supabase not only offers free PostgreSQL database hosting but also a RESTful API that handles CRUD requests.

Supabase pricing page in October 18, 2023

It kills two birds with one stone. Seems like a great deal for me! The unlimited API requests and 500MB database space are nice too.

I think I have to say: I'm not sponsored by Supabase. Oh, how I wish...

Questions

When I made my choice, I must say that some questions were on my mind. They might seem silly to some people but I decided not to ignore it. Here are they:

"Is a relational database like PostgreSQL suitable for this project?"
"Would it be better if I used a non-relational database like MongoDB? What difference does it make?"
"DynamoDB has 25GB of free storage and 200 million free requests per month. No way my app gets big enough to get out of this free tier! It's future-proof!!"
"If I choose a hosting provider that doesn't offer a CRUD API, I'm gonna have to build it myself and host it. Is it gonna worth it? What about cost?"

As you can see, they are divided into three categories: paradigm, scalability and utilities. Let's tackle them.

Paradigm

Since I used MongoDB in my internship for 1 year and 6 months, I knew that this relational vs non-relational thing wouldn't make a big difference. My database is not rocket science, after all. Should I care so much about this question?

Searching about this topic, the main advice seems to be: "If you don't know the answer, use a relational database. It will probably work." So I did this and, until now, it's working.

Scalability

DynamoDB seems to be the winner here. It's tempting to just use it and forget about some fears like: "What if Supabase's 500MB is not enough?". But at the same time, Supabase is more straightforward and simple to use.

Since I didn't know if DynamoDB resources would even be necessary to handle Cartogrify presumably small scale, I just opted for Supabase. Like they say: keep it simple, stupid.

Until this day (October 18, 2023), with 900 users and 15k artists, I'm using less than 100MB of database space. But, there's a catch: Only Last.fm users can use Cartogrify.

Project resources usage summary

I have submitted a request to Spotify to use their API's Extended Quota Mode which allows the app to be used by every Spotify user. If they accept my request, more users and artists will come and more space will be used. Will it be enough to justify a migration to DynamoDB then? Let's wait for the next chapters!

Utilities

It's not always that a database hosting provider offers you a RESTful API like Supabase does. In this case, you need to develop the API yourself, which isn't a problem since it's just CRUD operations.

But to use this API, you must send a request to one of its endpoints, which implies that you must set up and deploy a server that listens and responds to these requests. Until now, I haven't found a reason to follow this path instead of using Supabase API.

If I were using DynamoDB or another database without this utility, there would be a reason: Supabase isn't fit for the project anymore. But I don't know when or even if this day will come so more points to Supabase.

Backend

In the backend, I used AWS Lambda running JavaScript code in a Node.js environment. Besides that, I also used Supabase Edge Functions running JavaScript code in a Deno environment.

Serverless or not?

As you already know, zero cost is a must and a generous free tier is what I'm looking for. Using the Free For Developers website to search for hosting possibilities, I noticed some serverless options that were quite interesting regarding their free tier.

I didn't know much about this serverless thing so I started searching about it. When I noticed that I didn't needed a server running 24/7 or have to deal with cold starts like the ones from Heroku or Render, the serverless seemed like a great idea.

Since Supabase already offers free Edge Functions, it just decided to try it and it worked nicely!

Why AWS Lambda?

In the beginning, I only used Edge Functions. However, some limitations that I will address in a later post made me migrate the country detection algorithm to another place.

I searched for alternatives and found that AWS Lambda can run for up to 15 minutes and supports Server Sent Events. That was more than enough to solve my problems. Not only that but a generous free tier. So I decided to try it and it works great too.

Why Edge Functions?

As I said earlier, it's free, it's from Supabase and it works nicely for any other demand than the country detection code. It just works.

What I learned from all of this

Sometimes, it's great to try something new. Sometimes, it's better to seek improvement in what you already know.
You shouldn't worry so much about getting out of free tiers while developing hobby projects.
If your app does get out of the free tier due to being viral or something like that, it's actually a good thing. People are using it.
Sometimes, all you want is just a simple Serverless Function. No machine running 24/7 in the backend and no cold starts.
To have an experienced mentor to discuss things can surely make a world of difference.
There's no way you can be good at so many things, like UI/UX, frontend, backend and cloud at the same time. You will just be "enough" in all of them. And that, in my context, is not enough to get a job. So it makes sense to focus on one thing and be "enough" on the others. Contributors are key for big projects!

In the next posts...

In the next posts, I'm gonna talk more about the challenges I faced to make Cartogrify work and also share with you what I learned from them.

I think I'm gonna start with "How to find an artist's country of origin knowing only his name?". That was an awesome one! But if there are other things you would like to know about the Cartogrify development journey, please let me know!

See you there! Thanks!