Nicol Acosta

Posted on Oct 23, 2023

Easy site Crawling in Elixir with ex_crawlzy

#elixir #crawling

History

The last year (2022) i make the new year purpose (the tradition it's do nothing) to create something that create incomes, sounds like the default "i want to create my startup" history that never success, but must exist a starting poit isn't so i decide to create my brand new first service that will start as free and progresively will add this pricing service

So this super effective brand new product was never released, because i had some personal issues with my free time so it's on-hold or i want to belive this xD, btw, i coded a lot of stuff, created and learned a lot because this service was focused on crawling, yes, crawling

My favorite language is Elixir so all my personal projects are on Elixir and also im a beliver of OpenSource, so i decided that my code wouldn't be just isolated on that private repo, i created something good and thing that maybe in the world, some dude alone without hope of humanity must need help crawling a site using Elixir because i think that everyone in their career, in some moment, needs defeat the crawling monster, so my library must make this task easy

The solution

Stop of history, and let's move to the solution

Crawling exist in Elixir, the classic http request and the libraries that make this more 'easy"

The main point it's that all the libraries that i found has the same point, just the call and returns the html as text, so the parsing and all the hard stuff basically it's the same

What is the difference on make my own call using some http request librar and use your library? NOTHING!!!

So i create my library that basically calls to the endpoint site and returns the content parsed as a Map, sounds great, and it's great

The library it's ex_crawlzy and gives a direct solution to this crawl based on css selectors, that it's the stuff that almost all world does when the crawl it's needed, so you just need your selectors and this library makes the hard work

The code

The same of all libraries, add the dependencies

def deps do
  [
    {:ex_crawlzy, "~> 0.1.1"}
  ]
end

And includes the classic way that gives all the crawling libraries, just giving the html as text

site = "https://example.site"

{:ok, html_content} = ExCrawlzy.crawl(site)

But here come's the interesting stuff, the library includes a function to parse the html_content if you give the css selectors in a map using a key: selector format

fields = %{
  # shortcut for use a function from ExCrawlzy.Utils
  body: {"div#the_body", :text}
#  module/function way
#  body: {"div#the_body", {ExCrawlzy.Utils, :text}}
#  body: {"div#the_body", fn content -> 
#   ExCrawlzy.Utils.text(content)
#  end}
}

{:ok, %{body: body}} = ExCrawlzy.parse(fields, html_content)

You can parse using a direct shortcut of one parser from ExCrawlzy.Utils module, a tuple {Module, :function} or directly a function that must be called when the field it's parsed

Want more

The solution it's there, but i know that you will need more, maybe more organization in the case that you have a big map to fill of data, so for have more organization i added the module ExCrawlzy.Client.Json that helps to define directly your crawler in a module and just call your YourModule.crawl/1 and will call, parse and return your data, the implementation it's easy

First let's define the module

defmodule ExampleCrawler do
  use ExCrawlzy.Client.Json

  add_field(:title, "head title", :text)
  add_field(:body, "div#the_body", :text)
  add_field(:inner_field, "div#the_body div#inner_field", :text)
  add_field(:inner_second_field, "div#inner_second_field", :text_alt)
  add_field(:number, "div#the_number", :text)
  add_field(:exist, "div#the_body div#exist", :exist)
  add_field(:not_exist, "div#the_body div#not_exist", :exist)
  add_field(:link, "a.link_class", :link)
  add_field(:img, "img.img_class", :img)

  def text_alt(sub_doc) do
    ExCrawlzy.Utils.text(sub_doc)
  end
end

If you check the code, in this cases you can define as callback a function from ExCrawlzy.Utils or a directly defined function in your module, the function text_alt/1 it's defined in the module, the crawler will check automatically if it's a defined function in module or a parser from utils module

And then just use it

site = "https://example.site"
{:ok, data} = ExampleCrawler.crawl(site)

List of elements

Let's supose that you're crawling a store, i want's to list the products in your crawl, you can define this crawler using the module ExCrawlzy.Client.JsonList

defmodule ExampleCrawlerList do
  use ExCrawlzy.Client.JsonList

  list_size(2)
  list_selector("div.possible_value")
  add_field(:field_1, "div.field_1", :text)
  add_field(:field_2, "div.field_2", :text)
end

This module defines 2 new definition macros list_selector/1 and you need define parent selector, the one that has the list of elements and list_size/1 will define how many elements of the list will take when it's parsing

Example of the html pattern of the list

<div class="parent_class">
  <div class="child_class">
    ...content
  </div>
  <div class="child_class">
    ...content
  </div>
  <div class="child_class">
    ...content
  </div>
</div>

And follows the same rules as the first crawler module

site = "https://example_list.site"
{:ok, data} = ExampleCrawlerList.crawl(site)

Adding http clients

Talking about security, a lot of sites detect robots (crawlers) on the calls and retrieves a forbidden response or other kind fo responses to ensure just real traffic incomes, to this, the library has pre-coded some simulated clients that helps to avoid some robot-detectors, but in case that your site needs other specific browser request headers, you can define it

site = "https://example.site"
clients = [
  [
    {"referer", "https://your_site.com"},
    {"user-agent", "Custom User Agent"}
  ]
]

{:ok, content} = ExCrawlzy.crawl(site, clients)

Or use the macro add_browser_client/1 in your crawler module sharing a list of tuples

defmodule ExampleCrawler do
  use ExCrawlzy.Client.Json

  add_browser_client([
    {"referer", "https://your_site.com"},
    {"user-agent", "Custom User Agent"}
  ])
  add_field(:field_1, "div.field_1", :text)
end

Testing

Test it's really easy, it's a http client so the testing it's a http client, you can use tesla for the testing part

First add this line to your test.exs file, must add specifically your module

config :tesla, ExampleCrawler, adapter: Tesla.Mock

This it's a test example, the html it's saved in the priv folder to have more organization, strongly recommended this step

defmodule ExampleCrawlerTest do
  use ExUnit.Case

  import Tesla.Mock

  setup do
    {:ok, content} =
      :your_app
      |> :code.priv_dir()
      |> then(&"#{&1}/test.html")
      |> File.read()

    mock(fn
      %{method: :get, url: "https://example_list.site"} ->
        %Tesla.Env{status: 200, body: content}
    end)

    :ok
  end

  test "list things" do
    site = "https://example_list.site"
    assert {:ok, data} = ExampleCrawlerList.crawl(site)
  end
end

Conclusions

The library it's usefull, easy to implement, a lot of stuff that solves, in a traditional way you need add the library floki that it's a css selector parser, so you can skip that

There it's stuff to get done, this library it's based on Declarative Development so there is a lot of stuff to develop to make more flexible and complete the crawling, like nested crawlers, so all its just a concept right now and im trying to find the time to add more and more stuff to this library and the other stuff that i want to work

DEV Community