loading...
Cover image for Database is not always the answer

Database is not always the answer

jorgecc profile image Jorge Castro ・2 min read

It is about big data versus conventional database

I recently ended a hackathon www.spaceapps.cl and my project was simple: crawled information about the weather and process it visually. The system worked. This hackathon lasted 2 days so everything was rushed.

I didn't win it but I am not mad

However, a jury lost her arms :-)

My project was simple.

Collects information from a website.

Technically, it is illegal to crawl a site so I can't share the library that I use and it is a shame, it works really well.

Parse this information.

->enterLevel('<A HREF="http://www.nws.noaa.gov/dm-cgi-bin/nsd_lookup.pl?station=','"',false,false)
    ->if()
        ->set('myid','@_value@')
        ->showmessage('@myid@')
        ->object('myrow','stationid','@_value@','add')
    ->else()
        ->showmessage('exit')
        ->break()
    ->endif()
->exitLevel()

It is part of the code.

Store into a database Mysql

Shrink it and process it (ETL)

And finally display it.

I love the use of the database. Databases are ideal for data analysis.
However, I tried to insert a lot of information into the database and it was impossible to do it in a timely manner. It took around 3 hours to store 5% of the whole information (and this hackathon lasted 48 hours).

So, I decided to change strategy: FILE SYSTEM and surprise

It did the job in 5 minutes.

Why? It's simple. Every time we insert a value into a database, the database does a lot of jobs, updating the index, adding values to the redo-log, reserving space into the tablespace and inserting the value. Rinse and repeat x 1 million times. Even if we don't use an index, the work is huge. Instead, the file system is simple, it stores the information as-is (and once). The only bottleneck is the hard disk.

I could have done with MongoDB but (for this job), the file system is way faster even to MongoDB. Also, MongoDB adds a new level of complexity.

Finally, I compressed all the information and I store a consolidated into the database and the system works decently.

Posted on by:

jorgecc profile

Jorge Castro

@jorgecc

You are free to believe in whatever you want to, me too. So, stop preaching your religion, politics, or belief. Do you have facts? Then I will listen. Do you have a personal belief? Sorry but no.

Discussion

markdown guide
 

Wrapping it in a transaction might have helped with that by deferring the additional work until the transaction gets committed.

I once had some unit tests that had to insert a load of data for each test run, and wrapping this in a transaction reduced it to less than half the size.

 

I do not believe web crawling OR web scraping is illegal.

 
 

Believe it or not. It's illegal. 😅😅

Still there are so many jobs related to web scraping on freelancing platforms.

 

Could you expand on that? I have seen this opinion and the opinion that it's legal for public information.

It's not illegal at all. I don't understand why people think it's illegal when google(including all other search engines) crawl the web all the time for indexing purposes. And then those bots/spiders do scrape the meta data of a web page to display on SERPs.

Sometimes Website Admins do place Disallow directive in robots.txt but still it's not illegal.

In fact, this (it's illegal) is the biggest myth around web scraping domain.

Technically scraping is just fetching data from web page which is similar to manually copying and pasting.

On the other hand, web scraping on a large scale can cause Denial-of-services on a website and that obviously is not legal.

At the same time, scraping the data and using it without crediting the author will fall under Plagiarism.

Thanks to @clairmont32 for clearly specifying it unlike my bad sarcasm 😌😛

Agreed. As long as you're doing HTTP GETs to pages at a reasonable rate to not cause the server issues for other visitors, it is merely downloading the page via a non-browser application.

 

Hi Jorge,

Thank you for sharing your experience. I completely agree that one should not always go for a (traditional) database without evaluating other technologies.

How many "records" did you write from your tool?

Actually, a good DBMS should be superior regarding IO performance to just using the OS to access the filesystem, because it will short-circuit a lot of the filesystem's functionality and either remove unused parts or replace them with less general but optimized versions. The filesystem build indexes and all of that as well, so there is a similar overhead unless there really is a huge amount of complex indexes defined.

If course, evaluating performance optimizations takes time you certainly did not have during the hackathon. Without knowing your coffee if course, I would be very surprised if grouping the write operations in batches and/or transactions would not result in the same speed (and maybe replacing the ORM calls with hand-crafted SQL if it's not well optimized).

But just to reiterate: I totally agree that a database is not always required, especially for this kind of project with tight time constraints and few scope changes during the iteration (e.g. "now let's crawl and analyze from different machines in parallel in two datacenters)

 

Agreed JSON is superior in small to medium sized apps. Easily stored as a file. Has built in schema and takes little time to save. It's perfect for web apps.

 

Yeah. Now I know why my college professor told me that sometimes it is better to use FBS rather than DBMS. It's so relatable. 😃😃

 

As an addition, for cases that requires database like features without the overhead of a RDBMS, redis , rocksdb , badger etc shines well in these areas.

 

what is Illegal? Crawling or the Library used?? 🤔