DEV Community: Gyula Lakatos

Tanais Online, Week 3 - 4.

Gyula Lakatos — Mon, 21 Oct 2024 19:48:55 +0000

Hi dear readers!

This is the second entry in my blog about the game I'm working on. Let's look into what was added/changed in the past two weeks.

The first thing I added was more cities. This was quite trivial after I did the basic historical research.

The second one was adding city populations. I wanted to control how many units can players recruit and the best thing to do that was capping it by population.

These were the plans for population originally:

"Represents the population of the city. It is controlled by the birth rate and death rate mechanics. ~~Initially,~~ each settlement has a higher birth rate than the death rate, ~~but this can be influenced by squalor and diseases~~.

~~If squalor is low then the birth rate increases, if it is high, then the birth rate decreases. When there is a disease outbreak, the death rate grows significantly.~~

~~Population is required to raise armies/units and provides~~ one gold income for each population once each real-life day."

I added both the death rate and the birth rate mechanisms. Based on historical sources, the birth rate was 40 people per 1000 people per year. The death rate was 38, but I set it to 30 instead to give way to wars (that should ideally contribute an extra 8 persons 🤔).

Also, each person gives 1 gold on every real-life day so people will think twice about raising armies (which might provide plenty of income in case of a victory, but not much if the armies stand around doing nothing or even worse keep losing).

Gold was the first resource I had to add, so I needed to create a "resource system" as well, which is at the moment just linking the resources to the nations in the game. Each nation has a field for each resource and when the game refreshes, the nation's money is recalculated based on the population (added together) the nation has.

I added districts as well with the base of buildings:

Building a district costs 500 gold and takes 8 real-life hours.

There are five districts in the game: Industrial, Military, Cultural (incl. Religious), Entertainment, and Agricultural.

Villages can have two districts, small cities can have three, large cities can have four and metropolises can have five (city level is based on population).

I also had to add an event system to support the building times. The application updates every game once every 30 seconds.

This was the time I decided that enough is enough! Saving everything to the database as ~~it should be~~ expected to be in a relation-based DB was very cumbersome. I decided that it would be better if I saved the whole game state (except a few variables) as JSON. This makes the saving/loading a lot easier and I wouldn't search for games based on the game state anyway.

Now, the game table looks like this:

<changeSet id="2" author="laxika">
    <createTable tableName="game">
        <column name="id" type="bigint" autoIncrement="true">
            <constraints primaryKey="true"/>
        </column>
        <column name="status" type="varchar(16)"/>
        <column name="gameState" type="json"/>
    </createTable>
    <createIndex indexName="ix_status" tableName="game">
        <column name="status"/>
    </createIndex>
</changeSet>

The settlement, nation, and district tables are no longer there. I synchronize on the game instance every time I change something (to make changes atomic) and save the game data every 10 turns (5 minutes).

Also, there were some UI changes as well, but the design is still very wireframe-like 🙂.

Tanais Online, Week 1 - 2.

Gyula Lakatos — Mon, 07 Oct 2024 12:24:09 +0000

Hi dear readers!

This entry is the first in a playful bi-weekly series about developing a game called Tanais Online. Don't expect it to be very detailed. I'm a lone dev who has minimal time and super high ambitions so the focus is on devving. If you have any questions feel free to ask in the comments.

First of all, what is Tanais Online? It's a browser game that aims to be a cross between Total War: Attila — the last good historical TW game (yes, feel free to go at me in the comments, idc 😝) — and Supremacy 1914.

I played the former an awful lot (1303,6 hours to be exact) and found it awesome. Classical Antiquity rocks in general. Idk why there are no more games set in that age. If nobody creates them then it should be a good market for new releases, right? RIGHT? I hope so hah. We will see. In the worst case, I will create a game that I can play for the rest of my life, even if CA keeps failing with all its historical releases forever.

I started two weeks ago by sitting down to plan my game in Milanote. They had an ad on YT and it looked like the best tool for the job. I quickly added a gazillion small notes and hit the free limit. I realized the unlimited plan cost $10 a month. Whhaaaat? So I just did some Sulla-like purges between those ideas hopefully keeping the most successful ones.

I exported & uploaded the full plan to here. You will see excerpts from it all around in the next episodes.

After I got a plan I started to work on the hardest part, the graphical representation. I have the advantage that I'm working on a web game, so if I get lucky, I can just render everything in the browser. What a brilliant idea! 😅 After some thinking and research I decided that my target should be something like what's present here (check the last image). If I can get something like that and I can show some armies marching on it I will be 98% done (aka almost). After some Gimping, headaches, and a few laughs I ended up with this:

I know it looks terrible, but honestly, it still looks better than I expected. If I buy David Baumgart's terrain pack, it will look okay!

Other than the map UI, I started working on the backend code as well. Login & registration is done with the home screen that shows the running and soon starting games.

This was harder than I initially thought it to be. Adding websocket support & MySQL-based saving was easy, but coming up with the game's initializing logic was not.

I ended up creating 4 tables. The player table stores the users, the game table stores the games, the nations table stores the nations in those games, and the settlement table stores the settlements owned by those nations. Maybe using document storage (Mongo?) would have been easier, but I wanted to have "proper", easy-to-use (doesn't exist anywhere but ok) transaction support.

<changeSet id="2" author="laxika">
    <createTable tableName="game">
        <column name="id" type="bigint" autoIncrement="true">
            <constraints primaryKey="true"/>
        </column>
        <column name="scenarioId" type="int"/>
        <column name="status" type="varchar(16)"/>
        <column name="startTime" type="bigint"/>
        <column name="lastUpdated" type="bigint"/>
    </createTable>

    <createTable tableName="nation">
        <column name="id" type="bigint" autoIncrement="true">
            <constraints primaryKey="true"/>
        </column>
        <column name="gameId" type="bigint"/>
        <column name="ownedBy" type="varchar(16)"/>
        <column name="ownerId" type="bigint"/>
        <column name="nation" type="varchar(32)"/>
    </createTable>

    <createTable tableName="settlement">
        <column name="id" type="bigint" autoIncrement="true">
            <constraints primaryKey="true"/>
        </column>
        <column name="gameId" type="bigint"/>
        <column name="settlementInScenarioId" type="int"/>
        <column name="ownerId" type="bigint"/>
    </createTable>
</changeSet>

Now in the code, I have scenarios (actually just one) that are blueprints for games. There is an update logic that runs every 30 seconds, update every game and if there are less than five games under preparation it creates as many as needed to match five. The nations and settlements are spawned at game initialization as they are in the scenario. Then once a player joins, it overtakes an AI's place.

@Scheduled(fixedRate = 30000)
public void scheduleGameRefresh() {
    //TODO: Create as many as needed to match 5 :)
    if (activeGameContainer.getPreparingGames().size() < 5) {
        log.info("Less than 5 active games. Spawning a new one.");

        final Game game = gameFactory.newGame();

        activeGameContainer.registerGame(game);
    }

    activeGameContainer.getActiveGames()
            .forEach(game -> log.info("Updating game {}.", game.getId()));
}

This is all for now. In the next two weeks, I plan on adding the logic for actually starting a game, showing cities and nations to the in-game players, and maybe resource generation.

How I archived 100 million PDF documents... - Part 3: Deduplication & Compression

Gyula Lakatos — Mon, 06 Feb 2023 15:52:13 +0000

"How I archived 100 million PDF documents..." is a series about my experiences with data collecting and archival while working on the Library of Alexandria project. My local instance just hit 100 million documents. It's a good time to pop a 🍾 and remember how I got here.

The 1-st part of the series (Reasons & Beginning) is available here.
The 2-nd part of the series (Indexing, Search & UI) is available here.

The previous article discussed why I started using SQL to help with the communication between the newly split applications. Also, indexing and a basic search UI were added to the applications as well, so the users can easily browse the downloaded documents.

Saving space

Soon after I got the first couple of million documents I realized that I'll need some space to store them. A LOT of space, actually. Because of this realization, I started to look for ideas that could save me as much space as possible, so I can store as many documents as possible.

Deduplication

While searching using the web UI, I found a couple of duplicates in the dataset. This was not too problematic in the beginning, but more documents the program downloaded, the more duplicates I saw on the frontend. I had to do something.

My first idea was to use a Bloom filter. Initially, it felt like a good idea. I initialized a filter with an expectation of 50 million items and a 1% false positive probability. Like what fool would collect more than 50 million documents? Guess what, I ended up throwing the whole thing into the garbage bin after hitting 5 million files in a couple of weeks. Who would want to re-size the Bloom filter every time after a new max value is hit? Also, 1% of false positives felt way too high.

The next try was to calculate file checksums. A checksum is useful to verify file integrity after long time of storage, but it can also be used to detect duplicates. I started with MD5 as a hash function to generate the checksums. It is well known that albeit MD5 is super quick, it is broken for password hashing. Still I thought that it can work for files nevertheless. Unfortunately, there is a thing that's called hash-collision.

After learning that MD5 can have collisions, especially if we take the Birthday problem into consideration, I wanted something better. This is when I realized that by using SHA-256, the chance of a collision was significantly lower. Luckily, my code was quite well abstracted, so it was easy to replace the MD5 generation with SHA-256. In the final duplicate detection algorithm, a document can be considered a duplicate if its file type (extension), file size, and checksum is the same. After implementing the change, I had to re-crawl all the documents, but finally without duplication.

Hash collision probabilities.
The lighter fields in this table show the number of hashes needed to achieve the given probability of collision (column) given a hash space of a certain size in bits (row). Using the birthday analogy: the "hash space size" resembles the "available days", the "probability of collision" resembles the "probability of shared birthday", and the "required number of hashed elements" resembles the "required number of people in a group".
© Wikipedia - CC BY-SA 3.0

Compression

Removing duplicates saved a lot of space, but still, I kept acquiring more documents than what I had space for. This made me really desperate to lower the average document size, so I came up with an easy idea to cram more documents into the same space. I planned to compress them.

There are a couple of ways to compress a PDF document. Or rather a couple of lossless compression algorithms to be correct.

The first thing I looked into was the good old Deflate algorithm with GZIP as the file format. It had certain advantages. First of all, it was very mature supported by almost anything, including native Java (albeit later on, I switched to the Apache Compress library for usability reasons). Secondly, it was very fast and had an "okay" compression ratio.

GZIP was good enough for most of the time, but when I had spare CPU cycles to use, I wanted to re-compress the documents with something that had a better compression ratio. This was the time when I found out about the LZMA encoding. Unlike GZIP (which uses a combination of Z77 and Huffman coding), it does dictionary based compression. It is a mostly mature algorithm too, that has an excellent compression ratio and good decompression speed but abysmal compression speed. Ideal for long-term archival, especially when paired with an extensive amount of free CPU resources.

The final candidate for compression was Brotli, a relative new algorithm that was originally intended to replace deflate on the world wide web. It is mainly used by web servers and content delivery networks to serve web data. Unfortunately, I found just one library that supported it in Java (Brotli4J) and even that one was not a real Java rewrite but a wrapper around the native library provided by Google. It's feels very immature, mostly because it was released in 2015 (unlike Deflate which was released in 1951 and LZMA in 1998). But, it provides the best compression ratio out of the tree by far. Unfortunately, the resource usage is very high as well, it is the slowest one on the list. ALso, to function, it requires a native port for each and every operation system. A hustle to deal with.

Part four will describe how more challenges around the database scalability (replacing MySQL) and the application decoupling will be solved.

How I archived 100 million PDF documents... - Part 2: Indexing, Search & UI

Gyula Lakatos — Wed, 18 Jan 2023 14:14:25 +0000

"How I archived 100 million PDF documents..." is a series about my experiences with data collecting and archival while working on the Library of Alexandria project. My local instance just hit 100 million documents. It's a good time to pop a 🍾 and remember how I got here.

The 1-st part of the series (Reasons & Beginning) is available here.

The previous article dicussed why I started the project, how the URL collection was done using Common Crawl, and the ways the documents were verified if they are correct or not. In the end, we got an application that was able to collect 10.000 correct PDF documents that can be opened with a PDF viewer.

Okay, now what?

While manually smoke-testing the application, I quickly realized that I get suboptimal download speed because the documents were processed one-by-one on just one thread. It was time to parallelize our document handling logic, but to do that first it was needed to synchronize the URL link generation with the downloading of documents. It makes no sense to generate 10.000 URLs per second in the memory while we can only visit 10 locations per second. We will just fill our memory with a bunch of URLs and get an OutOfMemory error pretty quickly. It was time to split up our application and introduce a datastore that can act as an intermediate meeting place between the two applications. Let me introduce MySQL to you all.

Was splitting up the application a good idea? Absolutely! What about introducing MySQL? You can make a guess right now. What do you think, how can MySQL handle a couple of hundred million strings in one table? Let me help you. "Super badly" is an understatement compared to how awful the performance ended up long term. But I didn't know that at the time, so let's proceed with the integration of the said database. After the app was split into two, and the newly created "document location generator" application saved the URLs into a table (with a flag that can determine if the location was visited or not) the downloader application was able to visit them. Guess what? When I ran the whole app overnight, I got hundreds of thousands of documents saved next morning (my 500 Mbps connection was super awesome back then).

If you do the splitting up part over and over long enough, you will run out of space on your drawing board.

Going elastic

Now I got a bunch of documents. It was an awesome and inspiring feeling! This was the point when I realized that the original archiving idea can be done on a grand scale. It was good to see a couple of hundred gigabytes of documents on my hard disk, but you know what would be better? Indexing them into a search engine, then having a way to search and view them.

Initially I had little experience with indexing big datasets. I used Solr a while ago (like 7 years ago lol) so my initial idea came down to use that for the indexing. However, just by looking around for a bit longer before starting to work on the implementation I found Elasticsearch. It seemed to be superior over Solr in almost every way possible (except it was managed by a company but whatever). The major selling point was that it was easier to integrate with. As far as I know, both of them are just a wrapper around Lucene so the performance should be fairly similar. Maybe once it will be worthwhile to rewrite the application suite to use pure Lucene without actually doing premature optimization. However, until then, Elasticsearch is the name of the game.

After figuring out how indexing can be done, I immediately started to work. Extended the downloader application with code that indexed the downloaded and verified documents. Then I deleted the existing dataset to free up space, and started the whole downloading part yet again in the next night.

The indexing worked remarkably well, so I started to work on a web frontend that could be used to search and view documents. This was (controversially) called as the Backend Application in the beginning, then I quickly renamed it to the more meaningful name of Web Application. I'll use that name in this document to minimize the complexity.

Going angular

Initially, the frontend code was written in AngularJS. Why? Why have I choosen an obsolete technology to create the frontend of my next dream project? Because it was something I already understood quite well, was familiar with, and had a lot of experience in. At this stage, I just wanted to progress with my proof of concept. Optimizations and cleanups can be done later. Also, I'm a backend guy, so the frontend code should be minimal right? Right?

It started out as minimal, that's for sure. Also, because it only used dependencies that can be served by cdnjs, it was easy to build and integrate into a Java application.

Soon the frontend was finished and I had some time to actually search and read the documents I collected. I remember that I wanted to search something obscure. I was studying gardening back then, so my first search was for the lichen "Xanthoria parietina".

To my surprise, I got back around a hundred documents from a 2.3 million sample size. Honestly, I was surprised. Some of them were quite interesting. Like whom wouldn't want to read the "Detection of polysaccharides and ultrastructural modification of the photobiont cell wall produced by two arginase isolectins from Xanthoria parietina"?

Part three will describe how more challenges around the storage of documents were solved like deduplication and compression.

How I archived 100 million PDF documents... - Part 1: Reasons & Beginning

Gyula Lakatos — Wed, 11 Jan 2023 15:39:08 +0000

"How I archived 100 million PDF documents..." is a series about my experiences with data collecting and archival while working on the Library of Alexandria project. My local instance just hit 100 million documents. It's a good time to pop a 🍾 and remember how I got here.

The beginning

On a Friday night, after work, most people usually watch football, go to the gym or do something useful with their life. Not everyone though. I was an exception to this rule. As an introvert, I spent the last part of my day sitting in my room, reading an utterly boring-sounding book called "Moral letters to Lucilius". It was written by some old dude thousands of years ago. Definitely not the most fun-sounding book for a Friday night. However, after reading it for about an hour, I realized that the title might be boring, but the contents are almost literally gold. Too bad that there were only a couple of these books that withstood the test of time.

Good ole' Seneca
(image thanks to Matas Petrikas)

After a quick Google search, I figured out that only less than 1% of ancient texts survived to the modern day. This unfortunate fact was my inspiration to start working on an ambitious web crawling and archival project, called the Library of Alexandria.

But how?

At this point, I had a couple (more like a dozen) failed projects under my belt, so I was not too fond to start working on a new one. I had to motivate myself. After I set the target of saving as many documents as possible, I wanted to have a more tangible but quite hard-to-achieve goal. I set 100 million documents as my initial goal and a billion documents as my ultimate target. Ohh how naive I was.

The next day, after waking up, I immediately started typing on my old and trustworthy PC. Because I have a very T-shaped programming knowledge that is centered around Java, the language of choice for this project was immediately determined. Also, because I like to create small prototypes to understand the problem I need to solve, I immediately started with one.

The goal of the prototype was simple. I "just" wanted to download 10.000 documents to understand how hard it is to collect and kind of archive them. The immediate problem was that I didn't know where can I get links for this many files. Sitemaps can be useful in similar scenarios. However, there are a couple of reasons why in this case they are not really a viable solution. Most of the time it doesn't contain links to the documents, or at least not to all of them. Also, I would need to get a domain list to download the sitemaps for, etc. The immediate thing that came into my mind was that it is a lot of hassle and there must be an easier way. This is when the Common Crawl project came into the view.

Common crawl

Common Crawl is a project that contains hundreds of terabytes of HTML source code from websites that were crawled by the project. They publish a new set of crawl data at the beginning of each month.

The crawl archive for July/August 2021 is now available! The data was crawled July 23 – August 6 and contains 3.15 billion web pages or 360 TiB of uncompressed content. It includes page captures of 1 billion new URLs, not visited in any of our prior crawls.

Tiny little datasets...

It sounded exactly like the data that I needed. There was just one thing left to do. Grab the files and parse them with an HTML parser. This was the time when I realized that no matter what I do, it's not going to be an easy ride. When I downloaded the first entry provided by the Common Crawl project, I noticed that it was saved in a strange file format called WARC.

I found one Java library on Github (thanks Mixnode) that was able to read these files. Unfortunately, it was not maintained for the past couple of years. I picked it up and forked it to make it a little easier to use. (A couple of years later this repo was moved under the Bottomless Archive project as well.)

Finally, at this point, I was able to go through a bunch of webpages (parsing them in the process with JSoup), grab all the links that contained pdf files based on the file extension then download them. Unsurprisingly, most of the pages (~60-80%) ended up being unavailable (404 Not Found and friends). After a quick cup of coffee, I got the 10.000 documents on my hard drive. This is when I realized that I have one more problem to solve.

Unboxing & validation

So, when I started to view the documents, a lot of them simply failed to open. I had to look around for a library that could verify PDF documents. I had some experience with PDFBox in the past, so it seemed to be a good go-to solution. It had no way to verify documents by default, but it could open and parse them and that was enough to filter out the incorrect ones. It felt a little bit strange just to read the whole PDF into the memory to verify if it is correct or not, but hey I needed a simple fix for now and it worked really well.

Literally, half of the internet.

After doing a re-run, I concluded that 10.000 perfectly valid documents can fit on around 1.5 GB of space. That's not too bad I thought. Let's crawl more because it sounds like a lot of fun. I left my PC there for about half an hour, just to test the app a bit more.

Part two will describe how more challenges were solved like parallelizing the download requests, splitting up the application, making the documents searchable, and adding a web user interface.