loading...

Listen to the god damn warnings.

dannyaziz97 profile image Danny Aziz ・2 min read

(Most) Developers are not silly people, they know what they are doing. So when you (I) are using an open source project and there's a warning that appears, listen to it.

A couple of weeks back I was trying to integrate Elasticsearch into a Django project of mine. Setting up on my local machine was fine as I was using django-elasticsearch-dsl which does all the heavy lifting for you.

Deploying onto my server was a different story...

I had 800,000 objects that I wanted to keep in sync with Django and Elasticsearch, locally this was fine but on my AWS t3.medium server it wouldn't work.

Firstly, the server would become unresponsive when trying to sync - too much RAM usage so I used the queryset_pagination argument to paginate my objects for syncing (Instead of syncing all the objects in one go, it would sync them in bunches).

πŸŽ‰It's syncing!

30 minutes pass and it's time to check and guess what? There are only around 400,000 objects. Only half of the objects synced! This didn't happen on my local machine which had a copy of the same database, how could this be happening?

Next, I thought Elasticsearch was using up too much RAM and so I separated Django and Elasticsearch onto different servers.

⏳Time to sync again.

500,000 - An extra 100,000 objects but not the 800,000 that I needed. This time I spent more time than I care to admit rummaging through documentation, GitHub issues and source code trying to understand how everything worked.

I tried out different people's forks of django-elasticsearch-dsl but even that didn't help.

and then I saw it...

Pagination may yield inconsistent results with an unordered object_list

This was a warning message that was constantly being displayed as I was syncing but for some reason I ignored it.

Put simply, the queryset_pagination argument which was helping me save RAM (and was required to sync on the server) was the issue. django-elasticsearch-dsl was putting all of the objects into bunches to be synced but there was no order, so some objects were being synced more than once and some were being ignored.

All I had to do was fork the repo and add one line of code: .order_by("id")

Finally, all was good in the world again 🎈

Discussion

pic
Editor guide