DEV Community: Mateus Guimarães

Scaling Laravel to 100M+ jobs and 30,000 requests/min

Mateus Guimarães — Mon, 25 Jul 2022 14:07:25 +0000

I was reading some tweets about scaling and how Laravel was a bit behind other frameworks and I remembered I have a cool scaling story to tell.

Jack Ellis already wrote a very interesting blog on how Fathom Analytics scaled Laravel, but my story is about different.
You're about to read how we scaled to over a hundred million jobs and peaks of 30,000 requests/minute a timespan of only twelve hours, using nothing but Laravel, MySQL and Redis.

First, I must give some context: in 2019 I joined a pretty cool product: a SaaS that allowed companies to send SMS marketing campaigns.
The app would only handle the actual sending and charge a markup for each SMS --- the companies would be responsible for updating lists and signing with an SMS provider we supported.
The stack was Laravel, Vue and MongoDB.

At the time I joined, the app was handling maybe 1 to 2 million messages a day.
It is important to note something: sending a message wasn't so trivial. To each SMS that was sent, we should expect at least one webhook --- the "delivery report". Any replies to an SMS (any message sent to a number handled by us, in fact) would also come as a webhook, and we'd have to internally tie that to a sent message.

The platform had two places with a lot of traffic/processing: list uploading and message sending.
This is what the DB schema looked like:

Don't ask me why there were 3 messages collections. This is going to become a problem real soon.

Uploading a list wasn't so trivial: they were really big CSVs (think 50M+ contacts) and handling them was tough. The file would be chunked and a job dispatched to upload records --- now, uploading these records was also tricky.
We needed to fetch geographical data based on the number --- to do that, we'd divide the number into 3 pieces and use that to fetch geodata from some tables in our database (or rather, collections). This was a relatively expensive process, especially because it'd happen to each uploaded contact, so for a 50M list you could expect at least 50M queries.

Handling campaigns was also tricky: when a campaign was paused, records would be moved from pending messages to paused messages.
When a message was sent, it'd be deleted from pending messages and moved to sent messages.

It's also important to note that when someone replied to a message, that'd, maybe, trigger a new set of messages to that number (and webhooks). So a single message could generate up to 4+ records.

You don't have to think a lot to figure out this wouldn't scale very well. We had lots of problems very quickly:

Uploading lists would never work correctly. They were huge, and more often than not the jobs would timeout consecutively until the queue dropped them.
Creating a contact was complex and intensive: fetching geographical data was fairly expensive and there were some other queries involved. Creating a single contact involved hitting multiple places of the app.
When lots of campaigns started running, the system would go down because we'd get too many requests. Since it was synchronous, our server would take awhile to respond, requests would pile up and then everything would explode.
Mongo worked really well until we had a couple million records in each collection. Copying data from one collection to the other was incredibly expensive --- each one of them had unique properties and refactoring wasn't viable.
Pushing features, fixes and improvements was very hard. There were no tests until I joined, and even then we didn't have a robust suite. Getting the queue down was the number 1 worry.
The sending queue was actually processed by a script written in Go. It basically kept reading from pending outbounds and sending messages, but it was fairly basic --- there was no UI we could check and adding new sending providers was very problematic since that script had to be changed as well.

The app was, clearly, very poorly designed. I'm not sure on why they choose to use 3 collections for sending messages, but that was a huge problem.
The company tried hiring a Mongo specialist to do some magic --- I'm no DB specialist, but I remember there was a lot of sharding and the monthly bill was almost hitting 5 digits.
That allowed us to hit ~4M/sends a day, but it was still very problematic and data had to be cleaned up frequently.

At around that time It was decided I'd go into a *black ops mission *and rebuild this thing from scratch as an MVP. We didn't need many features (I haven't mentioned 1/3 of them --- there were a lot) for that --- just validate that we'd be able to send those messages comfortably.
I didn't have lots of experiences with microservices and devops so I just decided to use what I knew and ignore the new, shiny things.
I decided to use Laravel, MySQL and Redis. That's it. No Kubernetes, no Terraform, no microservices, no autoscaling, nada.

The new DB schema looked kinda like this:

Some other business rules I didn't mention:

During all sends, we needed to verify wether that number had received a message from that same company within 24 hours. That meant an extra query.
We needed to check if the contact should receive the message at a given time --- SMS marketing laws only allowed contacts to receive messages between a certain timeframe, so checking the timezone was extra important.
In every inbound reply, we needed to check if had any stop words --- those were customized, so that also meant an extra query. If it did, we needed to block that number for that company --- again, one more query.
We also needed to check for reply keywords --- those were words that'd trigger a new outbound message. Again, extra query and maybe an entire sending process.
Every campaign was tied to an Account that had many Numbers. There was a calculation, at runtime, to determine how many messages that account could send, in a single minute, without burning the numbers or being throttled.

To solve those problems, I relied in two of my best friends: Queues and Redis. To handle the queues, I went with the battle-tested, easy to use, Laravel Horizon.

The new application looked like this:

Every message was stored in the messages table, with a foreign key pointing to the campaign it belonged to and a sent_at timestamp field that was nullable. That way it was easy (and fast, with indexes) to check pending and sent messages.
Each campaign had a status column that determined what should happen: pending, canceled, paused, running and completed. Pending meant the messages were still being added into the table.
Nothing was processed synchronously --- everything went into a queue. From webhooks to list imports to contact creation to message sending.
When a lsit was imported, it was processed in batches of 10,000 --- that allowed the jobs to be processed rather quickly without us having to worry about timeouts.
When a campaign was created, the messages were generated in batches of 10,000 --- when the last batch was generated, the campaign status would change to paused.
Remember the geographical data stuff? That was super intensive. Imagine hundreds of millions of contacts being imported by different companies in a daily basis.\ That was deferred to Redis --- more often than not some numbers would share some of the records we'd use to fetch geographical data, and having those cached made things much faster.
Message processing remained complex, but easier to handle: the entire process was based on accounts instead of campaigns since we needed to respect the max throughput of each account, and there could be several campaigns using the same one. There was a job scheduled to run every minute that calculated how many messages an account could send, fetched the correct number of pending messages in a random order, and then dispatched a single job for each pending message.

Remember stop and reply keywords? That went into cache as well.
Determining whether an outbound was sent recently? Also cache.

Horizon was orchestrating a couple queues --- one to import CSVs, other to import each contact, one to dispatch account jobs, one to send the messages, one to handle webhooks, etc.\
The infrastructure piece looked like this:

I can't remember the size of each server from the top of my head, but besides MySQL and Redis, they were all pretty weak.
With that stack the app managed to send over 10 million messages and over 100 million queued jobs in a 12-hour timespan with ease.
It went over 1B records pretty quickly, and it was still smooth as butter. I don't remember the financials, but the monthly bill went from 5 digits (+ the DB consultant) to under a thousand dollars.

No autoscaling, no K8s, nothing --- just the basics and what I knew relatively well.

A couple thoughts on the overall infrastructure:\
Handling indexes properly on MySQL paid off greatly --- we didn't have any unless we needed to.
Redis was extensively used and with generous TTLs --- its cheaper to throw more RAM into the stack than to have the system go down. Overall, it worked pretty great, but handling cache invalidation was tricky at times.

Rewriting how messages were sent made things so much easier, since I could encapsulate each driver's unique behavior into their own class and have them follow a contract.
That meant that adding a new sending driver was just creating a new class, implementing an interface and adding a couple methods --- that'd make it show in the UI, handle webhooks and send messages.

Regarding Laravel Horizon, one important thing: the jobs needed to be dumb. Real dumb.
I'm used to passing models as arguments to jobs --- Laravel handles that extraordinarily by deserializing the model, passing it to the job and then serializing it on the queue worker. When that happens, the record is fetched from the database: a query is executed.

This is definitely something I did not want, so the jobs had to be as dumb as possible in what relates to the database --- all the necessary arguments were passed directly from the account handler job, so by the time it was executed it already knew all it had to.
No need to pass a List instance to a job if all it needs is the list_id --- just pass "int $listId" instead. 😉

To wrap it up, tests made a huge difference. The old application didn't have a lot besides the ones I wrote and it was fairly unstable. Knowing that everything worked as I intended gave me some piece of mind.

If I were to do this today, I'd probably pick some other tools: Swoole and Laravel Octane, for sure. Maybe SingleStore for the database. But overall, I'm happy with what I picked then and it worked super well, while still leaving room for improvement and maybe switching a couple things.

I'd also, definitely, ask Aaron Francis and Jack Ellis for help.

But yeah, thats it, end of story. Happy ever after. 😁

Escalando uma aplicação para 100M+ jobs e dezenas de milhares de requisições por minuto com Laravel

Mateus Guimarães — Sun, 24 Jul 2022 18:32:10 +0000

Nessa post eu vou mostrar como escalamos uma aplicação usando um stack que muita gente rolaria os olhos: Laravel, Redis e MySQL. Só.

Em 2019 eu entrei num projeto que era um gerenciador de campanhas SMS.
Basicamente, grandes marcas pagavam um valor mensal + um markup por mensagem enviada.
Eles faziam upload das suas próprias listas e depois segmentavam usando as campanhas.

Nessa época, a plataforma enviava, por dia, entre 1 e 2M de mensagens. Uma coisa importante é que enviar uma mensagem envolvia algumas outras coisas: pra cada uma, nós recebíamos um webhook do provedor informando o “delivery status” daquela mensagem. Cada resposta à SMS também voltava como um webhook, e internamente nós tínhamos que associar aquela resposta à uma mensagem enviada pela plataforma.

Tinha um fluxo grande de dados em principalmente duas partes: a tabela que guardava os contatos de cada lista (tinha empresa fazendo upload de lista com 50M+ de registros) e as mensagens enviadas.
O stack, nessa época, era Laravel, Vue e MongoDB. A arquitetura era bem cagada, e as coleções eram mais ou menos assim:

Eu não faço a menor ideia do que porque havia 3 coleções pra representar as mensagens e tenho certeza que isso daria um piripaque no @zanfranceschi, mas, basicamente, quando alguém pausava uma campanha, todos os registros pendentes eram copiados pra coleção de registros pausados, e vice-versa.
Quando uma mensagem era enviada, ela era deletada da coleção de mensagens pendentes.

Vale lembrar que quando alguém respondia uma mensagem, às vezes isso fazia o sistema enviar outra mensagem, então uma única mensagem podia gerar até 4 registros (mensagem enviada, webhook recebido, resposta, e uma outra mensagem enviada).

Não precisa pensar muito pra imaginar que isso não ia escalar. Rapidamente tivemos alguns problemas bem grandes:

O upload de listas quase nunca funcionava direito. Eram listas enormes e os registros eram empurrados pra uma fila que usava o banco de dados como driver e processava em chunks — era muito comum dar consecutivos timeouts até o job ser descartado.
A criação de um contato era um pouco complexa e intensiva — tínhamos que gerar dados de localização (cidade, estado, timezone, etc) a partir do número, e existiam algumas tabelas constantemente atualizadas que nos davam essa informação. Pra cada contato, eram executadas 3 queries (baseadas em partes do número de telefone).
Quando as campanhas começavam a rodar, era comum o site cair por causa da enxurrada de requisições. Como era tudo síncrono, cada requisição demorava a ser respondida e aí o php-fpm começava a chorar.
O Mongo funcionava muito bem até ter alguns milhões de registros em cada coleção. Esse esquema de copiar dados de uma coleção pra outra, obviamente, não ajudava — às vezes simplesmente não conseguiam pausar uma campanha.
Quem processava as mensagens pendentes era um programa em Go. Não tinha UI, nada – ele só ficava lendo do banco e enviando. Isso tornava adicionar novos drivers de envio bem problemático, já que o código do “sender” precisava ser mexido.

Existia, claramente, um grande problema de arquitetura aqui. Não sei porque escolheram Mongo, não sei porque fizeram o esquema das coleções, mas o ponto é que não tava rolando. Contrataram um especialista pra tentar escalar o banco, mas mesmo assim não dava pra passar muito dos ~5-6M diários, e todo dia tinha que rolar uma limpeza de dados.

Nesse momento foi decidido que eu ia fazer um MVP, basicamente de uma v2, que desse pra escalar. O único ponto é que precisava escalar — não precisava de muitas features, nada — isso vinha depois.
Bom, a minha experiência com microserviços era muito pequena, e eu não queria arriscar usar nada que eu não dominasse bem.
Resolvi, então, usar Laravel, Redis e MySQL. Só.

O banco ficou mais ou menos assim:

Eu deixei alguns detalhes importantes de fora pra não deixar essa thread mais gigante do que já está — mas existiam alguns outros requisitos:

Durante todos os envios, tínhamos que verificar se o número já tinha recebido uma mensagem daquela empresa nas últimas 24 horas, que não fosse resposta.
Em qualquer resposta, precisávamos verificar se a mensagem continha alguma “stop word”, que eram customizadas, portanto envolviam acesso ao banco, e bloquear o número de receber qualquer mensagem daquela empresa.
Em qualquer resposta também precisávamos verificar se continha alguma “reply keyword” pra enviar mensagens na sequência. Também envolvia acesso ao banco.
Toda campanha estava associada a uma “Account” que continha diversos números para serem usados. Existia um cálculo durante runtime pra determinar quantas mensagens essa conta podia enviar por MINUTO sem queimar o número nem ser rate limited pelo provedor.

Pra resolver tudo isso eu usei Redis e filas extensivamente.
Pra gerenciar as filas, usei o próprio Laravel Horizon.

A nova aplicação ficou assim:

Todas as mensagens ficavam numa tabela “outbounds”, com uma FK para a campanha e uma coluna datetime “sent_at” que era nullable. Era assim que determinávamos o que já tinha sido enviado ou não.
As campanhas tinham uma coluna de status (pending, cancelled, paused, running, completed). Pending era quando as mensagens ainda estavam sendo geradas.
Nada era processado sincronamente — tudo ia para a fila. Desde webhooks, ao envio dos nossos próprios web hooks, tudo era enfileirado.
Quando uma lista era importada, ela era processada na fila em batches de 10000 registros. Isso permitia que os jobs fossem executados rapidamente.
Quando uma campanha era criada, as mensagens também eram geradas em batches de 10000 — quando o último batch era gerado, o status da campanha era alterado para “paused”.
Lembra do processo intensivo de pegar os dados geográficos de um contato? Como o número era desmembrado em 3 partes, muitas vezes o mesmo registro era usado diversas vezes. Tudo foi pro Redis — só buscávamos uma vez e depois ficava em memória pra evitar queries no banco.
O processamento de mensagens continuou complexo, mas era mais fácil de manusear: o processo era feito por “Accounts” ao invés de campanhas, pois precisávamos respeitar o número máximo de envios por minuto. Existia um job que rodava de minuto em minuto, que pegava todas as Accounts que estavam sendo usadas por campanhas naquele momento e passava cada uma para um job que as processava. Esse job calculava quantas mensagens poderiam ser enviadas naquele minuto, buscava outbounds pendentes entre todas as campanhas usando aquela “Account”, e em seguida os despachava de forma que eles fossem enviados uniformemente dentro do intervalo de um minuto (ao invés da fila processar tudo o mais rápido possível). Cada envio de mensagem era um único job.

Lembra das stop e reply keywords? Tudo passou a ficar em cache, também.
O Laravel Horizon orquestrava algumas filas – uma pra importar os CSVs de listas, outra para gerar contatos, outra pra despachar os jobs das accounts, outras para enviar outbounds, outra pra processar webhooks, etc.
A parte de infra ficou mais ou menos assim:

Eu não lembro o tamanho de cada servidor agora, só que o Redis tinha uma cacetada de RAM, mas sempre deixávamos MUITA margem caso alguém resolvesse enviar mais mensagens do nada.
Com esse stack essa aplicação escalou, tranquilamente, para mais de 100 milhões de jobs e 15M de mensagens enviadas dentro de um período de 12 horas.
Passou bem rápido de 1 bilhão de contatos e de mensagens sem nenhuma dor de cabeça. A conta mensal desceu de >USD 11000 pra menos de USD 900.

Não tinha autoscaling, clusters, k8s, nada, pelo simples fato de que eu não manjava disso — e se eu tivesse ido por esse caminho, talvez teria demorado muito mais pra escrever essa aplicação. Usei o básico que eu sabia e funcionou muito bem.

Sobre o MySQL, tive bastante cuidado com os índices e evitava fazer quaisquer queries desnecessárias. Usei Redis onde dava e sempre com TTLs generosos.
A parte da API hoje em dia seria muito mais fácil com Swoole e Laravel Octane, mas naquela época não existia. Só precisei mexer um pouco no php-fpm até achar valores que deixassem uma margem legal.

Uma outra parte que facilitou muito foi como eu escrevi os drivers de envio na aplicação nova — ficou muito fácil de adicionar novos drivers e escrever os testes. Não mencionei isso, mas a aplicação anterior só tinha alguns testes críticos que eu escrevi. A nova ser bem testada deixou tudo muito mais fácil 😁.

Pra ler mais do que eu escrevo, pode me seguir no twitter: @mateusjatenee e também no YouTube.

Handling side projects as a web developer

Mateus Guimarães — Fri, 03 Jul 2020 17:32:21 +0000

Hey there!

Today I want to talk about handling side projects as a developer.

Us developers have the benefit of being able to literally build stuff, and while that’s awesome, it just doesn’t always help us.

I think it’s pretty usual for any developer to have started working on a bunch of ideas and then throw them in the trash. I myself have done that countless times.

I want to talk about the “strategies” that I have been using to try and get something actually done. There are a few points I want to talk:

Don’t start coding like crazy

That’s what we usually do.

Now, before I start coding something (unless I’m ridiculously excited), I give it 12 hours. If in 12 hours I still want to code that, that’s a positive sign.

What used to happen - at least to me - is that I would think of something, code like crazy and give up. Giving it a few hours gives me confidence it’s something I’m likely to build to the end.

Try to think of everything the MVP is going to need.

Instead of just writing code with no clear intentions, I now write all the features I can think of for an MVP in a piece of paper.

Sure, I might change a thing or two while developing but that usually helps me have a rough idea of how much time I’m going to take and helps me not deviate from the basic features needed.

If I think of a new feature while coding, I'll also weight if it's something that would be awesome on an MVP or if it can sit on the backlog for awhile.

If you keep adding features endlessly, you never launch.

Do not reinvent the wheel

Unless you are trying to learn things or if what you're building requires totally new things, try not to reinvent the wheel.

If there’s a well-maintained package for something you want to develop, use that instead. You can always contribute to that project or if you need something much more robust and specific, you can develop your own later down the road.

Be consistent.

One of the major things that made me drop projects in the past is the ups and downs. I'd get excited, code for several hours, then not code for a few days, then code for 12 hours straight and at the end I'd drop it.
Reserving some time to it daily helped me tremendously.

These days I define something like "dedicate 3 hours a day to X", and when I say dedicate I mean actually sit down and focus totally on that.

Some days I get to work a little bit more, some days less, but I'm always consistent on working on them.

Another thing I do is, if working on a project every day is too hard (or if you have more than one), I'll define some days to work on them. I'd rather consistently work 3 days a week than occasionally work on them.

Invest in copywriting and a landing page.

I’d say the copywriting is even more important than the design.

That’s something I didn’t know anything about and there are some really good resources out there.

I recommend Marketing Examples on twitter and there are a few additional resources I’m going to leave at the end.

Share.

Share what you’ve been learning on communities like Indie Hackers, Reddit and Twitter. More often than not someone who’s done something similar to what you’re working on will chime in and give you some tips. It’s also great to get any type of feedback.

Good Marketing Examples.

Marketing for Engineers.

Thanks for reading and if possible give the twitter thread or the original blog post some love.

Which tech stack should you pick as a web developer?

Mateus Guimarães — Fri, 03 Jul 2020 02:26:30 +0000

Hey guys -- made a quick little video regarding what to pick as a web dev, as I'm sure many of us have fallen into this dilemma.

Tips for testing web applications

Mateus Guimarães — Thu, 14 May 2020 18:51:54 +0000

Hey everyone!

Here are some tips that might help you test your applications. I had it previously posted as a twitter thread. Sadly I could not inline images here so it might be useful to look at the thread.

Don't be afraid to hit the database. Today DB calls are very fast, and if you're mocking the database you're usually not testing the feature completely. It makes sense to write "pure" unit tests on lower-level things like libraries, though.
Reverse test when testing code that already exists. A easy way to make sure your new tests cover an existing feature is to comment out blocks of code (specially if blocks) and see if the tests break. If they don't, you're missing something.
It's better to write some "unnecessary " tests than not to write them, so If you're wondering wether a test will be useful, just write it. Those few minutes might help you a lot later.
Use fakes instead of mocking external services when possible. Laravel ships with many fakes, and when you need a custom one it's easy to swap the implementation for a fake on the container. That's possible on anything that has a container. The tweet ocntains an example using Laravel's Facades.
Tests are important, but don't let them hold you back. If you feel thinking how the API is going to look before writing the code is holding you back, just forget the tests for a moment. You can come back later and write them. Writing them before is not a exactly a rule.

Learn how to write tests using Laravel (5-7)

Mateus Guimarães — Sat, 11 Apr 2020 19:18:17 +0000

Hey guys! This is the first video of a series on how and why to write tests, in this case using Laravel. We'll be building a course platform (where users can subscribe to courses, watch them, receive notifications about them, interact on the comments, etc).

I'm very sorry for the audio quality and interruptions -- still getting the hang of this screen cast thing :-)

To support me, you can subscribe to a newsletter and receive this kind of video earlier, plus a few extras here, and you can like the video and subscribe to the channel :-).

Testing Laravel API Resources

Mateus Guimarães — Mon, 21 May 2018 04:41:48 +0000

Video: