DEV Community: Mostafa Biomee

Monitoring Microservices Techniques

Mostafa Biomee — Tue, 20 Oct 2020 05:26:43 +0000

Monitoring is a critical part of any software development life cycle (SLDC) and with the rising of microservices architecture and DevOps practices, it becomes more important and more complex. to understand how to monitor microservices we must take a step back to the monolith legacy app and how we used to monitor it.
Three Pyramids Monitoring Philosophy

In a monolith environment we used to get some metrics which tell us how is our application status we usually start with infrastructure the physical hardware the host my application for example:
is my server up?
is my database up?
can web server talk to the database?

Then we move to another step to inquire about our application it self and ask a different question:
is my application process running?

Then we move another level up we monitor the functionality and business capability and that lead to ask different question like: can user place an order?

The past 3 level infrastructure, application, and business capability is called Monitoring Areas

let’s move to different perspective and let’s change the questions a little
let’s check for application health by asking
is my server up ?
and check for application performance by asking
is there is high CPU?
and check about capacity by asking
do i have enough disk space?
by answering these 3 question i get another metrics about health, performance and capacity of the system and this is called Monitoring Concerns.

and there is many to many relation between Monitoring Areas and Monitoring Concerns and it depends of the combination of a question we ask for example:
is my server up ? is there is high CPU? do i have enough disk space?
here i targeting health, performance and capacity of my infrastructure
and if i ask:
is my application generating exceptions? how quickly system process messages? can I handle month end batch job?
here I targeting health, performance and capacity of application layer and if i change the questions again and ask:
can users access checkout cart? are we meeting SLAs? what is the impact of adding another customer?
we targeting the health, performance and capacity of business capability layer.

there is also third permit i want to introduce this the Interaction Types which show how i monitor the system

passive monitoring: where you access the system dashboard and see current and past values
Reactive monitoring : where monitoring system alert me when something happens like system send email when queue length is reach 50
Proactive Monitoring: where monitoring system take action automatically to repair system like when the queue length reach 50 auto scale up another instance to solve the problem

so again it’s many to many relationship between the 3 monitoring pyramids so if I ask the first 3 question in the beginning of article:
is my server up?
is my database up?
can web server talk to the database?
then I monitor about infrastructure health at the point of time and it’s of-course passive monitoring.

so whenever you decide the metric you want to monitor keep in your mind what’s the Area you want to monitor what concern you want to get information about and interaction

these 3 pyramids are a way of thinking about what you are monitoring and interrogate whether it’s monolith system or distributed system that’s useful for you.

What’s Happens when we deal with distributed system?
the problem with distributed system is w start with a single point and we carve off pieces of functionality we communicate with messaging protocols and we will spin up a few others areas we got more than server to watch each of them has it’s own database that’s a lot more infrastructure to worry about on top of that the dynamic nature of microservices what if i scale out one of my services you got 4 instance of it all consuming one input queue or may be also distributed queue does it make sense to monitor queue length. it’s little tricky may be yes you should monitor it may be no. and it get more complicated as you increase the dynamic nature of systems you can run and it will be a lot of information that we can collect and it doesn’t make sense we look at every thing.

Let’s take a look at component of distributed system and see how we can monitor it.

Queue Length is the simplest metric every broker technology or queue technology have some method to provide queue length so what this metric tells us

Queue length is an indicator of work outstanding
High queue length doesn’t necessarily mean there is a problem so if it’s high but stable or decreasing or there is some spikes that’s can be good but if it’s increasing per the time it is problem

so for our pyramids we monitoring for infrastructure performance but this not give us a clear insight so let’s look for another important metric here.
Message Processing Time so we should get the time from message be in the front of queue until it finish her task whatever it was upload file via FTP or perform some query on database and finish it and remove the message from queue

Processing time is the time taken to successfully process the message
Processing time doesn’t include error handling time
it’s dependent on queue waiting time finish the process successfully here is important because if error thrown during processing that’s mean shouldn’t be removed it can sent to another pod to handle it again if it’s stable or decreasing or there is some spikes that’s can be good but if it’s increasing per the time it is problem

this lead us to a new concept it’s the Critical Time. it’s time counter from raising our message to reach at the front of the queue the processed and then time stop so what if there is network latency or even the instance that will process message is crashed and restarted and there was many retries to deliver message does critical time stop no it’s actually still counting. from that we can get a formula that describe Critical Time.

Critical Time = Time In Queue + Processing Time + Retries Time + Network Latency Time

and very similar to other metrics if it’s stable or decreasing or there is some spikes that’s can be good but if it’s increasing per the time it is problem

Let’s Put All Together

Each of these metrics represent a part of the puzzle.
Looking at them from endpoint’s perspective not per message.
Look at them together gives great insight into your system.

Let’s show some cases and analysis them

Case 1:

what we have here? stable critical time a spiky processing time and stable queue length over time. what this tells us about system?
system kind of keeping up with all the messages that are coming in also we are processing them because the queue length isn’t increasing but why processing time is not stable? it could be a number of things cause that jumping around may be there is contention of resources or may there is locking mechanism when handler receive the message it lock until it update some resource it could be also some messages which handle by that end point go quickly and others don’t and you can use that information to isolate the slow ones into their endpoint and scale the new endpoint out independently.

Case 2:

here we have high critical time, high processing time and and kinda medium queue length but all is stable . what this tells us?
the system is keeping up with the capacity but we are at the limit so as soon as any traffic spike that queue length is sky rocket and the critical time will be as well. so this may be a good indication to scale out those resources.

Case 3:

here we have high critical time , low processing time and low queue length. what is this means?
may be there is problem in network because if you remember the equation of critical time include the network latency time also may be a lot of retries t in processing the message we measure processing time for successfully processed message only so the problem connectivity or retries.
so if monitoring distributed system how you there’s communication breakdowns?
actually if you monitoring distributed system the easy way i to do health check and if your services replies with 200 status that’s mean it’s up but communication into distributed system usually done using brokers and when instance send message to the broker it doesn’t know if this message reach their destination or not the easiest option here is when the message reach it’s destination a read receipt is send back. is this good idea ?!! it’s not why ? we create turned our decoupled system into req/res system :( and we got double the message sent over the system.
the solution here is peer-to-peer connectivity tells us if an endpoint is actually processing the message from another.

What’s the tools to use?
we have a bunch of tools can collect metrics for us splunk , kibana, D3 and Grafana all are suitable for monitoring.

How we will collect all this information we sent?
if we talk about critical time or processing time it will be per message metrics when we send a message that message will have it’s processing time and critical time associated with it.
queue length and connectivity you might do checks periodically every minuet or every 5 minuets

How we store this?
A good schema to store this is: metrics type, message type, timestamp, and the value. But this is a very expensive way to store your metrics there are different techniques to do this but it’s out of this lecture scope.

How we display metrics?
We can use ELK stack to do this it will be suitable use case.

Conclusion:

Monitoring distributed systems is not easy process and direct proportional with how much dynamic is the system but with understanding the philosophy of monitoring and by choosing the right metrics that help to analysis system and keep it healthy :)

DDD — ubiquitous language is the key

Mostafa Biomee — Tue, 20 Oct 2020 05:17:18 +0000

As developers, we have our minds full of classes, methods, algorithms, patterns, and tend to always make a match between a real life concept and a programming artifact. We want to see what object classes to create and what relationships to model between them. We think in terms of inheritance, polymorphism, OOP, etc. And we talk like that all the time. And it is normal for us to do so. Developers will always be developers. But the domain experts usually know nothing about any of that. They have no idea about software libraries, frameworks, persistence, in many case not even databases. They know about their specific area of expertise.

for 2 years ago I work for Ad-tech company on sales funnel project, the domain experts know about leads, campaigns, leads generators, leads forms, creating email templates, sales funnels. And they talk about those things in their own jargon, which sometimes is not so straightforward to follow by an outsider.

when we work in separated islands we end with something like this

as we can see this style of communication will not lead to successful project by any way.

To overcome this difference in communication style, when we build the model, we must communicate to exchange ideas about the model, about the elements involved in the model, how we connect them, what is relevant and what is not. Communication at this level is paramount for the success of the project. If one says something, and the other does not understand or, even worse, understands something else, what are the chances for the project to succeed?

The Ubiquitous Language connects all the parts of the design, and creates the premise for the design team to function well. It takes weeks and even months for large scale project designs to take shape. The team members discover that some of the initial concepts were incorrect or inappropriately used, or they discover new elements of the design which need to be considered and fit into the overall design. All this is not possible without a common language.

Domain experts should object to terms or structures that are awkward or inadequate to convey domain understanding. If domain experts cannot understand something in the model or the language, then it is most likely that there is something is wrong with it. On the other hand, developers should watch for ambiguity or inconsistency that will tend to appear in design.

Creating the Ubiquitous Language

How can we start building a language? Here is a hypothetical dialog between a software developer and a domain expert in the Sales Funnel project. Watch out for the words appearing in bold face.

Developer: We want to build sales funnel. Where do we start?
Expert: Let’s start with the basics. All this sales funnel is made up of Campaigns. Each campaign contains collections of Ads Forms. every form has fields like name, email and phone ex., we publish the form to Generators and we start collect the data.
Developer: what is Generators?
Expert: generators is ads publishers like facebook, instgram, google adwords , our affiliates web sites and so on.
Developer: I got it so when customer fill the ad form. we save this data and start to process it.
Expert: we don’t call it customer we call it Lead. after collecting leads we add tags to every lead and start to apply workflow.
Developer: ok I got that let me sketch this in graphical way.

Notice how this team, talking about the sales funnel domain and around their incipient model, is slowly creating a language made up by the words in boldface. Also note how that language changes the model! However, in real life such a dialog is much more verbose, and people very often talk about things indirectly, or enter into too much detail, or choose the wrong concepts; this can make coming up with the language very difficult. To begin to address this, all team members should be aware of the need to create a common language and should be reminded to stay focused on essentials, and use the language whenever necessary. We should use our own jargon during such sessions as little as possible, and we should use the Ubiquitous Language because this helps us communicate clearly and precisely.

We have seen how the language is shared by the entire team, and also how it helps building knowledge and create the model. What should we use for the language? Just speech? We’ve used diagrams. What else? Writing? Some may say that UML is good enough to build a model upon. And indeed it is a great tool to write down key concepts as classes, and to express relationships between them. You can draw four or five classes on a sketchpad, write down their names, and show the relationships between them. It’s very easy for everyone to follow what you are thinking, and a graphical expression of an idea is easy to understand. Everyone instantly shares the same vision about a certain topic, and it becomes simpler to communicate based on that. When new ideas come up, and the diagram is modified to reflect the conceptual change. UML diagrams are very helpful when the number of elements involved is small. But UML can grow like mushrooms after a nice summer rain. What do you do when you have hundreds of classes filling up a sheet of paper as long as Mississippi? It’s hard to read even by the software specialists, not to mention domain experts. They won’t understand much of it when it gets big, and it does so even for medium size projects. Also, UML is good at expressing classes, their attributes and relationships between them. But the classes’ behavior and the constraints are not so easily expressed. For that UML resorts to text placed as notes into the diagram. So UML cannot convey two important aspects of a model: the meaning of the concepts it represents and what the objects are supposed to do. But that is OK, since we can add other communication tools to do it. We can use documents. One advisable way of communicating the model is to make some small diagrams each containing a subset of the model. These diagrams would contain several classes, and the relationship between them. That already includes a good portion of the concepts involved. Then we can add text to the diagram. The text will explain behavior and constraints which the diagram cannot. Each such subsection attempts to explain one important aspect of the domain, it points a “spotlight” to enlighten one part of the domain. Those documents can be even hand-drawn, because that transmits the feeling that they are temporary, and might be changed in the near future, which is true, because the model is changed many times in the beginning before it reaches a more stable status.

It might be tempting to try to create one large diagram over the entire model. However, most of the time such diagrams are almost impossible to put together. And furthermore, even if you do succeed in making that unified diagram, it will be so cluttered that it will not convey the understanding better then did the collection of small diagrams. Be wary of long documents. It takes a lot of time to write them, and they may become obsolete before they are finished. The documents must be in sync with the model. Old documents, using the wrong language, and not reflecting the model are not very helpful. Try to avoid them when possible. It is also possible to communicate using code. This approach is widely advocated by the XP community. Well written code can be very communicative. Although the behavior expressed by a method is clear, is the method name as clear as its body? Assertions of a test speak for themselves, but how about the variable names and overall code structure? Are they telling the whole story, loud and clear? Code, which functionally does the right thing, does not necessarily express the right thing. Writing a model in code is very difficult. There are other ways to communicate during design. It’s not the purpose of this post to present all of them. One thing is nonetheless clear: the design team, made up of software architects, developers, and domain experts, needs a language that unifies their actions, and helps them create a model and express that model with code.

AWS SNS & SQS with practical example

Mostafa Biomee — Tue, 20 Oct 2020 05:11:40 +0000

today I’m gonna talk about when to use SNS or SQS. there’s many articles out there that attempt to answer this question but don’t really give you a concrete example of when you should use one or the other so the first thing I’m going to do is go over some technical details and then give you a practical example of when you want to use SNS or SQS so quickly let’s go over the technical comparison between SNS and SQS

· SNS stands for simple notification service while SQS stands for simple queue service

· SNS uses a publisher subscriber system so you may own a topic and you publish to that topic and subscribers get notified of events that are delivered to that topic that’s starkly different to SQS whereas it’s a queuing service for message processing so this example on SQS could be a subscriber to an SNS so whenever someone publishes a message to an SNS your SQS queue can get a message in it that can be processed at a later time so for the SNS kind of as I was alluding to before publishing a message to a topic can deliver too many subscribers so it’s a fan-out approach and those subscribers can be of different types so you can have asked us as a subscriber a lambda function or even an email now moving back to SQS it’s a system that must pull the queue to discover new events so whenever an event is delivered to the queue there’s nothing that’s gonna get invoked there’s no system that automatically can become aware of it you need to have some separate thread that’s pulling the queue to discover when new events get delivered and have a mechanism to actually process it and then delete the message from the queue

· so continuing on that messages in the queue are typically processed by a single consumer or a single service with a very narrow responsibility and different services if they both care about the same events they can have their own separate queues

so you can see that there’s a pretty stark difference from the technical perspective of what these things do if you’re still confused I think there’s a very simple question or two simple questions that you can ask yourself to figure if you want to use SNS or SQS so the first one is do other systems care about an event so if something happens an event occurs in the world does do other people care about it and if the answer is yes you should use SNS because you want to publish a message to your topic and potentially tell other people that that thing happened and to ask us does your system care about an event so do you care specifically when something has happened so you are the receiver of this data and if this is true then SQS is the right choice for you so this is a quick little summary of what these two things do and the key differences between them.

let’s talk about a practical example now of when you would use one or the other so in this example we’re going to be talking about credit card transactions it’s a very common theme in many of my videos so let’s assume that we have a user here and they are making a purchase maybe on some website or through some POS terminal and putting in their credentials to buy some kind of product now they’re making a purchase request to some kind of REST API and we’re just calling this a transaction processing web service and the payload of what gets passed into this web service and maybe it looks something like this so we have a transaction ID some number a customer ID maybe their contact details and the amount that was charged to the customer so from a processing perspective if you get a request containing this kind of event the first thing you need to do is communicate with some kind of credit card authority service practically speaking probably Visa, Mastercard or Amex all these credit card authority services so when an event occurs saying that someone is attempting to purchase something you need to validate that against the authority service so the first thing you do is you probably call that service and say here’s the customer ID maybe they passed in some credentials including their PIN or something like that and that gets validated against the authority service and in this example let’s just assume that they returned 200 ok alright so the transaction is good the credentials are good this thing went.

Now in this ecommerce example this is an event that matters right because someone attempted to purchase something and it went through so it was successful and this is something that is potentially relevant to many different subs systems that exist in your e-commerce ecosystem so maybe we want to publish this event.

so you’re basically telling the world you’re publishing and events to this topic and from this topic there can be multiple different subscribers with all narrow use cases so in this example I have three different ones I have a lambda here and I have two SQS’s and I’m going to fill this out so it’ll make more sense why we have three different ones now.

let’s go through these one by one so in the first case when a message gets delivered maybe we have some kind of customer mailing service and if you’ve ever order anything online maybe on Amazon or any kind of web site you know that when you order something shortly after you get an email that says hey thank you for ordering we really appreciate your business here’s a summary and here’s the address that it’s going to something to this effect so this component is potentially responsible for that very narrow concern so after this event occurs a message gets delivered with this payload to your lambda function and this lambda function maybe it queries some database somewhere for some additional details but in the end it’s gonna send an email to your customer that says thank you for ordering and we appreciate your business okay so that’s one use case right now completely separately.

there’s two other subsystems here that care about this same events but they do very different things so we don’t want to mix or conflict the responsibilities of these different things we want to have a separate system that allows us to process these events in parallel and completely independent of one another so this second system here with a queue it’s some kind of analytics engine so when an event occurs in addition to publishing to this lambda function we’re publishing that same event at some analytics queue okay and this system one of its responsibilities is it cares about how many orders are generated in a single day. you want to show that on some dashboard so when an event occurs this payload gets delivered to this queue and on its own it’s not going to really do anything you need some other system here to pull the queue to actually become aware that something is in there and respond in some sort of way so in this example maybe you have some kind of transaction analytic service that’s hosted on ec2 and part of spinning up your service you have a series of pullers that know to pull messages off of this queue and when they find one maybe they put something in a database somewhere increment some values but at the end of the day their job is to output something that says something like orders today some number and the total amount of revenue is this okay so this is a very different concern than what we’re doing up here this is kind of an analytics perspective and whereas this is a customer use case right now.

for the other queue there’s a separate use case right so it’s a fraud detection service so you’re a bustling e-commerce company and maybe you’re concerned with fraud because certain people have ordered something and the transaction eventually got reversed and you kind of want to have some kind of proactive approach that attempts to detect and mitigate fraud so this is again a very separate use case from your analytics engine and your reminder service this thing is doing something that’s completely different but it also cares about this event and it’s going to use it in its own way so similar to what we had above here we have some fraud detection service that’s also based on ec2 and similar again we’re going to pull the queue to figure out when events occur and once we process the message we delete that message from the queue so no other pullers can receive it and at the end of the day what this system may try to do is determine based on events and based on its own kind of internal processes which transactions are suspicious what should we not deliver immediately and maybe have someone go and look into and just kind of make sure that this thing isn’t a fraudulent case.

so this is an example of a use case where many different consumers care about a singular event and this is where SNS really shines because something is being published and you were distributing that to multiple different consumers so we saw that there’s different consumer types right you can deliver to a lambda function or a queue.

some of you may be asking why would I do one or the other so with a lambda function when you deliver to us on us it’s best effort delivery so if there’s a problem in your logic here you can potentially lose this message whereas from SNS to SQS if a message gets delivered to the queue it’s guaranteed to be present so it’s guaranteed to be processed by some independent service that is pulling and deleting messages from the queue so that’s an example of why you would want to use SQS over lambda understand you multiple different examples here so I’ve shown you know four steps here right so there’s the first stage credit card authority then you do this stuff.

let’s rewind a little bit and try to answer why are we going through all this effort why are we decoupling here and so let’s think about what this world would look like if you did all these four things in sequence maybe in your transaction processing web service and let’s talk about why this is a bad thing so again as I said the first thing you do is you communicate with a credit card authority service and verify that this transaction is good and then if we use this naive approach maybe we want to write some function to send out an email right we attempt to use some email service and send a customer an email and then we continue to the next one we attempt to do some analytics on this event and then finally we go to the fraud detection thing maybe we query some database see if it’s in some kind of blacklist and make some decision based on what to do so why is this a bad thing???

well it’s a bad thing because you have a partial failure scenario so if you’re doing these in sequence you do the first one then the second one then the third one then the fourth one what happens if this one fails if this one fails you need to repeat the whole process and obviously you don’t want to charge the customer again right so by decoupling these applications you have one event that gets distributed to many different consumers and they can all independently in their own way ensure that this message gets processed and this is why we’re leveraging SNS with these different subscriber endpoints.