DEV Community: Siddarth Jain

How to Effectively Monitor Cron Jobs Using Dr Droid

Siddarth Jain — Thu, 27 Apr 2023 19:08:00 +0000

Some Examples of Cron Jobs

Typical Problems with Cron Jobs

Delays

Erroneous Execution

Operational Maintenance

How to Effectively Monitor and Manage Cron Jobs using Dr Droid

How does it work?

Fig 1: Creating events cronjob_initiated and cronjob_completed

Fig 2: Setting up Triggers

Fig 3: Setting up Actions

Product walkthrough

Summary

Cron jobs are an essential tool for developers who want to automate recurring tasks such as sending emails, running backups, and cleaning up log files. However, managing cron jobs can be challenging as they run in the background, making it difficult to troubleshoot and monitor their activities.

In this article, we'll look at some examples of cron jobs and how Dr Droid simplifies the process of monitoring, making it easier to manage cron jobs.

Some Examples of Cron Jobs

Cron scheduling is one of the oldest and most popular techniques since the introduction of UNIX. Many sysadmins, DevOps engineers, and Operations teams use it on a daily to schedule their tasks. Some of the examples are —

Databases: Automating regular backups, killing long-running queries, and generating reports using Cron jobs.

App management: Automating app backups and administrative tasks using cron jobs can help save time and reduce human error in managing the application. This ensures that tasks are executed regularly and efficiently, resulting in better performance of applications.

Rotating log files: To prevent server failures and downtime, it's essential to automate log file rotation. This involves taking regular backups of log files and creating new ones to minimize the risk of running out of disk space.

Scheduling Business Processes: Automating business processes, batch processing, etc using cron jobs, can save time and reduce the risk of human error. Examples of such business processing include payment disbursement, report generation, etc.

Typical Problems with Cron Jobs

Delays

If a cron job takes longer than expected to execute, it can cause a pile-up in the queue, which can result in subsequent jobs not running on time. This can lead to further delays.

Erroneous Execution

There can be scenarios where a job does not end and will wait forever to get the response until the server restarts or someone manually kills the job. Examples include infinite loop or condition which prevents proper termination or hung command in the job. Such jobs can consume more resources of the server and thus, affect other core business applications running on the same server.

Operational Maintenance

Cron jobs that are no longer needed can accumulate causing performance issues. To avoid this, it’s important to review and clean up those jobs regularly.

How to Effectively Monitor and Manage Cron Jobs using Dr Droid

Monitoring cron jobs is a crucial task that ensures the smooth functioning of systems and processes. Failing to monitor them can result in potential downtime, data loss, or suboptimal resource usage. For example, cron jobs can execute important tasks that require timely and accurate completion, and regular monitoring can help detect failures quickly, optimize their scheduling, and trigger alerts.

One effective tool for monitoring cron jobs is Dr Droid. It offers a stateful approach to monitoring the health of your cron jobs, enabling you to set up monitors that alert you to failures or delays. By using Dr Droid, you can address the problems mentioned above and ensure the uninterrupted and optimal performance of your systems and processes.

How does it work?

In the cron job script, you can send some events via REST API from the job to Dr Droid. When a cron job runs, it sends an event to Dr Droid to notify it of its status such as scheduled, started, and completed.

Fig 1: Creating events `cronjob_initiated` and `cronjob_completed`

The scheduled event for the cron job can be taken from the crontab configuration, which specifies the exact time and date when the job should run. Dr Droid can use this information to monitor the status of cron jobs.

However, if the cron service itself is broken, it can be challenging to detect issues using this approach. In such cases, Dr Droid may not receive the expected notifications, and it may be necessary to manually check the configuration.

Dr Droid can be used to monitor the average runtime of cron jobs and to send alerts if a job takes longer than expected.

In the dashboard, we can set up triggers for specific events, such as when a primary event (or any cron task) takes longer than expected to complete, causing a delay in a secondary event (another cron task). At that point, we can take action to resolve the delay.

Fig 2: Setting up Triggers

We can set up actions like sending alerts via Email or we can even configure webhook calls.

Fig 3: Setting up Actions

Product walkthrough

Summary

I hope you have learned the importance of monitoring cron jobs and how tools like Dr Droid monitor cron jobs and help developers ensure that their applications continue to run effectively.

Do let us know through comments, how you monitor the cron jobs. We will be happy to learn about your use case.

How to monitor asynchronous 3rd party integrations proactively?

Siddarth Jain — Fri, 21 Apr 2023 04:15:20 +0000

If you'd like to try Dr Droid, click here.

In today's landscape, third-party integrations have become a crucial part of software development. These integrations offer helpful features, new capabilities, and a mashup of services to create useful products. For instance, using SaaS products like Twilio or Stripe saves significant time and resources, and is also non-trivial to be built in-house from scratch. (If you know any tech startup that built these in-house, let me know 😬)

Leading startups today have upwards of 50+ business critical integrations and an average of 350+ overall integrations.

Microservices rely on external API calls to connect to these services - most of which are asynchronous or moving towards being async. If an external API call is in the flow of a critical business journey or customer experience, it needs proactive monitoring.

To address these issues, businesses need to implement effective third-party integration monitoring strategies. In this article, we will focus on how third-party integration monitoring can help businesses ensure the reliability and availability of services. Furthermore, we will discuss how Dr Droid enables you to monitor the functioning of these services and take necessary steps in case of any potential problems.

To understand with an example, we are discussing a food delivery application use case that you probably use every day.

Use case: Food Delivery Application

Food delivery services often rely on timely and efficient communication between drivers, dispatchers, and customers. To ensure that deliveries are made on time, it's important to have a reliable and efficient system in place. This is where third-party integrations like IVR can be useful. IVR is an automated interactive voice response system, you often see deployed in call centres.

For instance, imagine a customer who orders a pizza for delivery and waits for it to be delivered piping hot. Timely delivery is important for a good customer experience. Everything is going fine but let’s say the delivery agent is stuck and unresponsive for any reason. This event would risk the timely pizza delivery and the customer will have a poor experience.

A Sample Third-Party Integration

For the above scenario, a simple flow for this integration could include steps such as detecting the anomaly (i.e., driver not moving from their location), triggering an automated IVR call, prompting the driver to input their reason for the delay, suggesting appropriate actions based on the driver's input, and logging the outcome of the call for future reference.

This is the ideal scenario where the call is initiated and the response is received and required actions are taken but what if no response is received? This will result in delays in taking action and a bad customer experience.

For this, we need monitoring where we can take action even if the response is not received and the entire process would not be affected.

Monitoring using a Cron Task

Let’s understand how we can monitor such integrations effectively to reduce business risk.

The setup described involves a delivery service using an Interactive Voice Response (IVR) system and a Redis database to manage requests. A cron job is set up to trigger a specific action after 20 seconds.

When a customer initiates a delivery request through the IVR system, the request ID and state are stored in the Redis database. This allows the delivery service to keep track of all active delivery requests. Once the request is initiated, a cron job is set up to trigger after 20 seconds.

After 20 seconds, a signal is sent to the delivery service. The delivery service checks the Redis database to see if the request ID is still in the initiated state. If the request ID is still in the initiated state, the delivery service takes appropriate actions, such as updating the request status or sending a notification to the customer.

This setup does have some potential drawbacks while cron jobs can be useful for automating tasks and reducing the need for manual intervention, they do add an additional component to the architecture and increase the overall complexity of the system. This can increase the risk of failure and make it more difficult to identify and troubleshoot issues when they occur.

How does Dr Droid make the process easier?

The above flow describes a scenario in which a delivery service is using an IVR provider and Dr Droid to monitor the external integration we discussed above using Redis.

When it is found that there is some delay in delivery, the Delivery Service makes an external API call to the IVR provider to take inputs from the Delivery Executive about the reason for the delay and sends a "call initiated" event to Dr Droid. This event notifies Dr Droid that a new call has been initiated.

When the response is received from the Delivery Executive, the IVR provider sends the response back to the Delivery Service and sends a "callback received" event to Dr Droid, which notifies Dr Droid that a callback has been received for the call.

If a callback is not received within 20 seconds, you can set up alerts and webhooks on Dr Droid and it will take care of tracking and executing the recovery flow for delivery service.

Overall, this flow allows the Delivery Service to efficiently manage requests by leveraging the capabilities of the IVR provider and Dr Droid. With Dr Droid, businesses can monitor the performance and anomalies of their business process, receive alerts in case of any disruptions, and set up necessary actions to resolve the issues. The use of event notifications and decision-making processes enables the Delivery Service to handle calls promptly and effectively, while also providing visibility into the call process for monitoring and optimization purposes.

Setting up the monitoring using Dr Droid

We are using the API specs and showing how to interact with Dr Droid using the REST API.

An event called call_initiated is created.

Figure 3: call_initiated event created

Following that, we created another event callback_received:

Figure 4: callback_received event created

After creating the event, you can notice in the dashboard:

Figure 5: Events on Dr Droid’s dashboard

We can then set up monitoring in the console to track if a callback is received in x seconds after the call is initiated.

Figure 6: Monitor Dashboard

Figure 7: Selecting primary and secondary events

Configuring a notification to be triggered if there is a delay in the occurrence of the secondary event despite the primary event being initiated.

Figure 8: Setting up an alert

Figure 9: Setting up email notification

Figure 10: Monitoring setup is done

We are now ready to perform a test.

We will proceed by creating a new event called call_initiated

Figure 11: A new event created call_initiated

The primary event was initiated but as we have not yet created the secondary event, callback_recieved, an alert notification has been sent.

Figure 12: Monitoring dashboard

An email notification was received stating that the secondary event, "callback_received", has not been created within a timeframe of 20 seconds.

Figure 13: Email Notification

Figure 14: Notification on the dashboard

This makes monitoring the calls easier. If the event is delayed Dr Droid sends notifications via Email/Slack or triggers the configured webhook to take appropriate actions.

Summary

I hope you have learned the importance of using third-party tools and monitoring them to make the system work reliably and efficiently with some use cases.

Do let us know through comments, how you monitor the external APIs. We will be happy to learn about your use case.

Why API integrations break and how to avoid them?

Siddarth Jain — Thu, 20 Apr 2023 08:44:00 +0000

Today, our products are deeply dependent on third-party integrations to run successfully. Common integrations that we have come across in building products include payment gateways, communication APIs, CRMs, client integrations and banking APIs.

Most common reasons for issues with integrations

Poor Provisioning by Service Provider.
While multi-tenant architectures with distributed load are the recommended way to build API products, very often we end up with issues in reaching service providers because of a lack of efficient provisioning. Peak traffic apart, sometimes even during a lean period, we have had vendor services going down due to a minor increase in the request volumes
Latency breaches → Timeouts
To protect existing workflows and thread pools from choking, we have pre-defined thresholds with service providers for APIs. When these latencies are breached, it leads to timeouts of requests at our end
Unhandled error codes
A new error code, generated due to an edge case or otherwise, which is not managed during exception handling can impact our workflows
Incorrect response
Changes in third-party’s APIs can lead to unexpected request responses and bodies, causing downstream APIs to reject the response, throw errors or respond unexpectedly
Webhook Deactivations
Modern integrations depend on webhooks to receive data from third-party. We have seen instances where webhooks have been deactivated at the service provider’s end without appropriate notifications

Remediation Strategies

Monitoring API latencies, throughput and error rates
Multiple integrations in case of critical services - As you read above, a service provider’s APIs can falter due to multiple reasons. In our attempts to normalise dependency, multiple integrations should be set in place for critical services
Setting up fallback options in case of failures - Setup circuits with a default fallback option in case of failure
In case of requests consistently exceeding latency thresholds, dropping the requests would be critical to avoid queuing
Enable webhook monitoring and set up alerts for the sudden dip in transaction frequency
Store API response bodies in logs or databases to be able to retrieve and identify issues in data in case of debugging needs

Introducing Dr Droid: Monitor third-party integrations

Outsiders help you monitor 3rd party integrations - both performance and context monitoring.****

Context Monitoring

Change in response body format or variables
Data variation / sudden change for a specific parameter
A mismatch between the status code and API response body
Mapping between API call and webhook response

Performance Monitoring:

Error rates and error codes
API latency
Ack to webhook delay

Chat with ushere or sign up by clicking here!

How to Track Events Across Multiple Services

Siddarth Jain — Thu, 20 Apr 2023 08:33:00 +0000

In this article, we will see the various challenges in microservice architecture and we will then explore one major challenge that many developers and businesses face: how to track the events flowing across these services.

Tracking events are required for various business use cases such as audit trails, fraud detection, missing API callbacks, raising alarms on unusual patterns, and delays and failure in services.

Microservices Architecture:

If you're well-versed in microservices, you can directly jump to the next section from the table of contents above.

Microservices have been popularised since 2011 when it was introduced in an architecture workshop in Venice. Since then, most developers have adopted the microservices architectural style to develop their products.

Microservices is an approach that enables teams to work independently on different services, deploy changes more frequently, and scale services independently. The impact of microservices has been significant, enabling organiations to build and scale complex applications faster and with more agility.

Microservices offer several benefits for software development teams. The architecture provides greater agility by allowing faster development cycles and independent deployment of individual services, resulting in faster time-to-market.

Other benefits include better fault isolation, improved reliability, and uptime.

Challenges in Microservices

Microservices bring significant challenges such as distributed ownership of entities. As the number of microservices grows, it becomes difficult to track and understand the impact of decisions, leading to a lack of visibility and isolation in the decision-making process.
Identifying failures or issues across services can become challenging due to distributed nature of transactions and data flow. Also, failures in one service can have cascading effects on other services.
Managing data in a microservices architecture can become complex due to each microservice having its own data storage and retrieval requirements.
Integration testing becomes more complex.
Ensuring data privacy, having an audit trail, and implementing consistent security across multiple services can be challenging.

How to Implement Event Tracking Across Services?

Event tracking refers to the process of capturing and monitoring user interactions or events within an application, website, or system. Events are user actions or system-generated events such as clicks, taps, searches, logs, and more. For example, say we have an e-commerce platform where customers can place orders for products. When a customer places an order, the order is created in the Checkout Service, and the order details are stored in a database. The order is then processed by various other microservices, such as Inventory Service, Payment Service, and Delivery Service.

One important metric for the platform is the time taken for an order to be delivered. However, tracking this metric is difficult because the order passes through multiple services, and none of the services has context outside their own.

Below are the various ways to implement event tracking:

1. Adding Logs to ELK/Logging Tools

To facilitate issue detection across microservices, logs can be added in different services and sent to ELK or the logging system with a unique key to join events. However, this approach has cons such as complex querying in the logging system and non-trivial searching of events going into different indexes/streams.

2. Building an In-House System

Implementing event tracking in-house involves building a custom system to track and store events generated by the application. To achieve this we can embed code to capture events in business logic code and push the events to a database. This can also be combined with log files to create a custom report or visualization to identify anomalies or insights.

However, it can be complex and time-consuming to build and maintain such systems, so it is only recommended for applications with specific requirements that cannot be met by existing solutions. It may be more efficient to use a third-party event tracking tool instead such as Dr Droid.

3. Building ELT Pipelines:

We can also use Extract-Load-Transform (ELT) techniques. This involves capturing data from multiple sources such as web analytics, mobile apps, and CRM systems, and then transforming and loading it into a centralized data warehouse for analysis. Inherently they don't provide any intelligence on top of it. We can use a BI tool and create complex queries on the data and extract value from it.

Let’s consider an example:

Figure 1: A simplified ETL pipeline

Companies like Uber have created entire teams to build this scalably internally.

To get insights on any metric like the order_delivered time, ELT pipeline can be created, with any OLAP database (in this case, Snowflake) as the sink. Kafka queues are being used to stream the data from the Checkout and Delivery services to the Snowflake database.

Checkout and Delivery services will both send Order fact data (created_at and delivered_at timestamps) to snowflake tables which can be monitored via a BI tool generating hourly reports of the trending average service level agreements (SLAs) at the platform level.

However, setting up and maintaining a mix of tools or ELT can be complex and require significant effort. It may also require specialized skills and resources to build intelligence from the captured data.

How Dr Droid Simplifies Event Tracking.

To simplify the process of tracking events in microservices and correlating them to metrics, tools like Dr Droid can be used. This eliminates the need for managing a shared infrastructure and cron jobs. Dr Droid allows us to send events from microservices and set up alerts for individual orders or aggregated delayed orders. By sending events such as order_created and order_delivered to Dr Droid, it becomes easier to track metrics and set up alerts.

Dr Droid Set up Walkthrough

We are using the API specs and showing how to interact with Dr Droid using the REST API. An event called order_created is created.

Figure 2: Generating order_created event

After creating the event order_created, you can notice in the dashboard:

Figure 3: order_created event on the dashboard

Following that, we created another event order_delivered:

Figure 4: order_delivered event generation from API specs

Figure 5: order_delivered event at the dashboard

We can then set up monitoring in the console to track the time taken for each order to be delivered.

Figure 6: Monitor dashboard

Figure 7: Selecting primary and secondary events

Configuring a notification to be triggered if there is a delay in the occurrence of the secondary event despite the primary event being initiated.

Figure 8: Setting up an alert

There are two options available for setting up notifications, and I am choosing to enable email notifications.

Figure 9: Notification options

Figure 10: Setting up email notification

Figure 11: Monitoring set up is done

We are now ready to perform a test. We will proceed by creating a new event called order_created.

Figure 12: A new event created order_created

The primary event, which is currently active, can be viewed in the console.

Figure 13: Monitoring dashboard

As we have not yet created the secondary event, order_delivered, an alert notification has been received.

Figure 14: Monitoring dashboard alerts

An email notification was received stating that the secondary event, order_delivered, has not been created within a timeframe of 10 seconds.

Figure 15: Email notification

Figure 16: Notification on the dashboard

In this way, we can simply post the events to the Dr Droid over REST API to capture them in a single place and make it easy to audit, monitor and trigger alerts as per the business requirements.

Summary

I hope you have learned various use cases of event tracking and why it is vital in the microservices architecture. We also quickly learned various ways to implement event tracking including using Dr Droid.

Do let us know through comments, how you use event tracking and try Dr Droid. Our team will be happy to learn about your use cases.

Building a Data-Driven Engineering culture

Siddarth Jain — Mon, 19 Sep 2022 18:30:00 +0000

TABLE OF CONTENTS

Observability practices to drive data-first culture:

Enable democratic access to observability data and monitoring dashboards:
Creating a team accountability culture:
Make code performance and monitoring a part of Developers’ KPI:
Avoid data fatigue:
Avoiding data scattered across multiple tools:
Look at the Total Cost of Ownership (TCO) and not just the tool pricing while evaluating options:

We're Just Getting Warmed Up

Software is eating the world. This is a cliché that most of you know!

But when you build software and something breaks, what’s the first reaction of your team?

A data-driven engineering culture enables teams to mitigate and resolve conflicts. But where does one start? Setting up observability and setting it the right way can accelerate your team’s journey to becoming data-first.

Observability practices to drive data-first culture:

From our personal experience and interactions with senior engineering leaders, here are 5 ways to set your team up for success.

1. Enable democratic access to observability data and monitoring dashboards:

While an engineer might not typically be bothered to check the health of a service that’s not related directly to his/her domain, it’s common to have indirect dependencies that need to be checked. Avoiding data silos ensure that teams can get to the root cause without making your DevOps / admins the bottleneck.

2. Creating a team accountability culture:

Empower and hold your engineering team accountable for follow-up communication (post-outage) to all stakeholders (both EXTERNAL and internal):

What was the root cause of the issue?
Why it wasn’t mitigated previously?
What remediation actions have been taken to avoid it in the future?

The same document should be circulated among your engineering team as well as to the relevant/impacted stakeholders in the company.

For some inspiration, here’s how Heroku’s engineering team publishes follow-up reports.

3. Make code performance and monitoring a part of Developers’ KPI:

Modern engineering teams deploy code in tandem - expecting every DevOps/SRE team to monitor every alert is inefficient. Instead, hold developers accountable for the performance & monitoring of their work pre-deployment. Some top-performing teams have the following within the charter of the developer’s responsibility:

Check for instrumentation of observability within the CI/CD pipeline
Measure the performance of the code after deployment/integration
If it’s a new feature/service, create a monitoring dashboard that can track the health of the service appropriately. If it’s a code change on the existing one, the existing dashboard should be re-jigged if needed.

4. Avoid data fatigue:

Have you noticed that your team members frequently leave alerting slack channels due to irrelevant/too frequent notifications? This creates an approach of ignoring data while investigating. Here are some strategies that you can consider:

Only have “Actionable” alerts - pair an alert with the “impact” of the alert to make it actionable.
Contextual alerting - Mapping notifications to relevant stakeholders (both horizontally across teams & vertically within the team)
Continuous improvement - “On-call engineer” to create a report at end of their rotation about what % of alerts were relevant.

Here’s a good article by Atlassian team on some best-practices to reduce alert fatigue.

5. Avoiding data scattered across multiple tools:

There are too many developer tools in the market. Period.

Adopting a new tool requires your engineer is like building a muscle - it needs conscious effort over a prolonged period of time. In this case, it could be getting used to the user interface or the querying language. Create guidelines for setting up the monitoring dashboard to enable ease of accessibility in times of crisis or urgency. As teams build habits of their respective tools, it only gets harder to migrate. (Sooner the better)

6. Look at the Total Cost of Ownership (TCO) and not just the tool pricing while evaluating options:

Just because a tool is open-source or free, doesn’t make it the go-to option. Sometimes, orchestrating and managing open source tools can be demanding - if your team is very lean or worked up, avoid tooling that will require constant maintenance and development.

Chose tools that save your team’s time. The quick time-to-value also improves adoption. Once you’re closer to a scale where the tool cost pinches you too much, the TCO will automatically start weighing toward the open-source option.

We're Just Getting Warmed Up

At Dr Droid, our team is building tools to simplify the lives of engineering teams. And we are listening to what you have to say!

If you have any engineering practices to share that help drive data-first culture, tell us in the comments below.

DEV Community: Siddarth Jain

How to Effectively Monitor Cron Jobs Using Dr Droid

TABLE OF CONTENTS:

Some Examples of Cron Jobs

Typical Problems with Cron Jobs

Delays

Erroneous Execution

Operational Maintenance

How to Effectively Monitor and Manage Cron Jobs using Dr Droid

How does it work?

Product walkthrough

Summary

How to monitor asynchronous 3rd party integrations proactively?

Use case: Food Delivery Application

A Sample Third-Party Integration

Monitoring using a Cron Task

How does Dr Droid make the process easier?

Setting up the monitoring using Dr Droid

Summary

Why API integrations break and how to avoid them?

Most common reasons for issues with integrations

Remediation Strategies

Introducing Dr Droid: Monitor third-party integrations

Context Monitoring

Performance Monitoring:

How to Track Events Across Multiple Services

Microservices Architecture:

Challenges in Microservices

How to Implement Event Tracking Across Services?

1. Adding Logs to ELK/Logging Tools

2. Building an In-House System

3. Building ELT Pipelines:

How Dr Droid Simplifies Event Tracking.

Dr Droid Set up Walkthrough

Summary

Building a Data-Driven Engineering culture

Observability practices to drive data-first culture:

1. Enable democratic access to observability data and monitoring dashboards:

2. Creating a team accountability culture:

3. Make code performance and monitoring a part of Developers’ KPI:

4. Avoid data fatigue:

5. Avoiding data scattered across multiple tools:

6. Look at the Total Cost of Ownership (TCO) and not just the tool pricing while evaluating options:

We're Just Getting Warmed Up