How to Create a Telegram Bot to Monitor Your Service Uptime in Python (Part 1: Instant Metrics)

#python #web3 #linux #telegram

Hello everyone! For several years now, I have been writing various "assistant" telegram bots for myself in Python that handle various small routine tasks for me - notifying me about something, checking service uptime, forwarding interesting content from telegram channels and chats, and so forth.

This is convenient because the phone is always at hand, and being able to fix something on the server without even opening my laptop brings me particular pleasure.

In general, I have accumulated a lot of different small project templates that I want to share with dev.to readers.

I'll say right away that the examples may be niche in terms of their application "as is", but I will mark those places where, by changing a few lines of code to your own, you will be able to reuse most of the developments for your projects.

I completed this specific project a few days ago, and it has already brought me a lot of benefits. I work at a Web3 infrastructure provider chainstack.com, dealing with a service for indexing data from smart contracts on EVM blockchains.

And the quality of the service being developed critically depends on how "well" the nodes from which the service retrieves data online are functioning.

I spent many hours trying to use ready-made tools that our infrastructure division uses, such as Grafana, BetterUptime, and others, but as I have little interest in the system's internals, with the main focus for me being the metrics at the entrance and the exit, I decided to write my own bot, which would do the following:

At my request, it would go to the service, check the metrics, and send me a brief report on the current situation.
At my other request, it would send me graphs of what has been happening over the last X hours.
In case of a special situation, it would send me a notification that something is happening at that moment.

In this article, I will focus on the first part, that is, receiving metrics on request.

We will need a new virtual environment for work.

cd ~ 
virtualenv -p python3.8 up_env  # crete a virtualenv
source ~/up_env/bin/activate  # activate the virtualenl

Install dependencies:

pip install python-telegram-bot
pip install "python-telegram-bot[job-queue]" --pre
pip install --upgrade python-telegram-bot==13.6.0  # the code was written before version 20, so here the version is explicitly specified

pip install numpy # needed for the median value function
pip install web3 # needed for requests to nodes (replace with what you need)

File with functions functions.py (you can implement it with classes, but since the example is short, I did not plan to divide it into modules, but a multi-threading library requires functions to be moved to a separate file). Import dependencies:

import numpy as np
import multiprocessing

from web3 import Web3 #  add those libraries needed for your task

Describing a function for checking the state. In my case, it involved looping through pre-selected public nodes, retrieving their last block, taking the median value to filter out any deviations, and then, checking our own node against this median.

Service state checking function (you can replace it with your own):

# Helper function that checks a single node
def get_last_block_once(rpc):
    try:
        w3 = Web3(Web3.HTTPProvider(rpc))
        block_number = w3.eth.block_number
        if isinstance(block_number, int):
            return block_number
        else:
            return None
    except Exception as e:
        print(f'{rpc} - {repr(e)}')
        return None


# Main function to check the status of the service that will be called
def check_service():
    # pre-prepared list of reference nodes
    # for any network, it can be found on the website https://chainlist.org/
    list_of_public_nodes = [
        'https://polygon.llamarpc.com',
        'https://polygon.rpc.blxrbdn.com',
        'https://polygon.blockpi.network/v1/rpc/public',
        'https://polygon-mainnet.public.blastapi.io',
        'https://rpc-mainnet.matic.quiknode.pro',
        'https://polygon-bor.publicnode.com',
        'https://poly-rpc.gateway.pokt.network',
        'https://rpc.ankr.com/polygon',
        'https://polygon-rpc.com'
    ]

    # parallel processing of requests to all nodes
    with multiprocessing.Pool(processes=len(list_of_public_nodes)) as pool:
        results = pool.map(get_last_block_once, list_of_public_nodes)
        last_blocks = [b for b in results if b is not None and isinstance(b, int)]

    # define the maximum and median value of the current block
    med_val = int(np.median(last_blocks))
    max_val = int(np.max(last_blocks))
    # determine the number of nodes with the maximum and median value
    med_support = np.sum([1 for x in last_blocks if x == med_val])
    max_support = np.sum([1 for x in last_blocks if x == max_val])

    return max_val, max_support, med_val, med_support

The next important file of the bot is uptime_bot.py. We import libraries and functions from the file above and set the necessary constants:

import telegram
from telegram.ext import Updater, CommandHandler, Filters

from functions import get_last_block_once, check_service

# Here one can to set a limited circle of bot users, 
# listing the usernames of the users

ALLOWED_USERS = ['your_telegram_account', 'someone_else']
# The address of the node that I am monitoring (also a public node in this case)
OBJECT_OF_CHECKING = 'https://polygon-mainnet.chainstacklabs.com'
# Threshold for highlighting critical lag
THRESHOLD = 5

Next, let's describe a function that will be called when the command is issued from the bot's UI.

def start(update, context):
    """Send a message when the command /start is issued."""

    try:
        # Get the user
        user = update.effective_user

        # Filter out bots
        if user.is_bot:
            return

        # Check if the user is allowed
        username = str(user.username)
        if username not in ALLOWED_USERS:
            return
    except Exception as e:
        print(f'{repr(e)}')
        return

    # Call the main function to check the network status
    max_val, max_support, med_val, med_support = check_service()
    # Call the function to check the status of the specified node
    last_block = get_last_block_once(OBJECT_OF_CHECKING)

    # Create the message to send to Telegram
    message = ""

    # Information about the state of the nodes in the public network (median, maximum, and number of nodes)
    message += f"Public median block number {med_val} (on {med_support}) RPCs\n"
    message += f"Public maximum block number +{max_val - med_val} (on {max_support}) PRCs\n"

     # Compare with the threshold
    if last_block is not None:
        out_text = str(last_block - med_val) if last_block - med_val < 0 else '+' + str(last_block - med_val)

        if abs(last_block - med_val) > THRESHOLD:
            message += f"The node block number shift ⚠️<b>{out_text}</b>⚠️"
        else:
            message += f"The node block number shift {out_text}"
    else: # Exception processing if a node has not responded
        message += f"The node has ⚠️<b>not responded</b>⚠️"

    # Send the message to the user
    context.bot.send_message(chat_id=user.id, text=message, parse_mode="HTML")

Now, all that's left is to add the part where the bot is initialized, and the handler function is connected:

token = "xxx"  # Bot token obtained from BotFather

# set up the bot
bot = telegram.Bot(token=token)
updater = Updater(token=token, use_context=True)
dispatcher = updater.dispatcher

# bind the handler function
dispatcher.add_handler(CommandHandler("start", start, filters=Filters.chat_type.private))

# run the bot
updater.start_polling()

Finally, you can run the code on a cheap VPS server using:

source ~/up_env/bin/activate
python uptime_bot.py

After configuring the systemd unit file.

As a result, the bot's work will look like this.

If everything is fine:

And if the lag becomes too large, then as follows:

In the following articles, I will describe how to implement the two remaining tasks:

Retrieve graphs on request showing the events that occurred over the last X hours.
Receive an alert indicating that something is currently happening and requires action.

The project's source code is available in the GitHub repository. If you found this tutorial helpful, feel free to give it a star on GitHub, I would appreciate it🙂