This article is about pitfalls, solutions, and what I've come to in scheduling Python scripts.
Imagine the following case: periodically, you have to execute updating SQLs using ORM models from your codebase with intermediate processing. For example, you want to detect inactive users by side factors like actions they've taken or not taken within a certain time frame once per day and mark them as inactive. This scenario often occurs in data-driven applications where you need to maintain the integrity and accuracy of your data.
Or a different one,
Every Monday at 10 AM your script must send internal reports on Slack to every manager about their employees' non-compliant Key Performance Indicators. This scenario is a common requirement in organizations striving to monitor and improve performance across various teams and departments.
What could go wrong?
Async code
If your codebase contains asynchronous parts that you'd like to reuse, you need to consider async support. I wouldn't recommend relying solely on the standard asyncio.run
approach because you'll forfeit the advantages of asynchrony, and your code will operate like regular synchronous code.
Concurrent execution
If you have more than one job, you'll have to deal with parallel job execution. However, a common issue that arises is time shifting, as the scheduler may need to wait until one job finishes before it can move on to the next. This can potentially lead to delays in executions, especially when jobs have varying runtimes or when you have many jobs.
Resource leakage
There are different ways to start jobs, like using separate Python threads, individual processes, or event loops. The choice of initiation method can have a significant impact on CPU and memory utilization when you have a whole lot of jobs.
Restart intolerance
Inability to be resistant to reboots without affecting job intervals. In other words, after reboot, you get a time shift because of interrupting clocking. For example, you have a job whose interval is 3 hours. After 2 hours of clocking the CI/CD pipeline rebuilt the image and recreated the container to update the code because you pushed new commits. If the scheduler starts clocking right from the start, you'll get 2 hours of clocking before the reboot, and 3 after. So, in this case, the interval between executions is 5 hours and you have a significant time shift. Sometimes it might be critical.
Duplicate parallel executions
If the execution time of a job is more than an interval between executions or it varies, you may easily get a situation when the scheduler starts new executions of the job before the previous execution ends. It leads to job multiplication, resource leakage, and even a critical system shutdown due to overload. This case must be considered if you have short intervals or long-running jobs.
Localization
If you have a need to execute jobs at different times by different time zones on different days, it might be tricky to configure. For example, one job must be executed at 11 AM by New York on Monday only, a different one at 5 PM by Berlin on weekends, and the last one every 3 hours starting by Tokio on Friday only.
So, a lot could go wrong:)
Approaches
Let's look at possible ways to overcome pitfalls in order of their effectiveness.
Schedule Python lib
Schedule is the most basic scheduler for Python and the first one that you'll find if start googling.
It looks interesting, but the main disadvantage of this lib is that it's not designed to really use it somewhere. According to the official documentation:
You should probably look somewhere else if you need:
- Job persistence (remember schedule between restarts)
- Exact timing (sub-second precision execution)
- Concurrent execution (multiple threads)
- Localization (workdays or holidays)
Schedule does not account for the time it takes for the job function to execute.
Anyway, let's look at an example, pros and cons for further comparison.
import schedule
import time
def job():
print("I'm working...")
schedule.every().day.at("10:30").do(job)
while True:
schedule.run_pending()
time.sleep(1)
Pros
- Simple
Cons
- No async code support
- No concurrent execution
- Restart intolerance
- No localization
Cron Unix utility
Cron is the most popular general-purpose scheduler in the world. This scheduler is universal and starts jobs as a shell command. Take note that Cron starts a separate process for every job and this may take a lot of system resources.
Here you need a crontab
file:
# m h dom mon dow user command
*/30 * * * * /usr/local/bin/python /path/to/the/script.py >> /var/log/cron.log 2>&1
In turn, the script can contain whatever your heart desires:
if __name__ == "__main__":
print("Whatever your heart desires")
Pros
- Concurrent execution
- Versatility
Cons
- No async code support
- Resource leakage
- Restart intolerance
- Duplicate parallel executions
- No localization
Regta Python utility
Regta is a scheduling tool designed with these pitfalls in mind especially for Python. The key advantage is that it has async, multithreading, and multiprocessing support just like restart tolerance.
from regta import async_job, Period
@async_job(Period().on.sunday.at("18:35").by("Asia/Almaty"))
async def my_async_job():
pass # Do some stuff here
To run it use regta run
command.
Pros
- Async code support
- Concurrent execution
- Restart tolerance
- Localization
Cons
- No strong community
Summary
When it comes to scheduling Python programs, the first option, Schedule, may not be the most suitable choice for solving real-world problems. Instead, consider the following points:
For internal automation tasks that involve various programming languages and CLI tools, Cron stands as a robust choice.
If your automation needs are focused on Python exclusively, Regta emerges as a compelling option, offering a wealth of Python-specific optimizations. Give it a try, and feel free to share your thoughts in the comments 🙌
Top comments (0)