Deep night. The phone rings. The critical situation in the system. CPU is reached one hundred percent. There is an urgent need to find a solution.
I think this scenario is familiar to many engineers.
On mission-critical systems, it is important to be proactive in preventing the situation that the key parameters of the system reach maximum values that directly impact performance, stability, and reliability.
Like in medicine, we want to find a cure as soon as the first symptoms appear.
Databases are an essential key element of modern systems.
This blog is about monitoring PostgreSQL database(s).
It describes an approach that makes it easy to identify queries that use the system inefficiently, help to find a root cause for the performance issues, and assist to understand typical workload patterns and performance bottlenecks. The found patterns and queries can be improved and the system will work efficiently and resiliently.
The Proactive PostgreSQL Database(s) Performance Scanner is a script that connects to a database and runs a set of probes that can be extended if desired. All the probes are queries to a database, that are unified by structure.
It includes:
- the threshold value,
- description of the check,
- to which issue this check is associated,
- recommendation on how to troubleshoot the issue
- SQL query to perform the check
- an additional optional SQL query in case there is a need for more evidence
If some check exceeds the threshold value, then the corresponding report will be generated in the following standard form:
- description of the check
- datetime
- environment details
- issue
- details about the issue
- additional evidence
- recommendation
The script has the following structure:
- the function that executes the probes (mainProcessor)
- the function that checks input parameters (helpFunction)
- set the number of characters the queries will be cut. It's useful to make output readable in case queries are too long.
- populate the environment details.
- check the PostgreSQL version. It is useful in case different queries/checks should be performed depending on the version of the DB engine.
- check the pg_stat_statements extension is enabled. The script is using it.
- expandable set of probes.
The monitoring script has the option to run different types of queries depending on the version of the PostgreSQL database being checked. It's useful when the database metadata structure depends on the version.
If there are several PostgreSQL databases that need to be monitored, the Proactive PostgreSQL Database(s) Performance Scanner script can be run in a loop.
The basic version of the script contains sample checks that can be used for monitoring. It includes probes related to connection utilization, long non-optimal queries, high CPU utilization by queries, etc. It can be extended to any other metrics and indicators that it is important to monitor and check.
Example of how to run the Proactive PostgreSQL DB Performance Scanner:
proactive_pg_db_performance_scanner.sh -h db_host -p 5432 -U postgres -d postgres
Examples of output:
Check in the pg_stat_statements DB queries that take more than 5000 ms
DateTime: 20230105_112233
Environment: Host:db_host; Port:5432; DB_Username:postgres; DB_Name: postgres
Issue: Long-running queries
Details:
userid | dbid | db_name | total_time | calls | mean | query | chk
--------+-------+----------------------+------------+-------+-------+----------------------------------------------------------------------
11111 | 11112 | my_database_1 | 55555.00 | 1 | 55555 | select * from my_table where a='12345' | vwv
11111 | 11112 | my_database_1 | 33333.00 | 1 | 33333 | update my_table set a='12345' | vwv
11111 | 11112 | my_database_1 | 11111.00 | 1 | 11111 | delete from my_table where a='12345' | vwv
(3 rows)
Recommendation: Check why the query/queries take so much time. It may be a heavy non-optimized query. Maybe it's an unusual application pattern.
Check the queries that occupy more than 15 % of a CPU
DateTime: 20230106_115523
Environment: Host:db_host; Port:5432; DB_Username:postgres; DB_Name: postgres
Issue: Query/queries that utilize significant portion of CPU
Details:
userid | dbid | db_name | total_time | calls | mean | cpu_portion_pctg | query | chk
--------+-------+---------------------+--------------+---------+----------+------------------+----------------------------------------+-----
11111 | 11112 | my_database_1 | 888799911.12 | 9999999 | 88.88 | 80.00 | select * from my_table where a='12345' | wvw
11111 | 11112 | my_database_1 | 99999.99 | 1 | 99999.99 | 20.00 | update my_table set a='12345' | wvw
(2 rows)
Recommendation: Check why the query/queries take a significant portion of the CPU. Maybe it takes significant time. Maybe it's running too frequently. Try to analyze why this DB query takes a significant part of the CPU.
The query/queries that allocates/allocate a significant number of connection slots (Threshold=300)
DateTime: 20230106_120551
Environment: Host:db_host; Port:5432; DB_Username:postgres; DB_Name: postgres
Issue: The most of connection slots are occupied by single query
Details:
pctg | query | num_of_allocated_connection_slots_by_the_query | tot_allocated_slots | chk
-------+-------------------------------------------------------+------------------------------------------------+---------------------+-----
55.50 | select * from my_table where a='12345' | 555 | 1000 | wvw
33.30 | update my_table set a='12345' | 333 | 1000 | wvw
(2 rows)
Recommendation: Check why a single pattern of queries allocates so many connection slots. It may be application logic, or an unusual application pattern issue.
Below is a source code of the Proactive PostgreSQL DB Performance Scanner.
#!/bin/bash
##########################################################
#
# Proactive PostgreSQL DB Performance Scanner
#
# Purpose: Connect to the PostgreSQL DB instance
# and run a set of queries
# to find problematic performance patterns
#
# Provide the output in the format:
#
# DateTime:
# Environment:
# Issue:
# Evidence:
# Recommendation:
#
# Date: 04-Jan-2023
#
# Author: Dmitry
#
###########################################################
mainProcessor()
{
inpHost="${1}"
inpPort="${2}"
inpDBusername="${3}"
inpDBname="${4}"
sql_query="${5}"
sql_query_extra="${6}"
probe="${7}"
the_environment="${8}"
issue="${9}"
recommendation="${10}"
the_line="${11}"
if [ -z "$inpHost" ] || [ -z "$inpPort" ] || [ -z "$inpDBusername" ] || [ -z "$inpDBname" ]
then
echo "Error: not populated parameters!"
exit 3
fi
answer=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -c "$sql_query")
nRowsReturned=`echo $answer | grep wvw | wc -l`
if [ -z "$sql_query_extra" ]
then
evidence=""
else
evidence=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -c "$sql_query_extra")
fi
if [ "$nRowsReturned" -gt "0" ]; then
current_datetime=`date +"%Y%m%d_%H%M%S"`
echo " "
echo "$probe"
echo "DateTime: $current_datetime"
echo "Environment: $the_environment"
echo "Issue: $issue"
echo "Details:"
echo "$answer"
if [ ! -z "$evidence" ]
then
echo "Evidence:"
echo "$evidence"
fi
echo "Recommendation: $recommendation"
echo " "
echo "$the_line"
fi
}
helpFunction()
{
echo ""
echo "Usage: $0 -h hostname -p port -U db_username -d db_name"
echo -e "\t-h Postgres hostname"
echo -e "\t-p Postgers port"
echo -e "\t-U Postgres DB username"
echo -e "\t-d Postgres DB name"
echo -e " "
exit 1 # Exit script after printing help
}
while getopts "h:p:U:d:" opt
do
case "$opt" in
h ) inpHost="$OPTARG" ;;
p ) inpPort="$OPTARG" ;;
U ) inpDBusername="$OPTARG" ;;
d ) inpDBname="$OPTARG" ;;
? ) helpFunction ;; # Print helpFunction in case parameter is non-existent
esac
done
# Print helpFunction in case parameters are empty
if [ -z "$inpHost" ] || [ -z "$inpPort" ] || [ -z "$inpDBusername" ] || [ -z "$inpDBname" ]
then
echo "Some or all of the parameters are empty";
helpFunction
fi
echo " "
echo "Proactive PG DB Performance Scanner"
echo " "
the_line=" === === === === === === === === === === === === === === === "
echo "$the_line"
query_lenght_to_print=2048
the_environment="Host:$inpHost; Port:$inpPort; DB_Username:$inpDBusername; DB_Name: $inpDBname"
DBVersion=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -t -c "select version(); ")
DBVersion_Num=`echo $DBVersion | awk ' { print $2 } '`
# Check that pg_stat_statements is enabled and populated
n_check=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -qtX << EOF
select count(1) from information_schema.tables where table_name = 'pg_stat_statements';
EOF
)
n_check=`echo $n_check | xargs`
if [ "$n_check" -eq "0" ]; then
echo "The pg_stat_statements table does not exist. Please enable pg_stat_statements to be populated with recs."
echo " "
exit 1
fi
n_rows_pg_stat_statements=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -qtX << EOF
select count(1) n_rows_pg_stat_statements
from pg_stat_statements
EOF
)
n_rows_pg_stat_statements=`echo $n_rows_pg_stat_statements | xargs`
if [ "$n_rows_pg_stat_statements" -eq "0" ]; then
echo "The pg_stat_statements table is empty. Please enable pg_stat_statements. It should be populated with recs."
echo " "
exit 1
fi
#################################################################################################
############### probe 1
n_threshold=10
probe="Check the databases having more than $n_threshold active connections"
issue="It were found databases with the high number of active connections"
recommendation="Check why Customers open so many active connections. It may be wrong configuration or unusual application pattern."
sql_query="
select datname, count(1) num_of_active_connections, 'wvw' chk
from pg_stat_activity
where datname!='' and state!='idle'
group by datname
having count(1)>$n_threshold
order by 2 desc
"
sql_query_extra="
select datname, state, client_addr, client_hostname, substr(query, 1, $query_lenght_to_print) query
from pg_stat_activity
where state!='idle' and datname in (
select datname
from
(
select datname, count(1) num_of_active_sessions
from pg_stat_activity
where state!='idle' and datname!=''
group by 1
having count(1)>0
) M
)
order by 1, 5
"
mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
"$the_environment" "$issue" "$recommendation" "$the_line"
############### probe 2
n_threshold=30
probe="Check DB queries that take more than $n_threshold seconds"
issue="Long-running queries"
recommendation="Check why the query/queries take so much time. It maybe it's heavy non-optimized query. Maybe it's unusual application pattern. "
sql_query="
select
now()-query_start as runtime,
pid as process_id,
datname as db_name,
client_addr,
client_hostname,
substr(query, 1, $query_lenght_to_print) query,
'wvw' chk
from pg_stat_activity
where state!='idle' and datname!=''
and now() - query_start > '$n_threshold seconds'::interval
order by 1 desc;
"
sql_query_extra=""
mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
"$the_environment" "$issue" "$recommendation" "$the_line"
############### probe 3
n_threshold=2000
probe="Check in the pg_stat_statements DB queries that take more than $n_threshold ms"
issue="Long-running queries"
recommendation="Check why the query/queries take so much time. It may be it is a heavy non-optimized query. Maybe it's an unusual application pattern."
if [[ $DBVersion_Num == "11"* || $DBVersion_Num == "12"* ]] ; then
sql_query="
SELECT
pss.userid,
pss.dbid,
pd.datname as db_name,
round(pss.total_time::numeric, 2) as total_time,
pss.calls,
round(pss.mean_time::numeric, 0) as mean,
substr(pss.query, 1, $query_lenght_to_print) query,
'wvw' chk
FROM pg_stat_statements pss, pg_database pd
WHERE pd.oid=pss.dbid
and round(pss.mean_time::numeric, 0) > $n_threshold
ORDER BY round(pss.mean_time::numeric, 0) desc
LIMIT 30;
"
else
sql_query="
SELECT
pss.userid,
pss.dbid,
pd.datname as db_name,
round((pss.total_exec_time + pss.total_plan_time)::numeric, 2) as total_time,
pss.calls,
round((pss.mean_exec_time+pss.mean_plan_time)::numeric, 0) as mean,
substr(pss.query, 1, $query_lenght_to_print) query,
'wvw' chk
FROM pg_stat_statements pss, pg_database pd
WHERE pd.oid=pss.dbid
and round((pss.mean_exec_time+pss.mean_plan_time)::numeric, 0) > $n_threshold
ORDER BY round((pss.mean_exec_time+pss.mean_plan_time)::numeric, 0) desc
LIMIT 30;
"
fi
sql_query_extra=""
mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
"$the_environment" "$issue" "$recommendation" "$the_line"
############### probe 4
n_threshold=10
probe="Check the queries that occupy more than $n_threshold % of a CPU"
issue="Query/queries that utilize significant portion of CPU"
recommendation="Check why the query/queries take a significant portion of the CPU. Maybe it takes significant time. Maybe it's running too frequently. Try to analyze why this DB query takes a significant part of the CPU."
if [[ $DBVersion_Num == "11"* || $DBVersion_Num == "12"* ]] ; then
sql_query="
select M.*, 'wvw' chk
from
(SELECT
pss.userid,
pss.dbid,
pd.datname as db_name,
round(pss.total_time::numeric, 2) as total_time,
pss.calls,
round(pss.mean_time::numeric, 2) as mean,
round((100 * pss.total_time / sum(pss.total_time::numeric) OVER ())::numeric, 2) as cpu_portion_pctg,
substr(pss.query, 1, $query_lenght_to_print) query
FROM pg_stat_statements pss, pg_database pd
WHERE pd.oid=pss.dbid
ORDER BY pss.total_time
DESC LIMIT 30) M
where cpu_portion_pctg > $n_threshold;
"
else
sql_query="
select M.*, 'wvw' chk
from
(SELECT
pss.userid,
pss.dbid,
pd.datname as db_name,
round((pss.total_exec_time + pss.total_plan_time)::numeric, 2) as total_time,
pss.calls,
round((pss.mean_exec_time+pss.mean_plan_time)::numeric, 2) as mean,
round((100 * (pss.total_exec_time + pss.total_plan_time) / sum((pss.total_exec_time + pss.total_plan_time)::numeric) OVER ())::numeric, 2) as cpu_portion_pctg,
substr(pss.query, 1, $query_lenght_to_print) query
FROM pg_stat_statements pss, pg_database pd
WHERE pd.oid=pss.dbid
ORDER BY (pss.total_exec_time + pss.total_plan_time)
DESC LIMIT 30) M
where cpu_portion_pctg > $n_threshold;
"
fi
sql_query_extra=""
mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
"$the_environment" "$issue" "$recommendation" "$the_line"
############### probe 5
n_threshold=1000
probe="Check DB queries that run more than $n_threshold times per second"
issue="Too frequent DB queries"
recommendation="Check why the query/queries run so frequent. Maybe it's pointing to some abnormal pattern. "
sql_query="
select M.*, 'wvw' chk
from
(with
a as (select dbid, queryid, query, calls s from pg_stat_statements),
b as (select dbid, queryid, query, calls s from pg_stat_statements, pg_sleep(1))
select
pd.datname as db_name,
substr(a.query, 1, $query_lenght_to_print) as the_query,
sum(b.s-a.s) as runs_per_second
from a, b, pg_database pd
where
a.dbid= b.dbid
and
a.queryid = b.queryid
and
pd.oid=a.dbid
and
pd.datname not in ('postgres')
group by 1, 2
having sum(b.s-a.s) > $n_threshold
order by 3 desc) M;
"
sql_query_extra=""
mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
"$the_environment" "$issue" "$recommendation" "$the_line"
############### probe 6
n_threshold=5
probe="Actual connections to Max connections ratio (Threshold=$n_threshold)"
issue="Too high ratio of actual connections to max connections"
recommendation="Check that there is enough connection slots."
sql_query="
select a connection_slots_occupied,
b max_connections,
round((a.actual_connections::float/nullif(b.max_connections::float,0))::numeric*100, 2) the_ratio,
'wvw' chk
from
(select count(1) as actual_connections from pg_stat_activity) a,
(select setting as max_connections from pg_settings where name='max_connections') b
where round((a.actual_connections::float/nullif(b.max_connections::float,0))::numeric*100, 2) > $n_threshold;
"
sql_query_extra="
select datname, substr(query, 1, $query_lenght_to_print) query, count(1) num_of_allocated_connection_slots
from pg_stat_activity
group by 1, 2
having count(1) > 5
order by 3 desc;
"
mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
"$the_environment" "$issue" "$recommendation" "$the_line"
############### probe 7
n_threshold=5
probe="The query/queries that allocates/allocate the most connection slots (Threshold=$n_threshold)"
issue="The most of connection slots are occupied by single query"
recommendation="It maybe configuration issue. It looks suspicious. because single query occupies the most connection slots of the DB instance"
sql_query="
select
round((M.num_of_allocated_connection_slots_by_the_query::float/nullif(M.tot_allocated_slots::float,0))::numeric*100, 2) pctg,
M.*
from
(select
substr(query, 1, $query_lenght_to_print) query,
count(1) num_of_allocated_connection_slots_by_the_query,
(select count(1) as n from pg_stat_activity) tot_allocated_slots,
'wvw' chk
from
pg_stat_activity
group by 1, 3
having count(1) > $n_threshold
order by 2 desc) M;
"
sql_query_extra=""
mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
"$the_environment" "$issue" "$recommendation" "$the_line"
echo " "
######################################
#
# End
#
######################################
Top comments (0)