Frequent Replication Lag Issue in Patroni Cluster

#patroni #docker #etcd #timescaledb

Hello,

In Production Setup I have faced Replication Lag in either both or one of the Replica Nodes in the gap of 10 days.

Current Version:
Installation Method: Docker(timescale/timescaledb-ha:pg13.7-ts-2.7.1-latest)
OS: Ubuntu 22.04.1 (inside docker container)
PostgreSQL Version: 13.8
TimescaleDB Version: 2.7.1
Setup: 3 Node of Patroni-HA(PGSQL+TSDB) - ETCD and PgPool using Docker Swarm.
Patroni Version: 2.1.4
ETCD Version: 3.5.0

One Solution which I tried: When I saw the log related to Max Retry exceeded. I thought that the heartbeat-interval and election-timeout value in ETCD are causing problem because we are using default values. So I tried by changing the value of heartbeat-interval to 500ms and election-timeout value to 3000ms. But no luck with that.

I request to please help me identify the root cause behind this frequent replication lag.

I am sharing the logs for each services used related to Database HA Setup in the order of time of occurrence of unusual logs in different services being used.

Node2(Leader): ETCD Logs::

{"level":"warn","ts":"2023-09-18T20:35:29.269Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"634888d6d9dce83c","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"378.187007ms"}
{"level":"warn","ts":"2023-09-18T20:35:29.269Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"1ae8e1aa55b8dfeb","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"378.348601ms"}
{"level":"warn","ts":"2023-09-18T20:38:10.565Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"634888d6d9dce83c","clock-drift":"3.271107301s","rtt":"1.50669ms"}
{"level":"warn","ts":"2023-09-18T20:38:10.614Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"634888d6d9dce83c","clock-drift":"3.270118824s","rtt":"4.186608ms"}
{"level":"warn","ts":"2023-09-18T20:38:40.566Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"634888d6d9dce83c","clock-drift":"5.424953406s","rtt":"1.706893ms"}

Node1(Replica): TimescaleDB Logs::

2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=127.0.0.1 port=53422
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-3] postgres@[unknown],app=[unknown] [00000] LOG: disconnection: session time: 0:00:00.003 user=postgres database= host=127.0.0.1 port=53422
server signaled
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-9] @,app= [00000] LOG: received SIGHUP, reloading configuration files
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-10] @,app= [00000] LOG: parameter "primary_conninfo" changed to "user=postgres passfile=/home/postgres/.pgpass.patroni host=timescale-2 port=5432 sslmode=prefer application_name=timescaledb-1 gssencmode=prefer channel_binding=prefer"
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-11] @,app= [00000] LOG: parameter "primary_slot_name" changed to "timescaledb_1"
2023-09-18 20:41:53 UTC [3147952]: [6508b611.3008b0-1] @,app= [XX000] FATAL: could not start WAL streaming: ERROR: replication slot "timescaledb_1" does not exist
2023-09-18 20:41:55 UTC [3145823]: [6508af31.30005f-3] tsdb@tsdb,app=[unknown] [00000] LOG: disconnection: session time: 0:29:21.936 user=tsdb database=tsdb host=...* port=37952
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...* port=56024
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-2] tsdb@tsdb,app=[unknown] [00000] LOG: connection authorized: user=tsdb database=tsdb

Node1(Replica): PATRONI Logs::

2023-09-18 20:42:00,449 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:10,465 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:20,469 WARNING: Master (timescaledb-2) is still alive

Node2(Leader): TimescaleDB Logs::

2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...* port=33918
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-3] postgres@[unknown],app=timescaledb-1 [42704] ERROR: replication slot "timescaledb_1" does not exist
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-4] postgres@[unknown],app=timescaledb-1 [42704] STATEMENT: START_REPLICATION SLOT "timescaledb_1" 529/9D000000 TIMELINE 50
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.006 user=postgres database= host=...* port=33918

Node1(Replica): ETCD Logs::

{"level":"warn","ts":"2023-09-18T21:35:32.488Z","caller":"wal/wal.go:802","msg":"slow fdatasync","took":"4.388946076s","expected-duration":"1s"}
{"level":"warn","ts":"2023-09-18T22:35:31.521Z","caller":"wal/wal.go:802","msg":"slow fdatasync","took":"3.444079168s","expected-duration":"1s"}
{"level":"info","ts":"2023-09-19T00:09:15.101Z","caller":"etcdserver/server.go:1368","msg":"triggering snapshot","local-member-id":"634888d6d9dce83c","local-member-applied-index":43900439,"local-member-snapshot-index":43800438,"local-member-snapshot-count":100000}
{"level":"info","ts":"2023-09-19T00:09:15.155Z","caller":"etcdserver/server.go:2363","msg":"saved snapshot","snapshot-index":43900439}
{"level":"info","ts":"2023-09-19T00:09:15.155Z","caller":"etcdserver/server.go:2393","msg":"compacted Raft logs","compact-index":43895439}
{"level":"info","ts":"2023-09-19T00:09:28.269Z","caller":"fileutil/purge.go:77","msg":"purged","path":"data.etcd/member/snap/0000000000000044-0000000002963cf2.snap"}

Node2(Leader): PATRONI Logs::

2023-09-18 21:39:48,328 ERROR: Request to server http://etcd1:2379 failed: MaxRetryError('HTTPConnectionPool(host=\'etcd1\', port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'etcd1\', port=2379): Read timed out. (read timeout=3.3331781973441443)"))')
2023-09-18 21:39:48,328 ERROR:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='etcd1', port=2379): Read timed out. (read timeout=3.3331781973441443)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 211, in _do_http_request
response = request_executor(method, base_uri + path, **kwargs)
File "/usr/lib/python3/dist-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/usr/lib/python3/dist-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='etcd1', port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError("HTTPConnectionPool(host='etcd1', port=2379): Read timed out. (read timeout=3.3331781973441443)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 566, in wrapper
retval = func(self, *args, **kwargs) is not None
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 659, in touch_member
return self._client.set(self.member_path, data, None if permanent else self._ttl)
File "/usr/lib/python3/dist-packages/etcd/client.py", line 721, in set
return self.write(key, value, ttl=ttl)
File "/usr/lib/python3/dist-packages/etcd/client.py", line 500, in write
response = self.api_execute(path, method, params=params)
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 256, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 230, in _do_http_request
raise etcd.EtcdException('{0} {1} request failed'.format(method, path))
etcd.EtcdException: PUT /v2/keys/service/timescaledb/members/timescaledb-2 request failed
2023-09-18 21:40:05,153 ERROR: Failed to drop replication slot 'timescaledb_3'
2023-09-18 21:40:15,004 ERROR: Failed to drop replication slot 'timescaledb_3'
2023-09-18 21:40:25,015 ERROR: Failed to drop replication slot 'timescaledb_3'

Node1(Replica): TimescaleDB Logs::

2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-1] @,app= [00000] LOG: started streaming WAL from primary at 529/9D000000 on timeline 50
2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000032000005290000009D has already been removed

Node2(Leader): TimescaleDB Logs::

2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...* port=35148
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-3] postgres@[unknown],app=timescaledb-1 [58P01] ERROR: requested WAL segment 00000032000005290000009D has already been removed
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-4] postgres@[unknown],app=timescaledb-1 [58P01] STATEMENT: START_REPLICATION SLOT "timescaledb_1" 529/9D000000 TIMELINE 50
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.009 user=postgres database= host=... port=35148

--
Thanks

Top comments (1)

Aviral Singh • Sep 27 '23

Please, if anyone can help me on this issue. It is urgent. It is an issue in production and I am not able identify root cause behind this.

--
Thanks