DEV Community

Aviral Singh
Aviral Singh

Posted on

1

Frequent Replication Lag Issue in Patroni Cluster

Hello,

In Production Setup I have faced Replication Lag in either both or one of the Replica Nodes in the gap of 10 days.

Current Version:
Installation Method: Docker(timescale/timescaledb-ha:pg13.7-ts-2.7.1-latest)
OS: Ubuntu 22.04.1 (inside docker container)
PostgreSQL Version: 13.8
TimescaleDB Version: 2.7.1
Setup: 3 Node of Patroni-HA(PGSQL+TSDB) - ETCD and PgPool using Docker Swarm.
Patroni Version: 2.1.4
ETCD Version: 3.5.0

One Solution which I tried: When I saw the log related to Max Retry exceeded. I thought that the heartbeat-interval and election-timeout value in ETCD are causing problem because we are using default values. So I tried by changing the value of heartbeat-interval to 500ms and election-timeout value to 3000ms. But no luck with that.

I request to please help me identify the root cause behind this frequent replication lag.

I am sharing the logs for each services used related to Database HA Setup in the order of time of occurrence of unusual logs in different services being used.

Node2(Leader): ETCD Logs::

{"level":"warn","ts":"2023-09-18T20:35:29.269Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"634888d6d9dce83c","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"378.187007ms"}
{"level":"warn","ts":"2023-09-18T20:35:29.269Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"1ae8e1aa55b8dfeb","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"378.348601ms"}
{"level":"warn","ts":"2023-09-18T20:38:10.565Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"634888d6d9dce83c","clock-drift":"3.271107301s","rtt":"1.50669ms"}
{"level":"warn","ts":"2023-09-18T20:38:10.614Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"634888d6d9dce83c","clock-drift":"3.270118824s","rtt":"4.186608ms"}
{"level":"warn","ts":"2023-09-18T20:38:40.566Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"634888d6d9dce83c","clock-drift":"5.424953406s","rtt":"1.706893ms"}

Node1(Replica): TimescaleDB Logs::

2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=127.0.0.1 port=53422
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-3] postgres@[unknown],app=[unknown] [00000] LOG: disconnection: session time: 0:00:00.003 user=postgres database= host=127.0.0.1 port=53422
server signaled
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-9] @,app= [00000] LOG: received SIGHUP, reloading configuration files
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-10] @,app= [00000] LOG: parameter "primary_conninfo" changed to "user=postgres passfile=/home/postgres/.pgpass.patroni host=timescale-2 port=5432 sslmode=prefer application_name=timescaledb-1 gssencmode=prefer channel_binding=prefer"
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-11] @,app= [00000] LOG: parameter "primary_slot_name" changed to "timescaledb_1"
2023-09-18 20:41:53 UTC [3147952]: [6508b611.3008b0-1] @,app= [XX000] FATAL: could not start WAL streaming: ERROR: replication slot "timescaledb_1" does not exist
2023-09-18 20:41:55 UTC [3145823]: [6508af31.30005f-3] tsdb@tsdb,app=[unknown] [00000] LOG: disconnection: session time: 0:29:21.936 user=tsdb database=tsdb host=...* port=37952
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...* port=56024
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-2] tsdb@tsdb,app=[unknown] [00000] LOG: connection authorized: user=tsdb database=tsdb

Node1(Replica): PATRONI Logs::

2023-09-18 20:42:00,449 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:10,465 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:20,469 WARNING: Master (timescaledb-2) is still alive

Node2(Leader): TimescaleDB Logs::

2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...* port=33918
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-3] postgres@[unknown],app=timescaledb-1 [42704] ERROR: replication slot "timescaledb_1" does not exist
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-4] postgres@[unknown],app=timescaledb-1 [42704] STATEMENT: START_REPLICATION SLOT "timescaledb_1" 529/9D000000 TIMELINE 50
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.006 user=postgres database= host=...* port=33918

Node1(Replica): ETCD Logs::

{"level":"warn","ts":"2023-09-18T21:35:32.488Z","caller":"wal/wal.go:802","msg":"slow fdatasync","took":"4.388946076s","expected-duration":"1s"}
{"level":"warn","ts":"2023-09-18T22:35:31.521Z","caller":"wal/wal.go:802","msg":"slow fdatasync","took":"3.444079168s","expected-duration":"1s"}
{"level":"info","ts":"2023-09-19T00:09:15.101Z","caller":"etcdserver/server.go:1368","msg":"triggering snapshot","local-member-id":"634888d6d9dce83c","local-member-applied-index":43900439,"local-member-snapshot-index":43800438,"local-member-snapshot-count":100000}
{"level":"info","ts":"2023-09-19T00:09:15.155Z","caller":"etcdserver/server.go:2363","msg":"saved snapshot","snapshot-index":43900439}
{"level":"info","ts":"2023-09-19T00:09:15.155Z","caller":"etcdserver/server.go:2393","msg":"compacted Raft logs","compact-index":43895439}
{"level":"info","ts":"2023-09-19T00:09:28.269Z","caller":"fileutil/purge.go:77","msg":"purged","path":"data.etcd/member/snap/0000000000000044-0000000002963cf2.snap"}

Node2(Leader): PATRONI Logs::

2023-09-18 21:39:48,328 ERROR: Request to server http://etcd1:2379 failed: MaxRetryError('HTTPConnectionPool(host=\'etcd1\', port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'etcd1\', port=2379): Read timed out. (read timeout=3.3331781973441443)"))')
2023-09-18 21:39:48,328 ERROR:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='etcd1', port=2379): Read timed out. (read timeout=3.3331781973441443)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 211, in _do_http_request
response = request_executor(method, base_uri + path, **kwargs)
File "/usr/lib/python3/dist-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/usr/lib/python3/dist-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='etcd1', port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError("HTTPConnectionPool(host='etcd1', port=2379): Read timed out. (read timeout=3.3331781973441443)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 566, in wrapper
retval = func(self, *args, **kwargs) is not None
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 659, in touch_member
return self._client.set(self.member_path, data, None if permanent else self._ttl)
File "/usr/lib/python3/dist-packages/etcd/client.py", line 721, in set
return self.write(key, value, ttl=ttl)
File "/usr/lib/python3/dist-packages/etcd/client.py", line 500, in write
response = self.api_execute(path, method, params=params)
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 256, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 230, in _do_http_request
raise etcd.EtcdException('{0} {1} request failed'.format(method, path))
etcd.EtcdException: PUT /v2/keys/service/timescaledb/members/timescaledb-2 request failed
2023-09-18 21:40:05,153 ERROR: Failed to drop replication slot 'timescaledb_3'
2023-09-18 21:40:15,004 ERROR: Failed to drop replication slot 'timescaledb_3'
2023-09-18 21:40:25,015 ERROR: Failed to drop replication slot 'timescaledb_3'

Node1(Replica): TimescaleDB Logs::

2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-1] @,app= [00000] LOG: started streaming WAL from primary at 529/9D000000 on timeline 50
2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000032000005290000009D has already been removed

Node2(Leader): TimescaleDB Logs::

2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...* port=35148
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-3] postgres@[unknown],app=timescaledb-1 [58P01] ERROR: requested WAL segment 00000032000005290000009D has already been removed
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-4] postgres@[unknown],app=timescaledb-1 [58P01] STATEMENT: START_REPLICATION SLOT "timescaledb_1" 529/9D000000 TIMELINE 50
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.009 user=postgres database= host=... port=35148

--
Thanks

AWS Security LIVE!

Tune in for AWS Security LIVE!

Join AWS Security LIVE! for expert insights and actionable tips to protect your organization and keep security teams prepared.

Learn More

Top comments (1)

Collapse
 
aviral_tlpl profile image
Aviral Singh

Please, if anyone can help me on this issue. It is urgent. It is an issue in production and I am not able identify root cause behind this.

--
Thanks

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Engage with a sea of insights in this enlightening article, highly esteemed within the encouraging DEV Community. Programmers of every skill level are invited to participate and enrich our shared knowledge.

A simple "thank you" can uplift someone's spirits. Express your appreciation in the comments section!

On DEV, sharing knowledge smooths our journey and strengthens our community bonds. Found this useful? A brief thank you to the author can mean a lot.

Okay