<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aviral Singh</title>
    <description>The latest articles on DEV Community by Aviral Singh (@aviral_tlpl).</description>
    <link>https://dev.to/aviral_tlpl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1170124%2F9bb96028-c7b4-4c9b-baf0-1646c871ab11.png</url>
      <title>DEV Community: Aviral Singh</title>
      <link>https://dev.to/aviral_tlpl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aviral_tlpl"/>
    <language>en</language>
    <item>
      <title>Frequent Replication Lag Issue in Patroni Cluster</title>
      <dc:creator>Aviral Singh</dc:creator>
      <pubDate>Tue, 26 Sep 2023 04:05:53 +0000</pubDate>
      <link>https://dev.to/aviral_tlpl/frequent-replication-lag-issue-in-patroni-cluster-18c6</link>
      <guid>https://dev.to/aviral_tlpl/frequent-replication-lag-issue-in-patroni-cluster-18c6</guid>
      <description>&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;In Production Setup I have faced Replication Lag in either both or one of the Replica Nodes in the gap of 10 days.&lt;/p&gt;

&lt;p&gt;Current Version:&lt;br&gt;
Installation Method: Docker(timescale/timescaledb-ha:pg13.7-ts-2.7.1-latest)&lt;br&gt;
OS: Ubuntu 22.04.1 (inside docker container)&lt;br&gt;
PostgreSQL Version: 13.8&lt;br&gt;
TimescaleDB Version: 2.7.1&lt;br&gt;
Setup: 3 Node of Patroni-HA(PGSQL+TSDB) - ETCD and PgPool using Docker Swarm.&lt;br&gt;
Patroni Version: 2.1.4&lt;br&gt;
ETCD Version: 3.5.0&lt;/p&gt;

&lt;p&gt;One Solution which I tried: When I saw the log related to Max Retry exceeded. I thought that the heartbeat-interval and election-timeout value in ETCD are causing problem because we are using default values. So I tried by changing the value of heartbeat-interval to 500ms and election-timeout value to 3000ms. But no luck with that. &lt;/p&gt;

&lt;p&gt;I request to please help me identify the root cause behind this frequent replication lag.&lt;/p&gt;

&lt;p&gt;I am sharing the logs for each services used related to Database HA Setup in the order of time of occurrence of unusual logs in different services being used.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Node2(Leader): ETCD Logs::&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;{"level":"warn","ts":"2023-09-18T20:35:29.269Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"634888d6d9dce83c","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"378.187007ms"}&lt;br&gt;
{"level":"warn","ts":"2023-09-18T20:35:29.269Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"1ae8e1aa55b8dfeb","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"378.348601ms"}&lt;br&gt;
{"level":"warn","ts":"2023-09-18T20:38:10.565Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"634888d6d9dce83c","clock-drift":"3.271107301s","rtt":"1.50669ms"}&lt;br&gt;
{"level":"warn","ts":"2023-09-18T20:38:10.614Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"634888d6d9dce83c","clock-drift":"3.270118824s","rtt":"4.186608ms"}&lt;br&gt;
{"level":"warn","ts":"2023-09-18T20:38:40.566Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"634888d6d9dce83c","clock-drift":"5.424953406s","rtt":"1.706893ms"}&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Node1(Replica): TimescaleDB Logs::&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=127.0.0.1 port=53422&lt;br&gt;
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-2] postgres@[unknown],app=[unknown] [00000] LOG:  replication connection authorized: user=postgres&lt;br&gt;
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-3] postgres@[unknown],app=[unknown] [00000] LOG:  disconnection: session time: 0:00:00.003 user=postgres database= host=127.0.0.1 port=53422&lt;br&gt;
server signaled&lt;br&gt;
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-9] @,app= [00000] LOG:  received SIGHUP, reloading configuration files&lt;br&gt;
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-10] @,app= [00000] LOG:  parameter "primary_conninfo" changed to "user=postgres passfile=/home/postgres/.pgpass.patroni host=timescale-2 port=5432 sslmode=prefer application_name=timescaledb-1 gssencmode=prefer channel_binding=prefer"&lt;br&gt;
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-11] @,app= [00000] LOG:  parameter "primary_slot_name" changed to "timescaledb_1"&lt;br&gt;
2023-09-18 20:41:53 UTC [3147952]: [6508b611.3008b0-1] @,app= [XX000] FATAL:  could not start WAL streaming: ERROR:  replication slot "timescaledb_1" does not exist&lt;br&gt;
2023-09-18 20:41:55 UTC [3145823]: [6508af31.30005f-3] tsdb@tsdb,app=[unknown] [00000] LOG:  disconnection: session time: 0:29:21.936 user=tsdb database=tsdb host=&lt;strong&gt;.&lt;/strong&gt;.&lt;em&gt;.&lt;/em&gt;* port=37952&lt;br&gt;
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=&lt;strong&gt;.&lt;/strong&gt;.&lt;em&gt;.&lt;/em&gt;* port=56024&lt;br&gt;
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-2] tsdb@tsdb,app=[unknown] [00000] LOG:  connection authorized: user=tsdb database=tsdb&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Node1(Replica): PATRONI Logs::&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;2023-09-18 20:42:00,449 WARNING: Master (timescaledb-2) is still alive&lt;br&gt;
2023-09-18 20:42:10,465 WARNING: Master (timescaledb-2) is still alive&lt;br&gt;
2023-09-18 20:42:20,469 WARNING: Master (timescaledb-2) is still alive&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Node2(Leader): TimescaleDB Logs::&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=&lt;strong&gt;.&lt;/strong&gt;.&lt;em&gt;.&lt;/em&gt;* port=33918&lt;br&gt;
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-2] postgres@[unknown],app=[unknown] [00000] LOG:  replication connection authorized: user=postgres application_name=timescaledb-1&lt;br&gt;
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-3] postgres@[unknown],app=timescaledb-1 [42704] ERROR:  replication slot "timescaledb_1" does not exist&lt;br&gt;
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-4] postgres@[unknown],app=timescaledb-1 [42704] STATEMENT:  START_REPLICATION SLOT "timescaledb_1" 529/9D000000 TIMELINE 50&lt;br&gt;
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-5] postgres@[unknown],app=timescaledb-1 [00000] LOG:  disconnection: session time: 0:00:00.006 user=postgres database= host=&lt;strong&gt;.&lt;/strong&gt;.&lt;em&gt;.&lt;/em&gt;* port=33918&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Node1(Replica): ETCD Logs::&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;{"level":"warn","ts":"2023-09-18T21:35:32.488Z","caller":"wal/wal.go:802","msg":"slow fdatasync","took":"4.388946076s","expected-duration":"1s"}&lt;br&gt;
{"level":"warn","ts":"2023-09-18T22:35:31.521Z","caller":"wal/wal.go:802","msg":"slow fdatasync","took":"3.444079168s","expected-duration":"1s"}&lt;br&gt;
{"level":"info","ts":"2023-09-19T00:09:15.101Z","caller":"etcdserver/server.go:1368","msg":"triggering snapshot","local-member-id":"634888d6d9dce83c","local-member-applied-index":43900439,"local-member-snapshot-index":43800438,"local-member-snapshot-count":100000}&lt;br&gt;
{"level":"info","ts":"2023-09-19T00:09:15.155Z","caller":"etcdserver/server.go:2363","msg":"saved snapshot","snapshot-index":43900439}&lt;br&gt;
{"level":"info","ts":"2023-09-19T00:09:15.155Z","caller":"etcdserver/server.go:2393","msg":"compacted Raft logs","compact-index":43895439}&lt;br&gt;
{"level":"info","ts":"2023-09-19T00:09:28.269Z","caller":"fileutil/purge.go:77","msg":"purged","path":"data.etcd/member/snap/0000000000000044-0000000002963cf2.snap"}&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Node2(Leader): PATRONI Logs::&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;2023-09-18 21:39:48,328 ERROR: Request to server &lt;a href="http://etcd1:2379"&gt;http://etcd1:2379&lt;/a&gt; failed: MaxRetryError('HTTPConnectionPool(host=\'etcd1\', port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'etcd1\', port=2379): Read timed out. (read timeout=3.3331781973441443)"))')&lt;br&gt;
2023-09-18 21:39:48,328 ERROR:&lt;br&gt;
Traceback (most recent call last):&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 445, in _make_request&lt;br&gt;
    six.raise_from(e, None)&lt;br&gt;
  File "", line 3, in raise_from&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 440, in _make_request&lt;br&gt;
    httplib_response = conn.getresponse()&lt;br&gt;
  File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse&lt;br&gt;
    response.begin()&lt;br&gt;
  File "/usr/lib/python3.10/http/client.py", line 318, in begin&lt;br&gt;
    version, status, reason = self._read_status()&lt;br&gt;
  File "/usr/lib/python3.10/http/client.py", line 279, in _read_status&lt;br&gt;
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")&lt;br&gt;
  File "/usr/lib/python3.10/socket.py", line 705, in readinto&lt;br&gt;
    return self._sock.recv_into(b)&lt;br&gt;
TimeoutError: timed out&lt;/p&gt;

&lt;p&gt;During handling of the above exception, another exception occurred:&lt;/p&gt;

&lt;p&gt;Traceback (most recent call last):&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen&lt;br&gt;
    httplib_response = self._make_request(&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 447, in _make_request&lt;br&gt;
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 336, in _raise_timeout&lt;br&gt;
    raise ReadTimeoutError(&lt;br&gt;
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='etcd1', port=2379): Read timed out. (read timeout=3.3331781973441443)&lt;/p&gt;

&lt;p&gt;During handling of the above exception, another exception occurred:&lt;/p&gt;

&lt;p&gt;Traceback (most recent call last):&lt;br&gt;
  File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 211, in _do_http_request&lt;br&gt;
    response = request_executor(method, base_uri + path, **kwargs)&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/request.py", line 78, in request&lt;br&gt;
    return self.request_encode_body(&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/request.py", line 170, in request_encode_body&lt;br&gt;
    return self.urlopen(method, url, **extra_kw)&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 375, in urlopen&lt;br&gt;
    response = conn.urlopen(method, u.request_uri, **kw)&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 755, in urlopen&lt;br&gt;
    retries = retries.increment(&lt;br&gt;
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 574, in increment&lt;br&gt;
    raise MaxRetryError(_pool, url, error or ResponseError(cause))&lt;br&gt;
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='etcd1', port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError("HTTPConnectionPool(host='etcd1', port=2379): Read timed out. (read timeout=3.3331781973441443)"))&lt;/p&gt;

&lt;p&gt;During handling of the above exception, another exception occurred:&lt;/p&gt;

&lt;p&gt;Traceback (most recent call last):&lt;br&gt;
  File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 566, in wrapper&lt;br&gt;
    retval = func(self, *args, **kwargs) is not None&lt;br&gt;
  File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 659, in touch_member&lt;br&gt;
    return self._client.set(self.member_path, data, None if permanent else self._ttl)&lt;br&gt;
  File "/usr/lib/python3/dist-packages/etcd/client.py", line 721, in set&lt;br&gt;
    return self.write(key, value, ttl=ttl)&lt;br&gt;
  File "/usr/lib/python3/dist-packages/etcd/client.py", line 500, in write&lt;br&gt;
    response = self.api_execute(path, method, params=params)&lt;br&gt;
  File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 256, in api_execute&lt;br&gt;
    response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)&lt;br&gt;
  File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 230, in _do_http_request&lt;br&gt;
    raise etcd.EtcdException('{0} {1} request failed'.format(method, path))&lt;br&gt;
etcd.EtcdException: PUT /v2/keys/service/timescaledb/members/timescaledb-2 request failed&lt;br&gt;
2023-09-18 21:40:05,153 ERROR: Failed to drop replication slot 'timescaledb_3'&lt;br&gt;
2023-09-18 21:40:15,004 ERROR: Failed to drop replication slot 'timescaledb_3'&lt;br&gt;
2023-09-18 21:40:25,015 ERROR: Failed to drop replication slot 'timescaledb_3'&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Node1(Replica): TimescaleDB Logs::&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-1] @,app= [00000] LOG:  started streaming WAL from primary at 529/9D000000 on timeline 50&lt;br&gt;
2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-2] @,app= [XX000] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000032000005290000009D has already been removed&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Node2(Leader): TimescaleDB Logs::&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=&lt;strong&gt;.&lt;/strong&gt;.&lt;em&gt;.&lt;/em&gt;* port=35148&lt;br&gt;
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-2] postgres@[unknown],app=[unknown] [00000] LOG:  replication connection authorized: user=postgres application_name=timescaledb-1&lt;br&gt;
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-3] postgres@[unknown],app=timescaledb-1 [58P01] ERROR:  requested WAL segment 00000032000005290000009D has already been removed&lt;br&gt;
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-4] postgres@[unknown],app=timescaledb-1 [58P01] STATEMENT:  START_REPLICATION SLOT "timescaledb_1" 529/9D000000 TIMELINE 50&lt;br&gt;
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-5] postgres@[unknown],app=timescaledb-1 [00000] LOG:  disconnection: session time: 0:00:00.009 user=postgres database= host=&lt;strong&gt;.&lt;/strong&gt;.&lt;em&gt;.&lt;/em&gt; port=35148&lt;/p&gt;

&lt;p&gt;--&lt;br&gt;
Thanks&lt;/p&gt;

</description>
      <category>patroni</category>
      <category>docker</category>
      <category>etcd</category>
      <category>timescaledb</category>
    </item>
  </channel>
</rss>
