Redis Redis Cluster in k8s - Nineya|java/go/python

Crash report

At certain times the memory of the nodes that the master of redis are on spike up almost double the amount of maxmemory -> the memory of the db is still the same but the memory of the redis process takes up almost double.

We narrowed it down. and it happens when we load large amounts of data into redis while BGSAVE is happening.

we are running redis cluster in k8s with 4 shards and 1 replica(per shard).
each master/replica runs on a separate node.
we have limited maxmemory to 75% of total RAM of node(both in master and replica).
each node has 1tb of RAM.
we ran redis-server memory test on nodes with success.

LOGS:

2022-05-25T13:11:25+03:00   {"log":"1:M 25 May 2022 10:11:25.548 # Redis is now ready to exit, bye bye...\n","stream":"stdout","time":"2022-05-25T10:11:25.548920982Z"}
2022-05-25T13:11:25+03:00   {"log":"1:M 25 May 2022 10:11:25.541 * Removing the pid file.\n","stream":"stdout","time":"2022-05-25T10:11:25.541729751Z"}
2022-05-25T13:11:25+03:00   {"log":"1:M 25 May 2022 10:11:25.541 # User requested shutdown...\n","stream":"stdout","time":"2022-05-25T10:11:25.541713801Z"}
2022-05-25T13:11:25+03:00   {"log":"1:signal-handler (1653473485) Received SIGTERM scheduling shutdown...\n","stream":"stdout","time":"2022-05-25T10:11:25.508708909Z"}
2022-05-25T11:37:54+03:00   {"log":"1:M 25 May 2022 08:37:54.283 # Connection with replica client id #678898 lost.\n","stream":"stdout","time":"2022-05-25T08:37:54.283826456Z"}
2022-05-25T11:36:40+03:00   {"log":"1:M 25 May 2022 08:36:40.153 * FAIL message received from 7b3a434c09dbc3d7fb4c1e5ab044e1f7639168fe about 98770dc870778438859b2bdaaf83d8fb7048511a\n","stream":"stdout","time":"2022-05-25T08:36:40.154110709Z"}
2022-05-25T11:21:14+03:00   {"log":"1:M 25 May 2022 08:21:14.560 * FAIL message received from 98f18038773816ef885e53c7572b9ad187092eb7 about 5cb1ae6187c82602cae0d82bd4b74e1aad0fd0ae\n","stream":"stdout","time":"2022-05-25T08:21:14.560960279Z"}
2022-05-25T11:20:54+03:00   {"log":"1:M 25 May 2022 08:20:54.012 * Synchronization with replica 10.42.62.0:6379 succeeded\n","stream":"stdout","time":"2022-05-25T08:20:54.012517266Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 23 May 2022 10:23:14.757 * Background saving terminated with success\n","stream":"stdout","time":"2022-05-23T10:23:14.757804506Z"}
2022-05-25T15:09:26+03:00   {"log":"214:C 23 May 2022 10:23:12.278 * RDB: 18569 MB of memory used by copy-on-write\n","stream":"stdout","time":"2022-05-23T10:23:12.278644435Z"}
2022-05-25T15:09:26+03:00   {"log":"214:C 23 May 2022 10:23:10.869 * DB saved on disk\n","stream":"stdout","time":"2022-05-23T10:23:10.869628351Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 23 May 2022 10:02:48.249 * Background saving started by pid 214\n","stream":"stdout","time":"2022-05-23T10:02:48.249913928Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 23 May 2022 10:02:47.086 * 1 changes in 86400 seconds. Saving...\n","stream":"stdout","time":"2022-05-23T10:02:47.086498619Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 22 May 2022 10:02:46.485 * Background saving terminated with success\n","stream":"stdout","time":"2022-05-22T10:02:46.485203112Z"}
2022-05-25T15:09:26+03:00   {"log":"206:C 22 May 2022 10:02:46.391 * RDB: 2 MB of memory used by copy-on-write\n","stream":"stdout","time":"2022-05-22T10:02:46.39131451Z"}
2022-05-25T15:09:26+03:00   {"log":"206:C 22 May 2022 10:02:46.390 * DB saved on disk\n","stream":"stdout","time":"2022-05-22T10:02:46.391045179Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 22 May 2022 10:02:46.384 * Background saving started by pid 206\n","stream":"stdout","time":"2022-05-22T10:02:46.384914716Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 22 May 2022 10:02:46.384 * 1 changes in 86400 seconds. Saving...\n","stream":"stdout","time":"2022-05-22T10:02:46.38468353Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:19.786 * Synchronization with replica 10.33.204.23:6379 succeeded\n","stream":"stdout","time":"2022-05-19T10:17:19.786045267Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:19.785 * Background saving terminated with success\n","stream":"stdout","time":"2022-05-19T10:17:19.78598374Z"}
2022-05-25T15:09:26+03:00   {"log":"199:C 19 May 2022 10:17:19.775 * RDB: 2 MB of memory used by copy-on-write\n","stream":"stdout","time":"2022-05-19T10:17:19.775604291Z"}
2022-05-25T15:09:26+03:00   {"log":"199:C 19 May 2022 10:17:19.775 * DB saved on disk\n","stream":"stdout","time":"2022-05-19T10:17:19.775279346Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:19.772 * Background saving started by pid 199\n","stream":"stdout","time":"2022-05-19T10:17:19.772972213Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:19.772 * Starting BGSAVE for SYNC with target: disk\n","stream":"stdout","time":"2022-05-19T10:17:19.772404721Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:19.772 * Replication backlog created, my new replication IDs are '55f4ebd56a964c61ee7d9472b702b10788667099' and '0000000000000000000000000000000000000000'\n","stream":"stdout","time":"2022-05-19T10:17:19.772401835Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:19.772 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '6837b7c1b2b8f3347b4b57a0be9599a4f8c44a44', my replication IDs are '729d9ba528f6a72a997adf017bef474f2587e228' and '0000000000000000000000000000000000000000')\n","stream":"stdout","time":"2022-05-19T10:17:19.772395932Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:19.772 * Replica 10.33.204.23:6379 asks for synchronization\n","stream":"stdout","time":"2022-05-19T10:17:19.772371818Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:15.671 # Cluster state changed: ok\n","stream":"stdout","time":"2022-05-19T10:17:15.671434388Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:04.741 # configEpoch set to 5 via CLUSTER SET-CONFIG-EPOCH\n","stream":"stdout","time":"2022-05-19T10:17:04.74158825Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:03.628 * Ready to accept connections\n","stream":"stdout","time":"2022-05-19T10:17:03.628390028Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:03.627 # Server initialized\n","stream":"stdout","time":"2022-05-19T10:17:03.627935419Z"}
2022-05-25T15:09:26+03:00   {"log":"\n","stream":"stdout","time":"2022-05-19T10:17:03.627933475Z"}
2022-05-25T15:09:26+03:00   {"log":"              `-.__.-'                                               \n","stream":"stdout","time":"2022-05-19T10:17:03.627931432Z"}
2022-05-25T15:09:26+03:00   {"log":"          `-._        _.-'                                           \n","stream":"stdout","time":"2022-05-19T10:17:03.627929488Z"}
2022-05-25T15:09:26+03:00   {"log":"      `-._    `-.__.-'    _.-'                                       \n","stream":"stdout","time":"2022-05-19T10:17:03.627927474Z"}
2022-05-25T15:09:26+03:00   {"log":"  `-._    `-._`-.__.-'_.-'    _.-'                                   \n","stream":"stdout","time":"2022-05-19T10:17:03.62792547Z"}
2022-05-25T15:09:26+03:00   {"log":" |    `-._`-._        _.-'_.-'    |                                  \n","stream":"stdout","time":"2022-05-19T10:17:03.627921482Z"}
2022-05-25T15:09:26+03:00   {"log":" |`-._`-._    `-.__.-'    _.-'_.-'|                                  \n","stream":"stdout","time":"2022-05-19T10:17:03.627919499Z"}
2022-05-25T15:09:26+03:00   {"log":"  `-._    `-._`-.__.-'_.-'    _.-'                                   \n","stream":"stdout","time":"2022-05-19T10:17:03.627917526Z"}
2022-05-25T15:09:26+03:00   {"log":" |    `-._`-._        _.-'_.-'    |           https://redis.io       \n","stream":"stdout","time":"2022-05-19T10:17:03.627915081Z"}
2022-05-25T15:09:26+03:00   {"log":" |`-._`-._    `-.__.-'    _.-'_.-'|                                  \n","stream":"stdout","time":"2022-05-19T10:17:03.627904601Z"}
2022-05-25T15:09:26+03:00   {"log":"  `-._    `-._  `-./  _.-'    _.-'                                   \n","stream":"stdout","time":"2022-05-19T10:17:03.627902677Z"}
2022-05-25T15:09:26+03:00   {"log":" |    `-._   `._    /     _.-'    |     PID: 1\n","stream":"stdout","time":"2022-05-19T10:17:03.627900754Z"}
2022-05-25T15:09:26+03:00   {"log":" |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379\n","stream":"stdout","time":"2022-05-19T10:17:03.62789878Z"}
2022-05-25T15:09:26+03:00   {"log":" (    '      ,       .-`  | `,    )     Running in cluster mode\n","stream":"stdout","time":"2022-05-19T10:17:03.627896786Z"}
2022-05-25T15:09:26+03:00   {"log":"  .-`` .-```.  ```\\/    _.,_ ''-._                                  \n","stream":"stdout","time":"2022-05-19T10:17:03.627894292Z"}
2022-05-25T15:09:26+03:00   {"log":"      _.-``    `.  `_.  ''-._           Redis 6.2.7 (00000000/0) 64 bit\n","stream":"stdout","time":"2022-05-19T10:17:03.627892227Z"}
2022-05-25T15:09:26+03:00   {"log":"           _.-``__ ''-._                                             \n","stream":"stdout","time":"2022-05-19T10:17:03.627890054Z"}
2022-05-25T15:09:26+03:00   {"log":"                _._                                                  \n","stream":"stdout","time":"2022-05-19T10:17:03.627885725Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:03.627 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.\n","stream":"stdout","time":"2022-05-19T10:17:03.627781017Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:03.625 * No cluster configuration found, I'm 643d1fc58709cbcfed8564afa8dd9b9db3f40b6d\n","stream":"stdout","time":"2022-05-19T10:17:03.625489173Z"}
2022-05-25T15:09:26+03:00   {"log":"1:M 19 May 2022 10:17:03.625 * monotonic clock: POSIX clock_gettime\n","stream":"stdout","time":"2022-05-19T10:17:03.625084427Z"}
2022-05-25T15:09:26+03:00   {"log":"1:C 19 May 2022 10:17:03.624 # Configuration loaded\n","stream":"stdout","time":"2022-05-19T10:17:03.624645868Z"}
2022-05-25T15:09:26+03:00   {"log":"1:C 19 May 2022 10:17:03.624 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started\n","stream":"stdout","time":"2022-05-19T10:17:03.624642392Z"}
2022-05-25T15:09:26+03:00   {"log":"1:C 19 May 2022 10:17:03.624 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo\n","stream":"stdout","time":"2022-05-19T10:17:03.62463153Z"}
2022-05-25T15:09:26+03:00   {"log":"\n","stream":"stdout","time":"2022-05-19T10:17:03.598021235Z"}
2022-05-25T15:09:26+03:00   {"log":"Storing map with hostnames and IPs\n","stream":"stdout","time":"2022-05-19T10:17:03.595514594Z"}
2022-05-25T15:09:26+03:00   {"log":"COPYING FILE\n","stream":"stdout","time":"2022-05-19T10:16:53.386456162Z"}
2022-05-25T15:09:26+03:00   {"log":"\u001b[38;5;6mredis-cluster \u001b[38;5;5m10:17:03.59 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==\u003e ** Redis setup finished! **\n","stream":"stderr","time":"2022-05-19T10:17:03.597992581Z"}
2022-05-25T15:09:26+03:00   {"log":"\u001b[38;5;6mredis-cluster \u001b[38;5;5m10:16:53.44 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==\u003e Setting Redis config file\n","stream":"stderr","time":"2022-05-19T10:16:53.448776533Z"}
2022-05-25T15:09:26+03:00   {"log":"\u001b[38;5;6mredis-cluster \u001b[38;5;5m10:16:53.43 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==\u003e Initializing Redis\n","stream":"stderr","time":"2022-05-19T10:16:53.438670922Z"}
2022-05-25T15:09:26+03:00   {"log":"\u001b[38;5;6mredis-cluster \u001b[38;5;5m10:16:53.41 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==\u003e ** Starting Redis setup **\n","stream":"stderr","time":"2022-05-19T10:16:53.415803284Z"}
2022-05-25T15:09:26+03:00   {"log":"\u001b[38;5;6mredis-cluster \u001b[38;5;5m10:16:53.41 \u001b[0m\n","stream":"stderr","time":"2022-05-19T10:16:53.41487117Z"}
2022-05-25T15:09:26+03:00   {"log":"\u001b[38;5;6mredis-cluster \u001b[38;5;5m10:16:53.41 \u001b[0mSubmit issues and feature requests at \u001b[1mhttps://github.com/bitnami/bitnami-docker-redis-cluster/issues\u001b[0m\n","stream":"stderr","time":"2022-05-19T10:16:53.413790066Z"}
2022-05-25T15:09:26+03:00   {"log":"\u001b[38;5;6mredis-cluster \u001b[38;5;5m10:16:53.41 \u001b[0mSubscribe to project updates by watching \u001b[1mhttps://github.com/bitnami/bitnami-docker-redis-cluster\u001b[0m\n","stream":"stderr","time":"2022-05-19T10:16:53.412763905Z"}
2022-05-25T15:09:26+03:00   {"log":"\u001b[38;5;6mredis-cluster \u001b[38;5;5m10:16:53.41 \u001b[0m\u001b[1mWelcome to the Bitnami redis-cluster container\u001b[0m\n","stream":"stderr","time":"2022-05-19T10:16:53.41178218Z"}
2022-05-25T15:09:26+03:00   {"log":"\u001b[38;5;6mredis-cluster \u001b[38;5;5m10:16:53.41 \u001b[0m\n","stream":"stderr","time":"2022-05-19T10:16:53.410673101Z"}
2022-05-26T10:33:14+03:00   {"log":"1:M 26 May 2022 07:33:14.603 * Background saving terminated with success\n","stream":"stdout","time":"2022-05-26T07:33:14.603239925Z"}
2022-05-26T10:33:02+03:00   {"log":"41338:C 26 May 2022 07:33:02.432 * RDB: 4556 MB of memory used by copy-on-write\n","stream":"stdout","time":"2022-05-26T07:33:02.433072595Z"}
2022-05-26T10:32:51+03:00   {"log":"41338:C 26 May 2022 07:32:51.864 * DB saved on disk\n","stream":"stdout","time":"2022-05-26T07:32:51.865100633Z"}
2022-05-26T10:11:12+03:00   {"log":"1:M 26 May 2022 07:11:12.729 # Cluster state changed: fail\n","stream":"stdout","time":"2022-05-26T07:11:12.729252732Z"}
2022-05-26T10:11:12+03:00   {"log":"1:M 26 May 2022 07:11:12.729 * FAIL message received from d7fe568e07f3e1ec821fc4dbaeafca11293b1e28 about a685db2fc949b1bf57ce2ed498627ef3923aca1d\n","stream":"stdout","time":"2022-05-26T07:11:12.729227935Z"}
2022-05-26T10:07:19+03:00   {"log":"1:M 26 May 2022 07:07:19.776 * Clear FAIL state for node d7fe568e07f3e1ec821fc4dbaeafca11293b1e28: replica is reachable again.\n","stream":"stdout","time":"2022-05-26T07:07:19.777017545Z"}
2022-05-26T10:04:32+03:00   {"log":"1:M 26 May 2022 07:04:32.800 * Clear FAIL state for node 98f18038773816ef885e53c7572b9ad187092eb7: replica is reachable again.\n","stream":"stdout","time":"2022-05-26T07:04:32.800895023Z"}
2022-05-26T09:54:39+03:00   {"log":"1:M 26 May 2022 06:54:39.392 * Clear FAIL state for node c304426ea04b1e2cd7723a3858d1fcd3aac260ab: replica is reachable again.\n","stream":"stdout","time":"2022-05-26T06:54:39.392848356Z"}
2022-05-26T08:47:36+03:00   {"log":"1:M 26 May 2022 05:47:35.957 * Background saving started by pid 41338\n","stream":"stdout","time":"2022-05-26T05:47:35.957315341Z"}
2022-05-26T08:47:28+03:00   {"log":"1:M 26 May 2022 05:47:27.989 * Starting BGSAVE for SYNC with target: disk\n","stream":"stdout","time":"2022-05-26T05:47:27.989817384Z"}
2022-05-26T08:47:28+03:00   {"log":"1:M 26 May 2022 05:47:27.989 * Replication backlog created, my new replication IDs are 'bd10232f1f5fa05ce756bdf701b160754bc10220' and '0000000000000000000000000000000000000000'\n","stream":"stdout","time":"2022-05-26T05:47:27.98981523Z"}
2022-05-26T08:47:28+03:00   {"log":"1:M 26 May 2022 05:47:27.989 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'a8d915e175dcbf78b2e7d17bff8e66782edb7268', my replication IDs are '42c7e09a6363cb9a070be2fb5ff985db11c7b287' and '0000000000000000000000000000000000000000')\n","stream":"stdout","time":"2022-05-26T05:47:27.989811904Z"}
2022-05-26T08:47:28+03:00   {"log":"1:M 26 May 2022 05:47:27.989 * Replica 10.42.192.0:6379 asks for synchronization\n","stream":"stdout","time":"2022-05-26T05:47:27.989787707Z"}
2022-05-26T08:21:54+03:00   {"log":"1:M 26 May 2022 05:21:54.697 * FAIL message received from a685db2fc949b1bf57ce2ed498627ef3923aca1d about 5cb1ae6187c82602cae0d82bd4b74e1aad0fd0ae\n","stream":"stdout","time":"2022-05-26T05:21:54.697144005Z"}
2022-05-26T07:04:46+03:00   {"log":"1:M 26 May 2022 04:04:46.276 * Replication backlog freed after 3600 seconds without connected replicas.\n","stream":"stdout","time":"2022-05-26T04:04:46.277080131Z"}
2022-05-26T06:04:45+03:00   {"log":"1:M 26 May 2022 03:04:45.713 # Connection with replica client id #19805 lost.\n","stream":"stdout","time":"2022-05-26T03:04:45.713702978Z"}
2022-05-26T06:03:14+03:00   {"log":"1:M 26 May 2022 03:03:14.725 * FAIL message received from 7b3a434c09dbc3d7fb4c1e5ab044e1f7639168fe about d7fe568e07f3e1ec821fc4dbaeafca11293b1e28\n","stream":"stdout","time":"2022-05-26T03:03:14.725892791Z"}
2022-05-26T06:02:54+03:00   {"log":"1:M 26 May 2022 03:02:54.431 * Synchronization with replica 10.42.192.0:6379 succeeded\n","stream":"stdout","time":"2022-05-26T03:02:54.432021918Z"}
2022-05-26T06:00:58+03:00   {"log":"1:M 26 May 2022 03:00:58.249 * Marking node 98f18038773816ef885e53c7572b9ad187092eb7 as failing (quorum reached).\n","stream":"stdout","time":"2022-05-26T03:00:58.249719071Z"}
2022-05-26T05:52:49+03:00   {"log":"1:M 26 May 2022 02:52:49.677 # Connection with replica 10.42.208.0:6379 lost.\n","stream":"stdout","time":"2022-05-26T02:52:49.677885224Z"}
2022-05-26T05:52:44+03:00   {"log":"1:M 26 May 2022 02:52:44.603 * Marking node c304426ea04b1e2cd7723a3858d1fcd3aac260ab as failing (quorum reached).\n","stream":"stdout","time":"2022-05-26T02:52:44.60328255Z"}
2022-05-26T05:20:05+03:00   {"log":"1:M 26 May 2022 02:20:05.288 * Background saving terminated with success\n","stream":"stdout","time":"2022-05-26T02:20:05.288675082Z"}
2022-05-26T05:19:53+03:00   {"log":"27190:C 26 May 2022 02:19:53.227 * RDB: 4734 MB of memory used by copy-on-write\n","stream":"stdout","time":"2022-05-26T02:19:53.227874975Z"}
2022-05-26T05:19:42+03:00   {"log":"27190:C 26 May 2022 02:19:42.670 * DB saved on disk\n","stream":"stdout","time":"2022-05-26T02:19:42.670822476Z"}
2022-05-26T04:57:20+03:00   {"log":"1:M 26 May 2022 01:57:20.416 * Clear FAIL state for node d7fe568e07f3e1ec821fc4dbaeafca11293b1e28: replica is reachable again.\n","stream":"stdout","time":"2022-05-26T01:57:20.416786313Z"}
2022-05-26T04:54:42+03:00   {"log":"1:M 26 May 2022 01:54:42.165 * Waiting for end of BGSAVE for SYNC\n","stream":"stdout","time":"2022-05-26T01:54:42.165598435Z"}
2022-05-26T04:54:42+03:00   {"log":"1:M 26 May 2022 01:54:42.151 * Full resync requested by replica 10.42.208.0:6379\n","stream":"stdout","time":"2022-05-26T01:54:42.15128062Z"}
2022-05-26T04:54:42+03:00   {"log":"1:M 26 May 2022 01:54:42.151 * Replica 10.42.208.0:6379 asks for synchronization\n","stream":"stdout","time":"2022-05-26T01:54:42.151258528Z"}
2022-05-26T04:54:33+03:00   {"log":"1:M 26 May 2022 01:54:32.864 * Clear FAIL state for node c304426ea04b1e2cd7723a3858d1fcd3aac260ab: replica is reachable again.\n","stream":"stdout","time":"2022-05-26T01:54:32.864899404Z"}
2022-05-26T04:54:30+03:00   {"log":"1:M 26 May 2022 01:54:29.856 * Clear FAIL state for node 98f18038773816ef885e53c7572b9ad187092eb7: replica is reachable again.\n","stream":"stdout","time":"2022-05-26T01:54:29.856691448Z"}
2022-05-26T04:51:47+03:00   {"log":"1:M 26 May 2022 01:51:46.976 * Clear FAIL state for node 5cb1ae6187c82602cae0d82bd4b74e1aad0fd0ae: replica is reachable again.\n","stream":"stdout","time":"2022-05-26T01:51:46.97671794Z"}
2022-05-26T03:35:25+03:00   {"log":"1:M 26 May 2022 00:35:25.216 * Background saving started by pid 27190\n","stream":"stdout","time":"2022-05-26T00:35:25.216447029Z"}
2022-05-26T03:35:17+03:00   {"log":"1:M 26 May 2022 00:35:17.056 * Starting BGSAVE for SYNC with target: disk\n","stream":"stdout","time":"2022-05-26T00:35:17.05619435Z"}
2022-05-26T03:35:17+03:00   {"log":"1:M 26 May 2022 00:35:17.056 * Replication backlog created, my new replication IDs are '083209865b81fda2758036dce5ef38fc948e9d0e' and '0000000000000000000000000000000000000000'\n","stream":"stdout","time":"2022-05-26T00:35:17.056191694Z"}
2022-05-26T03:35:17+03:00   {"log":"1:M 26 May 2022 00:35:17.056 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '9b91702dcb47d6cecb4ea56e16a729d425f30bd4', my replication IDs are 'ab9d937c6f21407eb1c6bbd96753bc78f3146009' and '0000000000000000000000000000000000000000')\n","stream":"stdout","time":"2022-05-26T00:35:17.056187186Z"}
2022-05-26T03:35:17+03:00   {"log":"1:M 26 May 2022 00:35:17.056 * Replica 10.42.192.0:6379 asks for synchronization\n","stream":"stdout","time":"2022-05-26T00:35:17.056160816Z"}
2022-05-26T01:54:20+03:00   {"log":"1:M 25 May 2022 22:54:19.948 * Replication backlog freed after 3600 seconds without connected replicas.\n","stream":"stdout","time":"2022-05-25T22:54:19.949039402Z"}
2022-05-26T00:54:18+03:00   {"log":"1:M 25 May 2022 21:54:18.354 # Connection with replica client id #10223 lost.\n","stream":"stdout","time":"2022-05-25T21:54:18.354509044Z"}
2022-05-26T00:52:50+03:00   {"log":"1:M 25 May 2022 21:52:50.006 * FAIL message received from b851af83fe03cd50b4fad8633a22596a864a8bba about d7fe568e07f3e1ec821fc4dbaeafca11293b1e28\n","stream":"stdout","time":"2022-05-25T21:52:50.006416538Z"}
2022-05-26T00:52:32+03:00   {"log":"1:M 25 May 2022 21:52:32.334 * Synchronization with replica 10.42.192.0:6379 succeeded\n","stream":"stdout","time":"2022-05-25T21:52:32.334627132Z"}
2022-05-26T00:49:57+03:00   {"log":"1:M 25 May 2022 21:49:57.792 * FAIL message received from 7b3a434c09dbc3d7fb4c1e5ab044e1f7639168fe about c304426ea04b1e2cd7723a3858d1fcd3aac260ab\n","stream":"stdout","time":"2022-05-25T21:49:57.792148068Z"}
2022-05-26T00:48:51+03:00   {"log":"1:M 25 May 2022 21:48:51.189 * Marking node 98f18038773816ef885e53c7572b9ad187092eb7 as failing (quorum reached).\n","stream":"stdout","time":"2022-05-25T21:48:51.189433773Z"}
2022-05-26T00:47:55+03:00   {"log":"1:M 25 May 2022 21:47:54.962 * Marking node 5cb1ae6187c82602cae0d82bd4b74e1aad0fd0ae as failing (quorum reached).\n","stream":"stdout","time":"2022-05-25T21:47:54.96229959Z"}
2022-05-26T00:08:04+03:00   {"log":"1:M 25 May 2022 21:08:04.148 * Background saving terminated with success\n","stream":"stdout","time":"2022-05-25T21:08:04.14819246Z"}
2022-05-26T00:07:52+03:00   {"log":"13621:C 25 May 2022 21:07:52.242 * RDB: 4316 MB of memory used by copy-on-write\n","stream":"stdout","time":"2022-05-25T21:07:52.242248934Z"}
2022-05-26T00:07:41+03:00   {"log":"13621:C 25 May 2022 21:07:41.770 * DB saved on disk\n","stream":"stdout","time":"2022-05-25T21:07:41.771117751Z"}
2022-05-25T23:51:49+03:00   {"log":"1:M 25 May 2022 20:51:49.600 * Clear FAIL state for node 5cb1ae6187c82602cae0d82bd4b74e1aad0fd0ae: replica is reachable again.\n","stream":"stdout","time":"2022-05-25T20:51:49.600945799Z"}
2022-05-25T23:49:28+03:00   {"log":"1:M 25 May 2022 20:49:28.416 * Clear FAIL state for node c304426ea04b1e2cd7723a3858d1fcd3aac260ab: replica is reachable again.\n","stream":"stdout","time":"2022-05-25T20:49:28.416880421Z"}
2022-05-25T23:47:19+03:00   {"log":"1:M 25 May 2022 20:47:19.840 * Clear FAIL state for node d7fe568e07f3e1ec821fc4dbaeafca11293b1e28: replica is reachable again.\n","stream":"stdout","time":"2022-05-25T20:47:19.840781625Z"}
2022-05-25T23:44:28+03:00   {"log":"1:M 25 May 2022 20:44:28.832 * Clear FAIL state for node 98f18038773816ef885e53c7572b9ad187092eb7: replica is reachable again.\n","stream":"stdout","time":"2022-05-25T20:44:28.832628781Z"}
2022-05-25T22:25:31+03:00   {"log":"1:M 25 May 2022 19:25:31.485 * Background saving started by pid 13621\n","stream":"stdout","time":"2022-05-25T19:25:31.485561425Z"}
2022-05-25T22:25:23+03:00   {"log":"1:M 25 May 2022 19:25:23.077 * Starting BGSAVE for SYNC with target: disk\n","stream":"stdout","time":"2022-05-25T19:25:23.077316107Z"}
2022-05-25T22:25:23+03:00   {"log":"1:M 25 May 2022 19:25:23.077 * Replication backlog created, my new replication IDs are '84c4231e2b9ffd58d31585054d512450788af495' and '0000000000000000000000000000000000000000'\n","stream":"stdout","time":"2022-05-25T19:25:23.077313361Z"}
2022-05-25T22:25:23+03:00   {"log":"1:M 25 May 2022 19:25:23.077 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'bcaed0a7846be3970ffe45f509cbc55bd7dfbee0', my replication IDs are 'ec7dfeae2a6b102defd480d75a1388933dc1403f' and '0000000000000000000000000000000000000000')\n","stream":"stdout","time":"2022-05-25T19:25:23.077309214Z"}
2022-05-25T22:25:23+03:00   {"log":"1:M 25 May 2022 19:25:23.077 * Replica 10.42.192.0:6379 asks for synchronization\n","stream":"stdout","time":"2022-05-25T19:25:23.077283325Z"}
2022-05-25T20:44:55+03:00   {"log":"1:M 25 May 2022 17:44:55.076 * Replication backlog freed after 3600 seconds without connected replicas.\n","stream":"stdout","time":"2022-05-25T17:44:55.076967629Z"}
2022-05-25T19:46:27+03:00   {"log":"1:M 25 May 2022 16:46:27.235 * Marking node 5cb1ae6187c82602cae0d82bd4b74e1aad0fd0ae as failing (quorum reached).\n","stream":"stdout","time":"2022-05-25T16:46:27.235826888Z"}
2022-05-25T19:44:54+03:00   {"log":"1:M 25 May 2022 16:44:54.137 # Connection with replica client id #503 lost.\n","stream":"stdout","time":"2022-05-25T16:44:54.137678222Z"}
2022-05-25T19:44:18+03:00   {"log":"1:M 25 May 2022 16:44:18.199 * FAIL message received from 5cb1ae6187c82602cae0d82bd4b74e1aad0fd0ae about c304426ea04b1e2cd7723a3858d1fcd3aac260ab\n","stream":"stdout","time":"2022-05-25T16:44:18.19928015Z"}
2022-05-25T19:43:27+03:00   {"log":"1:M 25 May 2022 16:43:27.600 * Marking node d7fe568e07f3e1ec821fc4dbaeafca11293b1e28 as failing (quorum reached).\n","stream":"stdout","time":"2022-05-25T16:43:27.600332102Z"}
2022-05-25T19:43:08+03:00   {"log":"1:M 25 May 2022 16:43:08.897 * Synchronization with replica 10.42.192.0:6379 succeeded\n","stream":"stdout","time":"2022-05-25T16:43:08.89780068Z"}
2022-05-25T19:42:50+03:00   {"log":"1:M 25 May 2022 16:42:50.496 * FAIL message received from 7b3a434c09dbc3d7fb4c1e5ab044e1f7639168fe about 98f18038773816ef885e53c7572b9ad187092eb7\n","stream":"stdout","time":"2022-05-25T16:42:50.496172964Z"}
2022-05-25T18:59:11+03:00   {"log":"1:M 25 May 2022 15:59:11.753 * Background saving terminated with success\n","stream":"stdout","time":"2022-05-25T15:59:11.753936003Z"}
2022-05-25T18:58:55+03:00   {"log":"159:C 25 May 2022 15:58:55.311 * RDB: 156942 MB of memory used by copy-on-write\n","stream":"stdout","time":"2022-05-25T15:58:55.311874624Z"}
2022-05-25T18:58:46+03:00   {"log":"159:C 25 May 2022 15:58:46.463 * DB saved on disk\n","stream":"stdout","time":"2022-05-25T15:58:46.463968249Z"}
2022-05-25T17:19:36+03:00   {"log":"1:M 25 May 2022 14:19:36.096 * Clear FAIL state for node 98f18038773816ef885e53c7572b9ad187092eb7: replica is reachable again.\n","stream":"stdout","time":"2022-05-25T14:19:36.096766446Z"}
2022-05-25T17:19:28+03:00   {"log":"1:M 25 May 2022 14:19:28.608 * Clear FAIL state for node c304426ea04b1e2cd7723a3858d1fcd3aac260ab: replica is reachable again.\n","stream":"stdout","time":"2022-05-25T14:19:28.608716298Z"}
2022-05-25T17:17:47+03:00   {"log":"1:M 25 May 2022 14:17:47.040 * Clear FAIL state for node b851af83fe03cd50b4fad8633a22596a864a8bba: replica is reachable again.\n","stream":"stdout","time":"2022-05-25T14:17:47.040827733Z"}
2022-05-25T17:17:28+03:00   {"log":"1:M 25 May 2022 14:17:28.192 # Connection with replica 10.42.208.0:6379 lost.\n","stream":"stdout","time":"2022-05-25T14:17:28.192746911Z"}
2022-05-25T17:17:24+03:00   {"log":"1:M 25 May 2022 14:17:24.512 * Clear FAIL state for node d7fe568e07f3e1ec821fc4dbaeafca11293b1e28: replica is reachable again.\n","stream":"stdout","time":"2022-05-25T14:17:24.512681668Z"}
2022-05-25T17:17:02+03:00   {"log":"1:M 25 May 2022 14:17:02.496 # Cluster state changed: ok\n","stream":"stdout","time":"2022-05-25T14:17:02.496969879Z"}
2022-05-25T17:17:02+03:00   {"log":"1:M 25 May 2022 14:17:02.496 * Clear FAIL state for node a685db2fc949b1bf57ce2ed498627ef3923aca1d: is reachable again and nobody is serving its slots after some time.\n","stream":"stdout","time":"2022-05-25T14:17:02.496940143Z"}
2022-05-25T17:16:52+03:00   {"log":"1:M 25 May 2022 14:16:52.256 * Clear FAIL state for node 5cb1ae6187c82602cae0d82bd4b74e1aad0fd0ae: replica is reachable again.\n","stream":"stdout","time":"2022-05-25T14:16:52.256778187Z"}
2022-05-25T17:16:16+03:00   {"log":"1:M 25 May 2022 14:16:16.575 * Waiting for end of BGSAVE for SYNC\n","stream":"stdout","time":"2022-05-25T14:16:16.575280371Z"}
2022-05-25T17:16:16+03:00   {"log":"1:M 25 May 2022 14:16:16.575 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '22e27211ecd7828358504a6af95553b8ca303fbd', my replication IDs are '49b09eea88b0e9dd60e7b1db587aa53723382e93' and '0000000000000000000000000000000000000000')\n","stream":"stdout","time":"2022-05-25T14:16:16.575276974Z"}
2022-05-25T17:16:16+03:00   {"log":"1:M 25 May 2022 14:16:16.575 * Replica 10.42.208.0:6379 asks for synchronization\n","stream":"stdout","time":"2022-05-25T14:16:16.575252788Z"}
2022-05-25T17:16:16+03:00   {"log":"1:M 25 May 2022 14:16:16.033 * Background saving started by pid 159\n","stream":"stdout","time":"2022-05-25T14:16:16.033311548Z"}
2022-05-25T17:16:08+03:00   {"log":"1:M 25 May 2022 14:16:08.762 * Starting BGSAVE for SYNC with target: disk\n","stream":"stdout","time":"2022-05-25T14:16:08.76265156Z"}
2022-05-25T17:16:08+03:00   {"log":"1:M 25 May 2022 14:16:08.762 * Replication backlog created, my new replication IDs are '49b09eea88b0e9dd60e7b1db587aa53723382e93' and '0000000000000000000000000000000000000000'\n","stream":"stdout","time":"2022-05-25T14:16:08.762648364Z"}
2022-05-25T17:16:08+03:00   {"log":"1:M 25 May 2022 14:16:08.762 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '22e27211ecd7828358504a6af95553b8ca303fbd', my replication IDs are '52ba412ad2636d9232dcdc93eb9534a64e7673c4' and '0000000000000000000000000000000000000000')\n","stream":"stdout","time":"2022-05-25T14:16:08.76264134Z"}
2022-05-25T17:16:08+03:00   {"log":"1:M 25 May 2022 14:16:08.762 * Replica 10.42.192.0:6379 asks for synchronization\n","stream":"stdout","time":"2022-05-25T14:16:08.762607316Z"}
2022-05-25T17:15:31+03:00   {"log":"1:M 25 May 2022 14:15:31.773 * Marking node c304426ea04b1e2cd7723a3858d1fcd3aac260ab as failing (quorum reached).\n","stream":"stdout","time":"2022-05-25T14:15:31.773727258Z"}
2022-05-25T17:15:24+03:00   {"log":"1:M 25 May 2022 14:15:24.754 * Marking node b851af83fe03cd50b4fad8633a22596a864a8bba as failing (quorum reached).\n","stream":"stdout","time":"2022-05-25T14:15:24.754423942Z"}
2022-05-25T17:15:22+03:00   {"log":"1:M 25 May 2022 14:15:22.748 * Marking node d7fe568e07f3e1ec821fc4dbaeafca11293b1e28 as failing (quorum reached).\n","stream":"stdout","time":"2022-05-25T14:15:22.748425835Z"}
2022-05-25T17:15:21+03:00   {"log":"1:M 25 May 2022 14:15:21.240 * Marking node 98f18038773816ef885e53c7572b9ad187092eb7 as failing (quorum reached).\n","stream":"stdout","time":"2022-05-25T14:15:21.241107592Z"}

INFO:
# Server
redis_version:6.2.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:5b326c4b337eb561
redis_mode:cluster
os:Linux 5.4.17-2102.203.6.el7uek.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:8.3.0
process_id:1
process_supervised:no
run_id:3bb1fa6e0011f73401bbd91ea84535cdb79a7774
tcp_port:6379
server_time_usec:1653552189192525
uptime_in_seconds:78506
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:9385021
executable:/redis-server
config_file:
io_threads_active:0

# Clients
connected_clients:13
cluster_connections:16
maxclients:20000
client_recent_max_input_buffer:32
client_recent_max_output_buffer:6458760
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:811659497528
used_memory_human:755.92G
used_memory_rss:804876890112
used_memory_rss_human:749.60G
used_memory_peak:816899663624
used_memory_peak_human:760.80G
used_memory_peak_perc:99.36%
used_memory_overhead:143196956208
used_memory_startup:2031248
used_memory_dataset:668462541320
used_memory_dataset_perc:82.36%
allocator_allocated:811660038392
allocator_active:812254740480
allocator_resident:826484912128
total_system_memory:1081267273728
total_system_memory_human:1007.01G
used_memory_lua:46080
used_memory_lua_human:45.00K
used_memory_scripts:872
used_memory_scripts_human:872B
number_of_cached_scripts:1
maxmemory:824633720832
maxmemory_human:768.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.00
allocator_frag_bytes:594702088
allocator_rss_ratio:1.02
allocator_rss_bytes:14230171648
rss_overhead_ratio:0.97
rss_overhead_bytes:-21608022016
mem_fragmentation_ratio:0.99
mem_fragmentation_bytes:-6782564872
mem_not_counted_for_evict:6458760
mem_replication_backlog:21474836480
mem_clients_slaves:6479272
mem_clients_normal:266528
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0

# Persistence
loading:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1653550394
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:6339
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:4777336832
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:34938
total_commands_processed:75030969
instantaneous_ops_per_sec:2
total_net_input_bytes:14251128244
total_net_output_bytes:1273261620716
instantaneous_input_kbps:0.45
instantaneous_output_kbps:106196.16
rejected_connections:0
sync_full:6
sync_partial_ok:0
sync_partial_err:5
expired_keys:1261227
expired_stale_perc:6.89
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:33389
evicted_keys:0
keyspace_hits:51866448
keyspace_misses:2268302
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:7967454
total_forks:4
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:41439255
dump_payload_sanitizations:0
total_reads_processed:55155507
total_writes_processed:55120428
io_threaded_reads_processed:360235
io_threaded_writes_processed:372475

# Replication
role:master
connected_slaves:1
slave0:ip=10.42.192.0,port=6379,state=send_bulk,offset=0,lag=0
master_failover_state:no-failover
master_replid:bd10232f1f5fa05ce756bdf701b160754bc10220
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:4817144457
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:21474836480
repl_backlog_first_byte_offset:4810699705
repl_backlog_histlen:6444753

# CPU
used_cpu_sys:3042.115564
used_cpu_user:13006.532263
used_cpu_sys_children:1452.103448
used_cpu_user_children:21771.579799
used_cpu_sys_main_thread:3043.464854
used_cpu_user_main_thread:10272.059951

# Modules

# Errorstats
errorstat_CLUSTERDOWN:count=57359
errorstat_ERR:count=6546
errorstat_LOADING:count=156
errorstat_MOVED:count=41373880
errorstat_NOAUTH:count=1314

# Cluster
cluster_enabled:1

# Keyspace
db0:keys=1916400684,expires=803646278,avg_ttl=2022941123

OS : oracle linux 7

Comment From: oranagra

what you mention could be either fragmentation (RSS of the main process), or Copy-On-Write during fork (BGSAVE). we see that used_memory_rss is similar to used_memory, so it's not fragmentation. and also you said that it's happening while you load data (via commands, i assume, not RDB) during BGSAVE. so it makes perfect sense that it's COW. however, in your INFO it looks like COW peak of the last bgsave is low (compared to your used_memory)

used_memory:     811,659,497,528
used_memory_rss: 804,876,890,112
rdb_last_cow_size: 4,777,336,832

Comment From: oranagra

p.s. either way, this redis instance looks quite big and heavy. you should consider sharding it to smaller pieces, in which case the COW can be improved if they don't all BGSAVE at the same time.

Comment From: aryehlev

@oranagra thank u so much for the response. a few follow ups, 1: if we reshard should we keep the percentage of memory redis 75% of total memory if node or do we need to reduce to about 50%? 2: we need to make sure that BGSAVE are done at different times even though each shard is on a separate node? 3: The info might not be updated because it crashed multiple time. 4: Do u know how loading data affects this(the amount of data loaded is not more than 20 gb) since it happened when we loaded+BGSAVE

Comment From: aryehlev

also we are using redis 6.2 so i think fragmantation is disabled

Comment From: oranagra

i don't know much about k8s deployments, how to deploy multiple shards on one instance, and how much memory to reserve. in theory if you split it to smaller pieces, the ratios of reserved memory can remain the same, and the COW may be reduced since they don't all do BGSAVE at the same time, but i'm not sure how you may control that on your end.

if you load 20gb of data (overriding or evicting pre-existing 20gb), you'll get about the same overhead of COW, since the child process still holds the copy of the modified / deleted data. it could obviously be a little more, or a lot more, depending on how your data is arranged (i.e. it's enough to modify one byte in order to cause COW of a full 4kb page).

Comment From: aryehlev

ok, thank you for ur answer. we will try to reshard to many more instances and also i think we will only do our loading when there is no BGSAVE happening. I hope that will solve our issues, and if not we will add memory