@antirez this test still periodically fails, even after all the recent improvements.
[err]: PSYNC2: total sum of full synchronizations is exactly 4 in tests/integration/psync2.tcl
Expected 5 == 4 (context: type eval line 18 cmd {assert {$sum == 4}} proc ::test)
i tracked down the scenario, but i'm not entirely sure what's the right solution.
the flow is this: master disconnects the replica the replica doesn't cache it's master when the replica attempts to connect to the next master it asks for a full sync.
on the replica it looks like this:
readQueryFromClient
serverLog(LL_VERBOSE, "Client closed connection");
freeClientAsync(c);
c->flags |= CLIENT_CLOSE_ASAP;
freeClient
serverLog(LL_WARNING,"Connection with master lost.");
not doing replicationCacheMaster since CLIENT_CLOSE_ASAP is set
no cached master, requesting full sync from new master
could it be it's just about removing that condition from freeClient? i wasn't able to track it's origin.
Comment From: oranagra
regarding the other failure you saw (http://ci.redis.io/run_170246.html)
[exception]: Executing test client: assertion:Slaves are not in sync with the master after too long time..
i managed to reproduce this by running the test 16 times in parallel to itself, in a loop. the issue is this: when the machine is overloaded, the test checks that the replication offset of all replicas is equal in 100ms intervals for 50 seconds, and can't find a match. when it happens, i added another delay of 1 second, and then printed the offsets again. what i can see is that there's a delta of either 14 or 28 bytes in the offset (one or two pings), and after 1 second, some / all replica are advanced, but still don't match (still offset of 14 bytes).
the explanation is that due to overload of the machine, either the test code sampling interval is not quick enough, or the replicas themselves are unable to read their replication feed fast enough.
the possible solutions: 1. run this test in solo mode 2. increase sampling interval of the test (not sure if that would help). 3. reduce ping interval to 2 seconds instead of 1. 4. have the test look at the meaningful offset rather than the replication offset.
(3) will cause this test to be less effective (reduce the change of rare issues to surface and challenge the code). (4) might in some way contradict the test? i.e. we wanna test that partial sync succeeds?
actually, come to think of it, maybe (4) actually increases the effectiveness of the test (verifying that it can psync even if the offset doesn't match)? although we have another test for that now that reproduces that consistently and not by chance.
@antirez please let me know how you like to solve both issues, i'll make a fix, test it and push a PR.
Comment From: oranagra
@antirez remind you to take a look at this one. there are two issues:
- a real bug in redis (probably another regression from the change in readQueryFromClient calling freeClientAsync) which results in an excessive full sync. the solution is probably to ignore CLIENT_CLOSE_ASAP when freeClient calls replicationCacheMaster, but i'm not sure why that check is there so i'm afraid to remove it.
- an issue in the test with several simple solutions, not sure which one is the right one (which will not hide real issues if they come)
Comment From: antirez
@oranagra about replicationCacheMaster, indeed there is the test there because it was never expected the master to be closed in this way but in case of protocol errors. But now this is no longer the case.
Comment From: antirez
@oranagra that's the fix 1a7cd2c0e
Comment From: antirez
@oranagra for the second issue, isn't it a possible solution to adjust server.repl_ping_slave_period just before we enter such check? So that pings normally are sent often, but just for offset comparison we relax it to, like, 5 seconds.
Comment From: oranagra
@antirez what about a test for 1a7cd2c, I guess this is a common scenario which we expect to happen for real users, but apparently completely uncovered by the test suite (the psync2 tests only caught it once in a lot of run).
Regarding the other race, I'm afraid that changing the ping interval will reduce the effectiveness of the test. I thought about waiting that the meaningful offset matches (assuming the other concerns and edge cases of that meaningful offset code are well covered). Bottom line, many ways to fix, but I want to try and find one that will not hide other future issues.
Comment From: antirez
@oranagra something is not summing up here:
int freeClientsInAsyncFreeQueue(void) {
int freed = listLength(server.clients_to_close);
while (listLength(server.clients_to_close)) {
listNode *ln = listFirst(server.clients_to_close);
client *c = listNodeValue(ln);
c->flags &= ~CLIENT_CLOSE_ASAP;
freeClient(c);
listDelNode(server.clients_to_close,ln);
}
return freed;
}
The CLOSE_ASAP flag is removed before freeClient() is called. So the test was totally not effective there. And in turn can't be this the cause. The other option is CLIENT_CLOSE_AFTER_REPLY but that is rarely used at all, if not on protocol errors and other instances that I don't believe are related to masters.
If you are able to reproduce this issue, please could you add some debugging info to check what flags are set so that we may have an idea why the master client is not cached? Or actually this should be just part of the code, and Redis should always tell us why it is not caching the master?
Comment From: oranagra
@antirez indeed that looks odd, but I'm quite certain that that the parts I mentioned in the post are right. To reproduce I run that test with --loop --stop for a few hours (each time learning more and adding more prints to investigate). At the end I'm certain that the reason freeClient didn't cache the master is because of that flag. Maybe there's another way to get to freeClient with that flag on?
Comment From: antirez
@oranagra I think I understood what happens (or what happened, I think I fixed it even without understanding fully the problem in 1a7cd2c:
- The client is scheduled for asynchronous freeing in readQueryFromClient.
- The client gets freed again for some reason in freeClient(), before the code that runs the list of clients to free asynchronously is ever called.
- At this point we were in freeClient() with the CLOSE_ASAP flag set, and the master client was not freed.
However now 1a7cd2c should solve the problem since we don't check for this flag anymore, but check for the new protocol error flag, and for the blocked client condition.
I'm not sure how to simulate the problem since I don't know exactly how is calling freeClient() against a client that has the CLOSE_ASAP flag set, because in the test what we do is client kill type master that just uses synchronous freeing. On the other hand, once it is possible to reproduce the problem, to track it is as simple as adding an assertion for !(c->flags & CLIENT_CLOSE_ASAP) in freeClient() in order to get the stack trace of who called freeClient() while the client was still in the asynchronous free state.
However I'm not sure if it is worth the effort: if we no longer see the test failure, we are, as a side effect, testing it, because apparently from time to time it happens spontaneously.
Comment From: antirez
@oranagra about the offsets, I'm working at it since I'm observing that even stopping pings it happens that the offesets can't converge.
Comment From: antirez
Oh ok nevermind, making the ping time longer never was a solution AFAIK, because the problem is that we put the master down with this before doing our checks:
# $R($master_id) slaveof 127.0.0.1 0 ;# We use port zero to make it fail.
So it will trim the replication backlog if the last thing is a ping, and hence certain replicas will instead should a different offset since they received the last ping. I think we should drop this slaveof command above, because it was needed because we lacked the meaningful offset thing, and the test had full syncs that were otherwise avoidable, here is the comment:
# Put down the old master so that it cannot generate more
# replication stream, this way in the next master switch, the time at
# which we move slaves away is not important, each will have full
# history (otherwise PINGs will make certain slaves have more history),
# and sometimes a full resync will be needed.
$R($master_id) slaveof 127.0.0.1 0 ;# We use port zero to make it fail.
But if we remove it, we should no longer have problems and also, as a side effect, test that the meaningful replication offset is working. We were effectively avoiding to test it.
So we could do something like that maybe:
```diff tcl diff --git a/tests/integration/psync2.tcl b/tests/integration/psync2.tcl index 5fe29caba..d6b2d19d0 100644 --- a/tests/integration/psync2.tcl +++ b/tests/integration/psync2.tcl @@ -12,7 +12,7 @@ start_server {} {
set no_exit 0 ; # Do not exit at end of the test
- set duration 20 ; # Total test seconds
-
set duration 40 ; # Total test seconds
set genload 1 ; # Load master with writes at every cycle
@@ -125,14 +125,16 @@ start_server {} { } }
-
wait for all the slaves to be in sync with the master, due to pings, we have to re-sample the master constantly too
-
wait for all the slaves to be in sync.
- set masteroff [status $R($master_id) master_repl_offset] wait_for_condition 500 100 {
- [status $R($master_id) master_repl_offset] == [status $R(0) master_repl_offset] &&
- [status $R($master_id) master_repl_offset] == [status $R(1) master_repl_offset] &&
- [status $R($master_id) master_repl_offset] == [status $R(2) master_repl_offset] &&
- [status $R($master_id) master_repl_offset] == [status $R(3) master_repl_offset] &&
- [status $R($master_id) master_repl_offset] == [status $R(4) master_repl_offset]
- [status $R(0) master_repl_offset] >= $masteroff &&
- [status $R(1) master_repl_offset] >= $masteroff &&
- [status $R(2) master_repl_offset] >= $masteroff &&
- [status $R(3) master_repl_offset] >= $masteroff &&
- [status $R(4) master_repl_offset] >= $masteroff } else {
- puts "Master ID is $master_id" for {set j 0} {$j < 5} {incr j} { puts "$j: sync_full: [status $R($j) sync_full]" puts "$j: id1 : [status $R($j) master_replid]:[status $R($j) master_repl_offset]" @@ -140,17 +142,11 @@ start_server {} { puts "$j: backlog : firstbyte=[status $R($j) repl_backlog_first_byte_offset] len=[status $R($j) repl_backlog_histlen]" puts "---" }
- fail "Slaves are not in sync with the master after too long time."
-
fail "Replicas offsets didn't catch up with the master after too long time." }
-
Put down the old master so that it cannot generate more
-
replication stream, this way in the next master switch, the time at
-
which we move slaves away is not important, each will have full
-
history (otherwise PINGs will make certain slaves have more history),
-
and sometimes a full resync will be needed.
-
$R($master_id) slaveof 127.0.0.1 0 ;# We use port zero to make it fail.
if {$debug_msg} { -
puts "Master ID is $master_id" for {set j 0} {$j < 5} {incr j} { puts "$j: sync_full: [status $R($j) sync_full]" puts "$j: id1 : [status $R($j) master_replid]:[status $R($j) master_repl_offset]" @@ -168,6 +164,28 @@ start_server {} { assert {$sum == 4} }
-
In absence of pings, are the instances really able to have
-
the exact same offset?
- $R($master_id) config set repl-ping-replica-period 3600
- wait_for_condition 500 100 {
- [status $R($master_id) master_repl_offset] == [status $R(0) master_repl_offset] &&
- [status $R($master_id) master_repl_offset] == [status $R(1) master_repl_offset] &&
- [status $R($master_id) master_repl_offset] == [status $R(2) master_repl_offset] &&
- [status $R($master_id) master_repl_offset] == [status $R(3) master_repl_offset] &&
- [status $R($master_id) master_repl_offset] == [status $R(4) master_repl_offset]
- } else {
- puts "Master ID is $master_id"
- for {set j 0} {$j < 5} {incr j} {
- puts "$j: sync_full: [status $R($j) sync_full]"
- puts "$j: id1 : [status $R($j) master_replid]:[status $R($j) master_repl_offset]"
- puts "$j: id2 : [status $R($j) master_replid2]:[status $R($j) second_repl_offset]"
- puts "$j: backlog : firstbyte=[status $R($j) repl_backlog_first_byte_offset] len=[status $R($j) repl_backlog_histlen]"
- puts "---"
- }
- fail "Replicas and master offsets were unable to match exactly."
- }
- $R($master_id) config set repl-ping-replica-period 10 + # Limit anyway the maximum number of cycles. This is useful when the # test is skipped via --only option of the test suite. In that case # we don't want to see many seconds of this test being just skipped.
**Comment From: oranagra**
@antirez sounds good.
So instead of waiting for a prefect match of either offset or meaningful offset, we sample the master offset once, and check for >= replica offset. I was afraid to make the test less effective or tightly coupled with the implementation, but indeed this looks like a good fix.
Why did you have to add the wait for perfect match at the end though?
**Comment From: oranagra**
Regarding the excessive full sync, we'll surely know sooner or later if it solves it, but I still feel bad that we have a flow not consistently covered by the tests, and that we don't know exactly what happened there. I'll add that assertion you suggested on Sunday, get more details, and try to write a consistent test..
Btw, I find that I'm spending more time on writing tests or investigating problems they expose than doing actual coding.. Still I feel it worth the effort.
**Comment From: antirez**
@oranagra the reason I'm doing the "correct invariant" check first (replica offsets must be >= of the master offset) but then checking for an exact match, is basically because of the concerns you manifested, and that I fully agree with, of the risk of making the test too weak. Because this test here is maybe one of the most important replication tests we have in Redis. So while the correct invariant is replica-offset >= master-offset, if there is a bug where the replica over-count the bytes, or a similar bug like that, it would not be detected. So I also wanted the test, in absence of pings, to be able to prove that all the instances will be able to reach the same exact number.
About the efforts you are doing with the tests, I think this is very solid work, and after all we kinda accumulate technical debt in the testing side, so now it looks overwhelming, however it is really great that we are starting to have again a test suite that we trust, so that when there is a problem even with a random failure, we will stop every time to check what is happening. So I also believe it's worth it.
**Comment From: antirez**
@oranagra oops... I think we got problems: http://ci.redis.io/run_174093.html
=== REDIS BUG REPORT START: Cut & paste starting from here === 1092:S 16 May 2020 11:44:09.731 # Redis 999.999.999 crashed by signal: 11 1092:S 16 May 2020 11:44:09.731 # Crashed running the instruction at: 0x447e18 1092:S 16 May 2020 11:44:09.731 # Accessing address: 0xffffffffffffffff 1092:S 16 May 2020 11:44:09.731 # Failed assertion: (:0)
------ STACK TRACE ------ EIP: src/redis-server 127.0.0.1:21300(readQueryFromClient+0x48)[0x447e18]
Backtrace: src/redis-server 127.0.0.1:21300(logStackTrace+0x45)[0x478e95] src/redis-server 127.0.0.1:21300(sigsegvHandler+0xb9)[0x479659] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f3d9a6fc390] src/redis-server 127.0.0.1:21300(readQueryFromClient+0x48)[0x447e18] src/redis-server 127.0.0.1:21300[0x4c5423] src/redis-server 127.0.0.1:21300(aeProcessEvents+0x2e6)[0x42e6d6] src/redis-server 127.0.0.1:21300(aeMain+0x1d)[0x42ea5d] src/redis-server 127.0.0.1:21300(main+0x4c5)[0x42b095] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f3d9a341830] src/redis-server 127.0.0.1:21300(_start+0x29)[0x42b359]
------ INFO OUTPUT ------
Server
redis_version:999.999.999 redis_git_sha1:3cd92e87 redis_git_dirty:0 redis_build_id:2fb78e32d14b92f2 redis_mode:standalone os:Linux 4.4.0-1104-aws x86_64 arch_bits:64 multiplexing_api:epoll atomicvar_api:atomic-builtin gcc_version:5.4.0 process_id:1092 run_id:6f90a2320461ad096513ff3f61f3fbbd9b70cbce tcp_port:21300 uptime_in_seconds:17 uptime_in_days:0 hz:10 configured_hz:10 lru_clock:12571145 executable:/home/ubuntu/ci/redis/src/redis-server config_file:/home/ubuntu/ci/redis/./tests/tmp/redis.conf.130736.15
Clients
connected_clients:2 client_recent_max_input_buffer:2 client_recent_max_output_buffer:0 blocked_clients:0 tracking_clients:0 clients_in_timeout_table:0
Memory
used_memory:1874408 used_memory_human:1.79M used_memory_rss:7057408 used_memory_rss_human:6.73M used_memory_peak:2039128 used_memory_peak_human:1.94M used_memory_peak_perc:91.92% used_memory_overhead:1868722 used_memory_startup:803056 used_memory_dataset:5686 used_memory_dataset_perc:0.53% allocator_allocated:1965096 allocator_active:2297856 allocator_resident:7438336 total_system_memory:63315619840 total_system_memory_human:58.97G used_memory_lua:37888 used_memory_lua_human:37.00K used_memory_scripts:0 used_memory_scripts_human:0B number_of_cached_scripts:0 maxmemory:0 maxmemory_human:0B maxmemory_policy:noeviction allocator_frag_ratio:1.17 allocator_frag_bytes:332760 allocator_rss_ratio:3.24 allocator_rss_bytes:5140480 rss_overhead_ratio:0.95 rss_overhead_bytes:-380928 mem_fragmentation_ratio:3.77 mem_fragmentation_bytes:5184664 mem_not_counted_for_evict:0 mem_replication_backlog:1048576 mem_clients_slaves:0 mem_clients_normal:16986 mem_aof_buffer:0 mem_allocator:jemalloc-5.1.0 active_defrag_running:0 lazyfree_pending_objects:0
Persistence
loading:0 rdb_changes_since_last_save:182109 rdb_bgsave_in_progress:0 rdb_last_save_time:1589629435 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:0 rdb_current_bgsave_time_sec:-1 rdb_last_cow_size:266240 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok aof_last_cow_size:0 module_fork_in_progress:0 module_fork_last_cow_size:0
Stats
total_connections_received:7 total_commands_processed:182170 instantaneous_ops_per_sec:1946 total_net_input_bytes:3826299 total_net_output_bytes:2065961 instantaneous_input_kbps:39.91 instantaneous_output_kbps:7.14 rejected_connections:0 sync_full:1 sync_partial_ok:2 sync_partial_err:1 expired_keys:0 expired_stale_perc:0.00 expired_time_cap_reached_count:0 expire_cycle_cpu_milliseconds:0 evicted_keys:0 keyspace_hits:4 keyspace_misses:0 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:224 migrate_cached_sockets:0 slave_expires_tracked_keys:0 active_defrag_hits:0 active_defrag_misses:0 active_defrag_key_hits:0 active_defrag_key_misses:0 tracking_total_keys:0 tracking_total_items:0 tracking_total_prefixes:0 unexpected_error_replies:0
Replication
role:slave master_host:127.0.0.1 master_port:21301 master_link_status:up master_last_io_seconds_ago:0 master_sync_in_progress:0 slave_repl_offset:3824545 slave_priority:100 slave_read_only:1 connected_slaves:0 master_replid:8a8af91d8e2590517617b31c1086df61b78ac882 master_replid2:e58f1e96ddc70ce52f0a0bff710fe7d53470e178 master_repl_offset:3824545 master_repl_meaningful_offset:3824545 second_repl_offset:3824546 repl_backlog_active:1 repl_backlog_size:1048576 repl_backlog_first_byte_offset:2775970 repl_backlog_histlen:1048576
CPU
used_cpu_sys:2.080000 used_cpu_user:0.872000 used_cpu_sys_children:0.000000 used_cpu_user_children:0.000000
Modules
Commandstats
cmdstat_incr:calls=182109,usec=365214,usec_per_call=2.01 cmdstat_info:calls=11,usec=555,usec_per_call=50.45 cmdstat_select:calls=3,usec=2,usec_per_call=0.67 cmdstat_slaveof:calls=3,usec=136,usec_per_call=45.33 cmdstat_client:calls=2,usec=87,usec_per_call=43.50 cmdstat_get:calls=4,usec=5,usec_per_call=1.25 cmdstat_ping:calls=20,usec=19,usec_per_call=0.95 cmdstat_psync:calls=5,usec=618,usec_per_call=123.60 cmdstat_replconf:calls=13,usec=18,usec_per_call=1.38
Cluster
cluster_enabled:0
Keyspace
db9:keys=1,expires=0,avg_ttl=0
------ CLIENT LIST OUTPUT ------
I think it crashed again immediately after trying to provide the client list. I never saw a crash in this test before 1a7cd2c, so I suspect a lot the two things are related.
**Comment From: antirez**
It is new indeed:
$ grep 'readQueryFromClient+0x48' -R . ./run_174093.html:src/redis-server 127.0.0.1:21300(readQueryFromClient+0x48)[0x447e18] ./run_174093.html:src/redis-server 127.0.0.1:21300(readQueryFromClient+0x48)[0x447e18]
**Comment From: antirez**
@oranagra ok obvious first problem: I didn't unlink the client from the linked list. It gets very likely freed later.
**Comment From: antirez**
I think I fixed it in 624742d9b. Now running the test in a loop waiting for a crash.
**Comment From: antirez**
Current status -> `./runtest --single integration/psync2 --loop --clients 100 --stop` :-D
**Comment From: antirez**
I was able to verify that I can trigger the crash easily but no longer once I apply 624742d. I'm testing it a bit more than releasing immediately 6.0.3.
**Comment From: antirez**
@oranagra I collected a first hand information during this experience that is useful for the future I believe. Once I was able to reproduce the crash, I looked at different crash reports, and there was everything inside, including stuff totally unrelated to the actual bug, because of the effect of the memory corruption. For instance one of the run detected an error in the RDB that was impossible to load because of a bad object type. We see from the to time such impossible to explain RDB errors, and we always conjectured that it could be some unrelated memory corruption, but now I was able to observe this in practice.
**Comment From: antirez**
And finally, I know why a master client can be freed before in async and then in sync mode:
1. Async when it disconnects in aeProcessEvents -> readQueryFromClient().
2. Sync in replicationSetMaster() once the replica is promoted, and will call freeClient(server.master);
Not exactly sure still how it is possible that it is so simple to catch this race, I would expect that the function to free the clients is called before we are able to execute any other command, but that's not the case apparently.
**Comment From: antirez**
Final info for today:
* The `psync2-test-improvement` branch contains the above PSYNC2 tests improvements that enhance reliability. I think I'll merge ASAP now that I saw that the modification of the test is still able to trigger this crash, hence is still able to free the master client in asynchronous then in synchronous way.
* I'll do more tests on Monday because I want to understand *exactly* why readQueryFromClient() async free is apparently not processed fast enough before replicationSetMaster() has a chance to get called.
* Even after the PSYNC2 conditions improvements, the test still fails in other parts, much more rarely, but with 100 clients in a loop it is possible to see in minutes what would be sporadic failures during weeks or months. Sometimes 5 full synchs are detected instead of 4 for instance. Other times the test about final PINGs fail.
* Today while grepping for crashes in the CI archive of all the runs (they are a lot at this point), I was able to find other crashes that apparently no longer happen, and are about bugs already solved, but we probably need to look more carefully on that. And for this reason, probably we need ci.redis.io alike system to run in the next months even if we have Github actions: it allows things that are not otherwise possible, like grepping one year of CI runs easily.
Enough for today... but better more info than less :-) Sorry to the dear reader of this nodes.
**Comment From: oranagra**
both problems resolved.
thanks for the joined effort.
closing the issue
**Comment From: oranagra**
@antirez i still see the psync2 test fails sporadically.
i think i saw several types of failures (including unexpected full syncs), but now i can only reproduce one, so let's concentrate on it.:
Master ID is 3 0: sync_full: 1 0: id1 : 7bcd20430b98d082aed611a8cedc05f62bc9dda9:2927571 0: id2 : 69e64779d7a27860bcb55390b71934c757543195:1855205 0: backlog : firstbyte=1878996 len=1048576
1: sync_full: 1 1: id1 : 7bcd20430b98d082aed611a8cedc05f62bc9dda9:2927557 1: id2 : 69e64779d7a27860bcb55390b71934c757543195:1855205 1: backlog : firstbyte=1878982 len=1048576
2: sync_full: 1 2: id1 : 7bcd20430b98d082aed611a8cedc05f62bc9dda9:2927557 2: id2 : 69e64779d7a27860bcb55390b71934c757543195:1855205 2: backlog : firstbyte=1878982 len=1048576
3: sync_full: 1 3: id1 : 7bcd20430b98d082aed611a8cedc05f62bc9dda9:2927557 3: id2 : 69e64779d7a27860bcb55390b71934c757543195:1855205 3: backlog : firstbyte=1878982 len=1048576
4: sync_full: 0 4: id1 : 7bcd20430b98d082aed611a8cedc05f62bc9dda9:2927557 4: id2 : 69e64779d7a27860bcb55390b71934c757543195:1855205 4: backlog : firstbyte=1878982 len=1048576
[exception]: Executing test client: assertion:Replicas and master offsets were unable to match exactly.. assertion:Replicas and master offsets were unable to match exactly. while executing "error "assertion:$msg"" (procedure "fail" line 2) invoked from within "fail "Replicas and master offsets were unable to match exactly."" ("uplevel" body line 10) invoked from within "uplevel 1 $elsescript" (procedure "wait_for_condition" line 12) invoked from within "wait_for_condition 500 100 { [status $R($master_id) master_repl_offset] == [status $R(0) master_repl_offset] && [status $R($ma..." ("uplevel" body line 165) invoked from within ```