Describe the bug

redis-server --version

Redis server v=7.0.2 sha=00000000:0 malloc=jemalloc-5.2.1 bits=64 build=15c2c3973ed24647

free -h

               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       231Mi       7.0Gi       1.0Mi       547Mi       7.3Gi
Swap:          1.8Gi          0B       1.8Gi

lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04 LTS
Release:    22.04
Codename:   jammy

make test

Cleanup: may take some time... OK
Starting test server at port 21079
...
...
...
!!! WARNING The following tests failed:

*** [err]: Active defrag in tests/unit/memefficiency.tcl
Expected 72 <= 30 (context: type eval line 113 cmd {assert {$max_latency <= 30}} proc ::test)
Cleanup: may take some time... OK
make[1]: *** [Makefile:427: test] Error 1
make[1]: Leaving directory '/opt/redis-stable/src'
make: *** [Makefile:6: test] Error 2

To reproduce

make test

Expected behavior

\o/ All tests passed without errors!

Additional information

I have noticed that in a similar issue #10053 reported was asked to execute the following ./runtest --single unit/memefficiency --verbose output included below.

Cleanup: may take some time... OK
Starting test server at port 21079
[ready]: 24144
Testing unit/memefficiency
[ready]: 24143
[ready]: 24145
[ready]: 24146
[ready]: 24147
[ready]: 24148
[ready]: 24149
[ready]: 24150
[ready]: 24151
[ready]: 24152
[ready]: 24153
[ready]: 24155
[ready]: 24156
[ready]: 24157
[ready]: 24154
[ready]: 24158
=== (memefficiency external:skip) Starting server 127.0.0.1:21611 ok
[ok]: Memory efficiency with values in range 32 (154 ms)
[ok]: Memory efficiency with values in range 64 (153 ms)
[ok]: Memory efficiency with values in range 128 (157 ms)
[ok]: Memory efficiency with values in range 1024 (190 ms)
[ok]: Memory efficiency with values in range 16384 (557 ms)
[1/1 done]: unit/memefficiency (1 seconds)
Testing solo test
=== (defrag external:skip) Starting server 127.0.0.1:21612 ok
frag 1.52
frag 1.02
hits: 393676
misses: 2141633
max latency 7
{active-defrag-cycle 1657221160 7 7} {command 1657221157 889 889}
{1657221158 7} {1657221159 7} {1657221160 7}
AOF loading:
frag 1.05
hits: 1360602
misses: 8892575
max latency 176
{while-blocked-cron 1657221183 85 176} {active-defrag-cycle 1657221183 7 7} {command 1657221183 21993 21993}
{1657221171 39} {1657221172 39} {1657221173 143} {1657221174 52} {1657221175 124} {1657221176 53} {1657221177 91} {1657221178 33} {1657221179 176} {1657221181 20} {1657221182 104} {1657221183 85}
[err]: Active defrag in tests/unit/memefficiency.tcl
Expected 176 <= 30 (context: type eval line 113 cmd {assert {$max_latency <= 30}} proc ::test)
used 54570536
rss 54878208
frag 1.01
frag_bytes 307672
used 29315448
rss 54050816
frag 1.84
frag_bytes 24735368
used 29356232
rss 30638080
frag 1.04
frag_bytes 1281848
[ok]: Active defrag eval scripts (3830 ms)
frag 1.86
frag 1.03
hits: 499087
misses: 1509516
max latency 8
{active-defrag-cycle 1657221200 7 8} {command 1657221198 729 729}
{1657221198 8} {1657221199 8} {1657221200 7}
[ok]: Active defrag big keys (13727 ms)
frag 2.00
frag 1.02
misses: 101247
hits: 98756
max latency 8
{active-defrag-cycle 1657221214 8 8} {command 1657221213 335 335}
{1657221214 8}
[ok]: Active defrag big list (13411 ms)
=== (defrag) Starting server 127.0.0.1:21613 ok
frag 1.96
frag 1.05
hits: 626861
misses: 1716405
[ok]: Active defrag edge case (17771 ms)
[1/1 done]: defrag (87 seconds)

                   The End

Execution time of different units:
  1 seconds - unit/memefficiency
  87 seconds - defrag

!!! WARNING The following tests failed:

*** [err]: Active defrag in tests/unit/memefficiency.tcl
Expected 176 <= 30 (context: type eval line 113 cmd {assert {$max_latency <= 30}} proc ::test)
Cleanup: may take some time... OK

I have tried building version 6.2.7 on the same machine, make test is successful

Comment From: oranagra

Looks like timing / threshold issue (although the gap seem high) . Is it consistent? maybe the machine was busy at that time.

Comment From: judeng

Are you using WSL?

Comment From: 0xD4n

@oranagra it fails every single time I run the test, I'll attach output from another run below.

The output included in my original post was generated in a vbox vm, but I am also able to re-create it on the host.

I'm running this on a bare metal machine with AMD Ryzen 9 5900X 12-Core Processor load average: 0.14, 0.29, 0.22

               total        used        free      shared  buff/cache   available
Mem:           125Gi       2.9Gi       119Gi        60Mi       3.1Gi       121Gi

output from another run ./runtest --single unit/memefficiency --verbose

Cleanup: may take some time... OK
Starting test server at port 21079
[ready]: 5256
Testing unit/memefficiency
[ready]: 5258
[ready]: 5257
[ready]: 5259
[ready]: 5260
[ready]: 5261
[ready]: 5262
[ready]: 5263
[ready]: 5264
[ready]: 5265
[ready]: 5266
[ready]: 5267
[ready]: 5269
[ready]: 5268
[ready]: 5271
[ready]: 5270
=== (memefficiency external:skip) Starting server 127.0.0.1:21111 ok
[ok]: Memory efficiency with values in range 32 (147 ms)
[ok]: Memory efficiency with values in range 64 (132 ms)
[ok]: Memory efficiency with values in range 128 (133 ms)
[ok]: Memory efficiency with values in range 1024 (151 ms)
[ok]: Memory efficiency with values in range 16384 (449 ms)
[1/1 done]: unit/memefficiency (1 seconds)
Testing solo test
=== (defrag external:skip) Starting server 127.0.0.1:21112 ok
frag 1.52
frag 1.02
hits: 377763
misses: 2165981
max latency 7
{active-defrag-cycle 1657835794 7 7} {command 1657835791 762 762}
{1657835792 7} {1657835793 7} {1657835794 7}
AOF loading:
frag 1.52
hits: 0
misses: 0
max latency 0
{active-defrag-cycle 1657835794 7 7} {command 1657835798 2745 2745}

[err]: Active defrag in tests/unit/memefficiency.tcl
Expected 0 > 100000 (context: type eval line 106 cmd {assert {$hits > 100000}} proc ::test)
used 54563040
rss 54910976
frag 1.01
frag_bytes 347936
used 29336704
rss 54042624
frag 1.84
frag_bytes 24705920
used 29372000
rss 30605312
frag 1.04
frag_bytes 1233312
[ok]: Active defrag eval scripts (2601 ms)
frag 1.86
frag 1.03
hits: 498676
misses: 1510522
max latency 8
{active-defrag-cycle 1657835812 8 8} {command 1657835811 646 646}
{1657835811 8} {1657835812 8}
[ok]: Active defrag big keys (12693 ms)
frag 1.99
frag 1.01
misses: 101263
hits: 98740
max latency 7
{active-defrag-cycle 1657835825 7 7} {command 1657835825 297 297}
{1657835825 7}
[ok]: Active defrag big list (12314 ms)
=== (defrag) Starting server 127.0.0.1:21113 ok
frag 1.96
frag 1.05
hits: 625759
misses: 1718340
[ok]: Active defrag edge case (16879 ms)
[1/1 done]: defrag (62 seconds)

                   The End

Execution time of different units:
  1 seconds - unit/memefficiency
  62 seconds - defrag

!!! WARNING The following tests failed:

*** [err]: Active defrag in tests/unit/memefficiency.tcl
Expected 0 > 100000 (context: type eval line 106 cmd {assert {$hits > 100000}} proc ::test)
Cleanup: may take some time... OK

Comment From: reybog90

I don't know, if it's related, but I am also seeing intermittent Active defrag test failures on 6.2.7 running on a somewhat occupied bare-metal i7-7700K CPU @ 4.20GHz. So far that test fails about 10% of the time. Are these tests expected to fail on busy machines?

!!! WARNING The following tests failed:

*** [err]: Active defrag in tests/unit/memefficiency.tcl
Expected 1.52 < 1.4 (context: type eval line 109 cmd {assert {$frag < 1.4}} proc ::test)
$ ./src/redis-server --version
Redis server v=6.2.7 sha=e6f67092:0 malloc=jemalloc-5.1.0 bits=64 build=d8ac2d77249547e3

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.6 LTS
Release:    18.04
Codename:   bionic

$ gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0


$ free -h
              total        used        free      shared  buff/cache   available
Mem:            31G         14G         13G        204M        3,3G         16G
Swap:          2,0G        326M        1,7G

$ uptime
 15:34:19 up 32 days,  3:08,  1 user,  load average: 1,39, 1,83, 3,00

Comment From: oranagra

These tests are a bit sensitive, they're expected to fail on a ver busy or slow machine.. We should probably improve, but it's complicated..

Note that the first failure mentioned here is due to a latency threshold, the second one is not seeing enough defragmentation, and the third is about defrag not reaching its target threshold.

Comment From: bsdelf

I have same issue for both 7.0.5 and unstable, the wierd thing is, the unit test fails in tmux but passes in konsole.

CPU: 6-core 12th Gen Intel Core i5-12500 (-MT MCP-) speed/min/max: 3343/800/5900 MHz 
Kernel: 6.0.10_1 x86_64
Mem: 5330.5/64093.2 MiB (8.3%)
% ./src/redis-server --version
Redis server v=7.0.5 sha=1571907e:0 malloc=jemalloc-5.2.1 bits=64 build=bed131ec13dc8ff6

% gcc --version
gcc (GCC) 10.2.1 20201203
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

% lsb_release -a
LSB Version:    1.0
Distributor ID: VoidLinux
Description:    Void Linux
Release:        rolling
Codename:       void
redis git:(7.0.5) ? % ./runtest --single unit/memefficiency --verbose
Cleanup: may take some time... OK
Starting test server at port 21079
[ready]: 16464
Testing unit/memefficiency
[ready]: 16465
[ready]: 16463
[ready]: 16466
[ready]: 16467
[ready]: 16468
[ready]: 16469
[ready]: 16470
[ready]: 16471
[ready]: 16472
[ready]: 16473
[ready]: 16474
[ready]: 16475
[ready]: 16478
[ready]: 16477
[ready]: 16476
=== (memefficiency external:skip) Starting server 127.0.0.1:21611 ok
[ok]: Memory efficiency with values in range 32 (109 ms)
[ok]: Memory efficiency with values in range 64 (109 ms)
[ok]: Memory efficiency with values in range 128 (110 ms)
[ok]: Memory efficiency with values in range 1024 (118 ms)
[ok]: Memory efficiency with values in range 16384 (257 ms)
[1/1 done]: unit/memefficiency (1 seconds)
Testing solo test
=== (defrag external:skip) Starting server 127.0.0.1:21612 ok
frag 1.52
frag 1.04
hits: 372063
misses: 2181656
max latency 7
{command 1670386874 465 465} {active-defrag-cycle 1670386878 7 7}
{1670386875 7} {1670386876 7} {1670386877 7} {1670386878 7}
AOF loading:
frag 1.52
hits: 0
misses: 0
max latency 0
{command 1670386882 2723 2723} {active-defrag-cycle 1670386878 7 7}

[err]: Active defrag in tests/unit/memefficiency.tcl
Expected 0 > 100000 (context: type eval line 106 cmd {assert {$hits > 100000}} proc ::test)
used 54309504
rss 54632448
frag 1.01
frag_bytes 322944
used 29064688
rss 53747712
frag 1.85
frag_bytes 24683024
used 29100352
rss 30318592
frag 1.04
frag_bytes 1218240
[ok]: Active defrag eval scripts (2601 ms)
frag 1.87
frag 1.03
hits: 499297
misses: 1509532
max latency 8
{command 1670386893 386 386} {active-defrag-cycle 1670386896 6 8}
{1670386893 8} {1670386894 8} {1670386895 8} {1670386896 6}
[ok]: Active defrag big keys (11916 ms)
frag 2.00
frag 1.01
misses: 101248
hits: 98755
max latency 8
{command 1670386906 183 183} {active-defrag-cycle 1670386907 8 8}
{1670386907 8}
[ok]: Active defrag big list (11490 ms)
=== (defrag) Starting server 127.0.0.1:21613 ok
frag 1.96
frag 1.05
hits: 627696
misses: 1715533
[ok]: Active defrag edge case (15351 ms)
[1/1 done]: defrag (58 seconds)

                   The End

Execution time of different units:
  1 seconds - unit/memefficiency
  58 seconds - defrag

!!! WARNING The following tests failed:

*** [err]: Active defrag in tests/unit/memefficiency.tcl
Expected 0 > 100000 (context: type eval line 106 cmd {assert {$hits > 100000}} proc ::test)
Cleanup: may take some time... OK

Comment From: oranagra

@bsdelf please try to increase the play with the following line and see if it helps (e.g. try -200)

r config set key-load-delay -50 ;# sleep on average 1/50 usec

Comment From: bsdelf

Change to -200 or even -2000 does not help, I've tried several times. Run in konsole can also fail, so terminal emulator is irrelevant. But I do observe, if add some constant loads to system instead of idle before running that test case, the success rate increases.

Comment From: oranagra

i'm sorry, i mixed it up, please try to make it lower. e.g. -25 or -5 (effectively making AOF loading slower, and giving the defragger a chance to get triggered and run)

Comment From: bsdelf

It works, thanks.

Comment From: oranagra

so i suppose your machine is too fast for that test. i'll consider changing the threshold if we see it again.

Comment From: oranagra

actually, i just noticed that the person who opened this issue had this failure too (among others), so this is the second time we see it. @bsdelf please tell me which value you used that seemed to consistently work.

Comment From: bsdelf

-40 is good.

For anyone who wants to find out the optimal parameter

a.js

const fs = require('fs');
const { spawn } = require('child_process');

async function delay(ms) {
  return new Promise((resolve) => {
    setTimeout(resolve, ms);
  });
}

function tune() {
  const step = 5;
  const testFilePath = 'tests/unit/memefficiency.tcl';
  const pattern =
    /r config set key-load-delay -(?<value>\d+) ;# sleep on average/;
  const origData = fs.readFileSync(testFilePath, 'utf-8');
  const matched = origData.match(pattern);
  const line = matched[0];
  const value = Number.parseInt(matched.groups.value);
  if (!matched || !line || !Number.isInteger(value)) {
    throw new Error('pattern not found');
  }
  console.log('orignal:', { line, value });
  const newValue = value - step;
  const newLine = line.replace(`${value}`, `${newValue}`);
  console.log('patched:', { value: newValue, line: newLine });
  const newData = origData.replaceAll(line, newLine);
  fs.writeFileSync(testFilePath, newData, { encoding: 'utf-8' });
}

async function loop() {
  let stop = true;
  const cp = spawn('./runtest', ['--single', 'unit/memefficiency'], {
    detached: true,
  });
  for (const event of ['SIGINT', 'uncaughtException', 'unhandledRejection']) {
    process.on(event, () => {
      process.kill(-cp.pid);
    });
  }
  cp.stdout.on('data', (data) => {
    const text = data.toString();
    console.log(text);
    if (/\[.*err.*\]:/.test(text)) {
      stop = false;
      process.kill(-cp.pid);
    }
  });
  cp.stderr.on('data', (data) => {
    console.log(data.toString());
    stop = true;
    process.kill(-cp.pid);
  });
  cp.on('close', async (code) => {
    console.log(`child process exited with code: ${code}`);
    if (!stop) {
      tune();
      await delay(6000);
      loop();
    }
  });
}

(async () => {
  try {
    await loop();
  } catch (err) {
    console.log(err);
  }
})();

start iteration and wait

for run in {1..10}; do node a.js; done

check result

git diff

Comment From: oranagra

ok, i'll aim for -25 to keep a safe distance.