Redis error loading redis is loading the dataset in memory - Исправление ошибок и поиск оптимальных решений проблем

How do I «FLUSHALL» in redis in this situation?

Running redis via docker on PopOs 21.0.4 as shown in the following docker-compose.yml

version: "2.4"
services:
  redis:
    image: redis:5-alpine
    command: redis-server --save "" --appendonly yes
    restart: always
    volumes:
      - "${PWD}//redis/data:/data"
    ports:
      - "6379:6379"

Connecting to redis-cli and issuing a FLUSHALL (or FLUSHDB) command and I get the error:

127.0.0.1:6379[1]> FLUSHALL
(error) LOADING Redis is loading the dataset in memory

Here is docker version:

Client: Docker Engine - Community
 Version:           20.10.10
 API version:       1.41
 Go version:        go1.16.9
 Git commit:        b485636
 Built:             Mon Oct 25 07:43:13 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.10
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.9
  Git commit:       e2f740d
  Built:            Mon Oct 25 07:41:20 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.11
  GitCommit:        5b46e404f6b9f661a205e28d59c982d3634148f8
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

asked Nov 11, 2021 at 1:58

NonaNona

5,1627 gold badges37 silver badges78 bronze badges

The error message means that Redis is still loading data, i.e. in your case, the AOF file. You cannot run FLUSHALL until the loading finishes.

If you don’t need the data to be loaded, you can delete the AOF file before starting Redis.

answered Nov 11, 2021 at 2:20

for_stackfor_stack

19.5k4 gold badges33 silver badges45 bronze badges

Источник

ERROR: LOADING Redis is loading the dataset in memory.

This Redis error is shown when the system is not ready to accept connection requests. It usually goes away when Redis finishes loading data into the memory, but sometimes it persists.

As a part of our Server Management Services, we help our customers to fix Redis related errors like this.

Today we’ll take a look at what causes persistent memory errors, and how to fix it.

What causes “ERROR: LOADING Redis is loading the dataset in memory”?

Redis keeps the whole data set in memory and answers all queries from memory. This often helps to reduce the application load time.

The Redis replication system allows replica Redis instances to be exact copies of master instances. The replica will automatically reconnect to the master every time the link breaks and will attempt to be an exact copy of it regardless of what happens to the master.

As updated earlier “LOADING Redis is loading the dataset in memory” error occurs if connections requests reach before the system completely loads the dataset into memory and makes Redis ready for connections. This generally happens in two different scenarios:

At a master startup.
When a slave reconnects and performs a full resynchronization with a master.

Let us now look at the possible fixes for this error.

How to fix the error “LOADING Redis is loading the dataset in memory”?

In most cases, a frequent display of the error message can be related to some recent changes made on the site in relation to Redis. This may increase the data in Redis considerably and can get saturated easily. As a result, Redis replicas may disconnect frequently. Finally, when it tries to reconnect the message “LOADING Redis is loading the dataset in memory” may be displayed.

The quick solution here would be to flush the Redis cache. Lets us discuss on how to fix the Redis cache:

Flush Redis Cache

To Flush the Redis Cache, either the FLUSHDB or the FLUSHALL commands could be used. FLUSHDB command deletes all the keys of the DBs selected and the FLUSHALL command deletes all the keys of all the existing databases, not just the selected one.

The syntax for the commands are:

redis-cli FLUSHDB
redis-cli -n DB_NUMBER FLUSHDB
redis-cli -n DB_NUMBER FLUSHDB ASYNC
redis-cli FLUSHALL
redis-cli FLUSHALL ASYNC

For instance, to delete all the keys of a database #4 from the Redis cache, the syntax to be used is:

$ redis-cli -n 4 FLUSHDB

This will help to fix the issue. To prevent it from happening frequently, we need to revert the changes that were made earlier. It is always preferred to keep the information within Redis minimalistic.

[Need help to fix Redis errors? We are available 24×7]

Conclusion

In short, the error “LOADING Redis is loading the dataset in memory” occurs at Redis master startup or when the slave reconnects and performs a full resynchronization with master. When these connections requests reach before the dataset is completely loaded into memory, it triggers the error message. Today, we discussed how our Support Engineers fix this error.

Note: Sometimes the Service Control screen that indicates the progress is not displayed in front of the other screens.

Источник

Comments

This is a reference issue from Spring Jira https://jira.spring.io/browse/DATAREDIS-757

This problem not allows Redis to gracefully integrate in environment of microservices.

Several servers have the issue that they open a TCP port before they are actually ready. In contrast, Spring Boot opens a port as soon as it’s ready with all initialization and I think that would be a better option for Redis, too, to open a port as soon as all data is loaded.

This helps other application to don’t crash with exception:

LOADING Redis is loading the dataset in memory

Please open TCP port when Redis completely ready to serve.

mp911de, darkbarker, ghik, Fogapod, taomaree, prdanelli, BusterNeece, zsrinivas, tarossi, perlun, and 12 more reacted with thumbs up emoji
vodnicearv and sirajalam049 reacted with eyes emoji

same problem as yours.
if you use sentinel, you can have a try.
#4561

Another solution to this could be some blocking command which waits until Redis is ready to serve data.

@Hronom I believe we can close this issue given this is a by-design feature and not really an issue.
requesting the @redis/core-team opinion about it.

Redis enables certain commands while loading the database (ok-loading) like INFO, SELECT, MONITOR, DEBUG,…..

Clients should be responsible for being able to handle LOADING replies ( like they are able to handle MOVED, ASKED, etc… ).
WDYT?

@filipecosta90 this is defiantly a feature request, the github issues here used as place where you not only place issues, but where users can place feature request.

So please consider this as feature request.

In the world of docker this is very needed feature and it 3 years old? 3 years not able to implement solution with open port after load or commands that waits?

This is indeed by design and a feature request, let’s see how painful it is and what we can do to improve.

Maybe we can add a config flag that will tell redis not to listen on the socket at startup until after it loaded the persistence file.
This could take a very long time and will mean that no one can monitor it and get any progress checks, so many users would not want to use such a feature, but i suppose for some it can make life simpler.
Note however that it won’t solve the -LOADING response returned by a replica when loading the data it gets from the master.

Another approach could be to postpone most commands (excluding INFO, PING, etc) while loading instead of responding with -LOADING error. similar to what CLIENT PAUSE WRITE does today.
This would be a breaking change, and maybe also needs to be controlled by a config flag.
@redis/core-team WDYT?

@oranagra I think this should be handled by the client, it’s fairly easy to catch this and retry if that’s what the app wishes to do.

We use the Ruby Redis client and this error in not handled properly- i.e. it simply raise the exception:

LOADING Redis is loading the dataset in memory

It is a pain.

@yossigo It could be that the retry will take minutes to succeed depending on dataset size. Even 10 seconds can be unacceptable in some applications.

As I commented in a similar thread: #4561 (comment)

«I’ve been thinking why can’t we have a mode where the replica keeps serving stale data while the Full Sync is happening in background.
That could cost twice the memory and maybe disk space usage for a short period, but I think it definitely worth it and would get rid the LOADING status in most situations.
In my use case where there’s no need for sharding (so no Redis Cluster) and where it’s ok to have master down for about a minute and replicas serving stale data, it would be very useful. Sometimes a redis standalone master with a few replicas can be a solid setup, if we solve this kind of issues. That of course makes things more Kubernetes friendly without needing to pay for Redis Operator.»

@eduardobr You’re referring to a specific case here:

Replica is already up and has data
App is willing to get stale data

In that case, using repl-diskless-load swapdb alread yields a very similar behavior, although AFAIR (need to check) it still returns -LOADING and refuses commands. But that’s probably a relatively small change.

@yossigo I was coincidently testing the behavior of repl-diskless-load swapdb today in an attempt to solve the Loading status.
It will return LOADING about the same amount of time as disk load in our setup, but taking a look into the code it seems to be doable to change so we keep serving read commands. That would be an amazing improvement and make our setup solid here.

What does it take to move on with this change, and how could I help?

Thanks

In theory, if someone is already willing to get stale reads, and is using repl-diskless-load swapdb, i don’t see any reason not to allow reads from that backup database. at least as long as modules aren’t involved.
However, supporting that can be adding some extra complications to the code which i’m not sure are worth it.

Also, truth be told, i don’t really like the swapdb feature.
I added it following Salvatore’s request, in order to have a safer diskless loading, in case the master dies after the replica started loading the new data which it can’t complete.
But i think in many cases this mode is more dangerous since the system may not have enough memory to host both databases side by side, and the kernel’s OOM killer may intervene.

@oranagra @yossigo
I started a draft of a change to test the concept, it seems to be ok and solve the problem:
unstable…eduardobr:feature/draft-use-tempdb-swapdb

I started swapping the pointers of whole db array, but later noticed I could use the already tested mechanism of swap db and run it for each db index (will take care of things like maintaining selected db reference).

But there are small things I still don’t understand and would need help if we want to move on with this, for example, why calling «touchAllWatchedKeysInDb» works for SWAPDB command but crashes some tests if I use it after sync is finished. And if actually I would need to call it at all (we don’t do we when restoring the backup in current implementation).

Also, the failing test may not be relevant anymore, but I would need help with that.

up until now, touchAllWatchedKeysInDb was only needed once when we begin the loading, since no commands could be executed during loading.
but now:

we no longer want to invalidate watched keys when loading begins (we logically, we didn’t yet change anything in the db from the user’s perspective)
we do need to invalidate watched keys when loading was successful (but not when it failed if we recover from the backup).

regarding the crash and the test, i don’t now what you’re facing, you’ll have to debug it.
if you need assistance, maybe you can submit a PR and ask others expert contributors for help (i’m a bit too busy these days to dive in myself)

Revisiting this original issue, I think we can solve with little code and no breaking changes for scenarios with more than one replica:
Simply offer an option to start full syncs with no more than n replicas at the same time. Then you ensure you won’t put down all replicas with LOADING state simultaneously.

repl-diskless-sync-delay 0 doesn’t seem to guarantee a single replica at a time.

We could have some
repl-max-parallel-sync [n]

@yossigo @oranagra
Makes sense?

@eduardobr I’m not sure how that solves the original request. Instead of having «-Loading» errors the clients will get stale data for a bit, then later get the loading error later.

@madolson it solves in the sense that proxies, container orchestrators (and maybe client libraries) can send the client to a replica that is not “-Loading”.
In the case of Kubernetes, probes can query for this status to quickly take a replica out of service setting it unready without killing it.
Because it’s optional, the consumer will know the consequences (having stale vs no data at all), kinda same as the the compromise of replica-serve-stale-data but on a different case.

When configured with 50% of total replicas, in worst case master will finish syncs in 2 rounds

@eduardobr I see, I think that is a minor improvement but I’m not sure it solves the original requester’s issue, which is that they weren’t able to handle the -Loading error. Presumably they could handle the timeout.

@madolson right, that is not precisely what the author wants, just an alternative to mitigate the problems that -Loading error brings. Original solution of having port closed when doing full sync could confuse other systems because they won’t be able to know if redis is even alive.

As a workaround, I’m using a docker-compose healthcheck:

# docker-compose.yml

services:
  app:
    depends_on:
      redis:
        condition: service_healthy
  redis:
    healthcheck:
      test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
      interval: 1s
      timeout: 3s
      retries: 5

^ This is the way. TIL that the compose spec now allows you to actually wait on other services with healthchecks. dockerize is still great if services open ports only when ready, but this is awesome, and the «Docker way to do it» IMO.

https://docs.docker.com/compose/compose-file/#depends_on

(Also, apparently version is now unnecessary, if you’re wondering «what version» this was released on.)

Источник

How do I «FLUSHALL» in redis in this situation?

Running redis via docker on PopOs 21.0.4 as shown in the following docker-compose.yml

version: "2.4"
services:
  redis:
    image: redis:5-alpine
    command: redis-server --save "" --appendonly yes
    restart: always
    volumes:
      - "${PWD}//redis/data:/data"
    ports:
      - "6379:6379"

Connecting to redis-cli and issuing a FLUSHALL (or FLUSHDB) command and I get the error:

127.0.0.1:6379[1]> FLUSHALL
(error) LOADING Redis is loading the dataset in memory

Here is docker version:

Client: Docker Engine - Community
 Version:           20.10.10
 API version:       1.41
 Go version:        go1.16.9
 Git commit:        b485636
 Built:             Mon Oct 25 07:43:13 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.10
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.9
  Git commit:       e2f740d
  Built:            Mon Oct 25 07:41:20 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.11
  GitCommit:        5b46e404f6b9f661a205e28d59c982d3634148f8
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

asked Nov 11, 2021 at 1:58

NonaNona

5,1627 gold badges37 silver badges78 bronze badges

The error message means that Redis is still loading data, i.e. in your case, the AOF file. You cannot run FLUSHALL until the loading finishes.

If you don’t need the data to be loaded, you can delete the AOF file before starting Redis.

answered Nov 11, 2021 at 2:20

for_stackfor_stack

19.5k4 gold badges33 silver badges45 bronze badges

Источник

Содержание

Handle «LOADING Redis is loading the dataset in memory» #358
Comments
ERROR: LOADING Redis is loading the dataset in memory
What causes “ERROR: LOADING Redis is loading the dataset in memory”?
How to fix the error “LOADING Redis is loading the dataset in memory”?
Flush Redis Cache
Conclusion
ERROR: LOADING Redis is loading the dataset in memory
What causes “ERROR: LOADING Redis is loading the dataset in memory”?
How to fix the error “LOADING Redis is loading the dataset in memory”?
Flush Redis Cache
Conclusion
LOADING Redis is loading the dataset in memory #4624
Comments

Handle «LOADING Redis is loading the dataset in memory» #358

When a slave is first being connected to a master it needs to load the entire DB, which takes time.
Any command that is send to that slave during this time will receive a LOADING Redis is loading the dataset in memory response.

I think we should handle this and retry the command (Maybe even to a different node within the same slot).

The text was updated successfully, but these errors were encountered:

Its possible that during a failover to a slave, the old master will sync from the new master and cause this error to be returned, which makes the whole failover mechanism not so failsafe.

ioredis already supports detecting loading in standalong version: https://github.com/luin/ioredis/blob/master/lib/redis.js#L420-L428. Seems we just need to wait for the «ready» event of the new redis node here: status.https://github.com/luin/ioredis/blob/master/lib/cluster/connection_pool.js#L58-L63

@luin something like this?

Also, how should we handle an error in the _readyCheck function?

Hmm. I just checked the code, and it seems that when a node has not finished loading data from the disk, the commands sent to it will be added to its offline queue instead of sending to the redis immediately.

So that means that this should already be fixed? I’ve seen this happen in production, so its definitely an issue.

Could it be that it happens only to slaves or something? or when using scaleReads ?

Its also possible that it happens if the slave was once connected, but then got restarted for some reason

That’s strange. Either the node is a slave or a master doesn’t affect the support of offline queue. Are you able to reproduce the issue? Or enable the debug log maybe?

I found this issue when I did following.

Accidentally I ran FLUSHALL on redis-cli , I tried to do ctrl-d .
Without stopping redis-server I copied backed up rdb to dump.rdb and restarted redis-server . I found that the copy did not happen actually.
I stopped redis-server and then copied backed up rdb to dump.rdb and started redis-server . Copy worked.
Started redis-cli
Ran command KEYS * and got error (error) LOADING Redis is loading the dataset in memory

@shaharmor So how did you deal with it finally ?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Hey @luin , I just encountered this issue again, and I think we should see how we can fix it.

Any news on this? I got the same error on ioredis v4.0.10

@Eywek @shaharmor Do you have any more details on how you reproduce this issue?

Is it possible you’re connected to a slave that has begun a resync? E.g if the master it was pointing to performed a failover? A redis slave would return -LOADING errors during a resync which might explain how you encounter them without a connection reset.

What happens if you implement a reconnectOnError that returns 2 when a LOADING error is encountered?

^I have a hypothesis that an error handler like this:

might solve this problem and if so should perhaps be made a default ioredis behavior. But I haven’t built a repeatable way to reproduce this issue.

We were able to reproduce this issue by setting up an AWS ElastiCache cluster with the following config:

3 shards, 1 replica per shard
Engine: Clustered Redis
Engine Version Compatibility: 3.2.10
Auto-failover: enabled

We filled this cluster with about 700 Mb of data.
Then we setup an ioredis application which continuously sent redis.get’s all with keys that belonged to hash slots of one of our shards.

We deleted the replica node in the chosen shard, no gets failed.

But when we added back a node in this shard, we got multiple

We used the following config for ioredis:

Using @alavers ‘s snippet did indeed solve the issue:

We see the log message

and not a single error.

Note that we were only able to reproduce it if we used option scaleReads: ‘slave’ .

We also tried this exact same scenario with a Redis Cluster on our Dev pc and we were unable to reproduce it that way.
Ioredis kept sending requests to the master while this new replica node was LOADING the redis dataset in memory.
No idea why the behaviour is different between ElastiCache and a non ElastiCache Redis Cluster.

Источник

ERROR: LOADING Redis is loading the dataset in memory

by Arya MA | Jun 26, 2020

ERROR: LOADING Redis is loading the dataset in memory.

This Redis error is shown when the system is not ready to accept connection requests. It usually goes away when Redis finishes loading data into the memory, but sometimes it persists.

As a part of our Server Management Services, we help our customers to fix Redis related errors like this.

Today we’ll take a look at what causes persistent memory errors, and how to fix it.

What causes “ERROR: LOADING Redis is loading the dataset in memory”?

Redis keeps the whole data set in memory and answers all queries from memory. This often helps to reduce the application load time.

At a master startup.
When a slave reconnects and performs a full resynchronization with a master.

Let us now look at the possible fixes for this error.

How to fix the error “LOADING Redis is loading the dataset in memory”?

The quick solution here would be to flush the Redis cache. Lets us discuss on how to fix the Redis cache:

Flush Redis Cache

The syntax for the commands are:

For instance, to delete all the keys of a database #4 from the Redis cache, the syntax to be used is:

[Need help to fix Redis errors? We are available 24×7]

Conclusion

Note: Sometimes the Service Control screen that indicates the progress is not displayed in front of the other screens.

Источник

ERROR: LOADING Redis is loading the dataset in memory

ERROR: LOADING Redis is loading the dataset in memory.

This Redis error is shown when the system is not ready to accept connection requests. It usually goes away when Redis finishes loading data into the memory, but sometimes it persists.

As a part of our Server Management Services, we help our customers to fix Redis related errors like this.

Today we’ll take a look at what causes persistent memory errors, and how to fix it.

What causes “ERROR: LOADING Redis is loading the dataset in memory”?

Redis keeps the whole data set in memory and answers all queries from memory. This often helps to reduce the application load time.

At a master startup.
When a slave reconnects and performs a full resynchronization with a master.

Let us now look at the possible fixes for this error.

How to fix the error “LOADING Redis is loading the dataset in memory”?

The quick solution here would be to flush the Redis cache. Lets us discuss on how to fix the Redis cache:

Flush Redis Cache

The syntax for the commands are:

For instance, to delete all the keys of a database #4 from the Redis cache, the syntax to be used is:

[Need help to fix Redis errors? We are available 24×7]

Conclusion

Источник

LOADING Redis is loading the dataset in memory #4624

This problem not allows Redis to gracefully integrate in environment of microservices.

This helps other application to don’t crash with exception:

Please open TCP port when Redis completely ready to serve.

The text was updated successfully, but these errors were encountered:

Also there is discussion on client lib Lettuce : lettuce-io/lettuce-core#625
This is not a client responsibility handle this.

same problem as yours.
if you use sentinel, you can have a try.
#4561

I have created small workaround sh script that helps to wait until Redis fully started, here is the repo https://github.com/Hronom/wait-for-redis

This will help till developers fix that.

Another solution to this could be some blocking command which waits until Redis is ready to serve data.

@Hronom I believe we can close this issue given this is a by-design feature and not really an issue.
requesting the @redis/core-team opinion about it.

Redis enables certain commands while loading the database ( ok-loading ) like INFO , SELECT , MONITOR , DEBUG .

Clients should be responsible for being able to handle LOADING replies ( like they are able to handle MOVED , ASKED , etc. ).
WDYT?

@filipecosta90 this is defiantly a feature request, the github issues here used as place where you not only place issues, but where users can place feature request.

So please consider this as feature request.

In the world of docker this is very needed feature and it 3 years old? 3 years not able to implement solution with open port after load or commands that waits?

This is indeed by design and a feature request, let’s see how painful it is and what we can do to improve.

Maybe we can add a config flag that will tell redis not to listen on the socket at startup until after it loaded the persistence file.
This could take a very long time and will mean that no one can monitor it and get any progress checks, so many users would not want to use such a feature, but i suppose for some it can make life simpler.
Note however that it won’t solve the -LOADING response returned by a replica when loading the data it gets from the master.

Another approach could be to postpone most commands (excluding INFO, PING, etc) while loading instead of responding with -LOADING error. similar to what CLIENT PAUSE WRITE does today.
This would be a breaking change, and maybe also needs to be controlled by a config flag.
@redis/core-team WDYT?

@oranagra I think this should be handled by the client, it’s fairly easy to catch this and retry if that’s what the app wishes to do.

We use the Ruby Redis client and this error in not handled properly- i.e. it simply raise the exception:

@yossigo It could be that the retry will take minutes to succeed depending on dataset size. Even 10 seconds can be unacceptable in some applications.

As I commented in a similar thread: #4561 (comment)

@eduardobr You’re referring to a specific case here:

Replica is already up and has data
App is willing to get stale data

In that case, using repl-diskless-load swapdb alread yields a very similar behavior, although AFAIR (need to check) it still returns -LOADING and refuses commands. But that’s probably a relatively small change.

@yossigo I was coincidently testing the behavior of repl-diskless-load swapdb today in an attempt to solve the Loading status.
It will return LOADING about the same amount of time as disk load in our setup, but taking a look into the code it seems to be doable to change so we keep serving read commands. That would be an amazing improvement and make our setup solid here.

What does it take to move on with this change, and how could I help?

In theory, if someone is already willing to get stale reads, and is using repl-diskless-load swapdb , i don’t see any reason not to allow reads from that backup database. at least as long as modules aren’t involved.
However, supporting that can be adding some extra complications to the code which i’m not sure are worth it.

Also, truth be told, i don’t really like the swapdb feature.
I added it following Salvatore’s request, in order to have a safer diskless loading, in case the master dies after the replica started loading the new data which it can’t complete.
But i think in many cases this mode is more dangerous since the system may not have enough memory to host both databases side by side, and the kernel’s OOM killer may intervene.

@oranagra @yossigo
I started a draft of a change to test the concept, it seems to be ok and solve the problem:
unstable. eduardobr:feature/draft-use-tempdb-swapdb

Also, the failing test may not be relevant anymore, but I would need help with that.

up until now, touchAllWatchedKeysInDb was only needed once when we begin the loading, since no commands could be executed during loading.
but now:

we no longer want to invalidate watched keys when loading begins (we logically, we didn’t yet change anything in the db from the user’s perspective)
we do need to invalidate watched keys when loading was successful (but not when it failed if we recover from the backup).

Источник

We have setup 2 keydb servers with Active Replica, as described here: https://docs.keydb.dev/docs/active-rep/

We use HAproxy to redirect the traffic to the correct server. So we have the current situation:

keydb 001 - 10.0.0.7
keydb 002 - 10.0.0.8

We want to update and reboot keydb 01. We have put it in maintenance in HAproxy and all connections are drained. So the server is not used anymore, and all live connections are going to keydb 02.

Now when keydb 01 comes back up again, it askes keydb 02 for a full db sync. After this is done, we see that keydb 02 also askes a full db sync from keydb 01!!! This causes keydb 02 to go in a LOADING state, whilst it is the only live server in Haproxy.

So the result is that there are for a short period of time, NO keydb server is live. This results in errors like: LOADING Redis is loading the dataset in memory

The whole idea of this active replica setup is that it is robust, failsafe and creates an high availability setup. However, in this situation it means that every time a server goes down, it creates a short moment of complete downtime. This is unacceptable for our setup.

Are we doing wrong? Is this by design? Or do we have to reconfigure something?

We have tested with disk-based and diskless syncs. It makes no difference. Our configuration setup (from an ansible playbook, so formatting can look a bit weird):

        bind: "127.0.0.1 {{ my_private_ips[inventory_hostname] }}"
        requirepass: "{{ keydb_auth }}"
        masterauth: "{{ keydb_auth }}"
        replicaof: "xxx 6379"
        client-output-buffer-limit:
          - normal 0 0 0
          - replica 1024mb 256mb 60
          - pubsub 32mb 8mb 60
        repl-diskless-sync: yes
        port: 6379
        maxmemory: 3000m
        active-replica: yes

We are running Ubuntu 20.04.2 LTS with keydb version 6.0.16.

Here is the keydb log file from keydb 01, which goes down for maintenance and comes back up again.

3812319:1439:C 18 Jun 2021 09:08:36.048 * DB saved on disk
3812319:1439:C 18 Jun 2021 09:08:36.054 * RDB: 3 MB of memory used by copy-on-write
958:1439:S 18 Jun 2021 09:08:36.094 * Background saving terminated with success
958:signal-handler (1624000133) Received SIGTERM scheduling shutdown...
958:signal-handler (1624000133) Received SIGTERM scheduling shutdown...
958:1439:S 18 Jun 2021 09:08:53.194 # User requested shutdown...
958:1439:S 18 Jun 2021 09:08:53.194 # systemd supervision requested, but NOTIFY_SOCKET not found
958:1439:S 18 Jun 2021 09:08:53.194 * Saving the final RDB snapshot before exiting.
958:1439:S 18 Jun 2021 09:08:53.194 # systemd supervision requested, but NOTIFY_SOCKET not found
958:1439:S 18 Jun 2021 09:08:54.159 * DB saved on disk
958:1439:S 18 Jun 2021 09:08:54.159 * Removing the pid file.
958:1439:S 18 Jun 2021 09:08:54.159 # KeyDB is now ready to exit, bye bye...
874:874:C 18 Jun 2021 09:09:04.528 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
874:874:C 18 Jun 2021 09:09:04.531 * Notice: "active-replica yes" implies "replica-read-only no"
874:874:C 18 Jun 2021 09:09:04.531 # oO0OoO0OoO0Oo KeyDB is starting oO0OoO0OoO0Oo
874:874:C 18 Jun 2021 09:09:04.531 # KeyDB version=6.0.16, bits=64, commit=00000000, modified=0, pid=874, just started
874:874:C 18 Jun 2021 09:09:04.531 # Configuration loaded
874:874:C 18 Jun 2021 09:09:04.531 # WARNING supervised by systemd - you MUST set appropriate values for TimeoutStartSec and TimeoutStopSec in your service unit.
874:874:C 18 Jun 2021 09:09:04.531 # systemd supervision requested, but NOTIFY_SOCKET not found


                                        KeyDB 6.0.16 (00000000/0) 64 bit

                                        Running in standalone mode
                                        Port: 6379
                                        PID: 957

                     Join the KeyDB community! https://community.keydb.dev/



957:874:S 18 Jun 2021 09:09:04.781 # Server initialized
957:874:S 18 Jun 2021 09:09:04.781 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
957:874:S 18 Jun 2021 09:09:04.781 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with KeyDB. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. KeyDB must be restarted after THP is disabled.
957:874:S 18 Jun 2021 09:09:04.789 * Loading RDB produced by version 6.0.16
957:874:S 18 Jun 2021 09:09:04.789 * RDB age 11 seconds
957:874:S 18 Jun 2021 09:09:04.789 * RDB memory usage when created 100.16 Mb
957:874:S 18 Jun 2021 09:09:05.563 * DB loaded from disk: 0.778 seconds
957:874:S 18 Jun 2021 09:09:05.563 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
957:874:S 18 Jun 2021 09:09:05.563 # systemd supervision requested, but NOTIFY_SOCKET not found
957:1345:S 18 Jun 2021 09:09:05.563   Thread 1 alive.
957:1344:S 18 Jun 2021 09:09:05.564   Thread 0 alive.
957:1344:S 18 Jun 2021 09:09:05.564 * Connecting to MASTER 10.0.0.8:6379
957:1344:S 18 Jun 2021 09:09:05.564 * MASTER <-> REPLICA sync started
957:1344:S 18 Jun 2021 09:09:05.566 * Non blocking connect for SYNC fired the event.
957:1344:S 18 Jun 2021 09:09:05.567 * Master replied to PING, replication can continue...
957:1344:S 18 Jun 2021 09:09:05.571 * Replica 10.0.0.8:6379 asks for synchronization
957:1344:S 18 Jun 2021 09:09:05.571 * Full resync requested by replica 10.0.0.8:6379
957:1344:S 18 Jun 2021 09:09:05.571 * Replication backlog created, my new replication IDs are '60759691a4adca42cced2fd1895a470ab7b9eebb' and '0000000000000000000000000000000000000000'
957:1344:S 18 Jun 2021 09:09:05.571 * Delay next BGSAVE for diskless SYNC
957:1344:S 18 Jun 2021 09:09:05.573 * Partial resynchronization not possible (no cached master)
957:1344:S 18 Jun 2021 09:09:11.521 * Full resync from master: 3eb1532df27d6815eea8a420c7c72b04e03618dc:516117186158
957:1344:S 18 Jun 2021 09:09:11.521 * Discarding previously cached master state.
957:1344:S 18 Jun 2021 09:09:11.530 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
957:1344:S 18 Jun 2021 09:09:11.603 * Starting BGSAVE for SYNC with target: replicas sockets
957:1344:S 18 Jun 2021 09:09:11.606 * Background RDB transfer started by pid 2412
2412:1344:C 18 Jun 2021 09:09:12.488 * RDB: 1 MB of memory used by copy-on-write
957:1344:S 18 Jun 2021 09:09:12.488 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
957:1344:S 18 Jun 2021 09:09:12.507 * Background RDB transfer terminated with success
957:1344:S 18 Jun 2021 09:09:12.507 * Streamed RDB transfer with replica 10.0.0.8:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
957:1344:S 18 Jun 2021 09:09:12.872 * Synchronization with replica 10.0.0.8:6379 succeeded
957:1344:S 18 Jun 2021 09:10:12.799 # I/O error trying to sync with MASTER: connection lost
957:1344:S 18 Jun 2021 09:10:13.429 * Connecting to MASTER 10.0.0.8:6379
957:1344:S 18 Jun 2021 09:10:13.429 * MASTER <-> REPLICA sync started
957:1344:S 18 Jun 2021 09:10:13.429 * Non blocking connect for SYNC fired the event.
957:1344:S 18 Jun 2021 09:10:13.429 * Master replied to PING, replication can continue...
957:1344:S 18 Jun 2021 09:10:13.430 * Partial resynchronization not possible (no cached master)
957:1344:S 18 Jun 2021 09:10:19.943 * Full resync from master: 349f88c33a6dc9d67a3cb4d623727d9f1047033d:516121643991
957:1344:S 18 Jun 2021 09:10:19.952 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
957:1344:S 18 Jun 2021 09:10:20.856 * MASTER <-> REPLICA sync: Loading DB in memory
957:1344:S 18 Jun 2021 09:10:20.856 * Loading RDB produced by version 6.0.16
957:1344:S 18 Jun 2021 09:10:20.856 * RDB age 1 seconds
957:1344:S 18 Jun 2021 09:10:20.856 * RDB memory usage when created 102.08 Mb
957:1344:S 18 Jun 2021 09:10:21.431 * MASTER <-> REPLICA sync: Finished with success
957:1344:S 18 Jun 2021 09:10:21.431 # systemd supervision requested, but NOTIFY_SOCKET not found
957:1344:S 18 Jun 2021 09:10:21.431 # systemd supervision requested, but NOTIFY_SOCKET not found
957:1344:S 18 Jun 2021 09:11:03.197 * 10000 changes in 60 seconds. Saving...
957:1344:S 18 Jun 2021 09:11:03.200 * Background saving started by pid 4845
4845:1344:C 18 Jun 2021 09:11:04.015 * DB saved on disk
4845:1344:C 18 Jun 2021 09:11:04.018 * RDB: 2 MB of memory used by copy-on-write
957:1344:S 18 Jun 2021 09:11:04.104 * Background saving terminated with success
957:1344:S 18 Jun 2021 09:12:05.071 * 10000 changes in 60 seconds. Saving...
957:1344:S 18 Jun 2021 09:12:05.075 * Background saving started by pid 4870
4870:1344:C 18 Jun 2021 09:12:05.963 * DB saved on disk

Here is the keydb 02 log. This server stayed “up”, but was shortly unavailable:

960:1388:S 18 Jun 2021 09:08:29.066 * Background saving started by pid 3814844
3814844:1388:C 18 Jun 2021 09:08:29.896 * DB saved on disk
3814844:1388:C 18 Jun 2021 09:08:29.901 * RDB: 3 MB of memory used by copy-on-write
960:1388:S 18 Jun 2021 09:08:29.974 * Background saving terminated with success
960:1388:S 18 Jun 2021 09:08:54.176 # Connection with master lost.
960:1388:S 18 Jun 2021 09:08:54.176 * Caching the disconnected master state.
960:1388:S 18 Jun 2021 09:08:54.177 # Connection with replica client id #11 lost.
960:1388:S 18 Jun 2021 09:08:54.957 * Connecting to MASTER 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:08:54.957 * MASTER <-> REPLICA sync started
960:1388:S 18 Jun 2021 09:09:02.073 # Error condition on socket for SYNC: Resource temporarily unavailable
960:1388:S 18 Jun 2021 09:09:02.988 * Connecting to MASTER 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:09:02.988 * MASTER <-> REPLICA sync started
960:1388:S 18 Jun 2021 09:09:02.988 # Error condition on socket for SYNC: Operation now in progress
960:1388:S 18 Jun 2021 09:09:03.998 * Connecting to MASTER 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:09:03.998 * MASTER <-> REPLICA sync started
960:1388:S 18 Jun 2021 09:09:03.999 * Non blocking connect for SYNC fired the event.
960:1388:S 18 Jun 2021 09:09:04.068 * Master replied to PING, replication can continue...
960:1388:S 18 Jun 2021 09:09:04.084 * Partial resynchronization not possible (no cached master)
960:1388:S 18 Jun 2021 09:09:04.087 * Replica 10.0.0.7:6379 asks for synchronization
960:1388:S 18 Jun 2021 09:09:04.087 * Full resync requested by replica 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:09:04.087 * Delay next BGSAVE for diskless SYNC
960:1388:S 18 Jun 2021 09:09:10.035 * Starting BGSAVE for SYNC with target: replicas sockets
960:1388:S 18 Jun 2021 09:09:10.042 * Background RDB transfer started by pid 3814859
960:1388:S 18 Jun 2021 09:09:10.117 * Full resync from master: 640f9e48636076058722301cc52ea8a21bc8e450:453896462998
960:1388:S 18 Jun 2021 09:09:10.118 * Discarding previously cached master state.
960:1388:S 18 Jun 2021 09:09:10.122 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
960:1388:S 18 Jun 2021 09:09:10.999 * MASTER <-> REPLICA sync: Loading DB in memory
960:1388:S 18 Jun 2021 09:09:11.000 * Replica is about to load the RDB file received from the master, but there is a pending RDB child running. Killing process 3814859 and removing its temp file to avoid any race
3814859:signal-handler (1624000150) Received SIGUSR1 in child, exiting now.
960:1388:S 18 Jun 2021 09:09:11.000 * Loading RDB produced by version 6.0.16
960:1388:S 18 Jun 2021 09:09:11.000 * RDB age 0 seconds
960:1388:S 18 Jun 2021 09:09:11.000 * RDB memory usage when created 95.02 Mb
960:1388:S 18 Jun 2021 09:09:11.018 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
960:1388:S 18 Jun 2021 09:09:11.019 # Background transfer terminated by signal 10
960:1388:S 18 Jun 2021 09:09:11.019 * Streamed RDB transfer with replica 10.0.0.7:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
960:1388:S 18 Jun 2021 09:09:11.386 * MASTER <-> REPLICA sync: Finished with success
960:1388:S 18 Jun 2021 09:09:11.386 # systemd supervision requested, but NOTIFY_SOCKET not found
960:1388:S 18 Jun 2021 09:09:11.386 # systemd supervision requested, but NOTIFY_SOCKET not found
960:1388:S 18 Jun 2021 09:09:30.096 * 10000 changes in 60 seconds. Saving...
960:1388:S 18 Jun 2021 09:09:30.101 * Background saving started by pid 3814875
3814875:1388:C 18 Jun 2021 09:09:30.970 * DB saved on disk
3814875:1388:C 18 Jun 2021 09:09:30.975 * RDB: 5 MB of memory used by copy-on-write
960:1388:S 18 Jun 2021 09:09:31.022 * Background saving terminated with success
960:1388:S 18 Jun 2021 09:10:12.795 # Disconnecting timedout replica: 10.0.0.7:6379
960:1388:S 18 Jun 2021 09:10:12.795 # Connection with replica 10.0.0.7:6379 lost.
960:1389:S 18 Jun 2021 09:10:13.429 * Replica 10.0.0.7:6379 asks for synchronization
960:1389:S 18 Jun 2021 09:10:13.429 * Full resync requested by replica 10.0.0.7:6379
960:1389:S 18 Jun 2021 09:10:13.429 * Delay next BGSAVE for diskless SYNC
960:1388:S 18 Jun 2021 09:10:19.941 * Starting BGSAVE for SYNC with target: replicas sockets
960:1388:S 18 Jun 2021 09:10:19.948 * Background RDB transfer started by pid 3815442
3815442:1388:C 18 Jun 2021 09:10:20.860 * RDB: 5 MB of memory used by copy-on-write
960:1388:S 18 Jun 2021 09:10:20.860 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
960:1388:S 18 Jun 2021 09:10:20.958 * Background RDB transfer terminated with success
960:1388:S 18 Jun 2021 09:10:20.958 * Streamed RDB transfer with replica 10.0.0.7:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
960:1389:S 18 Jun 2021 09:10:22.034 * Synchronization with replica 10.0.0.7:6379 succeeded
960:1388:S 18 Jun 2021 09:10:32.013 * 10000 changes in 60 seconds. Saving...
960:1388:S 18 Jun 2021 09:10:32.018 * Background saving started by pid 3816278
3816278:1388:C 18 Jun 2021 09:10:32.845 * DB saved on disk
3816278:1388:C 18 Jun 2021 09:10:32.850 * RDB: 4 MB of memory used by copy-on-write
960:1388:S 18 Jun 2021 09:10:32.921 * Background saving terminated with success
960:1388:S 18 Jun 2021 09:11:33.101 * 10000 changes in 60 seconds. Saving...

Источник