"completely offline" also doesn't sound like a problem with a software project. At best it's a particular managed service experiencing downtime. Would Linux be to blame if my power supply goes up in smoke?
It’s a bit confusing to me exactly what went wrong. I think that when you have a redis/valkey cluster with multiple nodes and you use the cluster uri, there must be some kind of load balancer or custom routing. When we would attempt to connect to valkey the connection would look good, but when we would submit commands to it they would never execute. We had written our application so that it would operate with no issue (just slower) if the cache goes down. In this case, connections looked good but no work was actually being done. AWS support suggested we restart the nodes but because they were not responding they never shut down … or at least it took a really long time. They were never able to tell us what actually happened. My guess is that valkey command execution got stuck somehow but was still able to create new connections.
Can’t be reached outside the network that the instance and health check are running on? Maybe available in one AZ, but not on the one that’s trying to connect.