Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELECT after every EXEC / non-standard command #1988

Closed
slorello89 opened this issue Feb 10, 2022 · 25 comments · Fixed by #2023
Closed

SELECT after every EXEC / non-standard command #1988

slorello89 opened this issue Feb 10, 2022 · 25 comments · Fixed by #2023
Assignees

Comments

@slorello89
Copy link
Collaborator

👋 I was debugging an issue today, a customer saw an abnormal number of SELECTS being sent to their Redis instance, which was pretty odd given they weren't passing any arguments into GetDatabase. After a bit of digging, I found this bit of code from the libraries earliest days, which pointed me to the fact that whenever they execute a transaction, a SELECT is called immediately afterward.

I then noticed that above that bit, the UNKNOWN command also falls through to the same condition. This means that if you execute any Redis Command via the Execute/ExecuteAsync API (which is the pattern in NRediSearch, Redis.OM, NReJSON, etc. . .) that the multiplexer will immediately call SELECT afterward.

I suppose my question is in 2 parts

1: What was the origination of this behavior for EXEC? My guess would be that it's because if you call a SELECT in the middle of a transaction that it will change the client's selected database. But I don't think there's a way in which you can call an arbitrary command on a transaction is there? My other guess would be that since you are loaning a copy of the database to the transaction you need to be double sure that you set the selected database is correct afterward, but for some reason that doesn't ring correctly to me.

2: Would it be possible to provide either a way to extend the RedisCommand enum to allow expanded command sets, or alternatively perhaps provide a configuration item to prevent UNKNOWN commands from encountering this?

I'd of course be happy to open any PRs required myself.

Thanks!

@mgravell
Copy link
Collaborator

The motivation here is about correctness, and our confidence/knowledge of what a particular command has done, and in particular: could they have changed database?

I thought the transaction logic only applied in the case of failure, but I'm happy to be corrected. When using Lua, the caller can certainly change the connection database, so we need to treat that defensively. I "presume* the same is true of an ad-hoc command, whether inbuilt redis or a custom module.

Things we could consider here:

  • if we know (from info, or config) that there is only one DB, we could ignore this (like we do for cluster)
  • we could consider a new CommandFlags option which means "trust me, I don't change database"

Thoughts?

Note: the impact of this command is usually negligible, since we don't await it - it just becomes part of the same packet as the command, and the server processes it trivially. If you are seeing mearusable impact, I'm not against looking more, though.

@slorello89
Copy link
Collaborator Author

Hi @mgravell thanks for getting back so quick

I thought the transaction logic only applied in the case of failure

I can confirm it's definitely happening after a successful transaction:

var db = muxer.GetDatabase();
for (var i = 0; i < 3; i++)
{
    var transaction = db.CreateTransaction();
    var t1 = transaction.StringSetAsync("foo", "bar");
    var t2 = transaction.StringGetAsync("foo");
    await Task.WhenAll(transaction.ExecuteAsync(), t1, t2);
    Console.WriteLine(t2.Result);
}

The above trivial example corresponds to the following in Redis:

1644588492.645289 [0 172.17.0.1:36414] "MULTI"
1644588492.645645 [0 172.17.0.1:36414] "SET" "foo" "bar"
1644588492.645673 [0 172.17.0.1:36414] "GET" "foo"
1644588492.645674 [0 172.17.0.1:36414] "EXEC"
1644588492.650322 [0 172.17.0.1:36414] "SELECT" "0"
1644588492.650345 [0 172.17.0.1:36414] "MULTI"
1644588492.650351 [0 172.17.0.1:36414] "SET" "foo" "bar"
1644588492.650353 [0 172.17.0.1:36414] "GET" "foo"
1644588492.650354 [0 172.17.0.1:36414] "EXEC"
1644588492.651078 [0 172.17.0.1:36414] "SELECT" "0"
1644588492.651082 [0 172.17.0.1:36414] "MULTI"
1644588492.651086 [0 172.17.0.1:36414] "SET" "foo" "bar"
1644588492.651087 [0 172.17.0.1:36414] "GET" "foo"
1644588492.651092 [0 172.17.0.1:36414] "EXEC"

You'll notice the SELECTs called after each EXEC (I think it's probably more accurate to say it's called BEFORE each successive MULTI). Again, I'm not certain that there's a valid reason to set the database to unknown after an EXEC, since as far as I can tell there's no API to call SELECT from within a transaction? Please correct me if I'm missing something?

When using Lua, the caller can certainly change the connection database, so we need to treat that defensively

This was definitely true at one point, but as you can see in that switch statement, it appears that there was a change in Redis that prompted you (or one of the library's other contributors - I can't readily tell from the blame) to circle around this for scripts, as apparently SELECT is no longer applied to the calling client within a script here's the pertinent section of the release notes.

To your point, I can definitely see the need to be defensive with arbitrary commands.

I'm not sure about trying to divine the presence of other databases from the info, since this can change on the fly. I suppose if something were added to the configuration of the multiplexer that told it there was only a single database that could work. But then I guess my next question is: Does it toss an exception if you try to select the non-default database?

Having a command flag "hey, pinky-swear I'm not changing the database" seems the most correct way to handle this with minimal disruption.

If that's acceptable to you, I'll open a PR to add it.

Circling back to the EXEC, given what I mentioned above, can I correct that behavior in the same shot?

thanks again!

@mgravell
Copy link
Collaborator

mgravell commented Feb 11, 2022 via email

@mgravell
Copy link
Collaborator

mgravell commented Feb 11, 2022 via email

@slorello89
Copy link
Collaborator Author

Cool, so given all this how does the following sound:

  • Preserve current feature-checking logic for EVAL/EVALSHA
  • Remove DISCARD/EXEC from the switch statement (since as we've discussed it's not possible to call SELECT in a case where DISCARD/EXEC are the cmd's value)
  • Add a flag in Command flags (presumably ordinal value 2048) to tell the PhysicalBridge to not reset the database number in case of an ad-hoc command, call it something like: NonDatabaseChanging or something to that effect.
  • Add a short blurb in the summary docs above Execute/ExecuteAsync explaining the reasoning?

LMK what you think, if this all sounds good I'll open a PR for it.

@mgravell
Copy link
Collaborator

mgravell commented Feb 11, 2022

The discard/exec is a bit more complex than that; we need to know whether any command in the transaction might have changed db - essentially just an accumulator of the same logic, on a command-by-command basis, but: yes

@NickCraver
Copy link
Collaborator

Question: what's stopping someone from doing an .ExecuteAsync(<customCommand>) in the transaction itself? Or are we okay saying "yeah sure, shoot yourself in the foot" there?

Overall question I still have: is the defensiveness an issue in terms of performance or anything else today? If we're getting more complicated or clever, it would be helpful to know what we're improving in the exchange.

@mgravell
Copy link
Collaborator

Question: what's stopping someone from doing an .ExecuteAsync() in the transaction itself?

That would be the "we need to know whether any command in the transaction might have changed db" part

@NickCraver
Copy link
Collaborator

@mgravell k, same page there - I'm wondering if that's worth doing and adding complexity for, back to...what problem are we solving? @slorello89 Could you elaborate on if this is causing an issue today or just looks odd?

@mgravell
Copy link
Collaborator

True. From a performance standpoint I'd be surprised if there was any significant impact here (reasons summarised below), but performance analysis based on data is often surprising. Do we have any such data?

Reasons:

  1. we're not paying latency for these commands - they're pipelined
  2. we're not paying packets for these commands - they're usually slipped into the packet of the next outbound message
  3. at the server, in any event where we didn't genuinely need to send them, they are effectively a no-op at the server

@slorello89
Copy link
Collaborator Author

@mgravell @NickCraver , so, this is actually surprisingly a bit of an issue on Redis Enterprise. The proxy routes all the incoming SELECTS to slot-0 by default (I'm not entirely sure why this is the case, I'm guessing this is because SELECT isn't supported), so the impact on the first shard can be significant if you are sending lots of transactions/ad-hoc commands (as it's getting hit by everything in the cluster). We can work around this by disabling SELECTs at the proxy level, but since these commands aren't doing anything to alter the selected database the SELECT probably shouldn't be sent in the first place.

@NickCraver
Copy link
Collaborator

Is Redis Enterprise cluster in this case? (Sorry I'm not familiar with all the hosting models going on)

If the issue is sending the command at all in a cluster, that's a much simpler change and totally agree we should tweak that (which also has no consequences to reason about). I even have a new helper method for this as a result of the Envoy work :)

@NickCraver
Copy link
Collaborator

Looked at where this happens - ultimately if SELECT isn't supported, then we should not be generating the SELECT commands to issue at all, that happens here: https://github.com/StackExchange/StackExchange.Redis/blob/main/src/StackExchange.Redis/PhysicalConnection.cs#L576 - so the higher level question I have is: what's the setup here, and why do we think it supports SELECT in the first place? I'm not sure what the connection setup is to Redis Enterprise so some more info there would help!

@mgravell
Copy link
Collaborator

mgravell commented Feb 13, 2022

if we can detect this scenario: no problem automatically disabling it; as a temporary workaround, I suspect if the user adds ,$SELECT= to their connection string, it should end up disabled anyway

(Edit: or the client will explode in a shower of sparks and smoke; it'll be interesting to find out!)

@NickCraver
Copy link
Collaborator

+1 - wondering if we can detect it and help out, with more details I can poke there (been in the area) - it looks like we think it's Standalone atm (.HasDatabases is true), so I'm super curious what the combination of hosting is here, is it cluster with a single endpoint like Envoy? I'm curious!

@slorello89
Copy link
Collaborator Author

Aye, Clusters in Redis Enterprise have a proxy that accepts the incoming commands, so it looks sort of like a stand alone to the client. That's probably why the multiplexer thinks it can have multiple databases.

@mgravell
Copy link
Collaborator

I don't have redis enterprise to test against; any chance info gives us anything useful we could detect here?

@slorello89
Copy link
Collaborator Author

@mgravell @NickCraver Sorry for the delay, I needed to do a bit of digging and needed to speak to some people on our end. I've not been able to find a way to differentiate a Redis Enterprise instance with a single proxy (basically the transparent cluster I was describing earlier) from a stand-alone redis instance. The redis_mode in the INFO comes back as standalone.

Also, with the single-proxy the CLUSTER NODES command that the library appears to use as its detection mechanism for a cluster, is disabled, so basically, the library comes in "Hey what nodes are there?" and Redis Enterprise responds "We don't support the cluster API". So the library is blinded to the fact that there's actually a cluster back there, and that the instance doesn't support the SELECT command.

Slight tangent for some extra background (this isn't really pertinent to the issue), that's not actually the end of the story. There are two modes for Cluster in Redis Enterprise. In addition to a single proxy cluster (the similar to stand-alone that I've been referring to up to this point) there's also an OSS Cluster API flavor. In which rather than having a single proxy, all Nodes (machines, not Redis instances) have a proxy, this is much less common, but in this case, the cluster API is supported, and therefore CLUSTER NODES responds, and the library does not send erroneous SELECT commands.

Back on track. To Marc's point about adding ,$SELECT= to the connection string. It looks like this breaks things pretty horribly, it looks like in its attempt to set the database back causes an error when trying to send the select resulting in this error:

StackExchange.Redis.RedisConnectionException: InternalFailure on SET bar
 ---> StackExchange.Redis.RedisCommandException: This operation has been disabled in the command-map and cannot be used: SELECT
   at StackExchange.Redis.PhysicalConnection.WriteHeader(RedisCommand command, Int32 arguments, CommandBytes commandBytes) in /_/src/StackExchange.Redis/PhysicalConnection.cs:line 734
   at StackExchange.Redis.Message.SelectMessage.WriteImpl(PhysicalConnection physical) in /_/src/StackExchange.Redis/Message.cs:line 1336
   at StackExchange.Redis.Message.WriteTo(PhysicalConnection physical) in /_/src/StackExchange.Redis/Message.cs:line 768
   at StackExchange.Redis.PhysicalBridge.SelectDatabaseInsideWriteLock(PhysicalConnection connection, Message message) in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 1238
   at StackExchange.Redis.PhysicalBridge.WriteMessageToServerInsideWriteLock(PhysicalConnection connection, Message message) in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 1270
   --- End of inner exception stack trace ---
   at Program.<Main>$(String[] args) in C:\Users\slore\projects\redis\TestSEREdisSelect\Program.cs:line 10

And any further attempts to call any command will result in a timeout (something is probably deadlocked in the command queue?) Interestingly enough it looks like the third command in the catch fragment is making it through, just its result isn't being read back cleanly.

Unhandled exception. StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5000ms elapsed, timeout is 5000ms), command=SET, next: SET bar, inst: 0, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: localhost:6379, mc:
 1/1/0, mgr: 10 of 10 available, clientName: DESKTOP-VR6JNGK, IOCP: (Busy=0,Free=1000,Min=24,Max=1000), WORKER: (Busy=1,Free=32766,Min=24,Max=32767), v: 2.2.88.56325 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Re
dis/Timeouts)
****

Code for that example (this is run in a console app so perhaps a context issue), you can replicate with the rejson image docker run -p 6379:6379 redislabs/rejson:

var muxer = await ConnectionMultiplexer.ConnectAsync("localhost:6379,$SELECT=", Console.Out);
var db = muxer.GetDatabase();
try
{
    await db.ExecuteAsync("JSON.SET", "foo", ".", "{\"bar\":\"baz\"}"); // Goes through, no problem
    await db.StringSetAsync("bar", "baz"); // dies on the Select
}
catch (Exception ex)
{
    Console.WriteLine(ex);
    await db.StringSetAsync("bar", "baz"); // Makes it through to Redis, but fails with a timeout
}

In the PhysicalConnection's GetSelectDatabaseCommand there's no logic to interrogate the CommandMap in the multiplexer for the presence of SELECT, perhaps this would be a decent avenue to look at?

thanks!

@NickCraver
Copy link
Collaborator

I was looking at this but in a long session this week and didn't writeup: indeed there's no check here, but if we add one it needs to be as efficient as possible since it's inside the write lock. Most of the throughput performance that we're looking at is "how long is the lock taken?" and any extension of that is a net drain on throughput. I'll look at making $SELECT= work as part of making the lock slimmer in general - I think we can memoize some of the checks that don't change here to slim it a bit and I want to benchmark there. I'm only recently setup to do so, likely an end-of-week or next item.

@slorello89
Copy link
Collaborator Author

@NickCraver, if a config-level change ends up being the answer. Perhaps making ServerEndpoint.HasDatabases honor something in the configuration would be an appropriate route? Since every clock cycle is precious inside the write lock, you could even replace the computed property.

public bool HasDatabases => serverType == ServerType.Standalone;

With a simple property:

public bool HasDatabases { get; }

Then when the ServerEndpoint is initialized, interrogate the config for either a disabled SELECT command or perhaps a config setting HasDatabases which defaults to true, and set HasDatabases as such:

HasDatabases = config.CommandMap.IsAvailable(RedisCommand.SELECT) && config.HasDatabases;

Then in the ServerType property's setter, (this is what gets called after a successful call to CLUSTER NODES), set HasDatabses

HasDatabase = HasDatabases && value == ServerType.Standalone

Unless there are some other externalities I'm missing?

@NickCraver
Copy link
Collaborator

@slorello89 This week is crazier than most so very slow here, likely won't dig in until next week and recharged but yes we can probably memoize it in some way, though I'm not sure if a) that's the correct location (it's not per-server that it's disabled, for example), and b) how that'll interact with config changes at runtime. Something in play here is we intend to add a method soon to allow config changes at runtime for example to change authentication passwords for next reconnect in rolling environments and such. How that plans with memoized things based on config downstream is a factor.

Is there any chance you could post a full INFO (sans anything sensitive of course) that we could peruse to see if anything else stands out on the "auto-detect it" front?

@slorello89
Copy link
Collaborator Author

Hi @NickCraver, figured I'd wait a few days to get back to you as you seemed a bit swamped. What follows is the info from a RedisEnterprise instance with a similar configuration to the one we are talking about. Unfortunately, there's not much to go off of from the info.

# Server
redis_version:6.0.13
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:0000000000000000000000000000000000000000
redis_mode:standalone
os:Linux 4.15.0-1098-gcp x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:5.4.0 20160609
process_id:3
run_id:0640e63f679f0ec0c2c3bb7d694de9eef3ed92df
tcp_port:12000
server_time_usec:1645108605000000
uptime_in_seconds:91
uptime_in_days:0
hz:10
lru_clock:0
config_file:

# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
maxclients:0
cluster_connections:0

# Memory
used_memory:56383632
used_memory_human:53.77M
used_memory_rss:56383632
used_memory_peak:101404864
used_memory_peak_human:96.70M
used_memory_lua:227328
mem_fragmentation_ratio:1
mem_allocator:jemalloc-5.1.0

# Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1645108513
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:1
total_commands_processed:1
instantaneous_ops_per_sec:26
total_net_input_bytes:235674
total_net_output_bytes:479458
instantaneous_input_kbps:1.18
instantaneous_output_kbps:3.09
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
total_forks:0
total_error_replies:0

# Replication
role:master
connected_slaves:1
slave0:ip=0.0.0.0,port=0,state=online,offset=0,lag=0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.00
used_cpu_user:0.00
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
used_cpu_sys_main_thread:0.00
used_cpu_user_main_thread:0.00

# Cluster
cluster_enabled:0

# Keyspace

@NickCraver
Copy link
Collaborator

@slorello89 Thanks for that - agree we can't do much here. I have some bigger fish in progress about 2.5 but haven't forgotten about this - we'll make the $SELECT work here.

@NickCraver NickCraver self-assigned this Feb 24, 2022
NickCraver added a commit that referenced this issue Mar 6, 2022
This resolves #1988 by respecting when the `SELECT` command has been disabled. It's memoized which seems  bit extreme here but that's because we're inside the lock and every op matters. Measured locally to ensure no regressions.
NickCraver added a commit that referenced this issue Mar 7, 2022
This resolves #1988 by respecting when the `SELECT` command has been disabled. It's memoized which seems  bit extreme here but that's because we're inside the lock and every op matters. Measured locally to ensure no regressions.
@slorello89
Copy link
Collaborator Author

Fix works, thanks @NickCraver & @mgravell :)

@NickCraver
Copy link
Collaborator

@slorello89 awesome, thanks for the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants