Introduce fatlog to record big response packet. #336

CharlesChen888 · 2024-04-19T09:48:32Z

Why fat log?

We already have slowlog, which keeps a record of recent queries that take relatively long time to execute. However sometimes it is not enough for debugging.

For example, we often receive questions from users like: "Why is my QPS not high, but the network flow is substantial?" or "Why did the network flow suddenly increase?" The situation is network flow suddenly increases without significant higher QPS.

The most common reason is that users are accessing big keys, like getting a long string, or using commands like keys, scan, hget, zrange to request too many elements at once. This is common among our users.

Here a new log is introduced to keep a record when a query triggers a large response (and we may call this log "fatlog"), in which the command, response size, time stamp, client name and client network address are logged. This can help to monitor traffic flow spikes and big keys. And it is a quick way to pinpoint the problem.

Alternative solution?

Slowlog: the command with big response packet may not be slow enough to trigger slowlog.
Big key: we can detect big keys, but we are not sure whether they are accessed at a certain time, and accessing a big key does not necessarily generate large responses.
Client traffic statistics: we may need to locate the exact commands.

Detail introduction

Just like slowlog, fatlog is not actually logged in a file, and is accessible through a series of commands:

FATLOG GET [count]
FATLOG LEN
FATLOG RESET

127.0.0.1:6379> config set fatlog-log-bigger-than 10
OK
127.0.0.1:6379> set a 1234567890
OK
127.0.0.1:6379> get a
"1234567890"
127.0.0.1:6379> fatlog get
1) 1) (integer) 3
   2) (integer) 1713516667
   3) (integer) 17
   4) 1) "get"
      2) "a"
   5) "127.0.0.1:62969"
   6) ""

also a help command is provided:

FATLOG HELP

And just like slowlog, two config items are provided to set how many log items are preserved and the response size threshold of being logged.

fatlog-max-len (default value: 128)
fatlog-log-bigger-than (default value: 1MB)

About implementation

You will find that a lot changes have been appllied to slowlog.c. That is because the implementation of fatlog is basically reusing the code of slowlog, since they have similar format and are controled by similar commands. So what used to be slowlog.c now contains code of both slowlog and fatlog.

madolson · 2024-04-21T18:31:14Z

I'm not sure about this specific feature. At least from what I've seen, it's less common for a single command to produce an unexpectedly large result without it also showing up in the slowlog.

CharlesChen888 · 2024-04-22T05:52:10Z

At least from what I've seen, it's less common for a single command to produce an unexpectedly large result without it also showing up in the slowlog.

This does happen. Some of our users are very sensitive to network flow spikes, and they couldn't find any clue in slowlog, so thay asked us about it.

Slowlog only records time of command executions ss. But a hash lookup and a memory copy (simple get command, even with a big response packet of 1MB) could be really quick, (compared to the default slowlog threshold, 10ms).

hwware · 2024-04-24T14:44:10Z

@CharlesChen888 Before I review your PR, could you please provide some background information about why you want to update the slowlogCommand to heavyLoadLogCommand? Because in the top comment, you mention you want to add some new commands for fatlog, but in your pr, you make some changes for slowlog files. I am confused it.

CharlesChen888 · 2024-04-25T00:49:43Z

@hwware This is because fatlog shares a lot code of slowlog. So the function used to be called slowlogCommand now contains code of both slowlog command and fatlog command. I tried to think of a name meaning "both slow and fat", "heavy load" seems to cover both situation.

madolson · 2024-05-06T23:04:43Z

This does happen. Some of our users are very sensitive to network flow spikes, and they couldn't find any clue in slowlog, so they asked us about it.

Can you be more precise than this? People ask us at AWS too about issues, but we've always been able to debug this type of problem without what you proposed. What do you mean they are sensitive to "network flow spikes". We've seen a lot more issues with a sudden burst of smaller commands than this type of "single large" packet issue.

CharlesChen888 · 2024-05-10T06:12:32Z

@madolson

We've seen a lot more issues with a sudden burst of smaller commands than this type of "single large" packet issue.

Yes, this is a common case and we've also seen a lot.

But there is another situation in which network flow suddenly increases without significant higher QPS, the most common reason is that users are accessing big keys, like getting a long string, or using commands like keys, scan, hget, zrange to request too many elements at once. This is common among our users. And fatlog is mainly helpful in this situation, it is a quick way to pinpoint the cause.

codecov · 2024-05-10T06:35:54Z

Codecov Report

Attention: Patch coverage is 97.67442% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 70.23%. Comparing base (168da8b) to head (1aa1d56).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #336      +/-   ##
============================================
+ Coverage     70.22%   70.23%   +0.01%     
============================================
  Files           109      109              
  Lines         59956    59985      +29     
============================================
+ Hits          42104    42133      +29     
  Misses        17852    17852

Files	Coverage Δ
src/commands.def	`100.00% <ø> (ø)`
src/config.c	`78.31% <ø> (ø)`
src/networking.c	`85.42% <100.00%> (+0.06%)`	⬆️
src/server.c	`88.96% <100.00%> (+0.03%)`	⬆️
src/slowlog.c	`97.89% <96.55%> (+0.42%)`	⬆️

... and 13 files with indirect coverage changes

soloestoy · 2024-05-11T03:44:35Z

I think the scenario in the top comment is very clear: "slow" and "fat" are not same. I'd love to explain it further in my understanding.

We have different types of resources like CPU, memory, network, etc. The slowlog can only record commands eat lots of CPU during the processing phase (doesn't include read/write network time), but can not record commands eat too many memory and network. For example:

run "SET key value(10 megabytes)" command would not be recored in slowlog, since when processing it the SET command only insert the value's pointer into db dict. But that command eats huge memory in query buffer and bandwidth from network. In this case, just 1000 tps can cause 10GB/s network flow.
run "GET key" command and the key's value length is 10 megabytes. The get command can eat huge memory in output buffer and bandwidth to network.

So, the fatlog can help users to locate these special cases.

BTW, the PR only record the fat output buffer, we should also record the fat input buffer (by argv_len_sum).

madolson · 2024-05-13T22:51:27Z

BTW, the PR only record the fat output buffer, we should also record the fat input buffer (by argv_len_sum).

+1, although I would imagine unexpected large outputs are a lot more common than unexpected large inputs. There are many surprising ways to see large output, like running KEYS *.

I guess I'm convinced about the value of this feature, however I don't believe that this should be solving the case where there is uniform data with large keys, such as only being able to hit 1k tps with 1mb items. That is trivial to debug by just doing a scan + memory usage + command stat review. That's an average throughput concern. I don't think we need a command for identifying obvious issues that could be root caused with a bit of effort or a little bit of tooling. If there are a few, commands generating most of the traffic.

Another thought, should we also accumulate this information on cmd_stats? That would maybe more useful for seeing changes in command output trends compared to a fat log?

madolson · 2024-04-19T22:40:17Z

src/Makefile

@@ -383,7 +383,7 @@ endif
 ENGINE_NAME=valkey
 SERVER_NAME=$(ENGINE_NAME)-server$(PROG_SUFFIX)
 ENGINE_SENTINEL_NAME=$(ENGINE_NAME)-sentinel$(PROG_SUFFIX)
-ENGINE_SERVER_OBJ=threads_mngr.o adlist.o quicklist.o ae.o anet.o dict.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o crc16.o endianconv.o slowlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o
+ENGINE_SERVER_OBJ=threads_mngr.o adlist.o quicklist.o ae.o anet.o dict.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o crc16.o endianconv.o heavyloadlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o


Even assuming we accept this, I don't really like the idea of renaming the slowlog. I think the current naming of "slow" is probably sufficient.

madolson · 2024-05-13T22:33:26Z

src/commands/slowlog-get.json

@@ -6,7 +6,7 @@
 "since": "2.2.12",
 "arity": -2,
 "container": "SLOWLOG",
- "function": "slowlogCommand",
+ "function": "heavyLoadLogCommand",


My preference was to avoid large container commands. I would rather have slowlogCommand call some shared function then have this call a container function.

madolson · 2024-05-13T22:34:14Z

src/heavyloadlog.c

+ * Similarly, fatlog remembers the latest N queries that has a response
+ * larger than K bytes.
+ *
+ * The size of the response to reach to be logged in the fat lof is set


Suggested change

* The size of the response to reach to be logged in the fat lof is set

* The size of the response to reach to be logged in the fat log is set

madolson · 2024-05-13T22:37:17Z

src/module.c

@@ -1156,6 +1156,7 @@ int64_t commandFlagsFromString(char *s) {
 else if (!strcasecmp(t,"allow-stale")) flags |= CMD_STALE;
 else if (!strcasecmp(t,"no-monitor")) flags |= CMD_SKIP_MONITOR;
 else if (!strcasecmp(t,"no-slowlog")) flags |= CMD_SKIP_SLOWLOG;
+ else if (!strcasecmp(t,"no-fatlog")) flags |= CMD_SKIP_FATLOG;


What would the use case be for logging to the slowlog but not the fatlog (or vice versa)? Today it's only set in exec, since we want the underlying container command to show up instead of the exec, which I imagine is also true here. I almost think we have one flag, no-slowlog that implies skipping both.

madolson · 2024-05-13T22:43:56Z

src/config.c

@@ -3230,6 +3231,7 @@ standardConfig static_configs[] = {
 createLongLongConfig("proto-max-bulk-len", NULL, DEBUG_CONFIG | MODIFIABLE_CONFIG, 1024*1024, LONG_MAX, server.proto_max_bulk_len, 512ll*1024*1024, MEMORY_CONFIG, NULL, NULL), /* Bulk request max size */
 createLongLongConfig("stream-node-max-entries", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.stream_node_max_entries, 100, INTEGER_CONFIG, NULL, NULL),
 createLongLongConfig("repl-backlog-size", NULL, MODIFIABLE_CONFIG, 1, LLONG_MAX, server.repl_backlog_size, 1024*1024, MEMORY_CONFIG, NULL, updateReplBacklogSize), /* Default: 1mb */
+ createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, INTEGER_CONFIG, NULL, NULL),


Suggested change

createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, INTEGER_CONFIG, NULL, NULL),

createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, MEMORY_CONFIG, NULL, NULL),

Any reason we wouldn't let this be a memory config, so you could set it to 1mb or something?

16kb also seems way too small, some historical data was have shows that isn't even in the P99 of data size. I would say at least 1mb.

madolson · 2024-05-13T22:57:25Z

src/commands/fatlog-get.json

@@ -0,0 +1,69 @@
+{


Although I understand the rational behind a new command, a fat command is logically different from a slow command, I would also like us to more seriously consider just adding one new field to the slowlog, the output/input bytes, and then having both get logged to the same place. It's not a perfect fit, but then users can just look in "one" place for bad behavior. It also allows the two logs to perhaps explain each other, since a large output can cause a slow command. We could extend SLOWLOG GET [count] [TYPE FAT|SLOW] to only show entries of the specific type.

I'm thinking about it more from a simpler way to explain it to end users.

This does make things easier than adding a new series of commands, if we can ignore that changes in format of command reply may cause backwards compatibility problem.
But a big packet in query or response does not necessarily mean it will be slow. I'm not sure if we should do this.

madolson · 2024-05-27T14:51:24Z

Discussed this in a separate meeting. Everyone is generally aligned about the idea, but we need to refine the name and details.

CharlesChen888 · 2024-05-28T08:02:19Z

but we need to refine the name and details.

I will try to make some time for it in this week.

Signed-off-by: Chen Tianjie <[email protected]>

hwware · 2024-05-30T18:02:30Z

I propose one command name: BigValueLog.

From above comments, this command usually is used to record the big value for one key. It is different from the slowlog, which goal is to record the entry exceeded a specified execution time. Sometimes due to cpu or disk higher usage, one entry could be recorded in slowlog, but not in fatlog (BigValueLog).

Thus, I do not think adding one more argument in slowlog as SLOWLOG GET [count] [TYPE FAT|SLOW] is a good idea. I still prefer to add a new command in Valkey.

Or if we want to extend current slowlog function, I prefer the command format as: SLOWLOG GET [count] [FAT|SLOW|ALL]
Then it will cover the case where a key response with slow and fat condition.

CharlesChen888 force-pushed the add-fat-log branch from 4fd206a to c3727a7 Compare April 19, 2024 09:54

madolson added the major-decision-pending Major decision pending by TSC team label Apr 21, 2024

CharlesChen888 force-pushed the add-fat-log branch from 428f06b to f894a52 Compare May 10, 2024 09:48

madolson reviewed May 13, 2024

View reviewed changes

CharlesChen888 added 3 commits May 29, 2024 17:55

Implement fat log.

9216c7e

Signed-off-by: Chen Tianjie <[email protected]>

Rename file and correct typo.

f3483ea

Signed-off-by: Chen Tianjie <[email protected]>

Rename. Fix details.

9e7f529

Signed-off-by: Chen Tianjie <[email protected]>

CharlesChen888 force-pushed the add-fat-log branch from 2a80b69 to 9e7f529 Compare May 29, 2024 10:01

CharlesChen888 added 2 commits May 29, 2024 18:42

Run clang format.

53a235f

Signed-off-by: Chen Tianjie <[email protected]>

Fatlog count input big packet as well.

1aa1d56

Signed-off-by: Chen Tianjie <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce fatlog to record big response packet. #336

Introduce fatlog to record big response packet. #336

CharlesChen888 commented Apr 19, 2024 •

edited

madolson commented Apr 21, 2024

CharlesChen888 commented Apr 22, 2024 •

edited

hwware commented Apr 24, 2024 •

edited

CharlesChen888 commented Apr 25, 2024

madolson commented May 6, 2024

CharlesChen888 commented May 10, 2024

codecov bot commented May 10, 2024 •

edited

soloestoy commented May 11, 2024 •

edited

madolson commented May 13, 2024

madolson Apr 19, 2024

madolson May 13, 2024

madolson May 13, 2024

madolson May 13, 2024

madolson May 13, 2024

madolson May 13, 2024

CharlesChen888 May 29, 2024 •

edited

madolson commented May 27, 2024 •

edited

CharlesChen888 commented May 28, 2024

hwware commented May 30, 2024 •

edited

	* The size of the response to reach to be logged in the fat lof is set
	* The size of the response to reach to be logged in the fat log is set

	createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, INTEGER_CONFIG, NULL, NULL),
	createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, MEMORY_CONFIG, NULL, NULL),

Introduce fatlog to record big response packet. #336

Are you sure you want to change the base?

Introduce fatlog to record big response packet. #336

Conversation

CharlesChen888 commented Apr 19, 2024 • edited

Why fat log?

Alternative solution?

Detail introduction

About implementation

madolson commented Apr 21, 2024

CharlesChen888 commented Apr 22, 2024 • edited

hwware commented Apr 24, 2024 • edited

CharlesChen888 commented Apr 25, 2024

madolson commented May 6, 2024

CharlesChen888 commented May 10, 2024

codecov bot commented May 10, 2024 • edited

Codecov Report

soloestoy commented May 11, 2024 • edited

madolson commented May 13, 2024

madolson Apr 19, 2024

Choose a reason for hiding this comment

madolson May 13, 2024

Choose a reason for hiding this comment

madolson May 13, 2024

Choose a reason for hiding this comment

madolson May 13, 2024

Choose a reason for hiding this comment

madolson May 13, 2024

Choose a reason for hiding this comment

madolson May 13, 2024

Choose a reason for hiding this comment

CharlesChen888 May 29, 2024 • edited

Choose a reason for hiding this comment

madolson commented May 27, 2024 • edited

CharlesChen888 commented May 28, 2024

hwware commented May 30, 2024 • edited

CharlesChen888 commented Apr 19, 2024 •

edited

CharlesChen888 commented Apr 22, 2024 •

edited

hwware commented Apr 24, 2024 •

edited

codecov bot commented May 10, 2024 •

edited

soloestoy commented May 11, 2024 •

edited

CharlesChen888 May 29, 2024 •

edited

madolson commented May 27, 2024 •

edited

hwware commented May 30, 2024 •

edited