Bug: >> Order of magnitude difference between RAW KV benchmarks and SurrealDB TiKV driver read operations #3413
-
Describe the bugHello, I have benchmarked the following TiKV deployment configuration: global:
user: "mina"
ssh_port: 22
deploy_dir: "/data1/tidb/db"
data_dir: "/data1/tidb/db-data" # # Supported values: "amd64", "arm64" (default: "amd64")
arch: "amd64"
resource_control:
memory_limit: "24G"
cpu_quota: "3200%"
io_read_bandwidth_max: "/dev/nvme1n1 /dev/nvme0n1p6 /dev/nvme0n1p7 /dev/nvme0n1 700%"
io_write_bandwidth_max: "/dev/nvme1n1 /dev/nvme0n1p6 /dev/nvme0n1p7 /dev/nvme0n1 700%"
server_configs:
pd:
replication.location-labels:
- host
pd_servers:
- host: 127.0.0.1
client_port: 1279
peer_port: 2380
# Note that 1 server should simulate not being on the same drive for testing
tikv_servers:
- host: 127.0.0.1
port: 20160
status_port: 10080
config:
server.labels:
host: host1
deploy_dir: "/data1/tidb-deploy/tikv-20160"
data_dir: "/data1/tidb-data/tikv-20160"
Note that the single PD and TiKV node configuration was purposefully chosen to ensure that the issue wasn't related to the integration between the surrealdb driver and the TiKV distribution mechanisms between multiple pd's and storage servers. The configuration is deployed via the following command successfully where the cluster.yaml is the configuration listed above and is started successfully: tiup cluster deploy development v7.5.0 cluster.yaml -p "password" Both load and read testing is conducted via the official TiKV recommendations within their docs: https://tikv.org/docs/7.1/deploy/performance/instructions/ Using go-ycsb, the following is run for benchmarking: # This loads the workload
./go-ycsb load tikv -P workloads/workloada -p tikv.pd="127.0.0.1:1279" -p tikv.type="raw" -p recordcount=10000000 -p operationcount=30000000 -p threadcount=32
# This runs the workload
./go-ycsb run tikv -P workloads/workloada -p tikv.pd="127.0.0.1:1279" -p tikv.type="raw" -p recordcount=10000000 -p operationcount=30000000 -p threadcount=32 The resulting average output from the cluster is approximately as follows: INSERT - Takes(s): 249.9, Count: 3742650, OPS: 14974.1, Avg(us): 8526, Min(us): 480, Max(us): 810495, 50th(us): 7403, 90th(us): 11903, 95th(us): 16095, 99th(us): 29615, 99.9th(us): 83263, 99.99th(us): 499967
TOTAL - Takes(s): 249.9, Count: 3742650, OPS: 14974.1, Avg(us): 8526, Min(us): 480, Max(us): 810495, 50th(us): 7403, 90th(us): 11903, 95th(us): 16095, 99th(us): 29615, 99.9th(us): 83263, 99.99th(us): 499967
UPDATE - Takes(s): 189.9, Count: 1349791, OPS: 7107.1, Avg(us): 17813, Min(us): 4444, Max(us): 999423, 50th(us): 11287, 90th(us): 40351, 95th(us): 51647, 99th(us): 87359, 99.9th(us): 162047, 99.99th(us): 587263
READ - Takes(s): 200.0, Count: 1451358, OPS: 7258.3, Avg(us): 192, Min(us): 44, Max(us): 75455, 50th(us): 162, 90th(us): 275, 95th(us): 334, 99th(us): 806, 99.9th(us): 2041, 99.99th(us): 3115 Insertions occur at roughly 15K operations per second on average. Reads and updates are occur at approximately 7k operations per second. Note that P99.99 latency is also sub 1s. The question is then, given the aforementioned results, is it appropriate to expect the following results when bench-marking the surrealdb database via apache Jmeter, which is launched using the TiKV driver? This benchmark was targeted towards an axum endpoint, returning the following: pub async fn get_customers() -> Json<Vec<Customer>> {
Json(DB.select("customer").await.unwrap())
} Where the customer table had 0 entries and the setup is managed via the following: pub static DB: LazyLock<Surreal<surrealdb::engine::any::Any>> = LazyLock::new(Surreal::init);
#[instrument]
async fn deploy_tikv() -> Result<(), Box<dyn Error>> {
Command::new("tiup")
.arg("cluster")
.arg("deploy")
.arg(ENV.clone())
.arg("v7.5.0")
.arg(std::fs::canonicalize("./src/database/cluster.yaml")?)
.arg("-p")
.arg(PASS.clone())
.spawn()?
.wait_with_output()
.await?;
Ok(())
}
#[instrument]
async fn init_tikv() -> Result<(), Box<dyn Error>> {
tokio::spawn(async {
Command::new("tiup")
.arg("cluster")
.arg("start")
.arg(ENV.clone())
.spawn()
.unwrap()
.wait_with_output()
.await
})
.await??;
} And connection via the following: DB.connect("tikv://127.0.0.1:1279").await?; To ensure that this issue was related to the kv-tikv driver and not axum itself, I ran the same benchmark using the kv-speedb driver: #[async_recursion]
async fn init_db() -> Result<(), Box<dyn Error>> {
match std::fs::canonicalize("./data.db") {
Ok(dir) => {
DB.connect(&(String::from("speedb:/") + dir.to_str().unwrap()))
.await?;
}
Err(_) => {
tracing::info!("Could not find database directory");
std::fs::create_dir_all("./data.db").unwrap();
init_db().await?;
}
}
DB.use_ns("namespace").use_db("database").await?;
Ok(())
} Where the target SSD is the same SSD used within the TiKV test. The result is the following: The question is whether this is expected behavior, given that the database is empty and initial testing results on reads for RAW KV where over an order of magnitude faster? Steps to reproduceExplained above Expected behaviourSignificantly lower latency and faster throughput of the database. SurrealDB version1.1.1 on 64 bit Arch Linux (32 thread 7950x) Contact DetailsIs there an existing issue for this?
Code of Conduct
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Moved to a discussion because it's not really an issue (yet?). A few comments:
I hope this clarifies things! We internally ran some benchmarks using go-ycsb against TiKV and SurrealDB, and saw that, when comparing real throughput, we are roughly on pair, so the client is not a bottleneck. |
Beta Was this translation helpful? Give feedback.
Moved to a discussion because it's not really an issue (yet?).
A few comments:
tikv.type="raw"
, it means no transactions. SurrealDB always open a transaction against TiKV to do even the simplest of operations. In order to fix thego-ycsb
benchmark, you need to replace that parameter with-p tikv.type=txn -p tikv.async_commit=false -p tikv.one_pc=false
CREATE
), does more than 5 additional reads i…