qemu-img create via libfapi is slow #4320

resposit · 2024-03-20T10:14:00Z

Description of problem:

Creating a raw/qcow2 on a gluster volume via libgfapi appears to be very slow compare to fuse access.

# time qemu-img create gluster://10.10.81.16/vstor/10G-gfapi 10G
Formatting 'gluster://10.10.81.16/vstor/10G-gfapi', fmt=raw size=10737418240
[2024-03-20 09:50:35.293625 +0000] I [io-stats.c:3784:ios_sample_buf_size_configure] 0-vstor: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
[2024-03-20 09:50:47.299242 +0000] I [io-stats.c:4190:fini] 0-vstor: io-stats translator unloaded

real    0m13.046s
user    0m0.018s
sys     0m0.028s

If I do the same operation directly on the fuse mount point, the operation is almost istantaneous:

# mount | grep glust
cloud15-gl.na.infn.it:vstor on /mnt/pve/vstor type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

root@cloud15:/mnt/pve/vstor# time qemu-img create ./10G 10G
Formatting './10G', fmt=raw size=10737418240

real    0m0.033s
user    0m0.004s
sys     0m0.022s

Showing both files:

root@cloud15:/mnt/pve/vstor# ls -l 10G*
-rw-r--r-- 1 root root 10737418240 Mar 20 10:55 10G
-rw------- 1 root root 10737418240 Mar 20 10:30 10G-gfapi

Mandatory info:
- The output of the gluster volume info command:

Volume Name: vstor
Type: Disperse
Volume ID: c3d389ed-3d51-4484-bdfd-596d2629a0a1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: cloud15-gl:/pool6/vstor/brick
Brick2: cloud16-gl:/pool6/vstor/brick
Brick3: cloud17-gl:/pool6/vstor/brick
Options Reconfigured:
features.shard-block-size: 256MB
transport.address-family: inet
storage.fips-mode-rchecksum: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
features.shard: on
user.cifs: off
client.event-threads: 4
server.event-threads: 4
performance.client-io-threads: on
cluster.lookup-optimize: off

- The output of the gluster volume status command:

Status of volume: vstor
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick cloud15-gl:/pool6/vstor/brick         59554     0          Y       3111
Brick cloud16-gl:/pool6/vstor/brick         52485     0          Y       3104
Brick cloud17-gl:/pool6/vstor/brick         52170     0          Y       3116
Self-heal Daemon on localhost               N/A       N/A        Y       3165
Self-heal Daemon on cloud17-gl.na.infn.it   N/A       N/A        Y       3151
Self-heal Daemon on cloud16-gl.na.infn.it   N/A       N/A        Y       3142

Task Status of Volume vstor
------------------------------------------------------------------------------
There are no active volume tasks

Additional info:

- The operating system / glusterfs version:
Debian 12.5
Glusterfs 11.1

Bricks are on ZFS file sytems:

root@cloud15:/mnt/pve/vstor# zpool status
  pool: pool6
 state: ONLINE
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Mon Mar 18 11:42:02 2024
config:

        NAME                        STATE     READ WRITE CKSUM
        pool6                       ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            scsi-35000c500f64eeaf7  ONLINE       0     0     0
            scsi-35000c500f64f369b  ONLINE       0     0     0
            scsi-35000c500f65032fb  ONLINE       0     0     0
            scsi-35000c500c2254b77  ONLINE       0     0     0
            scsi-35000c500cf057b2b  ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            scsi-358ce38ee22c8e51d  ONLINE       0     0     0
            scsi-358ce38ee22c8e519  ONLINE       0     0     0

errors: No known data errors

The text was updated successfully, but these errors were encountered:

aravindavk · 2024-03-20T12:40:14Z

GFAPI based command (qemu-img create gluster:) is mounting the volume and unmounting the volume along with the file creation. Fuse mount those two steps are skipped and only file create was done.

Please share the gfapi logs so that we can confirm the timings for each steps.

resposit · 2024-03-20T12:50:07Z

Thank you, I increased diagnostics.client-log-level to DEBUG.
Here is the log:
gluster-log.txt

mohit84 · 2024-03-24T12:56:21Z

Can you share the profile after execute a command like below

gluster v profile <vol-name> start; gluster v profile <vol-name> info clear; time qemu-img create /mnt/pve/vstor/10G 10G;  gluster v profile <vol-name> info

resposit · 2024-03-25T08:24:03Z

Hello,
yes, see the attached log:
gfapi.txt
fuse.txt

aravindavk · 2024-03-27T05:09:04Z

GFAPI doesn't help for the stateless operations like create. qemu-img create has to do init (mount) and fini (umount) along with the actual file creation. The mount state will not be maintained while running the next command. If you run qemu-img create 10 times then it has to mount and unmount the volume 10 times.

When the image is booted using qemu-img command, then it is a long running process and the mount will be persisted as long as the image is running. During this time, all the IO operations use GFAPI so that it can be faster than fuse.

Fuse equivalent to the GFAPI code above is

# qemu_fuse_create.sh
mount -t glisters cloud15-gl.na.infn.it:vstor /mnt/pve/vstor
qemu-img create /mnt/pve/vstor/10G 10G
umount /mnt/pve/vstor

time bash qemu_fuse_create.sh

mohit84 · 2024-03-27T05:34:26Z

The actual image creation via gfapi is fast as we can see the connection was started around 12:47:24.708074 and shutdown
was started around 12:47:24.713059 so the total time taken is around 4985 microsec that is faster than fuse has taken. It
is showing slowness at application level because shutdown process is slow, we have to check it why it is taking time. I
will try to update on the same if i will find something.

[2024-03-20 12:47:24.708074 +0000] I [MSGID: 122062] [ec.c:335:ec_up] 0-vstor-disperse-0: Going UP : Child UP = 111 Child Notify = 111


[2024-03-20 12:47:24.713059 +0000] I [socket.c:835:__socket_shutdown] 0-gfapi: intentional socket shutdown(11)

resposit · 2024-03-27T11:46:30Z

@aravindavk I tried what you suggest

root@cloud15:~# mkdir vstor-test
root@cloud15:~# cat qemu_fuse_create.sh
mount -t glusterfs cloud15-gl.na.infn.it:vstor ./vstor-test
qemu-img create ./vstor-test/10G 10G
umount ./vstor-test

Execution is almost istantaneous:

root@cloud15:~# time ./qemu_fuse_create.sh
Formatting './vstor-test/10G', fmt=raw size=10737418240

real    0m0.105s
user    0m0.058s
sys     0m0.037s

mohit84 · 2024-03-29T05:23:35Z

I think i have found a RCA why qemu-img is taking time during connection shutdown.
During first fop initiation we do set a call_bail timer 10s and call_bail and take a reference on rpc object .
The function(call_bail) is call by timer thread every 10s basis. During disconnect the client job is
cancel the timer event and unref the rpc object but after this patch (https://review.gluster.org/#/c/glusterfs/+/22087/)
we have changed the return code to -1 in gf_timer_call_cancel if cleanup is started. Because
the function is returning -1 to the parent function rpc_clnt_connection_cleanup so it is not unref rpc object
and the connection has to wait until call_bail function is triggered by timer thread even the job operation
has finished completely. I think we have to change the return code to avoid an issue. Though the patch was implemented
as a part of shd mulitplex feature even shd mux feature was reverted but the associated patch was not revert.

@rafikc30
Do you think any issue if we do change the return code to 0 while ctx is valid in the function gf_timer_call_cancel so that
a parent function can uref rpc object without waiting for call_bail.

During the first rpc clnt submission we take the rpc reference and register the call_bail function for the timer thread. The timer thread call call_bail function every 10s basis. In case if a client trigger a shutdown request it try to call rpc_clnt_connection_cleanup to cleanup the rpc connection.The rpc_clnt_connection would not be able to cleanup the rpc connection successfully due to the cleanup_started flag being set by the upper xlator. The rpc reference will be unref only after trigger a call_bail function so basically if somehow call_bail is triggered just before start a shutdown process the application has to wait for 10s to cleanup the rpc connection eventually the process becomes slow. Solution: Unref the rpc object based on the conn->timer/conn->reconnect pointer value as we are doing the same for ping_timer. These pointer are always modified under the critical section so we can assume if pointer is valid it means rpc reference is also valid. Fixes: gluster#4320 Change-Id: Ib947b8bfcbe1b49e1ed05a50a84de6f92afbca13 Signed-off-by: Mohit Agrawal <[email protected]>

During the first rpc clnt submission we take the rpc reference and register the call_bail function for the timer thread. The timer thread call call_bail function every 10s basis. In case if a client trigger a shutdown request it try to call rpc_clnt_connection_cleanup to cleanup the rpc connection.The rpc_clnt_connection would not be able to cleanup the rpc connection successfully due to the cleanup_started flag being set by the upper xlator. The rpc reference will be unref only after trigger a call_bail function so basically if somehow call_bail is triggered just before start a shutdown process the application has to wait for 10s to cleanup the rpc connection eventually the process becomes slow. Solution: Unref the rpc object based on the conn->timer/conn->reconnect pointer value as we are doing the same for ping_timer. These pointer are always modified under the critical section so we can assume if pointer is valid it means rpc reference is also valid. Fixes: gluster#4320 credits: Xavi Hernandez <[email protected]> Change-Id: Ib947b8bfcbe1b49e1ed05a50a84de6f92afbca13 Signed-off-by: Mohit Agrawal <[email protected]>

mohit84 linked a pull request Apr 3, 2024 that will close this issue

rpc: Improve rpc clnt connection cleanup process #4329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qemu-img create via libfapi is slow #4320

qemu-img create via libfapi is slow #4320

resposit commented Mar 20, 2024

aravindavk commented Mar 20, 2024

resposit commented Mar 20, 2024

mohit84 commented Mar 24, 2024

resposit commented Mar 25, 2024

aravindavk commented Mar 27, 2024

mohit84 commented Mar 27, 2024

resposit commented Mar 27, 2024

mohit84 commented Mar 29, 2024

qemu-img create via libfapi is slow #4320

qemu-img create via libfapi is slow #4320

Comments

resposit commented Mar 20, 2024

aravindavk commented Mar 20, 2024

resposit commented Mar 20, 2024

mohit84 commented Mar 24, 2024

resposit commented Mar 25, 2024

aravindavk commented Mar 27, 2024

mohit84 commented Mar 27, 2024

resposit commented Mar 27, 2024

mohit84 commented Mar 29, 2024