Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qemu-img create via libfapi is slow #4320

Open
resposit opened this issue Mar 20, 2024 · 8 comments · May be fixed by #4329
Open

qemu-img create via libfapi is slow #4320

resposit opened this issue Mar 20, 2024 · 8 comments · May be fixed by #4329

Comments

@resposit
Copy link

Description of problem:

Creating a raw/qcow2 on a gluster volume via libgfapi appears to be very slow compare to fuse access.

# time qemu-img create gluster://10.10.81.16/vstor/10G-gfapi 10G
Formatting 'gluster://10.10.81.16/vstor/10G-gfapi', fmt=raw size=10737418240
[2024-03-20 09:50:35.293625 +0000] I [io-stats.c:3784:ios_sample_buf_size_configure] 0-vstor: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
[2024-03-20 09:50:47.299242 +0000] I [io-stats.c:4190:fini] 0-vstor: io-stats translator unloaded

real    0m13.046s
user    0m0.018s
sys     0m0.028s

If I do the same operation directly on the fuse mount point, the operation is almost istantaneous:

# mount | grep glust
cloud15-gl.na.infn.it:vstor on /mnt/pve/vstor type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

root@cloud15:/mnt/pve/vstor# time qemu-img create ./10G 10G
Formatting './10G', fmt=raw size=10737418240

real    0m0.033s
user    0m0.004s
sys     0m0.022s

Showing both files:

root@cloud15:/mnt/pve/vstor# ls -l 10G*
-rw-r--r-- 1 root root 10737418240 Mar 20 10:55 10G
-rw------- 1 root root 10737418240 Mar 20 10:30 10G-gfapi

Mandatory info:
- The output of the gluster volume info command:

Volume Name: vstor
Type: Disperse
Volume ID: c3d389ed-3d51-4484-bdfd-596d2629a0a1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: cloud15-gl:/pool6/vstor/brick
Brick2: cloud16-gl:/pool6/vstor/brick
Brick3: cloud17-gl:/pool6/vstor/brick
Options Reconfigured:
features.shard-block-size: 256MB
transport.address-family: inet
storage.fips-mode-rchecksum: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
features.shard: on
user.cifs: off
client.event-threads: 4
server.event-threads: 4
performance.client-io-threads: on
cluster.lookup-optimize: off

- The output of the gluster volume status command:

Status of volume: vstor
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick cloud15-gl:/pool6/vstor/brick         59554     0          Y       3111
Brick cloud16-gl:/pool6/vstor/brick         52485     0          Y       3104
Brick cloud17-gl:/pool6/vstor/brick         52170     0          Y       3116
Self-heal Daemon on localhost               N/A       N/A        Y       3165
Self-heal Daemon on cloud17-gl.na.infn.it   N/A       N/A        Y       3151
Self-heal Daemon on cloud16-gl.na.infn.it   N/A       N/A        Y       3142

Task Status of Volume vstor
------------------------------------------------------------------------------
There are no active volume tasks

Additional info:

- The operating system / glusterfs version:
Debian 12.5
Glusterfs 11.1

Bricks are on ZFS file sytems:

root@cloud15:/mnt/pve/vstor# zpool status
  pool: pool6
 state: ONLINE
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Mon Mar 18 11:42:02 2024
config:

        NAME                        STATE     READ WRITE CKSUM
        pool6                       ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            scsi-35000c500f64eeaf7  ONLINE       0     0     0
            scsi-35000c500f64f369b  ONLINE       0     0     0
            scsi-35000c500f65032fb  ONLINE       0     0     0
            scsi-35000c500c2254b77  ONLINE       0     0     0
            scsi-35000c500cf057b2b  ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            scsi-358ce38ee22c8e51d  ONLINE       0     0     0
            scsi-358ce38ee22c8e519  ONLINE       0     0     0

errors: No known data errors
@aravindavk
Copy link
Member

GFAPI based command (qemu-img create gluster:) is mounting the volume and unmounting the volume along with the file creation. Fuse mount those two steps are skipped and only file create was done.

Please share the gfapi logs so that we can confirm the timings for each steps.

@resposit
Copy link
Author

Thank you, I increased diagnostics.client-log-level to DEBUG.
Here is the log:
gluster-log.txt

@mohit84
Copy link
Contributor

mohit84 commented Mar 24, 2024

Can you share the profile after execute a command like below

gluster v profile <vol-name> start; gluster v profile <vol-name> info clear; time qemu-img create /mnt/pve/vstor/10G 10G;  gluster v profile <vol-name> info

@resposit
Copy link
Author

Hello,
yes, see the attached log:
gfapi.txt
fuse.txt

@aravindavk
Copy link
Member

GFAPI doesn't help for the stateless operations like create. qemu-img create has to do init (mount) and fini (umount) along with the actual file creation. The mount state will not be maintained while running the next command. If you run qemu-img create 10 times then it has to mount and unmount the volume 10 times.

When the image is booted using qemu-img command, then it is a long running process and the mount will be persisted as long as the image is running. During this time, all the IO operations use GFAPI so that it can be faster than fuse.

Fuse equivalent to the GFAPI code above is

# qemu_fuse_create.sh
mount -t glisters cloud15-gl.na.infn.it:vstor /mnt/pve/vstor
qemu-img create /mnt/pve/vstor/10G 10G
umount /mnt/pve/vstor
time bash qemu_fuse_create.sh

@mohit84
Copy link
Contributor

mohit84 commented Mar 27, 2024

The actual image creation via gfapi is fast as we can see the connection was started around 12:47:24.708074 and shutdown
was started around 12:47:24.713059 so the total time taken is around 4985 microsec that is faster than fuse has taken. It
is showing slowness at application level because shutdown process is slow, we have to check it why it is taking time. I
will try to update on the same if i will find something.

[2024-03-20 12:47:24.708074 +0000] I [MSGID: 122062] [ec.c:335:ec_up] 0-vstor-disperse-0: Going UP : Child UP = 111 Child Notify = 111


[2024-03-20 12:47:24.713059 +0000] I [socket.c:835:__socket_shutdown] 0-gfapi: intentional socket shutdown(11)

@resposit
Copy link
Author

@aravindavk I tried what you suggest

root@cloud15:~# mkdir vstor-test
root@cloud15:~# cat qemu_fuse_create.sh
mount -t glusterfs cloud15-gl.na.infn.it:vstor ./vstor-test
qemu-img create ./vstor-test/10G 10G
umount ./vstor-test

Execution is almost istantaneous:

root@cloud15:~# time ./qemu_fuse_create.sh
Formatting './vstor-test/10G', fmt=raw size=10737418240

real    0m0.105s
user    0m0.058s
sys     0m0.037s

@mohit84
Copy link
Contributor

mohit84 commented Mar 29, 2024

I think i have found a RCA why qemu-img is taking time during connection shutdown.
During first fop initiation we do set a call_bail timer 10s and call_bail and take a reference on rpc object .
The function(call_bail) is call by timer thread every 10s basis. During disconnect the client job is
cancel the timer event and unref the rpc object but after this patch (https://review.gluster.org/#/c/glusterfs/+/22087/)
we have changed the return code to -1 in gf_timer_call_cancel if cleanup is started. Because
the function is returning -1 to the parent function rpc_clnt_connection_cleanup so it is not unref rpc object
and the connection has to wait until call_bail function is triggered by timer thread even the job operation
has finished completely. I think we have to change the return code to avoid an issue. Though the patch was implemented
as a part of shd mulitplex feature even shd mux feature was reverted but the associated patch was not revert.

@rafikc30
Do you think any issue if we do change the return code to 0 while ctx is valid in the function gf_timer_call_cancel so that
a parent function can uref rpc object without waiting for call_bail.

mohit84 added a commit to mohit84/glusterfs that referenced this issue Apr 3, 2024
During the first rpc clnt submission we take the rpc reference and
register the call_bail function for the timer thread. The timer thread
call call_bail function every 10s basis. In case if a client trigger a
shutdown request it try to call rpc_clnt_connection_cleanup to cleanup
the rpc connection.The rpc_clnt_connection would not be able to cleanup
the rpc connection successfully due to the cleanup_started flag being set by
the upper xlator. The rpc reference will be unref only after trigger
a call_bail function so basically if somehow call_bail is triggered just
before start a shutdown process the application has to wait for 10s
to cleanup the rpc connection eventually the process becomes slow.

Solution: Unref the rpc object based on the conn->timer/conn->reconnect
pointer value as we are doing the same for ping_timer. These pointer are always
modified under the critical section so we can assume if pointer is valid it means
rpc reference is also valid.

Fixes: gluster#4320
Change-Id: Ib947b8bfcbe1b49e1ed05a50a84de6f92afbca13
Signed-off-by: Mohit Agrawal <[email protected]>
mohit84 added a commit to mohit84/glusterfs that referenced this issue Apr 3, 2024
During the first rpc clnt submission we take the rpc reference and
register the call_bail function for the timer thread. The timer thread
call call_bail function every 10s basis. In case if a client trigger a
shutdown request it try to call rpc_clnt_connection_cleanup to cleanup
the rpc connection.The rpc_clnt_connection would not be able to cleanup
the rpc connection successfully due to the cleanup_started flag being set by
the upper xlator. The rpc reference will be unref only after trigger
a call_bail function so basically if somehow call_bail is triggered just
before start a shutdown process the application has to wait for 10s
to cleanup the rpc connection eventually the process becomes slow.

Solution: Unref the rpc object based on the conn->timer/conn->reconnect
pointer value as we are doing the same for ping_timer. These pointer are always
modified under the critical section so we can assume if pointer is valid it means
rpc reference is also valid.

Fixes: gluster#4320
credits: Xavi Hernandez <[email protected]>
Change-Id: Ib947b8bfcbe1b49e1ed05a50a84de6f92afbca13
Signed-off-by: Mohit Agrawal <[email protected]>
@mohit84 mohit84 linked a pull request Apr 3, 2024 that will close this issue
mohit84 added a commit to mohit84/glusterfs that referenced this issue Apr 3, 2024
During the first rpc clnt submission we take the rpc reference and
register the call_bail function for the timer thread. The timer thread
call call_bail function every 10s basis. In case if a client trigger a
shutdown request it try to call rpc_clnt_connection_cleanup to cleanup
the rpc connection.The rpc_clnt_connection would not be able to cleanup
the rpc connection successfully due to the cleanup_started flag being set by
the upper xlator. The rpc reference will be unref only after trigger
a call_bail function so basically if somehow call_bail is triggered just
before start a shutdown process the application has to wait for 10s
to cleanup the rpc connection eventually the process becomes slow.

Solution: Unref the rpc object based on the conn->timer/conn->reconnect
pointer value as we are doing the same for ping_timer. These pointer are always
modified under the critical section so we can assume if pointer is valid it means
rpc reference is also valid.

Fixes: gluster#4320
credits: Xavi Hernandez <[email protected]>
Change-Id: Ib947b8bfcbe1b49e1ed05a50a84de6f92afbca13
Signed-off-by: Mohit Agrawal <[email protected]>
mohit84 added a commit to mohit84/glusterfs that referenced this issue Apr 3, 2024
During the first rpc clnt submission we take the rpc reference and
register the call_bail function for the timer thread. The timer thread
call call_bail function every 10s basis. In case if a client trigger a
shutdown request it try to call rpc_clnt_connection_cleanup to cleanup
the rpc connection.The rpc_clnt_connection would not be able to cleanup
the rpc connection successfully due to the cleanup_started flag being set by
the upper xlator. The rpc reference will be unref only after trigger
a call_bail function so basically if somehow call_bail is triggered just
before start a shutdown process the application has to wait for 10s
to cleanup the rpc connection eventually the process becomes slow.

Solution: Unref the rpc object based on the conn->timer/conn->reconnect
pointer value as we are doing the same for ping_timer. These pointer are always
modified under the critical section so we can assume if pointer is valid it means
rpc reference is also valid.

Fixes: gluster#4320
credits: Xavi Hernandez <[email protected]>
Change-Id: Ib947b8bfcbe1b49e1ed05a50a84de6f92afbca13
Signed-off-by: Mohit Agrawal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants