Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfsbackup seems to hang on Solaris with B2 #17

Open
lhoward opened this issue Jan 23, 2019 · 13 comments
Open

zfsbackup seems to hang on Solaris with B2 #17

lhoward opened this issue Jan 23, 2019 · 13 comments

Comments

@lhoward
Copy link

lhoward commented Jan 23, 2019

I'm running zfsbackup on Solaris 11. It seems to work fine for a small pool of a few MB, but appears to hang when I try on a real pool (still small, around 1GB).

I'm using the B2 backend. The file backend works fine.

I'm wondering if it may have something to do with the way the tasks are parallelized (although setting --maxFileBuffer=0 doesn't make any difference).

Here's where it hangs (pool names etc changed):

% zfsbackup send --maxFileBuffer=0 --encryptTo [email protected] --signFrom [email protected] --publicKeyRingPath /var/backups/.gnupg/pubring.gpg.asc --secretKeyRingPath /var/backups/.gnupg/secring.gpg.asc --logLevel debug --full tank/Foobar b2://example-com-tank-git
2019/01/23 10:48:20 Setting number of cores to: 2
2019/01/23 10:48:20 Loaded private key ring /var/backups/.gnupg/secring.gpg.asc
2019/01/23 10:48:20 Loaded public key ring /var/backups/.gnupg/pubring.gpg.asc
2019/01/23 10:48:20 Setting working directory to /var/backups/.zfsbackup
2019/01/23 10:48:20 PGP Debug Info:
...
2019/01/23 10:48:20 The number of parallel uploads (4) is greater than the number of active files allowed (0), this may result in an unachievable max parallel upload target.
2019/01/23 10:48:20 Getting ZFS Snapshots with command "zfs list -H -d 1 -t snapshot -r -o name,creation -S creation tank/Foobar"
2019/01/23 10:48:20 Found ZFS snapshot "tank/Foobar@20181010" from 2018-10-10 18:20:00 +0000 UTC
2019/01/23 10:48:20 Found ZFS snapshot "tank/Foobar@20151003" from 2015-10-03 02:09:00 +0000 UTC
2019/01/23 10:48:20 Utilizing smart option.
2019/01/23 10:48:20 Limiting the number of active files to 0
2019/01/23 10:48:20 Limiting the number of parallel uploads to 4
2019/01/23 10:48:20 Max Backoff Time will be 30m0s
2019/01/23 10:48:20 Max Upload Retry Time will be 12h0m0s
2019/01/23 10:48:20 Upload Chunk Size will be 10MiB
2019/01/23 10:48:20 Will be using encryption key for [email protected]
2019/01/23 10:48:20 Will be signed from [email protected]
2019/01/23 10:48:20 Getting ZFS Snapshots with command "zfs list -H -d 1 -t snapshot -r -o name,creation -S creation tank/Foobar"
2019/01/23 10:48:21 Found ZFS snapshot "tank/Foobar@20181010" from 2018-10-10 18:20:00 +0000 UTC
2019/01/23 10:48:21 Found ZFS snapshot "tank/Foobar@20151003" from 2015-10-03 02:09:00 +0000 UTC
2019/01/23 10:48:21 Initializing Backend b2://example-com-tank-git
2019/01/23 10:48:21 Starting zfs send command: zfs send tank/Foobar@20181010
2019/01/23 10:48:21 Will be using internal gzip compressor with compression level 6.
2019/01/23 10:48:21 Starting volume tank/Foobar|20181010.zstream.gz.pgp.vol1
2019/01/23 10:48:23 b2 backend: Processing volume tank/Foobar|20181010.zstream.gz.pgp.vol1

Here are a few pertinent stack traces from gdb:

(gdb) thr 4
[Switching to thread 4 (LWP    3        )]
#0  0xffff80ffbf59046a in _portfs () from /lib/64/libc.so.1
(gdb) bt
#0  0xffff80ffbf59046a in _portfs () from /lib/64/libc.so.1
#1  0xffff80ffbf5107a4 in port_getn () from /lib/64/libc.so.1
#2  0x000000000045855a in runtime.asmsysvicall6 () at /opt/go/src/runtime/sys_solaris_amd64.s:104
#3  0x00000000004561ff in runtime.asmcgocall () at /opt/go/src/runtime/asm_amd64.s:718
#4  0x0000000000000000 in ?? ()
(gdb) thr 14
[Switching to thread 14 (Thread 3 (LWP 3))]
#0  0xffff80ffbf59046a in _portfs () from /lib/64/libc.so.1
(gdb) bt
#0  0xffff80ffbf59046a in _portfs () from /lib/64/libc.so.1
#1  0xffff80ffbf5107a4 in port_getn () from /lib/64/libc.so.1
#2  0x000000000045855a in runtime.asmsysvicall6 () at /opt/go/src/runtime/sys_solaris_amd64.s:104
#3  0x00000000004561ff in runtime.asmcgocall () at /opt/go/src/runtime/asm_amd64.s:718
#4  0x0000000000000000 in ?? ()
(gdb) thr 1
[Switching to thread 1 (LWP    1        )]
#0  0xffff80ffbf587e97 in __lwp_park () from /lib/64/libc.so.1
(gdb) bt
#0  0xffff80ffbf587e97 in __lwp_park () from /lib/64/libc.so.1
#1  0xffff80ffbf57a6fe in sema_wait_impl () from /lib/64/libc.so.1
#2  0xffff80ffbf57a8b3 in sema_reltimedwait () from /lib/64/libc.so.1
#3  0xffff80ffbf56dda7 in sem_reltimedwait_np () from /lib/64/libc.so.1
#4  0x000000000045855a in runtime.asmsysvicall6 () at /opt/go/src/runtime/sys_solaris_amd64.s:104
#5  0x00000000004561b9 in runtime.asmcgocall () at /opt/go/src/runtime/asm_amd64.s:688
#6  0x000000c42003e000 in ?? ()
#7  0x00000000012d2600 in github.com/someone1/zfsbackup-go/cmd.jobInfo ()
#8  0xffff80ffbffff868 in ?? ()
#9  0x0000000000434432 in runtime.entersyscallblock_handoff () at /opt/go/src/runtime/proc.go:2920
#10 0x000000c42008a180 in ?? ()
#11 0x00000000004300b0 in ?? () at /opt/go/src/runtime/proc.go:1092
#12 0xffff80ff00000000 in ?? ()
#13 0x0000000000454827 in runtime.rt0_go () at /opt/go/src/runtime/asm_amd64.s:258
#14 0x000000000000000f in ?? ()
#15 0xffff80ffbffff8d8 in ?? ()
#16 0x000000000000000f in ?? ()
#17 0xffff80ffbffff8d8 in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb) 
@lhoward
Copy link
Author

lhoward commented Jan 23, 2019

(As to could it be that I wasn't sufficiently patient – I did leave it to run overnight, so presumably this is enough to compress/encrypt 1GB even on a lowly Celeron.)

Edit: this is irrelevant, files backend was quick.

@lhoward lhoward changed the title zfsbackup seems to hang on Solaris zfsbackup seems to hang on Solaris with B2 Jan 23, 2019
@lhoward
Copy link
Author

lhoward commented Jan 23, 2019

setting B2_LOG_LEVEL=10, I can see an API call for b2_list_file_versions for the first chunk (vol1) (with the expected empty response on a new bucket), but nothing after that.

Compared to a successful upload (with a smaller pool), the difference is that b2_get_upload_url is never called (nor b2_upload_file).

@someone1
Copy link
Owner

Thanks for reporting this - I assume you compiled your own version from tip? Can you try compiling from the last release version and let me know if the problem persists? I'm curious as to whether the issue is with the latest change I made to the b2 backend, the b2 backend library code I'm using itself, or Go on Solaris.

Can you also provide more details on your system (version, architecture, etc)?

@lhoward
Copy link
Author

lhoward commented Jan 24, 2019

I can try although the only material changes since v0.3-rc1 seem related to the Azure backend (which I'm not using, but I did need to make some changes to the SDK to get it to compile on Solaris, hence reluctant to rebase). (And, to support SHA1.)

I'm just running a vanilla Oracle Solaris 11.3 system. Small pools backup and restore to B2 successfully, and large pools backup OK with the file backend.

@lhoward
Copy link
Author

lhoward commented Jan 25, 2019

go version go1.10.1 solaris/amd64

@lhoward
Copy link
Author

lhoward commented Jan 25, 2019

Just recompiled with go1.11.5, no change.

@lhoward
Copy link
Author

lhoward commented Jan 26, 2019

I added some debugging log statements to github.com/kurin/blazer/b2/writer.go, it looks like ListObjects() in w.getLargeFile() is returning EOF, which bubbles up to Writer.sendChunk().

If I change getLargeFile() so that it sets w.Resume = false if ListObjects() returns EOF, and calls back into getLargeFile() (i.e. the same logic as if no objects were returned), then it looks more promising (am testing now).

@lhoward
Copy link
Author

lhoward commented Jan 26, 2019

Perhaps I need to refile this against blazer, unless it's a usage issue.

@lhoward
Copy link
Author

lhoward commented Jan 26, 2019

PS. zfsbackup prints the command line usage when there is an error, which sometimes obscures the error message. It would be good to fix this.

@lhoward
Copy link
Author

lhoward commented Jan 26, 2019

Here's the patch I'm using which I think works but I don't know enough about B2 nor Go to say confidently (and obviously, for something as important as backups I need to do some verification).

diff --git a/b2/writer.go b/b2/writer.go
index 209cfce..bedf626 100644
--- a/b2/writer.go
+++ b/b2/writer.go
@@ -311,7 +311,12 @@ func (w *Writer) getLargeFile() (beLargeFileInterface, error) {
                cur := &Cursor{name: w.name}
                objs, _, err := w.o.b.ListObjects(w.ctx, 1, cur)
                if err != nil {
-                       return nil, err
+                       if err == io.EOF {
+                               w.Resume = false
+                               return w.getLargeFile()
+                       } else {
+                               return nil, err
+                       }
                }
                if len(objs) < 1 || objs[0].name != w.name {
                        w.Resume = false

@someone1
Copy link
Owner

Thanks for all the debugging! Looks like a fix may have been made to blazer.

Can you please provide details as to what SHA1 change you mentioned you made?

@lhoward
Copy link
Author

lhoward commented Jan 28, 2019

I didn't make a change to SHA1, I was just listing the things that appeared to have changed in master since v0.3-rc1 (sorry, I worded it very confusingly).

Anyway, with the upgraded blazer, zfsbackup appears to work as long as I set w.Resume = false in the B2 backend. (I can confirm successfully restoring from a backup.)

@someone1
Copy link
Owner

The change to set Resume=true was to try and stop duplicate files from appearing on B2 as reported in #7.

The intended goal was:

  • If the file already exists, try to resume it
  • Otherwise, start a new file

I will admit I did not release a new version after this change since I hadn't gotten around to testing the change just yet (funny how "I'll get to this next week" turns into 6+ months...). From the code, it reads as if we didn't list any files, then it would start a normal upload (e.g. set Resume=false internally and retry the same call). Obviously this wasn't happening until the most recent fix was made since you stumbled across this bug (thank you!) Still unclear why you need to set Resume=false to get it working though (I'll do some testing!)

I think the path forward here will be to make the following adjustments:

  • Never set Resume=true so this operates like the other backends (every call to Upload starts a fresh object to upload to)
  • Try to cancel the large file upload if we failed (I think I'll need to open a feature request with blazer for this)

The other backends work differently in ways:

  • S3: you can set a lifecycle policy in order to auto-clear multipart uploads that never completed - but the Go SDK provided by AWS should clean this up for you.
  • GCS: Resumable uploads will expire after 1-week automatically
  • B2: No Lifecycle rules, versioning enabled by default on buckets may be confusing to users, and left to the user to cleanup after a failed upload
  • Azure: Same as B2 (minus the versioning)

I'll hopefully have some fixes in for you to try soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants