Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encrypted send to untrusted target not quite working ... pretty sure it's a simple config issue #742

Open
Halfwalker opened this issue Oct 8, 2023 · 21 comments · May be fixed by #744
Open

Comments

@Halfwalker
Copy link

Encrypted send to untrusted target not quite working ...

I have a backup dataset bondi3/zrepl on a server bondi that receives laptop backups. The dataset is encrypted, and laptop sends are plain, so they're re-encrypted on bondi. For offsite backups, I want to replicate this to a backup dataset wider/zrepl on remote server widey. Should be fairly simple ...

The two laptop sub-datasets under bondi3/zrepl - both look like this

bondi3/zrepl/ryzen2/ryzen22/home/alice  encryption             aes-256-gcm            -
bondi3/zrepl/ryzen2/ryzen22/home/alice  keylocation            none                   default
bondi3/zrepl/ryzen2/ryzen22/home/alice  keyformat              passphrase             -
bondi3/zrepl/ryzen2/ryzen22/home/alice  pbkdf2iters            350000                 -
bondi3/zrepl/ryzen2/ryzen22/home/alice  encryptionroot         bondi3/zrepl 
bondi3/zrepl/ryzen2/ryzen22/home/alice  keystatus              available

bondi zrepl.yml source job

  - type: source
    name: laptop_source
    filesystems:
      "bondi3/zrepl<": true
    snapshotting:
      type: manual
    send:
      encrypted: true
    serve:
      type: tls
      listen: :8541
      ca: /etc/zrepl/ca.crt
      cert: /etc/zrepl/bondi.crt
      key: /etc/zrepl/bondi.key
      client_cns:
        - widey

widey zrepl.yml pull job

  - name: pull_bondi
    type: pull
    connect:
      type: tls
      address: 192.168.2.104:8541
      ca: /etc/zrepl/ca.crt
      cert: /etc/zrepl/widey.crt
      key: /etc/zrepl/widey.key
      server_cn: bondi
    root_fs: wider/zrepl
    interval: 1h
    replication:
      protection:
        initial: guarantee_resumability
        incremental: guarantee_resumability
    recv:
      properties:
        override: {
          "readonly": "on",
          "mountpoint": "none",
          "canmount": "noauto"
        }
      # encryption off as per encrypted-send-to-untrusted-receiver
      # https://zrepl.github.io/master/configuration/sendrecvoptions.html#placeholders
      placeholder:
        encryption: off
    pruning:
      keep_sender:
        - type: regex
          regex: '.*'
      keep_receiver:
        - type: regex
          negate: true
          regex: '^zrepl_'
        - type: grid
          grid: 1x1h(keep=all) | 24x1h | 30x1d | 12x30d
          regex: '^zrepl_'

Except the two laptop backups (halflapp and ryzen2) fail ... this is the status on the remote widey

bondi3/zrepl                              PLANNING-ERROR (step 0/0, 0 B/0 B) sender does not have any versions
bondi3/zrepl/halflapp                     PLANNING-ERROR (step 0/0, 0 B/0 B) sender does not have any versions
bondi3/zrepl/halflapp/halflapp            PLANNING-ERROR (step 0/0, 0 B/0 B) sender does not have any versions
bondi3/zrepl/halflapp/halflapp/home       PLANNING-ERROR (step 0/0, 0 B/0 B) sender does not have any versions
bondi3/zrepl/halflapp/halflapp/home/alice STEP-ERROR (step 1/1, 0 B/6.8 GiB) parent(s) failed during initial replication: [bondi3/zrepl bondi3/zrepl/halflapp/halflapp]
     :
bondi3/zrepl/ryzen2                       PLANNING-ERROR (step 0/0, 0 B/0 B) sender does not have any versions
bondi3/zrepl/ryzen2/ryzen22               PLANNING-ERROR (step 0/0, 0 B/0 B) sender does not have any versions
bondi3/zrepl/ryzen2/ryzen22/home          PLANNING-ERROR (step 0/0, 0 B/0 B) sender does not have any versions 
bondi3/zrepl/ryzen2/ryzen22/home/alice    STEP-ERROR (step 1/1, 0 B/44.1 GiB) parent(s) failed during initial replication: [bondi3/zrepl bondi3/zrepl/ryzen2/ryzen22]
@problame
Copy link
Member

problame commented Oct 14, 2023

Sorry for the delayed reply.

I think I understand the problem.

The bondi-side filesystems

bondi3/zrepl/halflapp  
bondi3/zrepl/halflapp/halflapp
bondi3/zrepl/halflapp/halflapp/home

dont' have snapshots because they're placeholder filesystem.

And so, the replication planner code bails out early here:

if len(sfsvs) < 1 {
err := errors.New("sender does not have any versions")
log(ctx).Error(err.Error())
return nil, err
}
var rfsvs []*pdu.FilesystemVersion
if fs.receiverFS != nil && !fs.receiverFS.GetIsPlaceholder() {
rfsvsres, err := fs.receiver.ListFilesystemVersions(ctx, &pdu.ListFilesystemVersionsReq{Filesystem: fs.Path})
if err != nil {
log(ctx).WithError(err).Error("receiver error")
return nil, err
}
rfsvs = rfsvsres.GetVersions()
} else {
rfsvs = []*pdu.FilesystemVersion{}
}

Would you be willing to test a fix for this? You'd download a zrepl binary from the GitHub CI and replace your distro's zrepl binaries with it.

Also, can you try this workaround: what if you change the filesystems on the bondi-side source job to not include the chain of placeholder filesystems?

   - type: source
     name: laptop_source
     filesystems:
-      "bondi3/zrepl<": true
+      "bondi3/zrepl/halflapp/halflapp/home/alice": true

@problame problame added this to the 0.8: Replication milestone Oct 14, 2023
problame added a commit that referenced this issue Oct 14, 2023
Before this PR, when chaining replication from
A => B => C, if B had placeholders and the `filesystems`
included these placeholders, we'd incorrectly
fail the planning phase with error
`sender does not have any versions`.

The non-placeholder child filesystems of these placeholders
would then fail to replicate because of the
initial-replication-dependency-tracking that we do, i.e.,
their parent failed to initially replication, hence
they fail to replicate as well
(`parent(s) failed during initial replication`).

We can do better than that because we have the information
whether a sender-side filesystem is a placeholder.
This PR makes the planner act on that information.
The outcome is that placeholders are replicated as
placeholders (albeit the receiver remains in control
of how these placeholders are created, i.e., `recv.placeholders`)
The mechanism to do it is:
1. Don't plan any replication steps for filesystems that
   are placeholders on the sender.
2. Ensure that, if a receiving-side filesystem exists, it
   is indeed a placeholder.

Check (2) may seem overly restrictive, but, the goal here
is not just to mirror all non-placeholder filesystems, but
also to mirror the hierarchy.

TODO:
- test with user
- regression test

fixes #742
@problame problame linked a pull request Oct 14, 2023 that will close this issue
@problame
Copy link
Member

Untested fix in: #744
It's super low risk to try this out on your setup.

Please try it for both initial replication and also a few incremental replication runs after the initial replication.

Binaries available from CircleCI (click the appropriate quickcheck-* job, then navigate to tab "Artifacts") https://app.circleci.com/pipelines/github/zrepl/zrepl/7429/workflows/50139be5-b0fd-4fcc-a60f-998432b24bb7

@Halfwalker
Copy link
Author

Just tried the new binary on widey - no luck :( You did mean for it to be replaced on the pull/target system right ?

I picked it from
https://app.circleci.com/pipelines/github/zrepl/zrepl/7429/workflows/50139be5-b0fd-4fcc-a60f-998432b24bb7/jobs/51075/artifacts

To be sure, here are the md5sums of the original 0.6.1 (zrepl.orig) and the new test one (zrepl)

$ md5sum /usr/bin/zrepl*
c77f84879cf7aa8c67af2ed2c971af51  /usr/bin/zrepl
0387692626701b1e0975fc163db49563  /usr/bin/zrepl.orig

It gives the exact same error

@problame
Copy link
Member

The planner runs on the active side of the replication setup.
In your case, the active side is the pull job on widey, yeah.

Just to make sure you deployed the binary correctly: you need to

  1. systemctl stop zrepl
  2. copy the binary to the location (/usr/bin/zrepl)
  3. systemctl start zrepl

My PR adds debug log messages.
Do you see any of them in the logs?
Please enable debug logging, then check widey's logs for

sender filesystem is placeholder

@Halfwalker
Copy link
Author

Yup - stopped the service, then in /usr/bin moved old zrepl to zrepl.orig, then moved zrepl-linux-amd64 to zrepl

-rwxr-xr-x  1 alice alice   24424328 Oct 14 12:18  zrepl
-rwxr-xr-x  1 root  root    24535344 Oct  7 16:55  zrepl.orig

Here's the syslog looking for placeholder (didn't find "sender filesystem")

alice@widey:~$ sudo fgrep placeholder /var/log/syslog

Oct 14 12:44:42 widey zrepl[9260]: [pull_bondi][zfs.cmd][ZJuM$3z3q$3z3q.Bz2B.RK1Q.9vO3.qQMj.97sx.vcbb]: starting command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl"
Oct 14 12:44:42 widey zrepl[9260]: [pull_bondi][zfs.cmd][ZJuM$3z3q$3z3q.Bz2B.RK1Q.9vO3.qQMj.97sx.vcbb]: started command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl"
Oct 14 12:44:42 widey zrepl[9260]: [pull_bondi][zfs.cmd][ZJuM$3z3q$3z3q.Bz2B.RK1Q.9vO3.qQMj.97sx.vcbb]: start waiting cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl"
Oct 14 12:44:42 widey zrepl[9260]: [pull_bondi][zfs.cmd][ZJuM$3z3q$3z3q.Bz2B.RK1Q.9vO3.qQMj.97sx.vcbb]: command exited without error cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl" total_time_s="0.026661596" systemtime_s="0.021611" usertime_s="0.004322"
Oct 14 12:44:42 widey zrepl[9260]: [pull_bondi][trace.data][ZJuM$3z3q$3z3q.Bz2B.RK1Q.9vO3.qQMj.97sx.vcbb]: finished span cli.(*Subcommand).run#0$active-side-job#0$active-side-job#0.active-side-job pull_bondi.invocation-1.replication.plan.endpoint.(*Receiver).ListFilesystems.zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl duration_s="0.027127839"
Oct 14 12:46:07 widey zrepl[9260]: [pull_bondi][zfs.cmd][ZJuM$3z3q$3z3q.Bz2B.RK1Q.3HC6.dCfZ.aQHn]: starting command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl"
Oct 14 12:46:07 widey zrepl[9260]: [pull_bondi][zfs.cmd][ZJuM$3z3q$3z3q.Bz2B.RK1Q.3HC6.dCfZ.aQHn]: started command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl"
Oct 14 12:46:07 widey zrepl[9260]: [pull_bondi][zfs.cmd][ZJuM$3z3q$3z3q.Bz2B.RK1Q.3HC6.dCfZ.aQHn]: start waiting cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl"
Oct 14 12:46:07 widey zrepl[9260]: [pull_bondi][zfs.cmd][ZJuM$3z3q$3z3q.Bz2B.RK1Q.3HC6.dCfZ.aQHn]: command exited without error systemtime_s="0.02059" usertime_s="0.008236" cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl" total_time_s="0.0298248"
Oct 14 12:46:07 widey zrepl[9260]: [pull_bondi][trace.data][ZJuM$3z3q$3z3q.Bz2B.RK1Q.3HC6.dCfZ.aQHn]: finished span cli.(*Subcommand).run#0$active-side-job#0$active-side-job#0.active-side-job pull_bondi.invocation-1.prune_recever.endpoint.(*Receiver).ListFilesystems.zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl duration_s="0.030376682"

@problame
Copy link
Member

problame commented Oct 14, 2023

Hm, that's not helpful.

Please run

zrepl test placeholder --all

on the middle node of your replication chain, i.e., the bondi host.

And just to make sure, please confirm your setup looks like this

A => B => C
where => is a pull replication
and
A := ryzen22
B := bondi
C := widey 

@Halfwalker
Copy link
Author

Placeholders on bondi - only looking at the zrepl dataset, which is the one to be replicated to widey

$ sudo zrepl test placeholder --all | rg zrepl                                                                                 SIGINT(2) ↵
IS_PLACEHOLDER  DATASET zrepl:placeholder
no      bondi3/zrepl
yes     bondi3/zrepl/halflapp   on
yes     bondi3/zrepl/halflapp/halflapp  on
yes     bondi3/zrepl/halflapp/halflapp/home     on
no      bondi3/zrepl/halflapp/halflapp/home/alice
yes     bondi3/zrepl/ryzen2     on
yes     bondi3/zrepl/ryzen2/ryzen22     on
yes     bondi3/zrepl/ryzen2/ryzen22/home        on
no      bondi3/zrepl/ryzen2/ryzen22/home/alice

The setup is almost right above ...

ryzen2 pushes to bondi (since it's almost alway on - main workstation)
widey pulls from bondi (since it only turns on late at night for backups)

@problame
Copy link
Member

Ok, so, setup is

ryzen2 =push=> bondi =pull=> widey

I pushed a new commit to the PR, please wait for CI to finish and download the new artifacts, the try the replication again.

@Halfwalker
Copy link
Author

Nice :) Fast work, mikey likes it.

Same CircleCI url above ? I don't know CircleCI - github-actions/gitlab-CI/Drone yes, not CircleCI

@problame
Copy link
Member

No, different URL.
You can either navigate CircleCI UI or go via the commit status in the.

image

(Note the screenshots selects the Go 1.20 build, but feel free to use the Go 1.21 build.

@Halfwalker
Copy link
Author

Got it. I didn't see a 1.21 version for amd64, just the 1.20, unless I'm misreading ... The others are freebsd 1.20 and 1.21.
image
New zrepl

$ md5sum /usr/bin/zrepl*
86fc822e277231c71fd05b0c84ec0d05  /usr/bin/zrepl
0387692626701b1e0975fc163db49563  /usr/bin/zrepl.orig

Same error :( Searching for placeholder in syslog

Oct 14 13:27:44 widey zrepl[11869]: [snaproot][zfs.cmd][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.jxBb.udbZ]: starting command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy"
Oct 14 13:27:44 widey zrepl[11869]: [snaproot][zfs.cmd][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.jxBb.udbZ]: started command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy"
Oct 14 13:27:44 widey zrepl[11869]: [snaproot][zfs.cmd][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.jxBb.udbZ]: start waiting cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy"
Oct 14 13:27:44 widey zrepl[11869]: [snaproot][zfs.cmd][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.jxBb.udbZ]: command exited without error usertime_s="0" cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy" total_time_s="0.013439911" systemtime_s="0.013004"
Oct 14 13:27:44 widey zrepl[11869]: [snaproot][trace.data][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.jxBb.udbZ]: finished span cli.(*Subcommand).run#0$snap-job#0$snap-job#0.snap-job snaproot.invocation-1.snap-job-do-prune.endpoint.(*Sender).ListFilesystems.zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy duration_s="0.01382875"
Oct 14 13:27:44 widey zrepl[11869]: [snaproot][endpoint][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.jxBb]: placeholder state fs="&{[widey ROOT jammy]}" placeholder_state="&zfs.FilesystemPlaceholderState{FS:\"widey/ROOT/jammy\", FSExists:true, IsPlaceholder:false, RawLocalPropertyValue:\"\"}"
Oct 14 13:27:45 widey zrepl[11869]: [snaproot][zfs.cmd][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.S1Jn.XoW4]: starting command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy"
Oct 14 13:27:45 widey zrepl[11869]: [snaproot][zfs.cmd][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.S1Jn.XoW4]: started command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy"
Oct 14 13:27:45 widey zrepl[11869]: [snaproot][zfs.cmd][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.S1Jn.XoW4]: start waiting cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy"
Oct 14 13:27:45 widey zrepl[11869]: [snaproot][zfs.cmd][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.S1Jn.XoW4]: command exited without error cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy" systemtime_s="0.013" total_time_s="0.01350822" usertime_s="0"
Oct 14 13:27:45 widey zrepl[11869]: [snaproot][trace.data][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.S1Jn.XoW4]: finished span cli.(*Subcommand).run#0$snap-job#0$snap-job#0.snap-job snaproot.invocation-1.snap-job-do-prune.endpoint.(*Sender).ListFilesystems.zfs get -Hp -o name,property,value,source zrepl:placeholder widey/ROOT/jammy duration_s="0.0139077"
Oct 14 13:27:45 widey zrepl[11869]: [snaproot][endpoint][f+2x$Uy+Q$Uy+Q./cUP.Fcb+.kb6o.S1Jn]: placeholder state fs="&{[widey ROOT jammy]}" placeholder_state="&zfs.FilesystemPlaceholderState{FS:\"widey/ROOT/jammy\", FSExists:true, IsPlaceholder:false, RawLocalPropertyValue:\"\"}"
Oct 14 13:28:20 widey zrepl[11869]: [pull_bondi][zfs.cmd][f+2x$QnwL$QnwL.0Hvq.6aJQ.B4zD.lyWw.leDQ./VSt]: starting command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl"
Oct 14 13:28:20 widey zrepl[11869]: [pull_bondi][zfs.cmd][f+2x$QnwL$QnwL.0Hvq.6aJQ.B4zD.lyWw.leDQ./VSt]: started command cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl"
Oct 14 13:28:20 widey zrepl[11869]: [pull_bondi][zfs.cmd][f+2x$QnwL$QnwL.0Hvq.6aJQ.B4zD.lyWw.leDQ./VSt]: start waiting cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl"
Oct 14 13:28:20 widey zrepl[11869]: [pull_bondi][zfs.cmd][f+2x$QnwL$QnwL.0Hvq.6aJQ.B4zD.lyWw.leDQ./VSt]: command exited without error systemtime_s="0.023939" usertime_s="0.003989" cmd="zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl" total_time_s="0.028897699"
Oct 14 13:28:20 widey zrepl[11869]: [pull_bondi][trace.data][f+2x$QnwL$QnwL.0Hvq.6aJQ.B4zD.lyWw.leDQ./VSt]: finished span cli.(*Subcommand).run#0$active-side-job#0$active-side-job#0.active-side-job pull_bondi.invocation-1.replication.plan.endpoint.(*Receiver).ListFilesystems.zfs get -Hp -o name,property,value,source zrepl:placeholder wider/zrepl duration_s="0.029462667"

@problame
Copy link
Member

Got it. I didn't see a 1.21 version for amd64,

The 1.21 one is is part of the first stage, i.e., in the group of 3 boxes, the one in the middle.

Same error :( Searching for placeholder in syslog

Ah, I forgot to mention: with the commit that I added, it's necessary to update the sending side as well, i.e., you need to deploy the updated binary to both widey and bondi

@Halfwalker
Copy link
Author

Ah, OK. Done. Errors out, but different ...

Replication:
   Attempt #1
   Status: filesystem-error
   Last Run: 2023-10-14 15:53:25 -0400 EDT (lasted 29s)
   Problem: one or more of the filesystems encountered errors
     bondi3/zrepl                              PLANNING-ERROR (step 0/0, 0 B/0 B) sender does not have any versions
     bondi3/zrepl/halflapp                     STEP-ERROR (step 0/0, 0 B/0 B) parent(s) failed during initial replication: [bondi3/zrepl]
     bondi3/zrepl/halflapp/halflapp            STEP-ERROR (step 0/0, 0 B/0 B) parent(s) failed during initial replication: [bondi3/zrepl]
     bondi3/zrepl/halflapp/halflapp/home       STEP-ERROR (step 0/0, 0 B/0 B) parent(s) failed during initial replication: [bondi3/zrepl]
     bondi3/zrepl/halflapp/halflapp/home/alice STEP-ERROR (step 1/1, 0 B/6.9 GiB) parent(s) failed during initial replication: [bondi3/zrepl]
     bondi3/zrepl/ryzen2                       STEP-ERROR (step 0/0, 0 B/0 B) parent(s) failed during initial replication: [bondi3/zrepl]
     bondi3/zrepl/ryzen2/ryzen22               STEP-ERROR (step 0/0, 0 B/0 B) parent(s) failed during initial replication: [bondi3/zrepl]
     bondi3/zrepl/ryzen2/ryzen22/home          STEP-ERROR (step 0/0, 0 B/0 B) parent(s) failed during initial replication: [bondi3/zrepl]
     bondi3/zrepl/ryzen2/ryzen22/home/alice    STEP-ERROR (step 1/1, 0 B/44.1 GiB) parent(s) failed during initial replication: [bondi3/zrepl]

Pruning Sender:
   Status: ExecErr
   bondi3/zrepl                              ERROR: replication cursor bookmark does not exist (one successful replication is required before pruning works)

   bondi3/zrepl/halflapp                     skipped: filesystem is placeholder
   bondi3/zrepl/halflapp/halflapp            skipped: filesystem is placeholder
   bondi3/zrepl/halflapp/halflapp/home       skipped: filesystem is placeholder
   bondi3/zrepl/halflapp/halflapp/home/alice ERROR: replication cursor bookmark does not exist (one successful replication is required before pruning works)

   bondi3/zrepl/ryzen2                       skipped: filesystem is placeholder
   bondi3/zrepl/ryzen2/ryzen22               skipped: filesystem is placeholder
   bondi3/zrepl/ryzen2/ryzen22/home          skipped: filesystem is placeholder
   bondi3/zrepl/ryzen2/ryzen22/home/alice    ERROR: replication cursor bookmark does not exist (one successful replication is required before pruning works)

@problame
Copy link
Member

problame commented Oct 14, 2023

Hm, yeah, the root_fs dataset bondi3/zrepl is not a zrepl-managed placeholder, so zrepl expects there to be snapshots.
That is how it should be, I won't change that behavior.

The following should do the trick:

   - type: source
     name: laptop_source
     filesystems:
-      "bondi3/zrepl": true
+      "bondi3/zrepl": false
+      "bondi3/zrepl<": true

If that doesn't work, try

   - type: source
     name: laptop_source
     filesystems:
-      "bondi3/zrepl": true
+      "bondi3/zrepl/halflapp<": true
+      "bondi3/zrepl/ryzen2<": true

Doc context: https://zrepl.github.io/configuration/filter_syntax.html#pattern-filter

(I haven't thought about filesystems filter syntax in a while, hence my uncertainty).

Please report back which one worked.

@Halfwalker
Copy link
Author

Bingo ! That was it - replicating the halflapp dataset now. I'll post here if the ryzen2 one fails

Replication:
   Attempt #1
   Status: fan-out-filesystems
   Started: 2023-10-14 16:08:12 -0400 EDT (lasting 53s)
   Progress: [==\------------------------------------------------] 2.1 GiB / 51.0 GiB @ 113.9 MiB/s (7m 18s remaining)
     bondi3/zrepl/halflapp                     DONE (step 0/0, 0 B/0 B)
     bondi3/zrepl/halflapp/halflapp            DONE (step 0/0, 0 B/0 B)
     bondi3/zrepl/halflapp/halflapp/home       DONE (step 0/0, 0 B/0 B)
   * bondi3/zrepl/halflapp/halflapp/home/alice STEPPING (step 1/1, 2.1 GiB/6.9 GiB) next: full send @zrepl_2023-10-09_16:00:30
   * ```

@Halfwalker
Copy link
Author

OK, the ryzen2 one seemed to backup fine, so that's good. I set up a timer to trigger backups, came back after a while to see zrepl was stuck in planning stage for over an hour.

bondi seemed to have trouble - zpool status -v was hanging on the bondi3 pool, zfs get all was just hanging. Did a full power-cycle (I did try turning it off and back on again), and everything seemed OK. But starting a replication from widey got stuck again, and I see spl panics in the dmesg -T output on bondi

[Sat Oct 14 18:16:48 2023] Hardware name: Supermicro X10DRH LN4/X10DRH-CLN4, BIOS 3.4 08/20/2021
[Sat Oct 14 18:16:48 2023] Call Trace:
[Sat Oct 14 18:16:48 2023]  dump_stack+0x6d/0x8b
[Sat Oct 14 18:16:48 2023]  spl_dumpstack+0x29/0x2b [spl]
[Sat Oct 14 18:16:48 2023]  spl_panic+0xd4/0xfc [spl]
[Sat Oct 14 18:16:48 2023]  ? zap_lockdir+0x8c/0xb0 [zfs]
[Sat Oct 14 18:16:48 2023]  ? zap_add+0x7b/0xa0 [zfs]
[Sat Oct 14 18:16:48 2023]  dsl_dir_create_sync+0x20e/0x290 [zfs]
[Sat Oct 14 18:16:48 2023]  dsl_dataset_create_sync+0x52/0x380 [zfs]
[Sat Oct 14 18:16:48 2023]  dmu_recv_begin_sync+0x3ae/0xa70 [zfs]
[Sat Oct 14 18:16:48 2023]  ? spa_get_slop_space+0x4e/0x90 [zfs]
[Sat Oct 14 18:16:48 2023]  dsl_sync_task_sync+0xb6/0x100 [zfs]
[Sat Oct 14 18:16:48 2023]  dsl_pool_sync+0x3d6/0x4f0 [zfs]
[Sat Oct 14 18:16:48 2023]  spa_sync+0x562/0xff0 [zfs]
[Sat Oct 14 18:16:48 2023]  ? mutex_lock+0x12/0x40
[Sat Oct 14 18:16:48 2023]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[Sat Oct 14 18:16:48 2023]  txg_sync_thread+0x26d/0x3f0 [zfs]
[Sat Oct 14 18:16:48 2023]  ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
[Sat Oct 14 18:16:48 2023]  thread_generic_wrapper+0x79/0x90 [spl]
[Sat Oct 14 18:16:48 2023]  kthread+0x121/0x140
[Sat Oct 14 18:16:48 2023]  ? __thread_exit+0x20/0x20 [spl]
[Sat Oct 14 18:16:48 2023]  ? kthread_park+0x90/0x90
[Sat Oct 14 18:16:48 2023]  ret_from_fork+0x35/0x40

I don't know if it's related.

@Halfwalker
Copy link
Author

Point of interest - it's definitely the replication from ryzen2 to bondi that kicks that spl panic - soon as it tried to replication it kill zfs on bondi. I tested with a dmesg -Tw on bondi and the moment zrepl on ryzen2 started replication - boom, panic.

@problame
Copy link
Member

problame commented Oct 15, 2023

Well, nice to hear that it works. But this is definitely a ZFS bug, zrepl just uses the CLI and that should never cause panics.

we may still need to work around the issue. What’s the OS and ZFS version?

@Halfwalker
Copy link
Author

Doh - just remembered, I had NOT replaced the zrepl binary on ryzen2 ... Did that, but it still causes the panic when replication starts.

bondi is ubuntu 18.04, using zfs from the jonathan repo

~  dpkg -l zfs\* | fgrep ii
ii  zfs-dkms       2.1.6-0york1~18.04 all          OpenZFS filesystem kernel modules for Linux
ii  zfs-initramfs  2.1.6-0york1~18.04 all          OpenZFS root filesystem capabilities for Linux - initramfs
ii  zfsutils-linux 2.1.6-0york1~18.04 amd64        command-line tools to manage OpenZFS filesystems

~  zfs --version
zfs-2.1.6-0york1~18.04
zfs-kmod-2.1.6-0york1~18.04

ryzen2 is ubuntu 22.04, also with zfs from jonathan

❯ dpkg -l zfs\* | fgrep ii
ii  zfs-dracut     2.1.6-0york1~22.04 all          OpenZFS root filesystem capabilities for Linux - dracut
ii  zfs-zed        2.1.6-0york1~22.04 amd64        OpenZFS Event Daemon
ii  zfsutils-linux 2.1.6-0york1~22.04 amd64        command-line tools to manage OpenZFS filesystems

❯ zfs --version
zfs-2.1.6-0york1~22.04
zfs-kmod-2.1.9-2ubuntu1.1

Panic on bondi looks like this

[Sun Oct 15 11:09:51 2023] VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
[Sun Oct 15 11:09:51 2023] PANIC at dsl_dir.c:951:dsl_dir_create_sync()
[Sun Oct 15 11:09:51 2023] Showing stack for process 13807
[Sun Oct 15 11:09:51 2023] CPU: 21 PID: 13807 Comm: txg_sync Tainted: P           OE     5.4.0-164-generic #181~18.04.1-Ubuntu
[Sun Oct 15 11:09:51 2023] Hardware name: Supermicro X10DRH LN4/X10DRH-CLN4, BIOS 3.4 08/20/2021
[Sun Oct 15 11:09:51 2023] Call Trace:
[Sun Oct 15 11:09:51 2023]  dump_stack+0x6d/0x8b
[Sun Oct 15 11:09:51 2023]  spl_dumpstack+0x29/0x2b [spl]
[Sun Oct 15 11:09:51 2023]  spl_panic+0xd4/0xfc [spl]
[Sun Oct 15 11:09:51 2023]  ? zap_lockdir+0x8c/0xb0 [zfs]
[Sun Oct 15 11:09:51 2023]  ? zap_add+0x7b/0xa0 [zfs]
[Sun Oct 15 11:09:51 2023]  dsl_dir_create_sync+0x20e/0x290 [zfs]
[Sun Oct 15 11:09:51 2023]  dsl_dataset_create_sync+0x52/0x380 [zfs]
[Sun Oct 15 11:09:51 2023]  dmu_recv_begin_sync+0x3ae/0xa70 [zfs]
[Sun Oct 15 11:09:51 2023]  ? spa_get_slop_space+0x4e/0x90 [zfs]
[Sun Oct 15 11:09:51 2023]  dsl_sync_task_sync+0xb6/0x100 [zfs]
[Sun Oct 15 11:09:51 2023]  dsl_pool_sync+0x3d6/0x4f0 [zfs]
[Sun Oct 15 11:09:51 2023]  spa_sync+0x562/0xff0 [zfs]
[Sun Oct 15 11:09:51 2023]  ? mutex_lock+0x12/0x40
[Sun Oct 15 11:09:51 2023]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[Sun Oct 15 11:09:51 2023]  txg_sync_thread+0x26d/0x3f0 [zfs]
[Sun Oct 15 11:09:51 2023]  ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
[Sun Oct 15 11:09:51 2023]  thread_generic_wrapper+0x79/0x90 [spl]
[Sun Oct 15 11:09:51 2023]  kthread+0x121/0x140
[Sun Oct 15 11:09:51 2023]  ? __thread_exit+0x20/0x20 [spl]
[Sun Oct 15 11:09:51 2023]  ? kthread_park+0x90/0x90
[Sun Oct 15 11:09:51 2023]  ret_from_fork+0x35/0x40

I think it's something in the dataset doing it ... I moved back to the original zrepl binary, and the same thing is happening. So, going to blow away those datasets (they're backups) and start clean with the test zrepl binary.

@Halfwalker
Copy link
Author

That seems to have done the trick ... Changed the bondi config to point to a new dataset. Replication from both ryzen2 and halflapp to bondi worked fine with the new binary. Replication from bondi to widey is ongoing, and seems to be working fine.

So something got mucked up in the original bondi3/zrepl dataset heirarchy, enough to cause spl to freak out and panic when trying to push an incremental stream into it.

Question - can multiple pull jobs run against the same source job at the same time ?

@Halfwalker
Copy link
Author

Sort of answering my own question above ... From the docs https://zrepl.github.io/quickstart/fan_out_replication.html it appears that each pull client wants its own source job on the server. But the config seems to indicate that you can have multiple client_cns in the job.

I noticed with 2 clients (widey and another one wideload) pulling at different times from bondi one or the other would start failing with missing snapshots. Which was odd, since it was only ryzen2 that was managing the pruning on bondi

Refresher ...

  • ryzen2 pushes to bondi into bondi3/zrepl, managing local snapshots in a snap job and bondi snapshots in the push job via pruning.keep_receiver
  • bondi has a source job for bondi3/zrepl
  • widey has a pull job to bondi - pruning.keep_sender is a no-op, while pruning.keep_receiver does regular grid pruning on the target dataset on widey

Experimenting with a 2nd box wideload to ALSO pull those backups.

Kind of another question ... The pull jobs on widey and wideload had guarantee_incremental set. But I switched it to guarantee_recumability. Now seems to have a persistent error (yet replication works)

bondi3/zrepl/halflapp/halflapp/home/alice ERROR: replication cursor bookmark does not exist (one successful replication is required before pruning works)

What's the right way to clean this up ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants