Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexecerror: do not catch runtime panics in crdb_test builds #124098

Merged
merged 1 commit into from
May 21, 2024

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented May 14, 2024

CatchVectorizedRuntimeError performs its stack-walking in order to catch all panics originating from within the vectorized execution engine, including runtime panics. The purpose of this is to limit the impact of a bug in vectorized execution to failure of a single execution rather than crashing the node.

During tests, however, we'd like to make these runtime panics louder, since we believe they represent bugs in vectorized execution. So let's not catch panics under crdb_test builds unless they're wrapped.

Epic: None

Release note: None

Copy link

blathers-crl bot commented May 14, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@michae2 michae2 changed the title wip colexecerror: do not catch runtime panics in crdb_test builds May 14, 2024
@michae2 michae2 marked this pull request as ready for review May 14, 2024 17:28
@michae2 michae2 requested a review from a team as a code owner May 14, 2024 17:28
@michae2 michae2 requested review from mgartner, yuzefovich and a team and removed request for a team May 14, 2024 17:28
@michae2
Copy link
Collaborator Author

michae2 commented May 14, 2024

This came up during a postmortem. I'm not sure yet how much noise it will create in test failures, but curious to hear people's thoughts.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it's an interesting idea, and I'm in favor. I wouldn't expect this to have much noise - looks like only a single test needed an adjustment since it relies on the runtime panic.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/colexecerror/error_test.go line 43 at r1 (raw file):

	// Use the release-build panic-catching behavior instead of the
	// crdb_test-build behavior.
	defer func(prev bool) {

nit: it might be cleaner to have this logic in colexecerror/error.go and only have a single exported method. I'm thinking about something like this:

var testingKnobShouldCatchPanic bool

// ProductionBehaviorForTests makes the panic-catcher behave in tests
// in the same way as it does in production builds.
func ProductionBehaviorForTests() func() {
  old := testingKnobShouldCatchPanic
  testingKnobShouldCatchPanic = true
  return func() {
    testingKnobShouldCatchPanic = old
  }
}
...
// in tests
defer colexecerror.ProductionBehaviorForTests()()

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @yuzefovich)


pkg/sql/colexecerror/error_test.go line 43 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: it might be cleaner to have this logic in colexecerror/error.go and only have a single exported method. I'm thinking about something like this:

var testingKnobShouldCatchPanic bool

// ProductionBehaviorForTests makes the panic-catcher behave in tests
// in the same way as it does in production builds.
func ProductionBehaviorForTests() func() {
  old := testingKnobShouldCatchPanic
  testingKnobShouldCatchPanic = true
  return func() {
    testingKnobShouldCatchPanic = old
  }
}
...
// in tests
defer colexecerror.ProductionBehaviorForTests()()

Nice, this is much cleaner! Done.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/colexecerror/error_test.go line 43 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Nice, this is much cleaner! Done.

nit: I think one set of parenthesis is missing when using it.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will panics injected for testing cause problems with this?

Reviewed 1 of 4 files at r1, 1 of 3 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will panics injected for testing cause problems with this?

  • Panics injected by calling crdb_internal.force_panic('foo') from SQL will continue to crash the node, as they did before.
  • Panics injected outside the vectorized execution engine will continue to behave as they did before (either crashing the node if they are not caught somewhere else, or becoming an error if they are caught and wrapped with StorageError or InternalError or ExpectedError).

This should only affect panics injected within the vectorized execution engine that are not wrapped, which should be rare (and I believe represent bugs in the vectorized engine).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and I believe represent bugs in the vectorized engine).

Actually, what I meant to say, since we're talking about injected panics is not that they represent bugs, but that they should be handled by adding the call to colexecerror.ProductionBehaviorForTests(). And these should only be tests within the colexec* packages, which I think I've found.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have TestVectorizedPanics linter that requires that all code within the vectorized engine (which has ./pkg/sql/col* prefix in the path) to use either InternalError or ExpectedError. If we have an unwrapped panic, I think it's fair for us to crash in test builds.

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)

@michae2
Copy link
Collaborator Author

michae2 commented May 17, 2024

@mgartner ok if I bors this?

@michae2
Copy link
Collaborator Author

michae2 commented May 20, 2024

I'll go ahead, and we can revert this if it breaks something.

bors r=yuzefovich

craig bot pushed a commit that referenced this pull request May 20, 2024
124098: colexecerror: do not catch runtime panics in crdb_test builds r=yuzefovich a=michae2

CatchVectorizedRuntimeError performs its stack-walking in order to catch *all* panics originating from within the vectorized execution engine, including runtime panics. The purpose of this is to limit the impact of a bug in vectorized execution to failure of a single execution rather than crashing the node.

During tests, however, we'd like to make these runtime panics louder, since we believe they represent bugs in vectorized execution. So let's not catch panics under crdb_test builds unless they're wrapped.

Epic: None

Release note: None

Co-authored-by: Michael Erickson <[email protected]>
@craig
Copy link
Contributor

craig bot commented May 20, 2024

Build failed:

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only affect panics injected within the vectorized execution engine that are not wrapped, which should be rare (and I believe represent bugs in the vectorized engine).

Thanks for explaining! SGTM.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)

CatchVectorizedRuntimeError performs its stack-walking in order to catch
*all* panics originating from within the vectorized execution engine,
including runtime panics. The purpose of this is to limit the impact of
a bug in vectorized execution to failure of a single execution rather
than crashing the node.

During tests, however, we'd like to make these runtime panics louder,
since we believe they represent bugs in vectorized execution. So let's
not catch panics under crdb_test builds unless they're wrapped.

Epic: None

Release note: None
@michae2
Copy link
Collaborator Author

michae2 commented May 21, 2024

bors r=yuzefovich,mgartner

@craig craig bot merged commit 1749621 into cockroachdb:master May 21, 2024
20 of 22 checks passed
@michae2 michae2 deleted the testbuildpanic branch May 21, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants