Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(bug) flaky test - Test_websocket_flag_change #1790

Open
bearrito opened this issue Apr 23, 2024 · 2 comments
Open

(bug) flaky test - Test_websocket_flag_change #1790

bearrito opened this issue Apr 23, 2024 · 2 comments
Labels
bug Something isn't working p1 High priority

Comments

@bearrito
Copy link
Contributor

bearrito commented Apr 23, 2024

Observed behaviour

I started looking at this because my pr code coverage failed. The failure was unrelated.

https://github.com/thomaspoignant/go-feature-flag/actions/runs/8790234882/job/24121807507?pr=1770

panic: test timed out after 10m0s
running tests:
	Test_websocket_flag_change (10m0s)
	Test_websocket_flag_change/Update_1_flag (10m0s)

goroutine 360 [running]:
testing.(*M).startAlarm.func1()
	/opt/hostedtoolcache/go/1.21.9/x64/src/testing/testing.go:2259 +0x3b9
created by time.goFunc
	/opt/hostedtoolcache/go/1.21.9/x64/src/time/sleep.go:176 +0x2d

goroutine 1 [chan receive, 9 minutes]:
testing.(*T).Run(0xc0005321a0, {0x1ba2c2b?, 0x587d9c?}, 0x1c90f68)
	/opt/hostedtoolcache/go/1.21.9/x64/src/testing/testing.go:1649 +0x3c8
testing.runTests.func1(0x2cdde60?)
	/opt/hostedtoolcache/go/1.21.9/x64/src/testing/testing.go:2054 +0x3e
testing.tRunner(0xc0005321a0, 0xc0005bfc38)
	/opt/hostedtoolcache/go/1.21.9/x64/src/testing/testing.go:1595 +0xff
testing.runTests(0xc0004bc960?, {0x2cbdf20, 0x8, 0x8}, {0x41989f?, 0xc0005bfcf8?, 0x2cdc880?})
	/opt/hostedtoolcache/go/1.21.9/x64/src/testing/testing.go:2052 +0x445
testing.(*M).Run(0xc0004bc960)
	/opt/hostedtoolcache/go/1.21.9/x64/src/testing/testing.go:1925 +0x636
main.main()
	_testmain.go:97 +0x1bf

goroutine 19 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc000247c80)
	/home/runner/go/pkg/mod/[email protected]/stats/view/worker.go:292 +0x9f
created by go.opencensus.io/stats/view.init.0 in goroutine 1
	/home/runner/go/pkg/mod/[email protected]/stats/view/worker.go:34 +0x8d

I believe there is a data race in wsFlagChange I can repo this locally but it's very difficult to trigger.

Expected Behavior

Tests to pass.

Steps to reproduce

No response

@bearrito bearrito added bug Something isn't working needs-triage A priority should be added to the issue labels Apr 23, 2024
@thomaspoignant
Copy link
Owner

Hey @bearrito thanks for sharing this.
I would be interested to understand how you reproduce it because this is not something occurring on my side.

@thomaspoignant thomaspoignant added p1 High priority and removed needs-triage A priority should be added to the issue labels Apr 23, 2024
@bearrito
Copy link
Contributor Author

It looks like what is happening is a concurrent write to the webscocket connection

https://github.com/thomaspoignant/go-feature-flag/blob/main/cmd/relayproxy/controller/ws_flag_change.go#L87
and

https://github.com/thomaspoignant/go-feature-flag/blob/main/cmd/relayproxy/service/websocket.go#L45

==================
WARNING: DATA RACE
Read at 0x00c0001f0338 by goroutine 32:
  github.com/gorilla/websocket.(*Conn).beginMessage()
      /home/bstrausser/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:492 +0x6d
  github.com/gorilla/websocket.(*Conn).WriteMessage()
      /home/bstrausser/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:779 +0x17b
  github.com/thomaspoignant/go-feature-flag/cmd/relayproxy/controller.(*wsFlagChange).pingPongLoop()
      /home/bstrausser/Git/go-feature-flag/cmd/relayproxy/controller/ws_flag_change.go:87 +0x35e
  github.com/thomaspoignant/go-feature-flag/cmd/relayproxy/controller.(*wsFlagChange).Handler.func3()
      /home/bstrausser/Git/go-feature-flag/cmd/relayproxy/controller/ws_flag_change.go:58 +0x71

Previous write at 0x00c0001f0338 by goroutine 27:
  github.com/gorilla/websocket.(*Conn).NextWriter()
      /home/bstrausser/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:538 +0x177
  github.com/gorilla/websocket.(*Conn).WriteJSON()
      /home/bstrausser/go/pkg/mod/github.com/gorilla/[email protected]/json.go:24 +0x7e
  github.com/thomaspoignant/go-feature-flag/cmd/relayproxy/service.(*websocketServiceImpl).BroadcastFlagChanges()

I can force this by doing one of two things

Number 1
Add a tight loop at the https://github.com/thomaspoignant/go-feature-flag/blob/main/cmd/relayproxy/controller/ws_flag_change_test.go#L120

for i := 0; i < 10000; i++ {
				websocketService.BroadcastFlagChanges(tt.flagChange)
				_, _, err := ws.ReadMessage()
				assert.NoError(t, err)
			}

Set the ping pong frequency to something high https://github.com/thomaspoignant/go-feature-flag/blob/main/cmd/relayproxy/controller/ws_flag_change.go#L77 I did 100ms

Eventually the Broadcast and ping-pong will race. I takes a lot of loops and fast ping-pong rate so if this is real I bet this is rare in production.

Number 2
A sleep statement or debugger breakpoint before https://github.com/thomaspoignant/go-feature-flag/blob/main/cmd/relayproxy/controller/ws_flag_change_test.go#L121 will also cause it. I assume this stalls the broadcast write and gives time to the ping-ponger.

@thomaspoignant thomaspoignant removed their assignment May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p1 High priority
Projects
None yet
Development

No branches or pull requests

2 participants