[Research] Tracing for OpenFaaS Core services #1354

alexellis · 2019-10-13T08:12:12Z

[Research] Tracing for OpenFaaS Core services

There has been interest expressed in tracing from the community through Slack and issues over the last year and a half. At this time there doesn't seem to be any hard requirement or urgency so the feature hasn't been prioritised. This issue would look into the cost and scope of implementing tracing across the core services in OpenFaaS.

There are also some pros/cons to emitting spans, one of those is tightly coupling to an SDK from OpenTracing, OpenConsensus or, when it's ready OpenTelemetry, in every piece of code we write in the project.

Expected Behaviour

Traces to be propagated, even if not emitted.
Tracing to be available as a feature with a configuration for the "all in one" Jaeger etc image.

Current Behaviour

Unsure if traces are propagated.

Possible Solution

I'd like someone who has worked with OpenTracing or similar to provide a summary of changes and the pros/cons to implement the above. For those requesting this feature, please feel free to +1 or to leave your use-case in a comment.

Prometheus metrics are also emitted currently for core services and functions.

LucasRoesler · 2019-10-13T11:11:06Z

We have been using OpenTracing at Contiamo for most of the last 2 years. Here are a couple of things we learned as we implemented it

every function with ctx should immediately start a Span
any function that does I/O should be passed a context as a parameter so that it can start a Span. It also allows for cancel propagation
1. even if a function isn't doing IO, if a function is public and represents a major logic component of the application, we will often pass a context
every time you feel like you might need to create more than one span in a single function usually indicates that you need a new function.
1. if you do that, then it is easy to implement, every function with I/O or ctx starts a span and then immediately defers the span.Finish()
2. if you can't defer the span.Finish(), it suggests that you really have two different functions merged into one and the logic to get the span.Finish right can easily be buggy or broken in a future change
  i have found that the end result is actually cleaner code, because it forces cleaner separation of concerns and much more straightforward architecture
3. we don't usually directly defer the span.Finish(), instead we have a method that accepts FinishSpan(span, err) which allows us to close the span and also tag it and add the error content if it is not nil. Having this helper method have been very important for consistency and reducing boiler plate code
having good helpers makes a big difference, use a middleware to parse the headers and start spans from all headers. We have a wrapper for db drivers to wrap our sql statements. Similarly, add the context propagation to the http Client constructor so that callers do not need to manually set the headers. Basically, most developers should not need to think about tracing other than making sure they start and finish the span.
1. we actually still start new spans within the actual handler implementations because the middleware span will capture other possible middlewares as well as the handler implementation and we want to distinguish between the method we wrote and the code that wraps it

When adding tracing to an existing project we started by

add a middleware for the https handlers (db driver etc), as many automatic/framework level traces that we can get
find those places with IO and pass a context, use a context.TODO() context as a place holder until you can get a real one passed down
visualize what it looks like and then start asking what you wish you could see in the trace
add context to those areas,
repeat steps 3) - 4) refactoring and adding context until you are happy the span visualization makes sense and can communicate bugs

Most of the above was summarized from a conversation that Alex and I had on Slack.

mingfang · 2020-07-23T17:13:08Z

any update on this?

LucasRoesler mentioned this issue Dec 17, 2021

Add OpenTelemetry support during function proxy #1684

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Research] Tracing for OpenFaaS Core services #1354

[Research] Tracing for OpenFaaS Core services #1354

alexellis commented Oct 13, 2019

LucasRoesler commented Oct 13, 2019

mingfang commented Jul 23, 2020

[Research] Tracing for OpenFaaS Core services #1354

[Research] Tracing for OpenFaaS Core services #1354

Comments

alexellis commented Oct 13, 2019

Expected Behaviour

Current Behaviour

Possible Solution

LucasRoesler commented Oct 13, 2019

mingfang commented Jul 23, 2020