Open census Span to View converter #1239

tcoupland · 2018-09-25T13:36:55Z

Span to View converter

We have added open census span calls in interesting parts of the code,
but without creating a jaeger or zipkin cluster they are not going
anywhere.

Here we add a SpanConverter that converts our Spans into OpenCensus
distribution Views.

It contains an in memory map of the Measures for these Views,
recording values for the length of time reported to be spent in a
Span.

This list has a hard limit of 100, as there is a risk of a metric
explosion should an element of the system include some ID value in
it's Span names. Gin does this and those are discarded, but there
might be others that we'll have to deal with in the future when discovered.

rdallman · 2018-09-25T16:01:08Z

nice, thanks. will take a look, need something like this. slightly relevant maybe - am generally slightly concerned about 'metrics overload' from this (a lot of these won't be all that useful), curious if you have any thoughts about being able to turn certain 'views' (as opencensus calls them) on/off on a running server.

tcoupland · 2018-09-25T17:17:53Z

Metric overload is a worry for sure. It's easy to make em, but sometimes hard to track dwon where they come from.

I would say that if this creates a metric that is useless, then the span was going to be useless anyway. The idea here is to answer the "where is all the time spent" question, in particular when we start load testing and want to know the bottle neck.

The other thing we could think about is using span annotations, to create a sort of debug, info, leveling system. Then we cld selectively enable/disable per environment.

If you fire this branch up you can see how many are going to get made after running a few commands.

We have added open census span calls in interesting parts of the code, but without creating a jaeger or zipkin cluster they are not going anywhere. Here we add a SpanConverter that converts our Spans into OpenCensus distribution Views. It contains an in memory map of the Measures for these Views, recording values for the length of time reported to be spent in a Span. This list has a hard limit of 100, as there is a risk of a metric explosion should an element of the system include some ID value in it's Span names. Gin does this and those are discarded, but there might be others that we'll have to deal with in the future when discovered.

skinowski · 2018-09-26T16:41:39Z

Distribution buckets should be configurable. eg:
https://github.com/fnproject/fn/blob/master/cmd/fnserver/main.go#L30

skinowski · 2018-09-26T16:47:18Z

api/server/spanconverter.go

+ m, ok := c.measures[sig]
+ c.viewsMu.Unlock()
+
+ if !ok {


concurrent access without lock. Outside of locks this could register/insert more than once.

Yeah it looked wrong to me to, when I lifted it from the main prom exporter.

In actual fact there is no risk here, only that a view gets created twice, but OC will take care of that with it's deduplication logic.

skinowski · 2018-09-26T17:16:07Z

I think this should be optional similar to other views (api views, agent views, etc.) Maybe add a new registration function with distribution buckets? If we are concerned about metrics overload, perhaps the registration function could take an optional list of span names to be excluded/included?

tcoupland · 2018-09-26T17:21:25Z

Thanks for the pointer to our dist config, will hook that up.

I thought about a white list, but that would get painful quick imo. The size limit should stop it going nuts.

Will take a look at optionality. It could have it's own top level flag, like the prom one.

skinowski · 2018-09-26T17:29:53Z

@mantree Sounds good. The registration function could even have an optional span filter function. Then we can remove url, sanitize, 100 limit, 100 char limit and leave those up to the main.go/components.

rdallman · 2018-09-27T02:09:00Z

api/server/spanconverter.go

+ spanTimeNanos := sd.EndTime.Sub(sd.StartTime)
+ spanTimeMillis := float64(int64(spanTimeNanos / time.Millisecond))
+
+ stats.Record(context.Background(), m.M(spanTimeMillis))


we should rip some of the attributes off the span data as well and put them onto the context. this may be slightly painful, but tags are likely important so that we have context for eg function_id, app_id... these seem to map to the fields SpanData.Attributes and tag.Map, however it's not immediately clear from digging around the docs.

Attributes, MessageEvents and Annotations are all empty for our spans.

rdallman · 2018-09-27T02:16:55Z

api/server/spanconverter.go

+
+ if !ok {
+ m = stats.Float64(sig+"_span_time", "The span length in milliseconds", "ms")
+ v := &view.View{


this seems like the wrong place for view registration. we should maybe/probably stats.Record everything and not limit the stats.Record bit (it's gated by rejectSpan as well as getMeasure and an arbitrary 100 limit imposed). at least, stats.Record and view.Measure are separate processes and should be exclusive. Tolga linked to some pointers for how the other stuff is handled atm, I think initially we're mostly registering views in initialization, we can maybe make this more friendly (ie no reboot required) later but it's relatively nice to work with for the ability to toggle them on/off for eg things that extend fn. then we end up whitelisting things in, which I kind of agree is painful, while at the same time it does make us organize things (like stat/trace names) so it's a net good I think. I think we can remove the limits (100) with that in place.

(also won't require a lock - we shouldn't be locking for a stats library, that is on them!)

Everything has already been recorded, that's already happened to produce the SpanData. This is copying up the recorded values into the Measure.

rdallman · 2018-09-27T02:18:18Z

api/server/spanconverter.go

+}
+
+//Gin creates spans for all paths, containing ID values.
+//We can safely discard these, as other histograms are being created for them.


space between first char always (sorry, am that guy)... // We

rdallman · 2018-09-27T02:19:43Z

api/server/spanconverter.go

+const labelKeySizeLimit = 100
+
+// sanitize returns a string that is trunacated to 100 characters if it's too
+// long, and replaces non-alphanumeric characters to underscores.


this is all stuff we control right? we're the ones making the names up? I don't think we need this (and if we do, we should probably be using tags for some of the stuff in the name?)

trouble is we don't control them all. See the problem with the Gin generated spans.

rdallman · 2018-09-27T02:21:30Z

agree with some of Tolga's points, think this is the right direction though - getting things into stats.Record looks pretty easy, which is sweet (dare I show you our opentracing->prometheus exporter from last October?)

tcoupland · 2018-09-27T10:59:21Z

Ok, I've looked at pulling the Latency config from main, but it creates a cyclic dependency, so it's not going to happen. I've had a look at passing the value down and through the various layers to try and bring it to the Converter, but that ends up being a total mess, it just doesn't feel worth it.

I think to get this closer to the suggestions would really involve replacing the Spans with normal metrics. The idea here is to get some value from what we have, if they prove useful then that would strengthen the case to get Jaeger/Zipkin setup and make proper use of them.

rdallman · 2018-09-27T17:33:11Z

The idea here is to get some value from what we have, if they prove useful then that would strengthen the case to get Jaeger/Zipkin setup and make proper use of them.

hmm, not sure about motivation here if this is the case, I think they're 2 different goals. we lose all causal context by way of this method, and I think Jaeger would be useful in its own right without this (actually, I know it will be useful, because I've used it to fix bugs and performance tweaks already...). we just get a different set of data with Jaeger, especially since this is averaging down in distribution buckets, tail latency samples are lost -- Jaeger is immensely useful in this respect, even when sampling .01% of traffic. for this, it covers up holes in our metrics for things we're already gathering for traces, this is really useful too; docker latencies, for example, are a great example of when both are useful and we have doubled logic to get these things right now (ie before this) - maybe this is ok and what we should be doing instead, I'm not sure I have the answer, we can discuss (it sounds like you've run into some challenges, trust your judgment on this...)

skinowski · 2018-10-01T18:38:08Z

api/server/server.go

@@ -743,6 +744,14 @@ func WithPrometheus() Option {
 }
 }

+func WithSpanConverter() Option {


can this take an argument for latency buckets? Not the end of the world, but it would be good to keep buckets consistent with service side.

tcoupland force-pushed the feature/spanmetrics branch from e1c3065 to e5b9545 Compare September 26, 2018 15:52

tcoupland changed the title ~~Prometheus exporter for open census spans~~ Open census Span to View converter Sep 26, 2018

skinowski reviewed Sep 26, 2018

View reviewed changes

rdallman reviewed Sep 27, 2018

View reviewed changes

Tom Coupland added 2 commits September 27, 2018 11:49

Tidy up comment and dist creation

8950ffd

Seperate Converter injection from Prometheus exporter

e21000e

skinowski reviewed Oct 1, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open census Span to View converter #1239

Open census Span to View converter #1239

tcoupland commented Sep 25, 2018 •

edited

rdallman commented Sep 25, 2018

tcoupland commented Sep 25, 2018

skinowski commented Sep 26, 2018

skinowski Sep 26, 2018

tcoupland Sep 26, 2018

skinowski commented Sep 26, 2018 •

edited

tcoupland commented Sep 26, 2018

skinowski commented Sep 26, 2018

rdallman Sep 27, 2018

tcoupland Sep 27, 2018

rdallman Sep 27, 2018

rdallman Sep 27, 2018

tcoupland Sep 27, 2018

rdallman Sep 27, 2018

rdallman Sep 27, 2018

tcoupland Sep 27, 2018

rdallman commented Sep 27, 2018

tcoupland commented Sep 27, 2018

rdallman commented Sep 27, 2018

skinowski Oct 1, 2018

Open census Span to View converter #1239

Are you sure you want to change the base?

Open census Span to View converter #1239

Conversation

tcoupland commented Sep 25, 2018 • edited

rdallman commented Sep 25, 2018

tcoupland commented Sep 25, 2018

skinowski commented Sep 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skinowski commented Sep 26, 2018 • edited

tcoupland commented Sep 26, 2018

skinowski commented Sep 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdallman commented Sep 27, 2018

tcoupland commented Sep 27, 2018

rdallman commented Sep 27, 2018

Choose a reason for hiding this comment

tcoupland commented Sep 25, 2018 •

edited

skinowski commented Sep 26, 2018 •

edited