Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow metric metadata to be propagated via Remote Write. #6815

Merged
merged 12 commits into from Nov 19, 2020

Conversation

gotjosh
Copy link
Member

@gotjosh gotjosh commented Feb 13, 2020

Following up on #6395, we'd like to enable remote-write implementations with this API. To do so, we need to propagate the scaped metric's metadata via remote-write.

This PR takes a stab at that by using a similar approach to the WAL watcher. We observe the scrapeCache based on a specified period and pull the available metadata to send it to remote storage.

A high-level diagram of the process looks like:

mermaid-diagram-20200213152256

Documentation that explains the thinking behind this design lives in this design doc. The first version of the design doc contained details about other parts not relevant to Prometheus. Highly advise to only analyse the former.

storage/remote/shards.go Outdated Show resolved Hide resolved
@tomwilkie
Copy link
Member

Thanks @gotjosh! PR looks very promising. One question in the comments above, one suggestion (don't do a code move) and one question here:

How much metadata do we have on average per series in our Prometheus servers at Grafana Labs? And how much bandwidth will this use to send? We need to be able to give guidelines here.

@tomwilkie
Copy link
Member

Also to discuss:

  • config to enable / disable this, and whether this should be enabled by default (I think yes, but only if bandwidth is low enough)
  • config to specify the ratelimit / how often the data is sent.

@brian-brazil
Copy link
Contributor

Link to design doc for those that missed this:

I believe this is the first time this document has ever been shared. We should probably open it up for review (and enable comments) before proceeding with implementation.

@gotjosh
Copy link
Member Author

gotjosh commented Feb 13, 2020

@tomwilkie re the questions of:

How much metadata do we have on average per series in our Prometheus servers at Grafana Labs?

Apologies beforehand as I'm not sure I follow a part of that question. Within Prometheus, metadata is not kept per series but rather per metric name in the scrape cache (and also we keep a scrape cache per target).

That being said, currently, at Grafana Labs we have about ~7.67mb (and a total of ~139k entries) [1] metadata across all targets. However, that number is mildly deceiving as we only have ~1800 unique entries across those caches [2].

And how much bandwidth will this use to send?

With the current implementation, we're keeping an in-memory buffer that is a set of metadata entries. Only appending to the set if we've not seen that metadata (a combination of help/type/name/unit) before. From there, we only ever send data based on two criteria: a) We've reached the maximum number of entries b) We're at the ticked deadline.

Currently, the maximum number of metadata unique entries (similarly to remote-write samples) to flush is 2500. At an average of 55 bytes per unique entry [1] albeit more realistically between 60 - 90 bytes -- we're sending between 150kb - 225kb w/o compression every few minutes as long as we don't tip over that maximum threshold.

On scenarios where the number of unique metadata > than the maximum, it is not perfect, as metadata will be sent on a FIFO approach. There's a chance some metadata will never get sent but this can be overcomed by removing the limitation or allowing the user to tune the maximum value.

Does that make sense? Have I missed something?

[1]
Screenshot 2020-02-13 at 16 54 47

[2]
Screenshot 2020-02-13 at 16 59 10

@gotjosh
Copy link
Member Author

gotjosh commented Feb 13, 2020

I believe this is the first time this document has ever been shared. We should probably open it up for review (and enable comments) before proceeding with implementation.

@brian-brazil I don't think the document is highly relevant at this point as it started a while ago. It is not exclusively for this work but also the work already done as part of #6395.

Could we have the discussion as part of this PR?

@brian-brazil
Copy link
Contributor

Could we have the discussion as part of this PR?

This is a hard problem and per previous discussions I was lead to believe we'd have a design doc first. Lets see...

The PR description and doc don't provide any detail on how this works. From a quick look at the code this is re-adding the coupling that we went to some effort to remove between scraping and remote write, plus removing the property that remote write is resilient to restarts, and adding the requirement that a remote write endpoint be stateful.

Given the breadth of these changes I think that a design doc is in order, as these would be non-trivial architectural and semantic changes to both Prometheus in general and the remote write protocol more specifically.

I'm also seeing nothing here that couldn't be done in a sidecar, this smells a lot like the StackDriver use case.

@gotjosh
Copy link
Member Author

gotjosh commented Feb 13, 2020

plus removing the property that remote write is resilient to restarts, and adding the requirement that a remote write endpoint be stateful.

Isn't it slightly different? Even within Prometheus, metadata is not resilient to restarts. Don't see why we should make it so within remote-write. The goal here is to make it a best-effort approach similar to what we already do within Prometheus.

@brian-brazil
Copy link
Contributor

Isn't it slightly different? Even within Prometheus, metadata is not resilient to restarts.

Were talking about remote write here. If you we add a feature to remote write, it should maintain the properties of remote write.

I look forward to your design doc, I've been hoping to see the novel ideas for this for a while now.

@gotjosh gotjosh force-pushed the metadata-remote-write branch 3 times, most recently from d781cbb to 92446e7 Compare February 21, 2020 12:20
Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as far as I can tell, Friday eyes though. Made a few comments but lets make sure we talk on Monday if there's anything you want to go over.

scrape/manager.go Outdated Show resolved Hide resolved
storage/remote/write_test.go Outdated Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
scrape/metadata_watcher.go Outdated Show resolved Hide resolved
@gotjosh gotjosh force-pushed the metadata-remote-write branch 4 times, most recently from defc16a to e01b3b4 Compare February 24, 2020 11:55
@gotjosh
Copy link
Member Author

gotjosh commented Feb 24, 2020

For those following the conversation along, I've made a separate design doc that includes the decision and thinking for only this part of the feature. I can't make it editable by default so I'm assuming those who wish to comment can request to do so.

@brian-brazil
Copy link
Contributor

Can you make it world commentable? That's how we usually do things.

@gotjosh
Copy link
Member Author

gotjosh commented Feb 24, 2020

I had to create a different copy for that. All the previous links should be up to date.

@gotjosh
Copy link
Member Author

gotjosh commented Feb 25, 2020

Just to make sure those following along, didn't miss my last message. The design doc is now commentable and open to the world.

@brian-brazil
Copy link
Contributor

I was just getting to reviewing that. The developers list is a better place to share such things, rather than buried in a PR.

@gotjosh
Copy link
Member Author

gotjosh commented Feb 25, 2020

Thanks for the tip @brian-brazil, noted!

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, generally I think it would be nice to have a smaller footprint in packages outside of remote such that when metadata is written to disk all the relevant code will be in one place and easy to change/rip out. Curious what you think of that though.

Related to the above, what would you think of using the HTTP API instead of the internal scrape manager for getting all of the metadata? At first glance that would remove the dependency between remote storage and scrape entirely.

cmd/prometheus/main.go Outdated Show resolved Hide resolved
prompb/types.proto Outdated Show resolved Hide resolved
prompb/types.proto Outdated Show resolved Hide resolved
scrape/metadata_watcher.go Outdated Show resolved Hide resolved
storage/remote/queue_manager_test.go Outdated Show resolved Hide resolved
scrape/metadata_watcher.go Outdated Show resolved Hide resolved
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for me, nice work getting this together again!

storage/remote/queue_manager.go Outdated Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
prompb/types.proto Outdated Show resolved Hide resolved
docs/configuration/configuration.md Outdated Show resolved Hide resolved
storage/remote/metadata_watcher.go Show resolved Hide resolved
storage/remote/queue_manager.go Show resolved Hide resolved
storage/remote/queue_manager.go Outdated Show resolved Hide resolved
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@codesome
Copy link
Member

I would like to include this in v2.23. I guess only the metric names are yet to be finalised. Looking at it now.

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
@codesome
Copy link
Member

I have fixed the metrics and the doc comments. I will merge it in some time before starting the release process.

@brian-brazil
Copy link
Contributor

👍

I will merge it in some time before starting the release process.

I think we need to wait for @cstyan here, as we can't reasonably infer that the changes you just made are good with him.

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
@codesome codesome mentioned this pull request Nov 19, 2020
Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming build passes for the latest commit

@codesome
Copy link
Member

Merging it as tests are passing. Build will take forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants