Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[loki.source.kubernetes] restart of an alloy pod doubles the amount of logs in the log volume #876

Open
ToonTijtgat2 opened this issue May 16, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@ToonTijtgat2
Copy link

What's wrong?

I'm running loki.source.kubernetes in cluster mode with auto scaling enabled. everything works as expected until one of the pods restarts or an additional pod is added. from that moment the logs volume in the grafana explorer is doubled. (however the amount of logs listed in the log sections stays the same) this means that logs are not double ingested according to the logs section, but you can not trust the logs volume section anymore because these values have doubled.

amount of logs before an alloy pod restarts:
image
amount of logs after an alloy pod has restarted:
image

It seems that a pod only knows it's own tailing files and that if it restarts, it forgot where it left, and start to sent all the logs again.

Could it be that I missed something? or is this maybe a bug in the component?

Thanks for checking.

Steps to reproduce

deploy an alloy component in cluster mode (statefullset)
Sent logs using the loki.source.kubernetes component, and restart the pods.

System information

kubernetes

Software version

Grafana Alloy v1.1.0

Configuration

discovery.kubernetes "podstolog" {
        role = "pod"
      }

      discovery.relabel "podstologdropistio" {
        targets = discovery.kubernetes.podstolog.targets

        rule {
          source_labels = ["__meta_kubernetes_pod_container_name"]
          regex = "istio-proxy|istio-init"
          action = "drop"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_container_name"]
          target_label = "container_name"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_node_name"]
          target_label = "host"
        }
        rule {
          source_labels = ["__meta_kubernetes_namespace"]
          target_label = "namespace_name"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_name"]
          target_label = "pod_name"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_label_app"]
          target_label = "app"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_label_version"]
          target_label = "version"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_label_configuration_version"]
          target_label = "config_version"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_label_database_version"]
          target_label = "database_version"
        }
      }

      loki.source.kubernetes "podlogging" {
        targets = discovery.relabel.podstologdropistio.output
        forward_to = [loki.relabel.dropnotneededlabel.receiver]
        clustering {
          enabled = true
        }
      }

      loki.relabel "dropnotneededlabel" {
        forward_to = [loki.write.lokimandev.receiver]
        rule {
          action = "labeldrop"
          regex = "job|instance"
        }
      }

      loki.write "lokimandev" {
        endpoint {
      		url       = "https://xxx/loki/api/v1/push"
      		tenant_id = "logging-testalloy-dev"
      	}
      }

Logs

No response

@ToonTijtgat2 ToonTijtgat2 added the bug Something isn't working label May 16, 2024
@ToonTijtgat2
Copy link
Author

FYI, I also tried to use a persistant volume in the hope that the state of the tailing would be saved there and that the component would pick up where he left. but the effect on the logs volume is the same with persistent volumes.

@ToonTijtgat2
Copy link
Author

The same effect happens when you are running the components in cluster mode. if you then restart one of the pods, I assume the load is passed to other pod, but the position file is not known by the other pod. so I seems that it just starts from the beginning again. causing a lott of fuss in loki and in the memory of the alloy pods.

@ToonTijtgat2
Copy link
Author

Could it be that the /data storage should be a shared storage over all the pods? so that all pods can write in the same position file? so that if one pods stops, an other pod in the cluster can take over the load and start where the other left?

Can you please check/advice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant