Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random failure in rocky linux based custom container #380

Open
GregWhiteyBialas opened this issue Apr 16, 2024 · 0 comments
Open

Random failure in rocky linux based custom container #380

GregWhiteyBialas opened this issue Apr 16, 2024 · 0 comments
Labels
bug Something isn't working
Projects

Comments

@GregWhiteyBialas
Copy link

Bug description

I have build container with rpm based scaphandre installation. I am starting it on bare metal with 'prometheus --qemu" option. In docker logs I see:

scaphandre::sensors: Sysinfo sees 256
Scaphandre stdout exporter
Sending ⚡ metrics
Measurement step is: 2s

when I try to curl http://localhost:8080/metrics I don't get any output on console and in logs I see:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/sensors/utils.rs:177:18
scaphandre::exporters::prometheus: Error in show_metrics : PoisonError { .. }
scaphandre::exporters::prometheus: Error details : poisoned lock: another task failed inside

each next run of curl produces:

scaphandre::exporters::prometheus: Error in show_metrics : PoisonError { .. }
scaphandre::exporters::prometheus: Error details : poisoned lock: another task failed inside

Once in a few runs scaphandre is starting properly and I am able to scrap metrics. I have done a lot of tests to determine when it happens (without changing ownership and access rights to /sys/clss/powercap (so without running init.sh), after reboot (to clean ownership of /sys), restarting docker container, purging docker, running scaphandre with stdout option, etc) and I didn't find anything conclusive.

Bellow there is console output where I run scaphandre few times before it works , after few unsuccessful attempts to start it.

(kolla-ansible) [stack@hpc30 ~]$ docker run -v /sys/class/powercap:/sys/class/powercap -v /proc:/proc -ti --network host -e RUST_BACKTRACE=full kolla/scaphandre:17.1.0  scaphandre stdout -t 5
scaphandre::sensors: Sysinfo sees 256
Scaphandre stdout exporter
Sending ⚡ metrics
Measurement step is: 2s
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/sensors/utils.rs:177:18
stack backtrace:
   0:     0x5576c0d21f41 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hf66164b97344d0a2
   1:     0x5576c0d4a4af - core::fmt::write::hbb74f2248ccd4395
   2:     0x5576c0d1edb1 - std::io::Write::write_fmt::hed9c5edae1eac7b4
   3:     0x5576c0d21d55 - std::sys_common::backtrace::print::hc9a6bb05c1f66b1d
   4:     0x5576c0d231f7 - std::panicking::default_hook::{{closure}}::h617bee45ce760ff9
   5:     0x5576c0d22fe4 - std::panicking::default_hook::hfb5619c23c95dafb
   6:     0x5576c0d236ac - std::panicking::rust_panic_with_hook::h07253f826b957552
   7:     0x5576c0d23561 - std::panicking::begin_panic_handler::{{closure}}::hfde4141a9de96c92
   8:     0x5576c0d22376 - std::sys_common::backtrace::__rust_end_short_backtrace::he15cde744ac23f89
   9:     0x5576c0d232f2 - rust_begin_unwind
  10:     0x5576c06be443 - core::panicking::panic_fmt::h2494779393265ba8
  11:     0x5576c06be4d3 - core::panicking::panic::hfcc79b23445abeb8
  12:     0x5576c0799450 - scaphandre::exporters::MetricGenerator::gen_self_metrics::h280d657f7d304306
  13:     0x5576c07a208b - scaphandre::exporters::MetricGenerator::gen_all_metrics::h63813309d030eccd
  14:     0x5576c07b54a3 - scaphandre::exporters::stdout::StdoutExporter::iterate::h06a8bbbbab974fa2
  15:     0x5576c07b52c8 - <scaphandre::exporters::stdout::StdoutExporter as scaphandre::exporters::Exporter>::run::hd0394d843640f8d2
  16:     0x5576c06d6203 - scaphandre::main::h75d3d0458ba1b902
  17:     0x5576c06cdfd3 - std::sys_common::backtrace::__rust_begin_short_backtrace::hd18dc57ef0d20d7c
  18:     0x5576c06c9ad9 - std::rt::lang_start::{{closure}}::he293a497447ace7d
  19:     0x5576c0d18ef5 - std::rt::lang_start_internal::he62005167fe2938d
  20:     0x5576c06d9c95 - main
  21:     0x7f1a5f2b4eb0 - __libc_start_call_main
  22:     0x7f1a5f2b4f60 - __libc_start_main_alias_1
  23:     0x5576c06bebf5 - _start
  24:                0x0 - <unknown>
(kolla-ansible) [stack@hpc30 ~]$ docker run -v /sys/class/powercap:/sys/class/powercap -v /proc:/proc -ti --network host -e RUST_BACKTRACE=full kolla/scaphandre:17.1.0  scaphandre stdout -t 5
scaphandre::sensors: Sysinfo sees 256
Scaphandre stdout exporter
Sending ⚡ metrics
Measurement step is: 2s
scaphandre::sensors: Not enough records for socket
scaphandre::sensors: Not enough records for socket
Host:   0 W from
        package         core
Top 5 consumers:
Power           PID     Exe
No processes found yet or filter returns no value.
------------------------------------------------------------

Host:   167.52704 W from
        package         core
Socket1 83.300095 W |   0.123677 W

Socket0 85.29291 W |    0.137677 W

Top 5 consumers:
Power           PID     Exe
2.625001 W      295896  "/usr/bin/scaphandre"
0.0029199123 W  10613   ""
0.0029199123 W  10718   ""
0.0029199123 W  9711    ""
0.0029199123 W  4934    ""
------------------------------------------------------------

What is strange whenever I start scaphandre using official image it works just fine.

To Reproduce

Build image based on Rocky linux 9.3, install scaphandre rpm in it, and run it.

Expected behavior

Scpahandre prometheus --qemu will start properly each time.

Screenshots

n/a

Environment:

Rocky linux 9.3

 uname -a
Linux hpc30 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Sep 16 09:55:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional context

Why I am building docker images instead using official one?
I want to add scpahandre support to openstack deployment project kolla-ansible.
This effort can be tracked here: https://review.opendev.org/c/openstack/kolla/+/914646/10

@GregWhiteyBialas GregWhiteyBialas added the bug Something isn't working label Apr 16, 2024
@bpetit bpetit added this to Triage in General Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Triage
General
Triage
Development

No branches or pull requests

1 participant