Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector doesn't start after upgrading from v0.33.1-1 to v.0.34.1-1 #19333

Open
LinoVieira13 opened this issue Dec 7, 2023 · 10 comments
Open

Vector doesn't start after upgrading from v0.33.1-1 to v.0.34.1-1 #19333

LinoVieira13 opened this issue Dec 7, 2023 · 10 comments
Labels
domain: config Anything related to configuring Vector type: bug A code related bug.

Comments

@LinoVieira13
Copy link

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello,
I'm opening this bug at the request of jszwedko via discord.
So, I'm using an Ubuntu 22.04 machine, and I have vector v0.33.1-1 already installed and working, however, when I try to upgrade the package via Ubuntu APT to version v.0.34.1-1 it fails to start after 1m30s (timeout).
I don't have a /etc/vector/vector.yaml file, because I'm using environment variables to point the VECTOR_CONFIG_DIR to this folder: /etc/vector/conf.d. This is the error when it tries to start the service after the upgrade:

$ sudo systemctl status vector

● vector.service - Vector
     Loaded: loaded (/lib/systemd/system/vector.service; enabled; vendor preset: enabled)
    Drop-In: /run/systemd/system/service.d
             └─zzz-lxc-service.conf
     Active: deactivating (stop-sigterm) (Result: timeout) since Thu 2023-12-07 12:31:58 UTC; 3min 27s ago
       Docs: https://vector.dev
Cntrl PID: 610422 (vector)
      Tasks: 58 (limit: 154100)
     Memory: 17.9M
     CGroup: /system.slice/vector.service
             └─610422 /usr/bin/vector validate

Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.453286Z DEBUG vector::topology::builder: Building new source. component=XXXXXX
Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.456647Z DEBUG vector::topology::builder: Building new transform. component=XXXXXX
Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.463062Z DEBUG vector::topology::builder: Building new transform. component=XXXXXX
Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.474103Z DEBUG vector::topology::builder: Building new sink. component=XXXXXX
Dec 07 12:32:46 vector-01 vector[610422]: √ Component configuration
Dec 07 12:32:46 vector-01 vector[610422]: 2023-12-07T12:32:46.479222Z  INFO vector::topology::builder: Healthcheck passed.
Dec 07 12:32:46 vector-01 vector[610422]: √ Health check "XXXXXX"
Dec 07 12:34:16 vector-01 systemd[1]: vector.service: start-pre operation timed out. Terminating.
Dec 07 12:34:16 vector-01 vector[610422]: 2023-12-07T12:34:16.484450Z  INFO vector::signal: Signal received. signal="SIGTERM"
Dec 07 12:35:00 vector-01 systemd[1]: vector.service: Unit cannot be reloaded because it is inactive.

Configuration

/lib/systemd/system/vector.service

[Unit]
Description=Vector
Documentation=https://vector.dev
After=network-online.target
Requires=network-online.target

[Service]
User=vector
Group=vector
ExecStartPre=/usr/bin/vector validate
ExecStart=/usr/bin/vector
ExecReload=/usr/bin/vector validate
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
AmbientCapabilities=CAP_NET_BIND_SERVICE
EnvironmentFile=-/etc/default/vector
# Since systemd 229, should be in [Unit] but in order to support systemd <229,
# it is also supported to have it here.
StartLimitInterval=10
StartLimitBurst=5
[Install]
WantedBy=multi-user.target

-------------------------------
/etc/default/vector

VECTOR_CONFIG_DIR=/etc/vector/conf.d
VECTOR_LOG=debug

-------------------------------
ll /etc/vector/
total 20
drwxr-xr-x  5 root root 4096 Dec  7 12:35 ./
drwxr-xr-x 99 root root 4096 Dec  7 12:35 ../
drwxr-xr-x  2 root root 4096 Nov 17 09:59 conf.d/
drwxr-xr-x  2 root root 4096 Dec  7 12:35 examples/
drwxr-xr-x  2 root root 4096 Nov 17 10:01 vrl/

Version

vector 0.33.1 (x86_64-unknown-linux-gnu 3cc27b9 2023-10-30 16:50:49.747931844)

Debug Output

2023-12-07T12:43:26.802362Z DEBUG vector::app: Internal log rate limit configured. internal_log_rate_secs=10
2023-12-07T12:43:26.802410Z  INFO vector::app: Log level is enabled. level="vector=trace,codec=trace,vrl=trace,file_source=trace,tower_limit=trace,rdkafka=trace,buffers=trace,lapin=trace,kube=trace"
2023-12-07T12:43:26.802515Z DEBUG vector::app: messaged="Building runtime." worker_threads=48
2023-1207T12:43:26.806507Z  WARN vector::config::loading: The "/etc/vector/vector.toml" config path does not exist. Vector will attempt to use new default "/etc/vector/vector.yaml" instead.
2023-12-07T12:43:26.806579Z ERROR vector::config::loading: Config file not found in path. path="/etc/vector/vector.yaml"

Example Data

No response

Additional Context

Vector machine is an LXC host.

References

No response

@LinoVieira13 LinoVieira13 added the type: bug A code related bug. label Dec 7, 2023
@pront
Copy link
Contributor

pront commented Dec 8, 2023

Hi @LinoVieira13,

For vector 0.34.1 I cannot reproduce what you describe:

hoot:~$ which vector
/usr/bin/vector

host:~$ vector --version
vector 0.34.1 (x86_64-unknown-linux-gnu 86f1c22 2023-11-16 14:59:10.486846964)

host:~$ VECTOR_CONFIG_DIR=~/vector-config-dir/ vector
2023-12-08T10:21:15.402659Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-12-08T10:21:15.404428Z  INFO vector::app: Loading configs. paths=["/home/bits/vector-config-dir"]
2023-12-08T10:21:15.407402Z  INFO vector::topology::running: Running healthchecks.
2023-12-08T10:21:15.407471Z  INFO vector: Vector has started. debug="false" version="0.34.1" arch="x86_64" revision="86f1c22 2023-11-16 14:59:10.486846964"
2023-12-08T10:21:15.407487Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2023-12-08T10:21:15.407646Z  INFO vector::topology::builder: Healthcheck passed.
{"appname":"AnthraX","facility":"local2","hostname":"up.cn","message":"Take a breath, let it go, walk away","msgid":"ID52","procid":1840,"severity":"crit","timestamp":"2023-12-08T10:21:15.408Z","version":1}
{"appname":"b0rnc0nfused","facility":"alert","hostname":"some.xn--fzys8d69uvgm","message":"Great Scott! We're never gonna reach 88 mph with the flux capacitor in its current state!","msgid":"ID717","procid":5058,"severity":"crit","timestamp":"2023-12-08T10:21:16.408Z","version":2}
{"appname":"KarimMove","facility":"user","hostname":"names.ls","message":"You're not gonna believe what just happened","msgid":"ID94","procid":5582,"severity":"debug","timestamp":"2023-12-08T10:21:17.408Z","version":1}
^C2023-12-08T10:21:17.796252Z  INFO vector::signal: Signal received. signal="SIGINT"
2023-12-08T10:21:17.796317Z  INFO vector: Vector has stopped.
2023-12-08T10:21:17.797442Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="print, parse_logs, dummy_logs" time_remaining="59 seconds left"
{"appname":"jesseddy","facility":"local0","hostname":"we.associates","message":"Maybe we just shouldn't use computers","msgid":"ID559","procid":8294,"severity":"err","timestamp":"2023-12-08T10:21:18.408Z","version":2}


host:~$ VECTOR_CONFIG_DIR=~/vector-config-dir/ vector validate
√ Loaded ["/home/bits/vector-config-dir"]
x Could not create subdirectory "validate_tmp" inside of data dir "/var/lib/vector/": Permission denied (os error 13)


host:~$ sudo VECTOR_CONFIG_DIR=~/vector-config-dir/ vector validate
√ Loaded ["/home/bits/vector-config-dir"]
√ Component configuration
√ Health check "print"
-----------------------------------------
                                Validated

Is the Debug Output you shared above coming from vector 0.33.1 or vector 0.34.1?

@pront pront added the domain: config Anything related to configuring Vector label Dec 8, 2023
@LinoVieira13
Copy link
Author

Hello @pront,
The Debug output I shared is coming from vector 0.34.1. However, if I try the same in vector 0.33.1 the only difference in the debug output is the line 2023-1207T12:43:26.806507Z WARN vector::config::loading: The "/etc/vector/vector.toml" config path does not exist. Vector will attempt to use new default "/etc/vector/vector.yaml" instead. the rest stays the same.

@fisheax
Copy link

fisheax commented Dec 29, 2023

I had met the similar situation. The same point is that vector validate ... does not quit if without --no-environment in my case. But it doesn't appear all the time. When it occurs, I can see that the health checks are all passed but stuck without output the last line --- \n Validated. Also, I noticed that the kafka connections still being kept (I'm using the kafka as both source and sink, maybe using kafka-sink would cause the problem more often). Btw, I cannot use ctrl+c to quit the process. The same configurations are worked in the v0.32.1

@jszwedko jszwedko added this to the Vector v0.35.0 milestone Jan 2, 2024
@pront
Copy link
Contributor

pront commented Jan 3, 2024

@LinoVieira13 in order to help us debug this:

  • in the original bug report the version is vector 0.33.1 (x86_64-unknown-linux-gnu ....), is that wrong?
  • did you change anything else or this is purely a result of version upgrade?
  • what is the output of ls /etc/vector/conf.d?

@jszwedko jszwedko removed this from the Vector v0.35.0 milestone Jan 8, 2024
@r3code
Copy link

r3code commented Apr 9, 2024

I met the same bug. I tried to upgrade from 0.24.1 to 0.35.0 (aslo for latest 0.37) and it failes to start at PreStart stage.
I tried with one vector.toml and one source + one sink + one file with 5-7 trfansforms and it succeeds. When I add more files and more sinks (about 50) validate stucks after
√ Component configuration
~ Health checks are disabled

Started it with VECTOR_LOG=DEBUG with strace and it also stucks with my huge config, Ctrl+C can't stop it but it shows log messages INFO vector::signal: Signal received. signal="SIGINT". I only can exit with Ctrl+Z and then kill the proc by kill -9 vector_validate_pid.

For same set of transforms run 'vector validate -C config_dir'

| Tried version | Result |
| 0.32.0-1        | OK      |
| 0.33.1-1        | OK      |
| 0.34.0-1        | FAIL    |
| 0.34.1-1        | FAIL    |
| 0.34.2-1        | FAIL    |
| 0.35.0-1        | FAIL    |
| 0.35.1-1        | FAIL    |
| 0.36.0-1        | FAIL    |
| 0.36.1-1        | FAIL    |
| 0.37.0-1        | FAIL    |

@r3code
Copy link

r3code commented Apr 9, 2024

What I see with strace of vector validate without --no-environment:

openat(AT_FDCWD, "/usr/lib/ssl/certs/ca-certificates.crt", O_RDONLY) = 33
fstat(33, {st_mode=S_IFREG|0644, st_size=208567, ...}) = 0
read(33, "-----BEGIN CERTIFICATE-----\nMIIH"..., 4096) = 4096
...<excluded>
read(33, "", 4096)                      = 0
close(33)                               = 0
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
futex(0x7f05649fbd98, FUTEX_WAKE_PRIVATE, 1) = 1
write(1, "\33[32m\342\210\232\33[0m Component configura"..., 37) = 37
write(1, "\33[33m~\33[0m Health checks are dis"..., 38) = 38
futex(0x7f05695ebc50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f05695ebc00, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f0569460028, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 
// ^^^^ it stucks here
//
// -- NOW 2 attempts to stop with Ctrl+C, then terminate by Ctrl-Z
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(7, "\1", 1)                       = 1
rt_sigreturn({mask=[]})                 = 202
futex(0x7f0569460028, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(7, "\1", 1)                       = 1
rt_sigreturn({mask=[]})                 = 202
futex(0x7f0569460028, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=200774, si_uid=1027} ---
futex(0x7f0569460028, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = ?
+++ killed by SIGKILL +++

@jszwedko
Copy link
Member

jszwedko commented Apr 9, 2024

Can someone experiencing this failure share the Vector configuration they are trying to load and exact command they are running?

@r3code
Copy link

r3code commented Apr 10, 2024

Forgot to mention - I run vector under Ubuntu 20.04.04 LTS x64.
Vector installed by apt, versions are x86_64-unknown-linux-gnu

@r3code
Copy link

r3code commented Apr 13, 2024

Can't share my config files right now. I'll try to split and test it, i think It might be an issue due to the files count, or a topolgy issue

@Skunnyk
Copy link

Skunnyk commented Apr 24, 2024

Anybody with the problem on stuck validate, can you try to add a delay on the network interface following #20367 (comment) and run validate again ? 🤯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: config Anything related to configuring Vector type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

6 participants