New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random invalid session and inconsistent service accounts #19510
Comments
@pschichtel Do you only have one KES server for production? |
@jiuker production has 3, backup has 1 |
Does the 3 KES server have the same keys for production? @pschichtel |
They are all connected to the same vault (with a dedicated V2 KV engine for minio), so I'd assume so. How can I check? |
Could you check the key |
Not sure what you mean |
Check for overlapping value assignments between two clients |
Sorry for being confused!
By key, do you mean a KES key or an access key/secret access key? There is no "site-replicator-0" KES key, so I assume access key. What do you mean by "value" then?
What do you mean by "value assignments"? And what clients? I just checked with mcli again (
|
I only found a strange case where when I open two minio login pages in one browser at the same time, there have one said that |
@jiuker I have the occasional case, where the page after login stays blank, because I think you case sounds like a race condition on the shared cookies/localStorage/sessionStorage between the browser tabs, that are not shared between browsers. |
Yeah. Will return back to login page for |
@jiuker I don't think it is limited to a specific page, I've seen it happen on several different pages.
I'm not so sure anymore, because I get errors with mcli too, that doesn't go through the console, right? |
We can't reproduce any of the issues reported here. |
How can I properly clear replication settings from both sites? then I could test the production cluster without site replication and see if that helps. |
|
I just noticed, even the replication rules on buckets are completely inconsistent from refresh to refresh. @poornas thanks, I'll try that next week. |
Bucket versioning is also affected. it seems like everything somehow related to site replication is completely inconsistent between the nodes of the production cluster. It also seems to have gotten worse since I last checked last week. |
@poornas I removed the backup site from replication and it's all fine now. Should the |
Remove |
Are you saying it should automatically disappear after removing site-replication? Because it hasn't so far, neither in the production site nor in the backup site. Both sites don't have any other replication rules. So I'll delete the service accounts to have a clean state. |
Yeah. It should disappear. If not, you can try delete it for I didn't reproduce your case. |
I removed the accounts, I'll upgrade both instances to latest now and then setup replication again in the evening. |
I remember @harshavardhana saying something about this in a past issue: The /v1/service-accounts endpoint is rather slow (400-900 ms "wait" time in the browser) given that this is a small cluster (5 nodes) and only 3 service accounts exist and my connection is basically local, this feels noticeable slow in the UI. This is still the case even after disabling replication. Is the timing within a normal range or would this be worth investigating? I originally thought this is caused by the replication problem, but apparently it isn't. |
Can you share mc admin trace -a output while browsing this element in the UI? |
It's currently pretty noisy, I can do that in the evening. Or would there be a way to filter its output? |
Tracing only the call itself will not show what is going on, so filtering is not too feasible here. You can try |
@harshavardhana https://gist.github.com/pschichtel/c62d0eb9e46adb5472bf103a4b9cac85 I filtered out a bunch of log lines obviously related to S3 file accesses. |
OK so I wiped my backup site and enabled replication again. Replication has completed over night. /service-accounts is still slow (possibly even a bit slower, but not sure), but so far I haven't observed any invalid session responses and the list of service accounts in production seems to be consistent. |
Looks like wrong implementation by Console UI. Going serially per key info at a time. |
That's what I thought when I saw the trace lines. Should a create an issue over at minio/console or will you handle that "internally" ? |
Will handle it internally. |
I've just applied yesterday's minio release to the sites and that also hasn't reintroduced the issue. I'll monitor this for the rest of the week and if it stays without issues I'll close the issue. |
Checking how things are going @pschichtel ? |
sent PRs for this newer console release will handle these changes without making double the calls. |
Closing this since I haven't heard and assuming this has been resolved. |
Yeah I haven't had issues so far. |
@harshavardhana now after the upgrade to I wonder if this is something caused by the upgrade process of the operator? Should I open a new issue for this? |
At least it seems to be limited to the console API, running |
@harshavardhana after upgrading to I'm now convinced that that the upgrade as performed by the operator seems to cause or at least worsen this issue. Should I create a new issue? possibly over at minio/console ? |
This generally I don't it to occur unless someone actively wipes your credentials. |
Can you collect both sites all their backend .minio.sys folder and share it with us ? |
@harshavardhana Seems there is quite a bit of information in there that I don't think I can just share. Is it possible to limit the requested files? Otherwise I'd first have to clear internally if it's ok to share this stuff. What I noticed while poking around:
|
Ha... I found the offender. I slowly, one-by-one went through the pods (from last to first similar to the sts controller), deleted them and let the sts controller recreate them. between each pod I repeatedly checked the the /service-accounts endpoint. pods 4, 3 and 2 did nothing, restarting the pod 1 completely resolved the issue. |
Do you have logs from this pod before deleting it ? |
I do, but I don't think there was anything of interest. I'll check... |
here you go: production-1-logs.txt I noticed that the cluster once lost the quorum. The log file btw includes both the update and my restart the fixed the problem. |
This is a follow up to #19217 and #19201
After my vacation I just verified the state of the minio installation again after the previous issues.
Expected Behavior
Once logged in I'd expect not to randomly receive "invalid session" warnings or to get randomly logged out when navigating to certain pages (e.g. the Site Replication config page).
I would also expect to the same service accounts on my root user every time I refresh the Access Keys page (or when directly accessing
/api/v1/service-accounts
).Current Behavior
I randomly get invalid session responses ("The Access Key Id you provided does not exist in our records.") from the backend and on some pages, that leads to a redirect to the login page.
I also get a different list service accounts every time I refresh, sometimes it doesn't even include the site-replicator-0 account, which would explain why I'm still seeing #19217. Actually in my tests now by refreshing
/api/v1/service-accounts
a bunch of times, I rarely get all 4 service accounts.The backup site still occasionally logs this as in #19217:
Steps to Reproduce (for bugs)
I'm still not sure how I arrived at this state, I assume by enabling site replication.
I've checked that KES is working on both the production and the backup site. At this point I'm not even able to disable site replication on the production site, because I get constantly logged out (redirected to login page) from the page.
The single-node backup instance does not observe this behavior. there, I never get invalid session responses, I always get the same 4 service accounts on the root user (including site-replicator-0) and I can also access the Site Replication page.
Context
It makes using the minio console difficult. I assume, replication from backup to production would not reliably work (or be a lot slower), but that's not currently something I need to do.
Interestingly
mcli admin user svcacct list production admin
always returns the complete list of service accounts for my root user, although not always in the same order, but that doesn't matter. S3 clients in general don't seem to be affected, at least not functionally.To elaborate on the setup:
2 sites:
The keys between the KES deployments are identical (replicated files from production site can be decrypted on backup site. The production KES setup is responsive and can successfully access the vault (I created and deleted a test key to confirm).
Your Environment
minio --version
):RELEASE.2024-04-06T05-26-02Z
uname -a
):Linux 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: