Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High disk I/O prevents successful DNS resolution #5635

Open
nazar-pc opened this issue Apr 25, 2024 · 14 comments
Open

High disk I/O prevents successful DNS resolution #5635

nazar-pc opened this issue Apr 25, 2024 · 14 comments

Comments

@nazar-pc
Copy link
Sponsor

nazar-pc commented Apr 25, 2024

Versions

  • Pi-hole: v5.18.2
  • AdminLTE: v5.21
  • FTL: v5.25.1

Platform

  • OS and version: Ubuntu 22.04
  • Platform: Docker

Expected behavior

DSN resolution should work all the time

Actual behavior / bug

On very high disk I/O DNS resolution simply stops working. For example this happens when running sync on NVMe SSD after large disk operation that on modern Linux with noop I/O scheduler basically blocks a lot of things for a significant amount of time.

When Pi-hope recovers I see this in logs:

Long-term load (15min avg) larger than number of processors: 80.3 > 32
This may slow down DNS resolution and can cause bottlenecks.

Steps to reproduce

Have Pi-hope runing on disk with very high I/O load and observe DNS resolution failing.

Debug Token

https://tricorder.pi-hole.net/sgxHNmzd/

Additional context

My educated guess is that on each DNS request Pi-hole tries to read something from disk and under high I/O read times out.

Considering how small gravity database is (8.7M in my case), it would make a lot of sense to keep its copy in memory, something under 50M will not cause issues for most users, while performance will improve massively and will avoid issue described here completely.

@rdwebdesign
Copy link
Member

Pi-hole database is already kept in memory and new queries are written to disk from time to time.
If not possible to write immediately, Pi-hole will try to write the data later.

@nazar-pc
Copy link
Sponsor Author

Interesting, then my guess is incorrect, but the issue is still present. How can I debug it to provide more information?

@rdwebdesign
Copy link
Member

Please, generate and upload a Debug Log:

Using the command line:

  • use pihole -d command.
  • when asked to upload the log, answer Y.
  • copy only the token and paste here.

Using the web interface:

  • click on Tools > Generate Debug Log menu.
  • then mark the second checkbox to upload the log and click on the big button.
  • when finished, copy only the token and paste here.

@nazar-pc
Copy link
Sponsor Author

@rdwebdesign
Copy link
Member

I'm not 100% sure the issue is related to high I/O load causing a Pi-hole failure.

Usually, when it fails to write to the database you see messages like these:

ERROR: SQL query "END TRANSACTION" failed: database is locked (SQLITE_BUSY)
END TRANSACTION failed when trying to store queries to long-term database
Keeping queries in memory for later new attempt

And when the queries are finally written, you see this:

Notice: Queries stored in long-term database: 44 (took 142.7 ms, last SQLite ID 53921217)
        Queries from earlier attempt(s) stored successfully

But your log doesn't show evidence of this kind of issue (at least not in the available lines - the debug log shows only a few lines of FTL.log).

@nazar-pc
Copy link
Sponsor Author

Is there a more detailed log level that I can enable or something else of this kind?

@DL6ER
Copy link
Member

DL6ER commented Apr 25, 2024

This is an interesting case. Naively thinking, I guess every added debug logging would worsen your situation somewhat as all of Pi-hole's debug options are rather verbose.

FTL.log is barely ever written to and logging to pihole.log happens completely asynchronous. Following The Linux Programming Interface: A Linux and UNIX System Programming Handbook:

the kernel buffer cache ensures that I/O on regular files does not block, as described in Section 13.1

If this would not be working, something in the kernel would be broken. We are not syncing the log at any point and let the kernel itself handle things alone.

Storing to the database is furthermore done in a different thread altogether. It should not cause slowdowns to DNS resolution.

---snip---

Well, having had to make a break from typing here and returning later to the computer actually made me realize there is one possible issue I could imagine here: When writing to the database, we are temporarily locking the datastructure holding DNS queries to ensure consistent writes. However, we are needlessly locking it for longer than we need to - so far this has never been an issue but maybe it is in your case.

Fortunately, it's sufficiently easy to find out if my theory is right or not. Please run

pihole checkout ftl tweak/unlock_commit

and either find that the situation (greatly) improves or that nothing changes.

@nazar-pc
Copy link
Sponsor Author

Running it already, will report back after some time with it

@nazar-pc
Copy link
Sponsor Author

Observations so far: I do see a few seconds page load delays occasionally (which may or may not be related to DNS, didn't confirm yet), however there wasn't a single time that page didn't load at all.

Before switching to that branch DSN resolution was quickly failing outright, so there is definitely a visible positive change.

@nazar-pc
Copy link
Sponsor Author

The thing that broke though is deleting diagnostic messages. I'm getting:

Error, something went wrong!
While executing message statement: database is locked

@DL6ER
Copy link
Member

DL6ER commented Apr 27, 2024

Observations so far: I do see a few seconds page load delays occasionally (which may or may not be related to DNS, didn't confirm yet), however there wasn't a single time that page didn't load at all.

Before switching to that branch DSN resolution was quickly failing outright, so there is definitely a visible positive change.

This is great, I will open a PR to merge this into the currently running Pi-hole v6.0 beta. There won't be another v5 release (this may or not change, but it is rather unlikely).


I'm getting:

Error, something went wrong!
While executing message statement: database is locked

This is expected and cannot be avoided unless messages would be stored in a different database. This would increase complexity quite a bit and the solution should much more likely be running Pi-hole on a system with less load. It doesn't appear to be a good idea overall to run your Pi-hole at all on such a special system. Why are you running sync at all? Power-outage prevention? To my understanding, the noop sheduler will eventually write all files to disk on its own (over time), there will just no re-ordering happen (which is useful on SSDs). sync isn't strictly needed here.


Considering how small gravity database is (8.7M in my case), it would make a lot of sense to keep its copy in memory, something under 50M will not cause issues for most users, while performance will improve massively and will avoid issue described here completely.

Sorry, I have missed this before. We are not talking about gravity.db but pihole-FTL.db which will typically be in the low GB range for most users. We cannot afford putting this into memory. The gravity database is indeed held in the kernel's page cache, it's always quickly available and shouldn't affect performance in your case. The optimization I have done in the proposed branch you tried improves committing to pihole-FTL.db alone.

@PromoFaux PromoFaux changed the title High disk I/O prevents successful DSN resolution High disk I/O prevents successful DNS resolution Apr 27, 2024
@nazar-pc
Copy link
Sponsor Author

This is expected and cannot be avoided unless messages would be stored in a different database.

This is expected and cannot be avoided unless messages would be stored in a different database. This would increase complexity quite a bit and the solution should much more likely be running Pi-hole on a system with less load.

It doesn't appear to be a good idea overall to run your Pi-hole at all on such a special system. Why are you running sync at all? Power-outage prevention? To my understanding, the noop sheduler will eventually write all files to disk on its own (over time), there will just no re-ordering happen (which is useful on SSDs). sync isn't strictly needed here.

That was just one of the ways to reproduce the issue. I was having other workloads running there that caused high disk I/O as well. Already planning to upgrade the drive today or tomorrow, but since I did see this odd behavior I thought I better report it upstream and maybe something can be improved as the result.

Sorry, I have missed this before. We are not talking about gravity.db but pihole-FTL.db which will typically be in the low GB range for most users. We cannot afford putting this into memory. The gravity database is indeed held in the kernel's page cache, it's always quickly available and shouldn't affect performance in your case. The optimization I have done in the proposed branch you tried improves committing to pihole-FTL.db alone.

I see. Is some in-memory buffer a viable option though? I don't think it needs to be synchronously written to disk when DNS request comes in. Writing records in bulk might also speed thing up a bit.

@DL6ER
Copy link
Member

DL6ER commented Apr 27, 2024

Already planning to upgrade the drive today or tomorrow, but since I did see this odd behavior I thought I better report it upstream and maybe something can be improved as the result.

👍 definitely and thank you for this! I checked the code of the currently running Pi-hole v6.0 beta just now and see that we did already make a similar optimization having the same effect that what you tried now.

Is some in-memory buffer a viable option though? I don't think it needs to be synchronously written to disk when DNS request comes in. Writing records in bulk might also speed thing up a bit.

Yes, yes and yes.

  • in-memory buffer: already used and ....
  • it is already fully asynchronous - however, as we use low-level primitives to get best performance, when you manually call sync you force this buffer to the disk database now effectively removing the asynchronous aspect until your sync call returned. It is again asynchronous thereafter
  • in bulk: by default we collect DNS queries for one minute and then write them in one batch to the database. The interval can be adjusted using the config option DBINTERVAL=1 in v5 and database.DBinterval=60 in Pi-hole v6.0+

@nazar-pc
Copy link
Sponsor Author

it is already fully asynchronous - however, as we use low-level primitives to get best performance, when you manually call sync you force this buffer to the disk database now effectively removing the asynchronous aspect until your sync call returned. It is again asynchronous thereafter

Makes sense, I just though about some in-app buffer rather than OS caching/buffering to have better control over it and make sure blocking is not happening in Pi-hole no matter what. Looking at it from the perspective that it is better to lose logs than to impact DNS resolution performance.

Either way glad to see improvements and looking forward to 6.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants