High disk I/O prevents successful DNS resolution #5635

nazar-pc · 2024-04-25T01:37:17Z

Versions

Pi-hole: v5.18.2
AdminLTE: v5.21
FTL: v5.25.1

Platform

OS and version: Ubuntu 22.04

Platform: Docker

Expected behavior

DSN resolution should work all the time

Actual behavior / bug

On very high disk I/O DNS resolution simply stops working. For example this happens when running sync on NVMe SSD after large disk operation that on modern Linux with noop I/O scheduler basically blocks a lot of things for a significant amount of time.

When Pi-hope recovers I see this in logs:

Long-term load (15min avg) larger than number of processors: 80.3 > 32
This may slow down DNS resolution and can cause bottlenecks.

Steps to reproduce

Have Pi-hope runing on disk with very high I/O load and observe DNS resolution failing.

Debug Token

https://tricorder.pi-hole.net/sgxHNmzd/

Additional context

My educated guess is that on each DNS request Pi-hole tries to read something from disk and under high I/O read times out.

Considering how small gravity database is (8.7M in my case), it would make a lot of sense to keep its copy in memory, something under 50M will not cause issues for most users, while performance will improve massively and will avoid issue described here completely.

The text was updated successfully, but these errors were encountered:

rdwebdesign · 2024-04-25T02:59:27Z

Pi-hole database is already kept in memory and new queries are written to disk from time to time.
If not possible to write immediately, Pi-hole will try to write the data later.

nazar-pc · 2024-04-25T03:01:38Z

Interesting, then my guess is incorrect, but the issue is still present. How can I debug it to provide more information?

rdwebdesign · 2024-04-25T03:10:21Z

Please, generate and upload a Debug Log:

Using the command line:

use pihole -d command.
when asked to upload the log, answer Y.
copy only the token and paste here.

Using the web interface:

click on Tools > Generate Debug Log menu.
then mark the second checkbox to upload the log and click on the big button.
when finished, copy only the token and paste here.

nazar-pc · 2024-04-25T03:13:01Z

https://tricorder.pi-hole.net/sgxHNmzd/

rdwebdesign · 2024-04-25T18:09:14Z

I'm not 100% sure the issue is related to high I/O load causing a Pi-hole failure.

Usually, when it fails to write to the database you see messages like these:

ERROR: SQL query "END TRANSACTION" failed: database is locked (SQLITE_BUSY)
END TRANSACTION failed when trying to store queries to long-term database
Keeping queries in memory for later new attempt

And when the queries are finally written, you see this:

Notice: Queries stored in long-term database: 44 (took 142.7 ms, last SQLite ID 53921217)
        Queries from earlier attempt(s) stored successfully

But your log doesn't show evidence of this kind of issue (at least not in the available lines - the debug log shows only a few lines of FTL.log).

nazar-pc · 2024-04-25T18:11:32Z

Is there a more detailed log level that I can enable or something else of this kind?

DL6ER · 2024-04-25T18:37:48Z

This is an interesting case. Naively thinking, I guess every added debug logging would worsen your situation somewhat as all of Pi-hole's debug options are rather verbose.

FTL.log is barely ever written to and logging to pihole.log happens completely asynchronous. Following The Linux Programming Interface: A Linux and UNIX System Programming Handbook:

the kernel buffer cache ensures that I/O on regular files does not block, as described in Section 13.1

If this would not be working, something in the kernel would be broken. We are not syncing the log at any point and let the kernel itself handle things alone.

Storing to the database is furthermore done in a different thread altogether. It should not cause slowdowns to DNS resolution.

---snip---

Well, having had to make a break from typing here and returning later to the computer actually made me realize there is one possible issue I could imagine here: When writing to the database, we are temporarily locking the datastructure holding DNS queries to ensure consistent writes. However, we are needlessly locking it for longer than we need to - so far this has never been an issue but maybe it is in your case.

Fortunately, it's sufficiently easy to find out if my theory is right or not. Please run

pihole checkout ftl tweak/unlock_commit

and either find that the situation (greatly) improves or that nothing changes.

nazar-pc · 2024-04-25T19:02:39Z

Running it already, will report back after some time with it

nazar-pc · 2024-04-26T23:06:45Z

Observations so far: I do see a few seconds page load delays occasionally (which may or may not be related to DNS, didn't confirm yet), however there wasn't a single time that page didn't load at all.

Before switching to that branch DSN resolution was quickly failing outright, so there is definitely a visible positive change.

nazar-pc · 2024-04-27T03:36:45Z

The thing that broke though is deleting diagnostic messages. I'm getting:

Error, something went wrong!
While executing message statement: database is locked

DL6ER · 2024-04-27T07:42:45Z

Observations so far: I do see a few seconds page load delays occasionally (which may or may not be related to DNS, didn't confirm yet), however there wasn't a single time that page didn't load at all.

Before switching to that branch DSN resolution was quickly failing outright, so there is definitely a visible positive change.

This is great, I will open a PR to merge this into the currently running Pi-hole v6.0 beta. There won't be another v5 release (this may or not change, but it is rather unlikely).

I'm getting:

Error, something went wrong!
While executing message statement: database is locked

This is expected and cannot be avoided unless messages would be stored in a different database. This would increase complexity quite a bit and the solution should much more likely be running Pi-hole on a system with less load. It doesn't appear to be a good idea overall to run your Pi-hole at all on such a special system. Why are you running sync at all? Power-outage prevention? To my understanding, the noop sheduler will eventually write all files to disk on its own (over time), there will just no re-ordering happen (which is useful on SSDs). sync isn't strictly needed here.

Considering how small gravity database is (8.7M in my case), it would make a lot of sense to keep its copy in memory, something under 50M will not cause issues for most users, while performance will improve massively and will avoid issue described here completely.

Sorry, I have missed this before. We are not talking about gravity.db but pihole-FTL.db which will typically be in the low GB range for most users. We cannot afford putting this into memory. The gravity database is indeed held in the kernel's page cache, it's always quickly available and shouldn't affect performance in your case. The optimization I have done in the proposed branch you tried improves committing to pihole-FTL.db alone.

nazar-pc · 2024-04-27T14:20:25Z

This is expected and cannot be avoided unless messages would be stored in a different database.

This is expected and cannot be avoided unless messages would be stored in a different database. This would increase complexity quite a bit and the solution should much more likely be running Pi-hole on a system with less load.

It doesn't appear to be a good idea overall to run your Pi-hole at all on such a special system. Why are you running sync at all? Power-outage prevention? To my understanding, the noop sheduler will eventually write all files to disk on its own (over time), there will just no re-ordering happen (which is useful on SSDs). sync isn't strictly needed here.

That was just one of the ways to reproduce the issue. I was having other workloads running there that caused high disk I/O as well. Already planning to upgrade the drive today or tomorrow, but since I did see this odd behavior I thought I better report it upstream and maybe something can be improved as the result.

Sorry, I have missed this before. We are not talking about gravity.db but pihole-FTL.db which will typically be in the low GB range for most users. We cannot afford putting this into memory. The gravity database is indeed held in the kernel's page cache, it's always quickly available and shouldn't affect performance in your case. The optimization I have done in the proposed branch you tried improves committing to pihole-FTL.db alone.

I see. Is some in-memory buffer a viable option though? I don't think it needs to be synchronously written to disk when DNS request comes in. Writing records in bulk might also speed thing up a bit.

DL6ER · 2024-04-27T14:41:05Z

Already planning to upgrade the drive today or tomorrow, but since I did see this odd behavior I thought I better report it upstream and maybe something can be improved as the result.

👍 definitely and thank you for this! I checked the code of the currently running Pi-hole v6.0 beta just now and see that we did already make a similar optimization having the same effect that what you tried now.

Is some in-memory buffer a viable option though? I don't think it needs to be synchronously written to disk when DNS request comes in. Writing records in bulk might also speed thing up a bit.

Yes, yes and yes.

in-memory buffer: already used and ....
it is already fully asynchronous - however, as we use low-level primitives to get best performance, when you manually call sync you force this buffer to the disk database now effectively removing the asynchronous aspect until your sync call returned. It is again asynchronous thereafter
in bulk: by default we collect DNS queries for one minute and then write them in one batch to the database. The interval can be adjusted using the config option DBINTERVAL=1 in v5 and database.DBinterval=60 in Pi-hole v6.0+

nazar-pc · 2024-04-27T14:46:34Z

it is already fully asynchronous - however, as we use low-level primitives to get best performance, when you manually call sync you force this buffer to the disk database now effectively removing the asynchronous aspect until your sync call returned. It is again asynchronous thereafter

Makes sense, I just though about some in-app buffer rather than OS caching/buffering to have better control over it and make sure blocking is not happening in Pi-hole no matter what. Looking at it from the perspective that it is better to lose logs than to impact DNS resolution performance.

Either way glad to see improvements and looking forward to 6.0!

PromoFaux changed the title ~~High disk I/O prevents successful DSN resolution~~ High disk I/O prevents successful DNS resolution Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High disk I/O prevents successful DNS resolution #5635

High disk I/O prevents successful DNS resolution #5635

nazar-pc commented Apr 25, 2024 •

edited

rdwebdesign commented Apr 25, 2024

nazar-pc commented Apr 25, 2024

rdwebdesign commented Apr 25, 2024

nazar-pc commented Apr 25, 2024

rdwebdesign commented Apr 25, 2024

nazar-pc commented Apr 25, 2024

DL6ER commented Apr 25, 2024

nazar-pc commented Apr 25, 2024

nazar-pc commented Apr 26, 2024

nazar-pc commented Apr 27, 2024

DL6ER commented Apr 27, 2024

nazar-pc commented Apr 27, 2024

DL6ER commented Apr 27, 2024

nazar-pc commented Apr 27, 2024

High disk I/O prevents successful DNS resolution #5635

High disk I/O prevents successful DNS resolution #5635

Comments

nazar-pc commented Apr 25, 2024 • edited

Versions

Platform

Expected behavior

Actual behavior / bug

Steps to reproduce

Debug Token

Additional context

rdwebdesign commented Apr 25, 2024

nazar-pc commented Apr 25, 2024

rdwebdesign commented Apr 25, 2024

nazar-pc commented Apr 25, 2024

rdwebdesign commented Apr 25, 2024

nazar-pc commented Apr 25, 2024

DL6ER commented Apr 25, 2024

nazar-pc commented Apr 25, 2024

nazar-pc commented Apr 26, 2024

nazar-pc commented Apr 27, 2024

DL6ER commented Apr 27, 2024

nazar-pc commented Apr 27, 2024

DL6ER commented Apr 27, 2024

nazar-pc commented Apr 27, 2024

nazar-pc commented Apr 25, 2024 •

edited