Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-blocking queries #299

Open
zmbc opened this issue Apr 5, 2021 · 24 comments
Open

Non-blocking queries #299

zmbc opened this issue Apr 5, 2021 · 24 comments

Comments

@zmbc
Copy link

zmbc commented Apr 5, 2021

As someone trying to put an API in production that uses R to connect to a database, the ability to send a query without blocking the R process is sorely missed. (See https://github.com/r-dbi/DBI/issues/69.)

RPostgres is very close to supporting this already--it uses libpq's asynchronous API and already has an asynchronous loop to check for interrupts. In hacking on a local source install, I was able to very easily add async functionality. Of course, I'd probably need to add tests, error handling, etc before it could actually be merged.

@krlmlr Would you be open to a PR adding a dbSendQueryAsync method or similar?

@krlmlr
Copy link
Member

krlmlr commented Apr 5, 2021

Thanks. Sure, that would be awesome. Would you like to contribute a short design document first, so that we can discuss before actually implementing?

@zmbc
Copy link
Author

zmbc commented Apr 5, 2021

Great! Where should I contribute the design document?

@krlmlr
Copy link
Member

krlmlr commented Apr 5, 2021 via email

@zmbc
Copy link
Author

zmbc commented Apr 12, 2021

@zmbc
Copy link
Author

zmbc commented Apr 26, 2021

@krlmlr Did you get a chance to read the design document? Would love to know your feedback.

@krlmlr
Copy link
Member

krlmlr commented Aug 24, 2021

Thanks, sorry it took me so long.

The document looks good, I have a few questions:

  • Should dbSendQueryAsync() work with query parameters? How do we ensure that the parameters are not garbage collected while the query is running (in the case of multi-row parameters)?
  • Can we use dbHasCompleted(), or perhaps provide dbHasCompletedAsync() to follow existing naming?
  • Could you elaborate on implementation details? How will this work internally?

@zmbc
Copy link
Author

zmbc commented Sep 2, 2021

Hi @krlmlr!

Should dbSendQueryAsync() work with query parameters? How do we ensure that the parameters are not garbage collected while the query is running (in the case of multi-row parameters)?

Yes, I think so. And I don't understand your second question. I was able to get asynchronous queries working by moving a single line of code -- step(); -- from after_bind to fetch in the PqResultImpl class. Based on that, I'd assume that query parameters are stored somewhere in the class, and wouldn't be garbage-collected, right?

Can we use dbHasCompleted(), or perhaps provide dbHasCompletedAsync() to follow existing naming?

dbHasCompleted doesn't need an async version, I think. Or are you proposing that we rename dbHasExecutedQuery to dbHasCompletedAsync?

If the latter, I don't think it is a good idea for two reasons: dbHasCompleted is in the DBI spec and will be expected to behave that way. And dbHasExecutedQuery answers a different question entirely, and having them use similar names would be confusing.

Could you elaborate on implementation details? How will this work internally?

I mentioned my quick-and-dirty approach above (of course, that simple change does not implement the status-checking method, but that's not too hard).

I am not sure exactly the best way to avoid code duplication -- perhaps the synchronous result and the async result can be subclasses of the same class, or the synchronous result can use the async result under the hood? I favor the latter but would like to know what you think.

@krlmlr
Copy link
Member

krlmlr commented Sep 6, 2021

Thanks. I think I missed an important detail: is this about supporting multiple open result sets, or about being able to do a nonblocking dbFetch(), or both, or (also) something else entirely?

@zmbc
Copy link
Author

zmbc commented Sep 6, 2021

I think I missed an important detail: is this about supporting multiple open result sets, or about being able to do a nonblocking dbFetch(), or both, or (also) something else entirely?

As stated in the design doc, the purpose of this feature is to be able to send a query in a non-blocking fashion, so that other R code can run while the query is being executed by the database.

There can still only be one query and one result set at a time per connection, unless I am seriously misunderstanding the libpq API. Maybe you thought I was saying this because of a typo in the design document, which I have just corrected.

@krlmlr krlmlr added this to the 1.4.0 milestone Sep 14, 2021
@krlmlr
Copy link
Member

krlmlr commented Sep 19, 2021

I now have a better understanding of the problem after working on #272, your comments helped a lot.

library(RPostgres)

conn1 <- dbConnect(Postgres())

sql <- "SELECT pg_sleep(2)"

print(Sys.time())
#> [1] "2021-09-19 18:05:31 CEST"
res <- dbSendQuery(conn1, sql)
print(Sys.time())
#> [1] "2021-09-19 18:05:33 CEST"
dbFetch(res)
#>    pg_sleep
#> 1 blob[0 B]
print(Sys.time())
#> [1] "2021-09-19 18:05:33 CEST"
dbClearResult(res)
dbDisconnect(conn1)

Created on 2021-09-19 by the reprex package (v2.0.1)

I understand you're asking that dbSendQuery() should not wait until data is available, which can indeed be achieved by postponing the step() call in after_bind() .

#272 and this issue are very similar, both affect the PqResultImpl class.

  • immediate queries: use PQsendQuery() instead of PQprepare() + PQdescribePrepared() + PQsendQueryPrepared() (also with the ability to send multiple queries in one command)
  • asynchronous queries: when to call PQgetResult(): when fetching the first time vs. right after sending

This class already combines sending queries (with a result set) and statements (that returns the number of rows affected). I'm afraid all these degrees of freedom will complicate the implementation.

I also have doubts about the usefulness for queries that return data: RPostgres operates in "single row mode" via PQsetSingleRowMode(); availability of the first row doesn't necessarily mean that the entire result is available, and the process will be blocked anyway. This may be a non-issue in practice though, if fetching occurs in "small enough" chunks.

Regarding your draft, I have a few questions:

  • What happens in the sequence dbSendQueryAsync() + dbClearResult() ?
  • How to wait for a query/statement, to ensure it has completed? Should we implement an interruptible wait, perhaps with a timeout?
  • How do we test the implementation? Can you think of a query that "slowly" returns rows, with some clever use of pg_sleep()? In psql, the query select 1 AS a, pg_sleep(1) AS b UNION ALL select 2, pg_sleep(1); returned all rows at once after two seconds, is there a rowwise mode?
  • What practical problems does dbSendQueryAsync() solve? How would you use it if it was available today? Are you interested primarily in queries or DML/DDL statements?
  • How can we overcome libpq's restriction of one active result set per connection? Can we make use of the pool package to avoid manually maintaining multiple connections?

That said, I'll consider it when resuming work on #272. I have a proof of concept with two implementations of the PqResultImpl class, but ultimately all behaviors will have to live in the PqResultImpl class. Perhaps the "strategy" pattern can help.

@krlmlr krlmlr removed this from the 1.4.0 milestone Sep 25, 2021
@zmbc
Copy link
Author

zmbc commented Oct 18, 2021

Thanks @krlmlr.

availability of the first row doesn't necessarily mean that the entire result is available, and the process will be blocked anyway

This is a fair point. I'd actually be curious to see if a query could be constructed that would return rows with a large interval between them. But, I do not think this is the common case, see below.

What practical problems does dbSendQueryAsync() solve? How would you use it if it was available today? Are you interested primarily in queries or DML/DDL statements?

The much more common case in my opinion is that the delay is caused by network latency between the database and the computer where the R code is running. In modern web development (and perhaps other types of development besides), IO is almost always asynchronous for this reason -- it can be much more expensive than actual computation and having a process block for it is quite inefficient.

Primarily queries. But I don't see a good reason why this would be easier to implement for only queries as opposed to all kinds of statements.

How to wait for a query/statement, to ensure it has completed? Should we implement an interruptible wait, perhaps with a timeout?

This was addressed in my vignette. No need to complicate the API with a timeout, users can implement that themselves quite easily using dbHasExecutedQuery.

What happens in the sequence dbSendQueryAsync() + dbClearResult() ?

I didn't address this in the vignette nor my proof-of-concept, but this should just cancel the query without sending a new one. dbSendQueryAsync() + dbSendQueryAsync() cancels the query and sends a new one.

How can we overcome libpq's restriction of one active result set per connection? Can we make use of the pool package to avoid manually maintaining multiple connections?

I don't think RPostgres needs to worry about this. Users of RPostgres already need to be aware that they can only do one thing on a connection at a time (otherwise their results are overwritten).

Users can use the pool package to manage making multiple async requests at once, yes. I didn't do this in my vignette to keep things simple, but in my actual testing of my proof-of-concept I was using pool.

but ultimately all behaviors will have to live in the PqResultImpl class

I don't think this is the only way to do it. Why not this: make the PqResultImpl class only perform queries asynchronously. Then, the synchronous API uses the asynchronous API but blocks until the result is present. dbSendQuery() = dbSendQueryAsync() + loop until dbHasExecutedQuery() is true. (Whether this is literally written in R or as a C++ class that uses PqResultImpl.)

Even if that doesn't work, it definitely seems possible to combine these concerns in PqResultImpl. Or have two versions of PqResultImpl using inheritance.

This may be unrelated to asynchronicity, but I do not understand #272. It's unclear to me what the word "immediate" has to do with prepared statements, and why the user should have to specify it. If the user tries to run a query/statement with params, it should use prepared statements, and if there are no params, it shouldn't. Am I missing something?

@krlmlr
Copy link
Member

krlmlr commented Dec 5, 2021

Thanks for your detailed response.

Can you point me to a Shiny app that would benefit from asynchronous I/O? From my experience, I/O is often the result of a user interaction, and little computation can happen while we're waiting for the I/O. I'm curious to see such an example.

I'm a bit wary to handle everything asynchronously. I'm open to a pull request that adds asynchronous processing as an (internal) option, exposing it as you describe in the vignette. We should use the postgres prefix for the new API, as it is unclear if and how it makes it into DBI.

Perhaps "immediate" is not the best wording. Doing everything with prepared statements would have been much simpler, but it turns out that we need to use the other "immediate" API occasionally. A future version should distinguish explicitly between these two modes of operation.

@zmbc
Copy link
Author

zmbc commented Dec 6, 2021

Can you point me to a Shiny app that would benefit from asynchronous I/O? From my experience, I/O is often the result of a user interaction, and little computation can happen while we're waiting for the I/O. I'm curious to see such an example.

If a user requests something that requires a database query, the issue is not that they must wait for the database query, it is that everyone else using the Shiny app must also wait. Because database connections cannot be passed to another process (e.g. with future), the server process must be blocked waiting for the query results.

Of course, there are workarounds, such as running multiple Shiny processes. But that complicates the routing of requests and still sets an arbitrary limit on the number of concurrent queries that can be running before even basic requests, such as CSS, take far longer than they should. A much more elegant solution is to not block the process in the first place.

I'm a bit wary to handle everything asynchronously.

As I noted, RPostgres already handles everything asynchronously. All I'm proposing is that this busy-loop logic be extracted into a separate class, and that an R interface is exposed to the version that does not have the busy-loop.

We should use the postgres prefix for the new API, as it is unclear if and how it makes it into DBI.

Makes sense. I did not know about this convention.

Doing everything with prepared statements would have been much simpler, but it turns out that we need to use the other "immediate" API occasionally.

No, I am suggesting exactly the opposite. Use the non-prepared API (the "immediate" API) all the time, except when there are query parameters.

A future version should distinguish explicitly between these two modes of operation.

Again, my suggestion was the opposite. I may be missing something, but I see no reason why I as a user of RPostgres should have to know about different libpq APIs. From my reading of the libpq documentation, the difference between them is that the prepared API allows you to parse and plan a query once and then run it multiple times. Even if we want to use the prepared API for all queries with parameters (I don't know why that is a desirable default), there is no reason to use it for queries without parameters, as they by definition can't be run multiple times with different parameters.

Put another way, to justify the immediate flag being visible to me as a user of RPostgres, there must be a reason RPostgres needs me to specify this information. As far as I know, there is no possible query that I would want to sometimes run with immediate=T and sometimes with immediate=F; and, it isn't difficult at all to tell which queries should use which API. So, specifying immediate=T is just unnecessary work for the user.

@krlmlr
Copy link
Member

krlmlr commented Dec 6, 2021

Thanks, the multi-session Shiny is a good example. To be most useful in Shiny, do we need/want to integrate with the {promises} package? How to achieve that?

I suspect we can already offload database queries to a different process, but the serialization overhead might be substantial.

Maybe we can start with a draft implementation that changes the PqResultImpl in place, and then think if/how to combine both implementations. All DBItest tests must still pass, I suspect this will be a bit challenging.

I wasn't very clear. Historically, RPostgres (and some other backend packages, including odbc) only had this one code path -- everything was routed via the "prepared" API. It turned out that this doesn't work for at least odbc and RMariaDB -- backends will need to use both simple and "prepared" API. If we accept this, there's no reason except compatibility to use the "prepared" API for simple statements. Changing to use "simple" by default feels like a breaking change, perhaps it's not.

@zmbc
Copy link
Author

zmbc commented Dec 6, 2021

To be most useful in Shiny, do we need/want to integrate with the {promises} package? How to achieve that?

We could. I wasn't sure about whether to directly integrate with {promises} or not. Whether it's done by RPostgres or by the user, it's very easy; with the R API I outlined in the vignette, it's about 15 lines of code.

I suspect we can already offload database queries to a different process, but the serialization overhead might be substantial.

Interesting. I'm just going off this documentation for the {future} package; I tried a few different ways and wasn't able to successfully transfer a DB connection between processes.

But in any case, it seems to me a very hacky solution to use a worker process (really more of a "slacker" process!) just to execute a wait loop.

Maybe we can start with a draft implementation that changes the PqResultImpl in place, and then think if/how to combine both implementations. All DBItest tests must still pass, I suspect this will be a bit challenging.

I'm happy to take a shot at this and make a PR.

Changing to use "simple" by default feels like a breaking change, perhaps it's not.

Thanks for this explanation. Based on the documentation I don't see any reason to think it would be a breaking change, except that it would allow multiple statements/queries to be sent in a single string (which doesn't break any old functionality, only adds new functionality).

But, I understand why you would be worried that it could break some existing behavior. I guess my question would be, what would make you feel good about going forward with that change? Would the DBItest tests passing be enough?

@krlmlr
Copy link
Member

krlmlr commented Dec 6, 2021

Thanks for the pointer. Can we create a promise that doesn't need polling? Either way, looking forward to review your PR.

I'd say it's impossible to transfer a database connection to another process. The process that runs the future will have to instantiate its own connection.

Offloading to an external process brings the advantage that we can execute multiple queries truly simultaneously.

I have opened r-dbi/dbi3#47 for further discussion. Most users won't really care about simple vs. prepared, and it's easy to implement a wrapper that always sets immediate = TRUE . I need to wrap up a few things before thinking about new development or breaking changes here.

@krlmlr
Copy link
Member

krlmlr commented Dec 6, 2021

To avoid polling: Can we create a background task from C++ with {later}, just for the execution of the query?

@ismirsehregal
Copy link

Here you can find a related issue.

@zmbc
Copy link
Author

zmbc commented Dec 22, 2021

Can we create a promise that doesn't need polling?

It will require some polling no matter what, I think. This is about concurrency, not parallelism; a promise cannot resolve until the R process is idle. So, we can't handle the database's response the moment it comes in, but rather as soon as the R process finishes what it's doing. I don't see a way to do that without doing some kind of polling from the event loop of the R process.

Now, it's possible that doing something with a background task is smart, for example if it's not good to call libpq's PQisBusy too many times. But otherwise, I think the polling approach is just fine. We could poll in C++ instead of R if that's preferable, but the effect would be the same.

I haven't tried this with R's {promises}, but it's possible that later::later(<function>, 0) could be used in a loop to simply jump to the back of the event processing queue without requiring that a specific amount of time elapse. I know that works in Javascript.

@ismirsehregal
Copy link

Maybe using promises::then's onFulfilled callback is an option?

@krlmlr
Copy link
Member

krlmlr commented Dec 22, 2021

Yes, I think we'll have to work along these lines. I think something like the following might work:

  1. We prepare the SQL and the parameter data from the main thread (because we're accessing R's data structures)
  2. We prepare the callbacks for step 4
  3. We start a separate thread for executing the query, and return to the caller
  4. When this thread is done, it calls {later} through its C API to schedule the callback

For a draft, we can implement a new postgresSendQueryAsync() that takes two extra argument, on_success and on_failure, both are functions that are called with the result set. From this we should be able to construct a promise. What do you think?

@zmbc
Copy link
Author

zmbc commented Dec 22, 2021

@krlmlr I understand what you're proposing and I think it would work. However, I don't understand why we would do it. Is it a performance optimization? That would imply that polling PQisBusy from the R event loop is a major performance hit. I think it's unlikely to be, but no matter what, we should certainly see if it is before implementing an optimization.

Note that something will have to be waiting for the query no matter what. In your proposal, it's the background thread. In my original promise implementation, it's the R process. The question is whether waiting for the query is a time-consuming operation that shouldn't interfere with R process work. I think it won't be.

If your concern isn't the expensiveness of polling but that it may not resolve the promise immediately, later::later(<function>, 0) would already assure the soonest possible resolution of the promise. Let's try that first.

@krlmlr
Copy link
Member

krlmlr commented Dec 22, 2021

Thanks. To me, it seems that polling means that either the CPU is busy, or that we employ some sort of sleeping and might be seeing the update too late. We can start with polling and switch to a signal-based approach later.

@zmbc
Copy link
Author

zmbc commented Jan 2, 2022

@krlmlr Just to be clear about what "the CPU is busy" means: a polling loop as described above wouldn't significantly block other R code from running, since R code would run before event loop callbacks and would therefore wait for at most one check of PQisBusy. There is a chance that using the 0-based polling would lead to high CPU usage that would compete with other processes. I think it's likely that even a tiny amount of delay (a single millisecond, for example) would all but remove that problem.

I definitely think we can start with a polling approach. I see how there could be an advantage to the signal-based approach but my guess is that in practice it will be very small and not worth the complexity. But, we can test that out after doing the initial implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants