Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database connection error handling #268

Open
georg-schwarz opened this issue Dec 3, 2020 · 3 comments
Open

Database connection error handling #268

georg-schwarz opened this issue Dec 3, 2020 · 3 comments

Comments

@georg-schwarz
Copy link
Member

How will error handling work in the services if connections to the db break?
Should they be able to use a subscribeToError function to perform custom error handling?
Or do we want to handle that synchronously when requesting queries? Should we catch that specific case and throw WebExceptions 500 "Database currently not reachable" or something similar?

Originally posted by @georg-schwarz in jvalue/node-dry-pg#2 (comment)

@georg-schwarz
Copy link
Member Author

georg-schwarz commented Dec 3, 2020

moved discussion here @sonallux

@sonallux
Copy link
Contributor

sonallux commented Dec 3, 2020

Anything related to connection losses on idle connections in the connection pool should be resolved with jvalue/node-dry-pg#2. When performing a query, the connection pool will automatically try to establish a new connection if there is no active connection in the pool. Therefore those errors should not concern the services as the node-dry-pg library is handling those errors silently.

The query method does return a Promise which resolves with the query result or rejects with an error if performing the query failed (database error or connection problems). Therefore we must handle those errors synchronously.

First, we must distinguish between database schema violations errors (e.g. client errors) and network-related errors. I am going to focus only on the network-related errors now.

Further, we should also distinguish between the origins of the database queries because each of them will need a different error handling. I have currently identified these three origins:

Service startup (table initilization)

Here we do have two options:

  • wait and retry: We are currently doing this with a limited number of retries. After that, we are following the second option and exiting the microservice.
  • fail-fast: In this case, one would immediately exit the service and let the container orchestrator (e.g. Kubernetes) handle the error.

The wait and retry option is convenient in development. But if the ODS should ever run in production, I would move to the fail-fast option, because the database initialization is only needed on the very first start of the database. On all other service startups executing the database initialization is unnecessary and can break things if it is not idempotent.

When running in production database migrations are also a scenario that will arise at some point. For me, database initialization and database migration are actually very similar and ideally should be handled similarly. But as database migrations are a complex topic on its own, they should be handled separately when it is needed.

REST request

For me, the only option is to return a 5XX error. This is already done as the error of the rejected query Promise just bubbles up till it is caught by the default express error handler, which returns a 500 response (see #247)

Async message/event

I think we are currently just logging the error. This is definitely not an appropriate error handling mechanism. In those cases, I would use the feature of rejecting/nacking messages back to the message broker, so they do not get lost. Then the message can either be redelivered or put in a dead-letter queue.


Further points

Here are some further points that can influence the above decisions. Most of the points do not affect us now. But I would like to mention them here, as they are getting important when running the ODS in production with live traffic.

  • Is the connection loss due to high load on the database?
  • Is the database replicated? (Retry with another replica)
  • Are there multiple instances running of the microservice? (always fail-fast and let the client do a retry on another instance)
  • Are there circuit breakers?
  • How can the container orchestrator detect unhealthy or broken services and databases?
  • What does the container orchestrator do with unhealthy or broken services and databases (draining traffic and service restart)?
  • How does load balancing work (especially automatic traffic draining from unhealthy services)?

@georg-schwarz
Copy link
Member Author

I agree with everything you say!

Since we have retries on startup configured via end variables, we can set the retries to 0 on future Kubernetes deployments and let k8s restart the containers on failure.

The schema initialization + migrations should be handled differently than like right now. But that can happen later on when we have a version deployed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants