Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent data ingestion issue #222

Open
drnextgis opened this issue Oct 14, 2023 · 5 comments
Open

Intermittent data ingestion issue #222

drnextgis opened this issue Oct 14, 2023 · 5 comments

Comments

@drnextgis
Copy link
Collaborator

Are there any situations that might result in data not being ingested into the database following the successful invocation of load_items? Concurrent ingestions or heavy database loads during the ingestion process? We've encountered this issue few times, but the underlying cause remains uncertain; typically, rerunning it resolves the problem.

@drnextgis
Copy link
Collaborator Author

I observed a connection between the presence of a Lock:relation in the database performance chart and the absence of STAC items. This occurs when multiple workers are simultaneously utilizing the 'load_items' function to write data.
image

@bitner
Copy link
Collaborator

bitner commented Oct 16, 2023

@drnextgis If you are running multiple workers, you should try to ensure that the multiple workers are not accessing the same partitions as the partition maintenance tasks are locking and will hinder concurrency to those partitions.

PgSTAC does a number of things to try to mitigate locking issues and allow concurrency, but due to the nature that we need to actually modify the layout of the database (adding partitions / modifying table constraints) there is definitely locking that happens and the possibility for contention. It will always be safest from that standpoint to run ingests sequentially and if not, to make sure to chunk data out so that you are not running ingests to the same partition simultaneously.

@drnextgis
Copy link
Collaborator Author

I grasp that adhering to the practice of performing sequential operations of writing data into the same partition is a general guideline. However, due to our daily data ingestion into the Catalog, and considering that each collection can be further partitioned by either year or month, we inevitably find ourselves consistently ingesting data into the same partition 😥

Furthermore, I observed that invoking check_partition here comes at a significant cost, noticeably prolonging the ingestion process. Would it be sensible to implement a mechanism for pre-initializing partitions, especially if we have a well-established pattern of data distribution? As an experiment, I introduced a dummy item with the last day of the month into the collection, and as a result, the ingestion process now completes so fast that there is no noticeable activity on the database performance chart.

@drnextgis
Copy link
Collaborator Author

drnextgis commented Oct 17, 2023

Proposed solution: #223

@bitner
Copy link
Collaborator

bitner commented Nov 7, 2023

The issue if you get too aggressive with pre-generating partitions is the risk that if you create many empty partitions it can just make the query planning process slower for everything. Can you try this approach and see if that helps solve the issues you are seeing?

psql -c "INSERT INTO pgstac_settings (name, value) VALUES ('use_queue', 'true') ON CONFLICT (name) DO UPDATE SET value=EXCLUDED.value;"
pypgstac load items .....
pypgstac load items .....
psql -c " CALL run_queued_queries();"
psql -c "INSERT INTO pgstac_settings (name, value) VALUES ('use_queue', 'false') ON CONFLICT (name) DO UPDATE SET value=EXCLUDED.value;"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants