Intermittent data ingestion issue #222

drnextgis · 2023-10-14T22:25:46Z

Are there any situations that might result in data not being ingested into the database following the successful invocation of load_items? Concurrent ingestions or heavy database loads during the ingestion process? We've encountered this issue few times, but the underlying cause remains uncertain; typically, rerunning it resolves the problem.

drnextgis · 2023-10-15T18:43:59Z

I observed a connection between the presence of a Lock:relation in the database performance chart and the absence of STAC items. This occurs when multiple workers are simultaneously utilizing the 'load_items' function to write data.

bitner · 2023-10-16T14:15:57Z

@drnextgis If you are running multiple workers, you should try to ensure that the multiple workers are not accessing the same partitions as the partition maintenance tasks are locking and will hinder concurrency to those partitions.

PgSTAC does a number of things to try to mitigate locking issues and allow concurrency, but due to the nature that we need to actually modify the layout of the database (adding partitions / modifying table constraints) there is definitely locking that happens and the possibility for contention. It will always be safest from that standpoint to run ingests sequentially and if not, to make sure to chunk data out so that you are not running ingests to the same partition simultaneously.

drnextgis · 2023-10-16T20:08:34Z

I grasp that adhering to the practice of performing sequential operations of writing data into the same partition is a general guideline. However, due to our daily data ingestion into the Catalog, and considering that each collection can be further partitioned by either year or month, we inevitably find ourselves consistently ingesting data into the same partition 😥

Furthermore, I observed that invoking check_partition here comes at a significant cost, noticeably prolonging the ingestion process. Would it be sensible to implement a mechanism for pre-initializing partitions, especially if we have a well-established pattern of data distribution? As an experiment, I introduced a dummy item with the last day of the month into the collection, and as a result, the ingestion process now completes so fast that there is no noticeable activity on the database performance chart.

drnextgis · 2023-10-17T20:40:34Z

Proposed solution: #223

bitner · 2023-11-07T22:32:28Z

The issue if you get too aggressive with pre-generating partitions is the risk that if you create many empty partitions it can just make the query planning process slower for everything. Can you try this approach and see if that helps solve the issues you are seeing?

psql -c "INSERT INTO pgstac_settings (name, value) VALUES ('use_queue', 'true') ON CONFLICT (name) DO UPDATE SET value=EXCLUDED.value;"
pypgstac load items .....
pypgstac load items .....
psql -c " CALL run_queued_queries();"
psql -c "INSERT INTO pgstac_settings (name, value) VALUES ('use_queue', 'false') ON CONFLICT (name) DO UPDATE SET value=EXCLUDED.value;"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent data ingestion issue #222

Intermittent data ingestion issue #222

drnextgis commented Oct 14, 2023

drnextgis commented Oct 15, 2023

bitner commented Oct 16, 2023

drnextgis commented Oct 16, 2023

drnextgis commented Oct 17, 2023 •

edited

bitner commented Nov 7, 2023

Intermittent data ingestion issue #222

Intermittent data ingestion issue #222

Comments

drnextgis commented Oct 14, 2023

drnextgis commented Oct 15, 2023

bitner commented Oct 16, 2023

drnextgis commented Oct 16, 2023

drnextgis commented Oct 17, 2023 • edited

bitner commented Nov 7, 2023

drnextgis commented Oct 17, 2023 •

edited