Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to pg_bulkload for hight speed data insertion and automatic scan on insert #922

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

w1ld3r
Copy link

@w1ld3r w1ld3r commented Jul 21, 2023

Add support for pg_bulkload to provide high-speed data loading capability and automatic start scan on inserted domains.

Principal changes made:

  • Replace default postgres databse image by a custom one with preinstalled pg_bulkload
  • Attach to the postgres database container a local volume (./db/imports) for pg_bulkload data import and logging
  • Explicit table name definition to resolve error with pg_bulkload INPUT parameter
  • Bash script (./db/scripts/bulk_insert_domains.sh)
    • Bulk insert of domain with the possibility to link them to a new or exisiting organization
    • Start scan of previously inserted domains

Security changes made:

  • Remove exposed database (port 5432 ) and web server (port 8000)
  • Replace .env by .env.template to increase confidentiality

TO-DO:

  • Document script usage
  • Improve script

@w1ld3r w1ld3r marked this pull request as ready for review July 21, 2023 13:10
@w1ld3r w1ld3r changed the title Add support for pg_bulkload Add support to pg_bulkload for hight speed data insertion and automatic scan on insert Jul 23, 2023
@psyray
Copy link
Collaborator

psyray commented Nov 19, 2023

Hi @w1ld3r Your PR is a valuable one 👍

As the v2 has just been released, there are a lot of db model modifications.
So could you rebase your PR above master (which is now v2) and do the modification to your shell script to add the newly created tables?

Once done, I will test your PR and we will integrate it ASAP if everything is OK
DB is a bottleneck for reNgine when working with huge datas, so your proposal is really important

@fopina
Copy link
Contributor

fopina commented Feb 13, 2024

Shameless self-promotion, but we use https://github.com/fopina/django-bulk-update-or-create/ in https://github.com/surface-security/surface/
Works quite well without requiring database extensions

@w1ld3r
Copy link
Author

w1ld3r commented Feb 13, 2024

Hello @fopina, thanks for the share. I'll still be using pg_bulkload along side postgres for better performance.
Has pointed out @psyray the problem is on the processing of that amount of data. For what I have tested of rengine, it cannot handle huge amount of data.

@psyray
Copy link
Collaborator

psyray commented Feb 18, 2024

Hello @fopina, thanks for the share. I'll still be using pg_bulkload along side postgres for better performance.
Has pointed out @psyray the problem is on the processing of that amount of data. For what I have tested of rengine, it cannot handle huge amount of data.

After some deep use and the integration of the Django debug toolbar, real problem cames from the related table data gathering.
Problem comes mainly from the datas in the datatables view, defined in the datatables configuration, when datatable gets its data before rendering the table.
The lazy load of Django multiply queries on each line.
So, for example, for the vulnerabilities table, the one which has a lot of queries, there are more than 2000 queries for a 50 elements rendering.
I'm currently trying to optimize this.

@psyray
Copy link
Collaborator

psyray commented Feb 18, 2024

Shameless self-promotion, but we use https://github.com/fopina/django-bulk-update-or-create/ in https://github.com/surface-security/surface/
Works quite well without requiring database extensions

Thanks for.this I will have a look

@psyray
Copy link
Collaborator

psyray commented Feb 18, 2024

@w1ld3r

Could you rebase your PR on master ?

@fopina
Copy link
Contributor

fopina commented Feb 18, 2024

@psyray look into select_related and prefetch_related to optimize those

@psyray
Copy link
Collaborator

psyray commented Feb 18, 2024

@psyray look into select_related and prefetch_related to optimize those

Thanks 👍

@AnonymousWP AnonymousWP added the Waiting for Merge Already Worked, waiting to merge label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/2.1.0 Waiting for Merge Already Worked, waiting to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants