Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to query tables in DuckDB file created by DBT models #372

Open
JRocki opened this issue Mar 29, 2024 · 2 comments
Open

Unable to query tables in DuckDB file created by DBT models #372

JRocki opened this issue Mar 29, 2024 · 2 comments

Comments

@JRocki
Copy link

JRocki commented Mar 29, 2024

I am not sure if this is intentional or if there's something on my end, but I am unable to query tables in the DuckDB file that are created during the DBT run. I am using AWS S3/parquet as my 'data lake'. I'm using duckdb v 0.10.0 and dbt-duckdb v 1.7.2. Here's my profiles.yml:

db:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: ./db.duckdb
      extensions:
        - httpfs
        - parquet
      use_credential_provider: aws
      settings:
        s3_region: us-west-1

The pipeline runs successfully. I get all the expected parquet files in S3. The curious issue is with the duckdb file.

For example, I have a model that creates a table in a db.duckdb file: db.main.my_model. I also run a DBT test on this model (it checks that values in a column are unique). This creates a table db.main_dbt_test__audit.unique_column_id. I like to then use python/duckdb to directly query the tables when I'm debugging or exploring, but it errors when querying db.main.my_model:

con = duckdb.connect('db.duckdb')
df = con.query(
    """
    select *
    from db.main.my_model
    """
)

yields:
duckdb.duckdb.HTTPException: HTTP Error: HTTP GET error on 'https://my-bucket.s3.amazonaws.com/pipeline/my_model.parquet' (HTTP 403)

However, if I run:

con = duckdb.connect('db.duckdb')
df = con.query(
    """
    select *
    from db.main_dbt_test__audit.unique_column_id
    """
)

I am able to see the data. Is there something inherently different about tables create via duckdb models? How does the duckdb file know to point to S3? I think the HTTP 403 is a red herring because the DBT pipeline runs meaning that it is able to connect to AWS. I'd really appreciate any thoughts on this, thanks!

@jwills
Copy link
Collaborator

jwills commented Mar 29, 2024

The rub here is that the created DuckDB file won't have the S3 access key/secret present to be able to read the tables that are backed by files in S3 (the result of tests are just regular DuckDB tables, which is why you can query them.) You should be able to read the tables by loading the aws extension and running CALL load_aws_credentials(); to load your credentials.

To avoid needing to do this, you can use the new CREATE SECRET construct via a dbt macro and then DuckDB will store the credentials in an encrypted format in a special . directory in home: https://duckdb.org/docs/sql/statements/create_secret.html

@JRocki
Copy link
Author

JRocki commented Mar 29, 2024

@jwills That makes sense, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants