Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to lint and fix areas of templated code that are not actually executed #4061

Open
2 of 3 tasks
barrywhart opened this issue Nov 11, 2022 · 13 comments · May be fixed by #4202
Open
2 of 3 tasks

Add the ability to lint and fix areas of templated code that are not actually executed #4061

barrywhart opened this issue Nov 11, 2022 · 13 comments · May be fixed by #4202
Assignees
Labels
enhancement New feature or request

Comments

@barrywhart
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

Description

Several users have asked if SQLFluff does (or could) support this feature. Another SQL formatting tool, sqlfmt supports this (although it's less sophisticated and doesn't actually parse the SQL code.)

Alan and Barry discussed how that might be done in SQLFluff:
this pr means we got fairly accurate links between closing and starting jinja tags: #3936
and the rework of placeholder segments (currently bound up in the reindent pr), means that we know where code is being rendered and where not.
...so we could detect if statements and find cases where code is never rendered.
the only thing we'd need to do would be to allow the templater to optionally return multiple versions of a query, parse them all and then lint them all - deduplicating the results.
if one of the versions fails, we'd probably hide the parsing errors because it could be use incorrectly interpreting the loops, because we wouldn't really understand the loops, we're just replacing code.
....thinking about it - the other thing we could do would be preprocess the templates, replacing any if statements with True or False to force clauses to be evaluated. 🤷
all of this feels quite hacky though. Maybe just allowing users to add a --sqlfluff directive which tells the templater what other options to insert for some context values to generate multiple versions of a query. Whatever we do it all boils down to linting multiple versions of the same query - which has a performance overhead.
that being said. templating is fast. we could pretty quickly evaluate for any query if any alternative variations exist. For most queries, they won't - so performance is the same. We could also evaluate how many exist, and limit ourselves to only evaluating the most significant n (e.g. the variations with the most different code). Altering the templater to optionally return an iterable of TemplatedFile wouldn't be too hard, and deduplicating errors based on their source position already exists (#4041) and the same method could easily be used here, so that the user doesn't know the file got linted more than once.
I'm worried about the potential for extreme cases. Just as we have people trying to lint 100,000 line files, someone will try to line a file with 10,000 if statements, which means 2^10,000 possible paths.
Do you have thoughts on how to determine and limit to the most significant variations?
limiting is easy once we know how many variations.
determining the variations is the hard bit.
I'm tempted initially to just start generating them and then just stop iteration after generating 5.
then we filter from those files.
That would catch all the simple cases, and then we just warn for complicated cases.
That could be a good start. Now that I think about it, basically we just want to cover all the code. And JinjaTracer knows exactly what code is covered. So I think we could do this pretty easily. The tracer already does something slightly similar when building the trace -- it knows from the trace output what code got executed, but not necessarily how it got there. It has some very simple heuristics to determine that, e.g. if it executed code above the current point, then it needs to find a loop end whose beginning is above the target point. OTOH, if the next step is below, then find the shortest path to get there (nothing fancy, just skip over any "if"s).

Use case

Lint or format the whole SQL file without requiring the user to process the file multiple times with different Jinja variables.

Dialect

n/a

Are you willing to work on and submit a PR to address the issue?

  • Yes I am willing to submit a PR!

Code of Conduct

@barrywhart barrywhart added the enhancement New feature or request label Nov 11, 2022
@barrywhart
Copy link
Member Author

In some cases, forcing execution of a branch not normally taken could cause runtime errors, parse errors, etc. The linter should ignore these, but in case of potential side effects (e.g. it's common in dbt to call out to the database, perhaps other external services), we should provide a way to suppress this behavior, e.g. a comment like:

{% if True %}
    SELECT 1
{% else %}  --ignore: coverage
    SELECT 2
{% endif %}

@barrywhart
Copy link
Member Author

This is somewhat related to #4114.

This issue involves bookkeeping on whether each if statement is executed, then changing the trace template code to force it to be executed next time.
#4114 involves bookkeeping on how many times each for loop is executed, then discarding extra executions.

@barrywhart
Copy link
Member Author

barrywhart commented Dec 24, 2022

Hey all, is it possible to run sqlfluff for jinja (dbt) in a way it resolves all possible if-else nodes? For example, in my dbt project I have this:

<base code>
{% if is_incremental() %}
<code A>
{% else %}
<code B>
{% endif %}

and I would like to validate <base code> + <code A> and <base code> + <code B>. Currently it's only validating <base code> + <code A>.

The output of the parse command for else block is: [Type: 'compound', Raw: '{% else %}... [20 unused template characters] ...{% endif %}'].
I also tried to set ignore_templated_areas = False but it did not help.

@barrywhart
Copy link
Member Author

barrywhart commented Dec 24, 2022

Thanks for your reply.
I think I understand, within the dbt project you have a single dbt model.sql - this model contains a macro that iterates through a list.

  1. The 'tables' that are created, are these dbt models / .sql files that are contained within the dbt project? - I don't quite understand how these tables are 'created' / where they reside.
  2. Do you lose lineage?

For info, I've created a function in R, as more familiar with that than Python.
I provide a list of sink tables, in this case these are all {{ source }} objects and the function outputs a model / .sql file for each item in the list.

I've set incremental load within the project.yml for all models within models\raw folder where the function output to:

fn_create_dbt_raw_models_script <- function(alist, schema, folder_path) {
  
  for (table_name in alist) { 
    dbt_script <- glue('
                              with source as (
                              
                                  select *
                                  from {{{{ source(\'{schema}\', \'{table_name}\') }}}}
                              
                              )
                              
                              select * from source
                              
                              {{% if is_incremental() %}}
                              
                              -- this filter will only be applied on an incremental run
                              where SysStartTime > (select max(SysStartTime) from {{{{ this }}}})
                              
                              {{% endif %}}
                          ')
    
    print({table_name})    
    print(dbt_script)    
    print('start write')
    
    # Writes .sql file/model
    write.table(dbt_script,
                file = paste0(folder_path, table_name,'.sql', sep = ""),
                row.names=FALSE,
                col.names=FALSE,
                quote=FALSE)
    
    print('finish write')
    
  }
}

@barrywhart
Copy link
Member Author

barrywhart commented Dec 24, 2022

{% if is_incremental() %}
AND date >= dateadd(DAY, -31,{{ var('PREV_RUN_DATE_TIME') }}')
AND fori.delivered_date_pt < '{{ var('RUN_DATE_TIME') }}'
{% endif %}

@barrywhart
Copy link
Member Author

barrywhart commented Dec 24, 2022

Hi folks!!,
I am trying to implement the incremental model in dbt. Below is the code snippet. The incremental feature is failing (the data consumption is same with/without the incremental feature added). Do let me know, if I am missing anything here?

{{config(
        materialized = 'incremental',
        unique_key = ['user_id','sessionId'],
        incremental_strategy = 'insert_overwrite',
        partition_by= {
           "field": "signup_dt",
           "data_type": "timestamp",
           "granularity": "day"
          }
        )

}}

select
current_timestamp() as row_update_dt,
signup_dt,
user_id,
sessionId
from {{ source('schema', 'table_source') }},
where 1=1
{% if is_incremental() %}
    and date(_PARTITIONTIME) > (select date(max(row_update_dt)) from {{ this }})
{% endif %}

@barrywhart
Copy link
Member Author

barrywhart commented Dec 24, 2022

Here is an example where we override a ref.

{%- macro incr_ref(model_name)-%}
  {%- set node = ref(model_name) -%}
  {%- set cols = adapter.get_columns_in_relation(this)|map(attribute='column')|list|lower -%}

  {%- if is_incremental() -%}
    (
      SELECT 
        * 
      FROM 
        {{node}} 
      WHERE 
      basic incremental logic goes here. I don't think this is the issue.
  )
  {%- else -%}
    (
      SELECT * FROM {{node}}
    )
  {%- endif -%}
{%- endmacro -%}

in the model its basically a select some_columns from {{incr_ref('some_model')}}
I have a feeling it errors in any macro where adapter.function is invoked.

@barrywhart
Copy link
Member Author

barrywhart commented Dec 24, 2022

the first time that the model run the model, the table will be created and the pre_combined key will be added, in this run the is_incremental section is not evaluated.
On the 2nd time that you run the is_incremental section will be evaluated, and should work.

Few notes:

  • timestamp need to exist in your source table as you fetch based on that
  • you are using the last timestamp added in the destination table to pick data from the source, most likely you will miss some data points.

I think that you are missing something in your model to de-duplicate the input data.
That’s how I will do:

{{ config(
    materialized='incremental',
    file_format='hudi',
    incremental_strategy='merge',
    unique_key='id',
    pre_combine_key='timestamp',
    partition_by=['date_partition'],
    on_schema_change='append_new_columns',
    custom_location='s3://bucket/analytics/table/table_name' --e.g
    )
}}

WITH SOURCE AS (
	SELECT
      id
      ,date
      ,date_partition
  FROM {{ source('db_raw_dev', 'raw_table_poc_dbt_hudi')}}
  {% if is_incremental() %}
    WHERE timestamp >= (SELECT max(timestamp) FROM {{ this }}) 
  {% endif %}
  ),
ordered as (
select *, 
row_number() over (partition by id order by timestamp desc) as rn
from SOURCE

SELECT * FROM SOURCE
where rn=1

This model will first deduplicate the data then rerun the only latest record:
2e419c8c2813905034b539fc0ab9c321 2022-10-26
NOTE: I used timestamp in the source to build the deduplication, that is better than using a date, as if you have records with the same date you want to be sure to pick the latest.

dbt merge/incremental is not de-deduplication for you.

@alanmcruickshank
Copy link
Member

I've already started merging in some of the logic for this, so I've assigned this issue to me for now.

@gwenwindflower
Copy link

@alanmcruickshank -- do you feel like you have a good path for fixing this in SQLFluff? If not, we've discussed potential ways to override/set state consistently for issues like linting conditional logic so conditionals always compile in a certain direction. The idea would be to give you some levers to pull to get more consistent results out of dbt compile. Wanted to check with you before I open a dbt-core issue though, as if you feel you've got a handle on this we probably wouldn't need this. Let me know, thanks!

@alanmcruickshank
Copy link
Member

@gwenwindflower - I think we've got a good solution for the Jinja parsing bit already 👍

The part that is causing me a headache is managing multiple versions of a file and getting them to all play nice with each other.

That being said - if you've got good suggestions for making this easier then I'm all ears!

@gwenwindflower
Copy link

@alanmcruickshank cool! i'm
a) really glad to hear yall found a way forward for the pesky jinja conditionals issues! that's amazing!
b) curious to hear more about the versioning problem (feel free to email me at winnie at dbtlabs if that's easier, and we could potentially set up some time between yall and our DX team, since we use SQLFluff in the product we have a solid mandate to make sure we support it 😸 )

re our ideas it was more or less this, basically allowing people to construct state in a very precise way, then figuring out what the best way for sqlfluff to use that in compile would be (for instance letting people customize the args to the compile command for the dbt templater via the .sqluff config)

@alanmcruickshank
Copy link
Member

@gwenwindflower - check out #5822 which supports a very minimal version of this.

It might be best to get on the phone to talk more though and work through some of the details. In particular, this feature is still very experimental so it might not be ready for the big time yet - totally worth starting to test out to find bugs though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants