Add the ability to lint and fix areas of templated code that are not actually executed #4061

barrywhart · 2022-11-11T13:07:46Z

Search before asking

I searched the issues and found no similar issues.

Description

Several users have asked if SQLFluff does (or could) support this feature. Another SQL formatting tool, sqlfmt supports this (although it's less sophisticated and doesn't actually parse the SQL code.)

Alan and Barry discussed how that might be done in SQLFluff:
this pr means we got fairly accurate links between closing and starting jinja tags: #3936
and the rework of placeholder segments (currently bound up in the reindent pr), means that we know where code is being rendered and where not.
...so we could detect if statements and find cases where code is never rendered.
the only thing we'd need to do would be to allow the templater to optionally return multiple versions of a query, parse them all and then lint them all - deduplicating the results.
if one of the versions fails, we'd probably hide the parsing errors because it could be use incorrectly interpreting the loops, because we wouldn't really understand the loops, we're just replacing code.
....thinking about it - the other thing we could do would be preprocess the templates, replacing any if statements with True or False to force clauses to be evaluated. 🤷
all of this feels quite hacky though. Maybe just allowing users to add a --sqlfluff directive which tells the templater what other options to insert for some context values to generate multiple versions of a query. Whatever we do it all boils down to linting multiple versions of the same query - which has a performance overhead.
that being said. templating is fast. we could pretty quickly evaluate for any query if any alternative variations exist. For most queries, they won't - so performance is the same. We could also evaluate how many exist, and limit ourselves to only evaluating the most significant n (e.g. the variations with the most different code). Altering the templater to optionally return an iterable of TemplatedFile wouldn't be too hard, and deduplicating errors based on their source position already exists (#4041) and the same method could easily be used here, so that the user doesn't know the file got linted more than once.
I'm worried about the potential for extreme cases. Just as we have people trying to lint 100,000 line files, someone will try to line a file with 10,000 if statements, which means 2^10,000 possible paths.
Do you have thoughts on how to determine and limit to the most significant variations?
limiting is easy once we know how many variations.
determining the variations is the hard bit.
I'm tempted initially to just start generating them and then just stop iteration after generating 5.
then we filter from those files.
That would catch all the simple cases, and then we just warn for complicated cases.
That could be a good start. Now that I think about it, basically we just want to cover all the code. And JinjaTracer knows exactly what code is covered. So I think we could do this pretty easily. The tracer already does something slightly similar when building the trace -- it knows from the trace output what code got executed, but not necessarily how it got there. It has some very simple heuristics to determine that, e.g. if it executed code above the current point, then it needs to find a loop end whose beginning is above the target point. OTOH, if the next step is below, then find the shortest path to get there (nothing fancy, just skip over any "if"s).

Use case

Lint or format the whole SQL file without requiring the user to process the file multiple times with different Jinja variables.

Dialect

n/a

Are you willing to work on and submit a PR to address the issue?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

barrywhart · 2022-11-11T22:25:51Z

In some cases, forcing execution of a branch not normally taken could cause runtime errors, parse errors, etc. The linter should ignore these, but in case of potential side effects (e.g. it's common in dbt to call out to the database, perhaps other external services), we should provide a way to suppress this behavior, e.g. a comment like:

{% if True %}
    SELECT 1
{% else %}  --ignore: coverage
    SELECT 2
{% endif %}

barrywhart · 2022-11-28T22:46:41Z

This is somewhat related to #4114.

This issue involves bookkeeping on whether each if statement is executed, then changing the trace template code to force it to be executed next time.
#4114 involves bookkeeping on how many times each for loop is executed, then discarding extra executions.

barrywhart · 2022-12-24T22:17:49Z

Hey all, is it possible to run sqlfluff for jinja (dbt) in a way it resolves all possible if-else nodes? For example, in my dbt project I have this:

<base code>
{% if is_incremental() %}
<code A>
{% else %}
<code B>
{% endif %}

and I would like to validate <base code> + <code A> and <base code> + <code B>. Currently it's only validating <base code> + <code A>.

The output of the parse command for else block is: [Type: 'compound', Raw: '{% else %}... [20 unused template characters] ...{% endif %}'].
I also tried to set ignore_templated_areas = False but it did not help.

barrywhart · 2022-12-24T22:23:06Z

Thanks for your reply.
I think I understand, within the dbt project you have a single dbt model.sql - this model contains a macro that iterates through a list.

The 'tables' that are created, are these dbt models / .sql files that are contained within the dbt project? - I don't quite understand how these tables are 'created' / where they reside.
Do you lose lineage?

For info, I've created a function in R, as more familiar with that than Python.
I provide a list of sink tables, in this case these are all {{ source }} objects and the function outputs a model / .sql file for each item in the list.

I've set incremental load within the project.yml for all models within models\raw folder where the function output to:

fn_create_dbt_raw_models_script <- function(alist, schema, folder_path) {
  
  for (table_name in alist) { 
    dbt_script <- glue('
                              with source as (
                              
                                  select *
                                  from {{{{ source(\'{schema}\', \'{table_name}\') }}}}
                              
                              )
                              
                              select * from source
                              
                              {{% if is_incremental() %}}
                              
                              -- this filter will only be applied on an incremental run
                              where SysStartTime > (select max(SysStartTime) from {{{{ this }}}})
                              
                              {{% endif %}}
                          ')
    
    print({table_name})    
    print(dbt_script)    
    print('start write')
    
    # Writes .sql file/model
    write.table(dbt_script,
                file = paste0(folder_path, table_name,'.sql', sep = ""),
                row.names=FALSE,
                col.names=FALSE,
                quote=FALSE)
    
    print('finish write')
    
  }
}

barrywhart · 2022-12-24T22:26:15Z

{% if is_incremental() %}
AND date >= dateadd(DAY, -31,{{ var('PREV_RUN_DATE_TIME') }}')
AND fori.delivered_date_pt < '{{ var('RUN_DATE_TIME') }}'
{% endif %}

barrywhart · 2022-12-24T22:29:52Z

Hi folks!!,
I am trying to implement the incremental model in dbt. Below is the code snippet. The incremental feature is failing (the data consumption is same with/without the incremental feature added). Do let me know, if I am missing anything here?

{{config(
        materialized = 'incremental',
        unique_key = ['user_id','sessionId'],
        incremental_strategy = 'insert_overwrite',
        partition_by= {
           "field": "signup_dt",
           "data_type": "timestamp",
           "granularity": "day"
          }
        )

}}

select
current_timestamp() as row_update_dt,
signup_dt,
user_id,
sessionId
from {{ source('schema', 'table_source') }},
where 1=1
{% if is_incremental() %}
    and date(_PARTITIONTIME) > (select date(max(row_update_dt)) from {{ this }})
{% endif %}

barrywhart · 2022-12-24T22:32:28Z

Here is an example where we override a ref.

{%- macro incr_ref(model_name)-%}
  {%- set node = ref(model_name) -%}
  {%- set cols = adapter.get_columns_in_relation(this)|map(attribute='column')|list|lower -%}

  {%- if is_incremental() -%}
    (
      SELECT 
        * 
      FROM 
        {{node}} 
      WHERE 
      basic incremental logic goes here. I don't think this is the issue.
  )
  {%- else -%}
    (
      SELECT * FROM {{node}}
    )
  {%- endif -%}
{%- endmacro -%}

in the model its basically a select some_columns from {{incr_ref('some_model')}}
I have a feeling it errors in any macro where adapter.function is invoked.

barrywhart · 2022-12-24T22:34:24Z

the first time that the model run the model, the table will be created and the pre_combined key will be added, in this run the is_incremental section is not evaluated.
On the 2nd time that you run the is_incremental section will be evaluated, and should work.

Few notes:

timestamp need to exist in your source table as you fetch based on that
you are using the last timestamp added in the destination table to pick data from the source, most likely you will miss some data points.

I think that you are missing something in your model to de-duplicate the input data.
That’s how I will do:

{{ config(
    materialized='incremental',
    file_format='hudi',
    incremental_strategy='merge',
    unique_key='id',
    pre_combine_key='timestamp',
    partition_by=['date_partition'],
    on_schema_change='append_new_columns',
    custom_location='s3://bucket/analytics/table/table_name' --e.g
    )
}}

WITH SOURCE AS (
	SELECT
      id
      ,date
      ,date_partition
  FROM {{ source('db_raw_dev', 'raw_table_poc_dbt_hudi')}}
  {% if is_incremental() %}
    WHERE timestamp >= (SELECT max(timestamp) FROM {{ this }}) 
  {% endif %}
  ),
ordered as (
select *, 
row_number() over (partition by id order by timestamp desc) as rn
from SOURCE

SELECT * FROM SOURCE
where rn=1

This model will first deduplicate the data then rerun the only latest record:
2e419c8c2813905034b539fc0ab9c321 2022-10-26
NOTE: I used timestamp in the source to build the deduplication, that is better than using a date, as if you have records with the same date you want to be sure to pick the latest.

dbt merge/incremental is not de-deduplication for you.

alanmcruickshank · 2023-10-23T12:01:19Z

I've already started merging in some of the logic for this, so I've assigned this issue to me for now.

gwenwindflower · 2024-02-29T16:31:33Z

@alanmcruickshank -- do you feel like you have a good path for fixing this in SQLFluff? If not, we've discussed potential ways to override/set state consistently for issues like linting conditional logic so conditionals always compile in a certain direction. The idea would be to give you some levers to pull to get more consistent results out of dbt compile. Wanted to check with you before I open a dbt-core issue though, as if you feel you've got a handle on this we probably wouldn't need this. Let me know, thanks!

alanmcruickshank · 2024-04-01T14:03:21Z

@gwenwindflower - I think we've got a good solution for the Jinja parsing bit already 👍

The part that is causing me a headache is managing multiple versions of a file and getting them to all play nice with each other.

That being said - if you've got good suggestions for making this easier then I'm all ears!

gwenwindflower · 2024-04-01T14:12:13Z

@alanmcruickshank cool! i'm
a) really glad to hear yall found a way forward for the pesky jinja conditionals issues! that's amazing!
b) curious to hear more about the versioning problem (feel free to email me at winnie at dbtlabs if that's easier, and we could potentially set up some time between yall and our DX team, since we use SQLFluff in the product we have a solid mandate to make sure we support it 😸 )

re our ideas it was more or less this, basically allowing people to construct state in a very precise way, then figuring out what the best way for sqlfluff to use that in compile would be (for instance letting people customize the args to the compile command for the dbt templater via the .sqluff config)

alanmcruickshank · 2024-04-29T18:13:06Z

@gwenwindflower - check out #5822 which supports a very minimal version of this.

It might be best to get on the phone to talk more though and work through some of the details. In particular, this feature is still very experimental so it might not be ready for the big time yet - totally worth starting to test out to find bugs though!

barrywhart added the enhancement New feature or request label Nov 11, 2022

barrywhart linked a pull request Dec 30, 2022 that will close this issue

Lint and fix areas of templated code that are not actually executed #4202

Draft

8 tasks

alanmcruickshank mentioned this issue Mar 20, 2023

Lint templated file before evaluation #3496

Closed

3 tasks

sziolko mentioned this issue Apr 1, 2023

dbt-templater ignores dbt_is_incremental #4627

Open

3 tasks

This was referenced Oct 19, 2023

Templater slicing refactoring of RawFileSlice #5338

Merged

Logic to render variants of Jinja templates for more coverage. #5339

Merged

alanmcruickshank self-assigned this Oct 23, 2023

This was referenced Oct 23, 2023

Inconsistent Indentation Violations when Jinja blocks compile differently across contexts (local vs CI/CD) #5198

Open

Linting error on LT01 and LT02 appears in the CI but not in local enviroment #5098

Open

This was referenced Apr 19, 2024

Jinja Variant Configuration #5785

Merged

Bring multiple jinja variants through to the parser. #5794

Merged

alanmcruickshank mentioned this issue Apr 28, 2024

Naïve multi-variant jinja linting #5822

Open

jeancochrane mentioned this issue May 28, 2024

Reconfigure sqlfluff linter and pre-commit hooks ccao-data/data-architecture#456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to lint and fix areas of templated code that are not actually executed #4061

Add the ability to lint and fix areas of templated code that are not actually executed #4061

barrywhart commented Nov 11, 2022

barrywhart commented Nov 11, 2022

barrywhart commented Nov 28, 2022

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited

alanmcruickshank commented Oct 23, 2023

gwenwindflower commented Feb 29, 2024

alanmcruickshank commented Apr 1, 2024

gwenwindflower commented Apr 1, 2024

alanmcruickshank commented Apr 29, 2024

Add the ability to lint and fix areas of templated code that are not actually executed #4061

Add the ability to lint and fix areas of templated code that are not actually executed #4061

Comments

barrywhart commented Nov 11, 2022

Search before asking

Description

Use case

Dialect

Are you willing to work on and submit a PR to address the issue?

Code of Conduct

barrywhart commented Nov 11, 2022

barrywhart commented Nov 28, 2022

barrywhart commented Dec 24, 2022 • edited

barrywhart commented Dec 24, 2022 • edited

barrywhart commented Dec 24, 2022 • edited

barrywhart commented Dec 24, 2022 • edited

barrywhart commented Dec 24, 2022 • edited

barrywhart commented Dec 24, 2022 • edited

alanmcruickshank commented Oct 23, 2023

gwenwindflower commented Feb 29, 2024

alanmcruickshank commented Apr 1, 2024

gwenwindflower commented Apr 1, 2024

alanmcruickshank commented Apr 29, 2024

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited

barrywhart commented Dec 24, 2022 •

edited