Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluent_datasources does not reflect runtime datasource addition #9690

Open
ramananayak opened this issue Apr 2, 2024 · 9 comments
Open

fluent_datasources does not reflect runtime datasource addition #9690

ramananayak opened this issue Apr 2, 2024 · 9 comments
Assignees
Labels
fluent-datasources query-asset Related to use of a FDS QueryAsset

Comments

@ramananayak
Copy link

Describe the bug
I want to add fluent_datasource at runtime after a FileDataContext is already defined.
context.fluent_datasources is of type dictionary. When I add a new fluent_datasource, it does not add to the existing dictionary.
Where as it works on datasource.

To Reproduce

import great_expectations as gx
context = gx.data_context.FileDataContext(context_root_dir="my_context_dir")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = gx.datasource.fluent.PostgresDatasource(name="ds_runtime", connection_string=connection_string,create_temp_table=True)

# Running below does not update the dictonary
context.fluent_datasources[runtime_datasource.name] = runtime_datasource

# where as, if I run below command then it will update properly. Also, it will also update fluent_datasources
context.datasources[runtime_datasource.name] = runtime_datasource

Expected behavior
context.fluent_datasources should show added runtime_datasource inside the dictonary

Environment (please complete the following information):

  • Operating System: MacOS
  • Great Expectations Version: 0.18.12
  • Data Source: Redshift
  • Cloud environment: AWS

Additional context
Add any other context about the problem here.

@Kilo59 Kilo59 self-assigned this Apr 6, 2024
@Kilo59
Copy link
Member

Kilo59 commented Apr 11, 2024

@ramananayak
Sorry for the confusion this is because the context.fluent_datasources property is just a dictionary comprehension of context.datasources with all non-fluent datasources filtered out.

Would could alter the return type annotation to be an Immutable Mapping[str, FluentDatasource] to help with this. But it wouldn't alter runtime behavior and you'd have to rely on a type-checker or IDE to warn about it being immutable.

@property
def fluent_datasources(self) -> Dict[str, FluentDatasource]:
return {
name: ds
for (name, ds) in self.datasources.items()
if isinstance(ds, FluentDatasource)
}

The idiomatic way to add or update a datasource is by using one of the context.sources.add_or_update_<DATASOURCE_TYPE>() methods. This method also bootstraps the datasource with the components needed to do config substitution and connect certain datasources to things like s3/gcs/databricks etc.

import great_expectations as gx
context = gx.data_context.FileDataContext(context_root_dir="my_context_dir")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"

runtime_datasource = context.sources.add_or_update_postgres(
    name="ds_runtime",
    connection_string=connection_string,
    create_temp_table=True
)

print(repr(runtime_datasource))

@ramananayak
Copy link
Author

ramananayak commented Apr 17, 2024

thanks for the clarification @Kilo59 .
I tried with context.sources.add_postgress()
But for FileDataContext type this will end up updating the context file (great_expectations.yml) file with connection string details I am using as a variable in my code.
This does not serve the purpose of being runtime. Also because of this write lock on the context file, if multiple checks running on same config will lead to failures. I want this source to be used just for runtime without affecting the (great_expectations.yml) file.

I did some investigation and saw that for FileDataContext() context file is opened in w mode (

) .
So is there any way to add configurations for true run time use without changing context file everytime.

Same case with dataasset, I don’t see any example to show how can we create runtime dataseet. Currently I am testing with fluent datasource, all the methods are just keep adding dataasset to context file. So it will lead to growing config file.
in 0.17.1, below would have created run time data asset without any update in context file, for refrence below

validations:
  - batch_request:
      data_asset_name: runtime_asset
      runtime_parameters:
        query: "select column 1 from table"
    expectation_suite_name: appstat_suite

I don't rally know how can I achieve this in the latest version.
Thanks for your help !

@Kilo59
Copy link
Member

Kilo59 commented Apr 18, 2024

@ramananayak
I don't think this is exactly what you are looking for but you can use an EphemeralDataContext that doesn't persist anything.

import great_expectations as gx
context = gx.get_context(mode="ephemeral")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"

runtime_datasource = context.sources.add_or_update_postgres(
    name="ds_runtime",
    connection_string=connection_string,
    create_temp_table=True
)

print(repr(runtime_datasource))

The code above ☝️ should work but you won't have access to your filebacked checkpoints or expectations etc.
You would need to modify the code to pull in those items.

I will pass this along to our team working on the v1.0 release (and any other feedback you have).

@Kilo59
Copy link
Member

Kilo59 commented Apr 18, 2024

There's a somewhat related issue where a user is creating an ephemeral context from a file context but is unable to load the fluent configs.
For you, this shouldn't be a problem, though.

Updated example that should allow your ephemeral context to pull in the project config from your file context.

import great_expectations as gx

# Create two different contexts using THE SAME config
file_ctx = gx.get_context(mode="file")
ephm_ctx = gx.get_context(mode="ephemeral", project_config=file_ctx.config)

connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"

runtime_datasource = ephm_ctx.sources.add_or_update_postgres(
    name="ds_runtime",
    connection_string=connection_string,
    create_temp_table=True
)

print(repr(runtime_datasource))

@ramananayak
Copy link
Author

ramananayak commented Apr 19, 2024

Hi @Kilo59
Thanks for sharing this. Yes as you mentioned , I tried Ephemeral context and it looks like it will work.

Here is my version

import yaml
import great_expectations as gx
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import EphemeralDataContext

context_root_dir="path to my initial great_expectation.yml file "
with open(context_root_dir+'/great_expectations.yml', 'r') as file:
    conf = yaml.safe_load(file)
    
context_config = DataContextConfig(**conf)
ephm_ctx  = EphemeralDataContext(project_config=context_config)

connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = ephm_ctx.sources.add_postgres(name="ds_runtime", 
connection_string=connection_string, 
create_temp_table=True)

print(repr(runtime_datasource))

This is working. although I have to mention complete path for all the respective GX directories (like plugin directory) but that's understood.

But as I mentioned above, 0.17.11 supported RuntimeBatchRequest, where I could define datasource, dataasset and runtimequery as a part of checkpoint. I could see it is also available in 0.18.9 documentation.
But I am not able to get it working. I am struggling with this.
https://docs.greatexpectations.io/docs/reference/api/core/batch/RuntimeBatchRequest_class
For example, will work perfectly in 0.17.11

validations:
  - batch_request:
      data_asset_name: runtime_asset
      runtime_parameters:
        query: "select column 1 from table"
    expectation_suite_name: appstat_suite

Is it supported in the latest GX. or I have to go with creating dataasset separately outside of checkpoint for input query and then call the checkpoint as a part of validation ?
Is there any way to add datasource and query as a part of checkpoint.

Because this is a really helpfull feature for us, as we keep all the respective queries as a part of checkpoint and they stay separately , easy to identify dataasset and expectations together.

thanks !

@ramananayak
Copy link
Author

Hi @Kilo59
Do you have information about how can we set this type of config (one in the previous comment) in the latest GX version.
for data asset ? IN the new GX version, Do we have to create dataasset first for every query and then add the required checkpoint ? So there is no way for run time dataasset creation ?

If you have any idea on this, if you can give some pointers that would really help.

thanks !
Ram

@Kilo59
Copy link
Member

Kilo59 commented May 1, 2024

@ramananayak any workflow from 0.17 should still work in 0.18.

I think the issue is that the new "Fluent Style" Datasource (which are datasources created using the context.sources.add_<TYPE>()) methods do not support declaring queries as part of the batch request.

The documentation for the old "Block Style" datasources is no longer part of our latest version. You'll have to refer to 0.15 docs

You can continue to use the old ("Block Style" Datasources) or you can create a QueryAsset.

runtime_datasource = ephm_ctx.sources.add_postgres(
  name="ds_runtime", 
  connection_string=connection_string, 
  create_temp_table=True
)

my_query_asset = runtime_datasource.add_query_asset(name="my_query", query="select column 1 from table")

batch_request = my_query_asset.build_batch_request()

# pass batch_request to your checkpoint

Does the QueryAsset with an ephemeral context meet your needs, or are you still wanting something different?
We are actively working on 1.0 and this kind of feedback is invaluable.

@Kilo59 Kilo59 added query-asset Related to use of a FDS QueryAsset fluent-datasources labels May 1, 2024
@ramananayak
Copy link
Author

Hi @Kilo59 thanks for you response.
As you mentioned, If I am correct, In the latest version block style datasource config is not supported.
and I assume older version 0.18 and 0.17 support will end once the next 2 latest version will be released.

Now I understand that QueryAsset is the only way to go, I think I may have to write custom code to support run time query from config.

But I think run time query config is a nice feature to have because we have a lot of config which user will (say analyst) will setup in the form of config and all we do is to wrap the config in Airflow scheduler which runs this checks.
This enables us to automate whole flow through config driven framework.
Now everything becoming first class object, automating whole flow with multiple checks in a single input will add much more friction and only enable users to add single check at a time.

As I know lot of people use this method to add multiple checks in a single time. Also moving everything to a config file (in case of filedata context) also makes config file very bulky with lot of unnecessary configs added in context.
Hope this makes sense
thank you so much again !

@jcampbell
Copy link
Member

@ramananayak -- in your case, are you expecting to be able to use the validation results that come from these runtime assets at any time other than the immediate validation? We designed runtime assets to mean that the data would be available/provided at runtime, but the asset configuration itself was durable. The intent of that approach was to ensure that saved validation results could be identified by the asset's (durable) name. It sounds to me like that may be the gap, in that you're not looking to have the configuration of the asset persist at all.

I'd love to jump on a call with you and @Kilo59 if you'd like to make sure I understand the case fully, since we've recently been looking at the question of how to support runtime cases more clearly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fluent-datasources query-asset Related to use of a FDS QueryAsset
Projects
None yet
Development

No branches or pull requests

3 participants