Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to compute correlation among columns of my datasets #1522

Open
3 tasks done
tboz38 opened this issue Dec 29, 2023 · 0 comments
Open
3 tasks done

Unable to compute correlation among columns of my datasets #1522

tboz38 opened this issue Dec 29, 2023 · 0 comments

Comments

@tboz38
Copy link

tboz38 commented Dec 29, 2023

Current Behaviour

The report should not contained the correlations
It sued to work using previous release of ydata-profiling

/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/correlations.py:66: UserWarning: There was an attempt to calculate the auto correlation, but this failed.
To hide this warning, disable the calculation
(using df.profile_report(correlations={"auto": {"calculate": False}})
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: '('compute: 0 methods found', (<class 'ydata_profiling.config.Settings'>, <class 'pyspark.sql.dataframe.DataFrame'>, <class 'dict'>), [])')
warnings.warn(
master_stock_data_exploratory_report (1).json

compute correlation among columns of large datasets

Expected Behaviour

The report should contained the correlations

Data Description

correlations_settings = {
"auto": {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
...
}

my datasets are private but i can provide an example sample anonymized data
record_timestamp,plant,material_part_number,storage_location,unrestricted_use_stock,stock_in_transfer,stock_in_quality_inspection,all_restricted_stock,blocked_stock,block_stock_returns,stock_in_transit,stock_in_transfer_plant_to_plant,stock_at_vendor,valuated_stock_quantities,non_valuated_stock_quantities,stock_value,valuation_class,material_type,gl_account,account_description
2022-09-25,P006,79-2997197-11,,3.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,4.222222328186035,null,2442.855712890625,null,null,null,null
2022-09-25,P006,79-2997102-11,,0.1111111119389534,null,null,1.1111111640930176,null,null,null,null,null,1.2222222089767456,null,37961.89453125,null,null,null,null
2022-09-25,P006,72-2997190-11,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,21672.736328125,null,null,null,null
2022-09-25,P006,72-2997192-11,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,20033.513671875,null,null,null,null
2022-09-25,P006,72-2997197-11,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,5912.4423828125,null,null,null,null
2022-09-25,P006,72-2997102-11,,3.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,4.222222328186035,null,22345.1875,null,null,null,null
2022-09-25,P006,72-2997102-11252,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,325588.53125,null,null,null,null
2022-09-25,P006,72-2997132-11,,7.111111164093018,null,null,1.1111111640930176,null,null,null,null,null,8.222222328186035,null,76378.453125,null,null,null,null
2022-09-25,P006,72-2997138-11,,7.111111164093018,null,null,1.1111111640930176,null,null,null,null,null,8.222222328186035,null,78339.78125,null,null,null,null
2022-09-25,P006,82-2997112-19,,1.1111111640930176,null,null,3.1111111640930176,null,null,null,null,null,4.222222328186035,null,157067.140625,null,null,null,null
2022-09-25,P006,82-2997112-12,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,50453.15625,null,null,null,null
2022-09-25,P006,82-2997139-11,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,9203917,null,null,null,null
2022-09-25,P006,82-2997139-19,,0.1111111119389534,null,null,1.1111111640930176,null,null,null,null,null,1.2222222089767456,null,1100468.375,null,null,null,null
2022-09-25,P006,855112191,,87.11111450195312,null,null,1.1111111640930176,null,null,null,null,null,88.22222137451172,null,2399.089599609375,null,null,null,null
2022-09-25,P006,mj92119-9-r,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,219.49136352539062,null,null,null,null
2022-09-25,P006,mj92119-9h,,977.111083984375,null,null,1.1111111640930176,null,null,null,null,null,978.2222290039062,null,74072.5234375,null,null,null,null
2022-09-25,P006,mj92119-3-o,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,2.4691357612609863,null,null,null,null
2022-09-25,P006,mj92119-3-r,,927.111083984375,null,null,1.1111111640930176,null,null,null,null,null,928.2222290039062,null,27188.12109375,null,null,null,null
2022-09-25,P006,uj92119-3,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,5.713580131530762,null,null,null,null
2022-09-25,P006,uj92119-3r,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,6.3358025550842285,null,null,null,null
2022-09-25,P006,r002719,,302.1111145019531,null,null,1.1111111640930176,null,null,null,null,null,303.22222900390625,null,785.6824951171875,null,null,null,null
2022-09-25,P006,r002712,,72.11111450195312,null,null,1.1111111640930176,null,null,null,null,null,73.22222137451172,null,12.655077934265137,null,null,null,null
2022-09-25,P006,j932rm2111,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,2.4691357612609863,null,null,null,null
2022-09-25,P006,222l0112-9,,1.1111111640930176,null,null,93.11111450195312,null,null,null,null,null,94.22222137451172,null,733587,null,null,null,null
2022-09-25,P006,222l0112-9u,,2.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,3.222222328186035,null,71704.1796875,null,null,null,null
2022-09-25,P006,222l0112-9uom9,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,4913.580078125,null,null,null,null

Code that reproduces the bug

def create_profile_report(self,
                              dataset_to_analyze: pd.DataFrame,
                              report_name: str,
                              dataset_description_url: str) -> ProfileReport:
        """
        Creates a profile report for a given dataset.
        Args:
            dataset_to_analyze (pd.DataFrame): The dataset to analyze and generate a profile report for.
            report_name (str): The name of the report.
            dataset_description_url (str): The URL of the dataset description.
        Returns:
            ProfileReport: The generated profile report.
        """
        # Perform data quality operations and generate a profile report
        # ...
        # variables preferred characterization settings
        variables_settings = {
            "num": {"low_categorical_threshold": 5, "chi_squared_threshold": 0.999},
            "cat": {"length": True, "characters": False, "words": False,
                    "cardinality_threshold": 50, "imbalance_threshold": 0.5,
                    "n_obs": 5, "chi_squared_threshold": 0.999},
            "bool": {"n_obs": 3, "imbalance_threshold": 0.5}
        }
        missing_diagrams_settings = {
            "heatmap": False,
            "matrix": False,
            "bar": False
        }
        # Plot rendering option, way how to pass arguments to the underlying matplotlib visualization engine
        plot_rendering_settings = {
            # "histogram": {"x_axis_labels": True, "bins": 5, "max_bins": 10},
            "dpi": 200,
            "image_format": "png",
            "missing": {"cmap": "RdBu_r", "force_labels": True},
            "pie": {"max_unique": 10, "colors": ["gold", "b", "#FF796C"]},
            "correlation": {"cmap": "RdBu_r", "bad": "#000000"}
        }
        # Correlation matrices through description_set
        correlations_settings = {
            "auto": {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
            "pearson": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
            "spearman": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
            "kendall": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
            "phi_k": {"calculate": False, "warn_high_correlations": True, "threshold": 0.9},
            "cramers": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
        }

        interactions_settings = {
            "continuous": False,
            "targets": []
        }
        # Customizing the report's theme
        html_report_styling = {
            "style": {
                "theme": "flatly",
                "full_width": True,
                "primary_colors": {"#66cc00", "#ff9933", "#ff0099"}
            }
        }
        current_datetime = datetime.now()
        current_date = current_datetime.date()
        current_year = current_date.strftime("%Y")
        # compute amount of data used for profiling
        samples_percent_size = (min(len(dataset_to_analyze.columns.tolist()), 20) * min(dataset_to_analyze.shape[0], 100000)) / (len(dataset_to_analyze.columns.tolist()) * dataset_to_analyze.shape[0])
        samples = {
            "head": 0,
            "tail": 0,
            "random": 0
        }
        dataset_description = {
            "description": f"This profiling report was generated using a sample of {samples_percent_size}% of the filtered original dataset.",
            "copyright_holder": "SLS Data platform",
            "copyright_year": current_year,
            "url": dataset_description_url
        }
        # Identify time series variables if any
        # Enable tsmode to True to automatically identify time-series variables
        # and provide the column name that provides the chronological order of your time-series
        # time_series_type_schema = {}
        time_series_mode = False
        # time_series_sortby = None
        # for column_name in dataset_to_analyze.columns.tolist():
        #     if any(keyword in column_name.lower() for keyword in ["date", "timestamp"]):
        #         self.logger.info("candidate column_name as timeseries %s", column_name)
        #         time_series_type_schema[column_name] = "timeseries"
        # if len(time_series_type_schema) > 0:
        #     time_series_mode = True
        #     time_series_sortby = "Date Local"

        # is_run_minimal_mode = self.determine_run_minimal_mode(dataset_to_analyze.columns.tolist(), dataset_to_analyze.shape[0])

        # Convert the Pandas DataFrame to a Spark DataFrame
        # Configure pandas-profiling to handle Spark DataFrames
        # while preserving the categorical encoding
        # Enable Arrow-based columnar data transfers
        self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
        pd.DataFrame.iteritems = pd.DataFrame.items
        # psdf = ps.from_pandas(dataset_to_analyze)
        # data_to_analyze = psdf.to_spark()
        data_to_analyze = self.spark.createDataFrame(dataset_to_analyze)
        ydata_profiling_instance_config = Settings()
        ydata_profiling_instance_config.infer_dtypes = True
        # ydata_profiling_instance_config.Config.set_option("profilers", {"Spark": {"verbose": True}})

        return ProfileReport(
            # dataset_to_analyze,
            data_to_analyze,
            title=report_name,
            dataset=dataset_description,
            sort=None,
            progress_bar=False,
            vars=variables_settings,
            explorative=True,
            plot=plot_rendering_settings,
            correlations=correlations_settings,
            missing_diagrams=missing_diagrams_settings,
            samples=samples,
            # correlations=None,
            interactions=interactions_settings,
            html=html_report_styling,
            # minimal=is_run_minimal_mode,
            minimal=True,
            tsmode=time_series_mode,
            # tsmode=False,
            # sortby=time_series_sortby,
            # type_schema=time_series_type_schema
        )

    def is_categorical_column(self, df, column_name, n_unique_threshold=20, ratio_unique_values=0.05, exclude_patterns=[]):
        """
        Determines whether a column in a pandas DataFrame is categorical.
        Args:
            df (pandas.DataFrame): The DataFrame to check.
            column_name (str): The name of the column to check.
            n_unique_threshold (int): The threshold for the number of unique values.
            ratio_unique_values (float): The threshold for the ratio of unique values to total values.
            exclude_patterns (list): A list of patterns to exclude from consideration.
        Returns:
            bool: True if the column is categorical, False otherwise.
        """
        if df[column_name].dtype in [object, str]:

            # Check if the column name matches any of the exclusion patterns
            if any(pattern in column_name for pattern in exclude_patterns):
                return False

            # Check if the number of unique values is less than a threshold
            if df[column_name].nunique() < n_unique_threshold:
                return True

            # Check if the ratio of unique values to total values is less than a threshold
            if 1. * df[column_name].nunique() / df[column_name].count() < ratio_unique_values:
                return True

            # Check if any of the other conditions are true
            return False

    def get_categorical_columns(self, df, n_unique_threshold=10, ratio_threshold=0.05, exclude_patterns=[]):
        """
        Determines which columns in a pandas DataFrame are categorical.
        Args:
            df (pandas.DataFrame): The DataFrame to check.
            n_unique_threshold (int): The threshold for the number of unique values.
            ratio_threshold (float): The threshold for the ratio of unique values to total values.
            exclude_patterns (list): A list of patterns to exclude from consideration.
        Returns:
            list: A list of the names of the categorical columns.
        """
        categorical_cols = []
        for column_name in df.columns:
            if self.is_categorical_column(df, column_name, n_unique_threshold, ratio_threshold, exclude_patterns):
                categorical_cols.append(column_name)

        return categorical_cols


            profile = self.create_profile_report(dataset_to_analyze=data_to_analyze,
                                                 report_name=report_name,
                                                 dataset_description_url=description_url)
            return profile

pandas-profiling version

v.4.6.3

Dependencies

Ipython-8.19.0
MarkupSafe-2.1.3
PyAthena-3.0.10
PyWavelets-1.5.0
SQLAlchemy-1.4.50
altair-4.2.2
annotated-types-0.6.0
anyio-4.2.0
argon2-cffi-23.1.0
argon2-cffi-bindings-21.2.0
arrow-1.3.0
asn1crypto-1.5.1
asttokens-2.4.1
async-lru-2.0.4
asyncio-3.4.3
awswrangler-3.4.2
babel-2.14.0
beautifulsoup4-4.12.2
bleach-6.1.0
boto-session-manager-1.7.1
boto3-1.34.9
boto3-helpers-1.4.0
botocore-1.34.9
cffi-1.16.0
colorama-0.4.6
comm-0.2.0
cryptography-41.0.7
dacite-1.8.1
debugpy-1.8.0
decorator-5.1.1
defusedxml-0.7.1 
delta-spark-2.3.0
deltalake-0.14.0
editorconfig-0.12.3
entrypoints-0.4
exceptiongroup-1.2.0
executing-2.0.1
fastjsonschema-2.19.1
flatten_dict-0.4.2
fqdn-1.5.1
fsspec-2023.12.2
func-args-0.1.1
great-expectations-0.18.7
greenlet-3.0.3
htmlmin-0.1.12
imagehash-4.3.1
ipykernel-6.28.0 
ipywidgets-8.1.1
isoduration-20.11.0 
iterproxy-0.3.1 
jedi-0.19.1 
jinja2-3.1.2 
jsbeautifier-1.14.11 
json2html-1.3.0 
json5-0.9.14 jsonpatch-1.33 
jsonpath-ng-aerospike-1.5.3
jsonpointer-2.4 jsonschema-4.20.0
jsonschema-specifications-2023.12.1
jupyter-client-8.6.0 
jupyter-core-5.6.0 
jupyter-events-0.9.0
jupyter-lsp-2.2.1
jupyter-server-2.12.1 
jupyter-server-terminals-0.5.1
jupyterlab-4.0.9
jupyterlab-pygments-0.3.0
jupyterlab-server-2.25.2
jupyterlab-widgets-3.0.9
llvmlite-0.41.1 
lxml-4.9.4
makefun-1.15.2
markdown-it-py-3.0.0
marshmallow-3.20.1
matplotlib-inline-0.1.6
mdurl-0.1.2 
mistune-3.0.2 
mmhash3-3.0.1 
multimethod-1.10 
nbclient-0.9.0 
nbconvert-7.13.1 
nbformat-5.9.2 
nest-asyncio-1.5.8 
networkx-3.2.1 
notebook-7.0.6 
notebook-shim-0.2.3 
numba-0.58.1 
overrides-7.4.0 
pandas-2.0.3 
pandocfilters-1.5.0 
parso-0.8.3 
pathlib-mate-1.3.1 
pathlib2-2.3.7.post1 
patsy-0.5.5 
pexpect-4.9.0 
phik-0.12.3 
platformdirs-4.1.0 ply-3.11 
prometheus-client-0.19.0 prompt-toolkit-3.0.43 psutil-5.9.7 ptyprocess-0.7.0 
pure-eval-0.2.2 
py4j-0.10.9.5 pyarrow-12.0.1 
pycparser-2.21 
pydantic-2.5.3 
pydantic-core-2.14.6 
pydeequ-1.2.0 pygments-2.17.2 
pyiceberg-0.5.1 
pyparsing-3.1.1 
pyspark-3.3.4 
python-json-logger-2.0.7 
pytz-2023.3.post1 
pyzmq-25.1.2 
redshift_connector-2.0.918 
referencing-0.32.0 
requests-2.31.0 
rfc3339-validator-0.1.4 
rfc3986-validator-0.1.1 rich-13.7.0 
rpds-py-0.16.2 
ruamel.yaml-0.17.17 
s3path-0.4.2 
s3pathlib-2.0.1 
s3transfer-0.10.0 
scramp-1.4.4 
send2trash-1.8.2 
smart-open-6.4.0 
sniffio-1.3.0 
sortedcontainers-2.4.0 
soupsieve-2.5 
sqlalchemy-redshift-0.8.14 
sqlalchemy_utils-0.41.1 
stack-data-0.6.3 
strictyaml-1.7.3 
tabulate-0.9.0 
tangled-up-in-unicode-0.2.0
terminado-0.18.0 
tinycss2-1.2.1
tomli-2.0.1 
toolz-0.12.0 
tornado-6.4 
traitlets-5.14.0
typeguard-4.1.5
types-python-dateutil-2.8.19.14 
typing-extensions-4.9.0
tzlocal-5.2
uri-template-1.3.0
urllib3-2.0.7
uuid7-0.1.0 
visions-0.7.5
wcwidth-0.2.12
webcolors-1.13
webencodings-0.5.1 websocket-client-1.7.0
widgetsnbextension-4.0.9
wordcloud-1.9.3

OS

linux

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants