Unable to compute correlation among columns of my datasets #1522

tboz38 · 2023-12-29T15:08:28Z

Current Behaviour

The report should not contained the correlations
It sued to work using previous release of ydata-profiling

/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/correlations.py:66: UserWarning: There was an attempt to calculate the auto correlation, but this failed.
To hide this warning, disable the calculation
(using df.profile_report(correlations={"auto": {"calculate": False}})
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: '('compute: 0 methods found', (<class 'ydata_profiling.config.Settings'>, <class 'pyspark.sql.dataframe.DataFrame'>, <class 'dict'>), [])')
warnings.warn(
master_stock_data_exploratory_report (1).json

compute correlation among columns of large datasets

Expected Behaviour

The report should contained the correlations

Data Description

correlations_settings = {
"auto": {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
...
}

my datasets are private but i can provide an example sample anonymized data
record_timestamp,plant,material_part_number,storage_location,unrestricted_use_stock,stock_in_transfer,stock_in_quality_inspection,all_restricted_stock,blocked_stock,block_stock_returns,stock_in_transit,stock_in_transfer_plant_to_plant,stock_at_vendor,valuated_stock_quantities,non_valuated_stock_quantities,stock_value,valuation_class,material_type,gl_account,account_description
2022-09-25,P006,79-2997197-11,,3.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,4.222222328186035,null,2442.855712890625,null,null,null,null
2022-09-25,P006,79-2997102-11,,0.1111111119389534,null,null,1.1111111640930176,null,null,null,null,null,1.2222222089767456,null,37961.89453125,null,null,null,null
2022-09-25,P006,72-2997190-11,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,21672.736328125,null,null,null,null
2022-09-25,P006,72-2997192-11,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,20033.513671875,null,null,null,null
2022-09-25,P006,72-2997197-11,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,5912.4423828125,null,null,null,null
2022-09-25,P006,72-2997102-11,,3.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,4.222222328186035,null,22345.1875,null,null,null,null
2022-09-25,P006,72-2997102-11252,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,325588.53125,null,null,null,null
2022-09-25,P006,72-2997132-11,,7.111111164093018,null,null,1.1111111640930176,null,null,null,null,null,8.222222328186035,null,76378.453125,null,null,null,null
2022-09-25,P006,72-2997138-11,,7.111111164093018,null,null,1.1111111640930176,null,null,null,null,null,8.222222328186035,null,78339.78125,null,null,null,null
2022-09-25,P006,82-2997112-19,,1.1111111640930176,null,null,3.1111111640930176,null,null,null,null,null,4.222222328186035,null,157067.140625,null,null,null,null
2022-09-25,P006,82-2997112-12,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,50453.15625,null,null,null,null
2022-09-25,P006,82-2997139-11,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,9203917,null,null,null,null
2022-09-25,P006,82-2997139-19,,0.1111111119389534,null,null,1.1111111640930176,null,null,null,null,null,1.2222222089767456,null,1100468.375,null,null,null,null
2022-09-25,P006,855112191,,87.11111450195312,null,null,1.1111111640930176,null,null,null,null,null,88.22222137451172,null,2399.089599609375,null,null,null,null
2022-09-25,P006,mj92119-9-r,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,219.49136352539062,null,null,null,null
2022-09-25,P006,mj92119-9h,,977.111083984375,null,null,1.1111111640930176,null,null,null,null,null,978.2222290039062,null,74072.5234375,null,null,null,null
2022-09-25,P006,mj92119-3-o,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,2.4691357612609863,null,null,null,null
2022-09-25,P006,mj92119-3-r,,927.111083984375,null,null,1.1111111640930176,null,null,null,null,null,928.2222290039062,null,27188.12109375,null,null,null,null
2022-09-25,P006,uj92119-3,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,5.713580131530762,null,null,null,null
2022-09-25,P006,uj92119-3r,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,6.3358025550842285,null,null,null,null
2022-09-25,P006,r002719,,302.1111145019531,null,null,1.1111111640930176,null,null,null,null,null,303.22222900390625,null,785.6824951171875,null,null,null,null
2022-09-25,P006,r002712,,72.11111450195312,null,null,1.1111111640930176,null,null,null,null,null,73.22222137451172,null,12.655077934265137,null,null,null,null
2022-09-25,P006,j932rm2111,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,2.4691357612609863,null,null,null,null
2022-09-25,P006,222l0112-9,,1.1111111640930176,null,null,93.11111450195312,null,null,null,null,null,94.22222137451172,null,733587,null,null,null,null
2022-09-25,P006,222l0112-9u,,2.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,3.222222328186035,null,71704.1796875,null,null,null,null
2022-09-25,P006,222l0112-9uom9,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,4913.580078125,null,null,null,null

Code that reproduces the bug

def create_profile_report(self,
                              dataset_to_analyze: pd.DataFrame,
                              report_name: str,
                              dataset_description_url: str) -> ProfileReport:
        """
        Creates a profile report for a given dataset.
        Args:
            dataset_to_analyze (pd.DataFrame): The dataset to analyze and generate a profile report for.
            report_name (str): The name of the report.
            dataset_description_url (str): The URL of the dataset description.
        Returns:
            ProfileReport: The generated profile report.
        """
        # Perform data quality operations and generate a profile report
        # ...
        # variables preferred characterization settings
        variables_settings = {
            "num": {"low_categorical_threshold": 5, "chi_squared_threshold": 0.999},
            "cat": {"length": True, "characters": False, "words": False,
                    "cardinality_threshold": 50, "imbalance_threshold": 0.5,
                    "n_obs": 5, "chi_squared_threshold": 0.999},
            "bool": {"n_obs": 3, "imbalance_threshold": 0.5}
        }
        missing_diagrams_settings = {
            "heatmap": False,
            "matrix": False,
            "bar": False
        }
        # Plot rendering option, way how to pass arguments to the underlying matplotlib visualization engine
        plot_rendering_settings = {
            # "histogram": {"x_axis_labels": True, "bins": 5, "max_bins": 10},
            "dpi": 200,
            "image_format": "png",
            "missing": {"cmap": "RdBu_r", "force_labels": True},
            "pie": {"max_unique": 10, "colors": ["gold", "b", "#FF796C"]},
            "correlation": {"cmap": "RdBu_r", "bad": "#000000"}
        }
        # Correlation matrices through description_set
        correlations_settings = {
            "auto": {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
            "pearson": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
            "spearman": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
            "kendall": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
            "phi_k": {"calculate": False, "warn_high_correlations": True, "threshold": 0.9},
            "cramers": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
        }

        interactions_settings = {
            "continuous": False,
            "targets": []
        }
        # Customizing the report's theme
        html_report_styling = {
            "style": {
                "theme": "flatly",
                "full_width": True,
                "primary_colors": {"#66cc00", "#ff9933", "#ff0099"}
            }
        }
        current_datetime = datetime.now()
        current_date = current_datetime.date()
        current_year = current_date.strftime("%Y")
        # compute amount of data used for profiling
        samples_percent_size = (min(len(dataset_to_analyze.columns.tolist()), 20) * min(dataset_to_analyze.shape[0], 100000)) / (len(dataset_to_analyze.columns.tolist()) * dataset_to_analyze.shape[0])
        samples = {
            "head": 0,
            "tail": 0,
            "random": 0
        }
        dataset_description = {
            "description": f"This profiling report was generated using a sample of {samples_percent_size}% of the filtered original dataset.",
            "copyright_holder": "SLS Data platform",
            "copyright_year": current_year,
            "url": dataset_description_url
        }
        # Identify time series variables if any
        # Enable tsmode to True to automatically identify time-series variables
        # and provide the column name that provides the chronological order of your time-series
        # time_series_type_schema = {}
        time_series_mode = False
        # time_series_sortby = None
        # for column_name in dataset_to_analyze.columns.tolist():
        #     if any(keyword in column_name.lower() for keyword in ["date", "timestamp"]):
        #         self.logger.info("candidate column_name as timeseries %s", column_name)
        #         time_series_type_schema[column_name] = "timeseries"
        # if len(time_series_type_schema) > 0:
        #     time_series_mode = True
        #     time_series_sortby = "Date Local"

        # is_run_minimal_mode = self.determine_run_minimal_mode(dataset_to_analyze.columns.tolist(), dataset_to_analyze.shape[0])

        # Convert the Pandas DataFrame to a Spark DataFrame
        # Configure pandas-profiling to handle Spark DataFrames
        # while preserving the categorical encoding
        # Enable Arrow-based columnar data transfers
        self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
        pd.DataFrame.iteritems = pd.DataFrame.items
        # psdf = ps.from_pandas(dataset_to_analyze)
        # data_to_analyze = psdf.to_spark()
        data_to_analyze = self.spark.createDataFrame(dataset_to_analyze)
        ydata_profiling_instance_config = Settings()
        ydata_profiling_instance_config.infer_dtypes = True
        # ydata_profiling_instance_config.Config.set_option("profilers", {"Spark": {"verbose": True}})

        return ProfileReport(
            # dataset_to_analyze,
            data_to_analyze,
            title=report_name,
            dataset=dataset_description,
            sort=None,
            progress_bar=False,
            vars=variables_settings,
            explorative=True,
            plot=plot_rendering_settings,
            correlations=correlations_settings,
            missing_diagrams=missing_diagrams_settings,
            samples=samples,
            # correlations=None,
            interactions=interactions_settings,
            html=html_report_styling,
            # minimal=is_run_minimal_mode,
            minimal=True,
            tsmode=time_series_mode,
            # tsmode=False,
            # sortby=time_series_sortby,
            # type_schema=time_series_type_schema
        )

    def is_categorical_column(self, df, column_name, n_unique_threshold=20, ratio_unique_values=0.05, exclude_patterns=[]):
        """
        Determines whether a column in a pandas DataFrame is categorical.
        Args:
            df (pandas.DataFrame): The DataFrame to check.
            column_name (str): The name of the column to check.
            n_unique_threshold (int): The threshold for the number of unique values.
            ratio_unique_values (float): The threshold for the ratio of unique values to total values.
            exclude_patterns (list): A list of patterns to exclude from consideration.
        Returns:
            bool: True if the column is categorical, False otherwise.
        """
        if df[column_name].dtype in [object, str]:

            # Check if the column name matches any of the exclusion patterns
            if any(pattern in column_name for pattern in exclude_patterns):
                return False

            # Check if the number of unique values is less than a threshold
            if df[column_name].nunique() < n_unique_threshold:
                return True

            # Check if the ratio of unique values to total values is less than a threshold
            if 1. * df[column_name].nunique() / df[column_name].count() < ratio_unique_values:
                return True

            # Check if any of the other conditions are true
            return False

    def get_categorical_columns(self, df, n_unique_threshold=10, ratio_threshold=0.05, exclude_patterns=[]):
        """
        Determines which columns in a pandas DataFrame are categorical.
        Args:
            df (pandas.DataFrame): The DataFrame to check.
            n_unique_threshold (int): The threshold for the number of unique values.
            ratio_threshold (float): The threshold for the ratio of unique values to total values.
            exclude_patterns (list): A list of patterns to exclude from consideration.
        Returns:
            list: A list of the names of the categorical columns.
        """
        categorical_cols = []
        for column_name in df.columns:
            if self.is_categorical_column(df, column_name, n_unique_threshold, ratio_threshold, exclude_patterns):
                categorical_cols.append(column_name)

        return categorical_cols


            profile = self.create_profile_report(dataset_to_analyze=data_to_analyze,
                                                 report_name=report_name,
                                                 dataset_description_url=description_url)
            return profile

pandas-profiling version

v.4.6.3

Dependencies

Ipython-8.19.0
MarkupSafe-2.1.3
PyAthena-3.0.10
PyWavelets-1.5.0
SQLAlchemy-1.4.50
altair-4.2.2
annotated-types-0.6.0
anyio-4.2.0
argon2-cffi-23.1.0
argon2-cffi-bindings-21.2.0
arrow-1.3.0
asn1crypto-1.5.1
asttokens-2.4.1
async-lru-2.0.4
asyncio-3.4.3
awswrangler-3.4.2
babel-2.14.0
beautifulsoup4-4.12.2
bleach-6.1.0
boto-session-manager-1.7.1
boto3-1.34.9
boto3-helpers-1.4.0
botocore-1.34.9
cffi-1.16.0
colorama-0.4.6
comm-0.2.0
cryptography-41.0.7
dacite-1.8.1
debugpy-1.8.0
decorator-5.1.1
defusedxml-0.7.1 
delta-spark-2.3.0
deltalake-0.14.0
editorconfig-0.12.3
entrypoints-0.4
exceptiongroup-1.2.0
executing-2.0.1
fastjsonschema-2.19.1
flatten_dict-0.4.2
fqdn-1.5.1
fsspec-2023.12.2
func-args-0.1.1
great-expectations-0.18.7
greenlet-3.0.3
htmlmin-0.1.12
imagehash-4.3.1
ipykernel-6.28.0 
ipywidgets-8.1.1
isoduration-20.11.0 
iterproxy-0.3.1 
jedi-0.19.1 
jinja2-3.1.2 
jsbeautifier-1.14.11 
json2html-1.3.0 
json5-0.9.14 jsonpatch-1.33 
jsonpath-ng-aerospike-1.5.3
jsonpointer-2.4 jsonschema-4.20.0
jsonschema-specifications-2023.12.1
jupyter-client-8.6.0 
jupyter-core-5.6.0 
jupyter-events-0.9.0
jupyter-lsp-2.2.1
jupyter-server-2.12.1 
jupyter-server-terminals-0.5.1
jupyterlab-4.0.9
jupyterlab-pygments-0.3.0
jupyterlab-server-2.25.2
jupyterlab-widgets-3.0.9
llvmlite-0.41.1 
lxml-4.9.4
makefun-1.15.2
markdown-it-py-3.0.0
marshmallow-3.20.1
matplotlib-inline-0.1.6
mdurl-0.1.2 
mistune-3.0.2 
mmhash3-3.0.1 
multimethod-1.10 
nbclient-0.9.0 
nbconvert-7.13.1 
nbformat-5.9.2 
nest-asyncio-1.5.8 
networkx-3.2.1 
notebook-7.0.6 
notebook-shim-0.2.3 
numba-0.58.1 
overrides-7.4.0 
pandas-2.0.3 
pandocfilters-1.5.0 
parso-0.8.3 
pathlib-mate-1.3.1 
pathlib2-2.3.7.post1 
patsy-0.5.5 
pexpect-4.9.0 
phik-0.12.3 
platformdirs-4.1.0 ply-3.11 
prometheus-client-0.19.0 prompt-toolkit-3.0.43 psutil-5.9.7 ptyprocess-0.7.0 
pure-eval-0.2.2 
py4j-0.10.9.5 pyarrow-12.0.1 
pycparser-2.21 
pydantic-2.5.3 
pydantic-core-2.14.6 
pydeequ-1.2.0 pygments-2.17.2 
pyiceberg-0.5.1 
pyparsing-3.1.1 
pyspark-3.3.4 
python-json-logger-2.0.7 
pytz-2023.3.post1 
pyzmq-25.1.2 
redshift_connector-2.0.918 
referencing-0.32.0 
requests-2.31.0 
rfc3339-validator-0.1.4 
rfc3986-validator-0.1.1 rich-13.7.0 
rpds-py-0.16.2 
ruamel.yaml-0.17.17 
s3path-0.4.2 
s3pathlib-2.0.1 
s3transfer-0.10.0 
scramp-1.4.4 
send2trash-1.8.2 
smart-open-6.4.0 
sniffio-1.3.0 
sortedcontainers-2.4.0 
soupsieve-2.5 
sqlalchemy-redshift-0.8.14 
sqlalchemy_utils-0.41.1 
stack-data-0.6.3 
strictyaml-1.7.3 
tabulate-0.9.0 
tangled-up-in-unicode-0.2.0
terminado-0.18.0 
tinycss2-1.2.1
tomli-2.0.1 
toolz-0.12.0 
tornado-6.4 
traitlets-5.14.0
typeguard-4.1.5
types-python-dateutil-2.8.19.14 
typing-extensions-4.9.0
tzlocal-5.2
uri-template-1.3.0
urllib3-2.0.7
uuid7-0.1.0 
visions-0.7.5
wcwidth-0.2.12
webcolors-1.13
webencodings-0.5.1 websocket-client-1.7.0
widgetsnbextension-4.0.9
wordcloud-1.9.3

OS

linux

Checklist

There is not yet another bug report for this issue in the issue tracker
The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
The issue has not been resolved by the entries listed under Common Issues.

The text was updated successfully, but these errors were encountered:

azory-ydata added the needs-triage label Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to compute correlation among columns of my datasets #1522

Unable to compute correlation among columns of my datasets #1522

tboz38 commented Dec 29, 2023

Unable to compute correlation among columns of my datasets #1522

Unable to compute correlation among columns of my datasets #1522

Comments

tboz38 commented Dec 29, 2023

Current Behaviour

Expected Behaviour

Data Description

Code that reproduces the bug

pandas-profiling version

Dependencies

OS

Checklist