Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected error of type DispatchError raised while running data exploratory profiler from function spark_get_series_descriptions #1523

Open
3 tasks done
tboz38 opened this issue Dec 29, 2023 · 1 comment

Comments

@tboz38
Copy link

tboz38 commented Dec 29, 2023

Current Behaviour

       # converts the data types of the columns in the DataFrame to more appropriate types,
        # useful for improving the performance of calculations.
        # Selects the columns in the DataFrame that are of type object or category,
        # which are the types that are typically considered to be categorical
        data_to_analyze = dataframe_to_analyze.toPandas()
ERROR:data_quality_job.scheduler.data_quality_glue_job:Run data exploratory analysis fails for datasource master_wip in data domain stock_wip: Unexpected error of type DispatchError was raised while data exploratory profiler: Function Traceback (most recent call last): File "/home/spark/.local/lib/python3.10/site-packages/multimethod/__init__.py", line 328, in __call__ return func(*args, **kwargs) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/spark/describe_date_spark.py", line 50, in describe_date_1d_spark bin_edges, hist = df.select(col_name).rdd.flatMap(lambda x: x).histogram(bins_arg) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1652, in histogram raise TypeError("buckets should be a list or tuple or number(int or long)")TypeError: buckets should be a list or tuple -- or number(int or long)The above exception was the direct cause of the following exception:Traceback (most recent call last): File "/home/spark/.local/lib/python3.10/site-packages/multimethod/__init__.py", line 328, in __call__ return func(*args, **kwargs) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 64, in spark_describe_1d return summarizer.summarize(config, series, dtype=vtype) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/summarizer.py", line 42, in summarize _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)}) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 62, in handle return op(*args) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 21, in func2 return f(*res) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 21, in func2 return f(*res) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 21, in func2 return f(*res) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 17, in func2 res = g(*x) File "/home/spark/.local/lib/python3.10/site-packages/multimethod/__init__.py", line 330, in __call__ raise DispatchError(f"Function {func.__code__}") from exmultimethod.DispatchError: Function The above exception was the direct cause of the following exception:Traceback (most recent call last): File "/home/spark/.local/lib/python3.10/site-packages/multimethod/__init__.py", line 328, in __call__ return func(*args, **kwargs) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 92, in spark_get_series_descriptions for i, (column, description) in enumerate( File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 870, in next raise value File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 88, in multiprocess_1d return column, describe_1d(config, df.select(column), summarizer, typeset) File "/home/spark/.local/lib/python3.10/site-packages/multimethod/__init__.py", line 330, in __call__ raise DispatchError(f"Function {func.__code__}") from exmultimethod.DispatchError: Function The above exception was the direct cause of the following exception:Traceback (most recent call last): File "/tmp/sls_data_quality_library-0.3.0-py3-none-any.whl/data_quality_job/scheduler/data_quality_glue_job.py", line 1074, in run_data_exploratory_analysis self.dq_file_system_metrics_repository_manager.persist_profile_json_report( File "/tmp/sls_data_quality_library-0.3.0-py3-none-any.whl/data_quality_job/services/data_quality_file_system_metrics_repository.py", line 974, in persist_profile_json_report generated_profile.to_file(output_file=f"{local_json_report}") File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 347, in to_file data = self.to_json() File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 479, in to_json return self.json File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 283, in json self._json = self._render_json() File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 449, in _render_json description = self.description_set File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 253, in description_set self._description_set = describe_df( File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/describe.py", line 74, in describe series_description = get_series_descriptions( File "/home/spark/.local/lib/python3.10/site-packages/multimethod/__init__.py", line 330, in __call__ raise DispatchError(f"Function {func.__code__}") from exmultimethod.DispatchError: Function INFO:py4j.clientserver:Closing down clientserver connectionINFO:py4j.clientserver:Closing down clientserver connectionINFO:py4j.clientserver:Closing down clientserver connectionWARNING:data_quality_job.scheduler.data_quality_glue_job:Processing dataset fails to provide an exploratory data analysis report : Unexpected error of type DispatchError was raised while data exploratory profiler: Function

Expected Behaviour

While converting my spark dataframe to pandas, the report should be generated properly for the dataset
The dataframe should not be considered as spark dataframe
No error should be raised

Data Description

INFO:data_quality_job.services.data_quality_operations:Data profiler dataset data types to analyze: storage_location categorystock_in_transit float32unrestricted_use_stock float32stock_at_vendor float32stock_in_transfer float32stock_in_quality_inspection float32valuation_class float32block_stock_returns float32material_part_number objectstock_in_transfer_plant_to_plant float32stock_value float32material_type categoryblocked_stock float32account_description categoryplant categoryall_restricted_stock float32valuated_stock_quantities float32gl_account float32record -- _timestamp datetime64[ns]non_valuated_stock_quantities float32dtype: object

Code that reproduces the bug

def determine_run_minimal_mode(self, nb_columns, nb_records):
        """
        Determine if the function should run in minimal mode.
        Args:
            nb_columns (int): The number of columns in the dataset.
            nb_records (int): The number of records in the dataset.
        Returns:
            bool: True if the function should run in minimal mode, False otherwise.
        """
        return True if (len(nb_columns) >= EDA_PROFILING_MODE_NB_COLUMNS_LIMIT or nb_records >= EDA_PROFILING_MODE_NB_RECORDS_LIMIT) else False

    def create_profile_report(self,
                              dataset_to_analyze: pd.DataFrame,
                              report_name: str,
                              dataset_description_url: str) -> ProfileReport:
        """
        Creates a profile report for a given dataset.
        Args:
            dataset_to_analyze (pd.DataFrame): The dataset to analyze and generate a profile report for.
            report_name (str): The name of the report.
            dataset_description_url (str): The URL of the dataset description.
        Returns:
            ProfileReport: The generated profile report.
        """
        # Perform data quality operations and generate a profile report
        # ...
        # variables preferred characterization settings
        variables_settings = {
            "num": {"low_categorical_threshold": 5, "chi_squared_threshold": 0.999, "histogram_largest": 10},
            "cat": {"length": True, "characters": False, "words": False,
                    "cardinality_threshold": 20, "imbalance_threshold": 0.5,
                    "n_obs": 5, "chi_squared_threshold": 0.999},
            "bool": {"n_obs": 3, "imbalance_threshold": 0.5}
        }
        missing_diagrams_settings = {
            "heatmap": False,
            "matrix": True,
            "bar": False
        }
        # Plot rendering option, way how to pass arguments to the underlying matplotlib visualization engine
        plot_rendering_settings = {
            "histogram": {"x_axis_labels": True, "bins": 0, "max_bins": 10},
            "dpi": 200,
            "image_format": "png",
            "missing": {"cmap": "RdBu_r", "force_labels": True},
            "pie": {"max_unique": 10, "colors": ["gold", "b", "#FF796C"]},
            "correlation": {"cmap": "RdBu_r", "bad": "#000000"}
        }
        # Correlation matrices through description_set
        correlations_settings = {
            "auto": {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
            "pearson": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
            "spearman": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
            "kendall": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
            "phi_k": {"calculate": False, "warn_high_correlations": True, "threshold": 0.9},
            "cramers": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
        }
        categorical_maximum_correlation_distinct = 20
        report_rendering_settings = {
            "precision": 10,
        }
        interactions_settings = {
            "continuous": False,
            "targets": []
        }
        # Customizing the report's theme
        html_report_styling = {
            "style": {
                "theme": "flatly",
                "full_width": True,
                "primary_colors": {"#66cc00", "#ff9933", "#ff0099"}
            }
        }
        current_datetime = datetime.now()
        current_date = current_datetime.date()
        current_year = current_date.strftime("%Y")
        # compute amount of data used for profiling
        samples_percent_size = (min(len(dataset_to_analyze.columns.tolist()), 20) * min(dataset_to_analyze.shape[0], 100000)) / (len(dataset_to_analyze.columns.tolist()) * dataset_to_analyze.shape[0])
        samples = {
            "head": 0,
            "tail": 0,
            "random": 0
        }
        dataset_description = {
            "description": f"This profiling report was generated using a sample of {samples_percent_size}% of the filtered original dataset.",
            "copyright_year": current_year,
            "url": dataset_description_url
        }
        # Identify time series variables if any
        # Enable tsmode to True to automatically identify time-series variables
        # and provide the column name that provides the chronological order of your time-series
        # time_series_type_schema = {}
        time_series_mode = False
        # time_series_sortby = None
        # for column_name in dataset_to_analyze.columns.tolist():
        #     if any(keyword in column_name.lower() for keyword in ["date", "timestamp"]):
        #         self.logger.info("candidate column_name as timeseries %s", column_name)
        #         time_series_type_schema[column_name] = "timeseries"
        # if len(time_series_type_schema) > 0:
        #     time_series_mode = True
        #     time_series_sortby = "Date Local"

        # is_run_minimal_mode = self.determine_run_minimal_mode(dataset_to_analyze.columns.tolist(), dataset_to_analyze.shape[0])

        # Convert the Pandas DataFrame to a Spark DataFrame
        # Configure pandas-profiling to handle Spark DataFrames
        # while preserving the categorical encoding
        # Enable Arrow-based columnar data transfers
        self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
        pd.DataFrame.iteritems = pd.DataFrame.items
        # psdf = ps.from_pandas(dataset_to_analyze)
        # data_to_analyze = psdf.to_spark()
        data_to_analyze = self.spark.createDataFrame(dataset_to_analyze)
        ydata_profiling_instance_config = Settings()
        ydata_profiling_instance_config.infer_dtypes = True
        # ydata_profiling_instance_config.Config.set_option("profilers", {"Spark": {"verbose": True}})

        return ProfileReport(
            # dataset_to_analyze,
            data_to_analyze,
            title=report_name,
            dataset=dataset_description,
            sort=None,
            progress_bar=False,
            vars=variables_settings,
            explorative=True,
            plot=plot_rendering_settings,
            report=report_rendering_settings,
            correlations=correlations_settings,
            categorical_maximum_correlation_distinct=categorical_maximum_correlation_distinct,
            missing_diagrams=missing_diagrams_settings,
            samples=samples,
            # correlations=None,
            interactions=interactions_settings,
            html=html_report_styling,
            # minimal=is_run_minimal_mode,
            minimal=True,
            tsmode=time_series_mode,
            # tsmode=False,
            # sortby=time_series_sortby,
            # type_schema=time_series_type_schema
        )

    def is_categorical_column(self, df, column_name, n_unique_threshold=20, ratio_unique_values=0.05, exclude_patterns=[]):
        """
        Determines whether a column in a pandas DataFrame is categorical.
        Args:
            df (pandas.DataFrame): The DataFrame to check.
            column_name (str): The name of the column to check.
            n_unique_threshold (int): The threshold for the number of unique values.
            ratio_unique_values (float): The threshold for the ratio of unique values to total values.
            exclude_patterns (list): A list of patterns to exclude from consideration.
        Returns:
            bool: True if the column is categorical, False otherwise.
        """
        if df[column_name].dtype in [object, str]:

            # Check if the column name matches any of the exclusion patterns
            if any(pattern in column_name for pattern in exclude_patterns):
                return False

            # Check if the number of unique values is less than a threshold
            if df[column_name].nunique() < n_unique_threshold:
                return True

            # Check if the ratio of unique values to total values is less than a threshold
            if 1. * df[column_name].nunique() / df[column_name].count() < ratio_unique_values:
                print(df[column_name], "ratio is", 1. * df[column_name].nunique() / df[column_name].count())
                return True

            # Check if any of the other conditions are true
            return False

    def get_categorical_columns(self, df, n_unique_threshold=10, ratio_threshold=0.05, exclude_patterns=[]):
        """
        Determines which columns in a pandas DataFrame are categorical.
        Args:
            df (pandas.DataFrame): The DataFrame to check.
            n_unique_threshold (int): The threshold for the number of unique values.
            ratio_threshold (float): The threshold for the ratio of unique values to total values.
            exclude_patterns (list): A list of patterns to exclude from consideration.
        Returns:
            list: A list of the names of the categorical columns.
        """
        categorical_cols = []
        for column_name in df.columns:
            if self.is_categorical_column(df, column_name, n_unique_threshold, ratio_threshold, exclude_patterns):
                categorical_cols.append(column_name)

        return categorical_cols

    def perform_exploratory_data_analysis(self, report_name: str,
                                          dataframe_to_analyze: SparkDataFrame,
                                          columns_list: list,
                                          description_url: str, json_file_path: str) -> None:
        """
        Performs exploratory data analysis on a given DataFrame.
        Args:
            dataframe_to_analyze (DataFrame): The DataFrame to perform exploratory data analysis on.
            columns_list (list): A list of dictionaries containing column information.
        """
        try:

            # Cast the columns in the data DataFrame to match the Glue table column types
            self.logger.info("Performs exploratory data analysis on a given DataFrame with columns list: %s",
                             columns_list)

            for analyze_column in columns_list:
                dataframe_to_analyze = dataframe_to_analyze.withColumn(
                    analyze_column["Name"],
                    dataframe_to_analyze[analyze_column["Name"]].cast(analyze_column["Type"]),
                )

            # Verify the updated column types
            self.logger.info("Dataframe column type casted from data catalog: %s",
                             dataframe_to_analyze.printSchema())
            # converts the data types of the columns in the DataFrame to more appropriate types,
            # useful for improving the performance of calculations.
            # Selects the columns in the DataFrame that are of type object or category,
            # which are the types that are typically considered to be categorical
            data_to_analyze = dataframe_to_analyze.toPandas()
            data_to_analyze = data_to_analyze.infer_objects()
            data_to_analyze.convert_dtypes().dtypes

            categorical_cols = self.get_categorical_columns(data_to_analyze, n_unique_threshold=10, ratio_threshold=0.05, exclude_patterns=['date', 'timestamp', 'time', 'year', 'month', 'day', 'hour', 'minute', 'second', 'part_number'])
            # categorical_cols = data_to_analyze.select_dtypes(include=["object", "category"]).columns.tolist()

            self.logger.info("Data profiler dataset detected potential categorical columns %s and its type %s",
                             categorical_cols, data_to_analyze.dtypes)
            for column_name in data_to_analyze.columns.tolist():
                if column_name in categorical_cols:
                    data_to_analyze[column_name] = data_to_analyze[column_name].astype("category")
                else:
                    # search for undetected categorical columns
                    if any(term in str.lower(column_name) for term in ["plant", "program"]):
                        self.logger.info("Undetected potential categorical column %s", column_name)

            # for column_name in data_to_analyze.columns.tolist():
            #     # search for non categorical columns
            #     # if any(term in str.lower(column_name) for term in ["partnumber", "part_number", "_item", "_number", "plant", "program"]):
            #     if any(term in str.lower(column_name) for term in ["plant", "program"]):
            #         if column_name in categorical_cols:
            #             self.logger.info("Data profiler dataset proposed categorical column %s", column_name)
            #             data_to_analyze[column_name] = data_to_analyze[column_name].astype("category")

                # if any(term in str.lower(column_name) for term in ["partnumber", "part_number", "_item", "_number", "_timestamp", "_date"]):
                #     self.logger.info("Data profiler dataset detected non categorical column %s", column_name)
                #     data_to_analyze[column_name] = data_to_analyze[column_name].astype("str")
                if any(term in str.lower(column_name) for term in ["timestamp"]):
                    self.logger.info("Data profiler dataset detected datetime column %s", column_name)
                    try:
                        if pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%d', errors='coerce').notnull().all():
                            data_to_analyze[column_name] = data_to_analyze[column_name].apply(pd.to_datetime)
                            # data_to_analyze[column_name] = data_to_analyze[column_name].astype(np.datetime64)
                        elif pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%d %H:%M:%S', errors='coerce').notnull().all():
                            data_to_analyze[column_name] = pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%d %H:%M:%S')
                        elif data_to_analyze[column_name].dtypes in ['numpy.int64', 'int64']:
                            data_to_analyze[column_name] = data_to_analyze[column_name].apply(lambda x: datetime.fromtimestamp(int(x) / 1000))
                        elif data_to_analyze[column_name].dtypes == 'datetime64[ms]':
                            data_to_analyze[column_name] = pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%dT%H:%M:%SZ')
                            data_to_analyze[column_name] = data_to_analyze[column_name].values.astype(dtype='datetime64[ns]')
                        else:
                            data_to_analyze[column_name] = data_to_analyze[column_name].astype('str')

                    #     if not isinstance(data_to_analyze[column_name].dtype, np.datetime64):
                    #         data_to_analyze[column_name] = pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%d %H:%M:%S')

                    #     # if not np.issubdtype(data_to_analyze[column_name].dtype, np.datetime64):
                    #     #     data_to_analyze[column_name] = pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%d %H:%M:%S', errors="coerce")
                    #     # elif is_datetime64_any_dtype(data_to_analyze[column_name]):
                    #     #     data_to_analyze[column_name] = data_to_analyze[column_name].astype(np.datetime64)
                    #         data_to_analyze[column_name] = data_to_analyze[column_name].values.astype(dtype='datetime64[ns]')

                    #     # elif data_to_analyze[column_name].dtype == 'datetime64[ns]':
                    #     #     data_to_analyze[column_name] = pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%dT%H:%M:%SZ')
                    #     #     data_to_analyze[column_name] = data_to_analyze[column_name].values.astype(dtype='datetime64[ns]')
                    #     # else:
                    #     #     data_to_analyze[column_name] = data_to_analyze[column_name].astype('datetime64')
                    # except ValueError:
                    #     try:
                    #         data_to_analyze[column_name] = data_to_analyze[column_name].astype(np.date_time)
                    #     except ValueError:
                    #         try:
                    #             if (data_to_analyze[column_name].dtypes in ["numpy.int64", "int64"]):
                    #                 data_to_analyze[column_name] = data_to_analyze[column_name].apply(
                    #                             lambda x: datetime.fromtimestamp(int(x) / 1000))
                    except ValueError:
                        data_to_analyze[column_name] = data_to_analyze[column_name].astype('str')

                elif any(term in str.lower(column_name) for term in ["date"]):
                    self.logger.info("Data profiler dataset detected date column %s", column_name)
                    try:
                        if pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%d', errors='coerce').notnull().all():
                            data_to_analyze[column_name] = data_to_analyze[column_name].dt.date
                        elif pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%d %H:%M:%S', errors='coerce').notnull().all():
                            data_to_analyze[column_name] = pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%d %H:%M:%S')
                        elif data_to_analyze[column_name].dtypes in ['numpy.int64', 'int64']:
                            data_to_analyze[column_name] = data_to_analyze[column_name].apply(lambda x: datetime.fromtimestamp(int(x) / 1000))
                        elif data_to_analyze[column_name].dtypes == 'datetime64[ms]':
                            data_to_analyze[column_name] = pd.to_datetime(data_to_analyze[column_name], format='%Y-%m-%dT%H:%M:%SZ')
                            data_to_analyze[column_name] = data_to_analyze[column_name].values.astype(dtype='datetime64[ns]')
                        else:
                            data_to_analyze[column_name] = data_to_analyze[column_name].astype('str')

                    #     data_to_analyze[column_name] = pd.to_datetime(data_to_analyze[column_name]).dt.date
                    # except ValueError:
                    #     try:
                    #         data_to_analyze[column_name] = pd.to_datetime(data_to_analyze[column_name],
                    #                                                     format="%Y-%m-%d", errors="coerce")
                    #     except ValueError:
                    #         try:
                    #             if (data_to_analyze[column_name].dtypes in ["numpy.int64", "int64"]):
                    #                 data_to_analyze[column_name] = data_to_analyze[column_name].apply(
                    #                             lambda x: datetime.fromtimestamp(int(x) / 1000))
                    except ValueError:
                        pass
            self.logger.info("Data profiler changed dtypes %s", data_to_analyze.dtypes)

            # Downcast data types: If the precision of your data doesn't require float64,
            # consider downcasting to a lower precision data type like float32 or even int64.
            # This can significantly reduce memory usage and improve computational efficiency.
            try:
                float64_cols = list(data_to_analyze.select_dtypes(include="float64"))
                self.logger.info("Data profiler dataset detected float64 column %s", column_name)

                data_to_analyze[float64_cols] = data_to_analyze[float64_cols].astype("float32")
                # data_to_analyze[
                #     data_to_analyze.select_dtypes(np.float64).columns
                # ] = data_to_analyze.select_dtypes(np.float64).astype(np.float32)
            except ValueError:
                pass
            data_to_analyze.reset_index(drop=True, inplace=True)
            self.logger.info("Data profiler dataset data types to analyze: %s", data_to_analyze.dtypes)

            # If dealing with large datasets, consider using sampling techniques
            # to reduce the amount of data processed is useful for exploratory
            # data analysis or initial profiling.
            # Sample 10.000 rows
            # if data_to_analyze.count() >= EDA_PROFILING_MODE_NB_RECORDS_LIMIT:
            #     data_to_analyze = data_to_analyze.sample(EDA_PROFILING_MODE_NB_RECORDS_LIMIT)

            # Generates a profile report, providing for time-series data,
            # an overview of the behaviour of time dependent variables
            # regarding behaviours such as time plots, seasonality, trends,
            # stationary and data gaps, and identifying gaps in the time series,
            # caused either by missing values or by entries missing in the time index

            profile = self.create_profile_report(dataset_to_analyze=data_to_analyze,
                                                 report_name=report_name,
                                                 dataset_description_url=description_url)
            return profile
        except Exception as exc:
            error_message = f"Unexpected error of type {type(exc).__name__} was raised while data exploratory profiler: {str(exc)}"
            self.logger.exception(
                "Run data exploratory analysis fails to generate report %s: %s",
                report_name, error_message,
            )
            raise RuntimeError(error_message) from exc

pandas-profiling version

v.4.6.3

Dependencies

Ipython-8.19.0
MarkupSafe-2.1.3
PyAthena-3.0.10
PyWavelets-1.5.0
SQLAlchemy-1.4.50
altair-4.2.2
annotated-types-0.6.0
anyio-4.2.0
argon2-cffi-23.1.0
argon2-cffi-bindings-21.2.0
arrow-1.3.0
asn1crypto-1.5.1
asttokens-2.4.1
async-lru-2.0.4
asyncio-3.4.3
awswrangler-3.4.2
babel-2.14.0
beautifulsoup4-4.12.2
bleach-6.1.0
boto-session-manager-1.7.1
boto3-1.34.9
boto3-helpers-1.4.0
botocore-1.34.9
cffi-1.16.0
colorama-0.4.6
comm-0.2.0
cryptography-41.0.7
dacite-1.8.1
debugpy-1.8.0
decorator-5.1.1
defusedxml-0.7.1 
delta-spark-2.3.0
deltalake-0.14.0
editorconfig-0.12.3
entrypoints-0.4
exceptiongroup-1.2.0
executing-2.0.1
fastjsonschema-2.19.1
flatten_dict-0.4.2
fqdn-1.5.1
fsspec-2023.12.2
func-args-0.1.1
great-expectations-0.18.7
greenlet-3.0.3
htmlmin-0.1.12
imagehash-4.3.1
ipykernel-6.28.0 
ipywidgets-8.1.1
isoduration-20.11.0 
iterproxy-0.3.1 
jedi-0.19.1 
jinja2-3.1.2 
jsbeautifier-1.14.11 
json2html-1.3.0 
json5-0.9.14 jsonpatch-1.33 
jsonpath-ng-aerospike-1.5.3
jsonpointer-2.4 jsonschema-4.20.0
jsonschema-specifications-2023.12.1
jupyter-client-8.6.0 
jupyter-core-5.6.0 
jupyter-events-0.9.0
jupyter-lsp-2.2.1
jupyter-server-2.12.1 
jupyter-server-terminals-0.5.1
jupyterlab-4.0.9
jupyterlab-pygments-0.3.0
jupyterlab-server-2.25.2
jupyterlab-widgets-3.0.9
llvmlite-0.41.1 
lxml-4.9.4
makefun-1.15.2
markdown-it-py-3.0.0
marshmallow-3.20.1
matplotlib-inline-0.1.6
mdurl-0.1.2 
mistune-3.0.2 
mmhash3-3.0.1 
multimethod-1.10 
nbclient-0.9.0 
nbconvert-7.13.1 
nbformat-5.9.2 
nest-asyncio-1.5.8 
networkx-3.2.1 
notebook-7.0.6 
notebook-shim-0.2.3 
numba-0.58.1 
overrides-7.4.0 
pandas-2.0.3 
pandocfilters-1.5.0 
parso-0.8.3 
pathlib-mate-1.3.1 
pathlib2-2.3.7.post1 
patsy-0.5.5 
pexpect-4.9.0 
phik-0.12.3 
platformdirs-4.1.0 ply-3.11 
prometheus-client-0.19.0 prompt-toolkit-3.0.43 psutil-5.9.7 ptyprocess-0.7.0 
pure-eval-0.2.2 
py4j-0.10.9.5 pyarrow-12.0.1 
pycparser-2.21 
pydantic-2.5.3 
pydantic-core-2.14.6 
pydeequ-1.2.0 pygments-2.17.2 
pyiceberg-0.5.1 
pyparsing-3.1.1 
pyspark-3.3.4 
python-json-logger-2.0.7 
pytz-2023.3.post1 
pyzmq-25.1.2 
redshift_connector-2.0.918 
referencing-0.32.0 
requests-2.31.0 
rfc3339-validator-0.1.4 
rfc3986-validator-0.1.1 rich-13.7.0 
rpds-py-0.16.2 
ruamel.yaml-0.17.17 
s3path-0.4.2 
s3pathlib-2.0.1 
s3transfer-0.10.0 
scramp-1.4.4 
send2trash-1.8.2 
smart-open-6.4.0 
sniffio-1.3.0 
sortedcontainers-2.4.0 
soupsieve-2.5 
sqlalchemy-redshift-0.8.14 
sqlalchemy_utils-0.41.1 
stack-data-0.6.3 
strictyaml-1.7.3 
tabulate-0.9.0 
tangled-up-in-unicode-0.2.0
terminado-0.18.0 
tinycss2-1.2.1
tomli-2.0.1 
toolz-0.12.0 
tornado-6.4 
traitlets-5.14.0
typeguard-4.1.5
types-python-dateutil-2.8.19.14 
typing-extensions-4.9.0
tzlocal-5.2
uri-template-1.3.0
urllib3-2.0.7
uuid7-0.1.0 
visions-0.7.5
wcwidth-0.2.12
webcolors-1.13
webencodings-0.5.1 websocket-client-1.7.0
widgetsnbextension-4.0.9
wordcloud-1.9.3

OS

linux

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.
@tboz38
Copy link
Author

tboz38 commented Dec 29, 2023

File "/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/summary_algorithms.py", line 42, in histogram_compute weights = weights if weights and len(weights) == hist_config.max_bins else NoneValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() -- WARNING:data_quality_job.scheduler.data_quality_glue_job:Processing dataset fails to provide an exploratory data analysis report : Unexpected error of type ValueError was raised while data exploratory profiler: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Note:
I had set the following
plot_rendering_settings = {
"histogram": {"x_axis_labels": True, "bins": 0, "max_bins": 10},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants