Bug Report #1534

liyaskerj · 2024-02-06T18:08:02Z

Current Behaviour

Overview about this issue:

Source dataset has only one row and five columns, in that only one column datatype is int rest all are string, even though the string column has numbers in it but as it has the combination of comma in between numbers so its been treated as string. As it has only one column as int datatype with single cell value Databricks job is failing with the reason "Dispatch error". Reason for it as below

In ydata_profiling library under describe_numeric_spark.py class there is method called describe_numeric_1d_spark, in this method there is logic for find the 'cv' coefficient value as below
summary["cv"] = summary["std"] / summary["mean"] if summary["mean"] else np.NaN

I'm suspecting as the value for mean is coming as '0' we are getting this issue, I confirmed the same by two ways one is changing the number(Int) column to string by adding comma in between the number in the cell value and other way is without changing the column datatype just adding more rows to the dataset. For the both the scenarios we are not able to replicate the issue and its working fine as expected.

Conclusion :

We have to add null check before doing division for calculating 'cv'.

Expected Behaviour

In ydata_profiling library under describe_numeric_spark.py class there is method called describe_numeric_1d_spark, in this method there is logic for find the 'cv' coefficient value as below. Here we have to add null check before proceeding with division
summary["cv"] = summary["std"] / summary["mean"] if summary["mean"] else np.NaN

Data Description

In databricks code is installed in the form of wheel file and it has the logic to read Source dataset in csv file format which has only one row and five columns, in that only one column datatype is int rest all are string, even though the string column has numbers in it but as it has the combination of comma in between numbers so its been treated as string.

Code that reproduces the bug

df = spark.read.option("header", "true").option("inferSchema", "true").csv(file_name_withpath)
profile = ProfileReport(df, title='Test Profile',minimal=True,missing_diagrams=None, samples=None, interactions=None)
json_file = profile.to_json()

pandas-profiling version

v4.5.1

Dependencies

databricks-cli =0.16.2
pymongo =4.5.0
pycryptodome =3.19.0
pydantic =1.10.6
azure-storage-blob =12.19.0
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.21

OS

In databricks cluster our code is installed in the form wheel file, Am using MAC OS

Checklist

There is not yet another bug report for this issue in the issue tracker
The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
The issue has not been resolved by the entries listed under Common Issues.

The text was updated successfully, but these errors were encountered:

azory-ydata added the needs-triage label Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report #1534

Bug Report #1534

liyaskerj commented Feb 6, 2024

Bug Report #1534

Bug Report #1534

Comments

liyaskerj commented Feb 6, 2024

Current Behaviour

Expected Behaviour

Data Description

Code that reproduces the bug

pandas-profiling version

Dependencies

OS

Checklist