Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report #1534

Open
3 tasks done
liyaskerj opened this issue Feb 6, 2024 · 0 comments
Open
3 tasks done

Bug Report #1534

liyaskerj opened this issue Feb 6, 2024 · 0 comments

Comments

@liyaskerj
Copy link

Current Behaviour

Overview about this issue: 

  Source dataset has only one row and five columns, in that only one column datatype is int rest all are string, even though the string column has numbers in it but as it has the combination of comma in between numbers so its been treated as string. As it has only one column as int datatype with single cell value Databricks job is failing with the reason "Dispatch error". Reason for it as below

   In ydata_profiling library under describe_numeric_spark.py class there is method called describe_numeric_1d_spark, in this method there is logic for find the 'cv' coefficient  value as below 
    summary["cv"] = summary["std"] / summary["mean"] if summary["mean"] else np.NaN

I'm suspecting as the value for mean is coming as '0' we are getting this issue, I confirmed the same by two ways one is changing the number(Int) column to string by adding comma in between the number in the cell value and other way is without changing the column datatype just adding more rows to the dataset. For the both the scenarios we are not able to replicate the issue and its working fine as expected.

Conclusion :

We have to add null check before doing division for calculating 'cv'.

Expected Behaviour

In ydata_profiling library under describe_numeric_spark.py class there is method called describe_numeric_1d_spark, in this method there is logic for find the 'cv' coefficient  value as below. Here we have to add null check before proceeding with division
    summary["cv"] = summary["std"] / summary["mean"] if summary["mean"] else np.NaN

Data Description

In databricks code is installed in the form of wheel file and it has the logic to read Source dataset in csv file format which has only one row and five columns, in that only one column datatype is int rest all are string, even though the string column has numbers in it but as it has the combination of comma in between numbers so its been treated as string.

Code that reproduces the bug

df = spark.read.option("header", "true").option("inferSchema", "true").csv(file_name_withpath)
profile = ProfileReport(df, title='Test Profile',minimal=True,missing_diagrams=None, samples=None, interactions=None)
json_file = profile.to_json()

pandas-profiling version

v4.5.1

Dependencies

databricks-cli =0.16.2
pymongo =4.5.0
pycryptodome =3.19.0
pydantic =1.10.6
azure-storage-blob =12.19.0
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.21

OS

In databricks cluster our code is installed in the form wheel file, Am using MAC OS

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants