Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataFrameColumn Apply and DropNulls methods #7123

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

asmirnov82
Copy link
Contributor

@asmirnov82 asmirnov82 commented Apr 9, 2024

Fixes #7107 as was asked in #6144 (comment)

Additionaly:

  1. Fix incorrect IsNumeric method
  2. Fix error FillNulls crashes with NotImplemented exception on DataFrame with VBufferDataFrameColumn
  3. Improve speed and redesign API (legacy API marked as obsolete), internal methods are rewritten to use new API
  4. Add extra unit tests

Reasons for refactoring:

  1. Legacy implementation of ApplyElementwise is written not only to apply function on each element in the column, but also to be used for fast columns iteration, that's why it has rowIndex as one of it's arguments. This duplication of responsibilities is very confusing and absolutely not straightforward. More over, it is slow, as instead of only reading values, ApplyElementwise writes function results into into memory, also it converts read only columns to editable (and that again involves memory copy). So in some circumstances memory can be excessively copied even twice.
  2. Legacy method requires to provide function, that takes into account null values - this is not userfriendly. For working with Nulls there are already FillNulls and DropNulls methods. New method applies user function only to valuable values, this also leads to better performance

@asmirnov82
Copy link
Contributor Author

cc @JakeRadMSFT

Copy link

codecov bot commented Apr 9, 2024

Codecov Report

Attention: Patch coverage is 74.92355% with 82 lines in your changes missing coverage. Please review.

Project coverage is 68.50%. Comparing base (4bc753a) to head (e2f4d4e).
Report is 1 commits behind head on main.

Current head e2f4d4e differs from pull request most recent head 85bb4b7

Please upload reports for the commit 85bb4b7 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7123      +/-   ##
==========================================
- Coverage   68.66%   68.50%   -0.16%     
==========================================
  Files        1262     1263       +1     
  Lines      257774   255262    -2512     
  Branches    26660    26398     -262     
==========================================
- Hits       176991   174862    -2129     
+ Misses      73971    73696     -275     
+ Partials     6812     6704     -108     
Flag Coverage Δ
Debug 68.50% <74.92%> (-0.16%) ⬇️
production 62.88% <61.86%> (-0.07%) ⬇️
test 88.62% <100.00%> (-0.24%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Microsoft.Data.Analysis/DataFrame.cs 85.81% <100.00%> (+0.39%) ⬆️
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs 99.89% <100.00%> (-0.01%) ⬇️
...ta.Analysis.Tests/PrimitiveDataFrameColumnTests.cs 99.74% <100.00%> (+0.03%) ⬆️
....Data.Analysis.Tests/StringDataFrameColumnTests.cs 100.00% <100.00%> (ø)
...icrosoft.Data.Analysis/PrimitiveColumnContainer.cs 86.52% <96.96%> (+0.43%) ⬆️
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs 73.05% <97.29%> (+0.61%) ⬆️
src/Microsoft.Data.Analysis/DataFrameColumn.cs 62.96% <0.00%> (-0.22%) ⬇️
...nalysis/DataFrameColumns/VBufferDataFrameColumn.cs 44.74% <30.30%> (+0.14%) ⬆️
...sis/DataFrameColumns/ArrowStringDataFrameColumn.cs 59.16% <18.75%> (-4.01%) ⬇️
...Analysis/DataFrameColumns/StringDataFrameColumn.cs 69.23% <45.28%> (-1.95%) ⬇️

... and 53 files with indirect coverage changes

@asmirnov82 asmirnov82 changed the title Implement DataFrameColumn Apply method and DropNulls methods Implement DataFrameColumn Apply and DropNulls methods Apr 9, 2024
@@ -79,6 +79,27 @@ public void Append(string value)
Length++;
}

/// <summary>
/// Applies a function to all values in the column, that are not null.
/// </summary>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not null instead of using the isvalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsValid can be used here as well, however for StringDataFrame column it doesn't make a lot of sense (actualy IsValid works a little bit slower in this case). Using IsValid instead of checking for null dramaticaly improves performance in case of PrimitivesDataFrame columns without Nulls (most use cases), because such columns uses validity buffers and checking for Null is very expensive and can be skiped if NullCount is 0

# Conflicts:
#	src/Microsoft.Data.Analysis/DataFrameColumns/VBufferDataFrameColumn.cs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the Apply method available to StringDataFrame column
2 participants