Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CPUNearestNeighbor for benchmarking exact nearest neighbors. #655

Merged
merged 5 commits into from
May 25, 2024

Conversation

lijinf2
Copy link
Collaborator

@lijinf2 lijinf2 commented May 13, 2024

cpu LSH will be moved to bench_approx_nearest_neighbors.py for benchmarking with GPU IVFFlat.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented May 13, 2024

build

1 similar comment
@lijinf2
Copy link
Collaborator Author

lijinf2 commented May 14, 2024

build

from pyspark.sql.functions import udf

spark_func = udf(py_func, "array<float>")
df = spark.range(len(X)).select("id", spark_func("id").alias("features"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any advantage to doing this way vs createDataFrame from pandas df?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems does not throw out a "task size larger than 1000k" warning on large dataset, but it looks no different on small dataset.

def cache_df(dfA: DataFrame, dfB: DataFrame) -> Tuple[DataFrame, DataFrame]:
dfA = dfA.cache()
dfB = dfB.cache()
dfA.count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify that count actually caches the dataframe? I think sometimes it can be short-circuited via metadata (e.g. parquet files).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented May 24, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented May 25, 2024

build


yield pd.DataFrame({"dummy": [1]})

dfA.mapInPandas(func_dummy, schema="dummy int").count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to avoid python udfs for this kind of thing but probably ok.

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lijinf2 lijinf2 merged commit d608e96 into NVIDIA:branch-24.06 May 25, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants