-
-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Neighbors to accept sparse data #6749
Comments
There is an alternative solution, which is a bit cumbersome: Concatenate Reference and Data before Bag of Words (requires that they have more or less the same variables), separate after Bag of Words with Select Rows using some criterion that distinguishes Reference from Data, then connect Matching Data to the Reference input of Neighbors and Non-matching Data to the Data input. As I said, rather cumbersome but it works. |
@wvdvegte, you could probably also use the Apply Domain widget. But I agree, this should have been done automatically. We discussed this, and internally we should have applied the domain of the data onto the reference when comparing. |
Indeed, in my use case Apply Domain produces processable inputs for Neighbours, too. |
What's your use case?
I want to use Neighbors to search a corpus of documents for items similar to one or more reference documents. Since Neighbors requires that Reference and Data have the same features, I have to apply either Text Embedding, Similarity Hashing or Topic Modeling in order to represent the corpora quantitatively. But for most ML tasks with text, I find Bag of Words usually producing more convincing results.
What's your proposed solution?
Allow Neighbors to accept datasets with different features, at least when it comes to sparse data from Bag of Words. So, before computing distances, the words that are in Reference but not in Data are added to Data with value 0, and the other way around.
Are there any alternative solutions?
Not that I'm aware of.
The text was updated successfully, but these errors were encountered: