-
-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features/distances #694
base: main
Are you sure you want to change the base?
Features/distances #694
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @VascoSch92
Thank you so much for the contribution! This is looking really good.
I think the logic for the distance calculation is good. We need to make this transformer look like the other transformers that we have in the library.
Why don't you have a look at the class RelativeFeatures in the creation module, and try to incorporate that logic in this class as well? Mostly related to which parameters we need in the init, and which checks we normally do. And there you will see as well how you can import many premade bits of text for the documentation.
Let me know how you get along! Thank you!
self.variables = None | ||
|
||
@staticmethod | ||
def _check_column_name(column_name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We allow integers as column names. I think I introduced this because it helps accommodate some of the compatibility tests with sklearn. In short, I don't think we need this check. We "outsource" this to pandas.
Could we remove?
drop_original: bool = False, | ||
) -> None: | ||
|
||
self.a_latitude = self._check_column_name(a_latitude) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably just assign whatever the user passes. We need to check if it is a string or a list only, and we have that function ready to import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok... but I don't understand. Should we allow a column name to be an integer?
Because you wrote:
We allow integers as column names. I think I introduced this because it helps accommodate some of the compatibility tests with sklearn. In short, I don't think we need this check. We "outsource" this to pandas.
Therefore, I should checks that what we get is a list of str or integer, not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some example checks here:
feature_engine/feature_engine/creation/relative_features.py
Lines 140 to 157 in 77ea405
if ( | |
not isinstance(variables, list) | |
or not all(isinstance(var, (int, str)) for var in variables) | |
or len(set(variables)) != len(variables) | |
): | |
raise ValueError( | |
"variables must be a list of strings or integers. " | |
f"Got {variables} instead." | |
) | |
if ( | |
not isinstance(reference, list) | |
or not all(isinstance(var, (int, str)) for var in reference) | |
or len(set(reference)) != len(reference) | |
): | |
raise ValueError( | |
"reference must be a list of strings or integers. " | |
f"Got {reference} instead." |
_check_param_drop_original(drop_original=drop_original) | ||
self.drop_original = drop_original | ||
|
||
self.variables = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this parameter. I'd suggest using RelativeFeatures as template to model this class: https://github.com/VascoSch92/feature_engine/blob/e1e927625678ee73c5c3a9edcf79e955ff9c5e8e/feature_engine/creation/relative_features.py
Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea. P.S. it could be quite interesting to add more measures of distance, for example Ruler distance |
Yes it is possible to compute the Haversine distance with sklearn. I was also thinking to use an apply and the Haversine distance method of Sklearn. The question is: is it faster than vectorisation? But I'm happy to change if it faster or If there is a faster method than mine ;-) |
I'm not sure I understand the question. |
I think the issue with the sklearn implementation is that it does a cartesian product between X and Y and yields a matrix. We only need a pairwise calculation between X and Y that yields a vector. |
|
I know that it is a simple way to code this, but from a time complexity perspective, it's not a great idea to use quadratic complexity when only linear complexity is needed. |
Yea, you are right, this way it will be better |
Hey @glevv thanks for the suggestion. If I understood this blog correctly, it has 3 computations: euclidean, harvesine (the one we are trying to implement here) and a more complicated one that has a smaller error (vincenty's formula). Is this correct? I'd suggest we stick to harvesine in this PR, and see if we create an issue to expand the class later with the Vincenty's. Is this formula commonly used? do we really need an error as small as 0.5mm for geo coordinates? |
They are all measures of distance between two points on ellipsoid. There were no Vincenty formula, but it's quite heavy to compute. In this particular blog post they talked about two simpler and faster formulas (Cheap Ruler and FCC equation) but with higher error.
Ye, let's go with haversine only, not sure about Vincenty tho |
Hey @solegalli |
No Problem at all @VascoSch92 . Same here. I am doing some big changes to the correlation transformers, I think we could release a new version when i got those finished, hopefully during February. It would be great if we can squeeze this transformer in that release 2. If you find the time, we look forward to your contribution :) |
Hey @solegalli :-) |
Sure! Contributions are welcome any time :) |
ok perfect. I will work on it. |
Hey @solegalli I still need some guidance for some point:
|
Just a first sketch.
Let me know what do you think :-)