Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character based sorting #174

Open
nineteendo opened this issue Apr 6, 2024 · 14 comments
Open

Character based sorting #174

nineteendo opened this issue Apr 6, 2024 · 14 comments

Comments

@nineteendo
Copy link

nineteendo commented Apr 6, 2024

Describe the feature or enhancement

Currently natsort doesn't sort strings with optional numbers intuitively because we only split around numbers.
It would be better to use a character based approach (like macOS):

import re
from re import Pattern

def try_int(string: str) -> int | str:
    return int(string) if string.isdigit() else string

class BaseNatChar:
    def __init__(self, value: int | str) -> None:
        self.value: int | str = value

    def __eq__(self, other: object) -> bool:
        return isinstance(other, BaseNatChar) and self.value == other.value

    def __lt__(self, other: 'BaseNatChar') -> bool:
        if isinstance(self.value, str):
            return isinstance(other.value, str) and self.value < other.value

        return isinstance(other.value, str) or self.value < other.value


class NatChar(BaseNatChar):
    def __lt__(self, other: BaseNatChar) -> bool:
        if not isinstance(self.value, str):
            if not isinstance(other.value, str):
                return self.value < other.value

            return other.value.isalpha()

        if not isinstance(other.value, str):
            return not self.value.isalpha()

        if self.value.isalpha() == other.value.isalpha():
            return self.value < other.value

        return other.value.isalpha()

def natsorted(lst: list[str]) -> list[str]:
    regex: Pattern = re.compile(r'\D|\d+')

    def key_func(string) -> tuple[NatChar, ...]:
        return tuple(map(NatChar, map(try_int, regex.findall(string))))

    return sorted(lst, key=key_func)

Provide a concrete example of how the feature or enhancement will improve natsort

>>> import natsort
>>> natsort.natsorted(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2'])
['a0-item-1', 'a1-item-2', 'a-category-1', 'b0-item-1', 'b1-item-2', 'b-category-2']

New behaviour: ['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']

Would you be willing to submit a Pull Request for this feature?

Yes, but I'm not sure how to integrate this in the project.

@SethMMorton
Copy link
Owner

You can already enable this functionality by creating a simple key out of regular expressions.

In [1]: import re

In [2]: data = ['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']

In [3]: import natsort

In [4]: natsort.natsorted(data)
Out[4]: 
['a0-item-1',
 'a1-item-2',
 'a-category-1',
 'b0-item-1',
 'b1-item-2',
 'b-category-2']

In [5]: character_splitter = re.compile(r"-")

In [6]: natsort.natsorted(data, key=character_splitter.split)
Out[6]: 
['a-category-1',
 'a0-item-1',
 'a1-item-2',
 'b-category-2',
 'b0-item-1',
 'b1-item-2']

You can customize this regex to be as flexible as you need: r"[-_/]" would split on any of the threee characters given, and adding parenthesis (e.g. r"([-_/])") would make it take the specific character into account when sorting.

@nineteendo
Copy link
Author

nineteendo commented Apr 9, 2024

I think a character approach would be more flexible, as you can still replicate the old existing behaviour.

@SethMMorton
Copy link
Owner

I'm not sure I follow - can you please demonstrate how it is more flexible? Also, your suggestion would require a complete re-write of natsort, so I think I will need compelling evidence that it is more performant as well before considering that level of intrusiveness.

@SethMMorton
Copy link
Owner

Or, are the class-based code snippets just examples of behavior and not actual examples of how to implement? If so, it would not require a re-write, but would likely be backwards incompatible if not made a mode that can be toggled.

I would still like to hear an explanation or demonstration of what more flexible means. In owning this library for 10+ years, I have found that one person's "obvious" way to sort a collection of strings is another person's "incorrect" way to sort a collection of strings. I have also found that there is no universal heuristic (hence the huge number of specifiers for the ns enum as well as the key argument for further customization). It is likely that this modification to the baseline algorithm would be useful in some cases, but it is also possible that not everyone would agree that it is the correct solution, so I would be significantly more likely to consider this enhancement as an alg option rather than the default behavior.

@nineteendo
Copy link
Author

nineteendo commented Apr 11, 2024

I'm not sure I follow - can you please demonstrate how it is more flexible?

Sure. We can still sort the same as before, but now we can also control the sorting on a character level (see the updated comment above for the helper functions):

def natsorted1(lst: list[str]) -> list[str]:
    regex: Pattern = re.compile(r'(\d+)')

    def key_func(string) -> tuple[int | str, ...]:
        return tuple(map(try_int, regex.split(string)))

    return sorted(lst, key=key_func)


def natsorted2(lst: list[str]) -> list[str]:
    regex: Pattern = re.compile(r'\D|\d+')

    def key_func(string) -> tuple[BaseNatChar, ...]:
        return tuple(map(BaseNatChar, map(try_int, regex.findall(string))))

    return sorted(lst, key=key_func)


def natsorted3(lst: list[str]) -> list[str]:
    regex: Pattern = re.compile(r'\D|\d+')

    def key_func(string) -> tuple[NatChar, ...]:
        return tuple(map(NatChar, map(try_int, regex.findall(string))))

    return sorted(lst, key=key_func)

print(natsorted1(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']))
print(natsorted2(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']))
print(natsorted3(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']))

Output:

['a0-item-1', 'a1-item-2', 'a-category-1', 'b0-item-1', 'b1-item-2', 'b-category-2'] # current behaviour
['a0-item-1', 'a1-item-2', 'a-category-1', 'b0-item-1', 'b1-item-2', 'b-category-2'] # recreated behaviour
['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2'] # new behaviour

Also, your suggestion would require a complete re-write of natsort, so I think I will need compelling evidence that it is more performant as well before considering that level of intrusiveness.

That seems to be a problem. Maybe I'm doing something wrong, but there's a lot of overhead:

::test.bat
@echo off
echo 3 items && python -m timeit -s "import test" "test.natsorted1(['foo', 'bar', 'baz'] * 1)" && python -m timeit -s "import test" "test.natsorted2(['foo', 'bar', 'baz'] * 1)"
echo 30 items && python -m timeit -s "import test" "test.natsorted1(['foo', 'bar', 'baz'] * 10)" && python -m timeit -s "import test" "test.natsorted2(['foo', 'bar', 'baz'] * 10)"
echo 300 items && python -m timeit -s "import test" "test.natsorted1(['foo', 'bar', 'baz'] * 100)" && python -m timeit -s "import test" "test.natsorted2(['foo', 'bar', 'baz'] * 100)"
3 items
100000 loops, best of 5: 2.67 usec per loop # before
50000 loops, best of 5: 6.95 usec per loop # after
# -> 2.60x slower
30 items
10000 loops, best of 5: 20.2 usec per loop # before
5000 loops, best of 5: 81.9 usec per loop # after
# -> 4.05x slower
300 items
1000 loops, best of 5: 205 usec per loop # before
500 loops, best of 5: 962 usec per loop # after
# -> 4.69x slower

@nineteendo
Copy link
Author

Or, are the class-based code snippets just examples of behavior and not actual examples of how to implement? If so, it would not require a re-write, but would likely be backwards incompatible if not made a mode that can be toggled.

I have tried to update the example to closer match the actual implementation. The main difference is that we split on every character, but group numbers. However, then we need to use a wrapper to make them comparable (which is what BaseNatChar and NatChar are).

I would still like to hear an explanation or demonstration of what more flexible means.

Well, the main reason is very simple: matching the behaviour of macOS sorting on other platforms. This implementation gets really close.

@SethMMorton
Copy link
Owner

I would still like to hear an explanation or demonstration of what more flexible means.

Well, the main reason is very simple: matching the behaviour of macOS sorting on other > platforms. This implementation gets really close.

Sorry, I think you misunderstood my question. I had proposed a solution using existing behavior (the key argument) and you had responded with

I think a character approach would be more flexible, as you can still replicate the old existing behaviour.

I was looking to understand why the solution that uses existing functionality is insufficient and why natsort needs to be updated.

@nineteendo
Copy link
Author

OK, the solution proposed wouldn't work for me as the regex needs to be 136110 characters long to include a blacklist of all unicode letters.

@SethMMorton
Copy link
Owner

SethMMorton commented Apr 11, 2024

Would not r'(\D|\W)' work, which is "anything not numeric and anything not alphabetic?" Or, if you need to include _ then r'(\D|\W|_)'. That should cut down the characters from 136110 characters to 7 or 9.

In [1]: import re, natsort

In [2]: data = ['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']

In [3]: character_splitter = re.compile(r"(\D|\W)")

In [4]: natsort.natsorted(data, key=character_splitter.split)
Out[4]: 
['a-category-1',
 'a0-item-1',
 'a1-item-2',
 'b-category-2',
 'b0-item-1',
 'b1-item-2']

@nineteendo
Copy link
Author

Sadly not. there are a lot of unicode letters (which macOS doesn't split on):

image.

And even more non letters:

image

@SethMMorton
Copy link
Owner

Let's start over.

It sounds to me like there is a set of requirements you have that were not present in the original problem statement:

Currently natsort doesn't sort strings with optional numbers intuitively because we only split around numbers.
It would be better to use a character based approach (like macOS)

As a result, I cannot clearly see the problems that you are having with the solutions I am providing you.

The solution I provided (pre-split on non-alpha numeric) is functionally identical to your solution (post-split on non-alphanumeric), except that I used the regex \W and you are using not str.isalpha - I would expect these to yield the same results, yet the previous post is indicating this is not what you want.

I'm afraid I cannot help any further until I get a complete and unambiguous specification of what desired behavior you are looking for. "Sorts like macOS", unfortunately, is too ambiguous because to my knowledge Apple's collation is not public.

@nineteendo
Copy link
Author

The solution I provided (pre-split on non-alpha numeric) is functionally identical to your solution...

Sadly it's not, not str.isalpha() matches every character that's not in the L* unicode character categories (Ll, Lm, Lo, Lt & Lu):
https://www.compart.com/en/unicode/category. \W matches [^A-Za-z0-9_].

@SethMMorton
Copy link
Owner

\W matches [^A-Za-z0-9_]

This is only true if the ASCII flag is used, which is not being used in this case. From the Python docs:

\W

Matches any character which is not a word character. This is the opposite of \w. By default, matches non-underscore (_) characters for which str.isalnum() returns False.

Matches [^a-zA-Z0-9_] if the ASCII flag is used.

So, the definitions of \W is directly tied to str.isalnum(), which is a superset of str.isalpha().

@nineteendo
Copy link
Author

nineteendo commented Apr 13, 2024

Thanks for correcting me. It's a bit confusing as it doesn't match other regex parsers.
Could we then maybe add key=re.compile(r"(\W+)") to the ns_enum?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants