Character based sorting #174

nineteendo · 2024-04-06T14:22:09Z

Describe the feature or enhancement

Currently natsort doesn't sort strings with optional numbers intuitively because we only split around numbers.
It would be better to use a character based approach (like macOS):

import re
from re import Pattern

def try_int(string: str) -> int | str:
    return int(string) if string.isdigit() else string

class BaseNatChar:
    def __init__(self, value: int | str) -> None:
        self.value: int | str = value

    def __eq__(self, other: object) -> bool:
        return isinstance(other, BaseNatChar) and self.value == other.value

    def __lt__(self, other: 'BaseNatChar') -> bool:
        if isinstance(self.value, str):
            return isinstance(other.value, str) and self.value < other.value

        return isinstance(other.value, str) or self.value < other.value


class NatChar(BaseNatChar):
    def __lt__(self, other: BaseNatChar) -> bool:
        if not isinstance(self.value, str):
            if not isinstance(other.value, str):
                return self.value < other.value

            return other.value.isalpha()

        if not isinstance(other.value, str):
            return not self.value.isalpha()

        if self.value.isalpha() == other.value.isalpha():
            return self.value < other.value

        return other.value.isalpha()

def natsorted(lst: list[str]) -> list[str]:
    regex: Pattern = re.compile(r'\D|\d+')

    def key_func(string) -> tuple[NatChar, ...]:
        return tuple(map(NatChar, map(try_int, regex.findall(string))))

    return sorted(lst, key=key_func)

Provide a concrete example of how the feature or enhancement will improve `natsort`

>>> import natsort
>>> natsort.natsorted(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2'])
['a0-item-1', 'a1-item-2', 'a-category-1', 'b0-item-1', 'b1-item-2', 'b-category-2']

New behaviour: ['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']

Would you be willing to submit a Pull Request for this feature?

Yes, but I'm not sure how to integrate this in the project.

The text was updated successfully, but these errors were encountered:

SethMMorton · 2024-04-09T16:25:44Z

You can already enable this functionality by creating a simple key out of regular expressions.

In [1]: import re

In [2]: data = ['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']

In [3]: import natsort

In [4]: natsort.natsorted(data)
Out[4]: 
['a0-item-1',
 'a1-item-2',
 'a-category-1',
 'b0-item-1',
 'b1-item-2',
 'b-category-2']

In [5]: character_splitter = re.compile(r"-")

In [6]: natsort.natsorted(data, key=character_splitter.split)
Out[6]: 
['a-category-1',
 'a0-item-1',
 'a1-item-2',
 'b-category-2',
 'b0-item-1',
 'b1-item-2']

You can customize this regex to be as flexible as you need: r"[-_/]" would split on any of the threee characters given, and adding parenthesis (e.g. r"([-_/])") would make it take the specific character into account when sorting.

nineteendo · 2024-04-09T17:52:17Z

I think a character approach would be more flexible, as you can still replicate the old existing behaviour.

SethMMorton · 2024-04-09T19:31:00Z

I'm not sure I follow - can you please demonstrate how it is more flexible? Also, your suggestion would require a complete re-write of natsort, so I think I will need compelling evidence that it is more performant as well before considering that level of intrusiveness.

SethMMorton · 2024-04-09T20:05:09Z

Or, are the class-based code snippets just examples of behavior and not actual examples of how to implement? If so, it would not require a re-write, but would likely be backwards incompatible if not made a mode that can be toggled.

I would still like to hear an explanation or demonstration of what more flexible means. In owning this library for 10+ years, I have found that one person's "obvious" way to sort a collection of strings is another person's "incorrect" way to sort a collection of strings. I have also found that there is no universal heuristic (hence the huge number of specifiers for the ns enum as well as the key argument for further customization). It is likely that this modification to the baseline algorithm would be useful in some cases, but it is also possible that not everyone would agree that it is the correct solution, so I would be significantly more likely to consider this enhancement as an alg option rather than the default behavior.

nineteendo · 2024-04-11T13:59:12Z

I'm not sure I follow - can you please demonstrate how it is more flexible?

Sure. We can still sort the same as before, but now we can also control the sorting on a character level (see the updated comment above for the helper functions):

def natsorted1(lst: list[str]) -> list[str]:
    regex: Pattern = re.compile(r'(\d+)')

    def key_func(string) -> tuple[int | str, ...]:
        return tuple(map(try_int, regex.split(string)))

    return sorted(lst, key=key_func)


def natsorted2(lst: list[str]) -> list[str]:
    regex: Pattern = re.compile(r'\D|\d+')

    def key_func(string) -> tuple[BaseNatChar, ...]:
        return tuple(map(BaseNatChar, map(try_int, regex.findall(string))))

    return sorted(lst, key=key_func)


def natsorted3(lst: list[str]) -> list[str]:
    regex: Pattern = re.compile(r'\D|\d+')

    def key_func(string) -> tuple[NatChar, ...]:
        return tuple(map(NatChar, map(try_int, regex.findall(string))))

    return sorted(lst, key=key_func)

print(natsorted1(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']))
print(natsorted2(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']))
print(natsorted3(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']))

Output:

['a0-item-1', 'a1-item-2', 'a-category-1', 'b0-item-1', 'b1-item-2', 'b-category-2'] # current behaviour
['a0-item-1', 'a1-item-2', 'a-category-1', 'b0-item-1', 'b1-item-2', 'b-category-2'] # recreated behaviour
['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2'] # new behaviour

Also, your suggestion would require a complete re-write of natsort, so I think I will need compelling evidence that it is more performant as well before considering that level of intrusiveness.

That seems to be a problem. Maybe I'm doing something wrong, but there's a lot of overhead:

::test.bat
@echo off
echo 3 items && python -m timeit -s "import test" "test.natsorted1(['foo', 'bar', 'baz'] * 1)" && python -m timeit -s "import test" "test.natsorted2(['foo', 'bar', 'baz'] * 1)"
echo 30 items && python -m timeit -s "import test" "test.natsorted1(['foo', 'bar', 'baz'] * 10)" && python -m timeit -s "import test" "test.natsorted2(['foo', 'bar', 'baz'] * 10)"
echo 300 items && python -m timeit -s "import test" "test.natsorted1(['foo', 'bar', 'baz'] * 100)" && python -m timeit -s "import test" "test.natsorted2(['foo', 'bar', 'baz'] * 100)"

3 items
100000 loops, best of 5: 2.67 usec per loop # before
50000 loops, best of 5: 6.95 usec per loop # after
# -> 2.60x slower
30 items
10000 loops, best of 5: 20.2 usec per loop # before
5000 loops, best of 5: 81.9 usec per loop # after
# -> 4.05x slower
300 items
1000 loops, best of 5: 205 usec per loop # before
500 loops, best of 5: 962 usec per loop # after
# -> 4.69x slower

nineteendo · 2024-04-11T14:11:58Z

Or, are the class-based code snippets just examples of behavior and not actual examples of how to implement? If so, it would not require a re-write, but would likely be backwards incompatible if not made a mode that can be toggled.

I have tried to update the example to closer match the actual implementation. The main difference is that we split on every character, but group numbers. However, then we need to use a wrapper to make them comparable (which is what BaseNatChar and NatChar are).

I would still like to hear an explanation or demonstration of what more flexible means.

Well, the main reason is very simple: matching the behaviour of macOS sorting on other platforms. This implementation gets really close.

SethMMorton · 2024-04-11T16:45:41Z

I would still like to hear an explanation or demonstration of what more flexible means.

Well, the main reason is very simple: matching the behaviour of macOS sorting on other > platforms. This implementation gets really close.

Sorry, I think you misunderstood my question. I had proposed a solution using existing behavior (the key argument) and you had responded with

I think a character approach would be more flexible, as you can still replicate the old existing behaviour.

I was looking to understand why the solution that uses existing functionality is insufficient and why natsort needs to be updated.

nineteendo · 2024-04-11T21:40:26Z

OK, the solution proposed wouldn't work for me as the regex needs to be 136110 characters long to include a blacklist of all unicode letters.

SethMMorton · 2024-04-11T23:22:57Z

Would not r'(\D|\W)' work, which is "anything not numeric and anything not alphabetic?" Or, if you need to include _ then r'(\D|\W|_)'. That should cut down the characters from 136110 characters to 7 or 9.

In [1]: import re, natsort

In [2]: data = ['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']

In [3]: character_splitter = re.compile(r"(\D|\W)")

In [4]: natsort.natsorted(data, key=character_splitter.split)
Out[4]: 
['a-category-1',
 'a0-item-1',
 'a1-item-2',
 'b-category-2',
 'b0-item-1',
 'b1-item-2']

nineteendo · 2024-04-12T06:20:54Z

Sadly not. there are a lot of unicode letters (which macOS doesn't split on):

.

And even more non letters:

SethMMorton · 2024-04-13T03:08:23Z

Let's start over.

It sounds to me like there is a set of requirements you have that were not present in the original problem statement:

Currently natsort doesn't sort strings with optional numbers intuitively because we only split around numbers.
It would be better to use a character based approach (like macOS)

As a result, I cannot clearly see the problems that you are having with the solutions I am providing you.

The solution I provided (pre-split on non-alpha numeric) is functionally identical to your solution (post-split on non-alphanumeric), except that I used the regex \W and you are using not str.isalpha - I would expect these to yield the same results, yet the previous post is indicating this is not what you want.

I'm afraid I cannot help any further until I get a complete and unambiguous specification of what desired behavior you are looking for. "Sorts like macOS", unfortunately, is too ambiguous because to my knowledge Apple's collation is not public.

nineteendo · 2024-04-13T14:27:53Z

The solution I provided (pre-split on non-alpha numeric) is functionally identical to your solution...

Sadly it's not, not str.isalpha() matches every character that's not in the L* unicode character categories (Ll, Lm, Lo, Lt & Lu):
https://www.compart.com/en/unicode/category. \W matches [^A-Za-z0-9_].

SethMMorton · 2024-04-13T16:05:49Z

\W matches [^A-Za-z0-9_]

This is only true if the ASCII flag is used, which is not being used in this case. From the Python docs:

\W

Matches any character which is not a word character. This is the opposite of \w. By default, matches non-underscore (_) characters for which str.isalnum() returns False.

Matches [^a-zA-Z0-9_] if the ASCII flag is used.

So, the definitions of \W is directly tied to str.isalnum(), which is a superset of str.isalpha().

nineteendo · 2024-04-13T16:25:44Z

Thanks for correcting me. It's a bit confusing as it doesn't match other regex parsers.
Could we then maybe add key=re.compile(r"(\W+)") to the ns_enum?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character based sorting #174

Character based sorting #174

nineteendo commented Apr 6, 2024 •

edited

SethMMorton commented Apr 9, 2024

nineteendo commented Apr 9, 2024 •

edited

SethMMorton commented Apr 9, 2024

SethMMorton commented Apr 9, 2024

nineteendo commented Apr 11, 2024 •

edited

nineteendo commented Apr 11, 2024

SethMMorton commented Apr 11, 2024

nineteendo commented Apr 11, 2024

SethMMorton commented Apr 11, 2024 •

edited

nineteendo commented Apr 12, 2024

SethMMorton commented Apr 13, 2024

nineteendo commented Apr 13, 2024

SethMMorton commented Apr 13, 2024

nineteendo commented Apr 13, 2024 •

edited

Character based sorting #174

Character based sorting #174

Comments

nineteendo commented Apr 6, 2024 • edited

Describe the feature or enhancement

Provide a concrete example of how the feature or enhancement will improve natsort

Would you be willing to submit a Pull Request for this feature?

SethMMorton commented Apr 9, 2024

nineteendo commented Apr 9, 2024 • edited

SethMMorton commented Apr 9, 2024

SethMMorton commented Apr 9, 2024

nineteendo commented Apr 11, 2024 • edited

nineteendo commented Apr 11, 2024

SethMMorton commented Apr 11, 2024

nineteendo commented Apr 11, 2024

SethMMorton commented Apr 11, 2024 • edited

nineteendo commented Apr 12, 2024

SethMMorton commented Apr 13, 2024

nineteendo commented Apr 13, 2024

SethMMorton commented Apr 13, 2024

nineteendo commented Apr 13, 2024 • edited

nineteendo commented Apr 6, 2024 •

edited

Provide a concrete example of how the feature or enhancement will improve `natsort`

nineteendo commented Apr 9, 2024 •

edited

nineteendo commented Apr 11, 2024 •

edited

SethMMorton commented Apr 11, 2024 •

edited

nineteendo commented Apr 13, 2024 •

edited