Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errant parsing of parameters that can be a list of integers #396

Open
jimustafa opened this issue Feb 18, 2022 · 5 comments
Open

Errant parsing of parameters that can be a list of integers #396

jimustafa opened this issue Feb 18, 2022 · 5 comments

Comments

@jimustafa
Copy link
Contributor

For parameters that can be specified as a list of integers, such as exclude_bands and select_projections, there is a convenient notation that allows for individual integers and integer ranges to be specified together. For example (from the wannier90-v3.1.0 manual),

exclude_bands : 2, 6-8, 12

will yield, in the nnkp file, the following:

begin exclude_bands
   5
   2
   6
   7
   8
  12
end exclude_bands

It appears that spaces around the - do not matter, so that

exclude_bands : 2, 6 -8, 12

gives the same result. Clearly, a negative integer makes no sense here, but it still seems that this should be a parsing error.

The suggestion here is that only integer ranges where there are no spaces around - (even 6 - 8) should throw an error.

@jryates
Copy link
Member

jryates commented Feb 18, 2022

This comes from the generic parsing routine get_range_vector.
The three valid separators are ' ', ',' ';'

So
exclude_bands : 2, 6 -8, 12
is interpreted the same as
exclude_bands : 2, 6 -8, 12
and
exclude_bands : 2 6 -8 12
but differently to
exclude_bands : 2, 6, -8, 12
The latter should return a negative number, which is hopefully trapped in the calling code (must check this!)

At the moment a limitation of get_range_vector is that it can only robustly return positive integers. In fact this is (at the moment) all we need it to do. But the header should state this (trap for it?).

We could enforce no white space around a range separator. That could break old input files - so I'd like a discussion with @mostofi and @giovannipizzi

@jimustafa do you have a use case in which this lead to an error - or is it just from a defensive coding pov?

@jimustafa
Copy link
Contributor Author

I encountered this when working to write a parser for the WIN file (see aiidateam/aiida-wannier90#99 for some context). Splitting on the valid separators should give the same result, so the fact that exclude_bands : 2 6 -8 12 and exclude_bands : 2,6,-8,12 give different results seems like a bug. Why does a whitespace separator between 6 and -8 cause those two terms to be considered as a compound term representing a range, while the equally valid , or ; separator does not.

I would be surprised if there are any WIN files out in the wild that actually rely on 6 -8 being treated as a range, so the risk of breakage may be low. In any case, and assuming there is agreement that the current behavior should be changed, deferring a fix to the next major release may make a breaking change acceptable.

@giovannipizzi
Copy link
Member

Another option (partially mentioned above): can we state that the syntax for lists of integers only holds for positive integers? Is there any field where this is not the case? If not, we can say that any negative integer is invalid, ensure that 2,6,-8,12 complains that -8 is negative, and then both 2, 6 -8, 12 and 2, 6-8, 12 express the same thing (even if indeed the first one might be a bit unclear in its meaning - but we can document it).

Otherwise (or, in addition) we can also enforce that one cannot put whitespaces around the dashes, but while I see that 6 -8 as a range is not probable, 6 - 8 might be in principle.

@jimustafa
Copy link
Contributor Author

In Python, one might parse exclude_bands : 2, 6-8, 12 as follows:

>>> re.split(r'[ ,;]', '2, 6-8, 12')
['2', '', '6-8', '', '12']

Here, we get 3 types of elements, ones that are empty (can be ignored), ones that can be parsed as integers, and ones that look like a range, and could be further parsed into a range of integers.

In the other cases, we get:

>>> re.split(r'[ ,;]', '2, 6 -8, 12')
['2', '', '6', '-8', '', '12']
>>> re.split(r'[ ,;]', '2, 6 - 8, 12')
['2', '', '6', '-', '8', '', '12']

For these, one needs to look at a sequence of elements to try and interpret them as a range.

It seems that when treating whitespace as a list separator, the only valid-looking range element is one that does not include whitespace. Otherwise, there needs to be an exception to the rule that whitespace separates elements, which only works when negative integers are not allowed.

@jimustafa
Copy link
Contributor Author

jimustafa commented Jun 3, 2022

I scanned through the results in https://github.com/search?q=path%3A*.win+exclude_bands&type=code and did find a couple of examples that rely on parsing m - n as a range (see here). However, the majority of the WIN files found on GitHub use ranges without the intervening whitespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants