Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the pull request from #469 so that it will cleanly apply against master. #1441

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

chotchki
Copy link
Contributor

Cleaned up the conflicts from #469 so that the parser can be merged cleanly.

@chotchki chotchki marked this pull request as ready for review October 24, 2021 16:46
@Stargateur
Copy link
Contributor

Stargateur commented Oct 24, 2021

Can you explain more clearly why it's perfect for your use case (with example & grammar) ? For me this parser is totally opposite to the concept of combinator. This instead of eat input and give the rest to other combinator, eat the end, and let the head to other combinator.

This lead to very ineffective parser.

Plus, it's unfortunate you remove the original author.

@chotchki
Copy link
Contributor Author

chotchki commented Oct 25, 2021

@Stargateur The issue that caused me to switch to using this particular combinator is the definition of sql string literals in Postgres found here: https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS (Section 4.1.2.1).

It requires a single ', any combination of unicode characters and a terminating single quote '. Multiple sql strings are merged into one IF and ONLY IF they are joined by whitespace with a newline.

Valid example:
'foo'
'bar'

Invalid example:
'foo' 'bar'

This combinator is fairly greedy and seemed the best way to look ahead and match as much of the whole string as possible. My current code is here (other attempts have been commented out in the link): https://github.com/chotchki/feophant/blob/main/src/engine/sql_parser/constants.rs#L29

Further work is coming to support the variety of embedded escape sequences.


I only recreated the pull request and code because I couldn't get git rebase to replay the original author's commits (despite trying several times this week). From my googling I can't directly edit the original pull request so this was my attempted work around. I tagged the original pull request to link back and give credit, if I've made a mistake with that, my apologies!

@Stargateur
Copy link
Contributor

Stargateur commented Oct 25, 2021

Seem such syntax is parser very well with separated_list1 as separated_list1(new_line, constant_sql_string)(s). constant_sql_string as delimited(tag('\''), accepted_char, tag('\''))

I only recreated the pull request and code because I couldn't get git rebase to replay the original author's commits (despite trying several times this week). From my googling I can't directly edit the original pull request so this was my attempted work around. I tagged the original pull request to link back and give credit, if I've made a mistake with that, my apologies!

I didn't want to blame in anyway, I also struggle with git it's hard to use.

I think you can do something like that:

git checkout take_until_parser_matches
git branch -m take_until_parser_matches_tmp
git remote add tomalexander https://github.com/tomalexander/nom.git
git fetch tomalexander
git checkout --track tomalexander/take_until_parser_matches
git cherry-pick 706c89355ae56d57b8e181b1770d0dc2fab7fc70
# resolve conflict
git push --force

Best I can propose. But don't worry that much, that just unfortunate.

@Stargateur
Copy link
Contributor

See #1444 that allow empty sep if you want separated_list1(opt(new_line), constant_sql_string)(s)

@NickNick
Copy link

NickNick commented Oct 26, 2021

Can you explain more clearly why it's perfect for your use case (with example & grammar) ? For me this parser is totally opposite to the concept of combinator. This instead of eat input and give the rest to other combinator, eat the end, and let the head to other combinator.

I was parsing a language that relies on keywords, and allows multiple words as names in between, like <verb> <name> [modifier] e.g. <Rub> <the lovely kitten's belly> [<for 30 minutes> <with vigor>]. To figure out where 'name' ends, I look in this specific case for when a modifier (in that example for 30 minutes and with vigor) succeeds to parse. It's not efficient... the suggestion of starting the parse at the end sounds nice, but it would complicate the logic a bit (not all examples add with a list of modifiers like this). I feel this is a nice addition to nom because it is intuitive (for me anyway ;-)), albeit not the most efficient way to parse such grammars.

@Stargateur
Copy link
Contributor

Stargateur commented Oct 26, 2021

I was parsing a language that relies on keywords, and allows multiple words as names in between, like <verb> <name> [modifier] e.g. <Rub> <the lovely kitten's belly> [<for 30 minutes> <with vigor>]. To figure out where 'name' ends, I look in this specific case for when a modifier (in that example for 30 minutes and with vigor) succeeds to parse.

I don't get why you simply not parse <verb> then <name> then the modifiers. Since all of them are correctly delimited I don't see the problem here. I can produce an example if you wish with the following requirement:

let input = "<Rub> <the lovely kitten's belly> [<for 30 minutes> <with vigor>]";
let result = parse(input);
assert_eq!(result, Ok({
  verb: "Rub",
  names: vec!["the lively kitten's belly"],
  modifiers: vec!["for 30 minutes", "with vigor"],
}));

Thus like you say it's working for you, but I still don't see why "official" nom should have this. Tell me If I miss something, if needed be more precise I will try to help solve the problem.

In my opinion as user of nom, a new parser should solve a problem that can't be solve using other conbinator (or introduce a shortcut like for example many0). Add a parser that simply brute force every bytes of the input doesn't meet nom philosophy. Nom eat byte by byte. I agree there is already a number of parser like this in nom but I don't think there are a good addition to nom.

@NickNick
Copy link

The brackets were for clarity, the actual input is just Rub the lovely kitten's belly for 30 minutes with vigor. I am sure you can come up with a way to parse this specific example without this combinator though, but I don't get why you think this is such an unusual combinator while there is take_until. This is just a fancier version of it for when the thing you want to take_until is not a fixed value. That is not so weird right?

Copy link
Contributor

@Stargateur Stargateur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I would like improve this:

  • remove no consume variant (no need the sub parser can control this behavior using peek())
  • change signature to
    pub fn take_until0<F, G, Input, Output, Error>(f: F, g: G) ->
      impl FnMut(Input) -> IResult<Input, (Input, Output), Error>
    where this almost like many_till but instead of accumulate result drop them but return consumed input of f parser.
  • put it in multi module (or combinator module), take_* should not be in bytes module anymore, this combinator is simple a variant of many_till that doesn't accumulate. And no sense to put it in complete when they is no streaming version
  • add take_until1 variant
  • add take_until_m_n variant
  • Optionally depreciate other take_ function (maybe in another PR)

Why take_until name ? cause I think with that we can depreciate original all take function in bytes.

maybe take_until variant are not needed

I think we can factorize take_until0 and many_till into a fold_until one could use it as:

let (tail, (len, modifier)) = fold_until0(recognize(preceded(opt(char(' ')), word)), modifier,
  || 0, |acc, i| acc + i.len())(input)?;
let name = input[..len];

with this signature:

pub fn fold_until0<F, G, A, Input, Output, Error>(f: F, g: G, init: H, acc: A) ->
  impl FnMut(Input) -> IResult<Input, (Acc, Output), Error>

@chotchki
Copy link
Contributor Author

chotchki commented Oct 31, 2021

@Stargateur I like the approach, working on it.

@chotchki
Copy link
Contributor Author

chotchki commented Nov 3, 2021

@Stargateur This is pushing my understanding of nom but I'm trying to understand the signature fold_until0 completely.

I have the following signature with the where clause:

pub fn fold_until0<F, G, H, A, Input, Output, Error>(
  f: F,
  g: G,
  init: H,
  acc: A,
) -> impl FnMut(Input) -> IResult<Input, (A, Output), Error>
where
  Input: InputTake + InputIter + InputLength + Clone,
  F: Parser<Input, Output, Error>,
  G: FnMut(A, Output) -> (A, Output),
  H: FnMut() -> (A, Output),
  Error: ParseError<Input>,
  • "f" is the child parser that will recognize the end
  • "g": I don't understand in the context of the other arguments
  • "init": Function for the initial state for the accumulator
  • "acc": Function for building the accumulator up. I don't know if this really should be "g"
  • I'm returning the remaining input, the accumulation, the content until "f" matched (dropping "f"'s output).

Can you please help me understand if BOTH "g" and "acc" make sense? I'm reading https://docs.rs/nom/7.0.0/nom/multi/fn.fold_many0.html and it seems to just use g to accumulate.

@Stargateur
Copy link
Contributor

pub fn fold_until0<P, Until, Init, Acc, Fold, Input, Output, UntilOutput, Error>(
    parser: P,
    until: Until,
    init: Init,
    fold: Fold,
) -> impl FnMut(Input) -> IResult<Input, (Acc, UntilOutput), Error>
where
    Input: InputTake + InputIter + InputLength + Clone,
    F: Parser<Input, Output, Error>,
    Until: Parser<Input, UntilOutput, Error>,
    Fold: FnMut(Acc, Output) -> Acc,
    Init: FnMut() -> Acc,
    Error: ParseError<Input>;
  • parser is used to parse what is before could be anychar for quick and dirty parsing
  • until is the parser used to stop the output of this parser is returned by fold_until
  • init is the init fct used to get the accumulator of the user
  • fold is the user fct that use the accumulator and return it

the body of fold_until0 is very simulator to #1341

@Geal Geal added this to the 8.0 milestone Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants