Problems with parsing emphasis/style markup #12

schoettl · 2020-05-03T22:13:55Z

Problems with the ungreatful and recursive nature of emphasis markup [/*_+] are documented in #9, specifically

Parse content text that can contain style, links, footnotes, timestamps, ... #9 (comment)
Parse content text that can contain style, links, footnotes, timestamps, ... #9 (comment)

A summary in German:

Ein Syntaxelement in Orgmode ist ähnlich wie bei Markdown:

Dies ist /kursiver/ Text.

(* = fett, _ = unterstrichen, + = durchgestrichen, ...)

Jetzt geht es darum den Text zu parsen.

text := { text-kursiv | text-normal }
text-kursiv := '/' text '/'
text-normal := regex|.[^/]*|

Die Schwierigkeiten sind jetzt die:

Vor und nach eine /kursiven/ Text muss ein Leerzeichen oder Interpunktion sein.
Der /kursive Text/ selbst darf nicht mit Leerzeichen beginnen oder enden.
Der Delimiter darf auch im /kursiven/italic Text/ vorkommen.
Der /kursive/ Text/ ist so kurz wie möglich (also hier nur "kursive").
Die Sache ist rekursiv, also /kursiver Text kann auch fett/ sein.
Die Delimiter sind nicht eindeutig, z.B. kann der Schrägstrich/Slash auch so vorkommen und _ je nach Kontext Unterstreichung oder Tief_stellen bedeuten.
Neben den Symbolen text-kursiv und text-normal gibt es weitere Syntaxelemente wie Links, die mit einer ganz anderen Syntax gekennzeichnet sind.

Org Mode selbst löst das beim Export wohl auf eine andere Weise, nicht durch einen BNF Parser sondern durch Programmierung und insbesondere einen Regex, der nicht nur den kursiven Text matcht, sondern auch den Buchstaben davor und dahinter.

Nur hier funktioniert das nicht so einfach: Zum einen kann ich beim Symbol text-kursiv keinen Regex angeben (wegen der Rekursion). Zum anderen kann text-kursiv nicht wissen, ob vor ihm ein Leerzeichen kommt, oder nicht. Look-ahead ist unterstützt vom verwendeten BNF, aber nicht Look-back. Und zuletzt gestaltet sich der Regex von text-normal als schwierig, weil er eben an der richtigen Stelle stoppen muss: Mal nach einem Leerzeichen, wenn danach ein / kommt. Mal ohne zusätzliches Kriterium, wenn danach ein [ kommt (Link oder Fußnote).

Siehe auch in Emacs org-emph-re.

The text was updated successfully, but these errors were encountered:

munen · 2020-05-05T16:58:26Z

@branch14 Do you potentially have an input/idea on how to tackle this?

@schoettl provided more links on the issue in his last PR, too: #9

branch14 · 2020-05-13T08:15:25Z

@munen, @schoettl I acknowledge that Org-mode might have syntactic elements that cannot properly parsed by EBNF/PEG. While the project's goal is to have as much of the Org-mode syntax formalized in a EBNF/PEG, we need an alternative (more pragmatic) approach to provide a full featured parser.

Some of the issues mentioned here can be implemented in the grammar, while others might need to be deferred to the transformation step mentioned in #9. E.g. multi-line text styles cannot be tackled with a line based parser, but can easily dealt with Regexs in a transformation.

branch14 · 2020-05-13T10:11:28Z

Here I layed out how the code for transformation could look like: #15

For parsing multi-line styles a 2nd transformation step would be needed. (Not part of this PR.)

schoettl · 2020-05-13T13:14:20Z

Even single-line styles have severe problems in EBNF. I want to check out, if it is reasonable to put all style (emphasis and verbatim, how it's called in the spec) into the transformations. An advantage of this approach is that we can reuse the logic and regexes from orgmode.

munen · 2020-05-17T08:33:57Z

An advantage of this approach is that we can reuse the logic and regexes from orgmode.

👍

munen · 2020-05-17T08:35:32Z

That could actually prove to be a major benefit.

As long as Emacs is not using org-parser cough, it could prove very benefitial to keep some complicated parts close to the Elisp codebase.

schoettl · 2020-05-17T08:49:26Z

Note to self:

Check if it's OK or "allowed" that emphasis spans other elements that are already parsed via EBNF. E.g. *this [[url]] is bold and _O_2 is also included in the underlined_ text*
If 1. is good, there is one difficulty: When we apply the emphasis regexes on a parsed line, we need a string as input, not a parse tree containing links or otherwise parsed elements. ~~To get the string input, AFAIK we have to "export" the parse tree to get the original string. Similar to what is done in organice's org_export.js.~~

It would be great if instaparse has a way to get the original, unparsed text along with the parse tree, ~~but I don't think it has.~~ EDIT: It has: meta and span functions

schoettl · 2020-05-17T09:04:03Z

Regarding 2.: instaparse has a built-in way to get position/location meta information from the parse tree! Even if the parse tree looks like it only holds the parsed data, the meta and span functions return this information: https://cljdoc.org/d/instaparse/instaparse/1.4.9/doc/readme#character-spans

So, if we have the original input text, it's no problem to apply emphasis regexes on the original line. We do have all position information about elements parsed via EBNF.

jcguu95 · 2024-04-01T17:39:34Z

It's been 4 years, and this issue seems to be the last unfinished todo item in /org-parser/README.org. What is its current state?

Is this the only gap between org-parser and the official org parser?
Will this be done in the future, perhaps by extending EBNF and interparse a bit more?

schoettl · 2024-04-01T18:22:03Z

The project hasn't been very active since then and no one has worked on this specific problem. I guess it's not the only gap. The check list in the README may miss some less common org features. There is also a big room for enhancements on the transformation side (the step of converting the instaparse parse result into a more meaningful data structure). E.g. joining lines of a paragraph which would be a requirement for parsing style markup in the transformation step.

I'm personally more concerned about #56 – that's why I don't invest much time.

munen mentioned this issue May 17, 2020

Transform parse tree #15

Merged

schoettl self-assigned this Jun 8, 2020

schoettl mentioned this issue May 20, 2021

Parser will not parse styles within sections. #26

Closed

schoettl added the documentation Improvements or additions to documentation label May 20, 2021

schoettl mentioned this issue Jan 11, 2022

Performance problems #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with parsing emphasis/style markup #12

Problems with parsing emphasis/style markup #12

schoettl commented May 3, 2020

munen commented May 5, 2020

branch14 commented May 13, 2020

branch14 commented May 13, 2020

schoettl commented May 13, 2020

munen commented May 17, 2020

munen commented May 17, 2020

schoettl commented May 17, 2020 •

edited

schoettl commented May 17, 2020

jcguu95 commented Apr 1, 2024

schoettl commented Apr 1, 2024

Problems with parsing emphasis/style markup #12

Problems with parsing emphasis/style markup #12

Comments

schoettl commented May 3, 2020

munen commented May 5, 2020

branch14 commented May 13, 2020

branch14 commented May 13, 2020

schoettl commented May 13, 2020

munen commented May 17, 2020

munen commented May 17, 2020

schoettl commented May 17, 2020 • edited

schoettl commented May 17, 2020

jcguu95 commented Apr 1, 2024

schoettl commented Apr 1, 2024

schoettl commented May 17, 2020 •

edited