Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with parsing emphasis/style markup #12

Open
schoettl opened this issue May 3, 2020 · 10 comments
Open

Problems with parsing emphasis/style markup #12

schoettl opened this issue May 3, 2020 · 10 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@schoettl
Copy link
Collaborator

schoettl commented May 3, 2020

Problems with the ungreatful and recursive nature of emphasis markup [/*_+] are documented in #9, specifically

A summary in German:

Ein Syntaxelement in Orgmode ist ähnlich wie bei Markdown:

Dies ist /kursiver/ Text.

(* = fett, _ = unterstrichen, + = durchgestrichen, ...)

Jetzt geht es darum den Text zu parsen.

text := { text-kursiv | text-normal }
text-kursiv := '/' text '/'
text-normal := regex|.[^/]*|

Die Schwierigkeiten sind jetzt die:

  1. Vor und nach eine /kursiven/ Text muss ein Leerzeichen oder Interpunktion sein.
  2. Der /kursive Text/ selbst darf nicht mit Leerzeichen beginnen oder enden.
  3. Der Delimiter darf auch im /kursiven/italic Text/ vorkommen.
  4. Der /kursive/ Text/ ist so kurz wie möglich (also hier nur "kursive").
  5. Die Sache ist rekursiv, also /kursiver Text kann auch fett/ sein.
  6. Die Delimiter sind nicht eindeutig, z.B. kann der Schrägstrich/Slash auch so vorkommen und _ je nach Kontext Unterstreichung oder Tief_stellen bedeuten.
  7. Neben den Symbolen text-kursiv und text-normal gibt es weitere Syntaxelemente wie Links, die mit einer ganz anderen Syntax gekennzeichnet sind.

Org Mode selbst löst das beim Export wohl auf eine andere Weise, nicht durch einen BNF Parser sondern durch Programmierung und insbesondere einen Regex, der nicht nur den kursiven Text matcht, sondern auch den Buchstaben davor und dahinter.

Nur hier funktioniert das nicht so einfach: Zum einen kann ich beim Symbol text-kursiv keinen Regex angeben (wegen der Rekursion). Zum anderen kann text-kursiv nicht wissen, ob vor ihm ein Leerzeichen kommt, oder nicht. Look-ahead ist unterstützt vom verwendeten BNF, aber nicht Look-back. Und zuletzt gestaltet sich der Regex von text-normal als schwierig, weil er eben an der richtigen Stelle stoppen muss: Mal nach einem Leerzeichen, wenn danach ein / kommt. Mal ohne zusätzliches Kriterium, wenn danach ein [ kommt (Link oder Fußnote).

Siehe auch in Emacs org-emph-re.

@munen
Copy link
Contributor

munen commented May 5, 2020

@branch14 Do you potentially have an input/idea on how to tackle this?

@schoettl provided more links on the issue in his last PR, too: #9

@branch14
Copy link
Member

@munen, @schoettl I acknowledge that Org-mode might have syntactic elements that cannot properly parsed by EBNF/PEG. While the project's goal is to have as much of the Org-mode syntax formalized in a EBNF/PEG, we need an alternative (more pragmatic) approach to provide a full featured parser.

Some of the issues mentioned here can be implemented in the grammar, while others might need to be deferred to the transformation step mentioned in #9. E.g. multi-line text styles cannot be tackled with a line based parser, but can easily dealt with Regexs in a transformation.

@branch14
Copy link
Member

Here I layed out how the code for transformation could look like: #15

For parsing multi-line styles a 2nd transformation step would be needed. (Not part of this PR.)

@schoettl
Copy link
Collaborator Author

Even single-line styles have severe problems in EBNF. I want to check out, if it is reasonable to put all style (emphasis and verbatim, how it's called in the spec) into the transformations. An advantage of this approach is that we can reuse the logic and regexes from orgmode.

@munen
Copy link
Contributor

munen commented May 17, 2020

An advantage of this approach is that we can reuse the logic and regexes from orgmode.

👍

@munen
Copy link
Contributor

munen commented May 17, 2020

That could actually prove to be a major benefit.

As long as Emacs is not using org-parser cough, it could prove very benefitial to keep some complicated parts close to the Elisp codebase.

@schoettl
Copy link
Collaborator Author

schoettl commented May 17, 2020

Note to self:

  1. Check if it's OK or "allowed" that emphasis spans other elements that are already parsed via EBNF. E.g. *this [[url]] is bold and _O_2 is also included in the underlined_ text*

  2. If 1. is good, there is one difficulty: When we apply the emphasis regexes on a parsed line, we need a string as input, not a parse tree containing links or otherwise parsed elements. To get the string input, AFAIK we have to "export" the parse tree to get the original string. Similar to what is done in organice's org_export.js.

It would be great if instaparse has a way to get the original, unparsed text along with the parse tree, but I don't think it has. EDIT: It has: meta and span functions

@schoettl
Copy link
Collaborator Author

Regarding 2.: instaparse has a built-in way to get position/location meta information from the parse tree! Even if the parse tree looks like it only holds the parsed data, the meta and span functions return this information: https://cljdoc.org/d/instaparse/instaparse/1.4.9/doc/readme#character-spans

So, if we have the original input text, it's no problem to apply emphasis regexes on the original line. We do have all position information about elements parsed via EBNF.

@schoettl schoettl self-assigned this Jun 8, 2020
@schoettl schoettl added the documentation Improvements or additions to documentation label May 20, 2021
@jcguu95
Copy link

jcguu95 commented Apr 1, 2024

It's been 4 years, and this issue seems to be the last unfinished todo item in /org-parser/README.org. What is its current state?

  1. Is this the only gap between org-parser and the official org parser?
  2. Will this be done in the future, perhaps by extending EBNF and interparse a bit more?

@schoettl
Copy link
Collaborator Author

schoettl commented Apr 1, 2024

The project hasn't been very active since then and no one has worked on this specific problem. I guess it's not the only gap. The check list in the README may miss some less common org features. There is also a big room for enhancements on the transformation side (the step of converting the instaparse parse result into a more meaningful data structure). E.g. joining lines of a paragraph which would be a requirement for parsing style markup in the transformation step.

I'm personally more concerned about #56 – that's why I don't invest much time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants