Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for Pandoc's generic raw attributes #226

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kylebarron
Copy link
Contributor

See #225 and https://pandoc.org/MANUAL.html#generic-raw-attribute.

Pandoc added support for Inline and Block raw attributes with version 2.0.

image

LaTeX still isn't included perfectly. It might have something to do with the injectionsSelector property of the LaTeX grammars file.

With this branch of language-markdown:
image

With language-latex highlighting:
image

@Aerijo Your gist detailing injections is super helpful, and I see you've also worked on the language-latex package. Do you happen to quickly see the issue?

This is the relevant portion of the grammars file

  {
    begin: '^\\s{0,3}([`~]{3,})\\s*\\{\\s*=(latex|beamer)\\s*\\}\\s*?$'
    beginCaptures:
      1: name: 'punctuation.md'
    end: '^\\s{0,3}(\\1)$'
    endCaptures:
      1: name: 'punctuation.md'
    name: 'fenced.rawattribute.md'
    contentName: 'source.embedded.latex'
    patterns: [{ include: 'text.tex.latex' }]
  }

The scopes for the incorrectly-colored \begin{tabular} are:

Scopes at Cursor

text.md
fenced.rawattribute.md
source.embedded.latex
meta.function.environment.general.latex

and the scopes for the correctly-colored begin{center} are:

Scopes at Cursor

text.md
fenced.rawattribute.md
source.embedded.latex
meta.function.environment.general.latex
support.function.be.latex

When the file's scope is set to text.tex.latex, the scopes for \begin{tabular} are:

text.tex.latex
meta.function.environment.general.latex
meta.function.environment.tabular.latex
support.function.be.latex

I don't understand how the scopes are so different for the same text.

- See https://pandoc.org/MANUAL.html#generic-raw-attribute
- LaTeX still isn't included correctly. It might have something to do
with InjectionsSelector
@burodepeper
Copy link
Owner

Could you also add tests for this?
They might conflict with special attributes from one of the other flavors.
Will have a better look at it when I'm back home from my holiday.

@burodepeper
Copy link
Owner

@kylebarron Thought this through a bit. How widely used is this syntax?

Fenced code blocks already have the widely used syntax

``` js

and special attributes from Markdown Extra

``` js {.code-block #javascript}

if I remember correctly, don't use them myself.

Moving the language (detection) inside the curly brackets makes it quite a bit more complicated. Since the list of supported languages is dynamic (see https://github.com/burodepeper/language-markdown/blob/master/lib/GrammarCompiler.js#L82-L112 and https://github.com/burodepeper/language-markdown/blob/master/grammars/fixtures/fenced-code.cson), and is part of the grammar that is compiled, this change might be a tad more complicated than it seems.

@Aerijo
Copy link

Aerijo commented Jul 2, 2018

...the dreaded $base.

language-latex uses that pattern to refer back to it's top level, which is fine in a normal file, but it technically always means the base grammar of the file. Which in this case is language-markdown (hence the reversion to markdown scoping). The issue is well known with the C family of grammars, because they all share similar syntax and so use $base frequently.

I don't know if a solution is possible. The issue is less noticeable with my language-latex2e grammar, because I made a point of avoiding $base, but it seems some snuck in when I decided to copy and paste the table logic from language-latex.

On a different note, the patterns: [{ include: "text.tex.latex" }] is largely redundant, as it will automatically inject itself anywhere the source.embedded.latex scope is. Hypothetically, if the package were to remove it's injectionSelector key, then it would need patterns: .... But that probably won't happen. Having said that, it probably doesn't hurt to keep it.

@Aerijo
Copy link

Aerijo commented Jul 2, 2018

@burodepeper

Moving the language (detection) inside the curly brackets makes it quite a bit more complicated. Since the list of supported languages is dynamic

Taking capture groups and putting them in scope names is actually supported. I believe I've even seen a "to lower case" transformation (but can't confirm that). So you could capture the name in the braces, and use contentName: "source.embedded.$1".

This relies on the other language declaring itself with an injectionSelector, but could be good as a generic fallback. It seems to currently either include: "source" or scope as source.embedded, neither of which seem to do anything.

More on that here (I think; it's been a while since I've read it).

As for the note about spell checking with scopes: I need to follow up on that. I raised an issue, but nothing else has happened. I guess that's a while off, and you can ignore this comment for the time being.

@burodepeper
Copy link
Owner

Taking capture groups and putting them in scope names is actually supported. I believe I've even seen a "to lower case" transformation (but can't confirm that). So you could capture the name in the braces, and use contentName: "source.embedded.$1".

I've seen that as well. Found it in language-todo. Only works if all grammars use the same source.embedded.*** which they don't, and it also depends on the user knowing the scope of whatever language they use. js vs javascript for instance. It would work for a lot of circumstances, but not for all.

@burodepeper
Copy link
Owner

More on that here (I think; it's been a while since I've read it).

@Aerijo Did you write that guide? Never seen it before. I've just been hacking stuff together over the years.

@Aerijo
Copy link

Aerijo commented Jul 2, 2018

@burodepeper I wrote it at some point as a personal reference, but decided to make it a full guide. I've left some things unfinished, but I'm happy with it on the whole (and have referred back to it a couple of times).

@burodepeper
Copy link
Owner

@Aerijo 👍 I'll give it a read when I have some more time.

@kylebarron
Copy link
Contributor Author

kylebarron commented Jul 2, 2018

My main motivations for this change are that I often have sections of LaTeX, especially tables, within a Markdown document that I then use Pandoc to convert to PDF. The current highlighting of embedded latex is especially bad. Here's an example (copy-pastable text at the bottom):
image

Note that the HTML injection incorrectly colors all text after the table because it thinks a tag has been opened. With pandoc embedded_latex.md -o embedded_latex.pdf, that becomes:
image

Pandoc allows raw TeX either openly within the document or within raw attribute blocks, i.e.

```{=latex}
table here
```

There is no functional difference in Pandoc; it's just that for the syntax highlighting, the raw attribute blocks should be easier to color. Ideally, there would also be injections for common LaTeX environments like \begin{table}, \begin{equation}, etc.

@burodepeper

Moving the language (detection) inside the curly brackets makes it quite a bit more complicated.

I'm not moving the language detection inside the curly brackets. Instead I'm doing earlier detection for three specific rules. That's why #pandoc-raw-attribute-blocks has to be before #fenced-code-blocks. Unless the block starts with one of

```{=latex}
```{=html}
```{=openxml}

the block won't match any rule in #pandoc-raw-attribute-blocks and will continue on to #fenced-code-blocks for further classification.

You could argue that there's a cleaner way to add the code, but I don't see yet how this specific addition makes the language detection more complicated.

@Aerijo

...the dreaded $base.

Can you enlighten me on why language-latex needs to use $base? You write:

$base: similar to $self, but with some differences when embedded in another grammar. Not important right now, but just remember that $base is not the same as $self when your grammar is embedded in another. $self points to the grammar $self appears in (points to itself), whereas $base points to the base language of the file, which could be anything. If you don't know what I mean by embedded, don't use $base.

What's the benefit of $base then? It seems this is an example of why the grammar should use $self instead of $base...

embedded_latex.md:

---
title: Embedded LaTeX
header-includes:
    - \usepackage{booktabs}
---

\begin{table}[htbp]\centering
\def\sym#1{\ifmmode^{#1}\else\(^{#1}\)\fi}
\begin{tabular}{l*{5}{c}}
\toprule
&\multicolumn{1}{c}{(1)}&\multicolumn{1}{c}{(2)}&\multicolumn{1}{c}{(3)}&\multicolumn{1}{c}{(4)}&\multicolumn{1}{c}{(5)}\\
&\multicolumn{1}{c}{Dep. Var. }&\multicolumn{1}{c}{Dep. Var. }&\multicolumn{1}{c}{Dep. Var. }&\multicolumn{1}{c}{Dep. Var. }&\multicolumn{1}{c}{Dep. Var. }\\
Variable & 0.00\sym{**} & 0.00\sym{**} & 0.00 & 0.00 & 0.00\sym{**} \\
        & (0) & (0) & (0) & (0) & (0) \\
\addlinespace
Variable & 0.00\sym{***}& 0.00\sym{***}& 0.00\sym{***}& 0.00\sym{***}& 0.00\sym{***}\\
        & (0) & (0) & (0) & (0) & (0) \\
\addlinespace
Constant & 0.00\sym{***}& 0.00\sym{***}& 0.00\sym{***}& 0.00\sym{***}& 0.00\sym{***}\\
        & (0) & (0) & (0) & (0) & (0) \\
\midrule
N       & 0 & 0 & 0 & 0 & 0 \\
\bottomrule
\multicolumn{6}{l}{\footnotesize Standard errors in parentheses}\\
\multicolumn{6}{l}{\footnotesize \sym{*} \(p<0.05\), \sym{**} \(p<0.01\), \sym{***} \(p<0.001\)}\\
\end{tabular}
\end{table}

text is miscolored from now on

@burodepeper
Copy link
Owner

@kylebarron

Have you seen #190?

Additionally, if I use your testcase and fenced the latex with ```latex it doesn't work correctly either. The culprit seems to be the language-html that gets injected too early. That's something that needs to be fixed in any case.

I get the idea that you want to add it for just these three 'selectors'. Is that as far as it will go? Or will extra options have to be added in the future?

@kylebarron
Copy link
Contributor Author

I hadn't looked at #190 in a while. I'll take a look at it again.

Note that there is a difference between ```{=latex} and ```latex. The former tells Pandoc to evaluate the LaTeX code, whereas the latter means display the code verbatim in the final output.

Additionally, if I use your testcase and fenced the latex with ```latex it doesn't work correctly either.

Here's what I (and probably you) see:
image

At the beginning at least, LaTeX is correctly colored. That's why \begin{table} is colored well. Now I realize that the use of $base in the language-latex package is the reason why Markdown highlighting appears within the block. Every time they do recursion, it includes Markdown highlighting and not LaTeX highlighting.

Here's what I see when I change all the occurences of $base to $self in language-latex:
image

That's the correct syntax highlighting.

@burodepeper
Copy link
Owner

@kylebarron

Note that there is a difference between {=latex} and latex. The former tells Pandoc to evaluate the LaTeX code, whereas the latter means display the code verbatim in the final output.

Right, my bad...

If you could add some tests to the PR, I think I'm okay with adding it when all other issues have been ruled out. Would be nice to also add the attributes for inline code spans, if you haven't already.

@kylebarron
Copy link
Contributor Author

I have added highlighting for the inline spans as well.

It doesn't sound like I'll be able to write correct tests until we have a resolution of $base vs $self in the language-latex package.

@kylebarron
Copy link
Contributor Author

@Aerijo

Your language-latex2e works well embedded for general markup or math:
image

But you still use $base inside tabular environments, so it doesn't look great for tables:
image

Any chance your edits will make it back into language-latex?

@kylebarron
Copy link
Contributor Author

Everything should be fixed when my PR to language-latex gets merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants