Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Announcing v0.10 #70

Open
PoiScript opened this issue Nov 17, 2023 · 19 comments
Open

Announcing v0.10 #70

PoiScript opened this issue Nov 17, 2023 · 19 comments

Comments

@PoiScript
Copy link
Owner

Hello everyone. After leaving this crate for almost unmaintained for over three years, I finally had some time to pick up what we left off. :)

Three years has passed, and many new feature and awesome crates have been introduced in the Rust world. So I think it's time for us to rebuild orgize with these new features and crates in mind. So I'm thrilled to announce that i'll be publishing orgize v0.10 in the next couple of weeks.

This version is total rewrite of orgize. Some notable changes including:

Switching to rowan

In this version, we're replacing the underlying data structure from indextree to rowan, one of the core crates of rust-analyzer.

To visualise that, input * Heading *bold* will now becomes something like:

rowan help us resolves some long exist problem like high memory usage and not able to get position of each parsed element.

Also thanks to the rich ecosystem of rowan, we might have a lsp server for org mode in future, just like rust-analyzer does for rust language.

Traversing org element tree

Another big change in this version is that org.iter() is now superseded by org.traverse(). When walking through org element tree, traversal is much more flexible than simple iteration. For example, you can now skip the whole subtree but continue on its sibling without breaking the entire traversing by calling traversal_context.skip().

Lossless parsing

Orgize is now a loessless parser. It means you can always get original input back from Org struct:

let s = "* hello /world/!";
let org = Org::parse(s);
assert_eq!(s, org.to_org());

Try it out today

You can try out v0.10 by installing it from crates.io:

or using new demo website: https://poiscript.github.io/orgize

I hope you guys enjoy the new release, and feel free to send any feedback!

What's next

Before publishing v0.10, I want to fix other issues and merge pull requests as much as possible.

After that, I want to support parsing org entities and inlinetasks and try build a lsp server for org-mode using tower-lsp and orgize.

@PoiScript PoiScript pinned this issue Nov 17, 2023
@justinabrahms
Copy link

Hey PoiScript. Thanks for the update. I've forked orgize to add a few things I needed, so I'd love to get them into 0.10. You've fixed the exporter need I have w/ the Traverser trait. The one thing I'm missing now is that I need access to an iterable of ancestors when evaluating each node.

You can see my expected use-case here: https://github.com/justinabrahms/org-roam-to-subtext/blob/main/src/main.rs#L65-L72

I need to know if I'm within a quote so I can prefix Text lines with > for the export format I'm targeting, because it's line based not block based.

This was my commit which added it to the older code base: 8f66d5c

@PoiScript
Copy link
Owner Author

PoiScript commented Nov 21, 2023

hi @justinabrahms, it's possible to access the ancestors of parsed element by using some low-level apis in rowan:

use orgize::SyntaxKind;

fn text(&mut self, token: SyntaxToken, _ctx: &mut TraversalContext) {
    let inside_quote = token
        .parent_ancestors()
        .any(|n| n.kind() == SyntaxKind::QUOTE_BLOCK);

    // ... 
}

but it's not performant for your use case, since you have to check ancestors in each text. I would recommend you to store a flag in your traverser, and set it in quote_block:

pub struct MyTraverser {
    inside_quote: bool
}

fn text(&mut self, token: SyntaxToken, _ctx: &mut TraversalContext) {
    self.inside_quote
    // ... 
}

fn quote_block(&mut self, event: WalkEvent<&QuoteBlock>, _ctx: &mut TraversalContext) {
    self.inside_quote = matches!(event, WalkEvent::Enter(_));
}

@justinabrahms
Copy link

So I've spent a bit of time this morning attempting to convert my tool to v0.10. Some observations:

  1. We need a plaintext handler to implement b/c overriding all of the htmlisms is going to be very tedious when implementing a plain-text format. Defaulting to org syntax seems fine to me.
  2. I can't actually fetch the underlying data in, for instance, a Keyword.. because all of the methods to actually get at the syntax are private or crate-private. This means I can't, for instance, get the #+title: document element. I'd expect to be able to get the key and the value separately.
  3. The forward macro errors if you don't import orgize::rowan::WalkEvent which was surprising. Not sure how common that is in rust macro land.
  4. I have some minor concerns about what might happen to the stability of the API if rowan either falls out of favor or gets a backwards incompatible change, given we're straight re-exporting their stuff. You know more about the library and their compatibility guarantees, so I'll trust you on it.
  5. finish() as a "gimmie my string please" indicates finality, but you could call it multiple times without error. I would have either expected something like render() or to just implement a string conversion.

@PoiScript
Copy link
Owner Author

PoiScript commented Dec 5, 2023

hi, @justinabrahms, thanks for the feedback!

I also find out it's tedious to impl Traverser trait, since it requires too many methods to be defined. So I've combined all these methods in a single event method in next release.

And now it should be pretty straightforward to create a plain text format:

struct PlainText(String);

impl Traverser for PlainText {
    fn event(&mut self, event: Event, ctx: &mut TraversalContext) {
        if let Event::Text(text) = event {
            self.0 += text.text();
        }
    }
}

I have some minor concerns about what might happen to the stability of the API if rowan either falls out of favor or gets a backwards incompatible change, given we're straight re-exporting their stuff.

We re-export rowan crate for convenience. User should only call its api for advanced usage. So it'll be fine as long as rowan follows semantic versioning. If you do reply on rowan api and are worried about breaking changes, you can pin its in Cargo.toml.

finish() as a "gimmie my string please" indicates finality, but you could call it multiple times without error.

You cannot call finish more than one time, because it consumes self.

@justinabrahms
Copy link

@PoiScript I've picked this back up and have tried to get it working, but I can't get it to 100%. The parsing changes do feel better, so thank you.

I'm unclear how to print out the AST of what the system thinks my document structure is. I need this for debugging. The main problem is all of the SyntaxNode attributes are hidden from me, so I can't dump them via serde or similar. Rowan appears to have support for this. Can we get a method for this on Document?

In a similar vein, you've removed the ability for me to look at the ancestry of my node b/c I no longer get a SyntaxToken in the event, so I've had to adopt various boolean flags, which I don't prefer.

The API for Headline feels a bit odd. I'd have expect a method there to just return a text string, rather than the iterator that's there now. I'm guessing org is weirder than I expect here?

@PoiScript
Copy link
Owner Author

@justinabrahms

I'm unclear how to print out the AST of what the system thinks my document structure is.

You can import the orgize::rown::AstNode trait to do so:

use orgize::Org;
use rowan::ast::AstNode;

fn main() {
    let org = Org::parse("* hello");
    println!("{:#?}", org.document().syntax());
}

AstNode is implemented for all elements.

Also, you use the demo website: https://poiscript.github.io/orgize/, the structure of you input will be shown under the 'Syntax' tab

b/c I no longer get a SyntaxToken in the event

I don't really get it, did you mean the Event::Text variant?

The API for Headline feels a bit odd. I'd have expect a method there to just return a text string, rather than the iterator that's there now.

Headline title can contain other elements, like bold and italic. So Headline::title() actually returns "parsed" title. I just added a new method Headline::title_raw() for returning the title raw string.

@justinabrahms
Copy link

@justinabrahms

I'm unclear how to print out the AST of what the system thinks my document structure is.

You can import the orgize::rown::AstNode trait to do so:

use orgize::Org;
use rowan::ast::AstNode;

fn main() {
    let org = Org::parse("* hello");
    println!("{:#?}", org.document().syntax());
}

AstNode is implemented for all elements.

Wild. I've never seen an import change the behavior like that, though I'm a rust newbie. This is the error I was getting otherwise. Do you know the search term for what this pattern is called so I can learn more?

error[E0599]: no method named `syntax` found for struct `orgize::ast::Document` in the current scope
   --> src/main.rs:195:39
    |
195 |     println!("{:#?}", tree.document().syntax());
    |                                       ^^^^^^ private field, not a method
    |
   ::: /home/abrahms/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rowan-0.15.15/src/ast.rs:40:8
    |
40  |     fn syntax(&self) -> &SyntaxNode<Self::Language>;
    |        ------ the method is available for `orgize::ast::Document` here

b/c I no longer get a SyntaxToken in the event

I don't really get it, did you mean the Event::Text variant?

I'm good now. I don't think I realized that the text element was being passed a rowan::SyntaxToken.

The API for Headline feels a bit odd. I'd have expect a method there to just return a text string, rather than the iterator that's there now.

Headline title can contain other elements, like bold and italic. So Headline::title() actually returns "parsed" title. I just added a new method Headline::title_raw() for returning the title raw string.

I don't think you need to add this. I forgot that headlines in org were rich so this makes sense now.

<3

I'm a happy camper now. Thanks for your patience.

@rynoV
Copy link

rynoV commented Jan 12, 2024

Do you know the search term for what this pattern is called so I can learn more?

The pattern is just rust's traits, in this case there is a trait called AstNode defined in rowan which provides the syntax method. I also found it unintuitive while getting into rust since there's no obvious connection between what you have to import and what you're trying to use. I guess it also didn't help that orgize's Document had a private field with the same name, which made the error a bit more confusing than it normally would have been, but I think the second part of the error you showed normally helps to figure out what to import in this scenario.

@PoiScript
Copy link
Owner Author

thank @rynoV for explaining! I actually don't have such problem when writing rust code, since rust-analyzer should be smart enough for locating and importing trait you want:

image

@gmemstr
Copy link

gmemstr commented Jan 28, 2024

I find the removal of the .keywords() functionality a little frustrating, and haven't found a satisfactory replacement for it - my use case is a static site generator, and my posts typically have #+title and #+date defined for displaying and also organization. Is this function planned to be re-implemented, or is it something we'll have to rely on the traversal functionality in our own code to access?

@rynoV
Copy link

rynoV commented Jan 30, 2024

@gmemstr I'm not sure exactly what keywords did but I think this should do what you need:

use orgize::{Org, ast::Keyword};
use rowan::ast::{support, AstNode};

let org = Org::parse("#+KEY1: VALUE1\n#+KEY2: VALUE2\nabc");
let keywords: Vec<Keyword> =
support::children::<Keyword>(org.document().section().expect("No section").syntax()).collect();

assert_eq!(keywords.len(), 2);

assert_eq!(keywords[0].key(), "KEY1");
assert_eq!(keywords[0].value(), " VALUE1");

assert_eq!(keywords[1].key(), "KEY2");
assert_eq!(keywords[1].value(), " VALUE2");

@rrix
Copy link

rrix commented May 6, 2024

i'm finally coming around to rewriting my org tool to use the 0.10 api and trying to solve some of these issues

@gmemstr I'm not sure exactly what keywords did but I think this should do what you need:

use orgize::{Org, ast::Keyword};
use rowan::ast::{support, AstNode};

let org = Org::parse("#+KEY1: VALUE1\n#+KEY2: VALUE2\nabc");
let keywords: Vec<Keyword> =
support::children::<Keyword>(org.document().section().expect("No section").syntax()).collect();

this will only report keywords under the "0th" heading/section. There is also a Document::keywords() in at least alpha8 but this also does not descend in to grand-child headings, reporting only the top-level keywords and the first child sections.

i cooked up a traverse handler fn that does this:

    let mut handler = from_fn(|event| match event {
        // others ...
        Event::Enter(Container::Section(the_section)) => {
            for kw in the_section.syntax().children().filter_map(Keyword::cast) {
                keywords.push(dbg!(kw))
            }
        }
        _ => {}
    });

@PoiScript
Copy link
Owner Author

Document::keys only returns keywords in 0th section, aka top-leveling keyword, which is enough for @gmemstr use case. I also introduced a new method Document::title in last release to extract title keyword directly.

@rrix actually traverser can handle keyword for you:

let org = Org::parse("#+KEY1: VALUE1\n* HELLO\n#+KEY2: VALUE2");

let mut keywords = vec![];

let mut handle = from_fn(|event| {
    if let Event::Enter(Container::Keyword(kw)) = event {
        keywords.push((kw.key(), kw.value()));
    }
});

org.traverse(&mut handle);

for (key, value) in keywords {
    println!("{} -> {}", key.as_ref(), value.as_ref());
}
// KEY1 ->  VALUE1
// KEY2 ->  VALUE2

@rrix
Copy link

rrix commented May 8, 2024

@rrix actually traverser can handle keyword for you:

ack, thank you, i did eventually see that in my implementation.

i took some notes as i've been going about rewriting my site's metadata extractor from 0.9 to 0.10 the last few days. my code is online at https://code.rix.si/rrix/arroyo_rs2/ mind the built-in assumptions tailored for my site engine data model. mind the ugly state tracking, i still don't know rust very well and am not a great programmer in general. mind that that url may be dead in the future as it is folded in to a canonical url, a literate, self-hosted engine driven by orgize 0.9 https://cce.whatthefuck.computer/arroyo/arroyo-rs)

Overall, I think this iteration of the API is quite nice, well done and ty. i'm finding the rowan stuff a lot easier to work with than the previous arena thingy, and the handler fn is easier to work with as a newbie than the prior HtmlHandler trait and all the boilerplate i had to supply for the error types. i have a few questions, some with example snippets:

  • org-roam creates "level 0", top-level property drawers which aren't parsed in to PropertyDrawers. This isn't a regression, 0.9 left these as Drawers too, i had a fork of 0.9 to make an interface public to reach in and parse a drawer in to a property drawer, in 0.10 I take the raw drawer content and append it to a "* Throwaway\n" fake heading to extract them from[1]; this is better than changing an interface, but i wonder if there is a better way to take a drawer's string and get a PropertyDrawer; i might try to extend the Document type to have a properties() once i'm more comfortable with the lib
  • node properties with names like "header-args:rust" cause a propertydrawer to not be parsed. See header arguments
  • ast::Link has a has_description() but no description(), i had to do rowan::support chicanery to extract it[2]
  • is there any way to extend the rowan syntax, or should i implement my own? i use org-fc which has a "cloze" quiz type with a markup to have strings within a Text to be {{answer}{hint}@index} which i transform in to a span with alt-text, but if any of those sections has a link or other inline in it, it breaks my hacky regex.

[1]:

        Event::Enter(Container::Drawer(the_drawer)) => {
            if the_drawer.name() == "PROPERTIES" {
                let h = format!("* Throwaway\n{}", the_drawer.raw());
                let parsed = dbg!(Org::parse(h));
                let drawer = parsed.first_node::<PropertyDrawer>().unwrap();
                // do sth with orgize::ast::Drawer here
            }
        }

[2]:

            let desc: Option<String> =
                support::token(the_link.syntax(), orgize::SyntaxKind::TEXT).map(|t| t.to_string());

@PoiScript
Copy link
Owner Author

PoiScript commented May 8, 2024

@rrix thanks for your feedback and that's a really cool project!

I'm not familiar with org-roam or org-fc, and their syntax appears to deviate from standard org-mode syntax.

would you be willing to create a new issue including some example usage? we could have a feature flag for parsing these extension syntax, allowing user to opt-in manually.

@PoiScript
Copy link
Owner Author

@rrix

node properties with names like "header-args:rust" cause a propertydrawer to not be parsed.

this issue was fixed in 0.10.0-alpha.9

  • ast::Link has a has_description() but no description(), i had to do rowan::support chicanery to extract it[2]

actually the previous implementation of has_description doesn't handle description that weren't plan text. this was also fixed in 0.10.0-alpha.9. This version also introduces Link::description and Link::description_raw which is similar to Headline::title and Headline::title_raw

@rrix
Copy link

rrix commented May 8, 2024

@rrix

node properties with names like "header-args:rust" cause a propertydrawer to not be parsed.

this issue was fixed in 0.10.0-alpha.9

  • ast::Link has a has_description() but no description(), i had to do rowan::support chicanery to extract it[2]

actually the previous implementation of has_description doesn't handle description that weren't plan text. this was also fixed in 0.10.0-alpha.9. This version also introduces Link::description and Link::description_raw which is similar to Headline::title and Headline::title_raw

nice, ty :)

i'll open some issues for the org-fc and org-roam syntaxes

@rrix
Copy link

rrix commented May 8, 2024

@rrix thanks for your feedback and that's a really cool project!

I'm not familiar with org-roam or org-fc, and their syntax appears to deviate from standard org-mode syntax.

would you be willing to create a new issue including some example usage? we could have a feature flag for parsing these extension syntax, allowing user to opt-in manually.

yup, thanks! org-roam is actually relying on syntax which was recently allowed in org 9.5. I opened
#78 and #79

cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants