Atom: use correct xml:base for decoded elements #222

cristoper · 2024-02-24T23:08:03Z

The issue is that goxpp's DecodeElement() pop's the BaseStack after unmarshaling an element -- as it should to keep the BaseStack in sync with the current document scope. But that means the atom parser needs to keep a reference to the current base before calling DecodeElement() so that it can correctly resolve URLs.

Without this fix, elements with xml:base attributes will be erroneously resolved with the parent xml:base.

This fix is a little awkward with all of its saving and restoring of BaseStacks. A simpler solution would be to simply get the xml:base attribute from the decoded element and resolve against that. However, that would require changes to goxpp: either change XmlBaseResolveUrl() to accept a string rather than acting as a method, or maybe adding a separate ResolveUrl() function that takes a base and a relative URL that gofeed could use.

In order to keep tracking xml:base correctly, the goxpp's `DecodeElement` pops the BaseStack if the start element added a base (if any). That means the atom parser needs keep track of the base *before* calling `DecodeElement` to use for resolving relative URLs within the decoded element. Without this fix, elements with xml:base attributes will be erroneously resolved with the parent xml:base.

mmcdole · 2024-02-28T05:55:27Z

This fix is a little awkward with all of its saving and restoring of BaseStacks. A simpler solution would be to simply get the xml:base attribute from the decoded element and resolve against that. However, that would require changes to goxpp: either change XmlBaseResolveUrl() to accept a string rather than acting as a method, or maybe adding a separate ResolveUrl() function that takes a base and a relative URL that gofeed could use.

Maybe this makes sense @cristoper ?

You are suggesting that in goxpp we have:

func (p *XMLPullParser) ResolveUrl(baseURL, relativeUrl string) (string, error) {
    // Logic to resolve relativeUrl against baseURL + the current BaseStack URLs?
}

Then in Atom Parser, something like this?:

func (ap *Parser) parseAtomText(p *xpp.XMLPullParser) (string, error) {
    var text struct {
        Type     string `xml:"type,attr"`
        Mode     string `xml:"mode,attr"`
        Base     string `xml:"base,attr"` // Added to capture xml:base attribute
        InnerXML string `xml:",innerxml"`
    }

    // DecodeElement is used to unmarshal the element
    err := p.DecodeElement(&text)
    if err != nil {
        return "", err
    }

    // Use the captured xml:base if present; otherwise, use the current base URL
    baseURL := text.Base
    if baseURL == "" {
        baseURL = p.GetCurrentBaseURL() // Fallback to current base URL from the parser
    }

    // Resolve URLs in InnerXML using the determined base URL
    resolvedInnerXML, err := p.ResolveURL(baseURL, text.InnerXML)
    if err != nil {
        return "", err
    }

    return strings.TrimSpace(resolvedInnerXML), nil
}

cristoper · 2024-02-29T03:25:20Z

Yes, that's approximately what I had in mind. I just pushed an update to the PR which does similar but copies XmlBaseResolveUrl to gofeed so that we don't have to change goxpp.

However, if you'd prefer not duplicating code from goxpp we could instead make the change there (even by introducing a new function so we don't make any backward-incompatible change to the goxpp API).

This provides an equivalent fix that doesn't do any inelegant swapping out of the BaseStack. It also doesn't change `goxpp`'s public API by essentially copying `XmlBaseResolveUrl` to `gofeed`.

@JLugagne

PR mmcdole#220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This provides a solution to PR mmcdole#211 (and includes @JLugagne's test case from that PR). Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix the remaining failing test reported in mmcdole#210.

@JLugagne

PR mmcdole#220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This provides a solution to PR mmcdole#211 (and includes a test based on @JLugagne's test case from that PR). Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix the remaining failing test reported in mmcdole#210.

@JLugagne

PR mmcdole#220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This should also fix PR mmcdole#211 (and includes @JLugagne's test case from that PR). Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix the remaining failing test reported in mmcdole#210.

@JLugagne

PR #220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This should also fix PR #211 (and includes @JLugagne's test case from that PR). Once the fixes to xml:base handling in #222 are merged, this should fix the remaining failing test reported in #210.

mmcdole

Super clean, love this approach.

cristoper mentioned this pull request Feb 24, 2024

Unit tests are failing #210

Closed

Depend on updated goxpp version without xml:base bug

cf5c66f

Resolve xml:base URLs without switching out the BaseStack

25bb801

This provides an equivalent fix that doesn't do any inelegant swapping out of the BaseStack. It also doesn't change `goxpp`'s public API by essentially copying `XmlBaseResolveUrl` to `gofeed`.

cristoper force-pushed the fix-base-decode branch from d6a83eb to 25bb801 Compare February 29, 2024 04:32

cristoper mentioned this pull request Feb 29, 2024

Fix handling of RSS content:encoded #223

Merged

mmcdole approved these changes Mar 1, 2024

View reviewed changes

mmcdole merged commit 9455e2b into mmcdole:master Mar 1, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atom: use correct xml:base for decoded elements #222

Atom: use correct xml:base for decoded elements #222

cristoper commented Feb 24, 2024

mmcdole commented Feb 28, 2024 •

edited

cristoper commented Feb 29, 2024

mmcdole left a comment

Atom: use correct xml:base for decoded elements #222

Atom: use correct xml:base for decoded elements #222

Conversation

cristoper commented Feb 24, 2024

mmcdole commented Feb 28, 2024 • edited

cristoper commented Feb 29, 2024

mmcdole left a comment

Choose a reason for hiding this comment

mmcdole commented Feb 28, 2024 •

edited