Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode non-breaking space and various characters not getting translated properly #55

Open
dwertheimer opened this issue Aug 29, 2021 · 9 comments
Assignees

Comments

@dwertheimer
Copy link

Thanks for this tool! I don't know if Evernote changed the .enex export formatting or what, but running the command on a normal note results in some weird Unicode characters that show up as strange characters in markdown.
For instance:

  1. My note ends up with a bunch of   <0xa0> characters in the resulting markdown (this is apparently a non-breaking space. Markdown readers show that as a   or something else.
  2. Several other characters come into the markdown file as escaped and look like this in the markdown:
    \( or \)
    \-
    \!

@wormi4ok, is there any chance I could get your help with this? I guess I could create my own post-processor, but I would really prefer not to do that. Seems like this may be an easy change for you. I don't know Go so I can't do it myself.

@wormi4ok
Copy link
Owner

Hello @dwertheimer, thanks for the feedback!
It would really help if you could upload a sample .enex file that includes those characters. You can remove everything irrelevant from the note, just the minimal content that shows the problem. It would be much easier for me to write a test case and find the issue.

@wormi4ok wormi4ok self-assigned this Aug 29, 2021
@dwertheimer
Copy link
Author

dwertheimer commented Aug 29, 2021 via email

@wormi4ok
Copy link
Owner

Now, it's a bit clearer what we are talking about.

  1. I don't see the <0xa0> in the resulting md file. It may depend on the editor, but a least in the TestCase2.enex that you've shared - the NBSP symbol looks correct to me.
  2. Talking about \ chars. Even though it looks ugly indeed, I don't think that it's possible to avoid escaping without breaking the resulting markdown. Here is a list of all escaped characters explained - https://github.com/mattn/godown/blob/master/godown.go#L16-L28

If we don't escape these symbols, input html like this:

<a href="https://github.com">Test ] me</a>
becomes
[Test ] me](https://github.com)

which breaks the formatting.

@dge8
Copy link
Contributor

dge8 commented Aug 30, 2021

I encountered this issue recently, and since my notes didn't contain any pathological cases like that one, I worked around it by enabling the DoNotEscape option in godown (commenting out lines 283-287 here would have the same effect).

I think the better solution would be to improve the escaping within godown to selectively escape characters within their contexts. For example, ] only needs to be escaped within link names, ) within linked URLs, * within bolded text, etc.

Some discretion needed as well: IMHO - doesn't need to be escaped anywhere, since if I start a line with a - or 1. I probably want it to be a list anyway. Other MD syntax (e.g. *emphasis*) is probably too troublesome not to escape.

@dwertheimer
Copy link
Author

dwertheimer commented Aug 30, 2021 via email

@bjj
Copy link

bjj commented Oct 10, 2021

My note ends up with a bunch of <0xa0> characters in the resulting markdown (this is apparently a non-breaking space. Markdown readers show that as a  or something else.

That would suggest that your markdown reader is not processing the document as UTF-8. If it can't handle that character, it would fail to handle any non-ASCII characters.

It may make sense for evernote2md to include a UTF-8 BOM in the file if it contains UTF-8 to help downstream consumers.

@jussihuotari
Copy link

I'm having an issue with erroneous characters in my converted markup files as well. An example of my case is a webclip note that contains <b>1.&nbsp;</b>.... It looks ok in Evernote but after export and conversion, the characters become garbage in viewers and throw an error when reading: 'utf-8' codec can't decode byte 0xc2: invalid continuation byte. My "workaround" currently is just to ignore the encoding errors.

I tried the testcase provided above by @dwertheimer but there the decoding seems ok to me. TestCase2.md is UTF-8 text, and when viewing the file, the \0xa0 is rendered correctly (looks like a space). When reading the file content to a string, the non-breaking space is as expected: "---\ndate: '1970-01-17 10:42:10 +0000'\nupdated_at: '1970-01-17 10:42:10 +0000'\ntitle: TestCase2\nsource: desktop.mac\n\n---\n\n# TestCase2\n\nNonBreaking space character 0xa0 in brackets: \xa0 I wonder if....

@wormi4ok
Copy link
Owner

Hey @jussihuotari ! That's an interesting observation. I've searched for the errors like invalid continuation byte, and it seems to me that errors mentioned in the issue may have something to do with the exact encoding.
I assume all notes content to be encoded in UTF-8 and default xml.decoder doesn't support anything else. But the exported note root xml element has an encoding attribute that currently is ignored.
Would be great if you could check what is the value of this attribute in the faulty note or share a reproducible example.

@jussihuotari
Copy link

Good question @wormi4ok. I checked the encoding attribute and it is UTF-8, as expected.

Regarding a reproducible example: I amended the earlier test case in this issue with the text causing the error, attached here. I'm not sure if this is actually well-formed / realistic note format, but maybe it's useful as a test case? When I run evernote2md on TestCase3.enex, it generates the errors for the non-breaking space in the two bullet points (but no errors for the non-breaking space included in the previous test case file! So... is it the context around the character that causes a problem?).

In my test, these errors are not present in the .enex, only in the generated .md. Reading the .enex file to a string throws no errors, while reading the .md throws the familiar: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 383: invalid continuation byte.

TestCase3.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants