Unicode non-breaking space and various characters not getting translated properly #55

dwertheimer · 2021-08-29T16:47:34Z

Thanks for this tool! I don't know if Evernote changed the .enex export formatting or what, but running the command on a normal note results in some weird Unicode characters that show up as strange characters in markdown.
For instance:

My note ends up with a bunch of <0xa0> characters in the resulting markdown (this is apparently a non-breaking space. Markdown readers show that as a Â or something else.
Several other characters come into the markdown file as escaped and look like this in the markdown:
\( or \)
\-
\!

@wormi4ok, is there any chance I could get your help with this? I guess I could create my own post-processor, but I would really prefer not to do that. Seems like this may be an easy change for you. I don't know Go so I can't do it myself.

The text was updated successfully, but these errors were encountered:

wormi4ok · 2021-08-29T16:54:14Z

Hello @dwertheimer, thanks for the feedback!
It would really help if you could upload a sample .enex file that includes those characters. You can remove everything irrelevant from the note, just the minimal content that shows the problem. It would be much easier for me to write a test case and find the issue.

dwertheimer · 2021-08-29T17:38:37Z

Thank you so much for the quick response. Here's a quick test case showing the ones that I've found so far. There may be more. Solving these would be a great start. Thanks again for the tool and getting back to me! Github won't let me attach an enex file. So trying as a zip. Let me know if this works. @wormi4ok [TestCase2.enex.zip](https://github.com/wormi4ok/evernote2md/files/7072974/TestCase2.enex.zip)

wormi4ok · 2021-08-29T20:02:38Z

Now, it's a bit clearer what we are talking about.

I don't see the <0xa0> in the resulting md file. It may depend on the editor, but a least in the TestCase2.enex that you've shared - the NBSP symbol looks correct to me.
Talking about \ chars. Even though it looks ugly indeed, I don't think that it's possible to avoid escaping without breaking the resulting markdown. Here is a list of all escaped characters explained - https://github.com/mattn/godown/blob/master/godown.go#L16-L28

If we don't escape these symbols, input html like this:

<a href="https://github.com">Test ] me</a>
becomes
[Test ] me](https://github.com)

which breaks the formatting.

dge8 · 2021-08-30T02:55:06Z

I encountered this issue recently, and since my notes didn't contain any pathological cases like that one, I worked around it by enabling the DoNotEscape option in godown (commenting out lines 283-287 here would have the same effect).

I think the better solution would be to improve the escaping within godown to selectively escape characters within their contexts. For example, ] only needs to be escaped within link names, ) within linked URLs, * within bolded text, etc.

Some discretion needed as well: IMHO - doesn't need to be escaped anywhere, since if I start a line with a - or 1. I probably want it to be a list anyway. Other MD syntax (e.g. *emphasis*) is probably too troublesome not to escape.

dwertheimer · 2021-08-30T23:06:31Z

Hope you saw my comment on the other (closed) issue, which is still an issue for me. Please let me know. Thanks so much. Let me know what I can do to help track this down. David

…

On Sun, Aug 29, 2021 at 1:02 PM, Stanislav Petrashov < ***@***.***> wrote: Now, it's a bit clearer what we are talking about. 1. I don't see the <0xa0> in the resulting md file. It may depend on the editor, but a least in the TestCase2.enex that you've shared - the NBSP symbol looks correct to me. 2. Talking about \ chars. Even though it looks ugly indeed, I don't think that it's possible to avoid escaping without breaking the resulting markdown. Here is a list of all escaped characters explained - https:// github.com/mattn/godown/blob/master/godown.go#L16-L28 <https://github.com/mattn/godown/blob/master/godown.go#L16-L28> If we don't escape these symbols, input html like this: <a href="https://github.com">Test ] me</a> becomes [Test ] me](https://github.com) which breaks the formatting. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#55 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACEI6VFTUB3QCOU6AJGX5L3T7KG6RANCNFSM5DANCNEA> .

bjj · 2021-10-10T05:42:08Z

My note ends up with a bunch of <0xa0> characters in the resulting markdown (this is apparently a non-breaking space. Markdown readers show that as a Â or something else.

That would suggest that your markdown reader is not processing the document as UTF-8. If it can't handle that character, it would fail to handle any non-ASCII characters.

It may make sense for evernote2md to include a UTF-8 BOM in the file if it contains UTF-8 to help downstream consumers.

jussihuotari · 2022-01-12T08:03:48Z

I'm having an issue with erroneous characters in my converted markup files as well. An example of my case is a webclip note that contains <b>1. </b>.... It looks ok in Evernote but after export and conversion, the characters become garbage in viewers and throw an error when reading: 'utf-8' codec can't decode byte 0xc2: invalid continuation byte. My "workaround" currently is just to ignore the encoding errors.

I tried the testcase provided above by @dwertheimer but there the decoding seems ok to me. TestCase2.md is UTF-8 text, and when viewing the file, the \0xa0 is rendered correctly (looks like a space). When reading the file content to a string, the non-breaking space is as expected: "---\ndate: '1970-01-17 10:42:10 +0000'\nupdated_at: '1970-01-17 10:42:10 +0000'\ntitle: TestCase2\nsource: desktop.mac\n\n---\n\n# TestCase2\n\nNonBreaking space character 0xa0 in brackets: \xa0 I wonder if....

wormi4ok · 2022-01-12T19:51:27Z

Hey @jussihuotari ! That's an interesting observation. I've searched for the errors like invalid continuation byte, and it seems to me that errors mentioned in the issue may have something to do with the exact encoding.
I assume all notes content to be encoded in UTF-8 and default xml.decoder doesn't support anything else. But the exported note root xml element has an encoding attribute that currently is ignored.
Would be great if you could check what is the value of this attribute in the faulty note or share a reproducible example.

jussihuotari · 2022-01-13T07:16:09Z

Good question @wormi4ok. I checked the encoding attribute and it is UTF-8, as expected.

Regarding a reproducible example: I amended the earlier test case in this issue with the text causing the error, attached here. I'm not sure if this is actually well-formed / realistic note format, but maybe it's useful as a test case? When I run evernote2md on TestCase3.enex, it generates the errors for the non-breaking space in the two bullet points (but no errors for the non-breaking space included in the previous test case file! So... is it the context around the character that causes a problem?).

In my test, these errors are not present in the .enex, only in the generated .md. Reading the .enex file to a string throws no errors, while reading the .md throws the familiar: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 383: invalid continuation byte.

TestCase3.zip

wormi4ok self-assigned this Aug 29, 2021

wormi4ok mentioned this issue Jun 10, 2023

Special characters are being escaped by a backslash #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode non-breaking space and various characters not getting translated properly #55

Unicode non-breaking space and various characters not getting translated properly #55

dwertheimer commented Aug 29, 2021

wormi4ok commented Aug 29, 2021

dwertheimer commented Aug 29, 2021 via email •

edited

wormi4ok commented Aug 29, 2021

dge8 commented Aug 30, 2021

dwertheimer commented Aug 30, 2021 via email

bjj commented Oct 10, 2021

jussihuotari commented Jan 12, 2022

wormi4ok commented Jan 12, 2022

jussihuotari commented Jan 13, 2022

Unicode non-breaking space and various characters not getting translated properly #55

Unicode non-breaking space and various characters not getting translated properly #55

Comments

dwertheimer commented Aug 29, 2021

wormi4ok commented Aug 29, 2021

dwertheimer commented Aug 29, 2021 via email • edited

wormi4ok commented Aug 29, 2021

dge8 commented Aug 30, 2021

dwertheimer commented Aug 30, 2021 via email

bjj commented Oct 10, 2021

jussihuotari commented Jan 12, 2022

wormi4ok commented Jan 12, 2022

jussihuotari commented Jan 13, 2022

dwertheimer commented Aug 29, 2021 via email •

edited