Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subject with non-printable characters or new line is shown with ? #4276

Open
Yutsuten opened this issue Apr 25, 2024 · 30 comments
Open

Subject with non-printable characters or new line is shown with ? #4276

Yutsuten opened this issue Apr 25, 2024 · 30 comments
Assignees
Labels

Comments

@Yutsuten
Copy link
Member

Continuation of the discussion in #4272

[...] in my e-mail listing (index), some subject names shows those ?.

Google Payments              | ???      ? ? ?    ?  ?  ?  ?   ??Google Cloud Platform & APIs: 001112-001112-001112 の請求書の用意ができました

It is the expand of

set index_format = '%-28.28L | %s'

This is the raw data when seeing in the browser (copy-pasted from gmail):

Subject: 

 

Google Cloud Platform & APIs: 001112-001112-001112 の請求書の用意ができました
From: Google Payments <[email protected]>

gmail-subject-bug

(some data hidden for privacy)

%s (and Google) is the culprit. I wonder if this PR fixed this issue.

Expected Behaviour

Subject text without ?, like when we check the emails in the browser.

gmail

Actual Behaviour

Subject is shown as ??? ? ? ? ? ? ? ? ??Google Cloud Platform & APIs: 001112-001112-001112 の請求書の用意ができました.

That's probably what we want to display.

Really? I'm confused by this answer in the PR.

Steps to Reproduce

Not sure how to send an email like that, maybe only programmatically. I'll look into it.

How often does this happen?

  • On a particular email
    If the email isn't private, please attach it to this issue.

When did it start to happen?

  • Since I started using it, I guess.

NeoMutt Version

NeoMutt 20240416

Extra Info

  • Operating System and its version: Linux 6.8.7-arch1-1 (x86_64)
  • Were you using multiple copies of NeoMutt at once? No
  • Were you using 'screen' or 'tmux'? No
  • Is your email local (maildir) or remote (IMAP)? IMAP

I also have something to ask @flatcap
I'd like to try fixing this issue myself. My main programming language is python, and only used C while in the university, about 6 years ago. Do you think I would be able to do it?

@Yutsuten Yutsuten added the type:bug Bug label Apr 25, 2024
@Yutsuten Yutsuten changed the title Subject with invalid characters and new line is shown with ? in index Subject with invalid characters and new line is shown with ? Apr 25, 2024
@Yutsuten Yutsuten changed the title Subject with invalid characters and new line is shown with ? Subject with non-printable characters or new line is shown with ? Apr 25, 2024
@flatcap
Copy link
Member

flatcap commented Apr 25, 2024

Do you think I would be able to do it?

Sure :-)

Would you like to just dive in, or would you like a few hints?


NeoMutt's big and complicated.

Feel free to ask lots of questions, here, or on IRC: #neomutt on irc.libera.chat (web client)

@Yutsuten
Copy link
Member Author

Would you like to just dive in, or would you like a few hints?

Hints would be very welcome. I'll join IRC later, thanks!!

@flatcap
Copy link
Member

flatcap commented Apr 25, 2024

The next bit is a quick tour from the Index to the string formatting.
You don't need to understand it, unless you want to :-)


The Index displays a list of emails.
The format is controlled by the config, $index_format.
Release 2024-04-16 introduced a new parser for the format strings: libexpando.
Now, the string is parsed once and rendered efficiently when needed.

The Index is a wrapper around a struct Menu.
The Menu uses a callback function get the data it needs for display.
For the Index, this is index_make_entry(), which calls a helper mutt_make_string().
Finally this expando_filter() which wraps expando_render().

A parsed format string is stored as a tree of Expando Nodes.
expando_render() traverses this tree using supplied data and a table of callback functions.
See also: IndexFormatDef, IndexRenderData, index_s()

More detail in #3937

For the Index Subject Expando %s we finally arrive in node_expando_render(), where it calls format_string().
This function fits the data into the space available.


format_string() decides what to display and what fits.
It needs to calculate how many screen columns are used by each character,
whilst ignoring any colour codes or tree characters which have special representations.

Illegal characters, e.g. bad utf-8, are replaced with ?.

The function takes one flag, bool arboreal which controls whether tree characters are allowed.
index_make_entry() passes a flag (MUTT_FORMAT_TREE) to expando_render() that sets arboreal = true.

We might want to consider a flag that controls whether whitespace formatting is allowed.

@nobrowser

This comment was marked as resolved.

@flatcap

This comment was marked as resolved.

@nobrowser

This comment was marked as resolved.

@christian-heusel
Copy link

I'm seeing some alignment issue with an interesting unicode character (U+3000 : IDEOGRAPHIC SPACE) in the name:

$ echo "Stefan  " | xxd -
00000000: 5374 6566 616e 20e3 8080 0a              Stefan ....

causes broken alignment:

 2       23/03/29 07:51PM Stefan   (@sfan devtools | arch-nspawn: set console mode to autopipe (!137) (3.8K)
 3       23/06/04 07:23PM Stefan   (@sfan ├─> (3.8K)
 4       23/06/29 04:52PM Jan Alexander S ├─> (3.9K)
 5       23/07/05 08:59PM Stefan   (@sfan │ └─> (4.1K)
 6       24/01/10 12:01AM Stefan   (@sfan ├─> (3.3K)
 7 N     24/04/27 12:11PM Jan Alexander S └─> (4.4K)
 8 N     24/04/27 12:14PM Jan Alexander S   └─> (7.8K)

With the following $index_format:

set index_format="%2C %Z %?X?A& ? %D %-15.15F %s (%-4.4c)"

Commenting this here after having a chat with @flatcap on IRC

@flatcap
Copy link
Member

flatcap commented Apr 30, 2024

I've had a deeper dive into the code and we can do a better job (than my quick hack).
If you're still interested :-)

There are a lot of format strings (Expandos),
but with a couple of well-placed flags we can fine-tune the behaviour.

First lots of data (you can skim over this).

@flatcap

This comment was marked as outdated.

@flatcap

This comment was marked as outdated.

@flatcap

This comment was marked as outdated.

@Yutsuten
Copy link
Member Author

Yutsuten commented May 1, 2024

I finished doing a lot of setup in my dev environment to code in C. Now I'm doing a C language tutorial while taking notes (this is a website I use to record stuff I learn). Sorry for being slow.

Thank you so much for giving me so much information! I see you already a lot of things in mind.

About the issue in question here, I was thinking in creating a strip() function, like the one we have in python. I want to remove all the non-printable characters around the subject and names (both "to" and "from").
Not sure if there is something for this already in a C library or NeoMutt though. I may do it anyway just for studying purposes. 😃

Am I thinking in the correct direction?

@flatcap
Copy link
Member

flatcap commented May 2, 2024

while taking notes

ooh! very organised

Sorry for being slow

There's no rush

I see you already a lot of things in mind

Yes. I hope it makes some sense.
When I have an idea in mind, I tend to dump all the information I can think of.

There are two parts: render and display.

Render: turns an Expando + Data into a string.
This can have strict formatting rules, so it needs to measure the strings.
By adding flags to the Expando's config definition, we can control how non-printable characters are handled.

Display: Put the string on screen.
Ideally, this should be as simple as possible.
No special rules, no transformations.
Currently, it has special rules for colours, tree characters and non-printable characters.

I was thinking in creating a strip() function

Not a bad idea, but I think we want to preserve as much as possible.
A ? might give the user slightly more information than nothing.
(I can't think of a clever example :-)

  • Subject: I want ????

The current way NeoMutt does this is to use a standard "replacement character".
This often '?' (question mark), but can also be '�' (U+FFFD Replacement Character)

I may do it anyway just for studying purposes

Cool
(I'm already thinking how I'd do it :-)

@Yutsuten
Copy link
Member Author

Yutsuten commented May 2, 2024

Not a bad idea, but I think we want to preserve as much as possible.

I think this is personal taste. Using the e-mails I receive as a reference, I'd prefer these character stripped out instead of showing ? on them.

All emails in my inbox that have an ?, I'd be happier if those were not there.

I even have emails with ? between words (Why/how people do that??). In the browser those become spaces.

Subject: I want ????

I have never received an email like that.

Maybe if one sets their locale to something like LANG=C in their system, receiving an email that uses non-ASCII characters (Chinese or whatever I want バナナ lol), I agree that the ? would make sense though.

What do you think about making the behavior of non-printable characters being configurable? The default is to preserve as much as possible. Changing the default would replace all non-printable characters with spaces, then we strip them out. Personally I would love it.

@flatcap
Copy link
Member

flatcap commented May 2, 2024

What do you think about making the behavior of non-printable characters being configurable?

That sounds reasonable, but what would you apply it to? and where?

We probably only want to apply this to an email's subject.
So we probably don't want to alter the Expando code for this, just index_s().

How common are these bad subjects?
We don't want to add options just to make spam look prettier.

@Yutsuten
Copy link
Member Author

Yutsuten commented May 2, 2024

We don't want to add options just to make spam look prettier.

🤣 🤣 This comment made me laugh, actually. Probably because it's half true. What is spam is subjective, though. Google payment receipts are not spam for me, but it may be spam for others.

How common are these bad subjects?

Google and Amazon send me the bad subjects that I would love to strip. Both are payment receipt emails.

That sounds reasonable, but what would you apply it to? and where?

So far we have reports of issues in subject and names (from/to), so we could start here.

Ultimately I just want NeoMutt to behave as any other modern email client. I have never had issues like this back when I was using Thunderbird, for example. I also want to keep things simple, that's why I came to NeoMutt after all, but there is no need to neglect aesthetics of the inbox screen.
I know that the true culprits here are Google and Amazon for creating these bad subjects. But while they don't fix themselves, we have to do something to compensate that. It's not that hard or need heavy processing. Most email client are doing it already, actually.

Also, AFAIK if a system is using an UTF-8 locale there aren't that much non-printable characters that we would like to keep. But for systems that don't use UTF-8 I perfectly agree that we need to show the user there is something we cannot print by replacing it with ?. I may search later for "non-printable utf8 characters" in the internet if I can find cons for my argument, so I can stop being annoying 😆

@nobrowser
Copy link
Contributor

nobrowser commented May 5, 2024

Google and Amazon send me the bad subjects that I would love to strip. Both are payment receipt emails.

We're talking about properly RFC 2047 encoded headers, right? Otherwise those emails are just plain broken and shouldn't be catered for, IMO.

At the risk of sounding nasty / dismissive: neomutt should always stay an email client, not a gmail client.

Also, AFAIK if a system is using an UTF-8 locale there aren't that much non-printable characters that we would like to keep. But for systems that don't use UTF-8 I perfectly agree that we need to show the user there is something we cannot print by replacing it with ?. I may search later for "non-printable utf8 characters" in the internet if I can find cons for my argument, so I can stop being annoying 😆

The encoding in the locale is not relevant, IMO. Only the encoding specified by the RFC 2047 "word". And whether neomutt can render "prettily" or not depends just on the availability of the needed glyphs in the terminal font.

--
Ian

@Yutsuten
Copy link
Member Author

Yutsuten commented May 7, 2024

Ok, I've finished my re-learning of C, I'm ready to start working on this.

Thank you for your opinion @nobrowser , I'll quit about the strip() idea.

The issue

So I want to organize and confirm what actually I should do here. We have two issues reported:

  1. E-mails with bad characters in the subject
  2. E-mails where the sender have an U+3000 : IDEOGRAPHIC SPACE in their name

About (1), I downloaded the e-mail locally in a Maildir, and by checking the e-mail's raw file with a text editor we have this:

Subject: =?UTF-8?B?CgoKICAgICAgCiAKIAogICAgCiAgCiAgCiAgCiAgIAoKR29vZ2xlIENsb3VkIFBsYXRmbw==?=
	=?UTF-8?B?cm0gJiBBUElzOiAwMTQ4MTktRUI0Q0Q1LTFFOTA4RSDjga7oq4vmsYLmm7jjga7nlKjmhI/jgYzjgac=?=
	=?UTF-8?B?44GN44G+44GX44Gf?=

Not sure if it is useful. I don't know how to interpret this. Maybe @flatcap or @nobrowser can help me?

The fix

After we understand why the subject is becoming like that, I'd like to know what to do.

  • Are we going to replace any of the bad characters with space?
  • Are we going to replace U+3000 : IDEOGRAPHIC SPACE with space?

We also have a plan for the fix given by @flatcap

Plan (may need fine-tuning):

  • Add a new config flag, D_ALLOW_NEWLINES to allow newlines
  • Add it to the attribution config, etc
  • Add a flag to MuttFormatFlags for tree characters (see arborial param later)
  • Add a flag to MuttFormatFlags for newlines
  • Add a flags member to struct Expando
  • Alter expando_parse() to turn the config flag into a MuttFormatFlags flag
  • Alter format_string() to take MuttFormatFlags rather than bool arborial)
  • Alter format_string() to conditionally allow newlines

Just to understand, is this plan to allow newlines in places like the subject? By allowing newlines, would the newlines be replaced with whitespaces?

@nobrowser
Copy link
Contributor

nobrowser commented May 7, 2024

Yes, that is the RFC 2047 encoding I mentioned.

You see, email messages in transit are supposed to be encoded (for transport) entirely as 7 bit ASCII. So an encoding must be applied on top of whatever the native encoding is for the character set of the message. For the message body, the MIME standards specify how this transport-encoding is indicated to a recipient / client, but MIME doesn't apply to headers. For headers, there is this special mechanism defined in RFC 2047.

Here, the "?B?" substrings indicate the transport-encoding is Base64. So, if you apply a base 64 decoder to each of the parts between =?= pairs, you'll get the original UTF-8 encoded subject.

I take this as good news, meaning even Google & Amazon know how to RFC 2047 encode their headers.

I believe neomutt already handles newlines in Subjects correctly provided they're encoded as above. When there's a raw newline in the Subject header of a message, I believe it should be treated like newlines in other headers, which are indeed just generic separators without meaning to the recipient, so replacing with space is the right thing to do. I think neomutt does this too, but not 100% sure ATM.

--
Ian

@Yutsuten
Copy link
Member Author

Yutsuten commented May 7, 2024

Oooh now things are starting to make sense.

> echo 'CgoKICAgICAgCiAKIAogICAgCiAgCiAgCiAgCiAgIAoKR29vZ2xlIENsb3VkIFBsYXRmbw=' | base64 --decode | od -A x -t x1z -v
000000 0a 0a 0a 20 20 20 20 20 20 0a 20 0a 20 0a 20 20  >...      . . .  <
000010 20 20 0a 20 20 0a 20 20 0a 20 20 0a 20 20 20 0a  >  .  .  .  .   .<
000020 0a 47 6f 6f 67 6c 65 20 43 6c 6f 75 64 20 50 6c  >.Google Cloud Pl<
000030 61 74 66 6f                                      >atfo<
000034

Every time 0a appears, NeoMutt is converting it into a ?.
At least the dots shown in the right of the output of this command matches the ? placements in NeoMutt.

What 0a stands for, I still don't know though.

Edit: According to this, it is a LINE FEED (LF) (U+000A)

@flatcap
Copy link
Member

flatcap commented May 7, 2024

NeoMutt is already doing the base64 decoding for you (other encodings exist).

The plain subject is stored in Email.Envelope.subject
(struct Email, struct Envelope, member subject)

The Expando code needs to first measure this string, counting all whitespace as (space) and bad characters as ?.

Hmm...

I didn't consider Email.Envelope.disp_subj before.
It's normally just used for subjectrx transformations, see: Display Munging.

/me goes exploring in the code

@flatcap
Copy link
Member

flatcap commented May 7, 2024

OK, change of plan.
I think we should hijack subjrx_apply_mods() instead of all of the above.

If the subject contains any bad whitespace or other bad characters, replace them and store the result in disp_subj.
The Expando for the subject already uses this.

To check the string, it'll mean converting the chars from utf-8 into wide chars.
I'll find you and example to copy'n'paste.

@flatcap
Copy link
Member

flatcap commented May 7, 2024

The best template is format_string() and we can simplify it a lot.
We don't care about min/mix sizes of justification.
We don't care about colours or tree characters.

Copy format_string() into subjectrx.c and call it from subjrx_apply_mods().
We want it to put a sanitised version of the subject into a Buffer, then apply any mods as normal.

Have a look at the functions and see if that makes sense.

@Yutsuten
Copy link
Member Author

Yutsuten commented May 9, 2024

I'm still looking into the code, but I'll write some notes here.

bool subjrx_apply_mods(struct Envelope *env)
{
  // This is when there is no envelope or no subject, we can ignore this for this issue
  if (!env || !env->subject || (*env->subject == '\0'))
    return false;

  // This seems to be used when we merge envelopes. Not sure when it happens, but probably doesn't matter
  if (env->disp_subj)
    // Sanitize `env->disp_subj` here?
    return true;

  // Sanitize `env->subject` here

  if (STAILQ_EMPTY(&SubjectRegexList))
    // If the user have NOT configured the `subjectrx` option, keep the subject as is
    return false;

  // If the user have configured the `subjectrx` option, apply it here
  env->disp_subj = mutt_replacelist_apply(&SubjectRegexList, env->subject);
  return true;
}

The return values true and false are meaningless. At least I didn't find any place where it's value is being used.

--- a/expando/format.c
+++ b/expando/format.c
@@ -155,7 +155,7 @@ int format_string(struct Buffer *buf, int min_cols, int max_cols, enum FormatJus
       else
 #endif
           if (!IsWPrint(wc))
-        wc = '?';
+        wc = ReplacementChar;
       w = wcwidth(wc);
     }

Unrelated to this issue, but isn't better to use ReplacementChar here? I would fix the indentation of the code here too, the if line have 4 extra spaces. (Or because of the previous preprocessor #ifdef would the indentation rules change?)


Copying the whole format_string() into subjectrx.c feels hacky to me, but it's not like I've fully understand format_string() yet.

@flatcap did you mean using format_string() as a base, but modifying the copy in subjectrx.c with adjusts when to wc = ' ' or wc = ReplacementChar ?

Also wouldn't format_string() change the length of the subject by adding spaces at the end or trimming if its too long?

@flatcap
Copy link
Member

flatcap commented May 10, 2024

The return values true and false are meaningless. At least I didn't find any place where it's value is being used.

Yep :-)

did you mean using format_string() as a base

Yes

adjusts when to wc = ' ' or wc = ReplacementChar ?

Yes. If iswspace() then space, if !IsWPrint(), then ReplacementChar.

I would fix the indentation of the code here too

Yeah, that's clang-format being confused by the #ifdef.
As iswspace() is a bigger set than iswblank(), we can drop iswblank().

Also wouldn't format_string() change the length of the subject by adding spaces at the end or trimming if its too long?

Yes, we don't need those bits.

@Yutsuten
Copy link
Member Author

Yutsuten commented May 10, 2024

Ok, the logic seems to be correct... But it doesn't work 😢

If I use this in my new function:

if (iswspace(wc))
{
  wc = '-';
}

I get this subject:

???------?-?-?----?--?--?--?---??Google-Cloud-Platform

Now, if I try this:

if (iswspace(wc) || wc == '?')
{
  wc = '-';
}

I get the expected subject:

---------------------------------Google-Cloud-Platform

The LF character seems to be converted to ? already at subjrx_apply_mods().

That's it for today. I'll search more about it later... 😅

Edit: Tell me if it is better to push my changes to a branch already to help debugging stuff like this.

@flatcap
Copy link
Member

flatcap commented May 10, 2024

Tell me if it is better to push my changes to a branch

Yes, you're welcome to create a devel/SHORT-NAME branch that's yours to play with.
You can push anything you like to it and force-push or rebase,
then you can create a PR when it's ready for testing.

@Yutsuten
Copy link
Member Author

Pushed to devel/subject-bad-characters

@Yutsuten
Copy link
Member Author

Yutsuten commented May 11, 2024

I found the issue, it is in email/rfc2047.c L349.

The original subject is being modified by mutt_mb_filter_unprintable(). The call stack is like this:

  • rfc2047_decode_envelope()
  • rfc2047_decode(&subj)
  • finalize_chunk(buf, prev, prev_charset, prev_charsetlen)
  • mutt_mb_filter_unprintable(&buf->data)

So as soon as rfc2047_decode(&subj) finishes, we already have newlines converted to ?, which I think is undesirable.

I'll remove the mutt_mb_filter_unprintable() call from finalize_chunk(), hopefully this won't break anything. I think it's ok because filtering unprintable characters is the job of functions like format_string().

@nobrowser
Copy link
Contributor

I'll remove the mutt_mb_filter_unprintable() call from finalize_chunk(), hopefully this won't break anything. I think it's ok because filtering unprintable characters is the job of functions like format_string().

Be careful -- this has security implications because you don't want to send untrusted data (which email headers clearly are) to the terminal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants