Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders handle malformed headers because my home-grown "newsreader" has "problems" when responding to Winston's posts due to the way he formats his "FROM" header.
From: ...w¤?ñ?¤ <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name:
(U+00A1)
¤ (U+00F1)
? (U+00A7)
ñ (U+00B1)
? (U+00A4)
another ¤ (U+00F1)
To add further value to what Carlos kindly tested using Thunderbird, apparently, those on Thunderbird see not this (which is what I see):
From: ...w¤?ñ?¤<winstonmvp@gmail.com>
To add further value to what Carlos kindly tested using Thunderbird, apparently, those on Thunderbird see not this (which is what I see):
From: ...w¤?ñ?¤ <winstonmvp@gmail.com>
Which, is comprised of...
(U+00A1)
¤ (U+00F1)
? (U+00A7)
ñ (U+00B1)
? (U+00A4)
another ¤ (U+00F1)
But they actually see this instead (according to what Carlos reported):
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
Thank you for clarifying what I misunderstood from Carlos' tests, which is that you see what I see which Winston has subsequently confirmed are alt codes he manually typed in to set his FROM Usenet header long ago using
...w = ...w (literal)
= Alt 0161 (Windows inserts byte A1 hexadecimal value)
¤ = Alt 0241 (Windows inserts byte F1 hexadecimal value)
? = Alt 0167 (Windows inserts byte A7 hexadecimal value)
ñ = Alt 0177 (Windows inserts byte B1 hexadecimal value)
? = Alt 0164 (Windows inserts byte A4 hexadecimal value)
Thanks for confirming what I see Carlos has also confirmed, which is that
you see in Thunderbird what I see in my newsreader which is "...w¤?ñ?¤".
From: ...w¤?ñ?¤ <winstonmvp@gmail.com>
On 3/11/2026 5:32 PM, Maria Sophia wrote:
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders
handle
malformed headers because my home-grown "newsreader" has "problems" when
responding to Winston's posts due to the way he formats his "FROM"
header.
ÿ From: ...w¤?ñ?¤ <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name:
ÿ (U+00A1)
ÿ ¤ (U+00F1)
ÿ ? (U+00A7)
ÿ ñ (U+00B1)
ÿ ? (U+00A4)
ÿ another ¤ (U+00F1)
w = standard lower case w keystroke
= Alt 0161
¤ = Alt 0241
? = Alt 0167ÿ or ? = Alt 21
ñ = Alt 0177
? = Alt 0164
¤ = Alt 0241
All from one or more fonts available in Character Map.
ÿ- I've come across other folks that use some available character codes that appear blank - just copy the code and paste into a field to meet
the '*' required character entry.
Asking TB to produce the raw message, it comes as
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?= <...>
Which is legal, obviously. (Reasoning: if TB does it, then it is legal)
On 12/03/2026 07:08, ...w¤?ñ?¤ wrote:
On 3/11/2026 5:32 PM, Maria Sophia wrote:I also see your name as ...w¤?ñ?¤ (in Betterbird). It doesn't bother me unduly but it has puzzled me for a while. May I ask what you are doing
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders
handle
malformed headers because my home-grown "newsreader" has "problems" when >>> responding to Winston's posts due to the way he formats his "FROM"
header.
ÿ From: ...w¤?ñ?¤ <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name:
ÿ (U+00A1)
ÿ ¤ (U+00F1)
ÿ ? (U+00A7)
ÿ ñ (U+00B1)
ÿ ? (U+00A4)
ÿ another ¤ (U+00F1)
w = standard lower case w keystroke
= Alt 0161
¤ = Alt 0241
? = Alt 0167ÿ or ? = Alt 21
ñ = Alt 0177
? = Alt 0164
¤ = Alt 0241
All from one or more fonts available in Character Map.
ÿÿ- I've come across other folks that use some available character
codes that appear blank - just copy the code and paste into a field to
meet the '*' required character entry.
and why not simply use winston as in your email address?
On 2026-03-12 07:16, Maria Sophia wrote:
To add further value to what Carlos kindly tested using Thunderbird,
apparently, those on Thunderbird see not this (which is what I see):
ÿ From: ...w¤?ñ?¤ <winstonmvp@gmail.com>
Which, is comprised of...
ÿ (U+00A1)
ÿ ¤ (U+00F1)
ÿ ? (U+00A7)
ÿ ñ (U+00B1)
ÿ ? (U+00A4)
ÿ another ¤ (U+00F1)
But they actually see this instead (according to what Carlos reported):
ÿ From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
No, that's what I see when looking at the raw version. What I see in the editor or the message viewer is
...w¤?ñ?¤ <winstonmvp@gmail.com>
and on a follow up is "On 2026-03-12 08:08, ...w¤?ñ?¤ wrote:"
Notice that we are both using thunderbird, so what happens is
coordinated. It is sent as mime, but displayed as normal utf text.
That's on the header. The body is plain UTF, no need for any conversion.
The header needs to be compatible with older software.
On 3/12/2026 11:24 AM, MikeS wrote:
On 12/03/2026 07:08, ...w¤?ñ?¤ wrote:
On 3/11/2026 5:32 PM, Maria Sophia wrote:I also see your name as ...w¤?ñ?¤ (in Betterbird). It doesn't bother
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders
handle
malformed headers because my home-grown "newsreader" has "problems"
when
responding to Winston's posts due to the way he formats his "FROM"
header.
ÿ From: ...w¤?ñ?¤ <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name: >>>> ÿ (U+00A1)
ÿ ¤ (U+00F1)
ÿ ? (U+00A7)
ÿ ñ (U+00B1)
ÿ ? (U+00A4)
ÿ another ¤ (U+00F1)
w = standard lower case w keystroke
= Alt 0161
¤ = Alt 0241
? = Alt 0167ÿ or ? = Alt 21
ñ = Alt 0177
? = Alt 0164
¤ = Alt 0241
All from one or more fonts available in Character Map.
ÿÿ- I've come across other folks that use some available character
codes that appear blank - just copy the code and paste into a field
to meet the '*' required character entry.
me unduly but it has puzzled me for a while. May I ask what you are
doing and why not simply use winston as in your email address?
Have used that form for nntp and signature since 1998
Html nntp, Text nntp[1], private nntp groups, private list servers,
private web groups, blogging...
[1] text nntp(e.g. Eternal Sept. like servers - no HTML formatting composition) users are the only source where questions, criticism,
comments occur...but less than 5% of where 'it's' being used.
<g>Before 1998, the nomenclature was slightly longer
ÿ =>ÿ W¤?ñ?¤ª™á¢•gŒ‰
Maria Sophia wrote:
People have complained *to me* that my responses have mojibake in them.
So I'm trying to fix that problem *for them*.
Delving deeper in thought...
Given RFC 5322 says headers must be ASCII unless MIME-encoded, others have pointed out Big-5 & ISO-8859-1 sometimes gets inserted into my headers.
I don't add that. I can't add them. They're not in my dictionaries.
So "something else" must be adding them. But what?
I never really understood character encoding, and I've said so many times. But I wonder if what's happening is possibly
1. The "From:" display name contains raw CP1252 bytes
2. Which are not valid UTF-8
3. Where, if my outgoing message declares "charset=UTF-8"
4. Maybe some NNTP servers might respond by trying to be helpful
5. One way being by slapping a different charset label on the header
Given... these CP1252 bytes (0xA1, 0xA7, 0xB1, 0xF1) are
a. illegal in UTF-8
b. legal in ISO-8859-1
c. also legal byte patterns in Big-5
Maybe that's where some of my responses get ISO-8859-1 or Big-5 headers?
Maybe... given UTF-8 is not ASCII, but ASCII is valid UTF-8...
i. Declaring UTF-8 forces some nntp servers to validate all bytes.
ii. But CP1252 bytes are illegal in UTF-8
iii. Where UTF-8 replies trigger more server 'helpfulness'
An interesting related aside is that... for
I. 0xA1 is not a valid UTF-8 start byte
II. 0xF1 is a valid UTF-8 start byte,
but only if followed by 0x80-0xBF, which it isn't
III. 0xA7 is illegal as a UTF-8 start byte
IV. 0xB1 is illegal as a UTF-8 start byte
V. 0xA4 is illegal as a UTF-8 start byte
VI. 0xF1 is a valid UTF-8 start byte,
but only if followed by 0x80-0xBF, which it isn't
The RFC-correct solution would be:
From: =?UTF-8?Q?W=C2=A1=C3=B1=C2=A7=C2=B1=C2=A4=C3=B1=C2=AC=C3=96=C3=9F=C3=B3=C3=B2g=C3=AE=C3=AB?= <...>
But that's ugly.
Using W¤?ñ?¤ª™á¢•gŒ‰ would be even more so, given
VII. 0xAC is illegal as a UTF-8 start byte
VIII. 0xD6 is a valid start byte only if followed by continuation byte
And so on, where the "W" in W¤?ñ?¤ and the "g" in ™á¢•gŒ‰ are the only bytes in that entire (pre 1988) decorated name that is both ASCII and valid UTF-8. Everything else is raw CP1252.
The UTF-8 version of the whole name would be:
57 C2 A1 C3 B1 C2 A7 C2 B1 C2 A4 C3 B1 C2 AC C3 96 C3 9F C3 B3 C3 B2 67 C3 AE C3 AB
But all this is only meaningful if it causes downstream issues,
where I think simply switching my headers to ASCII solved the
mojibake that Andy, Carlos and others asked me to try to fix.
John Hall wrote:
On 12/03/2026 06:16, Maria Sophia wrote:
To add further value to what Carlos kindly tested using Thunderbird,
apparently, those on Thunderbird see not this (which is what I see):
From: ...w¤?ñ?¤<winstonmvp@gmail.com>
I'm using Thunderbird and I see exactly what you see. Maybe it's
something to do with which fonts we have installed or with our Windows
settings? (I'm using Windows 11 rather than Windows 10, but I doubt that
would make any difference.)
Thank you for clarifying what I misunderstood from Carlos' tests, which is that you see what I see which Winston has subsequently confirmed are alt codes he manually typed in to set his FROM Usenet header long ago using
...w = ...w (literal)
= Alt 0161 (Windows inserts byte A1 hexadecimal value)
¤ = Alt 0241 (Windows inserts byte F1 hexadecimal value)
? = Alt 0167 (Windows inserts byte A7 hexadecimal value)
ñ = Alt 0177 (Windows inserts byte B1 hexadecimal value)
? = Alt 0164 (Windows inserts byte A4 hexadecimal value)
Those are all valid Windows Alt-codes, but the important detail is that
they produce raw 8-bit bytes from the Windows-1252 (Latin-1) character set.
I could be wrong as I never really understood this characters stuff, but
a. They are not UTF-8
b. They are not ASCII
c. They are not MIME-encoded
d. They are raw 8-bit bytes
The valid format is:
=?charset?encoding?encoded-text?=
Hence, if we break Winston's header down:
=?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
| | | |
| | | +-- Base64 text
| | +------------------------ Encoding type ("B" = Base64)
| +-------------------------- Character set (UTF-8)
+-------------------------------- Begin encoded-word
The Base64 portion is:
Li4ud8Khw7HCp8KxwqTDsQ==
Decoding that Base64 string yields the UTF-8 text:
...w¤?ñ?¤
Carlos E.R. wrote:
But they actually see this instead (according to what Carlos reported):
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
No, that's what I see when looking at the raw version. What I see in the
editor or the message viewer is
...w¤?ñ?¤ <winstonmvp@gmail.com>
and on a follow up is "On 2026-03-12 08:08, ...w¤?ñ?¤ wrote:"
Notice that we are both using thunderbird, so what happens is
coordinated. It is sent as mime, but displayed as normal utf text.
That's on the header. The body is plain UTF, no need for any conversion.
The header needs to be compatible with older software.
Hi Carlos,
Thanks for correcting my misconception as I never really understood all
this mojibake character-set interaction but now that Winston explained he
is typing Windows Alt-codes, and after your clarification, I am scratching the surface at beginning to understand what is actually happening.
It may be that Thunderbird *stores* or *shows* the header in MIME-encoded form when you view the raw source, but apparently Thunderbird does not MIME-encode Winston's display name when sending the message.
I'm not using Thunderbird (and I changed the header to reflect that since
TB users are on this thread) but it appears that in normal viewing mode, Thunderbird simply displays the raw 8-bit Windows-1252 bytes exactly as
they appear:
...w¤?ñ?¤ <winstonmvp@gmail.com>
Which matches what I see on my end.
Apparently Thunderbird is perfectly happy to accept those raw 8-bit bytes
in the header, even though they are not valid UTF-8 and not legal ASCII.
My own workflow is strict ASCII, so when those bytes get copied into my attribution line, I think what happens is some NNTP servers try to repair
the mismatch and end up mangling my outgoing post, which is really the only reason I care (as I don't care to be a Usenet-rules enforcer by any means).
So, to clarify, I think you & Winston are saying the behavior is:
1. Winston types Windows11252 Alt-codes.
2. Thunderbird displays them as-is in the UI.
3. Thunderbird shows a MIME-encoded version only when viewing
the raw message source.
4. My ASCII-only workflow exposes the illegal bytes,
which sometimes apparently triggers server rewrites (AFAICT)
Thanks again for checking this from the Thunderbird side, as knowing how
you see Winston's messages helps me figure out how to handle the mojibake.
THIS IS A TEST. IT'S AN EXACT COPY OF THE PREVIOUS POST.
THE ONLY DIFFERENCE IS THIS HAS UTF-8 DECLARED IN THE HEADER. NOT ASCII.
DO YOU SEE THE SAME OUTPUT or DO YOU SEE IT DIFFERENTLY?
The RFC-correct solution would be:--------************
From: =?UTF-8?Q?W=C2=A1=C3=B1=C2=A7=C2=B1=C2=A4=C3=B1=C2=AC=C3=96=C3=9F=C3=B3=C3=B2g=C3=AE=C3=AB?=
<...>
But that's ugly.
Using W???????g?? would be even more so, given
Carlos E.R. wrote:
I could be wrong as I never really understood this characters stuff, but >>> a. They are not UTF-8
b. They are not ASCII
c. They are not MIME-encoded
d. They are raw 8-bit bytes
Huh, no. They were typed as 8-bit bytes from Latin-1 charset at some
point in time, but today they are UTF-8. UTF in the body, and as MIME in
the header.
You said it yourself in another post:
Hi Carlos,
I agree. I apologize for the flip flop indecision. I don't know what's
going on, as I'm only trying to fix the trouble W¤?ñ?¤ª™á¢•gŒ‰ creates.
I will endlessly admit I never understood this charset stuff, and I will point out that the only reason I even care is you and others asked me to
fix the problems that sometimes my posts look like a Chinese jigsaw puzzle.
Since I don't mess with the characters, something else is messing with the characters, where a test in this very thread shows that when I use headers
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bi t
Then W¤?ñ?¤ª™á¢•gŒ‰ remains W¤?ñ?¤ª™á¢•gŒ‰
But when I use headers
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bi t
Then W¤?ñ?¤ª™á¢•gŒ‰ turns the entire post into a ransom note.
Usenet (NNTP) follows email header rules (RFC 5322 + RFC 2047):
a. The body may be UTF-8, if declared.
b. Headers cannot contain raw 8-bit bytes.
c. Hence, non-ASCII characters must be encoded using MIME encoded-words
From: =?UTF-8?Q?W=C2=A1=C3=B1=C2=A7=C2=B1=C2=A4=C3=B1?= <winston@example.com>
Given Winston's "FROM:" header has those characters, which are not ASCII,
all I can say is that they're not valid characters for *headers*, unless they're MIME encoded. Are they Mime-encoded? I don't know. I don't see it.
As you said, I belatedly realized Winston's characters are valid Unicode
and valid UTF-8 but they appear in a header, apparently without required
MIME encoding when Usenet servers are allowed to mangle or reject 8-bit header bytes. When I respond, the attribute line contains W¤?ñ?¤ª™á¢•gŒ‰
What I'm trying to figure out is why my body gets mangled because the attribution line contains raw Latin-1 bytes, but my outgoing headers
declare UTF-8, so I think a server in the path re-encodes the body and corrupts it. But I'm not really sure what is causing the mojibake. .
Carlos E.R. wrote:
Then W¤?ñ?¤ª™á¢•gŒ‰ turns the entire post into a ransom note.
Possibly because the text is not actually UTF-8
Yeah. In a later post you see I belatedly figured that out for myself.
Sorry for the flip flop indecision on whether I think it's UTF-8 or not.
Did I ever mention I never really understood this Usenet charset stuff?
I'm one of the few people whose ego isn't so huge that they can't admit
when they don't know something, where I openly and humbly easily admit that
I seriously lack charset understanding when it comes to Usenet headers.
Luckily, the two things I'm doing seems to work "most" of the time:
a. If I copy/paste from a variety of web sources (particularly Chromium),
I run my body through a text-normalizer to eliminate Unicode chars.
<shortcuts.xml>
b. I manually place a US-ASCII header which seems to tell the receiving
newsreaders not to both trying to deal with W¤?ñ?¤ª™á¢•gŒ‰'s
Windows-1252 ISO-8859-1 (Latin-1) character set.
w = 0x57 (ASCII)
= 0xA1
¤ = 0xF1
? = 0xA7
ñ = 0xB1
? = 0xA4
ª = 0xAC
™ = 0xD6
á = 0xDF
¢ = 0xF3
• = 0xF2
g = 0x67 (ASCII)
Œ = 0xEE
‰ = 0xEB
Every one of those bytes is a single-byte Latin-1 / Windows-1252 character. None of them are UTF-8.
Given Winston's "FROM:" header has those characters, which are not ASCII, >>> all I can say is that they're not valid characters for *headers*, unless >>> they're MIME encoded. Are they Mime-encoded? I don't know. I don't see it. >>Yes, they are MIME encoded. I posted the other day the section in HEX,
taken directly from the on disk file that Leafnode has written on my
system, so no translation from Thunderbird.
I may be wrong since I never understood this stuff, so I appreciate your clarifications, and I openly let you know I really don't understand this.
I think you are describing Thunderbird's behavior, not necessarily
Winston's behavior, while mostly I'm describing Winston's original bytes,
not Thunderbird's. (Although it appears that Winston uses TB after all.)
I think we can all presume Winston originally long ago typed raw
Windows-1252 bytes using Alt-codes for his display name, but I think it may be that TB does not actually send those bytes directly in the header.
Those are raw 8-bit Latin-1 bytes when he types them.
However, I think TB does not send those bytes directly.
When Winston posts using TB, I think TB maybe perhaps converts the Latin-1 bytes to UTF-8, and then MIME-encodes the header using RFC 2047. That may
be why the raw source on your system shows something like:
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
On your side, TB maybe perhaps then decodes that MIME-encoded header for display, so in the normal UI you see:
...w!n?ñ?n <winstonmvp@gmail.com>
I'm rather confused, as I don't control anything but my side of the
equation, and all I'm doing is dealing with Winston's display name,
but maybe what's possibly happening overall, is this (maybe?):
1. Winston typed Windows-1252 Alt-codes for his display name long ago.
...W¤?ñ?¤
2. His Thunderbird converts those Latin-1 bytes to UTF-8 internally.
3. His Thunderbird MIME-encodes the UTF-8 header before sending it.
4. Your Thunderbird decodes the MIME header & displays normal UTF-8 text.
5. My own newsreader client copies the original Latin-1 bytes from the
attribution line because it does not decode the MIME header.
6. That mismatch triggers mojibake in my outgoing posts when my headers
declare "charset=UTF-8" instead of "charset=US-ASCII".
I never understood this stuff, but perhaps maybe that explains why you see
a valid MIME-encoded UTF-8 header in the raw view, while I see the original Latin-1 bytes in my ASCII world. Thunderbird is doing the right thing on Winston's end, but perhaps my own ASCII-only setup exposes the mismatch.
Thanks again for helping me sort out what Thunderbird is doing on your
side, as I used TB years ago for a client and hated how it thought Usenet
was email. Maybe it's better now as that had to be a decade or so ago.
As you said, I belatedly realized Winston's characters are valid Unicode
and valid UTF-8 but they appear in a header, apparently without required
MIME encoding when Usenet servers are allowed to mangle or reject 8-bit header bytes. When I respond, the attribute line contains W¤?ñ?¤ª™á¢•gŒ‰
I think you are describing Thunderbird's behavior, not necessarily
Winston's behavior, while mostly I'm describing Winston's original bytes,
not Thunderbird's. (Although it appears that Winston uses TB after all.)
I think we can all presume Winston originally long ago typed raw
Windows-1252 bytes using Alt-codes for his display name, but I think it may be that TB does not actually send those bytes directly in the header.
Forget latin-1. The servers are sending mime encoded utf-8 in the
headers. Life is simple that way.
| Sysop: | Tetrazocine |
|---|---|
| Location: | Melbourne, VIC, Australia |
| Users: | 15 |
| Nodes: | 8 (0 / 8) |
| Uptime: | 214:59:32 |
| Calls: | 208 |
| Files: | 21,502 |
| Messages: | 80,773 |