Skip to content

Commit

Permalink
fix: Fix encoding of the email body
Browse files Browse the repository at this point in the history
DOMDocument::loadHTML loads content as ISO-8859-1 by default. It means
that our UTF-8 $content variable is not correctly interpreted.

We could tell loadHTML to use UTF8, but we would need to alter the
content (e.g. by adding an XML encoding declaration). We cannot be sure
of how the string looks except by inspecting its content and I feel that
being a rabbit hole of problems.

The alternative is to convert all the non-ASCII characters to HTML
entities with some weird `mb_encode_numericentity` function. In my
experience, it works pretty well.

Another alternative that I didn't choose would be to use a better
library to parse HTML.

Bug introduced in: 417a76d

References:

- loadHTML using ISO-8859-1 by default https://stackoverflow.com/a/8218649
- about the `mb_encode_numericentity` cryptic arguments https://stackoverflow.com/a/35858944
- HTML5 parser (no longer maintained?) https://github.com/Masterminds/html5-php
  • Loading branch information
marien-probesys committed Dec 1, 2023
1 parent c4da224 commit 22ef430
Showing 1 changed file with 11 additions and 0 deletions.
11 changes: 11 additions & 0 deletions src/MessageHandler/CreateTicketsFromMailboxEmailsHandler.php
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,17 @@ private function storeAttachments(MailboxEmail $mailboxEmail): array
private function replaceAttachmentsUrls(string $content, array $messageDocuments): string
{
$contentDom = new \DOMDocument();

// DOMDocument::loadHTML considers the source string to be encoded in
// ISO-8859-1 by default. In order to not ending with weird characters,
// we encode the non-ASCII chars (i.e. all chars above >0x80) to HTML
// entities.
$content = mb_encode_numericentity(
$content,
[0x80, 0x10FFFF, 0, -1],
'UTF-8'
);

$contentDom->loadHTML($content);
$contentDomXPath = new \DomXPath($contentDom);

Expand Down

0 comments on commit 22ef430

Please sign in to comment.