fix: Fix encoding of the email body

DOMDocument::loadHTML loads content as ISO-8859-1 by default. It means that our UTF-8 $content variable is not correctly interpreted. We could tell loadHTML to use UTF8, but we would need to alter the content (e.g. by adding an XML encoding declaration). We cannot be sure of how the string looks except by inspecting its content and I feel that being a rabbit hole of problems. The alternative is to convert all the non-ASCII characters to HTML entities with some weird `mb_encode_numericentity` function. In my experience, it works pretty well. Another alternative that I didn't choose would be to use a better library to parse HTML. Bug introduced in: 417a76d References: - loadHTML using ISO-8859-1 by default https://stackoverflow.com/a/8218649 - about the `mb_encode_numericentity` cryptic arguments https://stackoverflow.com/a/35858944 - HTML5 parser (no longer maintained?) https://github.com/Masterminds/html5-php
Probesys · Dec 1, 2023 · 22ef430 · 22ef430
1 parent c4da224
commit 22ef430
Showing 1 changed file with 11 additions and 0 deletions.
diff --git a/src/MessageHandler/CreateTicketsFromMailboxEmailsHandler.php b/src/MessageHandler/CreateTicketsFromMailboxEmailsHandler.php
@@ -231,6 +231,17 @@ private function storeAttachments(MailboxEmail $mailboxEmail): array
     private function replaceAttachmentsUrls(string $content, array $messageDocuments): string
     {
         $contentDom = new \DOMDocument();
+
+        // DOMDocument::loadHTML considers the source string to be encoded in
+        // ISO-8859-1 by default. In order to not ending with weird characters,
+        // we encode the non-ASCII chars (i.e. all chars above >0x80) to HTML
+        // entities.
+        $content = mb_encode_numericentity(
+            $content,
+            [0x80, 0x10FFFF, 0, -1],
+            'UTF-8'
+        );
+
         $contentDom->loadHTML($content);
         $contentDomXPath = new \DomXPath($contentDom);