Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
DOMDocument::loadHTML loads content as ISO-8859-1 by default. It means that our UTF-8 $content variable is not correctly interpreted. We could tell loadHTML to use UTF8, but we would need to alter the content (e.g. by adding an XML encoding declaration). We cannot be sure of how the string looks except by inspecting its content and I feel that being a rabbit hole of problems. The alternative is to convert all the non-ASCII characters to HTML entities with some weird `mb_encode_numericentity` function. In my experience, it works pretty well. Another alternative that I didn't choose would be to use a better library to parse HTML. Bug introduced in: 417a76d References: - loadHTML using ISO-8859-1 by default https://stackoverflow.com/a/8218649 - about the `mb_encode_numericentity` cryptic arguments https://stackoverflow.com/a/35858944 - HTML5 parser (no longer maintained?) https://github.com/Masterminds/html5-php
- Loading branch information