-
Notifications
You must be signed in to change notification settings - Fork 7.6k
Editor and BOM (byte order mark) #3898
Comments
@fddima: What encoding is the file using? Brackets currently only supports UTF-8 files. |
File is UTF-8 encoded with BOM. I.e. file begins from EF BB BF. Editor show this as red dot at begin of file. |
Reviewed - low priority @RaymondLim to investigate. Should this be a user story? |
Thanks. But what you mean by user story? This issue mostly actual to files which contains non-latin characters, but in fact - even when files contains only latin characters, BOM can be present, and it is valid for any utf-encoded file. I got same very often with similar editors or viewers (red dot at begining of file), but while you target on quality - you must support BOM, even just for utf-8 encoded files. |
BOM is used to differentiate between various unicode documents and it has more than the one mentioned by @fddima. See http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding for all the encodings that BOM can represent. Since it represents the encoding, maybe it should be part of this non-utf-8 encoding user story https://trello.com/card/support-non-utf-8-encodings/51072abad1b1f8a4560086bf/30. For a document with BOM, we need to implement the following tasks.
|
BOM at the beginning of the file is considered as content and for example prevents php header function from working. |
@RaymondLim Even in UTF-8, though, the BOM can be present (but only in one form). So we should figure out what to do with it in that case. |
@njx Yes, we can initially just focus on handling the only one form EF BB BF for utf-8 document. Then we still need to do the first two tasks that I listed out above. @kirilloid Thanks for mentioning about php header example. I agree with you that BOM is part of the content. I can also agree that user should be aware of BOM presence in the file. But I don't think showing BOM in the editor is the right thing to do since you don't really want the user to edit it like normal text content. Maybe we should have something in the status bar to show the BOM with some UI that allows the user to add/delete BOM or convert the encoding of BOM. |
Having just filed a dupe :-) , I agree with @RaymondLim: we shouldn't show the BOM as part of the editor content. I like the idea of indicating it in the status bar, since we'd probably also use the sb to show/edit the overall encoding once we support multiple encodings. For that reason I think we could actually slice this in half: first just stop showing the red dot in the content (but presumably preserve the BOM on save); second, start indicating the BOM presence in the status bar -- we could potentially delay that part all the way until the broader encoding story happens. |
@larz0 Can you suggest the UI for BOM in the status bar which can be extended to show the encoding in the future? |
@RaymondLim thoughts? |
@RaymondLim maybe a tooltip that says, e.g. "UTF-7". Or should it be the other way around? Display encoding in status bar with a tooltop for hex? |
@larz0 - note that this isn't about showing the encoding (we're still only supporting UTF-8). It's just some way of showing that the file had a BOM (even though we don't show it in the code) and that we'll properly write it out when we save it. Does Sublime show the BOM somehow? I actually wonder if it's important for us to have any visual indication in the UI that there was a BOM. FWIW, I think this is only on Windows. I don't see the red dot on Mac--I'm guessing we never see it because we're reading the file as text and the filesystem deals with the encoding somehow. |
@njx Actually, we want to show both encoding and BOM. @larz0 I think it should be the other way around, showing the encoding instead of hex, then a tooltip to indicate whether the unicode encoding has BOM or not. In the future we can make it a link so that the user can click on it to invoke a dialog to do encoding conversion or add/remove BOM. |
Good point! The red dot shows up only on Windows, and on Mac the shell code is dealing with utf-8 encoding and stripping BOM before returning the file content. So we have a different issue on Mac for files with BOM. If the user makes any changes to a file with BOM on Mac, we're not saving BOM back with the updated content. |
This is also visible on Linux in sprint 33, so it's not a Windows only issue. I don't think this is something handled by the file system as suggested by @njx. This is handled by the application and the fact that it's different on Mac might mean that there is already support for it somehow. Figuring out why it does not happen on Mac might prove useful. |
For me and I guess many other non english users. This is a big problem. I'm forced to use another editor and re save files after using Brackets. Or not use Brackets at all. |
We have the same problem with Brackets messing up all non-english characters in files created with BOM. |
Couple of assumptions:
What about the following approach to resolve the issue:
Perhaps it makes sense to indicate BOM in red in the case of unsupported encoding. |
This would solve our problems. That or a "Save with encoding ..."-function with the choice UTF-8 BOM simmular to Sublime text. |
Any updates on this issue? |
@webjohan Sprint 37 already started and I don't think we have cycle to handle it in this sprint. I'll be nominating and hopefully we can handle it for utf-8 documents in sprint 38. We won't be handling other unicode (and non utf-8) documents since it involves encoding conversion when reading in and writing out. |
@RaymondLim This is great news! I look forward to this release! |
Nominating for Brackets 1.0. |
Reviewed. We discussed this, and we think that for UTF-8 specifically (not including other encodings), since the BOM is not important, the most we should do is preserve it, but it would be fine to just strip it. We don't want to do any UI for indicating that the BOM exists. We definitely do want to handle other encodings at some point (for which the BOM will be important), so we might want to defer this until we deal with encodings in general. Removing 1.0 milestone. |
Just preserving the BOM would make a big improvement until the "full encoding handling" is in place. |
I'm coding in Simplified Chinese, when updated to sprint 39,the red dot disappeared,but every time changing File which is UTF-8 encoded with BOM, brackets will automatically make it with none BOM,it mess my site. the red dot is ugly, but work fine. My sloution is use notepad++ add BOM after change the file, is any better way to preserve BOM? |
Same here, our files are mostly UTF8, which means every commit of a file modified by Brackets registers a change on the first line. I then either have to do like @Ashley2014 and use Notepad++ to change the encoding, or reset the first line of every file I want to commit. |
Hello everyone! I wrote this script to add BOM to those files edited by Brackets. |
Some minification programs automatically use a BOM instead of a typical encoding declaration to squeeze out a few more bytes; in particular, Sass's production mode will output the BOM instead of |
I'd like to offer my two cents on this issue, as Unicode encoding has become something of an issue that is dear to me. First, I'd like to address a couple of remarks made here:
Please do reconsider your opinion on this. The Byte Order Mark really is very badly named, as all emphasis is laid on Byte Order, which (as you say) does not make sense for UTF-8, but in fact this marker serves two purposes. The byte-order thing is just half of the story.
The second purpose of the BOM is as a signature mark, to identify the content as Unicode encoded. In this world filled with legacy encodings, this second function is actually much more important than the first one. Why? Because in practical reality today, UTF-8 is the only Unicode encoding that is actually in widespread use. But one of the biggest advantages of UTF-8 encoding, backwards compatibility with ASCII, is also it's Achilles' heel; it's near impossible for editors and tools opening UTF-8 encoded text to tell the difference between ASCII, ANSI, ISO-8859-x and UTF-8. We need to either ask the user, make assumptions (just assume UTF-8 for example) or make some guesstimate based on file content... Sometimes with very weird results. Many applications allow the user to specify the encoding. Some even offer explicit control over the BOM character.... But really, most user's don't have the knowledge to make informed decisions about these issues. Brackets targets programmers but even most programmers don't understand the intricacies of text encodings. As such, Brackets has an important function. The standards it chooses to set and the defaults it chooses will have a big impact on the community, because most programmers will simply use these defaults without ever thinking about them. As such I urge you to study this matter in great detail and take the time (and give this issue enough priority) such that you can pick those defaults that will leave us with as little problems as possible in 10 years time when thousands or even millions of files will have been created using these defaults. Please write out the Unicode BOM as the first character of every new file by default The BOM is specified in the Unicode standard. It's the lowest possible level where this could have been standardized (as opposed to in some OS or filesystem, or even at the application level). That is fantastic because it means we can get out of the encoding hell. Any application that claims to support Unicode must understand the BOM. So technically you can write out the BOM in front of every UTF-8 file you save and it should never cause any problems. If it does, the application reading the file is violating the Unicode standard. I just really badly want all those milions of files that are (hopefully :) going to be created with Brackets to have BOMs in front of them so text editors and any other software processing text files won't have to guess about the encoding it's in because it will know from the BOM that it is indeed UTF-8 encoded Unicode text and nothing else. Now in practical reality there is still software around that does not fully support Unicode and has trouble with the BOM... This very issue being one example. For those scenarios it's good to have the option to suppress it. The suggestion from @busykai for a small GUI in the status bar sounds fantastic. Just please, please, please set the BOM to enabled for new files. It will save us so much pain in the long run. |
In the meantime I read the user story on the backlog that was referenced in this issue and it is very interesting to read the discussion there in the context of the Unicode BOM. It specifically mentions that it is very hard to detect the encoding of a file. If you embrace the BOM, this makes the world so much easier! Just:
Starting to write out the BOM on saving UTF-8 files now will mean you will be able to get rid of the scanning for illegal UTF-8 sequences earlier or maybe even just skip it at all, because you will be able to reliably detect which encoding was used to save the file. And with you, all other tools that support Unicode. Long live the BOM! :) |
Editor show BOM (byte order mark) with red dot at begin of file.
UPD: Windows 7, Brackets Sprint 24.
The text was updated successfully, but these errors were encountered: