Skip to content
This repository has been archived by the owner on Sep 6, 2021. It is now read-only.

Editor and BOM (byte order mark) #3898

Closed
ghost opened this issue May 18, 2013 · 32 comments
Closed

Editor and BOM (byte order mark) #3898

ghost opened this issue May 18, 2013 · 32 comments
Assignees

Comments

@ghost
Copy link

ghost commented May 18, 2013

Editor show BOM (byte order mark) with red dot at begin of file.

UPD: Windows 7, Brackets Sprint 24.

@peterflynn
Copy link
Member

@fddima: What encoding is the file using? Brackets currently only supports UTF-8 files.

@ghost
Copy link
Author

ghost commented May 19, 2013

File is UTF-8 encoded with BOM. I.e. file begins from EF BB BF. Editor show this as red dot at begin of file.

@ghost ghost assigned RaymondLim May 20, 2013
@peterflynn
Copy link
Member

Reviewed - low priority @RaymondLim to investigate. Should this be a user story?

@ghost
Copy link
Author

ghost commented May 20, 2013

Thanks. But what you mean by user story? This issue mostly actual to files which contains non-latin characters, but in fact - even when files contains only latin characters, BOM can be present, and it is valid for any utf-encoded file. I got same very often with similar editors or viewers (red dot at begining of file), but while you target on quality - you must support BOM, even just for utf-8 encoded files.

@RaymondLim
Copy link
Contributor

BOM is used to differentiate between various unicode documents and it has more than the one mentioned by @fddima. See http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding for all the encodings that BOM can represent. Since it represents the encoding, maybe it should be part of this non-utf-8 encoding user story https://trello.com/card/support-non-utf-8-encodings/51072abad1b1f8a4560086bf/30.

For a document with BOM, we need to implement the following tasks.

  • Detect and strip off BOM on opening files.
  • Re-add BOM on saving.
  • If other non-utf-8 encodings are supported, then we may need to do encoding conversion on loading/saving.

@kirilloid
Copy link

BOM at the beginning of the file is considered as content and for example prevents php header function from working.
I suppose, user should be aware of BOM presence in the file. Unfortunately, solution by @RaymondLim doesn't address that problem.
I'd replace "red dot at begin of file" with, maybe, somthing like "[BOM]" above the first line.

@njx
Copy link
Contributor

njx commented May 24, 2013

@RaymondLim Even in UTF-8, though, the BOM can be present (but only in one form). So we should figure out what to do with it in that case.

@RaymondLim
Copy link
Contributor

@njx Yes, we can initially just focus on handling the only one form EF BB BF for utf-8 document. Then we still need to do the first two tasks that I listed out above.

@kirilloid Thanks for mentioning about php header example. I agree with you that BOM is part of the content. I can also agree that user should be aware of BOM presence in the file. But I don't think showing BOM in the editor is the right thing to do since you don't really want the user to edit it like normal text content. Maybe we should have something in the status bar to show the BOM with some UI that allows the user to add/delete BOM or convert the encoding of BOM.

@peterflynn
Copy link
Member

Having just filed a dupe :-) , I agree with @RaymondLim: we shouldn't show the BOM as part of the editor content. I like the idea of indicating it in the status bar, since we'd probably also use the sb to show/edit the overall encoding once we support multiple encodings. For that reason I think we could actually slice this in half: first just stop showing the red dot in the content (but presumably preserve the BOM on save); second, start indicating the BOM presence in the status bar -- we could potentially delay that part all the way until the broader encoding story happens.

@RaymondLim
Copy link
Contributor

@larz0 Can you suggest the UI for BOM in the status bar which can be extended to show the encoding in the future?

@larz0
Copy link
Member

larz0 commented Sep 25, 2013

@RaymondLim thoughts?

screen shot 2013-09-25 at 4 52 59 pm

@larz0
Copy link
Member

larz0 commented Sep 25, 2013

@RaymondLim maybe a tooltip that says, e.g. "UTF-7". Or should it be the other way around? Display encoding in status bar with a tooltop for hex?

@njx
Copy link
Contributor

njx commented Sep 26, 2013

@larz0 - note that this isn't about showing the encoding (we're still only supporting UTF-8). It's just some way of showing that the file had a BOM (even though we don't show it in the code) and that we'll properly write it out when we save it.

Does Sublime show the BOM somehow? I actually wonder if it's important for us to have any visual indication in the UI that there was a BOM.

FWIW, I think this is only on Windows. I don't see the red dot on Mac--I'm guessing we never see it because we're reading the file as text and the filesystem deals with the encoding somehow.

@RaymondLim
Copy link
Contributor

@njx Actually, we want to show both encoding and BOM. @larz0 I think it should be the other way around, showing the encoding instead of hex, then a tooltip to indicate whether the unicode encoding has BOM or not. In the future we can make it a link so that the user can click on it to invoke a dialog to do encoding conversion or add/remove BOM.

@RaymondLim
Copy link
Contributor

FWIW, I think this is only on Windows. I don't see the red dot on Mac--I'm guessing we never see it because we're reading the file as text and the filesystem deals with the encoding somehow.

Good point! The red dot shows up only on Windows, and on Mac the shell code is dealing with utf-8 encoding and stripping BOM before returning the file content. So we have a different issue on Mac for files with BOM. If the user makes any changes to a file with BOM on Mac, we're not saving BOM back with the updated content.

@carragom
Copy link

carragom commented Nov 5, 2013

This is also visible on Linux in sprint 33, so it's not a Windows only issue. I don't think this is something handled by the file system as suggested by @njx. This is handled by the application and the fact that it's different on Mac might mean that there is already support for it somehow. Figuring out why it does not happen on Mac might prove useful.

@webjohan
Copy link

For me and I guess many other non english users. This is a big problem. I'm forced to use another editor and re save files after using Brackets. Or not use Brackets at all.

@johandahlgren
Copy link

We have the same problem with Brackets messing up all non-english characters in files created with BOM.
This is a major showstopper for us outside the Anglo-Saxon world. I really like Brackets and I hope there will be a fix for this soon.

@busykai
Copy link
Contributor

busykai commented Nov 19, 2013

Couple of assumptions:

  1. BOM for UTF-8 does not make any sense (it's just 1 byte). A BOM indicating UTF-8 does exist though.
  2. Opening multi-byte documents as well as any strangely encoded documents should be detected and warned about.
  3. BOMs are well-known:http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding

What about the following approach to resolve the issue:

  1. BOMs are automatically yanked on read and prepended on write.
  2. BOM indicator is diplayed when BOM is present (except it should indicate encoding and tooltip should display literal, e.g. "BOM: 00 00 FE FF").
  3. When the BOM indicator clicked, the BOM will not be written. BOM indicator text is striked out (and probably greyed out).
  4. When clicked the second time, the BOM will be written. BOM indicator text is normal.
  5. When the file is written and the BOM was removed as in 3, BOM indicator disappears.
  6. When a UTF-8 file with a BOM is opened, 1 and 2 silently happen.
  7. When a file is opened and the BOM is not a UTF-8, 1 happens, a warning (modal dialog) is displayed that the file might be displayed inappropriately, 2 happens.
  8. Non-printable sequence is read at the beginning, but not recognized as a known BOM: fallback to current behaviour -- display non-printables the CM way.

Perhaps it makes sense to indicate BOM in red in the case of unsupported encoding.

@webjohan
Copy link

This would solve our problems.

That or a "Save with encoding ..."-function with the choice UTF-8 BOM simmular to Sublime text.

@webjohan
Copy link

Any updates on this issue?

@RaymondLim
Copy link
Contributor

@webjohan Sprint 37 already started and I don't think we have cycle to handle it in this sprint. I'll be nominating and hopefully we can handle it for utf-8 documents in sprint 38. We won't be handling other unicode (and non utf-8) documents since it involves encoding conversion when reading in and writing out.

@webjohan
Copy link

@RaymondLim This is great news! I look forward to this release!

@RaymondLim RaymondLim added this to the Brackets 1.0 milestone Mar 20, 2014
@RaymondLim
Copy link
Contributor

Nominating for Brackets 1.0.

@njx njx removed this from the Brackets 1.0 milestone Apr 11, 2014
@njx
Copy link
Contributor

njx commented Apr 11, 2014

Reviewed. We discussed this, and we think that for UTF-8 specifically (not including other encodings), since the BOM is not important, the most we should do is preserve it, but it would be fine to just strip it. We don't want to do any UI for indicating that the BOM exists.

We definitely do want to handle other encodings at some point (for which the BOM will be important), so we might want to defer this until we deal with encodings in general.

Removing 1.0 milestone.

@webjohan
Copy link

Just preserving the BOM would make a big improvement until the "full encoding handling" is in place.

@Ashley2014
Copy link

I'm coding in Simplified Chinese, when updated to sprint 39,the red dot disappeared,but every time changing File which is UTF-8 encoded with BOM, brackets will automatically make it with none BOM,it mess my site. the red dot is ugly, but work fine.

My sloution is use notepad++ add BOM after change the file, is any better way to preserve BOM?

@christianrondeau
Copy link

Same here, our files are mostly UTF8, which means every commit of a file modified by Brackets registers a change on the first line. I then either have to do like @Ashley2014 and use Notepad++ to change the encoding, or reset the first line of every file I want to commit.

@cyrildtm
Copy link

Hello everyone! I wrote this script to add BOM to those files edited by Brackets.
https://gist.github.com/cyrildtm/71c685cdd28a010511e2
I've tested with my own project, but I'm not sure if it works to all of you. Be ware of any damages, so make a backup before using it.

@tigt
Copy link

tigt commented Jul 11, 2015

Some minification programs automatically use a BOM instead of a typical encoding declaration to squeeze out a few more bytes; in particular, Sass's production mode will output the BOM instead of @charset "utf-8"; when it detects non-ASCII characters inside the final file. I would love it if Brackets did BOM handling in a robust, discoverable way; the consequences can be dire for things like Ruby.

@Download
Copy link

Download commented Jan 1, 2016

I'd like to offer my two cents on this issue, as Unicode encoding has become something of an issue that is dear to me.

First, I'd like to address a couple of remarks made here:

@busykai

Couple of assumptions:

BOM for UTF-8 does not make any sense (it's just 1 byte). A BOM indicating UTF-8 does exist though.

@njx

since the BOM is not important, the most we should do is preserve it, but it would be fine to just strip it.

Please do reconsider your opinion on this. The Byte Order Mark really is very badly named, as all emphasis is laid on Byte Order, which (as you say) does not make sense for UTF-8, but in fact this marker serves two purposes. The byte-order thing is just half of the story.

Q: What is a BOM?
A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files.
(Unicode FAQ, emphasis mine)

The second purpose of the BOM is as a signature mark, to identify the content as Unicode encoded. In this world filled with legacy encodings, this second function is actually much more important than the first one. Why? Because in practical reality today, UTF-8 is the only Unicode encoding that is actually in widespread use. But one of the biggest advantages of UTF-8 encoding, backwards compatibility with ASCII, is also it's Achilles' heel; it's near impossible for editors and tools opening UTF-8 encoded text to tell the difference between ASCII, ANSI, ISO-8859-x and UTF-8. We need to either ask the user, make assumptions (just assume UTF-8 for example) or make some guesstimate based on file content... Sometimes with very weird results.

Many applications allow the user to specify the encoding. Some even offer explicit control over the BOM character.... But really, most user's don't have the knowledge to make informed decisions about these issues. Brackets targets programmers but even most programmers don't understand the intricacies of text encodings. As such, Brackets has an important function. The standards it chooses to set and the defaults it chooses will have a big impact on the community, because most programmers will simply use these defaults without ever thinking about them. As such I urge you to study this matter in great detail and take the time (and give this issue enough priority) such that you can pick those defaults that will leave us with as little problems as possible in 10 years time when thousands or even millions of files will have been created using these defaults.

Please write out the Unicode BOM as the first character of every new file by default

The BOM is specified in the Unicode standard. It's the lowest possible level where this could have been standardized (as opposed to in some OS or filesystem, or even at the application level). That is fantastic because it means we can get out of the encoding hell. Any application that claims to support Unicode must understand the BOM. So technically you can write out the BOM in front of every UTF-8 file you save and it should never cause any problems. If it does, the application reading the file is violating the Unicode standard.

I just really badly want all those milions of files that are (hopefully :) going to be created with Brackets to have BOMs in front of them so text editors and any other software processing text files won't have to guess about the encoding it's in because it will know from the BOM that it is indeed UTF-8 encoded Unicode text and nothing else.

Now in practical reality there is still software around that does not fully support Unicode and has trouble with the BOM... This very issue being one example. For those scenarios it's good to have the option to suppress it. The suggestion from @busykai for a small GUI in the status bar sounds fantastic. Just please, please, please set the BOM to enabled for new files. It will save us so much pain in the long run.

@Download
Copy link

Download commented Jan 2, 2016

In the meantime I read the user story on the backlog that was referenced in this issue and it is very interesting to read the discussion there in the context of the Unicode BOM. It specifically mentions that it is very hard to detect the encoding of a file.

If you embrace the BOM, this makes the world so much easier! Just:

  • read the first bytes of the file to detect the BOM
  • If it's there, great! You will know the encoding and you are done.
  • If it's not there life becomes more complex. You have to somehow decide on the encoding to use
    • Assume UTF-8 and scan the file for byte sequences that violate UTF-8. Have a look at this Wikipedia page: UTF-8#Codepage layout. The red cells are invalid byte sequences. This allows us to rule out UTF-8 in some scenario's.
    • If no illegal UTF-8 bytes were found, just open the file as UTF-8. This allows you to stay backward-compatible with older versions of Brackets
    • If illegal UTF-8 characters were found, open up a 'Specify encoding' dialog that lets the user pick from the list of whichever encodings you choose to support and then just open the file in that format.
    • Add a Open non-Unicode file menu option, or somehow integrate an encoding option in the normal Open dialog, that just skips the scanning for illegal UTF-8 byte sequences and directly asks the user to choose the input file encoding.

Starting to write out the BOM on saving UTF-8 files now will mean you will be able to get rid of the scanning for illegal UTF-8 sequences earlier or maybe even just skip it at all, because you will be able to reliably detect which encoding was used to save the file. And with you, all other tools that support Unicode. Long live the BOM! :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

16 participants