Skip to content

Egg #5: Unicode Support

jfdm edited this page Nov 21, 2014 · 1 revision

This post is a first draft of a proposal for adding Unicode support for strings in Idris.

The goal of this document is to outline different approaches along with their advantages and disadvantages. Hopefully this will lead to a discussion and the best alternative being chosen.

Implementation Alternatives

There are two possible ways to add Unicode capabilities to Idris strings:

  1. Unicode strings are defined in Idris source code as in Data.Text by @ziman.

  2. A new primitive type for Unicode strings is introduced and primitive operations are extended to this new string type.

The first approach has the advantage of requiring fewer primitive operations (no new ones will have to be added) and that more string operations could be directly implemented in Idris code. Furthermore, Data.Text already implements a number of String operations, and these could be reused.

The main advantage of the second method is that the different backends would be free to choose how they implement Unicode strings. This allows them to re-use their platforms native Unicode string libraries if these exist (e.g. Java and JavaScript). Furthermore, interoperability with other code on the platform becomes easier, because the conversion step between Idris' internal representation of Unicode strings and the platform's representation is avoided.

How Strings Currently Work

A brief overview of how strings currently work in Idris, to put the proposed changes in context.

  • Literal strings in the Idris source code are represented as Haskell Strings during compilation. The C back-end escapes non-ASCII characters when these are written out as C-string literals.

  • Literal characters in the Idris source code are represented as Haskell Chars during compilation. These are mapped to ints in the C back-end.

  • Idris has built-in types String and Char. Occurrences of these types in Idris code are mapped in TT to the constructors StrType and AType ITChar of the data type Const. AType stands (as far as I understand) for arithmetic type.

  • There are the following primitive functions on Strings and Chars which may or may not be implemented by the different back-ends. These are given in src/IRTS/Lang.hs.

    -- Primitive operators. Backends are not *required* to implement all
    -- of these, but should report an error if they are unable
    data PrimFn =
        -- ---- 8< ----
        | LStrHead
        | LStrTail
        | LStrCons
        | LStrIndex
        | LStrRev
        | LStrConcat
        | LStrLt
        | LStrEq
        | LStrLen
        | LIntStr IntTy
        | LStrInt IntTy
        | LFloatStr
        | LStrFloat
        | LPrintNum
        | LPrintStr
        | LReadStr
        | LChInt IntTy
        | LIntCh IntTy
        -- ---- >8 ----
    
  • libs/prelude/Prelude/Strings.idr defines functions on String in terms of the primitive string functions described above. libs/prelude/Prelude/Chars.idr uses the primitive function LChInt and LIntCh and defines additional functions for characters based on these.

  • Functions for reading from and writing to file handles are defined in libs/prelude/Prelude.idr and libs/prelude/IO.idr. These are defined by calls via the foreign function interface and make use of the fact that Strings and Chars are easily convertible to FString and FInt.

Alternative 1: Defining Unicode Strings in Idris, on top of Bytestrings

This alternative consists of the following tasks:

  1. Change the current String type to ByteString.

  2. Move Data.Text to the main Idris repo and make the String type what Data.Text.Text is now. Data.Text.Text represents Unicode strings as UTF-8 encoded bytestrings.

  3. Let literal strings be polymorphic in the same way that literal numbers are.

  4. Let characters represent Unicode codepoints and change primitive functions which use characters (e.g. consing characters onto strings) accordingly.

  5. The FFIs must handle strings as UTF-8 sequences and marshall these to their plattform's strings when needed.

Alternative 2: Adding a Built-In Unicode String Type

Literals

  • Idris source files will be expected to be UTF-8 encoded (everything else is a compile time error).

  • Literal strings and characters are thus UTF-8 encoded.

  • Unicode escapes: It may be necessary to give characters by escaping some code points. Such code points can be given by '\uXXXX' where XXXX is the code point given as hexadecimal number.

Built-in Types

  • The Idris built-in type String will be Unicode capable. To make this as obvious as possible it will be renamed to UStrType in the internal representation.

  • A new built-in type ByteString will be introduced. Internally this will map to the BStrType.

  • A Char will represent a Unicode code point.

Primitive Functions

  • The built-in primitive functions will need versions for Unicode and byte strings. Some will only need to be implemented for one of the two.

    • Both Unicode and byte string versions:

      LStrHead , LStrTail , LStrCons , LStrIndex , LStrRev, LStrConcat , LStrLt , LStrEq , LStrLen

    • Only Unicode (byte string versions may be done in user space if needed):

      LIntStr IntTy , LStrInt IntTy, LFloatStr, LStrFloat, LPrintNum ,

    • Only byte string:

      LPrintStr, LReadStr

    It is expected that we will have more user space functions that wrap the respective primitive versions for convenience.

  • Characters will work with the new Unicode characters.

    LChInt IntTy, LIntCh IntTy

    No such functions are required to work with byte characters, these are just bytes ;)

  • Conversion functions: Additionally, the following primitive functions will be introduced to convert between Unicode and byte strings:

    LFromUtf8: Takes a byte string which must be legal UTF-8 and returns a Unicode string (or produces an error if the byte string is not legal UTF-8).

    LToUtf8: Takes a Unicode string and returns the UTF-8 representation as a byte string.

    Other conversion functions (e.g. LFromUtf16) are possible but need not be primitives (although this may be more efficient for back-ends which use these encodings internally for their strings).

Foreign Function Interface

I think that conversion between Unicode strings and FFI strings should also be handled by the respective back-ends, because conversion could be necessary (e.g. C) or not (e.g. JavaScript). This could be done implicitly by the back-end or with new built-in conversion functions LUStrToFStr and LFStrToUStr.

Functions in the Prelude

Additionally functions in the prelude, especially in Prelude.idr and IO.idr and Prelude/Char.idr will have to be updated. The functions dealing with IO in Prelude.idr and IO.idr will need to work with ByteStrings and convert to and from Unicode correctly. The character functions will have to be beefed up to deal with all of unicode. (I hope that I can steal most of these things from Rust's implementation.) I believe that most functions Prelude/String.idr can stay the way they are.

Internal Representation of Unicode Strings

This still leaves us with the question of how we will encode Unicode strings in the back-ends we control (C and LLVM).

The choice of the internal representation is fairly orthogonal to the rest of this proposal: Since the representation is opaque in user-space it can be swapped out later.

That said, I believe that Unicode strings should be UTF-8 encoded. UTF-8 has the following advantages:

  • It is the most space efficient encoding.

  • It is the standard in the *nix world and on the net. This avoids re-encodings.

  • It is not byte-order sensitive.

  • It is backward compatible: 7-bit ASCII strings are valid UTF-8. No null bytes.

The main disadvantage is that UTF-8 does not allow O(1) string indexing.

I think this might actually be an advantage, because indexing does not seem to be a very common operation with strings. Furthermore, giving up O(1) indexing allows us to choose more advanced internal representations (e.g. Ropes (PDF) or something similar) which will perform other common string operations (e.g. appending and prepending) more efficiently.

Rust, Go and Ruby currently use UTF-8 as their internal representation.

Additional Ideas (aka Future Work)

  • Operations on strings should be abstracted into one or more typeclasses. This will allow us to swap in a different string type (e.g. a lazy string) where appropriate. Many properties are probably already covered by common typeclasses (e.g. Foldable, Traversable, Monoid).

  • We may want to have more (non-primitive) string types (e.g. lazy unicode and byte strings) in the future.

  • We probably want to have more efficient primitive string operations. This will likely include destructive update, I am not sure what the right way to do this is in Idris (Haskell's ST Monad?).

  • Heavier string and unicode operations (e.g. transcoding, normalization, ...) will be implemented in a separate library (and may be backend specific - in C/LLVM we may want to depend on ICU).

Related Work


Appendix: String Encodings

Some additional thoughts on the common unicode string encodings out there:

UTF-8

UTF-8 has the following advantages (see also UTF-8 Everywhere):

  • Space efficient: Common ASCII characters only take one byte. This is very common, because text processed by a computer (textfiles, XML, HTML, ...) often contain a large number of control characters. UTF-8 characters take at most 4 bytes, this is the same 'worst-case' as other encodings.

  • Prevalent: UTF-8 is the default encoding on *nix systems and also used for much of HTML and XML and other text-based file formats. This means such text must not be encoded when read or decoded when written, which is makes this format more efficient.

  • Not byte-order sensitive.

  • Backwards compatible: 7-bit ASCII is valid UTF-8. No null bytes.

And the following disadvantages:

  • No constant time string indexing (see separate point why I think this is acceptable).

UTF-8 is now used as the internal encoding for strings by Rust, Go and Ruby.

UTF-16

UTF-16 was initially the same as UCS-2. It was intended as a fixed-byte-width character encoding but is now also variable width because unicode code-points have 21 bits.

This is what Java and C# use. At least in the case of Java this was motivated by historical reasons, in C# I am not so sure (maybe it was chosen because this encoding is prevelant in Windows).

It can be argued that this encoding has the worst of variable-width and fixed-width encodings: It requires a byte-order-mark and is less memory efficient than UTF-8 for much text that is found on a computer. One also loses constant time character indexing. However, 4-byte characters appear sufficiently infrequently that many implementations don't get them right. (UTF-8 Everywhere, Should UTF-16 be considered harmful?).

UTF-32 / UCS-4

UTF-32 is fixed width encoding which uses four bytes per character.

This has the advantage of giving strings which allow O(1) access.

However, this encoding is very space-inefficient. Unicode needs at most 21 bits to represent all characters, which means that 9 of the 32 bits of each character will never be used. This is in addition to the fact that many characters are in the 7-bit ASCII charset and will only need one byte in UTF-8 and two bytes in UTF-16.

Mixed Encodings

Python represents unicode strings in one of three ways: Either as latin-1 encoded, UCS-2 (if there is at least one character that is not in latin-1, or UCS-4 (same as UTF-32) if there is at least one characer that is not in UCS-2.

The advantage of this encoding is that all characters have the same width. However, this encoding has the following disadvantages:

  • UCS-4 is very space inefficient (see above). If there is a single character that is not in the base plane (such as, for example, an emoji smile), then the entire string has to be UCS-4 encoded.

  • Re-encoding: if a character is appended to a string and this new character is outside of the previously used character set the entire string has not only to be copied but must also be re-encoded.

  • Byte-ordering: Multi-byte encodings other than UTF-8 have the byte ordering issue which increases the error potential.

  • Re-encoding of strings from elsewhere: A lot of text nowadays is UTF-8 encoded and a lot of protocols have UTF-8 as their standard encoding. Reading these strings requires an additional decoding step and writing them back requires an encoding step.

  • Complex: The implementation seems quite complex. The main benefit of this approach seems to be that one has constant time indexing.


Appendix: References

Clone this wiki locally