Skip to content

Strings and bytes in Cython

Giovanni Torres edited this page Aug 19, 2017 · 1 revision

Python2

Strings are bytes.

>>> type("a")
<type 'str'>
>>> type(b'a')
<type 'str'>
>>> type(u'a')
<type 'unicode'>
>>> type("a".encode("UTF-8"))
<type 'str'>
>>> type("a".decode("UTF-8"))
<type 'unicode'>

Python3

Strings are unicode.

>>> type("a")
<class 'str'>
>>> type(b'a')
<class 'bytes'>
>>> type(u'a')
<class 'str'>
>>> type("a".encode("UTF-8"))
<class 'bytes'>
>>> type("a".decode("UTF-8"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

Receiving char* from C

A function to decode a C character pointer:

cdef unicode tounicode(char* s):
    if s == NULL:
        return None
    else:
        return s.decode("UTF-8", "replace")

In Python2, the c_string is decoded to a type unicode.

>>> c_string.decode("UTF-8")
unicode

In Python3, the c_string is decoded to a type str, which is unicode.

>>> c_string.decode("UTF-8")
str

Passing a string to C function

c_function(item)

Python 2: item should be string (which is bytes in Py2) and needs no conversion, but .encode("UTF-8") will keep it as string/bytes, which can be passed to C

Python 3: item should be bytes and needs to be encoded, .encode("UTF-8") will convert to bytes and then passed to C

Summary

  • .encode() when passing to C (converts to bytes - py2 string is bytes)
  • .decode() when receiving from C (converts to unicode - py3 string is unicode)