aboutsummaryrefslogtreecommitdiff
path: root/docs/advanced/cast/strings.rst
blob: 271716b4b300a076be22cf005d004f615f0d0470 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
Strings, bytes and Unicode conversions
######################################

Passing Python strings to C++
=============================

When a Python ``str`` is passed from Python to a C++ function that accepts
``std::string`` or ``char *`` as arguments, pybind11 will encode the Python
string to UTF-8. All Python ``str`` can be encoded in UTF-8, so this operation
does not fail.

The C++ language is encoding agnostic. It is the responsibility of the
programmer to track encodings. It's often easiest to simply `use UTF-8
everywhere <http://utf8everywhere.org/>`_.

.. code-block:: c++

    m.def("utf8_test",
        [](const std::string &s) {
            cout << "utf-8 is icing on the cake.\n";
            cout << s;
        }
    );
    m.def("utf8_charptr",
        [](const char *s) {
            cout << "My favorite food is\n";
            cout << s;
        }
    );

.. code-block:: pycon

    >>> utf8_test("🎂")
    utf-8 is icing on the cake.
    🎂

    >>> utf8_charptr("🍕")
    My favorite food is
    🍕

.. note::

    Some terminal emulators do not support UTF-8 or emoji fonts and may not
    display the example above correctly.

The results are the same whether the C++ function accepts arguments by value or
reference, and whether or not ``const`` is used.

Passing bytes to C++
--------------------

A Python ``bytes`` object will be passed to C++ functions that accept
``std::string`` or ``char*`` *without* conversion.  In order to make a function
*only* accept ``bytes`` (and not ``str``), declare it as taking a ``py::bytes``
argument.


Returning C++ strings to Python
===============================

When a C++ function returns a ``std::string`` or ``char*`` to a Python caller,
**pybind11 will assume that the string is valid UTF-8** and will decode it to a
native Python ``str``, using the same API as Python uses to perform
``bytes.decode('utf-8')``. If this implicit conversion fails, pybind11 will
raise a ``UnicodeDecodeError``.

.. code-block:: c++

    m.def("std_string_return",
        []() {
            return std::string("This string needs to be UTF-8 encoded");
        }
    );

.. code-block:: pycon

    >>> isinstance(example.std_string_return(), str)
    True


Because UTF-8 is inclusive of pure ASCII, there is never any issue with
returning a pure ASCII string to Python. If there is any possibility that the
string is not pure ASCII, it is necessary to ensure the encoding is valid
UTF-8.

.. warning::

    Implicit conversion assumes that a returned ``char *`` is null-terminated.
    If there is no null terminator a buffer overrun will occur.

Explicit conversions
--------------------

If some C++ code constructs a ``std::string`` that is not a UTF-8 string, one
can perform a explicit conversion and return a ``py::str`` object. Explicit
conversion has the same overhead as implicit conversion.

.. code-block:: c++

    // This uses the Python C API to convert Latin-1 to Unicode
    m.def("str_output",
        []() {
            std::string s = "Send your r\xe9sum\xe9 to Alice in HR"; // Latin-1
            py::handle py_s = PyUnicode_DecodeLatin1(s.data(), s.length(), nullptr);
            if (!py_s) {
                throw py::error_already_set();
            }
            return py::reinterpret_steal<py::str>(py_s);
        }
    );

.. code-block:: pycon

    >>> str_output()
    'Send your résumé to Alice in HR'

The `Python C API
<https://docs.python.org/3/c-api/unicode.html#built-in-codecs>`_ provides
several built-in codecs. Note that these all return *new* references, so
use :cpp:func:`reinterpret_steal` when converting them to a :cpp:class:`str`.


One could also use a third party encoding library such as libiconv to transcode
to UTF-8.

Return C++ strings without conversion
-------------------------------------

If the data in a C++ ``std::string`` does not represent text and should be
returned to Python as ``bytes``, then one can return the data as a
``py::bytes`` object.

.. code-block:: c++

    m.def("return_bytes",
        []() {
            std::string s("\xba\xd0\xba\xd0");  // Not valid UTF-8
            return py::bytes(s);  // Return the data without transcoding
        }
    );

.. code-block:: pycon

    >>> example.return_bytes()
    b'\xba\xd0\xba\xd0'


Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without
encoding, but cannot convert ``std::string`` back to ``bytes`` implicitly.

.. code-block:: c++

    m.def("asymmetry",
        [](std::string s) {  // Accepts str or bytes from Python
            return s;  // Looks harmless, but implicitly converts to str
        }
    );

.. code-block:: pycon

    >>> isinstance(example.asymmetry(b"have some bytes"), str)
    True

    >>> example.asymmetry(b"\xba\xd0\xba\xd0")  # invalid utf-8 as bytes
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte


Wide character strings
======================

When a Python ``str`` is passed to a C++ function expecting ``std::wstring``,
``wchar_t*``, ``std::u16string`` or ``std::u32string``, the ``str`` will be
encoded to UTF-16 or UTF-32 depending on how the C++ compiler implements each
type, in the platform's native endianness. When strings of these types are
returned, they are assumed to contain valid UTF-16 or UTF-32, and will be
decoded to Python ``str``.

.. code-block:: c++

    #define UNICODE
    #include <windows.h>

    m.def("set_window_text",
        [](HWND hwnd, std::wstring s) {
            // Call SetWindowText with null-terminated UTF-16 string
            ::SetWindowText(hwnd, s.c_str());
        }
    );
    m.def("get_window_text",
        [](HWND hwnd) {
            const int buffer_size = ::GetWindowTextLength(hwnd) + 1;
            auto buffer = std::make_unique< wchar_t[] >(buffer_size);

            ::GetWindowText(hwnd, buffer.data(), buffer_size);

            std::wstring text(buffer.get());

            // wstring will be converted to Python str
            return text;
        }
    );

Strings in multibyte encodings such as Shift-JIS must transcoded to a
UTF-8/16/32 before being returned to Python.


Character literals
==================

C++ functions that accept character literals as input will receive the first
character of a Python ``str`` as their input. If the string is longer than one
Unicode character, trailing characters will be ignored.

When a character literal is returned from C++ (such as a ``char`` or a
``wchar_t``), it will be converted to a ``str`` that represents the single
character.

.. code-block:: c++

    m.def("pass_char", [](char c) { return c; });
    m.def("pass_wchar", [](wchar_t w) { return w; });

.. code-block:: pycon

    >>> example.pass_char("A")
    'A'

While C++ will cast integers to character types (``char c = 0x65;``), pybind11
does not convert Python integers to characters implicitly. The Python function
``chr()`` can be used to convert integers to characters.

.. code-block:: pycon

    >>> example.pass_char(0x65)
    TypeError

    >>> example.pass_char(chr(0x65))
    'A'

If the desire is to work with an 8-bit integer, use ``int8_t`` or ``uint8_t``
as the argument type.

Grapheme clusters
-----------------

A single grapheme may be represented by two or more Unicode characters. For
example 'é' is usually represented as U+00E9 but can also be expressed as the
combining character sequence U+0065 U+0301 (that is, the letter 'e' followed by
a combining acute accent). The combining character will be lost if the
two-character sequence is passed as an argument, even though it renders as a
single grapheme.

.. code-block:: pycon

    >>> example.pass_wchar("é")
    'é'

    >>> combining_e_acute = "e" + "\u0301"

    >>> combining_e_acute
    'é'

    >>> combining_e_acute == "é"
    False

    >>> example.pass_wchar(combining_e_acute)
    'e'

Normalizing combining characters before passing the character literal to C++
may resolve *some* of these issues:

.. code-block:: pycon

    >>> example.pass_wchar(unicodedata.normalize("NFC", combining_e_acute))
    'é'

In some languages (Thai for example), there are `graphemes that cannot be
expressed as a single Unicode code point
<http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`_, so there is
no way to capture them in a C++ character type.


C++17 string views
==================

C++17 string views are automatically supported when compiling in C++17 mode.
They follow the same rules for encoding and decoding as the corresponding STL
string type (for example, a ``std::u16string_view`` argument will be passed
UTF-16-encoded data, and a returned ``std::string_view`` will be decoded as
UTF-8).

References
==========

* `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
* `C++ - Using STL Strings at Win32 API Boundaries <https://msdn.microsoft.com/en-ca/magazine/mt238407.aspx>`_