| MBRTOWC(3) | Library Functions Manual | MBRTOWC(3) |
mbrtowc, mbrtoc32
— convert a multibyte character to a wide character
(restartable)
#include
<wchar.h>
size_t
mbrtowc(wchar_t * restrict wc,
const char * restrict s, size_t
n, mbstate_t * restrict mbs);
#include
<uchar.h>
size_t
mbrtoc32(char32_t * restrict wc,
const char * restrict s, size_t
n, mbstate_t * restrict mbs);
The
mbrtowc()
and
mbrtoc32()
functions examine at most n bytes of the multibyte
character byte string pointed to by s, convert those
bytes to a wide character, and store the wide character into
*wc if wc is not
NULL and s points to a valid
character.
Conversion happens in accordance with the
conversion state *mbs, which must be initialized to
zero before the application's first call to
mbrtowc()
or
mbrtoc32().
If the previous call did not return (size_t)-1,
mbs can safely be reused without reinitialization.
The input encoding that
mbrtowc()
and
mbrtoc32()
use for s is determined by the
LC_CTYPE category of the current locale. If the
locale is changed without reinitialization of *mbs,
the behaviour is undefined.
Unlike
mbtowc(3),
mbrtowc()
and
mbrtoc32()
accept an incomplete byte sequence pointed to by s
which does not form a complete character but is potentially part of a valid
character. In this case, both functions consume all such bytes. The
conversion state saved in *mbs will be used to restart
the suspended conversion during the next call.
On systems other than
OpenBSD that support state-dependent encodings,
s may point to a special sequence of bytes called a
“shift sequence”. Shift sequences switch between character
code sets available within an encoding scheme. One encoding scheme using
shift sequences is ISO/IEC 2022-JP, which can switch e.g. from ASCII (which
uses one byte per character) to JIS X 0208 (which uses two bytes per
character). Shift sequence bytes correspond to no individual wide character,
so
mbrtowc()
and
mbrtoc32()
treat them as if they were part of the subsequent multibyte character.
Therefore they do contribute to the number of bytes in the multibyte
character.
The following arguments cause special processing:
NULLNULLNULLmbrtowc()
and
mbrtoc32()
each use their own internal state object instead of the
mbs argument. Both internal state objects are
initialized at startup time of the program, and no other libc function
ever changes either of them.
If
mbrtowc()
or
mbrtoc32()
is called with a NULL mbs
argument and that call returns (size_t)-1, the
internal conversion state of the respective function becomes permanently
undefined and there is no way to reset it to any defined state.
Consequently, after such a mishap, it is not safe to call the same
function with a NULL mbs
argument ever again until the program is terminated.
NULL, a NUL wide character has been stored in the
wchar_t object pointed to by wc.NULL, the
corresponding wide character has been stored in the wchar_t object pointed
to by wc.EILSEQ or
EINVAL, respectively. The conversion state object
pointed to by mbs is left in an undefined state and
must be reinitialized before being used again.
Because applications using mbrtowc()
or mbrtoc32() are shielded from the specifics of
the multibyte character encoding scheme, it is impossible to repair byte
sequences containing encoding errors. Such byte sequences must be
treated as invalid and potentially malicious input. Applications must
stop processing the byte string pointed to by s
and either discard any wide characters already converted, or cope with
truncated input.
mbrtowc(), and on OpenBSD,
it never happens for mbrtoc32() either.mbrtowc() and
mbrtoc32() cause an error in the following
cases:
mbrtowc() conforms to
ISO/IEC 9899/AMD1:1995 (“ISO C90, Amendment
1”). The restrict qualifier was added at
ISO/IEC 9899:1999
(“ISO C99”).
mbrtoc32() conforms to
ISO/IEC 9899:2011
(“ISO C11”).
mbrtowc() has been available since
OpenBSD 3.8 and has provided support for UTF-8 since
OpenBSD 4.8.
mbrtoc32() has been available since
OpenBSD 7.4.
mbrtowc() and
mbrtoc32() are not suitable for programs that care
about internals of the character encoding scheme used by the byte string
pointed to by s.
It is possible that these functions fail because of locale configuration errors. An “invalid” character sequence may simply be encoded in a different encoding than that of the current locale.
The special cases for s
== NULL and
mbs ==
NULL do not make any sense. Instead of passing
NULL for mbs,
mbtowc(3) can be used.
Earlier versions of this man page implied that calling
mbrtowc() with a NULL
s argument would always set mbs
to the initial conversion state. But this is true only if the previous call
to mbrtowc() using mbs did not
return (size_t)-1 or (size_t)-2. It is recommended to zero the mbstate_t
object instead.
| September 12, 2023 | openbsd |