UTF-16/UCS-2

UTF-16 (16-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire, by mapping each character (or code point) to a sequence of 16-bit code units. For characters in the Basic Multilingual Plane (BMP) the encoding is a single code unit equal to the code point. For characters in the other planes the encoding is a pair of code units called a surrogate pair.

UTF-16 is officially defined in Annex Q of the international standard ISO/IEC 10646-1. It is also described in The Unicode Standard version 2.0 and higher, as well as in the IETF's RFC 2781.

The older UCS-2 (2-byte Universal Character Set) standard is a similar character encoding that was superseded by UTF-16 in Unicode version 2.0, though it still remains in use. UCS-2 is fixed length and always encodes characters into a single 16-bit code unit. It does not support surrogate pairs and can only encode characters in the BMP range U+0000 through U+FFFF.

Because of the technical similarities and upwards compatibility from UCS-2 to UTF-16, the two are often erroneously conflated and used as if interchangeable, so that strings encoded in UTF-16 are sometimes misidentified as being encoded in UCS-2.