D1515R0
Code points, scalar values, UTF-8, and WTF-8

Draft Proposal,

Author:
Audience:
SG16
Project:
ISO/IEC JTC1/SC22/WG21 14882: Programming Language — C++
This version:
https://wg21.link/p1515r0
Latest version:
https://wg21.link/p1515

Abstract

This paper tries to frame a discussion around the question of whether SG16 should try to support code points or scalar values or both.

1. Definitions

A Unicode code point is a value in the Unicode codespace, that is, it is a value in the range of integers from 0 to 10FFFF16.

A Unicode scalar value is a code point that is not a surrogate code point, that is, it is a value in the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16.

2. Does it matter?

All of the UTF encodings in [Unicode] are defined to support only the ranges of scalar values. However, there are other encoding formats in use that are able to encode the entire range of code points, as is the case with [WTF-8] or of Windows filenames and JavaScript strings.

The Unicode standard explicitly acknowledges and allows this in Unicode strings:

Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences.

[...]

Whenever such strings are specified to be in a particular Unicode encoding form—even one with the same code unit size—the string must not violate the requirements of that encoding form.

3. The tradeoff

When designing APIs, SG16 should take in consideration whether to give prominence to code points and/or to scalar values.

3.1. We like code points

Preferring code points supports lossless decoding of encodings that support unpaired surrogates. These encodings are usually hacks to handle idiosyncracies of particular systems, but such systems exist that are widely deployed; notorious offenders being JavaScript and Windows.

On the other hand, for encodings that cannot encode unpaired surrogates, like UTF-8, UTF-16, and UTF-32, accepting code point streams as input means that error handling mechanisms must be invoked for dealing with surrogates in the input.

3.2. We like scalar values

Preferring scalar values guarantees that encodings like UTF-8, UTF-16, and UTF-32 can always losslessly encode any input stream. Encoding a valid stream of scalar values with these encodings thus requires no error checking.

However, this also means that encodings like WTF-8 cannot be decoded losslessly, as the sequences that encode surrogate code points will trigger the error handling mechanisms in order to produce scalar values.

3.3. We like both

Supporting both code points and scalar values effectively punts the tradeoff described here to the user. The cost of this is an API that is more complex and less intuitive, and perhaps easier to misuse.

Appendix A: Revision History

Revision 0 - 06 March 2019

Drafting

References

Informative References

[Unicode]
The Unicode Standard. URL: https://www.unicode.org/versions/latest/
[WTF-8]
Simon Sapin. The WTF-8 encoding. 12 May 2018. URL: https://simonsapin.github.io/wtf-8/