=item *
-If Unicode is used in hash keys, there is a subtle effect on the hashes.
-The hash becomes "Unicode-sticky" so that keys retrieved from the hash
-(either by %hash, each %hash, or keys %hash) will be in Unicode, not
-in bytes, even when the keys were bytes went they "went in". This
-"stickiness" persists unless the hash is completely emptied, either by
-using delete() or clearing the with undef() or assigning an empty list
-to the hash. Most of the time this difference is negligible, but
-there are few places where it matters: for example the regular
-expression character classes like C<\w> behave differently for
-bytes and characters.
-
-=item *
-
If an appropriate L<encoding> is specified, identifiers within the
Perl script may contain Unicode alphanumeric characters, including
ideographs. (You are currently on your own when it comes to using the
Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF,
the 90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF.
+The "gaps" are caused by legal UTF-8 avoiding non-shortest encodings:
+it is technically possible to UTF-8-encode a single code point in different
+ways, but that is explicitly forbidden, and the shortest possible encoding
+should always be used (and that is what Perl does).
+
Or, another way to look at it, as bits:
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
and the decoding is
- $uni = 0x10000 + ($hi - 0xD8000) * 0x400 + ($lo - 0xDC00);
+ $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
If you try to generate surrogates (for example by using chr()), you
will get a warning if warnings are turned on (C<-w> or C<use