Perl normally assumes character semantics in the presence of
character data (i.e. data that has come from a source that has
-been marked as being of a particular character encoding) or when
-the global $^U flag is enabled. [XXX: implement -C command line
-switch and mention that instead of $^U?]
+been marked as being of a particular character encoding).
To understand the implications and differences between character
semantics and byte semantics, see L<perlunicode>.
package utf8;
-$^U = 1 if caller and caller eq 'main'; # they are unicode aware
- # XXX split this out?
-
sub import {
$^H |= 0x00800000;
$enc{caller()} = $_[1] if $_[1];
=item *
-As a side effect, when this pragma is used within the main package,
-it also enables Unicode character semantics for the entire program.
-See L<perlunicode> for more on that.
-
-[XXX: split this out into separate "pragma" and/or -C command-line
-switch?]
-
-=item *
-
In the absence of inputs marked as UTF-8, regular expressions within the
scope of this pragma will default to using character semantics instead
of byte semantics.
@chars = split //, $data; # splits characters
}
-[XXX: Should this should be enabled like chr()/sprintf("%c") by looking
-at $^U instead?]
-
=head1 SEE ALSO
L<perlunicode>, L<byte>
L</Character encodings for input and output>, we'll see how such
inputs may be marked as being Unicode character data sources.
-One particular condition will enable character semantics on the entire
-program, bypassing the compatibility mode: if the C<$^U> global flag is
-set to C<1>, nearly all operations will use character semantics by
-default. As an added convenience, if the C<utf8> pragma is used in the
-C<main> package, C<$^U> is enabled automatically. [XXX: Should there
-be a -C switch to enable $^U?]
+If the C<$^U> global flag is set to C<1>, all system calls will use the
+corresponding wide character APIs. This is currently only implemented
+on Windows. [XXX: Should there be a -C switch to enable $^U?]
Regardless of the above, the C<byte> pragma can always be used to force
byte semantics in a particular lexical scope. See L<byte>.
=item $^U
-Global flag that switches on Unicode character support in the Perl
-interpreter. The initial value is usually C<0> for compatibility
-with Perl versions earlier than 5.6, but may be automatically set
-to C<1> by Perl if the system provides a user-settable default
-(e.g., C<$ENV{LC_CTYPE}>). It is also implicitly set to C<1>
-whenever the utf8 pragma is loaded.
+Global flag that enables system calls made by Perl to use wide character
+APIs native to the system, if available. This is currently only implemented
+on the Windows platform.
-Setting it to C<1> has the following effects:
+The initial value is typically C<0> for compatibility with Perl versions
+earlier than 5.6, but may be automatically set to C<1> by Perl if the system
+provides a user-settable default (e.g., C<$ENV{LC_CTYPE}>).
-=over
-
-=item *
-
-C<chr> produces UTF-8 encoded Unicode characters. These are the same
-as the corresponding ASCII characters if the argument is less than 128.
-
-=item *
-
-The C<%c> format in C<sprintf> generates a UTF-8 encoded Unicode
-character. This is the same as the corresponding ASCII character
-if the argument is less than 128.
-
-=item *
-
-Any system calls made by Perl will use wide character APIs native to
-the system, if available. This is currently only implemented on the
-Windows platform.
-
-=back
-
-The C<byte> pragma overrides the value of this flag in the current
+The C<byte> pragma always overrides the effect of this flag in the current
lexical scope. See L<byte>.
=item $^V
as a "version tuple". Version tuples have both a numeric value and a
string value. The numeric value is a floating point number that amounts
to revision + version/1000 + subversion/1000000, and the string value
-is made of utf8 characters:
+is made of characters possibly in the UTF-8 range:
C<chr($revision) . chr($version) . chr($subversion)>.
This can be used to determine whether the Perl interpreter executing a
char *tmps;
U32 value = POPu;
- SvUTF8_off(TARG); /* decontaminate */
(void)SvUPGRADE(TARG,SVt_PV);
- if (value >= 128 && PL_bigchar && !IN_BYTE) {
+ if (value > 255 && !IN_BYTE) {
SvGROW(TARG,8);
tmps = SvPVX(TARG);
tmps = (char*)uv_to_utf8((U8*)tmps, (UV)value);
tmps = SvPVX(TARG);
*tmps++ = value;
*tmps = '\0';
+ SvUTF8_off(TARG); /* decontaminate */
(void)SvPOK_only(TARG);
XPUSHs(TARG);
RETURN;
uv = va_arg(*args, int);
else
uv = (svix < svmax) ? SvIVx(svargs[svix++]) : 0;
- if (uv >= 128 && PL_bigchar && !IN_BYTE) {
+ if ((uv > 255 || (uv > 127 && SvUTF8(sv))) && !IN_BYTE) {
eptr = (char*)utf8buf;
elen = uv_to_utf8((U8*)eptr, uv) - utf8buf;
is_utf = TRUE;