X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=c5ffbaf0e49b0fadf44253ee0603c95cd4b60963;hb=35f2feb095c3dd2b77eb6efc2bf725b5886b6931;hp=b0efcca8dfcffcc31057fba59656eae1bf0d19a6;hpb=393fec973b1b95a178b4b9600173880d9f93debf;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index b0efcca..c5ffbaf 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -14,13 +14,46 @@ uses the UTF-8 encoding. In future, Perl-level operations will expect to work with characters rather than bytes, in general. -However, Perl v5.6 aims to provide a safe migration path from byte -semantics to character semantics for programs. To preserve compatibility -with earlier versions of Perl which allowed byte semantics in Perl -operations (owing to the fact that the internal representation for -characters was in bytes) byte semantics will continue to be in effect -until a the C pragma is used in the C
package, or the C<$^U> -global flag is explicitly set. +However, as strictly an interim compatibility measure, Perl v5.6 aims to +provide a safe migration path from byte semantics to character semantics +for programs. For operations where Perl can unambiguously decide that the +input data is characters, Perl now switches to character semantics. +For operations where this determination cannot be made without additional +information from the user, Perl decides in favor of compatibility, and +chooses to use byte semantics. + +This behavior preserves compatibility with earlier versions of Perl, +which allowed byte semantics in Perl operations, but only as long as +none of the program's inputs are marked as being as source of Unicode +character data. Such data may come from filehandles, from calls to +external programs, from information provided by the system (such as %ENV), +or from literals and constants in the source text. Later, in +L, we'll see how such +inputs may be marked as being Unicode character data sources. + +If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS} +global flag is set to C<1>), all system calls will use the +corresponding wide character APIs. This is currently only implemented +on Windows. + +Regardless of the above, the C pragma can always be used to force +byte semantics in a particular lexical scope. See L. + +The C pragma is primarily a compatibility device that enables +recognition of UTF-8 in literals encountered by the parser. It is also +used for enabling some of the more experimental Unicode support features. +Note that this pragma is only required until a future version of Perl +in which character semantics will become the default. This pragma may +then become a no-op. See L. + +Unless mentioned otherwise, Perl operators will use character semantics +when they are dealing with Unicode data, and byte semantics otherwise. +Thus, character semantics for these operations apply transparently; if +the input data came from a Unicode source (for example, by adding a +character encoding discipline to the filehandle whence it came, or a +literal UTF-8 string constant in the program), character semantics +apply; otherwise, byte semantics are in effect. To force byte semantics +on Unicode data, the C pragma should be used. Under character semantics, many operations that formerly operated on bytes change to operating on characters. For ASCII data this makes @@ -33,15 +66,7 @@ ranging from 0 to 2**32 or so. Larger characters encode to longer sequences of bytes internally, but again, this is just an internal detail which is hidden at the Perl level. -The C pragma can be used to force byte semantics in a particular -lexical scope. See L. - -The C pragma is a compatibility device to enables recognition -of UTF-8 in literals encountered by the parser. It is also used -for enabling some experimental Unicode support features. Note that -this pragma is only required until a future version of Perl in which -character semantics will become the default. This pragma may then -become a no-op. See L. +=head2 Effects of character semantics Character semantics have the following effects: @@ -73,9 +98,9 @@ characters, including ideographs. (You are currently on your own when it comes to using the canonical forms of characters--Perl doesn't (yet) attempt to canonicalize variable names for you.) -This also needs C currently. [XXX: Why? High-bit chars were +This also needs C currently. [XXX: Why?!? High-bit chars were syntax errors when they occurred within identifiers in previous versions, -so this should be enabled by default.] +so this should probably be enabled by default.] =item * @@ -86,7 +111,8 @@ C<\C>).) Unicode support in regular expressions needs C currently. [XXX: Because the SWASH routines need to be loaded. And the RE engine -appears to need an overhaul to Unicode by default anyway.] +appears to need an overhaul to dynamically match Unicode anyway--the +current RE compiler creates different nodes with and without C.] =item * @@ -180,14 +206,18 @@ And finally, C reverses by character rather than by byte. =back +=head2 Character encodings for input and output + +[XXX: This feature is not yet implemented.] + =head1 CAVEATS As of yet, there is no method for automatically coercing input and output to some encoding other than UTF-8. This is planned in the near future, however. -Whether a piece of data will be treated as "characters" or "bytes" -by internal operations cannot be divined at the current time. +Whether an arbitrary piece of data will be treated as "characters" or +"bytes" by internal operations cannot be divined at the current time. Use of locales with utf8 may lead to odd results. Currently there is some attempt to apply 8-bit locale info to characters in the range @@ -197,6 +227,6 @@ tend to run slower. Avoidance of locales is strongly encouraged. =head1 SEE ALSO -L, L, L +L, L, L =cut