From: Jarkko Hietaniemi Date: Thu, 16 Aug 2001 01:07:21 +0000 (+0000) Subject: Document the bytes-to-Unicode upgrading. X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=7dedd01fe68e1bc71e98f1f13b6e607814dec07b;p=p5sagit%2Fp5-mst-13.2.git Document the bytes-to-Unicode upgrading. p4raw-id: //depot/perl@11688 --- diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index f429be7..9609cdc 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -49,7 +49,8 @@ need not normally be used. However, as a compatibility measure, this pragma must be explicitly used to enable recognition of UTF-8 in the Perl scripts themselves on ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines. -B is needed>. +B is +needed>. =back @@ -88,10 +89,9 @@ force byte semantics in a particular lexical scope. See L. The C pragma is primarily a compatibility device that enables recognition of UTF-(8|EBCDIC) in literals encountered by the parser. -It may also be used for enabling some of the more experimental Unicode -support features. Note that this pragma is only required until a -future version of Perl in which character semantics will become the -default. This pragma may then become a no-op. See L. +Note that this pragma is only required until a future version of Perl +in which character semantics will become the default. This pragma may +then become a no-op. See L. Unless mentioned otherwise, Perl operators will use character semantics when they are dealing with Unicode data, and byte semantics otherwise. @@ -102,6 +102,11 @@ literal UTF-8 string constant in the program), character semantics apply; otherwise, byte semantics are in effect. To force byte semantics on Unicode data, the C pragma should be used. +Notice that if you have a string with byte semantics and you then +add character data into it, the bytes will be upgraded I (or if in EBCDIC, after a translation +to ISO 8859-1). + Under character semantics, many operations that formerly operated on bytes change to operating on characters. For ASCII data this makes no difference, because UTF-8 stores ASCII in single bytes, but for any