From: Jarkko Hietaniemi Date: Sat, 17 Nov 2001 22:22:47 +0000 (+0000) Subject: Banish "use utf8". X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=72ff290864ea88cc224b5d3af7058f500755f94a;p=p5sagit%2Fp5-mst-13.2.git Banish "use utf8". p4raw-id: //depot/perl@13064 --- diff --git a/pod/perlre.pod b/pod/perlre.pod index 6c68749..5c7e76b 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -184,7 +184,9 @@ In addition, Perl defines the following: \PP Match non-P \X Match eXtended Unicode "combining character sequence", equivalent to C<(?:\PM\pM*)> - \C Match a single C char (octet) even under utf8. + \C Match a single C char (octet) even under Unicode. + B breaks up characters into their UTF-8 bytes, + so you may end up with malformed pieces of UTF-8. A C<\w> matches a single alphanumeric character or C<_>, not a whole word. Use C<\w+> to match a string of Perl-identifier characters (which isn't @@ -193,7 +195,7 @@ list of alphabetic characters generated by C<\w> is taken from the current locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character classes, but if you try to use them as endpoints of a range, that's not a range, the "-" is understood literally. -See L for details about C<\pP>, C<\PP>, and C<\X>. +See L for details about C<\pP>, C<\PP>, and C<\X>. The POSIX character class syntax @@ -230,9 +232,10 @@ whole character class. For example: matches zero, one, any alphabetic character, and the percentage sign. -If the C pragma is used, the following equivalences to Unicode -\p{} constructs and equivalent backslash character classes (if available), -will hold: +The following equivalences to Unicode \p{} constructs and equivalent +backslash character classes (if available), will hold: + + [:...:] \p{...} backslash alpha IsAlpha alnum IsAlnum @@ -291,7 +294,7 @@ work just fine) it is included for completeness. You can negate the [::] character classes by prefixing the class name with a '^'. This is a Perl extension. For example: - POSIX trad. Perl utf8 Perl + POSIX traditional Unicode [:^digit:] \D \P{IsDigit} [:^space:] \S \P{IsSpace} diff --git a/pod/perlretut.pod b/pod/perlretut.pod index f4e9bb6..bb2423b 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -1653,12 +1653,11 @@ Unicode characters in the range of 128-255 use two hexadecimal digits with braces: C<\x{ab}>. Note that this is different than C<\xab>, which is just a hexadecimal byte with no Unicode significance. -B: in perl 5.6.0 it used to be that one needed to say C -to use any Unicode features. This is no more the case: for almost all -Unicode processing, the explicit C pragma is not needed. -(The only case where it matters is if your Perl script is in Unicode, -that is, encoded in UTF-8/UTF-16/UTF-EBCDIC: then an explicit C -is needed.) +B: in Perl 5.6.0 it used to be that one needed to say C to use any Unicode features. This is no more the case: for +almost all Unicode processing, the explicit C pragma is not +needed. (The only case where it matters is if your Perl script is in +Unicode and encoded in UTF-8, then an explicit C is needed.) Figuring out the hexadecimal sequence of a Unicode character you want or deciphering someone else's hexadecimal Unicode regexp is about as diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 2fca714..64116bc 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -782,7 +782,7 @@ for more discussion of the issues. =head1 SEE ALSO -L, L, L, L, L, L, -L +L, L, L, L, L, L, +L, L =cut