From: Jarkko Hietaniemi Date: Fri, 12 Apr 2002 13:36:52 +0000 (+0000) Subject: Discuss the magic of \w in security terms. X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=0d7c09bbda3e156731933b5a499b81784cf6adf2;p=p5sagit%2Fp5-mst-13.2.git Discuss the magic of \w in security terms. p4raw-id: //depot/perl@15876 --- diff --git a/pod/perlsec.pod b/pod/perlsec.pod index 8616c64..d5effd9 100644 --- a/pod/perlsec.pod +++ b/pod/perlsec.pod @@ -390,6 +390,13 @@ Your access to it does not give you permission to use it blah blah blah." You should see a lawyer to be sure your licence's wording will stand up in court. +=head2 Unicode + +Unicode is a new and complex technology and one may easily overlook +certain security pitfalls. See L for an overview and +L for details, and L for security implications in particular. + =head1 SEE ALSO L for its description of cleaning up environment variables. diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 26e704a..45c5932 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -840,7 +840,13 @@ transport/storage is not eight-bit safe. Defined by RFC 2152. =back -=head2 Security Implications of Malformed UTF-8 +=head2 Security Implications of Unicode + +=over 4 + +=item * + +Malformed UTF-8 Unfortunately, the specification of UTF-8 leaves some room for interpretation of how many bytes of encoded output one should generate @@ -853,6 +859,37 @@ warnings;>) Perl will warn about non-shortest length UTF-8 (and other malformations, too, such as the surrogates, which are not real Unicode code points.) +=item * + +Regular expressions behave slightly differently between byte data and +character (Unicode data). For example, the "word character" character +class C<\w> will work differently when the data is all eight-bit bytes +or when the data is Unicode. + +In the first case, the set of C<\w> characters is either small (the +default set of alphabetic characters, digits, and the "_"), or, if you +are using a locale (see L), the C<\w> might contain a few +more letters according to your language and country. + +In the second case, the C<\w> set of characters is much, much larger, +and most importantly, even in the set of the first 256 characters, it +will most probably be different: as opposed to most locales (which are +specific to a language and country pair) Unicode classifies all the +characters that are letters as C<\w>. For example: your locale might +not think that LATIN SMALL LETTER ETH is a letter (unless you happen +to speak Icelandic), but Unicode does. + +As discussed elswhere, Perl tries to stand one leg (two legs, being +a quadruped camel?) in two worlds: the old worlds of byte and the new +world of characters, upgrading from bytes to characters when necessary. +If your legacy code is not explicitly using Unicode, no automatic +switchover to characters should happen, and characters shouldn't get +downgraded back to bytes, either. It is possible to accidentally mix +bytes and characters, however (see L), in which case the +C<\w> might start behaving differently. Review your code. + +=back + =head2 Unicode in Perl on EBCDIC The way Unicode is handled on EBCDIC platforms is still rather