From: Jarkko Hietaniemi <jhi@iki.fi>
Date: Fri, 12 Apr 2002 13:36:52 +0000 (+0000)
Subject: Discuss the magic of \w in security terms.
X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=0d7c09bbda3e156731933b5a499b81784cf6adf2;p=p5sagit%2Fp5-mst-13.2.git

Discuss the magic of \w in security terms.

p4raw-id: //depot/perl@15876
---

diff --git a/pod/perlsec.pod b/pod/perlsec.pod
index 8616c64..d5effd9 100644
--- a/pod/perlsec.pod
+++ b/pod/perlsec.pod
@@ -390,6 +390,13 @@ Your access to it does not give you permission to use it blah blah
 blah."  You should see a lawyer to be sure your licence's wording will
 stand up in court.
 
+=head2 Unicode
+
+Unicode is a new and complex technology and one may easily overlook
+certain security pitfalls.  See L<perluniintro> for an overview and
+L<perlunicode> for details, and L<perlunicode/"Security Implications
+of Unicode"> for security implications in particular.
+
 =head1 SEE ALSO
 
 L<perlrun> for its description of cleaning up environment variables.
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 26e704a..45c5932 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -840,7 +840,13 @@ transport/storage is not eight-bit safe.  Defined by RFC 2152.
 
 =back
 
-=head2 Security Implications of Malformed UTF-8
+=head2 Security Implications of Unicode
+
+=over 4
+
+=item *
+
+Malformed UTF-8
 
 Unfortunately, the specification of UTF-8 leaves some room for
 interpretation of how many bytes of encoded output one should generate
@@ -853,6 +859,37 @@ warnings;>) Perl will warn about non-shortest length UTF-8 (and other
 malformations, too, such as the surrogates, which are not real
 Unicode code points.)
 
+=item *
+
+Regular expressions behave slightly differently between byte data and
+character (Unicode data).  For example, the "word character" character
+class C<\w> will work differently when the data is all eight-bit bytes
+or when the data is Unicode.
+
+In the first case, the set of C<\w> characters is either small (the
+default set of alphabetic characters, digits, and the "_"), or, if you
+are using a locale (see L<perllocale>), the C<\w> might contain a few
+more letters according to your language and country.
+
+In the second case, the C<\w> set of characters is much, much larger,
+and most importantly, even in the set of the first 256 characters, it
+will most probably be different: as opposed to most locales (which are
+specific to a language and country pair) Unicode classifies all the
+characters that are letters as C<\w>.  For example: your locale might
+not think that LATIN SMALL LETTER ETH is a letter (unless you happen
+to speak Icelandic), but Unicode does.
+
+As discussed elswhere, Perl tries to stand one leg (two legs, being
+a quadruped camel?) in two worlds: the old worlds of byte and the new
+world of characters, upgrading from bytes to characters when necessary.
+If your legacy code is not explicitly using Unicode, no automatic
+switchover to characters should happen, and characters shouldn't get
+downgraded back to bytes, either.  It is possible to accidentally mix
+bytes and characters, however (see L<perluniintro>), in which case the
+C<\w> might start behaving differently.  Review your code.
+
+=back
+
 =head2 Unicode in Perl on EBCDIC
 
 The way Unicode is handled on EBCDIC platforms is still rather