From: Jeff Pinyan Date: Wed, 14 Apr 2004 17:01:38 +0000 (-0400) Subject: Re: [PATCH] lib/utf8_heavy.pl -- cascading classes and '&' support X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=bac0b42524fd3607268d7139a21b07697a1c978b;p=p5sagit%2Fp5-mst-13.2.git Re: [PATCH] lib/utf8_heavy.pl -- cascading classes and '&' support From: "Jeff 'japhy' Pinyan" Message-ID: p4raw-id: //depot/perl@22713 --- diff --git a/MANIFEST b/MANIFEST index 1f8239d..cf87b5f 100644 --- a/MANIFEST +++ b/MANIFEST @@ -2952,6 +2952,7 @@ t/TestInit.pm Preamble library for core tests t/test.pl Simple testing library t/uni/case.pl See if Unicode casing works t/uni/chomp.t See if Unicode chomp works +t/uni/class.t See if Unicode classes work (\p) t/uni/fold.t See if Unicode folding works t/uni/lower.t See if Unicode casing works t/uni/sprintf.t See if Unicode sprintf works diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 7de87ac..0817bb3 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -632,10 +632,21 @@ And finally, C reverses by character rather than by byte. =head2 User-Defined Character Properties You can define your own character properties by defining subroutines -whose names begin with "In" or "Is". The subroutines must be defined -in the C
package. The user-defined properties can be used in the -regular expression C<\p> and C<\P> constructs. Note that the effect -is compile-time and immutable once defined. +whose names begin with "In" or "Is". The subroutines can be defined in +any package. The user-defined properties can be used in the regular +expression C<\p> and C<\P> constructs; if you are using a user-defined +property from a package other than the one you are in, you must specify +its package in the C<\p> or C<\P> construct. + + # assuming property IsForeign defined in Lang:: + package main; # property package name required + if ($txt =~ /\p{Lang::IsForeign}+/) { ... } + + package Lang; # property package name not required + if ($txt =~ /\p{IsForeign}+/) { ... } + + +Note that the effect is compile-time and immutable once defined. The subroutines must return a specially-formatted string, with one or more newline-separated lines. Each line must be one of the following: @@ -650,23 +661,30 @@ tabular characters) denoting a range of Unicode code points to include. =item * Something to include, prefixed by "+": a built-in character -property (prefixed by "utf8::"), to represent all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to exclude, prefixed by "-": an existing character -property (prefixed by "utf8::"), for all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to negate, prefixed "!": an existing character -property (prefixed by "utf8::") for all the characters except the -characters in the property; two hexadecimal code points for a range; -or a single hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. + +=item * + +Something to intersect with, prefixed by "&": an existing character +property (prefixed by "utf8::") or a user-defined character property, +for all the characters except the characters in the property; two +hexadecimal code points for a range; or a single hexadecimal code point. =back @@ -714,6 +732,19 @@ The negation is useful for defining (surprise!) negated classes. END } +Intersection is useful for getting the common characters matched by +two (or more) classes. + + sub InFooAndBar { + return <<'END'; + +main::Foo + &main::Bar + END + } + +It's important to remember not to use "&" for the first set -- that +would be intersecting with nothing (resulting in an empty set). + You can also define your own mappings to be used in the lc(), lcfirst(), uc(), and ucfirst() (or their string-inlined versions). The principle is the same: define subroutines in the C
package