From: Ævar Arnfjörð Bjarmason
Date: Sun, 17 Jun 2007 18:09:25 +0000 (+0000)
Subject: perlreapi.pod documentation for flags & cleanup
X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=c998b2457229a340c2e4cb581a814fd9362e5b45;p=p5sagit%2Fp5-mst-13.2.git
perlreapi.pod documentation for flags & cleanup
From: "Ævar Arnfjörð Bjarmason"
Message-ID: <51dd1af80706171109r37c294c4h78a51083c3b851ba@mail.gmail.com>
p4raw-id: //depot/perl@31411
---
diff --git a/pod/perlreapi.pod b/pod/perlreapi.pod
index c218c10..3b5dc85 100644
--- a/pod/perlreapi.pod
+++ b/pod/perlreapi.pod
@@ -4,9 +4,9 @@ perlreapi - perl regular expression plugin interface
=head1 DESCRIPTION
-As of Perl 5.9.5 there is a new interface for using other regexp engines than
-the default one. Each engine is supposed to provide access to a constant
-structure of the following format:
+As of Perl 5.9.5 there is a new interface for using other regexp
+engines than the default one. Each engine is supposed to provide
+access to a constant structure of the following format:
typedef struct regexp_engine {
REGEXP* (*comp) (pTHX_ const SV * const pattern, const U32 flags);
@@ -76,80 +76,115 @@ stringify everything using the snippet above but that doesn't mean
other engines have to.
The C paramater is a bitfield which indicates which of the
-C flags the regex was compiled with. In addition it contains
-info about whether C
- RXf_PMf_KEEPCOPY
-The C flag. Guaranteed not to be used outside the regex engine.
+=back
-=item RXf_PMf_FOLD
+Additional flags:
-The C flag. Guaranteed not to be used outside the regex engine.
+=over 4
-=item RXf_PMf_EXTENDED
+=item RXf_SKIPWHITE
+
+If C is invoked as C or with no arguments (which
+really means C, see L), perl will set
+this flag and change the pattern from C<" "> to C<"\s+"> before it's
+passed to the comp routine.
-The C flag. Guaranteed not to be used outside the regex
-engine. However if present on a regex C<#> comments will be stripped
-by the tokenizer regardless of the engine currently in use.
+If the flag is present in C<< rx->extflags >> C to delete
+whitespace from the start of the subject string before it's operated
+on. What is considered whitespace depends on whether the subject is a
+UTF-8 string and whether the C flag is set.
-=item RXf_PMf_KEEPCOPY
+This probably always be preserved verbatim in C<< rx->extflags >>.
+
+=item RXf_PMf_LOCALE
-The C flag.
+Set if C is in effect. If present in C<< rx->extflags >>
+C will use the locale dependant definition of whitespace under
+when RXf_SKIPWHITE or RXf_WHITE are in effect. Under ASCII whitespace
+is defined as per L, and by the internal
+macros C under UTF-8 and C under C.
=item RXf_UTF8
Set if the pattern is L, set by Perl_pmruntime.
+A regex engine may want to set or disable this flag during
+compilation. The perl engine for instance may upgrade non-UTF-8
+strings to UTF-8 if the pattern includes constructs such as C<\x{...}>
+that can only match Unicode values.
+
=back
-In general these flags should be preserved in C<< rx->extflags >>
-after compilation, although it is possible the regex includes
-constructs that changes them. The perl engine for instance may upgrade
-non-utf8 strings to utf8 if the pattern includes constructs such as
-C<\x{...}> that can only match unicode values. RXf_SKIPWHITE should
-always be preserved verbatim in C<< regex->extflags >>.
+These flags can be set during compilation to enable optimizations in
+the C operator.
+
+=over 4
+
+=item RXf_START_ONLY
+
+Tells the split operator to split the target string on newlines
+(C<\n>) without invoking the regex engine.
+
+Perl's engine sets this if the pattern is C^/> (C), even under C^/s>, see L. Of course a
+different regex engine might want to use the same optimizations
+with a different syntax.
+
+=item RXf_WHITE
+
+Tells the split operator to split the target string on whitespace
+without invoking the regex engine. The definition of whitespace varies
+depending on whether the target string is a UTF-8 string and on
+whether RXf_PMf_LOCALE is set.
+
+Perl's engine sets this flag if the pattern is C<\s+>, which it will be if
+the pattern actually was C<\s+> or if it was originally C<" "> (see
+C above).
+
+=back
=head2 exec
@@ -215,7 +250,7 @@ redundant. The scalar can be set with C, C and
friends, see L.
This callback is where perl untaints its own capture variables under
-taint mode (see L). See the C
+taint mode (see L). See the C
function in F for how to untaint capture variables if
that's something you'd like your engine to do as well.
@@ -325,7 +360,7 @@ Whether C<%+> or C<%-> is being operated on, if any.
RXf_HASH_ALL /* %- */
Whether this is being called as C, C or
-C, if any. The first two will be combined with
+C, if any. The first two will be combined with
C or C.
RXf_HASH_REGNAME
@@ -462,7 +497,6 @@ values.
I32 prelen; /* length of precomp */
const char *precomp; /* pre-compilation regular expression */
- /* wrapped can't be const char*, as it is returned by sv_2pv_flags */
char *wrapped; /* wrapped version of the pattern */
I32 wraplen; /* length of wrapped */
@@ -494,7 +528,8 @@ TODO, see L
This will be used by perl to see what flags the regexp was compiled
with, this will normally be set to the value of the flags parameter by
-the L callback.
+the L callback. See the L documentation for
+valid flags.
=head2 C C
@@ -526,7 +561,7 @@ Left offset from pos() to start match at.
Substring data about strings that must appear in the final match. This
is currently only used internally by perl's engine for but might be
-used in the future for all engines for optimisations like C.
+used in the future for all engines for optimisations.
=head2 C, C, and C
@@ -589,7 +624,7 @@ pv being an embedded array of I32. The values may also be contained
independently in the data array in cases where named backreferences are
used.
-=head2 C
+=head2 C
Holds information on the longest string that must occur at a fixed
offset from the start of the pattern, and the longest string that must
@@ -599,25 +634,17 @@ the regex engine at all, and if so where in the string to search.
=head2 C C C
- #define SAVEPVN(p,n) ((p) ? savepvn(p,n) : NULL)
- if (RX_MATCH_COPIED(ret))
- ret->subbeg = SAVEPVN(ret->subbeg, ret->sublen);
- else
- ret->subbeg = NULL;
-
-Cextflags & RXf_PMf_KEEPCOPY>
-
-These are used during execution phase for managing search and replace
-patterns.
+Used during execution phase for managing search and replace patterns.
=head2 C C
-Stores the string C stringifies to, for example C<(?-xism:eek)>
-in the case of C.
+Stores the string C stringifies to. The perl engine for example
+stores C<(?-xism:eek)> in the case of C.
-When using a custom engine that doesn't support the C<(?:)> construct for
-inline modifiers it's best to have C stringify to the supplied pattern,
-note that this will create invalid patterns in cases such as:
+When using a custom engine that doesn't support the C<(?:)> construct
+for inline modifiers, it's probably best to have C stringify to
+the supplied pattern, note that this will create undesired patterns in
+cases such as:
my $x = qr/a|b/; # "a|b"
my $y = qr/c/i; # "c"
@@ -626,8 +653,6 @@ note that this will create invalid patterns in cases such as:
There's no solution for this problem other than making the custom
engine understand a construct like C<(?:)>.
-The C in F does the stringification work.
-
=head2 C
This stores the number of eval groups in the pattern. This is used for security