From: Ilya Zakharevich Date: Sat, 20 Jun 1998 21:11:37 +0000 (-0400) Subject: Re docs X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=75e14d17912ce8a35d5c2b04c0c6e30b903ab97f;p=p5sagit%2Fp5-mst-13.2.git Re docs Message-Id: <199806210111.VAA17752@monk.mps.ohio-state.edu> p4raw-id: //depot/perl@1181 --- diff --git a/pod/perlop.pod b/pod/perlop.pod index b3202e5..c534234 100644 --- a/pod/perlop.pod +++ b/pod/perlop.pod @@ -695,6 +695,18 @@ evaluation of variables when used within double quotes. Here are the quote-like operators that apply to pattern matching and related activities. +Most of this section is related to use of regular expressions from Perl. +Such a use may be considered from two points of view: Perl handles a +a string and a "pattern" to RE (regular expression) engine to match, +RE engine finds (or does not find) the match, and Perl uses the findings +of RE engine for its operation, possibly asking the engine for other matches. + +RE engine has no idea what Perl is going to do with what it finds, +similarly, the rest of Perl has no idea what a particular regular expression +means to RE engine. This creates a clean separation, and in this section +we discuss matching from Perl point of view only. The other point of +view may be found in L. + =over 8 =item ?PATTERN? @@ -1172,6 +1184,182 @@ an eval(): =back +=head2 Gory details of parsing quoted constructs + +When presented with something which may have several different +interpretations, Perl uses the principle B (expanded to Do What I Mean +- not what I wrote) to pick up the most probable interpretation of the +source. This strategy is so successful that Perl users usually do not +suspect ambivalence of what they write. However, time to time Perl's ideas +differ from what the author meant. + +The target of this section is to clarify the Perl's way of interpreting +quoted constructs. The most frequent reason one may have to want to know the +details discussed in this section is hairy regular expressions. However, the +first steps of parsing are the same for all Perl quoting operators, so here +they are discussed together. + +Some of the passes discussed below are performed concurrently, but as +far as results are the same, we consider them one-by-one. For different +quoting constructs Perl performs different number of passes, from +one to five, but they are always performed in the same order. + +=over + +=item Finding the end + +First pass is finding the end of the quoted construct, be it multichar ender +C<"\nEOF\n"> of C<< construct, C which terminates C construct, +C> which terminates C construct, or C> which terminates a +fileglob started with C<<>. + +When searching for multichar construct no skipping is performed. When +searching for one-char non-matching delimiter, such as C, combinations +C<\\> and C<\/> are skipped. When searching for one-char matching delimiter, +such as C<]>, combinations C<\\>, C<\]> and C<\[> are skipped, and +nested C<[>, C<]> are skipped as well. + +For 3-parts constructs C etc. the search is repeated once more. + +During this search no attension is paid to the semantic of the construct, thus + + "$hash{"$foo/$bar"}" + +or + + m/ + bar # This is not a comment, this slash / terminated m//! + /x + +do not form legal quoted expressions. Note that since the slash which +terminated C was followed by a C, this is not C, +thus C<#> was interpreted as a literal C<#>. + +=item Removal of backslashes before delimiters + +During the second pass the text between the starting delimiter and +the ending delimiter is copied to a safe location, and the C<\> is +removed from combinations consisting of C<\> and delimiter(s) (both starting +and ending delimiter if they differ). + +The removal does not happen for multi-char delimiters. + +Note that the combination C<\\> is left as it was! + +Starting from this step no information about the delimiter(s) is used in the +parsing. + +=item Interpolation + +Next step is interpolation in the obtained delimiter-independent text. +There are many different cases. + +=over + +=item C<<<'EOF'>, C, C, C, C + +No interpolation is performed. + +=item C<''>, C + +The only interpolation is removal of C<\> from pairs C<\\>. + +=item C<"">, C<``>, C, C, C<> + +C<\Q>, C<\U>, C<\u>, C<\L>, C<\l> (possibly paired with C<\E>) are converted +to corresponding Perl constructs, thus C<"$foo\Qbaz$bar"> is converted to + + $foo . (quotemeta("baz" . $bar)); + +Other combinations of C<\> with following chars are substituted with +appropriate expansions. + +Interpolated scalars and arrays are converted to C and C<.> Perl +constructs, thus C<"'@arr'"> becomes + + "'" . (join $", @arr) . "'"; + +Since all three above steps are performed simultaneously left-to-right, +the is no way to insert a literal C<$> or C<@> inside C<\Q\E> pair: it +cannot be protected by C<\>, since any C<\> (except in C<\E>) is +interpreted as a literal inside C<\Q\E>, and any $ is +interpreted as starting an interpolated scalar. + +Note also that the interpolating code needs to make decision where the +interpolated scalar ends, say, whether C<"a $b -> {c}"> means + + "a " . $b . " -> {c}"; + +or + + "a " . $b -> {c}; + +Most the time the decision is to take the longest possible text which does +not include spaces between components and contains matching braces/brackets. + +=item C, C, C, C, + +Processing of C<\Q>, C<\U>, C<\u>, C<\L>, C<\l> and interpolation happens +(almost) as with qq// constructs, but I followed by +other chars is not performed>! Moreover, inside C<(?{BLOCK})> no processing +is performed at all. + +Interpolation has several quirks: $|, $( and $) are not interpolated, and +constructs C<$var[SOMETHING]> are I (by several different estimators) +to be an array element or $var followed by a RE alternative. This is +the place where the notation C<${arr[$bar]}> comes handy: C +is interpreted as an array element -9, not as a regular expression from +variable $arr followed by a digit, which is the interpretation of +C. + +Note that absense of processing of C<\\> creates specific restrictions on the +post-processed text: if the delimeter is C, one cannot get the combination +C<\/> into the result of this step: C will finish the regular expression, +C<\/> will be stripped to C on the previous step, and C<\\/> will be left +as is. Since C is equivalent to C<\/> inside a regular expression, this +does not matter unless the delimiter is special character for RE engine, as +in C, C, or C. + +=back + +This step is the last one for all the constructs except regular expressions, +which are processed further. + +=item Interpolation of regular expressions + +All the previous steps were performed during the compilation of Perl code, +this one happens in run time (though it may be optimized to be calculated +at compile time if appropriate). After all the preprocessing performed +above (and possibly after evaluation if catenation, joining, up/down-casing +and quotemeta()ing are involved) the resulting I is passed to RE +engine for compilation. + +Whatever happens in the RE engine is better be discussed in L, +but for the sake of continuity let us do it here. + +This is the first step where presense of the C switch is relevant. +RE engine scans the string left-to-right, and converts it to a finite +automaton. + +Backslashed chars are either substituted by corresponding literal +strings, or generate special nodes of the finite automaton. Characters +which are special to RE engine generate corresponding nodes. C<(?#...)> +comments are ignored. All the rest is either converted to literal strings +to match, or is ignored (as is whitespace and C<#>-style comments if +C is present). + +Note that the parsing of the construct C<[...]> is performed using +absolutely different rules than the rest of the regular expression. +Similarly, the C<(?{...})> is only checked for matching braces. + +=item Optimization of regular expressions + +This step is listed for compeleteness only. Since it does not change +semantics, details of this step are not documented and are subject +to change. + +=back + =head2 I/O Operators There are several I/O operators you should know about. diff --git a/pod/perlre.pod b/pod/perlre.pod index 89bfb8d..b8c5662 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -6,13 +6,13 @@ perlre - Perl regular expressions This page describes the syntax of regular expressions in Perl. For a description of how to I regular expressions in matching -operations, plus various examples of the same, see C and C in -L. +operations, plus various examples of the same, see discussion +of C, C, and C in L. The matching operations can have various modifiers. The modifiers that relate to the interpretation of the regular expression inside -are listed below. For the modifiers that alter the behaviour of the -operation, see L and L. +are listed below. For the modifiers that alter the way regular expression +is used by Perl, see L. =over 4