[p5sagit/p5-mst-13.2.git] / pod / perlreref.pod

=head1 NAME

perlreref - Perl Regular Expressions Reference

=head1 DESCRIPTION

This is a quick reference to Perl's regular expressions.
For full information see L<perlre> and L<perlop>, as well
as the L</"SEE ALSO"> section in this document.

=head2 OPERATORS

C<=~> determines to which variable the regex is applied.
In its absence, $_ is used.

    $var =~ /foo/;

C<!~> determines to which variable the regex is applied,
and negates the result of the match; it returns
false if the match succeeds, and true if it fails.

    $var !~ /foo/;

C<m/pattern/msixpogc> searches a string for a pattern match,
applying the given options.

    m  Multiline mode - ^ and $ match internal lines
    s  match as a Single line - . matches \n
    i  case-Insensitive
    x  eXtended legibility - free whitespace and comments
    p  Preserve a copy of the matched string -
       ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
    o  compile pattern Once
    g  Global - all occurrences
    c  don't reset pos on failed matches when using /g

If 'pattern' is an empty string, the last I<successfully> matched
regex is used. Delimiters other than '/' may be used for both this
operator and the following ones. The leading C<m> can be ommitted
if the delimiter is '/'.

C<qr/pattern/msixpo> lets you store a regex in a variable,
or pass one around. Modifiers as for C<m//>, and are stored
within the regex.

C<s/pattern/replacement/msixpogce> substitutes matches of
'pattern' with 'replacement'. Modifiers as for C<m//>,
with one addition:

    e  Evaluate 'replacement' as an expression

'e' may be specified multiple times. 'replacement' is interpreted
as a double quoted string unless a single-quote (C<'>) is the delimiter.

C<?pattern?> is like C<m/pattern/> but matches only once. No alternate
delimiters can be used.  Must be reset with reset().

=head2 SYNTAX

   \       Escapes the character immediately following it
   .       Matches any single character except a newline (unless /s is used)
   ^       Matches at the beginning of the string (or line, if /m is used)
   $       Matches at the end of the string (or line, if /m is used)
   *       Matches the preceding element 0 or more times
   +       Matches the preceding element 1 or more times
   ?       Matches the preceding element 0 or 1 times
   {...}   Specifies a range of occurrences for the element preceding it
   [...]   Matches any one of the characters contained within the brackets
   (...)   Groups subexpressions for capturing to $1, $2...
   (?:...) Groups subexpressions without capturing (cluster)
   |       Matches either the subexpression preceding or following it
   \1, \2 ...  Matches the text from the Nth group

=head2 ESCAPE SEQUENCES

These work as in normal strings.

   \a       Alarm (beep)
   \e       Escape
   \f       Formfeed
   \n       Newline
   \r       Carriage return
   \t       Tab
   \037     Any octal ASCII value
   \x7f     Any hexadecimal ASCII value
   \x{263a} A wide hexadecimal value
   \cx      Control-x
   \N{name} A named character

   \l  Lowercase next character
   \u  Titlecase next character
   \L  Lowercase until \E
   \U  Uppercase until \E
   \Q  Disable pattern metacharacters until \E
   \E  End modification

For Titlecase, see L</Titlecase>.

This one works differently from normal strings:

   \b  An assertion, not backspace, except in a character class

=head2 CHARACTER CLASSES

   [amy]    Match 'a', 'm' or 'y'
   [f-j]    Dash specifies "range"
   [f-j-]   Dash escaped or at start or end means 'dash'
   [^f-j]   Caret indicates "match any character _except_ these"

The following sequences work within or without a character class.
The first six are locale aware, all are Unicode aware. See L<perllocale>
and L<perlunicode> for details.

   \d      A digit
   \D      A nondigit
   \w      A word character
   \W      A non-word character
   \s      A whitespace character
   \S      A non-whitespace character
   \h      An horizontal white space
   \H      A non horizontal white space
   \v      A vertical white space
   \V      A non vertical white space
   \R      A generic newline           (?>\v|\x0D\x0A)

   \C      Match a byte (with Unicode, '.' matches a character)
   \pP     Match P-named (Unicode) property
   \p{...} Match Unicode property with long name
   \PP     Match non-P
   \P{...} Match lack of Unicode property with long name
   \X      Match extended Unicode combining character sequence

POSIX character classes and their Unicode and Perl equivalents:

   alnum   IsAlnum              Alphanumeric
   alpha   IsAlpha              Alphabetic
   ascii   IsASCII              Any ASCII char
   blank   IsSpace  [ \t]       Horizontal whitespace (GNU extension)
   cntrl   IsCntrl              Control characters
   digit   IsDigit  \d          Digits
   graph   IsGraph              Alphanumeric and punctuation
   lower   IsLower              Lowercase chars (locale and Unicode aware)
   print   IsPrint              Alphanumeric, punct, and space
   punct   IsPunct              Punctuation
   space   IsSpace  [\s\ck]     Whitespace
           IsSpacePerl   \s     Perl's whitespace definition
   upper   IsUpper              Uppercase chars (locale and Unicode aware)
   word    IsWord   \w          Alphanumeric plus _ (Perl extension)
   xdigit  IsXDigit [0-9A-Fa-f] Hexadecimal digit

Within a character class:

    POSIX       traditional   Unicode
    [:digit:]       \d        \p{IsDigit}
    [:^digit:]      \D        \P{IsDigit}

=head2 ANCHORS

All are zero-width assertions.

   ^  Match string start (or line, if /m is used)
   $  Match string end (or line, if /m is used) or before newline
   \b Match word boundary (between \w and \W)
   \B Match except at word boundary (between \w and \w or \W and \W)
   \A Match string start (regardless of /m)
   \Z Match string end (before optional newline)
   \z Match absolute string end
   \G Match where previous m//g left off

=head2 QUANTIFIERS

Quantifiers are greedy by default -- match the B<longest> leftmost.

   Maximal Minimal Allowed range
   ------- ------- -------------
   {n,m}   {n,m}?  Must occur at least n times but no more than m times
   {n,}    {n,}?   Must occur at least n times
   {n}     {n}?    Must occur exactly n times
   *       *?      0 or more times (same as {0,})
   +       +?      1 or more times (same as {1,})
   ?       ??      0 or 1 time (same as {0,1})

There is no quantifier {,n} -- that gets understood as a literal string.

=head2 EXTENDED CONSTRUCTS

   (?#text)         A comment
   (?imxs-imsx:...) Enable/disable option (as per m// modifiers)
   (?=...)          Zero-width positive lookahead assertion
   (?!...)          Zero-width negative lookahead assertion
   (?<=...)         Zero-width positive lookbehind assertion
   (?<!...)         Zero-width negative lookbehind assertion
   (?>...)          Grab what we can, prohibit backtracking
   (?{ code })      Embedded code, return value becomes $^R
   (??{ code })     Dynamic regex, return value used as regex
   (?(cond)yes|no)  cond being integer corresponding to capturing parens
   (?(cond)yes)        or a lookaround/eval zero-width assertion

=head2 VARIABLES

   $_    Default variable for operators to use

   $`    Everything prior to matched string
   $&    Entire matched string
   $'    Everything after to matched string

   ${^PREMATCH}   Everything prior to matched string
   ${^MATCH}      Entire matched string
   ${^POSTMATCH}  Everything after to matched string

The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
within your program. Consult L<perlvar> for C<@LAST_MATCH_START>
to see equivalent expressions that won't cause slow down.
See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
and C<${^POSTMATCH}>, but for them to be defined, you have to
specify the C</p> (preserve) modifier on your regular expression.

   $1, $2 ...  hold the Xth captured expr
   $+    Last parenthesized pattern match
   $^N   Holds the most recently closed capture
   $^R   Holds the result of the last (?{...}) expr
   @-    Offsets of starts of groups. $-[0] holds start of whole match
   @+    Offsets of ends of groups. $+[0] holds end of whole match
   %+    Named capture buffers
   %-    Named capture buffers, as array refs

Captured groups are numbered according to their I<opening> paren.

=head2 FUNCTIONS

   lc          Lowercase a string
   lcfirst     Lowercase first char of a string
   uc          Uppercase a string
   ucfirst     Titlecase first char of a string

   pos         Return or set current match position
   quotemeta   Quote metacharacters
   reset       Reset ?pattern? status
   study       Analyze string for optimizing matching

   split       Use a regex to split a string into parts

The first four of these are like the escape sequences C<\L>, C<\l>,
C<\U>, and C<\u>.  For Titlecase, see L</Titlecase>.

=head2 TERMINOLOGY

=head3 Titlecase

Unicode concept which most often is equal to uppercase, but for
certain characters like the German "sharp s" there is a difference.

=head1 AUTHOR

Iain Truskett.

This document may be distributed under the same terms as Perl itself.

=head1 SEE ALSO

=over 4

=item *

L<perlretut> for a tutorial on regular expressions.

=item *

L<perlrequick> for a rapid tutorial.

=item *

L<perlre> for more details.

=item *

L<perlvar> for details on the variables.

=item *

L<perlop> for details on the operators.

=item *

L<perlfunc> for details on the functions.

=item *

L<perlfaq6> for FAQs on regular expressions.

=item *

The L<re> module to alter behaviour and aid
debugging.

=item *

L<perldebug/"Debugging regular expressions">

=item *

L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
for details on regexes and internationalisation.

=item *

I<Mastering Regular Expressions> by Jeffrey Friedl
(F<http://regex.info/>) for a thorough grounding and
reference on the topic.

=back

=head1 THANKS

David P.C. Wollmann,
Richard Soderberg,
Sean M. Burke,
Tom Christiansen,
Jim Cromie,
and
Jeffrey Goff
for useful advice.

=cut
Commit	Line	Data
30487ceb	1	=head1 NAME
	2
	3	perlreref - Perl Regular Expressions Reference
	4
	5	=head1 DESCRIPTION
	6
	7	This is a quick reference to Perl's regular expressions.
	8	For full information see L<perlre> and L<perlop>, as well
6d014f17	9	as the L</"SEE ALSO"> section in this document.
30487ceb	10
a5365663	11	=head2 OPERATORS
30487ceb	12
e17472c5	13	C<=~> determines to which variable the regex is applied.
e17472c5	14	In its absence, $_ is used.
30487ceb	15
e17472c5	16	$var =~ /foo/;
30487ceb	17
e17472c5	18	C<!~> determines to which variable the regex is applied,
	19	and negates the result of the match; it returns
	20	false if the match succeeds, and true if it fails.
6d014f17	21
e17472c5	22	$var !~ /foo/;
6d014f17	23
e17472c5	24	C<m/pattern/msixpogc> searches a string for a pattern match,
e17472c5	25	applying the given options.
30487ceb	26
e17472c5	27	m Multiline mode - ^ and $ match internal lines
	28	s match as a Single line - . matches \n
	29	i case-Insensitive
	30	x eXtended legibility - free whitespace and comments
	31	p Preserve a copy of the matched string -
	32	${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
	33	o compile pattern Once
	34	g Global - all occurrences
	35	c don't reset pos on failed matches when using /g
30487ceb	36
e17472c5	37	If 'pattern' is an empty string, the last I<successfully> matched
	38	regex is used. Delimiters other than '/' may be used for both this
	39	operator and the following ones. The leading C<m> can be ommitted
	40	if the delimiter is '/'.
30487ceb	41
e17472c5	42	C<qr/pattern/msixpo> lets you store a regex in a variable,
	43	or pass one around. Modifiers as for C<m//>, and are stored
	44	within the regex.
30487ceb	45
e17472c5	46	C<s/pattern/replacement/msixpogce> substitutes matches of
	47	'pattern' with 'replacement'. Modifiers as for C<m//>,
	48	with one addition:
30487ceb	49
e17472c5	50	e Evaluate 'replacement' as an expression
30487ceb	51
e17472c5	52	'e' may be specified multiple times. 'replacement' is interpreted
e17472c5	53	as a double quoted string unless a single-quote (C<'>) is the delimiter.
30487ceb	54
e17472c5	55	C<?pattern?> is like C<m/pattern/> but matches only once. No alternate
e17472c5	56	delimiters can be used. Must be reset with reset().
30487ceb	57
a5365663	58	=head2 SYNTAX
30487ceb	59
6d014f17	60	\ Escapes the character immediately following it
e5a7b003	61	. Matches any single character except a newline (unless /s is used)
	62	^ Matches at the beginning of the string (or line, if /m is used)
	63	$ Matches at the end of the string (or line, if /m is used)
	64	* Matches the preceding element 0 or more times
	65	+ Matches the preceding element 1 or more times
	66	? Matches the preceding element 0 or 1 times
	67	{...} Specifies a range of occurrences for the element preceding it
	68	[...] Matches any one of the characters contained within the brackets
	69	(...) Groups subexpressions for capturing to $1, $2...
	70	(?:...) Groups subexpressions without capturing (cluster)
6d014f17	71	\| Matches either the subexpression preceding or following it
e17472c5	72	\1, \2 ... Matches the text from the Nth group
30487ceb	73
	74	=head2 ESCAPE SEQUENCES
	75
	76	These work as in normal strings.
	77
	78	\a Alarm (beep)
	79	\e Escape
	80	\f Formfeed
	81	\n Newline
	82	\r Carriage return
	83	\t Tab
6ed007ae	84	\037 Any octal ASCII value
30487ceb	85	\x7f Any hexadecimal ASCII value
	86	\x{263a} A wide hexadecimal value
	87	\cx Control-x
	88	\N{name} A named character
	89
6d014f17	90	\l Lowercase next character
d3b55b48	91	\u Titlecase next character
30487ceb	92	\L Lowercase until \E
d3b55b48	93	\U Uppercase until \E
30487ceb	94	\Q Disable pattern metacharacters until \E
e17472c5	95	\E End modification
30487ceb	96
47e8a552	97	For Titlecase, see L</Titlecase>.
47e8a552	98
30487ceb	99	This one works differently from normal strings:
	100
	101	\b An assertion, not backspace, except in a character class
	102
	103	=head2 CHARACTER CLASSES
	104
	105	[amy] Match 'a', 'm' or 'y'
	106	[f-j] Dash specifies "range"
	107	[f-j-] Dash escaped or at start or end means 'dash'
6d014f17	108	[^f-j] Caret indicates "match any character _except_ these"
30487ceb	109
e04a154e	110	The following sequences work within or without a character class.
e17472c5	111	The first six are locale aware, all are Unicode aware. See L<perllocale>
	112	and L<perlunicode> for details.
	113
	114	\d A digit
	115	\D A nondigit
	116	\w A word character
	117	\W A non-word character
	118	\s A whitespace character
	119	\S A non-whitespace character
	120	\h An horizontal white space
	121	\H A non horizontal white space
	122	\v A vertical white space
	123	\V A non vertical white space
	124	\R A generic newline (?>\v\|\x0D\x0A)
e04a154e	125
e04a154e	126	\C Match a byte (with Unicode, '.' matches a character)
30487ceb	127	\pP Match P-named (Unicode) property
	128	\p{...} Match Unicode property with long name
	129	\PP Match non-P
	130	\P{...} Match lack of Unicode property with long name
e17472c5	131	\X Match extended Unicode combining character sequence
30487ceb	132
	133	POSIX character classes and their Unicode and Perl equivalents:
	134
e04a154e	135	alnum IsAlnum Alphanumeric
	136	alpha IsAlpha Alphabetic
	137	ascii IsASCII Any ASCII char
	138	blank IsSpace [ \t] Horizontal whitespace (GNU extension)
	139	cntrl IsCntrl Control characters
	140	digit IsDigit \d Digits
	141	graph IsGraph Alphanumeric and punctuation
	142	lower IsLower Lowercase chars (locale and Unicode aware)
	143	print IsPrint Alphanumeric, punct, and space
	144	punct IsPunct Punctuation
	145	space IsSpace [\s\ck] Whitespace
	146	IsSpacePerl \s Perl's whitespace definition
	147	upper IsUpper Uppercase chars (locale and Unicode aware)
	148	word IsWord \w Alphanumeric plus _ (Perl extension)
	149	xdigit IsXDigit [0-9A-Fa-f] Hexadecimal digit
30487ceb	150
	151	Within a character class:
	152
	153	POSIX traditional Unicode
	154	[:digit:] \d \p{IsDigit}
	155	[:^digit:] \D \P{IsDigit}
	156
	157	=head2 ANCHORS
	158
	159	All are zero-width assertions.
	160
	161	^ Match string start (or line, if /m is used)
	162	$ Match string end (or line, if /m is used) or before newline
	163	\b Match word boundary (between \w and \W)
6d014f17	164	\B Match except at word boundary (between \w and \w or \W and \W)
30487ceb	165	\A Match string start (regardless of /m)
6d014f17	166	\Z Match string end (before optional newline)
30487ceb	167	\z Match absolute string end
30487ceb	168	\G Match where previous m//g left off
30487ceb	169
	170	=head2 QUANTIFIERS
	171
6d014f17	172	Quantifiers are greedy by default -- match the B<longest> leftmost.
30487ceb	173
	174	Maximal Minimal Allowed range
	175	------- ------- -------------
	176	{n,m} {n,m}? Must occur at least n times but no more than m times
	177	{n,} {n,}? Must occur at least n times
6d014f17	178	{n} {n}? Must occur exactly n times
30487ceb	179	* *? 0 or more times (same as {0,})
	180	+ +? 1 or more times (same as {1,})
	181	? ?? 0 or 1 time (same as {0,1})
	182
6d014f17	183	There is no quantifier {,n} -- that gets understood as a literal string.
6d014f17	184
30487ceb	185	=head2 EXTENDED CONSTRUCTS
	186
	187	(?#text) A comment
6d014f17	188	(?imxs-imsx:...) Enable/disable option (as per m// modifiers)
30487ceb	189	(?=...) Zero-width positive lookahead assertion
30487ceb	190	(?!...) Zero-width negative lookahead assertion
6d014f17	191	(?<=...) Zero-width positive lookbehind assertion
30487ceb	192	(?<!...) Zero-width negative lookbehind assertion
	193	(?>...) Grab what we can, prohibit backtracking
	194	(?{ code }) Embedded code, return value becomes $^R
	195	(??{ code }) Dynamic regex, return value used as regex
e5a7b003	196	(?(cond)yes\|no) cond being integer corresponding to capturing parens
30487ceb	197	(?(cond)yes) or a lookaround/eval zero-width assertion
30487ceb	198
a5365663	199	=head2 VARIABLES
30487ceb	200
30487ceb	201	$_ Default variable for operators to use
30487ceb	202
30487ceb	203	$` Everything prior to matched string
e17472c5	204	$& Entire matched string
30487ceb	205	$' Everything after to matched string
30487ceb	206
e17472c5	207	${^PREMATCH} Everything prior to matched string
	208	${^MATCH} Entire matched string
	209	${^POSTMATCH} Everything after to matched string
	210
	211	The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
30487ceb	212	within your program. Consult L<perlvar> for C<@LAST_MATCH_START>
30487ceb	213	to see equivalent expressions that won't cause slow down.
e17472c5	214	See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
	215	can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
	216	and C<${^POSTMATCH}>, but for them to be defined, you have to
	217	specify the C</p> (preserve) modifier on your regular expression.
30487ceb	218
	219	$1, $2 ... hold the Xth captured expr
	220	$+ Last parenthesized pattern match
	221	$^N Holds the most recently closed capture
	222	$^R Holds the result of the last (?{...}) expr
6d014f17	223	@- Offsets of starts of groups. $-[0] holds start of whole match
6d014f17	224	@+ Offsets of ends of groups. $+[0] holds end of whole match
e17472c5	225	%+ Named capture buffers
e17472c5	226	%- Named capture buffers, as array refs
30487ceb	227
6d014f17	228	Captured groups are numbered according to their I<opening> paren.
30487ceb	229
a5365663	230	=head2 FUNCTIONS
30487ceb	231
	232	lc Lowercase a string
	233	lcfirst Lowercase first char of a string
	234	uc Uppercase a string
47e8a552	235	ucfirst Titlecase first char of a string
47e8a552	236
30487ceb	237	pos Return or set current match position
	238	quotemeta Quote metacharacters
	239	reset Reset ?pattern? status
	240	study Analyze string for optimizing matching
	241
e17472c5	242	split Use a regex to split a string into parts
30487ceb	243
d3b55b48	244	The first four of these are like the escape sequences C<\L>, C<\l>,
d3b55b48	245	C<\U>, and C<\u>. For Titlecase, see L</Titlecase>.
47e8a552	246
1501d360	247	=head2 TERMINOLOGY
47e8a552	248
a5365663	249	=head3 Titlecase
47e8a552	250
	251	Unicode concept which most often is equal to uppercase, but for
	252	certain characters like the German "sharp s" there is a difference.
	253
40506b5d	254	=head1 AUTHOR
30487ceb	255
	256	Iain Truskett.
	257
	258	This document may be distributed under the same terms as Perl itself.
	259
40506b5d	260	=head1 SEE ALSO
30487ceb	261
	262	=over 4
	263
	264	=item *
	265
	266	L<perlretut> for a tutorial on regular expressions.
	267
	268	=item *
	269
	270	L<perlrequick> for a rapid tutorial.
	271
	272	=item *
	273
	274	L<perlre> for more details.
	275
	276	=item *
	277
	278	L<perlvar> for details on the variables.
	279
	280	=item *
	281
	282	L<perlop> for details on the operators.
	283
	284	=item *
	285
	286	L<perlfunc> for details on the functions.
	287
	288	=item *
	289
	290	L<perlfaq6> for FAQs on regular expressions.
	291
	292	=item *
	293
	294	The L<re> module to alter behaviour and aid
	295	debugging.
	296
	297	=item *
	298
	299	L<perldebug/"Debugging regular expressions">
	300
	301	=item *
	302
e17472c5	303	L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
30487ceb	304	for details on regexes and internationalisation.
	305
	306	=item *
	307
	308	I<Mastering Regular Expressions> by Jeffrey Friedl
	309	(F<http://regex.info/>) for a thorough grounding and
	310	reference on the topic.
	311
	312	=back
	313
40506b5d	314	=head1 THANKS
30487ceb	315
	316	David P.C. Wollmann,
	317	Richard Soderberg,
	318	Sean M. Burke,
	319	Tom Christiansen,
e5a7b003	320	Jim Cromie,
30487ceb	321	and
	322	Jeffrey Goff
	323	for useful advice.
6d014f17	324
6d014f17	325	=cut