=head1 NAME
-perlfaq4 - Data Manipulation ($Revision: 1.61 $, $Date: 2005/03/11 16:27:53 $)
+perlfaq4 - Data Manipulation ($Revision: 1.73 $, $Date: 2005/12/31 00:54:37 $)
=head1 DESCRIPTION
(despite appearances caused by bugs in your programs :-). see the
F<random> article in the "Far More Than You Ever Wanted To Know"
collection in http://www.cpan.org/misc/olddoc/FMTEYEWTK.tgz , courtesy of
-Tom Phoenix, talks more about this. John von Neumann said, ``Anyone
+Tom Phoenix, talks more about this. John von Neumann said, "Anyone
who attempts to generate random numbers by deterministic means is, of
-course, living in a state of sin.''
+course, living in a state of sin."
If you want numbers that are more random than C<rand> with C<srand>
provides, you should also check out the Math::TrulyRandom module from
CPAN. It uses the imperfections in your system's timer to generate
random numbers, but this takes quite a while. If you want a better
pseudorandom generator than comes with your operating system, look at
-``Numerical Recipes in C'' at http://www.nr.com/ .
+"Numerical Recipes in C" at http://www.nr.com/ .
=head2 How do I get a random number between X and Y?
The localtime function returns the day of the year. Without an
argument localtime uses the current time.
- $day_of_year = (localtime)[7];
+ $day_of_year = (localtime)[7];
The POSIX module can also format a date as the day of the year or
week of the year.
use POSIX qw/strftime/;
use Time::Local;
- my $week_of_year = strftime "%W",
+ my $week_of_year = strftime "%W",
localtime( timelocal( 0, 0, 0, 18, 11, 1987 ) );
-The Date::Calc module provides two functions for to calculate these.
+The Date::Calc module provides two functions to calculate these.
use Date::Calc;
my $day_of_year = Day_of_Year( 1987, 12, 18 );
sub get_century {
return int((((localtime(shift || time))[5] + 1999))/100);
}
-
+
sub get_millennium {
return 1+int((((localtime(shift || time))[5] + 1899))/1000);
}
=head2 How can I compare two dates and find the difference?
-If you're storing your dates as epoch seconds then simply subtract one
-from the other. If you've got a structured date (distinct year, day,
-month, hour, minute, seconds values), then for reasons of accessibility,
-simplicity, and efficiency, merely use either timelocal or timegm (from
-the Time::Local module in the standard distribution) to reduce structured
-dates to epoch seconds. However, if you don't know the precise format of
-your dates, then you should probably use either of the Date::Manip and
-Date::Calc modules from CPAN before you go hacking up your own parsing
-routine to handle arbitrary date formats.
+(contributed by brian d foy)
+
+You could just store all your dates as a number and then subtract. Life
+isn't always that simple though. If you want to work with formatted
+dates, the Date::Manip, Date::Calc, or DateTime modules can help you.
+
=head2 How can I take a string and turn it into epoch seconds?
=head2 How do I find yesterday's date?
-If you only need to find the date (and not the same time), you
-can use the Date::Calc module.
-
- use Date::Calc qw(Today Add_Delta_Days);
+(contributed by brian d foy)
- my @date = Add_Delta_Days( Today(), -1 );
+Use one of the Date modules. The C<DateTime> module makes it simple, and
+give you the same time of day, only the day before.
- print "@date\n";
+ use DateTime;
-Most people try to use the time rather than the calendar to
-figure out dates, but that assumes that your days are
-twenty-four hours each. For most people, there are two days
-a year when they aren't: the switch to and from summer time
-throws this off. Russ Allbery offers this solution.
-
- sub yesterday {
- my $now = defined $_[0] ? $_[0] : time;
- my $then = $now - 60 * 60 * 24;
- my $ndst = (localtime $now)[8] > 0;
- my $tdst = (localtime $then)[8] > 0;
- $then - ($tdst - $ndst) * 60 * 60;
- }
+ my $yesterday = DateTime->now->subtract( days => 1 );
-Should give you "this time yesterday" in seconds since epoch relative to
-the first argument or the current time if no argument is given and
-suitable for passing to localtime or whatever else you need to do with
-it. $ndst is whether we're currently in daylight savings time; $tdst is
-whether the point 24 hours ago was in daylight savings time. If $tdst
-and $ndst are the same, a boundary wasn't crossed, and the correction
-will subtract 0. If $tdst is 1 and $ndst is 0, subtract an hour more
-from yesterday's time since we gained an extra hour while going off
-daylight savings time. If $tdst is 0 and $ndst is 1, subtract a
-negative hour (add an hour) to yesterday's time since we lost an hour.
+ print "Yesterday was $yesterday\n";
-All of this is because during those days when one switches off or onto
-DST, a "day" isn't 24 hours long; it's either 23 or 25.
+You can also use the C<Date::Calc> module using its Today_and_Now
+function.
-The explicit settings of $ndst and $tdst are necessary because localtime
-only says it returns the system tm struct, and the system tm struct at
-least on Solaris doesn't guarantee any particular positive value (like,
-say, 1) for isdst, just a positive value. And that value can
-potentially be negative, if DST information isn't available (this sub
-just treats those cases like no DST).
+ use Date::Calc qw( Today_and_Now Add_Delta_DHMS );
-Note that between 2am and 3am on the day after the time zone switches
-off daylight savings time, the exact hour of "yesterday" corresponding
-to the current hour is not clearly defined. Note also that if used
-between 2am and 3am the day after the change to daylight savings time,
-the result will be between 3am and 4am of the previous day; it's
-arguable whether this is correct.
-
-This sub does not attempt to deal with leap seconds (most things don't).
+ my @date_time = Add_Delta_DHMS( Today_and_Now(), -1, 0, 0, 0 );
+ print "@date\n";
+Most people try to use the time rather than the calendar to figure out
+dates, but that assumes that days are twenty-four hours each. For
+most people, there are two days a year when they aren't: the switch to
+and from summer time throws this off. Let the modules do the work.
=head2 Does Perl have a Year 2000 problem? Is Perl Y2K compliant?
That doesn't mean that Perl can't be used to create non-Y2K compliant
programs. It can. But so can your pencil. It's the fault of the user,
-not the language. At the risk of inflaming the NRA: ``Perl doesn't
-break Y2K, people do.'' See http://www.perl.org/about/y2k.html for
+not the language. At the risk of inflaming the NRA: "Perl doesn't
+break Y2K, people do." See http://www.perl.org/about/y2k.html for
a longer exposition.
=head1 Data: Strings
=head2 How do I validate input?
-The answer to this question is usually a regular expression, perhaps
-with auxiliary logic. See the more specific questions (numbers, mail
-addresses, etc.) for details.
+(contributed by brian d foy)
+
+There are many ways to ensure that values are what you expect or
+want to accept. Besides the specific examples that we cover in the
+perlfaq, you can also look at the modules with "Assert" and "Validate"
+in their names, along with other modules such as C<Regexp::Common>.
+
+Some modules have validation for particular types of input, such
+as C<Business::ISBN>, C<Business::CreditCard>, C<Email::Valid>,
+and C<Data::Validate::IP>.
=head2 How do I unescape a string?
-It depends just what you mean by ``escape''. URL escapes are dealt
+It depends just what you mean by "escape". URL escapes are dealt
with in L<perlfaq9>. Shell escapes with the backslash (C<\>)
character are removed with
=head2 How do I remove consecutive pairs of characters?
-To turn C<"abbcccd"> into C<"abccd">:
+(contributed by brian d foy)
+
+You can use the substitution operator to find pairs of characters (or
+runs of characters) and replace them with a single instance. In this
+substitution, we find a character in C<(.)>. The memory parentheses
+store the matched character in the back-reference C<\1> and we use
+that to require that the same thing immediately follow it. We replace
+that part of the string with the character in C<$1>.
- s/(.)\1/$1/g; # add /s to include newlines
+ s/(.)\1/$1/g;
-Here's a solution that turns "abbcccd" to "abcd":
+We can also use the transliteration operator, C<tr///>. In this
+example, the search list side of our C<tr///> contains nothing, but
+the C<c> option complements that so it contains everything. The
+replacement list also contains nothing, so the transliteration is
+almost a no-op since it won't do any replacements (or more exactly,
+replace the character with itself). However, the C<s> option squashes
+duplicated and consecutive characters in the string so a character
+does not show up next to itself
- y///cs; # y == tr, but shorter :-)
+ my $str = 'Haarlem'; # in the Netherlands
+ $str =~ tr///cs; # Now Harlem, like in New York
=head2 How do I expand function calls in a string?
-This is documented in L<perlref>. In general, this is fraught with
-quoting and readability problems, but it is possible. To interpolate
-a subroutine call (in list context) into a string:
+(contributed by brian d foy)
+
+This is documented in L<perlref>, and although it's not the easiest
+thing to read, it does work. In each of these examples, we call the
+function inside the braces used to dereference a reference. If we
+have a more than one return value, we can construct and dereference an
+anonymous array. In this case, we call the function in list context.
+
+ print "The time values are @{ [localtime] }.\n";
+
+If we want to call the function in scalar context, we have to do a bit
+more work. We can really have any code we like inside the braces, so
+we simply have to end with the scalar reference, although how you do
+that is up to you, and you can use code inside the braces.
+
+ print "The time is ${\(scalar localtime)}.\n"
- print "My sub returned @{[mysub(1,2,3)]} that time.\n";
+ print "The time is ${ my $x = localtime; \$x }.\n";
+
+If your function already returns a reference, you don't need to create
+the reference yourself.
+
+ sub timestamp { my $t = localtime; \$t }
+
+ print "The time is ${ timestamp() }.\n";
+
+The C<Interpolation> module can also do a lot of magic for you. You can
+specify a variable name, in this case C<E>, to set up a tied hash that
+does the interpolation for you. It has several other methods to do this
+as well.
+
+ use Interpolation E => 'eval';
+ print "The time values are $E{localtime()}.\n";
+
+In most cases, it is probably easier to simply use string concatenation,
+which also forces scalar context.
+
+ print "The time is " . localtime . ".\n";
=head2 How do I find matching/nesting anything?
characters, a pattern like C</x([^x]*)x/> will get the intervening
bits in $1. For multiple ones, then something more like
C</alpha(.*?)omega/> would be needed. But none of these deals with
-nested patterns. For balanced expressions using C<(>, C<{>, C<[>
-or C<< < >> as delimiters, use the CPAN module Regexp::Common, or see
-L<perlre/(??{ code })>. For other cases, you'll have to write a parser.
+nested patterns. For balanced expressions using C<(>, C<{>, C<[> or
+C<< < >> as delimiters, use the CPAN module Regexp::Common, or see
+L<perlre/(??{ code })>. For other cases, you'll have to write a
+parser.
If you are serious about writing a parser, there are a number of
modules or oddities that will make your life a lot easier. There are
the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
-and the byacc program. Starting from perl 5.8 the Text::Balanced
-is part of the standard distribution.
+and the byacc program. Starting from perl 5.8 the Text::Balanced is
+part of the standard distribution.
One simple destructive, inside-out approach that you might try is to
pull out the smallest nesting parts one at a time:
=head2 How do I strip blank space from the beginning/end of a string?
-Although the simplest approach would seem to be
+(contributed by brian d foy)
- $string =~ s/^\s*(.*?)\s*$/$1/;
+A substitution can do this for you. For a single line, you want to
+replace all the leading or trailing whitespace with nothing. You
+can do that with a pair of substitutions.
-not only is this unnecessarily slow and destructive, it also fails with
-embedded newlines. It is much faster to do this operation in two steps:
+ s/^\s+//;
+ s/\s+$//;
- $string =~ s/^\s+//;
- $string =~ s/\s+$//;
+You can also write that as a single substitution, although it turns
+out the combined statement is slower than the separate ones. That
+might not matter to you, though.
-Or more nicely written as:
+ s/^\s+|\s+$//g;
- for ($string) {
- s/^\s+//;
- s/\s+$//;
- }
+In this regular expression, the alternation matches either at the
+beginning or the end of the string since the anchors have a lower
+precedence than the alternation. With the C</g> flag, the substitution
+makes all possible matches, so it gets both. Remember, the trailing
+newline matches the C<\s+>, and the C<$> anchor can match to the
+physical end of the string, so the newline disappears too. Just add
+the newline to the output, which has the added benefit of preserving
+"blank" (consisting entirely of whitespace) lines which the C<^\s+>
+would remove all by itself.
-This idiom takes advantage of the C<foreach> loop's aliasing
-behavior to factor out common code. You can do this
-on several strings at once, or arrays, or even the
-values of a hash if you use a slice:
+ while( <> )
+ {
+ s/^\s+|\s+$//g;
+ print "$_\n";
+ }
- # trim whitespace in the scalar, the array,
- # and all the values in the hash
- foreach ($scalar, @array, @hash{keys %hash}) {
- s/^\s+//;
- s/\s+$//;
- }
+For a multi-line string, you can apply the regular expression
+to each logical line in the string by adding the C</m> flag (for
+"multi-line"). With the C</m> flag, the C<$> matches I<before> an
+embedded newline, so it doesn't remove it. It still removes the
+newline at the end of the string.
+
+ $string =~ s/^\s+|\s+$//gm;
+
+Remember that lines consisting entirely of whitespace will disappear,
+since the first part of the alternation can match the entire string
+and replace it with nothing. If need to keep embedded blank lines,
+you have to do a little more work. Instead of matching any whitespace
+(since that includes a newline), just match the other whitespace.
+
+ $string =~ s/^[\t\f ]+|[\t\f ]+$//mg;
=head2 How do I pad a string with blanks or pad a number with zeroes?
=head2 How can I remove duplicate elements from a list or array?
-There are several possible ways, depending on whether the array is
-ordered and whether you wish to preserve the ordering.
-
-=over 4
-
-=item a)
-
-If @in is sorted, and you want @out to be sorted:
-(this assumes all true values in the array)
-
- $prev = "not equal to $in[0]";
- @out = grep($_ ne $prev && ($prev = $_, 1), @in);
-
-This is nice in that it doesn't use much extra memory, simulating
-uniq(1)'s behavior of removing only adjacent duplicates. The ", 1"
-guarantees that the expression is true (so that grep picks it up)
-even if the $_ is 0, "", or undef.
-
-=item b)
-
-If you don't know whether @in is sorted:
-
- undef %saw;
- @out = grep(!$saw{$_}++, @in);
-
-=item c)
-
-Like (b), but @in contains only small integers:
+(contributed by brian d foy)
- @out = grep(!$saw[$_]++, @in);
+Use a hash. When you think the words "unique" or "duplicated", think
+"hash keys".
-=item d)
+If you don't care about the order of the elements, you could just
+create the hash then extract the keys. It's not important how you
+create that hash: just that you use C<keys> to get the unique
+elements.
-A way to do (b) without any loops or greps:
+ my %hash = map { $_, 1 } @array;
+ # or a hash slice: @hash{ @array } = ();
+ # or a foreach: $hash{$_} = 1 foreach ( @array );
- undef %saw;
- @saw{@in} = ();
- @out = sort keys %saw; # remove sort if undesired
+ my @unique = keys %hash;
-=item e)
+You can also go through each element and skip the ones you've seen
+before. Use a hash to keep track. The first time the loop sees an
+element, that element has no key in C<%Seen>. The C<next> statement
+creates the key and immediately uses its value, which is C<undef>, so
+the loop continues to the C<push> and increments the value for that
+key. The next time the loop sees that same element, its key exists in
+the hash I<and> the value for that key is true (since it's not 0 or
+undef), so the next skips that iteration and the loop goes to the next
+element.
-Like (d), but @in contains only small positive integers:
+ my @unique = ();
+ my %seen = ();
- undef @ary;
- @ary[@in] = @in;
- @out = grep {defined} @ary;
+ foreach my $elem ( @array )
+ {
+ next if $seen{ $elem }++;
+ push @unique, $elem;
+ }
-=back
+You can write this more briefly using a grep, which does the
+same thing.
-But perhaps you should have been using a hash all along, eh?
+ my %seen = ();
+ my @unique = grep { ! $seen{ $_ }++ } @array;
=head2 How can I tell whether a certain element is contained in a list or array?
+(portions of this answer contributed by Anno Siegel)
+
Hearing the word "in" is an I<in>dication that you probably should have
used a hash, not a list or array, to store your data. Hashes are
designed to answer this question quickly and efficiently. Arrays aren't.
Now check whether C<vec($read,$n,1)> is true for some C<$n>.
-Please do not use
+These methods guarantee fast individual tests but require a re-organization
+of the original list or array. They only pay off if you have to test
+multiple values against the same array.
- ($is_there) = grep $_ eq $whatever, @array;
+If you are testing only once, the standard module List::Util exports
+the function C<first> for this purpose. It works by stopping once it
+finds the element. It's written in C for speed, and its Perl equivalant
+looks like this subroutine:
-or worse yet
+ sub first (&@) {
+ my $code = shift;
+ foreach (@_) {
+ return $_ if &{$code}();
+ }
+ undef;
+ }
- ($is_there) = grep /$whatever/, @array;
+If speed is of little concern, the common idiom uses grep in scalar context
+(which returns the number of items that passed its condition) to traverse the
+entire list. This does have the benefit of telling you how many matches it
+found, though.
-These are slow (checks every element even if the first matches),
-inefficient (same reason), and potentially buggy (what if there are
-regex characters in $whatever?). If you're only testing once, then
-use:
+ my $is_there = grep $_ eq $whatever, @array;
- $is_there = 0;
- foreach $elt (@array) {
- if ($elt eq $elt_to_find) {
- $is_there = 1;
- last;
- }
- }
- if ($is_there) { ... }
+If you want to actually extract the matching elements, simply use grep in
+list context.
+
+ my @matches = grep $_ eq $whatever, @array;
=head2 How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
same thing. Once you find the element, you stop the loop with last.
my $found;
- foreach my $element ( @array )
+ foreach ( @array )
{
- if( /Perl/ ) { $found = $element; last }
+ if( /Perl/ ) { $found = $_; last }
}
If you want the array index, you can iterate through the indices
that satisfies the condition.
my( $found, $index ) = ( undef, -1 );
- for( $i = 0; $i < @array; $i++ )
- {
- if( $array[$i] =~ /Perl/ )
- {
- $found = $array[$i];
- $index = $i;
- last;
- }
- }
+ for( $i = 0; $i < @array; $i++ )
+ {
+ if( $array[$i] =~ /Perl/ )
+ {
+ $found = $array[$i];
+ $index = $i;
+ last;
+ }
+ }
=head2 How do I handle linked lists?
sub fisher_yates_shuffle {
my $deck = shift; # $deck is a reference to an array
my $i = @$deck;
- while ($i--) {
+ while (--$i) {
my $j = int rand ($i+1);
@$deck[$i,$j] = @$deck[$j,$i];
}
Use C<for>/C<foreach>:
for (@lines) {
- s/foo/bar/; # change that word
- y/XZ/ZX/; # swap those letters
+ s/foo/bar/; # change that word
+ tr/XZ/ZX/; # swap those letters
}
Here's another; let's compute spherical volumes:
for (@volumes = @radii) { # @volumes has changed parts
- $_ **= 3;
- $_ *= (4/3) * 3.14159; # this will be constant folded
+ $_ **= 3;
+ $_ *= (4/3) * 3.14159; # this will be constant folded
}
which can also be done with map() which is made to transform
case), you modify the value.
for $orbit ( values %orbits ) {
- ($orbit **= 3) *= (4/3) * 3.14159;
+ ($orbit **= 3) *= (4/3) * 3.14159;
}
Prior to perl 5.6 C<values> returned copies of the values,
=head2 How do I sort a hash (optionally by value instead of key)?
-Internally, hashes are stored in a way that prevents you from imposing
-an order on key-value pairs. Instead, you have to sort a list of the
-keys or values:
+(contributed by brian d foy)
+
+To sort a hash, start with the keys. In this example, we give the list of
+keys to the sort function which then compares them ASCIIbetically (which
+might be affected by your locale settings). The output list has the keys
+in ASCIIbetical order. Once we have the keys, we can go through them to
+create a report which lists the keys in ASCIIbetical order.
+
+ my @keys = sort { $a cmp $b } keys %hash;
- @keys = sort keys %hash; # sorted by key
- @keys = sort {
- $hash{$a} cmp $hash{$b}
- } keys %hash; # and by value
+ foreach my $key ( @keys )
+ {
+ printf "%-20s %6d\n", $key, $hash{$value};
+ }
-Here we'll do a reverse numeric sort by value, and if two keys are
-identical, sort by length of key, or if that fails, by straight ASCII
-comparison of the keys (well, possibly modified by your locale--see
-L<perllocale>).
+We could get more fancy in the C<sort()> block though. Instead of
+comparing the keys, we can compute a value with them and use that
+value as the comparison.
- @keys = sort {
- $hash{$b} <=> $hash{$a}
- ||
- length($b) <=> length($a)
- ||
- $a cmp $b
- } keys %hash;
+For instance, to make our report order case-insensitive, we use
+the C<\L> sequence in a double-quoted string to make everything
+lowercase. The C<sort()> block then compares the lowercased
+values to determine in which order to put the keys.
+
+ my @keys = sort { "\L$a" cmp "\L$b" } keys %hash;
+
+Note: if the computation is expensive or the hash has many elements,
+you may want to look at the Schwartzian Transform to cache the
+computation results.
+
+If we want to sort by the hash value instead, we use the hash key
+to look it up. We still get out a list of keys, but this time they
+are ordered by their value.
+
+ my @keys = sort { $hash{$a} <=> $hash{$b} } keys %hash;
+
+From there we can get more complex. If the hash values are the same,
+we can provide a secondary sort on the hash key.
+
+ my @keys = sort {
+ $hash{$a} <=> $hash{$b}
+ or
+ "\L$a" cmp "\L$b"
+ } keys %hash;
=head2 How can I always keep my hash sorted?
=head2 How can I use a reference as a hash key?
-You can't do this directly, but you could use the standard Tie::RefHash
-module distributed with Perl.
+(contributed by brian d foy)
+
+Hash keys are strings, so you can't really use a reference as the key.
+When you try to do that, perl turns the reference into its stringified
+form (for instance, C<HASH(0xDEADBEEF)>). From there you can't get back
+the reference from the stringified form, at least without doing some
+extra work on your own. Also remember that hash keys must be unique, but
+two different variables can store the same reference (and those variables
+can change later).
+
+The Tie::RefHash module, which is distributed with perl, might be what
+you want. It handles that extra work.
=head1 Data: Misc
wrapper function for more convenient access. This function takes
a string and returns the number it found, or C<undef> for input that
isn't a C float. The C<is_numeric> function is a front end to C<getnum>
-if you just want to say, ``Is this a float?''
+if you just want to say, "Is this a float?"
sub getnum {
use POSIX qw(strtod);
=head1 AUTHOR AND COPYRIGHT
-Copyright (c) 1997-2005 Tom Christiansen, Nathan Torkington, and
+Copyright (c) 1997-2006 Tom Christiansen, Nathan Torkington, and
other authors as noted. All rights reserved.
This documentation is free; you can redistribute it and/or modify it