extras/slurp2.pod

   1 =head1 Perl Slurp Ease
   2
   3 =head2 Introduction
   4
   5
   6 One of the common Perl idioms is processing text files line by line
   7
   8         while( <FH> ) {
   9                 do something with $_
  10         }
  11
  12 This idiom has several variants but the key point is that it reads in
  13 only one line from the file in each loop iteration. This has several
  14 advantages including limiting memory use to one line, the ability to
  15 handle any size file (including data piped in via STDIN), and it is
  16 easily taught and understood to Perl newbies. In fact newbies are the
  17 ones who do silly things like this:
  18
  19         while( <FH> ) {
  20                 push @lines, $_ ;
  21         }
  22
  23         foreach ( @lines ) {
  24                 do something with $_
  25         }
  26
  27 Line by line processing is fine but it isn't the only way to deal with
  28 reading files. The other common style is reading the entire file into a
  29 scalar or array and that is commonly known as slurping. Now slurping has
  30 somewhat of a poor reputation and this article is an attempt at
  31 rehabilitating it. Slurping files has advantages and limitations and is
  32 not something you should just do when line by line processing is fine.
  33 It is best when you need the entire file in memory for processing all at
  34 once. Slurping with in memory processing can be faster and lead to
  35 simpler code than line by line if done properly.
  36
  37 The biggest issue to watch for with slurping is file size. Slurping very
  38 large files or unknown amounts of data from STDIN can be disastrous to
  39 your memory usage and cause swap disk thrashing. I advocate slurping
  40 only disk files and only when you know their size is reasonable and you
  41 have a real reason to process the file as a whole. Note that reasonable
  42 size these days is larger than the bad old days of limited RAM. Slurping
  43 in a megabyte size file is not an issue on most systems. But most of the
  44 files I tend to slurp in are much smaller than that. Typical files that
  45 work well with slurping are configuration files, (mini)language scripts,
  46 some data (especially binary) files, and other files of known sizes
  47 which need fast processing.
  48
  49 Another major win for slurping over line by line is speed. Perl's IO
  50 system (like many others) is slow. Calling <> for each line requires a
  51 check for the end of line, checks for EOF, copying a line, munging the
  52 internal handle structure, etc. Plenty of work for each line read
  53 in. Whereas slurping, if done correctly, will usually involve only one
  54 IO call and no extra data copying. The same is true for writing files to
  55 disk and we will cover that as well (even though the term slurping is
  56 traditionally a read operation, I use the term slurp for the concept of
  57 doing IO with an entire file in one operation).
  58
  59 Finally, when you have slurped the entire file into memory, you can do
  60 operations on the data that are not possible or easily done with line by
  61 line processing. These include global search/replace (without regard for
  62 newlines), grabbing all matches with one call of m//g, complex parsing
  63 (which in many cases must ignore newlines), processing *ML (where line
  64 endings are just white space) and performing complex transformations
  65 such as template expansion.
  66
  67 =head2 Global Operations
  68
  69 Here are some simple global operations that can be done quickly and
  70 easily on an entire file that has been slurped in. They could also be
  71 done with line by line processing but that would be slower and require
  72 more code.
  73
  74 A common problem is reading in a file with key/value pairs. There are
  75 modules which do this but who needs them for simple formats? Just slurp
  76 in the file and do a single parse to grab all the key/value pairs.
  77
  78         my $text = read_file( $file ) ;
  79         my %config = $test =~ /^(\w+)=(.+)$/mg ;
  80
  81 That matches a key which starts a line (anywhere inside the string
  82 because of the /m modifier), the '=' char and the text to the end of the
  83 line (again /m makes that work). In fact the ending $ is not even needed
  84 since . will not normally match a newline. Since the key and value are
  85 grabbed and the m// is in list context with the /g modifier, it will
  86 grab all key/value pairs and return them. The %config hash will be
  87 assigned this list and now you have the file fully parsed into a hash.
  88
  89 Various projects I have worked on needed some simple templating and I
  90 wasn't in the mood to use a full module (please,no flames about your
  91 favorite template module :-). So I rolled my own by slurping in the
  92 template file, setting up a template hash and doing this one line:
  93
  94         $text =~ s/<%(.+?)%>/$template{$1}/g ;
  95
  96 That only works if the entire file was slurped in. With a little
  97 extra work it can handle chunks of text to be expanded:
  98
  99         $text =~ s/<%(\w+)_START%>(.+)<%\1_END%>/ template($1, $2)/sge ;
 100
 101 Just supply a template sub to expand the text between the markers and
 102 you have yourself a simple system with minimal code. Note that this will
 103 work and grab over multiple lines due the the /s modifier. This is
 104 something that is much trickier with line by line processing.
 105
 106 Note that this is a very simple templating system and it can't directly
 107 handle nested tags and other complex features. But even if you use one
 108 of the myriad of template modules on the CPAN, you will gain by having
 109 speedier ways to read/write files.
 110
 111 Slurping in a file into an array also offers some useful advantages.
 112
 113
 114 =head2 Traditional Slurping
 115
 116 Perl has always supported slurping files with minimal code. Slurping of
 117 a file to a list of lines is trivial, just call the <> operator in a
 118 list context:
 119
 120         my @lines = <FH> ;
 121
 122 and slurping to a scalar isn't much more work. Just set the built in
 123 variable $/ (the input record separator to the undefined value and read
 124 in the file with <>:
 125
 126         {
 127                 local( $/, *FH ) ;
 128                 open( FH, $file ) or die "sudden flaming death\n"
 129                 $text = <FH>
 130         }
 131
 132 Notice the use of local(). It sets $/ to undef for you and when the
 133 scope exits it will revert $/ back to its previous value (most likely
 134 "\n"). Here is a Perl idiom that allows the $text variable to be
 135 declared and there is no need for a tightly nested block. The do block
 136 will execute the <FH> in a scalar context and slurp in the file which is
 137 assigned to $text.
 138
 139         local( *FH ) ;
 140         open( FH, $file ) or die "sudden flaming death\n"
 141         my $text = do { local( $/ ) ; <FH> } ;
 142
 143 Both of those slurps used localized filehandles to be compatible with
 144 5.005. Here they are with 5.6.0 lexical autovivified handles:
 145
 146         {
 147                 local( $/ ) ;
 148                 open( my $fh, $file ) or die "sudden flaming death\n"
 149                 $text = <$fh>
 150         }
 151
 152         open( my $fh, $file ) or die "sudden flaming death\n"
 153         my $text = do { local( $/ ) ; <$fh> } ;
 154
 155 And this is a variant of that idiom that removes the need for the open
 156 call:
 157
 158         my $text = do { local( @ARGV, $/ ) = $file ; <> } ;
 159
 160 The filename in $file is assigned to a localized @ARGV and the null
 161 filehandle is used which reads the data from the files in @ARGV.
 162
 163 Instead of assigning to a scalar, all the above slurps can assign to an
 164 array and it will get the file but split into lines (using $/ as the end
 165 of line marker).
 166
 167 There is one common variant of those slurps which is very slow and not
 168 good code. You see it around and it is almost always cargo cult code:
 169
 170         my $text = join( '', <FH> ) ;
 171
 172 That needlessly splits the input file into lines (join provides a list
 173 context to <FH>) and then joins up those lines again. The original coder
 174 of this idiom obviously never read perlvar and learned how to use $/ to
 175 allow scalar slurping.
 176
 177 =head2 Write Slurping
 178
 179 While reading in entire files at one time is common, writing out entire
 180 files is also done. We call it slurping when we read in files but there
 181 is no commonly accepted term for the write operation. I asked some Perl
 182 colleagues and got two interesting nominations. Peter Scott said to call
 183 it burping (rhymes with slurping and the noise is the opposite
 184 direction). Others suggested spewing which has a stronger visual image
 185 :-) Tell me your favorite or suggest your own. I will use both in this
 186 section so you can see how they work for you.
 187
 188 Spewing a file is a much simpler operation than slurping. You don't have
 189 context issues to worry about and there is no efficiency problem with
 190 returning a buffer. Here is a simple burp sub:
 191
 192         sub burp {
 193                 my( $file_name ) = shift ;
 194                 open( my $fh, ">$file_name" ) ||
 195                                  die "can't create $file_name $!" ;
 196                 print $fh @_ ;
 197         }
 198
 199 Note that it doesn't copy the input text but passes @_ directly to
 200 print. We will look at faster variations of that later on.
 201
 202 =head2 Slurp on the CPAN
 203
 204 As you would expect there are modules in the CPAN that will slurp files
 205 for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on
 206 CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).
 207
 208 Here is the code from Slurp.pm:
 209
 210     sub slurp {
 211         local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ );
 212         return <ARGV>;
 213     }
 214
 215     sub to_array {
 216         my @array = slurp( @_ );
 217         return wantarray ? @array : \@array;
 218     }
 219
 220     sub to_scalar {
 221         my $scalar = slurp( @_ );
 222         return $scalar;
 223     }
 224
 225 The sub slurp uses the magic undefined value of $/ and the magic file
 226 handle ARGV to support slurping into a scalar or array. It also provides
 227 two wrapper subs that allow the caller to control the context of the
 228 slurp. And the to_array sub will return the list of slurped lines or a
 229 anonymous array of them according to its caller's context by checking
 230 wantarray. It has 'slurp' in @EXPORT and all three subs in @EXPORT_OK.
 231 A final point is that Slurp.pm is poorly named and it shouldn't be in
 232 the top level namespace.
 233
 234 File::Slurp.pm has this code:
 235
 236 sub read_file
 237 {
 238         my ($file) = @_;
 239
 240         local($/) = wantarray ? $/ : undef;
 241         local(*F);
 242         my $r;
 243         my (@r);
 244
 245         open(F, "<$file") || croak "open $file: $!";
 246         @r = <F>;
 247         close(F) || croak "close $file: $!";
 248
 249         return $r[0] unless wantarray;
 250         return @r;
 251 }
 252
 253 This module provides several subs including read_file (more on the
 254 others later). read_file behaves simularly to Slurp::slurp in that it
 255 will slurp a list of lines or a single scalar depending on the caller's
 256 context. It also uses the magic undefined value of $/ for scalar
 257 slurping but it uses an explicit open call rather than using a localized
 258 @ARGV and the other module did. Also it doesn't provide a way to get an
 259 anonymous array of the lines but that can easily be rectified by calling
 260 it inside an anonymous array constuctor [].
 261
 262 Both of these modules make it easier for Perl coders to slurp in
 263 files. They both use the magic $/ to slurp in scalar mode and the
 264 natural behavior of <> in list context to slurp as lines. But neither is
 265 optmized for speed nor can they handle binmode to support binary or
 266 unicode files. See below for more on slurp features and speedups.
 267
 268 =head2 Slurping API Design
 269
 270 The slurp modules on CPAN are have a very simple API and don't support
 271 binmode. This section will cover various API design issues such as
 272 efficient return by reference, binmode and calling variations.
 273
 274 Let's start with the call variations. Slurped files can be returned in
 275 four formats, as a single scalar, as a reference to a scalar, as a list
 276 of lines and as an anonymous array of lines. But the caller can only
 277 provide two contexts, scalar or list. So we have to either provide an
 278 API with more than one sub as Slurp.pm did or just provide one sub which
 279 only returns a scalar or a list (no anonymous array) as File::Slurp.pm
 280 does.
 281
 282 I have used my own read_file sub for years and it has the same API as
 283 File::Slurp.pm, a single sub which returns a scalar or a list of lines
 284 depending on context. But I recognize the interest of those that want an
 285 anonymous array for line slurping. For one thing it is easier to pass
 286 around to other subs and another it eliminates the extra copying of the
 287 lines via return. So my module will support multiple subs with one that
 288 returns the file based on context and the other returns only lines
 289 (either as a list or as an anonymous array). So this API is in between
 290 the two CPAN modules. There is no need for a specific slurp in as a
 291 scalar sub as the general slurp will do that in scalar context. If you
 292 wanted to slurp a scalar into an array, just select the desired array
 293 element and that will provide scalar context to the read_file sub.
 294
 295 The next area to cover is what to name these subs. I will go with
 296 read_file and read_file_lines. They are descriptive, simple and don't
 297 use the 'slurp' nickname (though that nick is in the module name).
 298
 299 Another critical area when designing APIs is how to pass in
 300 arguments. The read_file subs takes one required argument which is the
 301 file name. To support binmode we need another optional argument. And a
 302 third optional argument is needed to support returning a slurped scalar
 303 by reference. My first thought was to design the API with 3 positional
 304 arguments - file name, buffer reference and binmode. But if you want to
 305 set the binmode and not pass in a buffer reference, you have to fill the
 306 second argument with undef and that is ugly. So I decided to make the
 307 filename argument positional and the other two are pass by name.
 308 The sub will start off like this:
 309
 310         sub read_file {
 311
 312                 my( $file_name, %args ) = @_ ;
 313
 314                 my $buf ;
 315                 my $buf_ref = $args{'buf'} || \$buf ;
 316
 317 The binmode argument will be handled later (see code below).
 318
 319 The other sub read_file_lines will only take an optional binmode (so you
 320 can read files with binary delimiters). It doesn't need a buffer
 321 reference argument since it can return an anonymous array if the called
 322 in a scalar context. So this sub could use positional arguments but to
 323 keep its API similar to the API of read_file, it will also use pass by
 324 name for the optional arguments. This also means that new optional
 325 arguments can be added later without breaking any legacy code. A bonus
 326 with keeping the API the same for both subs will be seen how the two
 327 subs are optimized to work together.
 328
 329 Write slurping (or spewing or burping :-) needs to have its API designed
 330 as well. The biggest issue is not only needing to support optional
 331 arguments but a list of arguments to be written is needed. Perl 6 can
 332 handle that with optional named arguments and a final slurp
 333 argument. Since this is Perl 5 we have to do it using some
 334 cleverness. The first argument is the file name and it will be
 335 positional as with the read_file sub. But how can we pass in the
 336 optional arguments and also a list of data? The solution lies in the
 337 fact that the data list should never contain a reference.
 338 Burping/spewing works only on plain data. So if the next argument is a
 339 hash reference, we can assume it is the optional arguments and the rest
 340 of the arguments is the data list. So the write_file sub will start off
 341 like this:
 342
 343         sub write_file {
 344
 345                 my $file_name = shift ;
 346
 347                 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
 348
 349 Whether or not optional arguments are passed in, we leave the data list
 350 in @_ to minimize any more copying. You call write_file like this:
 351
 352         write_file( 'foo', { binmode => ':raw' }, @data ) ;
 353         write_file( 'junk', { append => 1 }, @more_junk ) ;
 354         write_file( 'bar', @spew ) ;
 355
 356 =head2 Fast Slurping
 357
 358
 359 =head2 Benchmarks
 360
 361
 362 =head2 Error Handling
 363
 364 Slurp subs are subject to conditions such as not being able to open the
 365 file or I/O errors. How these errors are handled and what the caller
 366 will see are important aspects of the design of an API. The classic
 367 error handling for slurping has been to call die or even better,
 368 croak. But sometimes you want to either the slurp to either warn/carp
 369 and allow your code to handle the error. Sure, this can be done by
 370 wrapping the slurp in a eval block to catch a fatal error, but not
 371 everyone wants all that extra code. So I have added another option to
 372 all the subs which selects the error handling. If the 'err_mode' option
 373 is 'croak' (which is also the default, the called sub will croak. If set
 374 to 'carp' then carp will be called. Set to any other string (use 'quiet'
 375 by convention) and no error handler call is made. Then the caller can
 376 use the error status from the call.
 377
 378 C<write_file> doesn't use the return value for data so it can return a
 379 false status value in-band to mark an error. C<read_file> does use its
 380 return value for data but we can still make it pass back the error
 381 status. A successful read in any scalar mode will return either a
 382 defined data string or a (scalar or array) reference. So a bare return
 383 would work here. But if you slurp in lines by calling it in a list
 384 context, a bare return will return an empty list which is the same value
 385 it would from from an existing but empty file. So now, C<read_file> will
 386 do something I strongly advocate against, which is returning a call to
 387 undef. In the scalar contexts this still returns a error and now in list
 388 context, the returned first value will be undef and that is not legal
 389 data for the first element. So the list context also gets a error status
 390 it can detect:
 391
 392         my @lines = read_file( $file_name, err_mode => 'quiet' ) ;
 393         your_handle_error( "$file_name can't be read\n" ) unless
 394                                         @lines && defined $lines[0] ;
 395
 396
 397 =head2 File::FastSlurp
 398
 399         sub read_file {
 400
 401                 my( $file_name, %args ) = @_ ;
 402
 403                 my $buf ;
 404                 my $buf_ref = $args{'buf_ref'} || \$buf ;
 405
 406                 my $mode = O_RDONLY ;
 407                 $mode |= O_BINARY if $args{'binmode'} ;
 408
 409                 local( *FH ) ;
 410                 sysopen( FH, $file_name, $mode ) or
 411                                         carp "Can't open $file_name: $!" ;
 412
 413                 my $size_left = -s FH ;
 414
 415                 while( $size_left > 0 ) {
 416
 417                         my $read_cnt = sysread( FH, ${$buf_ref},
 418                                         $size_left, length ${$buf_ref} ) ;
 419
 420                         unless( $read_cnt ) {
 421
 422                                 carp "read error in file $file_name: $!" ;
 423                                 last ;
 424                         }
 425
 426                         $size_left -= $read_cnt ;
 427                 }
 428
 429         # handle void context (return scalar by buffer reference)
 430
 431                 return unless defined wantarray ;
 432
 433         # handle list context
 434
 435                 return split m|?<$/|g, ${$buf_ref} if wantarray ;
 436
 437         # handle scalar context
 438
 439                 return ${$buf_ref} ;
 440         }
 441
 442
 443         sub read_file_lines {
 444
 445         # handle list context
 446
 447                 return &read_file if wantarray ;;
 448
 449         # otherwise handle scalar context
 450
 451                 return [ &read_file ] ;
 452         }
 453
 454
 455         sub write_file {
 456
 457                 my $file_name = shift ;
 458
 459                 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
 460                 my $buf = join '', @_ ;
 461
 462
 463                 my $mode = O_WRONLY ;
 464                 $mode |= O_BINARY if $args->{'binmode'} ;
 465                 $mode |= O_APPEND if $args->{'append'} ;
 466
 467                 local( *FH ) ;
 468                 sysopen( FH, $file_name, $mode ) or
 469                                         carp "Can't open $file_name: $!" ;
 470
 471                 my $size_left = length( $buf ) ;
 472                 my $offset = 0 ;
 473
 474                 while( $size_left > 0 ) {
 475
 476                         my $write_cnt = syswrite( FH, $buf,
 477                                         $size_left, $offset ) ;
 478
 479                         unless( $write_cnt ) {
 480
 481                                 carp "write error in file $file_name: $!" ;
 482                                 last ;
 483                         }
 484
 485                         $size_left -= $write_cnt ;
 486                         $offset += $write_cnt ;
 487                 }
 488
 489                 return ;
 490         }
 491
 492 =head2 Slurping in Perl 6
 493
 494 As usual with Perl 6, much of the work in this article will be put to
 495 pasture. Perl 6 will allow you to set a 'slurp' property on file handles
 496 and when you read from such a handle, the file is slurped. List and
 497 scalar context will still be supported so you can slurp into lines or a
 498 <scalar. I would expect that support for slurping in Perl 6 will be
 499 optimized and bypass the stdio subsystem since it can use the slurp
 500 property to trigger a call to special code. Otherwise some enterprising
 501 individual will just create a File::FastSlurp module for Perl 6. The
 502 code in the Perl 5 module could easily be modified to Perl 6 syntax and
 503 semantics. Any volunteers?
 504
 505 =head2 In Summary
 506
 507 We have compared classic line by line processing with munging a whole
 508 file in memory. Slurping files can speed up your programs and simplify
 509 your code if done properly. You must still be aware to not slurp
 510 humongous files (logs, DNA sequences, etc.) or STDIN where you don't
 511 know how much data you will read in. But slurping megabyte sized files
 512 of is not an major issue on today's systems with the typical amount of
 513 RAM installed. When Perl was first being used in depth (Perl 4),
 514 slurping was limited by the smalle RAM size of 10 years ago. Now you
 515 should be able to slurp most any reasonably sized file be they
 516 configurations, source code, data, etc.