slurp_talk/slurp_article.pod

   1 =head1 Perl Slurp Ease
   2
   3 =head2 Introduction
   4
   5
   6 One of the common Perl idioms is processing text files line by line:
   7
   8         while( <FH> ) {
   9                 do something with $_
  10         }
  11
  12 This idiom has several variants, but the key point is that it reads in
  13 only one line from the file in each loop iteration. This has several
  14 advantages, including limiting memory use to one line, the ability to
  15 handle any size file (including data piped in via STDIN), and it is
  16 easily taught and understood to Perl newbies. In fact newbies are the
  17 ones who do silly things like this:
  18
  19         while( <FH> ) {
  20                 push @lines, $_ ;
  21         }
  22
  23         foreach ( @lines ) {
  24                 do something with $_
  25         }
  26
  27 Line by line processing is fine, but it isn't the only way to deal with
  28 reading files. The other common style is reading the entire file into a
  29 scalar or array, and that is commonly known as slurping. Now, slurping has
  30 somewhat of a poor reputation, and this article is an attempt at
  31 rehabilitating it. Slurping files has advantages and limitations, and is
  32 not something you should just do when line by line processing is fine.
  33 It is best when you need the entire file in memory for processing all at
  34 once. Slurping with in memory processing can be faster and lead to
  35 simpler code than line by line if done properly.
  36
  37 The biggest issue to watch for with slurping is file size. Slurping very
  38 large files or unknown amounts of data from STDIN can be disastrous to
  39 your memory usage and cause swap disk thrashing.  You can slurp STDIN if
  40 you know that you can handle the maximum size input without
  41 detrimentally affecting your memory usage. So I advocate slurping only
  42 disk files and only when you know their size is reasonable and you have
  43 a real reason to process the file as a whole.  Note that reasonable size
  44 these days is larger than the bad old days of limited RAM. Slurping in a
  45 megabyte is not an issue on most systems. But most of the
  46 files I tend to slurp in are much smaller than that. Typical files that
  47 work well with slurping are configuration files, (mini-)language scripts,
  48 some data (especially binary) files, and other files of known sizes
  49 which need fast processing.
  50
  51 Another major win for slurping over line by line is speed. Perl's IO
  52 system (like many others) is slow. Calling C<< <> >> for each line
  53 requires a check for the end of line, checks for EOF, copying a line,
  54 munging the internal handle structure, etc. Plenty of work for each line
  55 read in. Whereas slurping, if done correctly, will usually involve only
  56 one I/O call and no extra data copying. The same is true for writing
  57 files to disk, and we will cover that as well (even though the term
  58 slurping is traditionally a read operation, I use the term ``slurp'' for
  59 the concept of doing I/O with an entire file in one operation).
  60
  61 Finally, when you have slurped the entire file into memory, you can do
  62 operations on the data that are not possible or easily done with line by
  63 line processing. These include global search/replace (without regard for
  64 newlines), grabbing all matches with one call of C<//g>, complex parsing
  65 (which in many cases must ignore newlines), processing *ML (where line
  66 endings are just white space) and performing complex transformations
  67 such as template expansion.
  68
  69 =head2 Global Operations
  70
  71 Here are some simple global operations that can be done quickly and
  72 easily on an entire file that has been slurped in. They could also be
  73 done with line by line processing but that would be slower and require
  74 more code.
  75
  76 A common problem is reading in a file with key/value pairs. There are
  77 modules which do this but who needs them for simple formats? Just slurp
  78 in the file and do a single parse to grab all the key/value pairs.
  79
  80         my $text = read_file( $file ) ;
  81         my %config = $test =~ /^(\w+)=(.+)$/mg ;
  82
  83 That matches a key which starts a line (anywhere inside the string
  84 because of the C</m> modifier), the '=' char and the text to the end of the
  85 line (again, C</m> makes that work). In fact the ending C<$> is not even needed
  86 since C<.> will not normally match a newline. Since the key and value are
  87 grabbed and the C<m//> is in list context with the C</g> modifier, it will
  88 grab all key/value pairs and return them. The C<%config>hash will be
  89 assigned this list and now you have the file fully parsed into a hash.
  90
  91 Various projects I have worked on needed some simple templating and I
  92 wasn't in the mood to use a full module (please, no flames about your
  93 favorite template module :-). So I rolled my own by slurping in the
  94 template file, setting up a template hash and doing this one line:
  95
  96         $text =~ s/<%(.+?)%>/$template{$1}/g ;
  97
  98 That only works if the entire file was slurped in. With a little
  99 extra work it can handle chunks of text to be expanded:
 100
 101         $text =~ s/<%(\w+)_START%>(.+?)<%\1_END%>/ template($1, $2)/sge ;
 102
 103 Just supply a C<template> sub to expand the text between the markers and
 104 you have yourself a simple system with minimal code. Note that this will
 105 work and grab over multiple lines due the the C</s> modifier. This is
 106 something that is much trickier with line by line processing.
 107
 108 Note that this is a very simple templating system, and it can't directly
 109 handle nested tags and other complex features. But even if you use one
 110 of the myriad of template modules on the CPAN, you will gain by having
 111 speedier ways to read and write files.
 112
 113 Slurping in a file into an array also offers some useful advantages.
 114 One simple example is reading in a flat database where each record has
 115 fields separated by a character such as C<:>:
 116
 117         my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ;
 118
 119 Random access to any line of the slurped file is another advantage. Also
 120 a line index could be built to speed up searching the array of lines.
 121
 122
 123 =head2 Traditional Slurping
 124
 125 Perl has always supported slurping files with minimal code. Slurping of
 126 a file to a list of lines is trivial, just call the C<< <> >> operator
 127 in a list context:
 128
 129         my @lines = <FH> ;
 130
 131 and slurping to a scalar isn't much more work. Just set the built in
 132 variable C<$/> (the input record separator to the undefined value and read
 133 in the file with C<< <> >>:
 134
 135         {
 136                 local( $/, *FH ) ;
 137                 open( FH, $file ) or die "sudden flaming death\n"
 138                 $text = <FH>
 139         }
 140
 141 Notice the use of C<local()>. It sets C<$/> to C<undef> for you and when
 142 the scope exits it will revert C<$/> back to its previous value (most
 143 likely "\n").
 144
 145 Here is a Perl idiom that allows the C<$text> variable to be declared,
 146 and there is no need for a tightly nested block. The C<do> block will
 147 execute C<< <FH> >> in a scalar context and slurp in the file named by
 148 C<$text>:
 149
 150         local( *FH ) ;
 151         open( FH, $file ) or die "sudden flaming death\n"
 152         my $text = do { local( $/ ) ; <FH> } ;
 153
 154 Both of those slurps used localized filehandles to be compatible with
 155 5.005. Here they are with 5.6.0 lexical autovivified handles:
 156
 157         {
 158                 local( $/ ) ;
 159                 open( my $fh, $file ) or die "sudden flaming death\n"
 160                 $text = <$fh>
 161         }
 162
 163         open( my $fh, $file ) or die "sudden flaming death\n"
 164         my $text = do { local( $/ ) ; <$fh> } ;
 165
 166 And this is a variant of that idiom that removes the need for the open
 167 call:
 168
 169         my $text = do { local( @ARGV, $/ ) = $file ; <> } ;
 170
 171 The filename in C<$file> is assigned to a localized C<@ARGV> and the
 172 null filehandle is used which reads the data from the files in C<@ARGV>.
 173
 174 Instead of assigning to a scalar, all the above slurps can assign to an
 175 array and it will get the file but split into lines (using C<$/> as the
 176 end of line marker).
 177
 178 There is one common variant of those slurps which is very slow and not
 179 good code. You see it around, and it is almost always cargo cult code:
 180
 181         my $text = join( '', <FH> ) ;
 182
 183 That needlessly splits the input file into lines (C<join> provides a
 184 list context to C<< <FH> >>) and then joins up those lines again. The
 185 original coder of this idiom obviously never read I<perlvar> and learned
 186 how to use C<$/> to allow scalar slurping.
 187
 188 =head2 Write Slurping
 189
 190 While reading in entire files at one time is common, writing out entire
 191 files is also done. We call it ``slurping'' when we read in files, but
 192 there is no commonly accepted term for the write operation. I asked some
 193 Perl colleagues and got two interesting nominations. Peter Scott said to
 194 call it ``burping'' (rhymes with ``slurping'' and suggests movement in
 195 the opposite direction). Others suggested ``spewing'' which has a
 196 stronger visual image :-) Tell me your favorite or suggest your own. I
 197 will use both in this section so you can see how they work for you.
 198
 199 Spewing a file is a much simpler operation than slurping. You don't have
 200 context issues to worry about and there is no efficiency problem with
 201 returning a buffer. Here is a simple burp subroutine:
 202
 203         sub burp {
 204                 my( $file_name ) = shift ;
 205                 open( my $fh, ">$file_name" ) ||
 206                                  die "can't create $file_name $!" ;
 207                 print $fh @_ ;
 208         }
 209
 210 Note that it doesn't copy the input text but passes @_ directly to
 211 print. We will look at faster variations of that later on.
 212
 213 =head2 Slurp on the CPAN
 214
 215 As you would expect there are modules in the CPAN that will slurp files
 216 for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on
 217 CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).
 218
 219 Here is the code from Slurp.pm:
 220
 221     sub slurp {
 222         local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ );
 223         return <ARGV>;
 224     }
 225
 226     sub to_array {
 227         my @array = slurp( @_ );
 228         return wantarray ? @array : \@array;
 229     }
 230
 231     sub to_scalar {
 232         my $scalar = slurp( @_ );
 233         return $scalar;
 234     }
 235
 236 +The subroutine C<slurp()> uses the magic undefined value of C<$/> and
 237 the magic file +handle C<ARGV> to support slurping into a scalar or
 238 array. It also provides two wrapper subs that allow the caller to
 239 control the context of the slurp. And the C<to_array()> subroutine will
 240 return the list of slurped lines or a anonymous array of them according
 241 to its caller's context by checking C<wantarray>. It has 'slurp' in
 242 C<@EXPORT> and all three subroutines in C<@EXPORT_OK>.
 243
 244 <Footnote: Slurp.pm is poorly named and it shouldn't be in the top level
 245 namespace.>
 246
 247 The original File::Slurp.pm has this code:
 248
 249 sub read_file
 250 {
 251         my ($file) = @_;
 252
 253         local($/) = wantarray ? $/ : undef;
 254         local(*F);
 255         my $r;
 256         my (@r);
 257
 258         open(F, "<$file") || croak "open $file: $!";
 259         @r = <F>;
 260         close(F) || croak "close $file: $!";
 261
 262         return $r[0] unless wantarray;
 263         return @r;
 264 }
 265
 266 This module provides several subroutines including C<read_file()> (more
 267 on the others later). C<read_file()> behaves simularly to
 268 C<Slurp::slurp()> in that it will slurp a list of lines or a single
 269 scalar depending on the caller's context. It also uses the magic
 270 undefined value of C<$/> for scalar slurping but it uses an explicit
 271 open call rather than using a localized C<@ARGV> and the other module
 272 did. Also it doesn't provide a way to get an anonymous array of the
 273 lines but that can easily be rectified by calling it inside an anonymous
 274 array constuctor C<[]>.
 275
 276 Both of these modules make it easier for Perl coders to slurp in
 277 files. They both use the magic C<$/> to slurp in scalar mode and the
 278 natural behavior of C<< <> >> in list context to slurp as lines. But
 279 neither is optmized for speed nor can they handle C<binmode()> to
 280 support binary or unicode files. See below for more on slurp features
 281 and speedups.
 282
 283 =head2 Slurping API Design
 284
 285 The slurp modules on CPAN are have a very simple API and don't support
 286 C<binmode()>. This section will cover various API design issues such as
 287 efficient return by reference, C<binmode()> and calling variations.
 288
 289 Let's start with the call variations. Slurped files can be returned in
 290 four formats: as a single scalar, as a reference to a scalar, as a list
 291 of lines or as an anonymous array of lines. But the caller can only
 292 provide two contexts: scalar or list. So we have to either provide an
 293 API with more than one subroutine (as Slurp.pm did) or just provide one
 294 subroutine which only returns a scalar or a list (not an anonymous
 295 array) as File::Slurp does.
 296
 297 I have used my own C<read_file()> subroutine for years and it has the
 298 same API as File::Slurp: a single subroutine that returns a scalar or a
 299 list of lines depending on context. But I recognize the interest of
 300 those that want an anonymous array for line slurping. For one thing, it
 301 is easier to pass around to other subs and for another, it eliminates
 302 the extra copying of the lines via C<return>. So my module provides only
 303 one slurp subroutine that returns the file data based on context and any
 304 format options passed in. There is no need for a specific
 305 slurp-in-as-a-scalar or list subroutine as the general C<read_file()>
 306 sub will do that by default in the appropriate context. If you want
 307 C<read_file()> to return a scalar reference or anonymous array of lines,
 308 you can request those formats with options. You can even pass in a
 309 reference to a scalar (e.g. a previously allocated buffer) and have that
 310 filled with the slurped data (and that is one of the fastest slurp
 311 modes. see the benchmark section for more on that).  If you want to
 312 slurp a scalar into an array, just select the desired array element and
 313 that will provide scalar context to the C<read_file()> subroutine.
 314
 315 The next area to cover is what to name the slurp sub. I will go with
 316 C<read_file()>. It is descriptive and keeps compatibilty with the
 317 current simple and don't use the 'slurp' nickname (though that nickname
 318 is in the module name). Also I decided to keep the  File::Slurp
 319 namespace which was graciously handed over to me by its current owner,
 320 David Muir.
 321
 322 Another critical area when designing APIs is how to pass in
 323 arguments. The C<read_file()> subroutine takes one required argument
 324 which is the file name. To support C<binmode()> we need another optional
 325 argument. A third optional argument is needed to support returning a
 326 slurped scalar by reference. My first thought was to design the API with
 327 3 positional arguments - file name, buffer reference and binmode. But if
 328 you want to set the binmode and not pass in a buffer reference, you have
 329 to fill the second argument with C<undef> and that is ugly. So I decided
 330 to make the filename argument positional and the other two named. The
 331 subroutine starts off like this:
 332
 333         sub read_file {
 334
 335                 my( $file_name, %args ) = @_ ;
 336
 337                 my $buf ;
 338                 my $buf_ref = $args{'buf'} || \$buf ;
 339
 340 The other sub (C<read_file_lines()>) will only take an optional binmode
 341 (so you can read files with binary delimiters). It doesn't need a buffer
 342 reference argument since it can return an anonymous array if the called
 343 in a scalar context. So this subroutine could use positional arguments,
 344 but to keep its API similar to the API of C<read_file()>, it will also
 345 use pass by name for the optional arguments. This also means that new
 346 optional arguments can be added later without breaking any legacy
 347 code. A bonus with keeping the API the same for both subs will be seen
 348 how the two subs are optimized to work together.
 349
 350 Write slurping (or spewing or burping :-)) needs to have its API
 351 designed as well. The biggest issue is not only needing to support
 352 optional arguments but a list of arguments to be written is needed. Perl
 353 6 will be able to handle that with optional named arguments and a final
 354 slurp argument. Since this is Perl 5 we have to do it using some
 355 cleverness. The first argument is the file name and it will be
 356 positional as with the C<read_file> subroutine. But how can we pass in
 357 the optional arguments and also a list of data? The solution lies in the
 358 fact that the data list should never contain a reference.
 359 Burping/spewing works only on plain data. So if the next argument is a
 360 hash reference, we can assume it cointains the optional arguments and
 361 the rest of the arguments is the data list. So the C<write_file()>
 362 subroutine will start off like this:
 363
 364         sub write_file {
 365
 366                 my $file_name = shift ;
 367
 368                 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
 369
 370 Whether or not optional arguments are passed in, we leave the data list
 371 in C<@_> to minimize any more copying. You call C<write_file()> like this:
 372
 373         write_file( 'foo', { binmode => ':raw' }, @data ) ;
 374         write_file( 'junk', { append => 1 }, @more_junk ) ;
 375         write_file( 'bar', @spew ) ;
 376
 377 =head2 Fast Slurping
 378
 379 Somewhere along the line, I learned about a way to slurp files faster
 380 than by setting $/ to undef. The method is very simple, you do a single
 381 read call with the size of the file (which the -s operator provides).
 382 This bypasses the I/O loop inside perl that checks for EOF and does all
 383 sorts of processing. I then decided to experiment and found that
 384 sysread is even faster as you would expect. sysread bypasses all of
 385 Perl's stdio and reads the file from the kernel buffers directly into a
 386 Perl scalar. This is why the slurp code in File::Slurp uses
 387 sysopen/sysread/syswrite. All the rest of the code is just to support
 388 the various options and data passing techniques.
 389
 390
 391 =head2 Benchmarks
 392
 393 Benchmarks can be enlightening, informative, frustrating and
 394 deceiving. It would make no sense to create a new and more complex slurp
 395 module unless it also gained signifigantly in speed. So I created a
 396 benchmark script which compares various slurp methods with differing
 397 file sizes and calling contexts. This script can be run from the main
 398 directory of the tarball like this:
 399
 400         perl -Ilib extras/slurp_bench.pl
 401
 402 If you pass in an argument on the command line, it will be passed to
 403 timethese() and it will control the duration. It defaults to -2 which
 404 makes each benchmark run to at least 2 seconds of cpu time.
 405
 406 The following numbers are from a run I did on my 300Mhz sparc. You will
 407 most likely get much faster counts on your boxes but the relative speeds
 408 shouldn't change by much. If you see major differences on your
 409 benchmarks, please send me the results and your Perl and OS
 410 versions. Also you can play with the benchmark script and add more slurp
 411 variations or data files.
 412
 413 The rest of this section will be discussing the results of the
 414 benchmarks. You can refer to extras/slurp_bench.pl to see the code for
 415 the individual benchmarks. If the benchmark name starts with cpan_, it
 416 is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are
 417 from the new File::Slurp.pm. Those that start with file_contents_ are
 418 from a client's code base. The rest are variations I created to
 419 highlight certain aspects of the benchmarks.
 420
 421 The short and long file data is made like this:
 422
 423         my @lines = ( 'abc' x 30 . "\n")  x 100 ;
 424         my $text = join( '', @lines ) ;
 425
 426         @lines = ( 'abc' x 40 . "\n")  x 1000 ;
 427         $text = join( '', @lines ) ;
 428
 429 So the short file is 9,100 bytes and the long file is 121,000
 430 bytes.
 431
 432 =head3  Scalar Slurp of Short File
 433
 434         file_contents        651/s
 435         file_contents_no_OO  828/s
 436         cpan_read_file      1866/s
 437         cpan_slurp          1934/s
 438         read_file           2079/s
 439         new                 2270/s
 440         new_buf_ref         2403/s
 441         new_scalar_ref      2415/s
 442         sysread_file        2572/s
 443
 444 =head3  Scalar Slurp of Long File
 445
 446         file_contents_no_OO 82.9/s
 447         file_contents       85.4/s
 448         cpan_read_file       250/s
 449         cpan_slurp           257/s
 450         read_file            323/s
 451         new                  468/s
 452         sysread_file         489/s
 453         new_scalar_ref       766/s
 454         new_buf_ref          767/s
 455
 456 The primary inference you get from looking at the mumbers above is that
 457 when slurping a file into a scalar, the longer the file, the more time
 458 you save by returning the result via a scalar reference. The time for
 459 the extra buffer copy can add up. The new module came out on top overall
 460 except for the very simple sysread_file entry which was added to
 461 highlight the overhead of the more flexible new module which isn't that
 462 much. The file_contents entries are always the worst since they do a
 463 list slurp and then a join, which is a classic newbie and cargo culted
 464 style which is extremely slow. Also the OO code in file_contents slows
 465 it down even more (I added the file_contents_no_OO entry to show this).
 466 The two CPAN modules are decent with small files but they are laggards
 467 compared to the new module when the file gets much larger.
 468
 469 =head3  List Slurp of Short File
 470
 471         cpan_read_file          589/s
 472         cpan_slurp_to_array     620/s
 473         read_file               824/s
 474         new_array_ref           824/s
 475         sysread_file            828/s
 476         new                     829/s
 477         new_in_anon_array       833/s
 478         cpan_slurp_to_array_ref 836/s
 479
 480 =head3  List Slurp of Long File
 481
 482         cpan_read_file          62.4/s
 483         cpan_slurp_to_array     62.7/s
 484         read_file               92.9/s
 485         sysread_file            94.8/s
 486         new_array_ref           95.5/s
 487         new                     96.2/s
 488         cpan_slurp_to_array_ref 96.3/s
 489         new_in_anon_array       97.2/s
 490
 491 This is perhaps the most interesting result of this benchmark. Five
 492 different entries have effectively tied for the lead. The logical
 493 conclusion is that splitting the input into lines is the bounding
 494 operation, no matter how the file gets slurped. This is the only
 495 benchmark where the new module isn't the clear winner (in the long file
 496 entries - it is no worse than a close second in the short file
 497 entries).
 498
 499
 500 Note: In the benchmark information for all the spew entries, the extra
 501 number at the end of each line is how many wallclock seconds the whole
 502 entry took. The benchmarks were run for at least 2 CPU seconds per
 503 entry. The unusually large wallclock times will be discussed below.
 504
 505 =head3  Scalar Spew of Short File
 506
 507         cpan_write_file 1035/s  38
 508         print_file      1055/s  41
 509         syswrite_file   1135/s  44
 510         new             1519/s  2
 511         print_join_file 1766/s  2
 512         new_ref         1900/s  2
 513         syswrite_file2  2138/s  2
 514
 515 =head3  Scalar Spew of Long File
 516
 517         cpan_write_file 164/s   20
 518         print_file      211/s   26
 519         syswrite_file   236/s   25
 520         print_join_file 277/s   2
 521         new             295/s   2
 522         syswrite_file2  428/s   2
 523         new_ref         608/s   2
 524
 525 In the scalar spew entries, the new module API wins when it is passed a
 526 reference to the scalar buffer. The C<syswrite_file2> entry beats it
 527 with the shorter file due to its simpler code. The old CPAN module is
 528 the slowest due to its extra copying of the data and its use of print.
 529
 530 =head3 List Spew of Short File
 531
 532         cpan_write_file  794/s  29
 533         syswrite_file   1000/s  38
 534         print_file      1013/s  42
 535         new             1399/s  2
 536         print_join_file 1557/s  2
 537
 538 =head3  List Spew of Long File
 539
 540         cpan_write_file 112/s   12
 541         print_file      179/s   21
 542         syswrite_file   181/s   19
 543         print_join_file 205/s   2
 544         new             228/s   2
 545
 546 Again, the simple C<print_join_file> entry beats the new module when
 547 spewing a short list of lines to a file. But is loses to the new module
 548 when the file size gets longer. The old CPAN module lags behind the
 549 others since it first makes an extra copy of the lines and then it calls
 550 C<print> on the output list and that is much slower than passing to
 551 C<print> a single scalar generated by join. The C<print_file> entry
 552 shows the advantage of directly printing C<@_> and the
 553 C<print_join_file> adds the join optimization.
 554
 555 Now about those long wallclock times. If you look carefully at the
 556 benchmark code of all the spew entries, you will find that some always
 557 write to new files and some overwrite existing files. When I asked David
 558 Muir why the old File::Slurp module had an C<overwrite> subroutine, he
 559 answered that by overwriting a file, you always guarantee something
 560 readable is in the file. If you create a new file, there is a moment
 561 when the new file is created but has no data in it. I feel this is not a
 562 good enough answer. Even when overwriting, you can write a shorter file
 563 than the existing file and then you have to truncate the file to the new
 564 size. There is a small race window there where another process can slurp
 565 in the file with the new data followed by leftover junk from the
 566 previous version of the file. This reinforces the point that the only
 567 way to ensure consistant file data is the proper use of file locks.
 568
 569 But what about those long times? Well it is all about the difference
 570 between creating files and overwriting existing ones. The former have to
 571 allocate new inodes (or the equivilent on other file systems) and the
 572 latter can reuse the exising inode. This mean the overwrite will save on
 573 disk seeks as well as on cpu time. In fact when running this benchmark,
 574 I could hear my disk going crazy allocating inodes during the spew
 575 operations. This speedup in both cpu and wallclock is why the new module
 576 always does overwriting when spewing files. It also does the proper
 577 truncate (and this is checked in the tests by spewing shorter files
 578 after longer ones had previously been written). The C<overwrite>
 579 subroutine is just an typeglob alias to C<write_file> and is there for
 580 backwards compatibilty with the old File::Slurp module.
 581
 582 =head3 Benchmark Conclusion
 583
 584 Other than a few cases where a simpler entry beat it out, the new
 585 File::Slurp module is either the speed leader or among the leaders. Its
 586 special APIs for passing buffers by reference prove to be very useful
 587 speedups. Also it uses all the other optimizations including using
 588 C<sysread/syswrite> and joining output lines. I expect many projects
 589 that extensively use slurping will notice the speed improvements,
 590 especially if they rewrite their code to take advantage of the new API
 591 features. Even if they don't touch their code and use the simple API
 592 they will get a significant speedup.
 593
 594 =head2 Error Handling
 595
 596 Slurp subroutines are subject to conditions such as not being able to
 597 open the file, or I/O errors. How these errors are handled, and what the
 598 caller will see, are important aspects of the design of an API. The
 599 classic error handling for slurping has been to call C<die()> or even
 600 better, C<croak()>. But sometimes you want the slurp to either
 601 C<warn()>/C<carp()> or allow your code to handle the error. Sure, this
 602 can be done by wrapping the slurp in a C<eval> block to catch a fatal
 603 error, but not everyone wants all that extra code. So I have added
 604 another option to all the subroutines which selects the error
 605 handling. If the 'err_mode' option is 'croak' (which is also the
 606 default), the called subroutine will croak. If set to 'carp' then carp
 607 will be called. Set to any other string (use 'quiet' when you want to
 608 be explicit) and no error handler is called. Then the caller can use the
 609 error status from the call.
 610
 611 C<write_file()> doesn't use the return value for data so it can return a
 612 false status value in-band to mark an error. C<read_file()> does use its
 613 return value for data, but we can still make it pass back the error
 614 status. A successful read in any scalar mode will return either a
 615 defined data string or a reference to a scalar or array. So a bare
 616 return would work here. But if you slurp in lines by calling it in a
 617 list context, a bare C<return> will return an empty list, which is the
 618 same value it would get from an existing but empty file. So now,
 619 C<read_file()> will do something I normally strongly advocate against,
 620 i.e., returning an explicit C<undef> value. In the scalar context this
 621 still returns a error, and in list context, the returned first value
 622 will be C<undef>, and that is not legal data for the first element. So
 623 the list context also gets a error status it can detect:
 624
 625         my @lines = read_file( $file_name, err_mode => 'quiet' ) ;
 626         your_handle_error( "$file_name can't be read\n" ) unless
 627                                         @lines && defined $lines[0] ;
 628
 629
 630 =head2 File::FastSlurp
 631
 632         sub read_file {
 633
 634                 my( $file_name, %args ) = @_ ;
 635
 636                 my $buf ;
 637                 my $buf_ref = $args{'buf_ref'} || \$buf ;
 638
 639                 my $mode = O_RDONLY ;
 640                 $mode |= O_BINARY if $args{'binmode'} ;
 641
 642                 local( *FH ) ;
 643                 sysopen( FH, $file_name, $mode ) or
 644                                         carp "Can't open $file_name: $!" ;
 645
 646                 my $size_left = -s FH ;
 647
 648                 while( $size_left > 0 ) {
 649
 650                         my $read_cnt = sysread( FH, ${$buf_ref},
 651                                         $size_left, length ${$buf_ref} ) ;
 652
 653                         unless( $read_cnt ) {
 654
 655                                 carp "read error in file $file_name: $!" ;
 656                                 last ;
 657                         }
 658
 659                         $size_left -= $read_cnt ;
 660                 }
 661
 662         # handle void context (return scalar by buffer reference)
 663
 664                 return unless defined wantarray ;
 665
 666         # handle list context
 667
 668                 return split m|?<$/|g, ${$buf_ref} if wantarray ;
 669
 670         # handle scalar context
 671
 672                 return ${$buf_ref} ;
 673         }
 674
 675         sub write_file {
 676
 677                 my $file_name = shift ;
 678
 679                 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
 680                 my $buf = join '', @_ ;
 681
 682
 683                 my $mode = O_WRONLY ;
 684                 $mode |= O_BINARY if $args->{'binmode'} ;
 685                 $mode |= O_APPEND if $args->{'append'} ;
 686
 687                 local( *FH ) ;
 688                 sysopen( FH, $file_name, $mode ) or
 689                                         carp "Can't open $file_name: $!" ;
 690
 691                 my $size_left = length( $buf ) ;
 692                 my $offset = 0 ;
 693
 694                 while( $size_left > 0 ) {
 695
 696                         my $write_cnt = syswrite( FH, $buf,
 697                                         $size_left, $offset ) ;
 698
 699                         unless( $write_cnt ) {
 700
 701                                 carp "write error in file $file_name: $!" ;
 702                                 last ;
 703                         }
 704
 705                         $size_left -= $write_cnt ;
 706                         $offset += $write_cnt ;
 707                 }
 708
 709                 return ;
 710         }
 711
 712 =head2 Slurping in Perl 6
 713
 714 As usual with Perl 6, much of the work in this article will be put to
 715 pasture. Perl 6 will allow you to set a 'slurp' property on file handles
 716 and when you read from such a handle, the file is slurped. List and
 717 scalar context will still be supported so you can slurp into lines or a
 718 <scalar. I would expect that support for slurping in Perl 6 will be
 719 optimized and bypass the stdio subsystem since it can use the slurp
 720 property to trigger a call to special code. Otherwise some enterprising
 721 individual will just create a File::FastSlurp module for Perl 6. The
 722 code in the Perl 5 module could easily be modified to Perl 6 syntax and
 723 semantics. Any volunteers?
 724
 725 =head2 In Summary
 726
 727 We have compared classic line by line processing with munging a whole
 728 file in memory. Slurping files can speed up your programs and simplify
 729 your code if done properly. You must still be aware to not slurp
 730 humongous files (logs, DNA sequences, etc.) or STDIN where you don't
 731 know how much data you will read in. But slurping megabyte sized files
 732 is not an major issue on today's systems with the typical amount of RAM
 733 installed. When Perl was first being used in depth (Perl 4), slurping
 734 was limited by the smaller RAM size of 10 years ago. Now, you should be
 735 able to slurp almost any reasonably sized file, whether it contains
 736 configuration, source code, data, etc.
 737
 738 =head2 Acknowledgements
 739
 740
 741
 742
 743