slurp_talk/slurp_article.html

   1 <HTML>
   2 <HEAD>
   3 <TITLE>Perl Slurp Ease</TITLE>
   4 <LINK REV="made" HREF="mailto:steve@dewitt.vnet.net">
   5 </HEAD>
   6
   7 <BODY>
   8
   9 <A NAME="__index__"></A>
  10 <!-- INDEX BEGIN -->
  11
  12 <UL>
  13
  14         <LI><A HREF="#perl slurp ease">Perl Slurp Ease</A></LI>
  15         <UL>
  16
  17                 <LI><A HREF="#introduction">Introduction</A></LI>
  18                 <LI><A HREF="#global operations">Global Operations</A></LI>
  19                 <LI><A HREF="#traditional slurping">Traditional Slurping</A></LI>
  20                 <LI><A HREF="#write slurping">Write Slurping</A></LI>
  21                 <LI><A HREF="#slurp on the cpan">Slurp on the CPAN</A></LI>
  22                 <LI><A HREF="#slurping api design">Slurping API Design</A></LI>
  23                 <LI><A HREF="#fast slurping">Fast Slurping</A></LI>
  24                 <UL>
  25
  26                         <LI><A HREF="#scalar slurp of short file">Scalar Slurp of Short File</A></LI>
  27                         <LI><A HREF="#scalar slurp of long file">Scalar Slurp of Long File</A></LI>
  28                         <LI><A HREF="#list slurp of short file">List Slurp of Short File</A></LI>
  29                         <LI><A HREF="#list slurp of long file">List Slurp of Long File</A></LI>
  30                         <LI><A HREF="#scalar spew of short file">Scalar Spew of Short File</A></LI>
  31                         <LI><A HREF="#scalar spew of long file">Scalar Spew of Long File</A></LI>
  32                         <LI><A HREF="#list spew of short file">List Spew of Short File</A></LI>
  33                         <LI><A HREF="#list spew of long file">List Spew of Long File</A></LI>
  34                         <LI><A HREF="#benchmark conclusion">Benchmark Conclusion</A></LI>
  35                 </UL>
  36
  37                 <LI><A HREF="#error handling">Error Handling</A></LI>
  38                 <LI><A HREF="#file::fastslurp">File::FastSlurp</A></LI>
  39                 <LI><A HREF="#slurping in perl 6">Slurping in Perl 6</A></LI>
  40                 <LI><A HREF="#in summary">In Summary</A></LI>
  41                 <LI><A HREF="#acknowledgements">Acknowledgements</A></LI>
  42         </UL>
  43
  44 </UL>
  45 <!-- INDEX END -->
  46
  47 <HR>
  48 <P>
  49 <H1><A NAME="perl slurp ease">Perl Slurp Ease</A></H1>
  50 <P>
  51 <H2><A NAME="introduction">Introduction</A></H2>
  52 <P>One of the common Perl idioms is processing text files line by line:</P>
  53 <PRE>
  54         while( &lt;FH&gt; ) {
  55                 do something with $_
  56         }</PRE>
  57 <P>This idiom has several variants, but the key point is that it reads in
  58 only one line from the file in each loop iteration. This has several
  59 advantages, including limiting memory use to one line, the ability to
  60 handle any size file (including data piped in via STDIN), and it is
  61 easily taught and understood to Perl newbies. In fact newbies are the
  62 ones who do silly things like this:</P>
  63 <PRE>
  64         while( &lt;FH&gt; ) {
  65                 push @lines, $_ ;
  66         }</PRE>
  67 <PRE>
  68         foreach ( @lines ) {
  69                 do something with $_
  70         }</PRE>
  71 <P>Line by line processing is fine, but it isn't the only way to deal with
  72 reading files. The other common style is reading the entire file into a
  73 scalar or array, and that is commonly known as slurping. Now, slurping has
  74 somewhat of a poor reputation, and this article is an attempt at
  75 rehabilitating it. Slurping files has advantages and limitations, and is
  76 not something you should just do when line by line processing is fine.
  77 It is best when you need the entire file in memory for processing all at
  78 once. Slurping with in memory processing can be faster and lead to
  79 simpler code than line by line if done properly.</P>
  80 <P>The biggest issue to watch for with slurping is file size. Slurping very
  81 large files or unknown amounts of data from STDIN can be disastrous to
  82 your memory usage and cause swap disk thrashing.  You can slurp STDIN if
  83 you know that you can handle the maximum size input without
  84 detrimentally affecting your memory usage. So I advocate slurping only
  85 disk files and only when you know their size is reasonable and you have
  86 a real reason to process the file as a whole.  Note that reasonable size
  87 these days is larger than the bad old days of limited RAM. Slurping in a
  88 megabyte is not an issue on most systems. But most of the
  89 files I tend to slurp in are much smaller than that. Typical files that
  90 work well with slurping are configuration files, (mini-)language scripts,
  91 some data (especially binary) files, and other files of known sizes
  92 which need fast processing.</P>
  93 <P>Another major win for slurping over line by line is speed. Perl's IO
  94 system (like many others) is slow. Calling <CODE>&lt;&gt;</CODE> for each line
  95 requires a check for the end of line, checks for EOF, copying a line,
  96 munging the internal handle structure, etc. Plenty of work for each line
  97 read in. Whereas slurping, if done correctly, will usually involve only
  98 one I/O call and no extra data copying. The same is true for writing
  99 files to disk, and we will cover that as well (even though the term
 100 slurping is traditionally a read operation, I use the term ``slurp'' for
 101 the concept of doing I/O with an entire file in one operation).</P>
 102 <P>Finally, when you have slurped the entire file into memory, you can do
 103 operations on the data that are not possible or easily done with line by
 104 line processing. These include global search/replace (without regard for
 105 newlines), grabbing all matches with one call of <CODE>//g</CODE>, complex parsing
 106 (which in many cases must ignore newlines), processing *ML (where line
 107 endings are just white space) and performing complex transformations
 108 such as template expansion.</P>
 109 <P>
 110 <H2><A NAME="global operations">Global Operations</A></H2>
 111 <P>Here are some simple global operations that can be done quickly and
 112 easily on an entire file that has been slurped in. They could also be
 113 done with line by line processing but that would be slower and require
 114 more code.</P>
 115 <P>A common problem is reading in a file with key/value pairs. There are
 116 modules which do this but who needs them for simple formats? Just slurp
 117 in the file and do a single parse to grab all the key/value pairs.</P>
 118 <PRE>
 119         my $text = read_file( $file ) ;
 120         my %config = $test =~ /^(\w+)=(.+)$/mg ;</PRE>
 121 <P>That matches a key which starts a line (anywhere inside the string
 122 because of the <CODE>/m</CODE> modifier), the '=' char and the text to the end of the
 123 line (again, <CODE>/m</CODE> makes that work). In fact the ending <CODE>$</CODE> is not even needed
 124 since <CODE>.</CODE> will not normally match a newline. Since the key and value are
 125 grabbed and the <CODE>m//</CODE> is in list context with the <CODE>/g</CODE> modifier, it will
 126 grab all key/value pairs and return them. The <CODE>%config</CODE>hash will be
 127 assigned this list and now you have the file fully parsed into a hash.</P>
 128 <P>Various projects I have worked on needed some simple templating and I
 129 wasn't in the mood to use a full module (please, no flames about your
 130 favorite template module :-). So I rolled my own by slurping in the
 131 template file, setting up a template hash and doing this one line:</P>
 132 <PRE>
 133         $text =~ s/&lt;%(.+?)%&gt;/$template{$1}/g ;</PRE>
 134 <P>That only works if the entire file was slurped in. With a little
 135 extra work it can handle chunks of text to be expanded:</P>
 136 <PRE>
 137         $text =~ s/&lt;%(\w+)_START%&gt;(.+?)&lt;%\1_END%&gt;/ template($1, $2)/sge ;</PRE>
 138 <P>Just supply a <CODE>template</CODE> sub to expand the text between the markers and
 139 you have yourself a simple system with minimal code. Note that this will
 140 work and grab over multiple lines due the the <CODE>/s</CODE> modifier. This is
 141 something that is much trickier with line by line processing.</P>
 142 <P>Note that this is a very simple templating system, and it can't directly
 143 handle nested tags and other complex features. But even if you use one
 144 of the myriad of template modules on the CPAN, you will gain by having
 145 speedier ways to read and write files.</P>
 146 <P>Slurping in a file into an array also offers some useful advantages.
 147 One simple example is reading in a flat database where each record has
 148 fields separated by a character such as <CODE>:</CODE>:</P>
 149 <PRE>
 150         my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ;</PRE>
 151 <P>Random access to any line of the slurped file is another advantage. Also
 152 a line index could be built to speed up searching the array of lines.</P>
 153 <P>
 154 <H2><A NAME="traditional slurping">Traditional Slurping</A></H2>
 155 <P>Perl has always supported slurping files with minimal code. Slurping of
 156 a file to a list of lines is trivial, just call the <CODE>&lt;&gt;</CODE> operator
 157 in a list context:</P>
 158 <PRE>
 159         my @lines = &lt;FH&gt; ;</PRE>
 160 <P>and slurping to a scalar isn't much more work. Just set the built in
 161 variable <CODE>$/</CODE> (the input record separator to the undefined value and read
 162 in the file with <CODE>&lt;&gt;</CODE>:</P>
 163 <PRE>
 164         {
 165                 local( $/, *FH ) ;
 166                 open( FH, $file ) or die &quot;sudden flaming death\n&quot;
 167                 $text = &lt;FH&gt;
 168         }</PRE>
 169 <P>Notice the use of <CODE>local()</CODE>. It sets <CODE>$/</CODE> to <CODE>undef</CODE> for you and when
 170 the scope exits it will revert <CODE>$/</CODE> back to its previous value (most
 171 likely ``\n'').</P>
 172 <P>Here is a Perl idiom that allows the <CODE>$text</CODE> variable to be declared,
 173 and there is no need for a tightly nested block. The <CODE>do</CODE> block will
 174 execute <CODE>&lt;FH&gt;</CODE> in a scalar context and slurp in the file named by
 175 <CODE>$text</CODE>:</P>
 176 <PRE>
 177         local( *FH ) ;
 178         open( FH, $file ) or die &quot;sudden flaming death\n&quot;
 179         my $text = do { local( $/ ) ; &lt;FH&gt; } ;</PRE>
 180 <P>Both of those slurps used localized filehandles to be compatible with
 181 5.005. Here they are with 5.6.0 lexical autovivified handles:</P>
 182 <PRE>
 183         {
 184                 local( $/ ) ;
 185                 open( my $fh, $file ) or die &quot;sudden flaming death\n&quot;
 186                 $text = &lt;$fh&gt;
 187         }</PRE>
 188 <PRE>
 189         open( my $fh, $file ) or die &quot;sudden flaming death\n&quot;
 190         my $text = do { local( $/ ) ; &lt;$fh&gt; } ;</PRE>
 191 <P>And this is a variant of that idiom that removes the need for the open
 192 call:</P>
 193 <PRE>
 194         my $text = do { local( @ARGV, $/ ) = $file ; &lt;&gt; } ;</PRE>
 195 <P>The filename in <CODE>$file</CODE> is assigned to a localized <CODE>@ARGV</CODE> and the
 196 null filehandle is used which reads the data from the files in <CODE>@ARGV</CODE>.</P>
 197 <P>Instead of assigning to a scalar, all the above slurps can assign to an
 198 array and it will get the file but split into lines (using <CODE>$/</CODE> as the
 199 end of line marker).</P>
 200 <P>There is one common variant of those slurps which is very slow and not
 201 good code. You see it around, and it is almost always cargo cult code:</P>
 202 <PRE>
 203         my $text = join( '', &lt;FH&gt; ) ;</PRE>
 204 <P>That needlessly splits the input file into lines (<CODE>join</CODE> provides a
 205 list context to <CODE>&lt;FH&gt;</CODE>) and then joins up those lines again. The
 206 original coder of this idiom obviously never read <EM>perlvar</EM> and learned
 207 how to use <CODE>$/</CODE> to allow scalar slurping.</P>
 208 <P>
 209 <H2><A NAME="write slurping">Write Slurping</A></H2>
 210 <P>While reading in entire files at one time is common, writing out entire
 211 files is also done. We call it ``slurping'' when we read in files, but
 212 there is no commonly accepted term for the write operation. I asked some
 213 Perl colleagues and got two interesting nominations. Peter Scott said to
 214 call it ``burping'' (rhymes with ``slurping'' and suggests movement in
 215 the opposite direction). Others suggested ``spewing'' which has a
 216 stronger visual image :-) Tell me your favorite or suggest your own. I
 217 will use both in this section so you can see how they work for you.</P>
 218 <P>Spewing a file is a much simpler operation than slurping. You don't have
 219 context issues to worry about and there is no efficiency problem with
 220 returning a buffer. Here is a simple burp subroutine:</P>
 221 <PRE>
 222         sub burp {
 223                 my( $file_name ) = shift ;
 224                 open( my $fh, &quot;&gt;$file_name&quot; ) ||
 225                                  die &quot;can't create $file_name $!&quot; ;
 226                 print $fh @_ ;
 227         }</PRE>
 228 <P>Note that it doesn't copy the input text but passes @_ directly to
 229 print. We will look at faster variations of that later on.</P>
 230 <P>
 231 <H2><A NAME="slurp on the cpan">Slurp on the CPAN</A></H2>
 232 <P>As you would expect there are modules in the CPAN that will slurp files
 233 for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on
 234 CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).</P>
 235 <P>Here is the code from Slurp.pm:</P>
 236 <PRE>
 237     sub slurp {
 238         local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ );
 239         return &lt;ARGV&gt;;
 240     }</PRE>
 241 <PRE>
 242     sub to_array {
 243         my @array = slurp( @_ );
 244         return wantarray ? @array : \@array;
 245     }</PRE>
 246 <PRE>
 247     sub to_scalar {
 248         my $scalar = slurp( @_ );
 249         return $scalar;
 250     }</PRE>
 251 <P>+The subroutine <CODE>slurp()</CODE> uses the magic undefined value of <CODE>$/</CODE> and
 252 the magic file +handle <CODE>ARGV</CODE> to support slurping into a scalar or
 253 array. It also provides two wrapper subs that allow the caller to
 254 control the context of the slurp. And the <CODE>to_array()</CODE> subroutine will
 255 return the list of slurped lines or a anonymous array of them according
 256 to its caller's context by checking <CODE>wantarray</CODE>. It has 'slurp' in
 257 <CODE>@EXPORT</CODE> and all three subroutines in <CODE>@EXPORT_OK</CODE>.</P>
 258 <P>&lt;Footnote: Slurp.pm is poorly named and it shouldn't be in the top level
 259 namespace.&gt;</P>
 260 <P>The original File::Slurp.pm has this code:</P>
 261 <P>sub read_file
 262 {
 263         my ($file) = @_;</P>
 264 <PRE>
 265         local($/) = wantarray ? $/ : undef;
 266         local(*F);
 267         my $r;
 268         my (@r);</PRE>
 269 <PRE>
 270         open(F, &quot;&lt;$file&quot;) || croak &quot;open $file: $!&quot;;
 271         @r = &lt;F&gt;;
 272         close(F) || croak &quot;close $file: $!&quot;;</PRE>
 273 <PRE>
 274         return $r[0] unless wantarray;
 275         return @r;
 276 }</PRE>
 277 <P>This module provides several subroutines including <CODE>read_file()</CODE> (more
 278 on the others later). <CODE>read_file()</CODE> behaves simularly to
 279 <CODE>Slurp::slurp()</CODE> in that it will slurp a list of lines or a single
 280 scalar depending on the caller's context. It also uses the magic
 281 undefined value of <CODE>$/</CODE> for scalar slurping but it uses an explicit
 282 open call rather than using a localized <CODE>@ARGV</CODE> and the other module
 283 did. Also it doesn't provide a way to get an anonymous array of the
 284 lines but that can easily be rectified by calling it inside an anonymous
 285 array constuctor <CODE>[]</CODE>.</P>
 286 <P>Both of these modules make it easier for Perl coders to slurp in
 287 files. They both use the magic <CODE>$/</CODE> to slurp in scalar mode and the
 288 natural behavior of <CODE>&lt;&gt;</CODE> in list context to slurp as lines. But
 289 neither is optmized for speed nor can they handle <CODE>binmode()</CODE> to
 290 support binary or unicode files. See below for more on slurp features
 291 and speedups.</P>
 292 <P>
 293 <H2><A NAME="slurping api design">Slurping API Design</A></H2>
 294 <P>The slurp modules on CPAN are have a very simple API and don't support
 295 <CODE>binmode()</CODE>. This section will cover various API design issues such as
 296 efficient return by reference, <CODE>binmode()</CODE> and calling variations.</P>
 297 <P>Let's start with the call variations. Slurped files can be returned in
 298 four formats: as a single scalar, as a reference to a scalar, as a list
 299 of lines or as an anonymous array of lines. But the caller can only
 300 provide two contexts: scalar or list. So we have to either provide an
 301 API with more than one subroutine (as Slurp.pm did) or just provide one
 302 subroutine which only returns a scalar or a list (not an anonymous
 303 array) as File::Slurp does.</P>
 304 <P>I have used my own <CODE>read_file()</CODE> subroutine for years and it has the
 305 same API as File::Slurp: a single subroutine that returns a scalar or a
 306 list of lines depending on context. But I recognize the interest of
 307 those that want an anonymous array for line slurping. For one thing, it
 308 is easier to pass around to other subs and for another, it eliminates
 309 the extra copying of the lines via <CODE>return</CODE>. So my module provides only
 310 one slurp subroutine that returns the file data based on context and any
 311 format options passed in. There is no need for a specific
 312 slurp-in-as-a-scalar or list subroutine as the general <CODE>read_file()</CODE>
 313 sub will do that by default in the appropriate context. If you want
 314 <CODE>read_file()</CODE> to return a scalar reference or anonymous array of lines,
 315 you can request those formats with options. You can even pass in a
 316 reference to a scalar (e.g. a previously allocated buffer) and have that
 317 filled with the slurped data (and that is one of the fastest slurp
 318 modes. see the benchmark section for more on that).  If you want to
 319 slurp a scalar into an array, just select the desired array element and
 320 that will provide scalar context to the <CODE>read_file()</CODE> subroutine.</P>
 321 <P>The next area to cover is what to name the slurp sub. I will go with
 322 <CODE>read_file()</CODE>. It is descriptive and keeps compatibilty with the
 323 current simple and don't use the 'slurp' nickname (though that nickname
 324 is in the module name). Also I decided to keep the  File::Slurp
 325 namespace which was graciously handed over to me by its current owner,
 326 David Muir.</P>
 327 <P>Another critical area when designing APIs is how to pass in
 328 arguments. The <CODE>read_file()</CODE> subroutine takes one required argument
 329 which is the file name. To support <CODE>binmode()</CODE> we need another optional
 330 argument. A third optional argument is needed to support returning a
 331 slurped scalar by reference. My first thought was to design the API with
 332 3 positional arguments - file name, buffer reference and binmode. But if
 333 you want to set the binmode and not pass in a buffer reference, you have
 334 to fill the second argument with <CODE>undef</CODE> and that is ugly. So I decided
 335 to make the filename argument positional and the other two named. The
 336 subroutine starts off like this:</P>
 337 <PRE>
 338         sub read_file {</PRE>
 339 <PRE>
 340                 my( $file_name, %args ) = @_ ;</PRE>
 341 <PRE>
 342                 my $buf ;
 343                 my $buf_ref = $args{'buf'} || \$buf ;</PRE>
 344 <P>The other sub (<CODE>read_file_lines()</CODE>) will only take an optional binmode
 345 (so you can read files with binary delimiters). It doesn't need a buffer
 346 reference argument since it can return an anonymous array if the called
 347 in a scalar context. So this subroutine could use positional arguments,
 348 but to keep its API similar to the API of <CODE>read_file()</CODE>, it will also
 349 use pass by name for the optional arguments. This also means that new
 350 optional arguments can be added later without breaking any legacy
 351 code. A bonus with keeping the API the same for both subs will be seen
 352 how the two subs are optimized to work together.</P>
 353 <P>Write slurping (or spewing or burping :-)) needs to have its API
 354 designed as well. The biggest issue is not only needing to support
 355 optional arguments but a list of arguments to be written is needed. Perl
 356 6 will be able to handle that with optional named arguments and a final
 357 slurp argument. Since this is Perl 5 we have to do it using some
 358 cleverness. The first argument is the file name and it will be
 359 positional as with the <CODE>read_file</CODE> subroutine. But how can we pass in
 360 the optional arguments and also a list of data? The solution lies in the
 361 fact that the data list should never contain a reference.
 362 Burping/spewing works only on plain data. So if the next argument is a
 363 hash reference, we can assume it cointains the optional arguments and
 364 the rest of the arguments is the data list. So the <CODE>write_file()</CODE>
 365 subroutine will start off like this:</P>
 366 <PRE>
 367         sub write_file {</PRE>
 368 <PRE>
 369                 my $file_name = shift ;</PRE>
 370 <PRE>
 371                 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;</PRE>
 372 <P>Whether or not optional arguments are passed in, we leave the data list
 373 in <CODE>@_</CODE> to minimize any more copying. You call <CODE>write_file()</CODE> like this:</P>
 374 <PRE>
 375         write_file( 'foo', { binmode =&gt; ':raw' }, @data ) ;
 376         write_file( 'junk', { append =&gt; 1 }, @more_junk ) ;
 377         write_file( 'bar', @spew ) ;</PRE>
 378 <P>
 379 <H2><A NAME="fast slurping">Fast Slurping</A></H2>
 380 <P>Somewhere along the line, I learned about a way to slurp files faster
 381 than by setting $/ to undef. The method is very simple, you do a single
 382 read call with the size of the file (which the -s operator provides).
 383 This bypasses the I/O loop inside perl that checks for EOF and does all
 384 sorts of processing. I then decided to experiment and found that
 385 sysread is even faster as you would expect. sysread bypasses all of
 386 Perl's stdio and reads the file from the kernel buffers directly into a
 387 Perl scalar. This is why the slurp code in File::Slurp uses
 388 sysopen/sysread/syswrite. All the rest of the code is just to support
 389 the various options and data passing techniques.</P>
 390 <P></P>
 391 <PRE>
 392
 393 =head2 Benchmarks</PRE>
 394 <P>Benchmarks can be enlightening, informative, frustrating and
 395 deceiving. It would make no sense to create a new and more complex slurp
 396 module unless it also gained signifigantly in speed. So I created a
 397 benchmark script which compares various slurp methods with differing
 398 file sizes and calling contexts. This script can be run from the main
 399 directory of the tarball like this:</P>
 400 <PRE>
 401         perl -Ilib extras/slurp_bench.pl</PRE>
 402 <P>If you pass in an argument on the command line, it will be passed to
 403 <CODE>timethese()</CODE> and it will control the duration. It defaults to -2 which
 404 makes each benchmark run to at least 2 seconds of cpu time.</P>
 405 <P>The following numbers are from a run I did on my 300Mhz sparc. You will
 406 most likely get much faster counts on your boxes but the relative speeds
 407 shouldn't change by much. If you see major differences on your
 408 benchmarks, please send me the results and your Perl and OS
 409 versions. Also you can play with the benchmark script and add more slurp
 410 variations or data files.</P>
 411 <P>The rest of this section will be discussing the results of the
 412 benchmarks. You can refer to extras/slurp_bench.pl to see the code for
 413 the individual benchmarks. If the benchmark name starts with cpan_, it
 414 is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are
 415 from the new File::Slurp.pm. Those that start with file_contents_ are
 416 from a client's code base. The rest are variations I created to
 417 highlight certain aspects of the benchmarks.</P>
 418 <P>The short and long file data is made like this:</P>
 419 <PRE>
 420         my @lines = ( 'abc' x 30 . &quot;\n&quot;)  x 100 ;
 421         my $text = join( '', @lines ) ;</PRE>
 422 <PRE>
 423         @lines = ( 'abc' x 40 . &quot;\n&quot;)  x 1000 ;
 424         $text = join( '', @lines ) ;</PRE>
 425 <P>So the short file is 9,100 bytes and the long file is 121,000
 426 bytes.</P>
 427 <P>
 428 <H3><A NAME="scalar slurp of short file">Scalar Slurp of Short File</A></H3>
 429 <PRE>
 430         file_contents        651/s
 431         file_contents_no_OO  828/s
 432         cpan_read_file      1866/s
 433         cpan_slurp          1934/s
 434         read_file           2079/s
 435         new                 2270/s
 436         new_buf_ref         2403/s
 437         new_scalar_ref      2415/s
 438         sysread_file        2572/s</PRE>
 439 <P>
 440 <H3><A NAME="scalar slurp of long file">Scalar Slurp of Long File</A></H3>
 441 <PRE>
 442         file_contents_no_OO 82.9/s
 443         file_contents       85.4/s
 444         cpan_read_file       250/s
 445         cpan_slurp           257/s
 446         read_file            323/s
 447         new                  468/s
 448         sysread_file         489/s
 449         new_scalar_ref       766/s
 450         new_buf_ref          767/s</PRE>
 451 <P>The primary inference you get from looking at the mumbers above is that
 452 when slurping a file into a scalar, the longer the file, the more time
 453 you save by returning the result via a scalar reference. The time for
 454 the extra buffer copy can add up. The new module came out on top overall
 455 except for the very simple sysread_file entry which was added to
 456 highlight the overhead of the more flexible new module which isn't that
 457 much. The file_contents entries are always the worst since they do a
 458 list slurp and then a join, which is a classic newbie and cargo culted
 459 style which is extremely slow. Also the OO code in file_contents slows
 460 it down even more (I added the file_contents_no_OO entry to show this).
 461 The two CPAN modules are decent with small files but they are laggards
 462 compared to the new module when the file gets much larger.</P>
 463 <P>
 464 <H3><A NAME="list slurp of short file">List Slurp of Short File</A></H3>
 465 <PRE>
 466         cpan_read_file          589/s
 467         cpan_slurp_to_array     620/s
 468         read_file               824/s
 469         new_array_ref           824/s
 470         sysread_file            828/s
 471         new                     829/s
 472         new_in_anon_array       833/s
 473         cpan_slurp_to_array_ref 836/s</PRE>
 474 <P>
 475 <H3><A NAME="list slurp of long file">List Slurp of Long File</A></H3>
 476 <PRE>
 477         cpan_read_file          62.4/s
 478         cpan_slurp_to_array     62.7/s
 479         read_file               92.9/s
 480         sysread_file            94.8/s
 481         new_array_ref           95.5/s
 482         new                     96.2/s
 483         cpan_slurp_to_array_ref 96.3/s
 484         new_in_anon_array       97.2/s</PRE>
 485 <P>This is perhaps the most interesting result of this benchmark. Five
 486 different entries have effectively tied for the lead. The logical
 487 conclusion is that splitting the input into lines is the bounding
 488 operation, no matter how the file gets slurped. This is the only
 489 benchmark where the new module isn't the clear winner (in the long file
 490 entries - it is no worse than a close second in the short file
 491 entries).</P>
 492 <P>Note: In the benchmark information for all the spew entries, the extra
 493 number at the end of each line is how many wallclock seconds the whole
 494 entry took. The benchmarks were run for at least 2 CPU seconds per
 495 entry. The unusually large wallclock times will be discussed below.</P>
 496 <P>
 497 <H3><A NAME="scalar spew of short file">Scalar Spew of Short File</A></H3>
 498 <PRE>
 499         cpan_write_file 1035/s  38
 500         print_file      1055/s  41
 501         syswrite_file   1135/s  44
 502         new             1519/s  2
 503         print_join_file 1766/s  2
 504         new_ref         1900/s  2
 505         syswrite_file2  2138/s  2</PRE>
 506 <P>
 507 <H3><A NAME="scalar spew of long file">Scalar Spew of Long File</A></H3>
 508 <PRE>
 509         cpan_write_file 164/s   20
 510         print_file      211/s   26
 511         syswrite_file   236/s   25
 512         print_join_file 277/s   2
 513         new             295/s   2
 514         syswrite_file2  428/s   2
 515         new_ref         608/s   2</PRE>
 516 <P>In the scalar spew entries, the new module API wins when it is passed a
 517 reference to the scalar buffer. The <CODE>syswrite_file2</CODE> entry beats it
 518 with the shorter file due to its simpler code. The old CPAN module is
 519 the slowest due to its extra copying of the data and its use of print.</P>
 520 <P>
 521 <H3><A NAME="list spew of short file">List Spew of Short File</A></H3>
 522 <PRE>
 523         cpan_write_file  794/s  29
 524         syswrite_file   1000/s  38
 525         print_file      1013/s  42
 526         new             1399/s  2
 527         print_join_file 1557/s  2</PRE>
 528 <P>
 529 <H3><A NAME="list spew of long file">List Spew of Long File</A></H3>
 530 <PRE>
 531         cpan_write_file 112/s   12
 532         print_file      179/s   21
 533         syswrite_file   181/s   19
 534         print_join_file 205/s   2
 535         new             228/s   2</PRE>
 536 <P>Again, the simple <CODE>print_join_file</CODE> entry beats the new module when
 537 spewing a short list of lines to a file. But is loses to the new module
 538 when the file size gets longer. The old CPAN module lags behind the
 539 others since it first makes an extra copy of the lines and then it calls
 540 <CODE>print</CODE> on the output list and that is much slower than passing to
 541 <CODE>print</CODE> a single scalar generated by join. The <CODE>print_file</CODE> entry
 542 shows the advantage of directly printing <CODE>@_</CODE> and the
 543 <CODE>print_join_file</CODE> adds the join optimization.</P>
 544 <P>Now about those long wallclock times. If you look carefully at the
 545 benchmark code of all the spew entries, you will find that some always
 546 write to new files and some overwrite existing files. When I asked David
 547 Muir why the old File::Slurp module had an <CODE>overwrite</CODE> subroutine, he
 548 answered that by overwriting a file, you always guarantee something
 549 readable is in the file. If you create a new file, there is a moment
 550 when the new file is created but has no data in it. I feel this is not a
 551 good enough answer. Even when overwriting, you can write a shorter file
 552 than the existing file and then you have to truncate the file to the new
 553 size. There is a small race window there where another process can slurp
 554 in the file with the new data followed by leftover junk from the
 555 previous version of the file. This reinforces the point that the only
 556 way to ensure consistant file data is the proper use of file locks.</P>
 557 <P>But what about those long times? Well it is all about the difference
 558 between creating files and overwriting existing ones. The former have to
 559 allocate new inodes (or the equivilent on other file systems) and the
 560 latter can reuse the exising inode. This mean the overwrite will save on
 561 disk seeks as well as on cpu time. In fact when running this benchmark,
 562 I could hear my disk going crazy allocating inodes during the spew
 563 operations. This speedup in both cpu and wallclock is why the new module
 564 always does overwriting when spewing files. It also does the proper
 565 truncate (and this is checked in the tests by spewing shorter files
 566 after longer ones had previously been written). The <CODE>overwrite</CODE>
 567 subroutine is just an typeglob alias to <CODE>write_file</CODE> and is there for
 568 backwards compatibilty with the old File::Slurp module.</P>
 569 <P>
 570 <H3><A NAME="benchmark conclusion">Benchmark Conclusion</A></H3>
 571 <P>Other than a few cases where a simpler entry beat it out, the new
 572 File::Slurp module is either the speed leader or among the leaders. Its
 573 special APIs for passing buffers by reference prove to be very useful
 574 speedups. Also it uses all the other optimizations including using
 575 <CODE>sysread/syswrite</CODE> and joining output lines. I expect many projects
 576 that extensively use slurping will notice the speed improvements,
 577 especially if they rewrite their code to take advantage of the new API
 578 features. Even if they don't touch their code and use the simple API
 579 they will get a significant speedup.</P>
 580 <P>
 581 <H2><A NAME="error handling">Error Handling</A></H2>
 582 <P>Slurp subroutines are subject to conditions such as not being able to
 583 open the file, or I/O errors. How these errors are handled, and what the
 584 caller will see, are important aspects of the design of an API. The
 585 classic error handling for slurping has been to call <CODE>die()</CODE> or even
 586 better, <CODE>croak()</CODE>. But sometimes you want the slurp to either
 587 <CODE>warn()</CODE>/<CODE>carp()</CODE> or allow your code to handle the error. Sure, this
 588 can be done by wrapping the slurp in a <CODE>eval</CODE> block to catch a fatal
 589 error, but not everyone wants all that extra code. So I have added
 590 another option to all the subroutines which selects the error
 591 handling. If the 'err_mode' option is 'croak' (which is also the
 592 default), the called subroutine will croak. If set to 'carp' then carp
 593 will be called. Set to any other string (use 'quiet' when you want to
 594 be explicit) and no error handler is called. Then the caller can use the
 595 error status from the call.</P>
 596 <P><CODE>write_file()</CODE> doesn't use the return value for data so it can return a
 597 false status value in-band to mark an error. <CODE>read_file()</CODE> does use its
 598 return value for data, but we can still make it pass back the error
 599 status. A successful read in any scalar mode will return either a
 600 defined data string or a reference to a scalar or array. So a bare
 601 return would work here. But if you slurp in lines by calling it in a
 602 list context, a bare <CODE>return</CODE> will return an empty list, which is the
 603 same value it would get from an existing but empty file. So now,
 604 <CODE>read_file()</CODE> will do something I normally strongly advocate against,
 605 i.e., returning an explicit <CODE>undef</CODE> value. In the scalar context this
 606 still returns a error, and in list context, the returned first value
 607 will be <CODE>undef</CODE>, and that is not legal data for the first element. So
 608 the list context also gets a error status it can detect:</P>
 609 <PRE>
 610         my @lines = read_file( $file_name, err_mode =&gt; 'quiet' ) ;
 611         your_handle_error( &quot;$file_name can't be read\n&quot; ) unless
 612                                         @lines &amp;&amp; defined $lines[0] ;</PRE>
 613 <P>
 614 <H2><A NAME="file::fastslurp">File::FastSlurp</A></H2>
 615 <PRE>
 616         sub read_file {</PRE>
 617 <PRE>
 618                 my( $file_name, %args ) = @_ ;</PRE>
 619 <PRE>
 620                 my $buf ;
 621                 my $buf_ref = $args{'buf_ref'} || \$buf ;</PRE>
 622 <PRE>
 623                 my $mode = O_RDONLY ;
 624                 $mode |= O_BINARY if $args{'binmode'} ;</PRE>
 625 <PRE>
 626                 local( *FH ) ;
 627                 sysopen( FH, $file_name, $mode ) or
 628                                         carp &quot;Can't open $file_name: $!&quot; ;</PRE>
 629 <PRE>
 630                 my $size_left = -s FH ;</PRE>
 631 <PRE>
 632                 while( $size_left &gt; 0 ) {</PRE>
 633 <PRE>
 634                         my $read_cnt = sysread( FH, ${$buf_ref},
 635                                         $size_left, length ${$buf_ref} ) ;</PRE>
 636 <PRE>
 637                         unless( $read_cnt ) {</PRE>
 638 <PRE>
 639                                 carp &quot;read error in file $file_name: $!&quot; ;
 640                                 last ;
 641                         }</PRE>
 642 <PRE>
 643                         $size_left -= $read_cnt ;
 644                 }</PRE>
 645 <PRE>
 646         # handle void context (return scalar by buffer reference)</PRE>
 647 <PRE>
 648                 return unless defined wantarray ;</PRE>
 649 <PRE>
 650         # handle list context</PRE>
 651 <PRE>
 652                 return split m|?&lt;$/|g, ${$buf_ref} if wantarray ;</PRE>
 653 <PRE>
 654         # handle scalar context</PRE>
 655 <PRE>
 656                 return ${$buf_ref} ;
 657         }</PRE>
 658 <PRE>
 659         sub write_file {</PRE>
 660 <PRE>
 661                 my $file_name = shift ;</PRE>
 662 <PRE>
 663                 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
 664                 my $buf = join '', @_ ;</PRE>
 665 <PRE>
 666                 my $mode = O_WRONLY ;
 667                 $mode |= O_BINARY if $args-&gt;{'binmode'} ;
 668                 $mode |= O_APPEND if $args-&gt;{'append'} ;</PRE>
 669 <PRE>
 670                 local( *FH ) ;
 671                 sysopen( FH, $file_name, $mode ) or
 672                                         carp &quot;Can't open $file_name: $!&quot; ;</PRE>
 673 <PRE>
 674                 my $size_left = length( $buf ) ;
 675                 my $offset = 0 ;</PRE>
 676 <PRE>
 677                 while( $size_left &gt; 0 ) {</PRE>
 678 <PRE>
 679                         my $write_cnt = syswrite( FH, $buf,
 680                                         $size_left, $offset ) ;</PRE>
 681 <PRE>
 682                         unless( $write_cnt ) {</PRE>
 683 <PRE>
 684                                 carp &quot;write error in file $file_name: $!&quot; ;
 685                                 last ;
 686                         }</PRE>
 687 <PRE>
 688                         $size_left -= $write_cnt ;
 689                         $offset += $write_cnt ;
 690                 }</PRE>
 691 <PRE>
 692                 return ;
 693         }</PRE>
 694 <P>
 695 <H2><A NAME="slurping in perl 6">Slurping in Perl 6</A></H2>
 696 <P>As usual with Perl 6, much of the work in this article will be put to
 697 pasture. Perl 6 will allow you to set a 'slurp' property on file handles
 698 and when you read from such a handle, the file is slurped. List and
 699 scalar context will still be supported so you can slurp into lines or a
 700 &lt;scalar. I would expect that support for slurping in Perl 6 will be
 701 optimized and bypass the stdio subsystem since it can use the slurp
 702 property to trigger a call to special code. Otherwise some enterprising
 703 individual will just create a File::FastSlurp module for Perl 6. The
 704 code in the Perl 5 module could easily be modified to Perl 6 syntax and
 705 semantics. Any volunteers?</P>
 706 <P>
 707 <H2><A NAME="in summary">In Summary</A></H2>
 708 <P>We have compared classic line by line processing with munging a whole
 709 file in memory. Slurping files can speed up your programs and simplify
 710 your code if done properly. You must still be aware to not slurp
 711 humongous files (logs, DNA sequences, etc.) or STDIN where you don't
 712 know how much data you will read in. But slurping megabyte sized files
 713 is not an major issue on today's systems with the typical amount of RAM
 714 installed. When Perl was first being used in depth (Perl 4), slurping
 715 was limited by the smaller RAM size of 10 years ago. Now, you should be
 716 able to slurp almost any reasonably sized file, whether it contains
 717 configuration, source code, data, etc.</P>
 718 <P>
 719 <H2><A NAME="acknowledgements">Acknowledgements</A></H2>
 720
 721 </BODY>
 722
 723 </HTML>