cleanup
[urisagit/Perl-Docs.git] / slurp_talk / slurp_article.html
CommitLineData
635c7876 1<HTML>
2<HEAD>
3<TITLE>Perl Slurp Ease</TITLE>
4<LINK REV="made" HREF="mailto:steve@dewitt.vnet.net">
5</HEAD>
6
7<BODY>
8
9<A NAME="__index__"></A>
10<!-- INDEX BEGIN -->
11
12<UL>
13
14 <LI><A HREF="#perl slurp ease">Perl Slurp Ease</A></LI>
15 <UL>
16
17 <LI><A HREF="#introduction">Introduction</A></LI>
18 <LI><A HREF="#global operations">Global Operations</A></LI>
19 <LI><A HREF="#traditional slurping">Traditional Slurping</A></LI>
20 <LI><A HREF="#write slurping">Write Slurping</A></LI>
21 <LI><A HREF="#slurp on the cpan">Slurp on the CPAN</A></LI>
22 <LI><A HREF="#slurping api design">Slurping API Design</A></LI>
23 <LI><A HREF="#fast slurping">Fast Slurping</A></LI>
24 <UL>
25
26 <LI><A HREF="#scalar slurp of short file">Scalar Slurp of Short File</A></LI>
27 <LI><A HREF="#scalar slurp of long file">Scalar Slurp of Long File</A></LI>
28 <LI><A HREF="#list slurp of short file">List Slurp of Short File</A></LI>
29 <LI><A HREF="#list slurp of long file">List Slurp of Long File</A></LI>
30 <LI><A HREF="#scalar spew of short file">Scalar Spew of Short File</A></LI>
31 <LI><A HREF="#scalar spew of long file">Scalar Spew of Long File</A></LI>
32 <LI><A HREF="#list spew of short file">List Spew of Short File</A></LI>
33 <LI><A HREF="#list spew of long file">List Spew of Long File</A></LI>
34 <LI><A HREF="#benchmark conclusion">Benchmark Conclusion</A></LI>
35 </UL>
36
37 <LI><A HREF="#error handling">Error Handling</A></LI>
38 <LI><A HREF="#file::fastslurp">File::FastSlurp</A></LI>
39 <LI><A HREF="#slurping in perl 6">Slurping in Perl 6</A></LI>
40 <LI><A HREF="#in summary">In Summary</A></LI>
41 <LI><A HREF="#acknowledgements">Acknowledgements</A></LI>
42 </UL>
43
44</UL>
45<!-- INDEX END -->
46
47<HR>
48<P>
49<H1><A NAME="perl slurp ease">Perl Slurp Ease</A></H1>
50<P>
51<H2><A NAME="introduction">Introduction</A></H2>
52<P>One of the common Perl idioms is processing text files line by line:</P>
53<PRE>
54 while( &lt;FH&gt; ) {
55 do something with $_
56 }</PRE>
57<P>This idiom has several variants, but the key point is that it reads in
58only one line from the file in each loop iteration. This has several
59advantages, including limiting memory use to one line, the ability to
60handle any size file (including data piped in via STDIN), and it is
61easily taught and understood to Perl newbies. In fact newbies are the
62ones who do silly things like this:</P>
63<PRE>
64 while( &lt;FH&gt; ) {
65 push @lines, $_ ;
66 }</PRE>
67<PRE>
68 foreach ( @lines ) {
69 do something with $_
70 }</PRE>
71<P>Line by line processing is fine, but it isn't the only way to deal with
72reading files. The other common style is reading the entire file into a
73scalar or array, and that is commonly known as slurping. Now, slurping has
74somewhat of a poor reputation, and this article is an attempt at
75rehabilitating it. Slurping files has advantages and limitations, and is
76not something you should just do when line by line processing is fine.
77It is best when you need the entire file in memory for processing all at
78once. Slurping with in memory processing can be faster and lead to
79simpler code than line by line if done properly.</P>
80<P>The biggest issue to watch for with slurping is file size. Slurping very
81large files or unknown amounts of data from STDIN can be disastrous to
82your memory usage and cause swap disk thrashing. You can slurp STDIN if
83you know that you can handle the maximum size input without
84detrimentally affecting your memory usage. So I advocate slurping only
85disk files and only when you know their size is reasonable and you have
86a real reason to process the file as a whole. Note that reasonable size
87these days is larger than the bad old days of limited RAM. Slurping in a
88megabyte is not an issue on most systems. But most of the
89files I tend to slurp in are much smaller than that. Typical files that
90work well with slurping are configuration files, (mini-)language scripts,
91some data (especially binary) files, and other files of known sizes
92which need fast processing.</P>
93<P>Another major win for slurping over line by line is speed. Perl's IO
94system (like many others) is slow. Calling <CODE>&lt;&gt;</CODE> for each line
95requires a check for the end of line, checks for EOF, copying a line,
96munging the internal handle structure, etc. Plenty of work for each line
97read in. Whereas slurping, if done correctly, will usually involve only
98one I/O call and no extra data copying. The same is true for writing
99files to disk, and we will cover that as well (even though the term
100slurping is traditionally a read operation, I use the term ``slurp'' for
101the concept of doing I/O with an entire file in one operation).</P>
102<P>Finally, when you have slurped the entire file into memory, you can do
103operations on the data that are not possible or easily done with line by
104line processing. These include global search/replace (without regard for
105newlines), grabbing all matches with one call of <CODE>//g</CODE>, complex parsing
106(which in many cases must ignore newlines), processing *ML (where line
107endings are just white space) and performing complex transformations
108such as template expansion.</P>
109<P>
110<H2><A NAME="global operations">Global Operations</A></H2>
111<P>Here are some simple global operations that can be done quickly and
112easily on an entire file that has been slurped in. They could also be
113done with line by line processing but that would be slower and require
114more code.</P>
115<P>A common problem is reading in a file with key/value pairs. There are
116modules which do this but who needs them for simple formats? Just slurp
117in the file and do a single parse to grab all the key/value pairs.</P>
118<PRE>
119 my $text = read_file( $file ) ;
120 my %config = $test =~ /^(\w+)=(.+)$/mg ;</PRE>
121<P>That matches a key which starts a line (anywhere inside the string
122because of the <CODE>/m</CODE> modifier), the '=' char and the text to the end of the
123line (again, <CODE>/m</CODE> makes that work). In fact the ending <CODE>$</CODE> is not even needed
124since <CODE>.</CODE> will not normally match a newline. Since the key and value are
125grabbed and the <CODE>m//</CODE> is in list context with the <CODE>/g</CODE> modifier, it will
126grab all key/value pairs and return them. The <CODE>%config</CODE>hash will be
127assigned this list and now you have the file fully parsed into a hash.</P>
128<P>Various projects I have worked on needed some simple templating and I
129wasn't in the mood to use a full module (please, no flames about your
130favorite template module :-). So I rolled my own by slurping in the
131template file, setting up a template hash and doing this one line:</P>
132<PRE>
133 $text =~ s/&lt;%(.+?)%&gt;/$template{$1}/g ;</PRE>
134<P>That only works if the entire file was slurped in. With a little
135extra work it can handle chunks of text to be expanded:</P>
136<PRE>
137 $text =~ s/&lt;%(\w+)_START%&gt;(.+?)&lt;%\1_END%&gt;/ template($1, $2)/sge ;</PRE>
138<P>Just supply a <CODE>template</CODE> sub to expand the text between the markers and
139you have yourself a simple system with minimal code. Note that this will
140work and grab over multiple lines due the the <CODE>/s</CODE> modifier. This is
141something that is much trickier with line by line processing.</P>
142<P>Note that this is a very simple templating system, and it can't directly
143handle nested tags and other complex features. But even if you use one
144of the myriad of template modules on the CPAN, you will gain by having
145speedier ways to read and write files.</P>
146<P>Slurping in a file into an array also offers some useful advantages.
147One simple example is reading in a flat database where each record has
148fields separated by a character such as <CODE>:</CODE>:</P>
149<PRE>
150 my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ;</PRE>
151<P>Random access to any line of the slurped file is another advantage. Also
152a line index could be built to speed up searching the array of lines.</P>
153<P>
154<H2><A NAME="traditional slurping">Traditional Slurping</A></H2>
155<P>Perl has always supported slurping files with minimal code. Slurping of
156a file to a list of lines is trivial, just call the <CODE>&lt;&gt;</CODE> operator
157in a list context:</P>
158<PRE>
159 my @lines = &lt;FH&gt; ;</PRE>
160<P>and slurping to a scalar isn't much more work. Just set the built in
161variable <CODE>$/</CODE> (the input record separator to the undefined value and read
162in the file with <CODE>&lt;&gt;</CODE>:</P>
163<PRE>
164 {
165 local( $/, *FH ) ;
166 open( FH, $file ) or die &quot;sudden flaming death\n&quot;
167 $text = &lt;FH&gt;
168 }</PRE>
169<P>Notice the use of <CODE>local()</CODE>. It sets <CODE>$/</CODE> to <CODE>undef</CODE> for you and when
170the scope exits it will revert <CODE>$/</CODE> back to its previous value (most
171likely ``\n'').</P>
172<P>Here is a Perl idiom that allows the <CODE>$text</CODE> variable to be declared,
173and there is no need for a tightly nested block. The <CODE>do</CODE> block will
174execute <CODE>&lt;FH&gt;</CODE> in a scalar context and slurp in the file named by
175<CODE>$text</CODE>:</P>
176<PRE>
177 local( *FH ) ;
178 open( FH, $file ) or die &quot;sudden flaming death\n&quot;
179 my $text = do { local( $/ ) ; &lt;FH&gt; } ;</PRE>
180<P>Both of those slurps used localized filehandles to be compatible with
1815.005. Here they are with 5.6.0 lexical autovivified handles:</P>
182<PRE>
183 {
184 local( $/ ) ;
185 open( my $fh, $file ) or die &quot;sudden flaming death\n&quot;
186 $text = &lt;$fh&gt;
187 }</PRE>
188<PRE>
189 open( my $fh, $file ) or die &quot;sudden flaming death\n&quot;
190 my $text = do { local( $/ ) ; &lt;$fh&gt; } ;</PRE>
191<P>And this is a variant of that idiom that removes the need for the open
192call:</P>
193<PRE>
194 my $text = do { local( @ARGV, $/ ) = $file ; &lt;&gt; } ;</PRE>
195<P>The filename in <CODE>$file</CODE> is assigned to a localized <CODE>@ARGV</CODE> and the
196null filehandle is used which reads the data from the files in <CODE>@ARGV</CODE>.</P>
197<P>Instead of assigning to a scalar, all the above slurps can assign to an
198array and it will get the file but split into lines (using <CODE>$/</CODE> as the
199end of line marker).</P>
200<P>There is one common variant of those slurps which is very slow and not
201good code. You see it around, and it is almost always cargo cult code:</P>
202<PRE>
203 my $text = join( '', &lt;FH&gt; ) ;</PRE>
204<P>That needlessly splits the input file into lines (<CODE>join</CODE> provides a
205list context to <CODE>&lt;FH&gt;</CODE>) and then joins up those lines again. The
206original coder of this idiom obviously never read <EM>perlvar</EM> and learned
207how to use <CODE>$/</CODE> to allow scalar slurping.</P>
208<P>
209<H2><A NAME="write slurping">Write Slurping</A></H2>
210<P>While reading in entire files at one time is common, writing out entire
211files is also done. We call it ``slurping'' when we read in files, but
212there is no commonly accepted term for the write operation. I asked some
213Perl colleagues and got two interesting nominations. Peter Scott said to
214call it ``burping'' (rhymes with ``slurping'' and suggests movement in
215the opposite direction). Others suggested ``spewing'' which has a
216stronger visual image :-) Tell me your favorite or suggest your own. I
217will use both in this section so you can see how they work for you.</P>
218<P>Spewing a file is a much simpler operation than slurping. You don't have
219context issues to worry about and there is no efficiency problem with
220returning a buffer. Here is a simple burp subroutine:</P>
221<PRE>
222 sub burp {
223 my( $file_name ) = shift ;
224 open( my $fh, &quot;&gt;$file_name&quot; ) ||
225 die &quot;can't create $file_name $!&quot; ;
226 print $fh @_ ;
227 }</PRE>
228<P>Note that it doesn't copy the input text but passes @_ directly to
229print. We will look at faster variations of that later on.</P>
230<P>
231<H2><A NAME="slurp on the cpan">Slurp on the CPAN</A></H2>
232<P>As you would expect there are modules in the CPAN that will slurp files
233for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on
234CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).</P>
235<P>Here is the code from Slurp.pm:</P>
236<PRE>
237 sub slurp {
238 local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ );
239 return &lt;ARGV&gt;;
240 }</PRE>
241<PRE>
242 sub to_array {
243 my @array = slurp( @_ );
244 return wantarray ? @array : \@array;
245 }</PRE>
246<PRE>
247 sub to_scalar {
248 my $scalar = slurp( @_ );
249 return $scalar;
250 }</PRE>
251<P>+The subroutine <CODE>slurp()</CODE> uses the magic undefined value of <CODE>$/</CODE> and
252the magic file +handle <CODE>ARGV</CODE> to support slurping into a scalar or
253array. It also provides two wrapper subs that allow the caller to
254control the context of the slurp. And the <CODE>to_array()</CODE> subroutine will
255return the list of slurped lines or a anonymous array of them according
256to its caller's context by checking <CODE>wantarray</CODE>. It has 'slurp' in
257<CODE>@EXPORT</CODE> and all three subroutines in <CODE>@EXPORT_OK</CODE>.</P>
258<P>&lt;Footnote: Slurp.pm is poorly named and it shouldn't be in the top level
259namespace.&gt;</P>
260<P>The original File::Slurp.pm has this code:</P>
261<P>sub read_file
262{
263 my ($file) = @_;</P>
264<PRE>
265 local($/) = wantarray ? $/ : undef;
266 local(*F);
267 my $r;
268 my (@r);</PRE>
269<PRE>
270 open(F, &quot;&lt;$file&quot;) || croak &quot;open $file: $!&quot;;
271 @r = &lt;F&gt;;
272 close(F) || croak &quot;close $file: $!&quot;;</PRE>
273<PRE>
274 return $r[0] unless wantarray;
275 return @r;
276}</PRE>
277<P>This module provides several subroutines including <CODE>read_file()</CODE> (more
278on the others later). <CODE>read_file()</CODE> behaves simularly to
279<CODE>Slurp::slurp()</CODE> in that it will slurp a list of lines or a single
280scalar depending on the caller's context. It also uses the magic
281undefined value of <CODE>$/</CODE> for scalar slurping but it uses an explicit
282open call rather than using a localized <CODE>@ARGV</CODE> and the other module
283did. Also it doesn't provide a way to get an anonymous array of the
284lines but that can easily be rectified by calling it inside an anonymous
285array constuctor <CODE>[]</CODE>.</P>
286<P>Both of these modules make it easier for Perl coders to slurp in
287files. They both use the magic <CODE>$/</CODE> to slurp in scalar mode and the
288natural behavior of <CODE>&lt;&gt;</CODE> in list context to slurp as lines. But
289neither is optmized for speed nor can they handle <CODE>binmode()</CODE> to
290support binary or unicode files. See below for more on slurp features
291and speedups.</P>
292<P>
293<H2><A NAME="slurping api design">Slurping API Design</A></H2>
294<P>The slurp modules on CPAN are have a very simple API and don't support
295<CODE>binmode()</CODE>. This section will cover various API design issues such as
296efficient return by reference, <CODE>binmode()</CODE> and calling variations.</P>
297<P>Let's start with the call variations. Slurped files can be returned in
298four formats: as a single scalar, as a reference to a scalar, as a list
299of lines or as an anonymous array of lines. But the caller can only
300provide two contexts: scalar or list. So we have to either provide an
301API with more than one subroutine (as Slurp.pm did) or just provide one
302subroutine which only returns a scalar or a list (not an anonymous
303array) as File::Slurp does.</P>
304<P>I have used my own <CODE>read_file()</CODE> subroutine for years and it has the
305same API as File::Slurp: a single subroutine that returns a scalar or a
306list of lines depending on context. But I recognize the interest of
307those that want an anonymous array for line slurping. For one thing, it
308is easier to pass around to other subs and for another, it eliminates
309the extra copying of the lines via <CODE>return</CODE>. So my module provides only
310one slurp subroutine that returns the file data based on context and any
311format options passed in. There is no need for a specific
312slurp-in-as-a-scalar or list subroutine as the general <CODE>read_file()</CODE>
313sub will do that by default in the appropriate context. If you want
314<CODE>read_file()</CODE> to return a scalar reference or anonymous array of lines,
315you can request those formats with options. You can even pass in a
316reference to a scalar (e.g. a previously allocated buffer) and have that
317filled with the slurped data (and that is one of the fastest slurp
318modes. see the benchmark section for more on that). If you want to
319slurp a scalar into an array, just select the desired array element and
320that will provide scalar context to the <CODE>read_file()</CODE> subroutine.</P>
321<P>The next area to cover is what to name the slurp sub. I will go with
322<CODE>read_file()</CODE>. It is descriptive and keeps compatibilty with the
323current simple and don't use the 'slurp' nickname (though that nickname
324is in the module name). Also I decided to keep the File::Slurp
325namespace which was graciously handed over to me by its current owner,
326David Muir.</P>
327<P>Another critical area when designing APIs is how to pass in
328arguments. The <CODE>read_file()</CODE> subroutine takes one required argument
329which is the file name. To support <CODE>binmode()</CODE> we need another optional
330argument. A third optional argument is needed to support returning a
331slurped scalar by reference. My first thought was to design the API with
3323 positional arguments - file name, buffer reference and binmode. But if
333you want to set the binmode and not pass in a buffer reference, you have
334to fill the second argument with <CODE>undef</CODE> and that is ugly. So I decided
335to make the filename argument positional and the other two named. The
336subroutine starts off like this:</P>
337<PRE>
338 sub read_file {</PRE>
339<PRE>
340 my( $file_name, %args ) = @_ ;</PRE>
341<PRE>
342 my $buf ;
343 my $buf_ref = $args{'buf'} || \$buf ;</PRE>
344<P>The other sub (<CODE>read_file_lines()</CODE>) will only take an optional binmode
345(so you can read files with binary delimiters). It doesn't need a buffer
346reference argument since it can return an anonymous array if the called
347in a scalar context. So this subroutine could use positional arguments,
348but to keep its API similar to the API of <CODE>read_file()</CODE>, it will also
349use pass by name for the optional arguments. This also means that new
350optional arguments can be added later without breaking any legacy
351code. A bonus with keeping the API the same for both subs will be seen
352how the two subs are optimized to work together.</P>
353<P>Write slurping (or spewing or burping :-)) needs to have its API
354designed as well. The biggest issue is not only needing to support
355optional arguments but a list of arguments to be written is needed. Perl
3566 will be able to handle that with optional named arguments and a final
357slurp argument. Since this is Perl 5 we have to do it using some
358cleverness. The first argument is the file name and it will be
359positional as with the <CODE>read_file</CODE> subroutine. But how can we pass in
360the optional arguments and also a list of data? The solution lies in the
361fact that the data list should never contain a reference.
362Burping/spewing works only on plain data. So if the next argument is a
363hash reference, we can assume it cointains the optional arguments and
364the rest of the arguments is the data list. So the <CODE>write_file()</CODE>
365subroutine will start off like this:</P>
366<PRE>
367 sub write_file {</PRE>
368<PRE>
369 my $file_name = shift ;</PRE>
370<PRE>
371 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;</PRE>
372<P>Whether or not optional arguments are passed in, we leave the data list
373in <CODE>@_</CODE> to minimize any more copying. You call <CODE>write_file()</CODE> like this:</P>
374<PRE>
375 write_file( 'foo', { binmode =&gt; ':raw' }, @data ) ;
376 write_file( 'junk', { append =&gt; 1 }, @more_junk ) ;
377 write_file( 'bar', @spew ) ;</PRE>
378<P>
379<H2><A NAME="fast slurping">Fast Slurping</A></H2>
380<P>Somewhere along the line, I learned about a way to slurp files faster
381than by setting $/ to undef. The method is very simple, you do a single
382read call with the size of the file (which the -s operator provides).
383This bypasses the I/O loop inside perl that checks for EOF and does all
384sorts of processing. I then decided to experiment and found that
385sysread is even faster as you would expect. sysread bypasses all of
386Perl's stdio and reads the file from the kernel buffers directly into a
387Perl scalar. This is why the slurp code in File::Slurp uses
388sysopen/sysread/syswrite. All the rest of the code is just to support
389the various options and data passing techniques.</P>
390<P></P>
391<PRE>
392
393=head2 Benchmarks</PRE>
394<P>Benchmarks can be enlightening, informative, frustrating and
395deceiving. It would make no sense to create a new and more complex slurp
396module unless it also gained signifigantly in speed. So I created a
397benchmark script which compares various slurp methods with differing
398file sizes and calling contexts. This script can be run from the main
399directory of the tarball like this:</P>
400<PRE>
401 perl -Ilib extras/slurp_bench.pl</PRE>
402<P>If you pass in an argument on the command line, it will be passed to
403<CODE>timethese()</CODE> and it will control the duration. It defaults to -2 which
404makes each benchmark run to at least 2 seconds of cpu time.</P>
405<P>The following numbers are from a run I did on my 300Mhz sparc. You will
406most likely get much faster counts on your boxes but the relative speeds
407shouldn't change by much. If you see major differences on your
408benchmarks, please send me the results and your Perl and OS
409versions. Also you can play with the benchmark script and add more slurp
410variations or data files.</P>
411<P>The rest of this section will be discussing the results of the
412benchmarks. You can refer to extras/slurp_bench.pl to see the code for
413the individual benchmarks. If the benchmark name starts with cpan_, it
414is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are
415from the new File::Slurp.pm. Those that start with file_contents_ are
416from a client's code base. The rest are variations I created to
417highlight certain aspects of the benchmarks.</P>
418<P>The short and long file data is made like this:</P>
419<PRE>
420 my @lines = ( 'abc' x 30 . &quot;\n&quot;) x 100 ;
421 my $text = join( '', @lines ) ;</PRE>
422<PRE>
423 @lines = ( 'abc' x 40 . &quot;\n&quot;) x 1000 ;
424 $text = join( '', @lines ) ;</PRE>
425<P>So the short file is 9,100 bytes and the long file is 121,000
426bytes.</P>
427<P>
428<H3><A NAME="scalar slurp of short file">Scalar Slurp of Short File</A></H3>
429<PRE>
430 file_contents 651/s
431 file_contents_no_OO 828/s
432 cpan_read_file 1866/s
433 cpan_slurp 1934/s
434 read_file 2079/s
435 new 2270/s
436 new_buf_ref 2403/s
437 new_scalar_ref 2415/s
438 sysread_file 2572/s</PRE>
439<P>
440<H3><A NAME="scalar slurp of long file">Scalar Slurp of Long File</A></H3>
441<PRE>
442 file_contents_no_OO 82.9/s
443 file_contents 85.4/s
444 cpan_read_file 250/s
445 cpan_slurp 257/s
446 read_file 323/s
447 new 468/s
448 sysread_file 489/s
449 new_scalar_ref 766/s
450 new_buf_ref 767/s</PRE>
451<P>The primary inference you get from looking at the mumbers above is that
452when slurping a file into a scalar, the longer the file, the more time
453you save by returning the result via a scalar reference. The time for
454the extra buffer copy can add up. The new module came out on top overall
455except for the very simple sysread_file entry which was added to
456highlight the overhead of the more flexible new module which isn't that
457much. The file_contents entries are always the worst since they do a
458list slurp and then a join, which is a classic newbie and cargo culted
459style which is extremely slow. Also the OO code in file_contents slows
460it down even more (I added the file_contents_no_OO entry to show this).
461The two CPAN modules are decent with small files but they are laggards
462compared to the new module when the file gets much larger.</P>
463<P>
464<H3><A NAME="list slurp of short file">List Slurp of Short File</A></H3>
465<PRE>
466 cpan_read_file 589/s
467 cpan_slurp_to_array 620/s
468 read_file 824/s
469 new_array_ref 824/s
470 sysread_file 828/s
471 new 829/s
472 new_in_anon_array 833/s
473 cpan_slurp_to_array_ref 836/s</PRE>
474<P>
475<H3><A NAME="list slurp of long file">List Slurp of Long File</A></H3>
476<PRE>
477 cpan_read_file 62.4/s
478 cpan_slurp_to_array 62.7/s
479 read_file 92.9/s
480 sysread_file 94.8/s
481 new_array_ref 95.5/s
482 new 96.2/s
483 cpan_slurp_to_array_ref 96.3/s
484 new_in_anon_array 97.2/s</PRE>
485<P>This is perhaps the most interesting result of this benchmark. Five
486different entries have effectively tied for the lead. The logical
487conclusion is that splitting the input into lines is the bounding
488operation, no matter how the file gets slurped. This is the only
489benchmark where the new module isn't the clear winner (in the long file
490entries - it is no worse than a close second in the short file
491entries).</P>
492<P>Note: In the benchmark information for all the spew entries, the extra
493number at the end of each line is how many wallclock seconds the whole
494entry took. The benchmarks were run for at least 2 CPU seconds per
495entry. The unusually large wallclock times will be discussed below.</P>
496<P>
497<H3><A NAME="scalar spew of short file">Scalar Spew of Short File</A></H3>
498<PRE>
499 cpan_write_file 1035/s 38
500 print_file 1055/s 41
501 syswrite_file 1135/s 44
502 new 1519/s 2
503 print_join_file 1766/s 2
504 new_ref 1900/s 2
505 syswrite_file2 2138/s 2</PRE>
506<P>
507<H3><A NAME="scalar spew of long file">Scalar Spew of Long File</A></H3>
508<PRE>
509 cpan_write_file 164/s 20
510 print_file 211/s 26
511 syswrite_file 236/s 25
512 print_join_file 277/s 2
513 new 295/s 2
514 syswrite_file2 428/s 2
515 new_ref 608/s 2</PRE>
516<P>In the scalar spew entries, the new module API wins when it is passed a
517reference to the scalar buffer. The <CODE>syswrite_file2</CODE> entry beats it
518with the shorter file due to its simpler code. The old CPAN module is
519the slowest due to its extra copying of the data and its use of print.</P>
520<P>
521<H3><A NAME="list spew of short file">List Spew of Short File</A></H3>
522<PRE>
523 cpan_write_file 794/s 29
524 syswrite_file 1000/s 38
525 print_file 1013/s 42
526 new 1399/s 2
527 print_join_file 1557/s 2</PRE>
528<P>
529<H3><A NAME="list spew of long file">List Spew of Long File</A></H3>
530<PRE>
531 cpan_write_file 112/s 12
532 print_file 179/s 21
533 syswrite_file 181/s 19
534 print_join_file 205/s 2
535 new 228/s 2</PRE>
536<P>Again, the simple <CODE>print_join_file</CODE> entry beats the new module when
537spewing a short list of lines to a file. But is loses to the new module
538when the file size gets longer. The old CPAN module lags behind the
539others since it first makes an extra copy of the lines and then it calls
540<CODE>print</CODE> on the output list and that is much slower than passing to
541<CODE>print</CODE> a single scalar generated by join. The <CODE>print_file</CODE> entry
542shows the advantage of directly printing <CODE>@_</CODE> and the
543<CODE>print_join_file</CODE> adds the join optimization.</P>
544<P>Now about those long wallclock times. If you look carefully at the
545benchmark code of all the spew entries, you will find that some always
546write to new files and some overwrite existing files. When I asked David
547Muir why the old File::Slurp module had an <CODE>overwrite</CODE> subroutine, he
548answered that by overwriting a file, you always guarantee something
549readable is in the file. If you create a new file, there is a moment
550when the new file is created but has no data in it. I feel this is not a
551good enough answer. Even when overwriting, you can write a shorter file
552than the existing file and then you have to truncate the file to the new
553size. There is a small race window there where another process can slurp
554in the file with the new data followed by leftover junk from the
555previous version of the file. This reinforces the point that the only
556way to ensure consistant file data is the proper use of file locks.</P>
557<P>But what about those long times? Well it is all about the difference
558between creating files and overwriting existing ones. The former have to
559allocate new inodes (or the equivilent on other file systems) and the
560latter can reuse the exising inode. This mean the overwrite will save on
561disk seeks as well as on cpu time. In fact when running this benchmark,
562I could hear my disk going crazy allocating inodes during the spew
563operations. This speedup in both cpu and wallclock is why the new module
564always does overwriting when spewing files. It also does the proper
565truncate (and this is checked in the tests by spewing shorter files
566after longer ones had previously been written). The <CODE>overwrite</CODE>
567subroutine is just an typeglob alias to <CODE>write_file</CODE> and is there for
568backwards compatibilty with the old File::Slurp module.</P>
569<P>
570<H3><A NAME="benchmark conclusion">Benchmark Conclusion</A></H3>
571<P>Other than a few cases where a simpler entry beat it out, the new
572File::Slurp module is either the speed leader or among the leaders. Its
573special APIs for passing buffers by reference prove to be very useful
574speedups. Also it uses all the other optimizations including using
575<CODE>sysread/syswrite</CODE> and joining output lines. I expect many projects
576that extensively use slurping will notice the speed improvements,
577especially if they rewrite their code to take advantage of the new API
578features. Even if they don't touch their code and use the simple API
579they will get a significant speedup.</P>
580<P>
581<H2><A NAME="error handling">Error Handling</A></H2>
582<P>Slurp subroutines are subject to conditions such as not being able to
583open the file, or I/O errors. How these errors are handled, and what the
584caller will see, are important aspects of the design of an API. The
585classic error handling for slurping has been to call <CODE>die()</CODE> or even
586better, <CODE>croak()</CODE>. But sometimes you want the slurp to either
587<CODE>warn()</CODE>/<CODE>carp()</CODE> or allow your code to handle the error. Sure, this
588can be done by wrapping the slurp in a <CODE>eval</CODE> block to catch a fatal
589error, but not everyone wants all that extra code. So I have added
590another option to all the subroutines which selects the error
591handling. If the 'err_mode' option is 'croak' (which is also the
592default), the called subroutine will croak. If set to 'carp' then carp
593will be called. Set to any other string (use 'quiet' when you want to
594be explicit) and no error handler is called. Then the caller can use the
595error status from the call.</P>
596<P><CODE>write_file()</CODE> doesn't use the return value for data so it can return a
597false status value in-band to mark an error. <CODE>read_file()</CODE> does use its
598return value for data, but we can still make it pass back the error
599status. A successful read in any scalar mode will return either a
600defined data string or a reference to a scalar or array. So a bare
601return would work here. But if you slurp in lines by calling it in a
602list context, a bare <CODE>return</CODE> will return an empty list, which is the
603same value it would get from an existing but empty file. So now,
604<CODE>read_file()</CODE> will do something I normally strongly advocate against,
605i.e., returning an explicit <CODE>undef</CODE> value. In the scalar context this
606still returns a error, and in list context, the returned first value
607will be <CODE>undef</CODE>, and that is not legal data for the first element. So
608the list context also gets a error status it can detect:</P>
609<PRE>
610 my @lines = read_file( $file_name, err_mode =&gt; 'quiet' ) ;
611 your_handle_error( &quot;$file_name can't be read\n&quot; ) unless
612 @lines &amp;&amp; defined $lines[0] ;</PRE>
613<P>
614<H2><A NAME="file::fastslurp">File::FastSlurp</A></H2>
615<PRE>
616 sub read_file {</PRE>
617<PRE>
618 my( $file_name, %args ) = @_ ;</PRE>
619<PRE>
620 my $buf ;
621 my $buf_ref = $args{'buf_ref'} || \$buf ;</PRE>
622<PRE>
623 my $mode = O_RDONLY ;
624 $mode |= O_BINARY if $args{'binmode'} ;</PRE>
625<PRE>
626 local( *FH ) ;
627 sysopen( FH, $file_name, $mode ) or
628 carp &quot;Can't open $file_name: $!&quot; ;</PRE>
629<PRE>
630 my $size_left = -s FH ;</PRE>
631<PRE>
632 while( $size_left &gt; 0 ) {</PRE>
633<PRE>
634 my $read_cnt = sysread( FH, ${$buf_ref},
635 $size_left, length ${$buf_ref} ) ;</PRE>
636<PRE>
637 unless( $read_cnt ) {</PRE>
638<PRE>
639 carp &quot;read error in file $file_name: $!&quot; ;
640 last ;
641 }</PRE>
642<PRE>
643 $size_left -= $read_cnt ;
644 }</PRE>
645<PRE>
646 # handle void context (return scalar by buffer reference)</PRE>
647<PRE>
648 return unless defined wantarray ;</PRE>
649<PRE>
650 # handle list context</PRE>
651<PRE>
652 return split m|?&lt;$/|g, ${$buf_ref} if wantarray ;</PRE>
653<PRE>
654 # handle scalar context</PRE>
655<PRE>
656 return ${$buf_ref} ;
657 }</PRE>
658<PRE>
659 sub write_file {</PRE>
660<PRE>
661 my $file_name = shift ;</PRE>
662<PRE>
663 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
664 my $buf = join '', @_ ;</PRE>
665<PRE>
666 my $mode = O_WRONLY ;
667 $mode |= O_BINARY if $args-&gt;{'binmode'} ;
668 $mode |= O_APPEND if $args-&gt;{'append'} ;</PRE>
669<PRE>
670 local( *FH ) ;
671 sysopen( FH, $file_name, $mode ) or
672 carp &quot;Can't open $file_name: $!&quot; ;</PRE>
673<PRE>
674 my $size_left = length( $buf ) ;
675 my $offset = 0 ;</PRE>
676<PRE>
677 while( $size_left &gt; 0 ) {</PRE>
678<PRE>
679 my $write_cnt = syswrite( FH, $buf,
680 $size_left, $offset ) ;</PRE>
681<PRE>
682 unless( $write_cnt ) {</PRE>
683<PRE>
684 carp &quot;write error in file $file_name: $!&quot; ;
685 last ;
686 }</PRE>
687<PRE>
688 $size_left -= $write_cnt ;
689 $offset += $write_cnt ;
690 }</PRE>
691<PRE>
692 return ;
693 }</PRE>
694<P>
695<H2><A NAME="slurping in perl 6">Slurping in Perl 6</A></H2>
696<P>As usual with Perl 6, much of the work in this article will be put to
697pasture. Perl 6 will allow you to set a 'slurp' property on file handles
698and when you read from such a handle, the file is slurped. List and
699scalar context will still be supported so you can slurp into lines or a
700&lt;scalar. I would expect that support for slurping in Perl 6 will be
701optimized and bypass the stdio subsystem since it can use the slurp
702property to trigger a call to special code. Otherwise some enterprising
703individual will just create a File::FastSlurp module for Perl 6. The
704code in the Perl 5 module could easily be modified to Perl 6 syntax and
705semantics. Any volunteers?</P>
706<P>
707<H2><A NAME="in summary">In Summary</A></H2>
708<P>We have compared classic line by line processing with munging a whole
709file in memory. Slurping files can speed up your programs and simplify
710your code if done properly. You must still be aware to not slurp
711humongous files (logs, DNA sequences, etc.) or STDIN where you don't
712know how much data you will read in. But slurping megabyte sized files
713is not an major issue on today's systems with the typical amount of RAM
714installed. When Perl was first being used in depth (Perl 4), slurping
715was limited by the smaller RAM size of 10 years ago. Now, you should be
716able to slurp almost any reasonably sized file, whether it contains
717configuration, source code, data, etc.</P>
718<P>
719<H2><A NAME="acknowledgements">Acknowledgements</A></H2>
720
721</BODY>
722
723</HTML>