cleanup
[urisagit/Perl-Docs.git] / extras / slurp_article.pod
CommitLineData
635c7876 1=head1 Perl Slurp Ease
2
3=head2 Introduction
4
5
6One of the common Perl idioms is processing text files line by line:
7
8 while( <FH> ) {
9 do something with $_
10 }
11
12This idiom has several variants, but the key point is that it reads in
13only one line from the file in each loop iteration. This has several
14advantages, including limiting memory use to one line, the ability to
15handle any size file (including data piped in via STDIN), and it is
16easily taught and understood to Perl newbies. In fact newbies are the
17ones who do silly things like this:
18
19 while( <FH> ) {
20 push @lines, $_ ;
21 }
22
23 foreach ( @lines ) {
24 do something with $_
25 }
26
27Line by line processing is fine, but it isn't the only way to deal with
28reading files. The other common style is reading the entire file into a
29scalar or array, and that is commonly known as slurping. Now, slurping has
30somewhat of a poor reputation, and this article is an attempt at
31rehabilitating it. Slurping files has advantages and limitations, and is
32not something you should just do when line by line processing is fine.
33It is best when you need the entire file in memory for processing all at
34once. Slurping with in memory processing can be faster and lead to
35simpler code than line by line if done properly.
36
37The biggest issue to watch for with slurping is file size. Slurping very
38large files or unknown amounts of data from STDIN can be disastrous to
39your memory usage and cause swap disk thrashing. You can slurp STDIN if
40you know that you can handle the maximum size input without
41detrimentally affecting your memory usage. So I advocate slurping only
42disk files and only when you know their size is reasonable and you have
43a real reason to process the file as a whole. Note that reasonable size
44these days is larger than the bad old days of limited RAM. Slurping in a
45megabyte is not an issue on most systems. But most of the
46files I tend to slurp in are much smaller than that. Typical files that
47work well with slurping are configuration files, (mini-)language scripts,
48some data (especially binary) files, and other files of known sizes
49which need fast processing.
50
51Another major win for slurping over line by line is speed. Perl's IO
52system (like many others) is slow. Calling C<< <> >> for each line
53requires a check for the end of line, checks for EOF, copying a line,
54munging the internal handle structure, etc. Plenty of work for each line
55read in. Whereas slurping, if done correctly, will usually involve only
56one I/O call and no extra data copying. The same is true for writing
57files to disk, and we will cover that as well (even though the term
58slurping is traditionally a read operation, I use the term ``slurp'' for
59the concept of doing I/O with an entire file in one operation).
60
61Finally, when you have slurped the entire file into memory, you can do
62operations on the data that are not possible or easily done with line by
63line processing. These include global search/replace (without regard for
64newlines), grabbing all matches with one call of C<//g>, complex parsing
65(which in many cases must ignore newlines), processing *ML (where line
66endings are just white space) and performing complex transformations
67such as template expansion.
68
69=head2 Global Operations
70
71Here are some simple global operations that can be done quickly and
72easily on an entire file that has been slurped in. They could also be
73done with line by line processing but that would be slower and require
74more code.
75
76A common problem is reading in a file with key/value pairs. There are
77modules which do this but who needs them for simple formats? Just slurp
78in the file and do a single parse to grab all the key/value pairs.
79
80 my $text = read_file( $file ) ;
81 my %config = $text =~ /^(\w+)=(.+)$/mg ;
82
83That matches a key which starts a line (anywhere inside the string
84because of the C</m> modifier), the '=' char and the text to the end of the
85line (again, C</m> makes that work). In fact the ending C<$> is not even needed
86since C<.> will not normally match a newline. Since the key and value are
87grabbed and the C<m//> is in list context with the C</g> modifier, it will
88grab all key/value pairs and return them. The C<%config>hash will be
89assigned this list and now you have the file fully parsed into a hash.
90
91Various projects I have worked on needed some simple templating and I
92wasn't in the mood to use a full module (please, no flames about your
93favorite template module :-). So I rolled my own by slurping in the
94template file, setting up a template hash and doing this one line:
95
96 $text =~ s/<%(.+?)%>/$template{$1}/g ;
97
98That only works if the entire file was slurped in. With a little
99extra work it can handle chunks of text to be expanded:
100
101 $text =~ s/<%(\w+)_START%>(.+?)<%\1_END%>/ template($1, $2)/sge ;
102
103Just supply a C<template> sub to expand the text between the markers and
104you have yourself a simple system with minimal code. Note that this will
105work and grab over multiple lines due the the C</s> modifier. This is
106something that is much trickier with line by line processing.
107
108Note that this is a very simple templating system, and it can't directly
109handle nested tags and other complex features. But even if you use one
110of the myriad of template modules on the CPAN, you will gain by having
111speedier ways to read and write files.
112
113Slurping in a file into an array also offers some useful advantages.
114One simple example is reading in a flat database where each record has
115fields separated by a character such as C<:>:
116
117 my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ;
118
119Random access to any line of the slurped file is another advantage. Also
120a line index could be built to speed up searching the array of lines.
121
122
123=head2 Traditional Slurping
124
125Perl has always supported slurping files with minimal code. Slurping of
126a file to a list of lines is trivial, just call the C<< <> >> operator
127in a list context:
128
129 my @lines = <FH> ;
130
131and slurping to a scalar isn't much more work. Just set the built in
132variable C<$/> (the input record separator to the undefined value and read
133in the file with C<< <> >>:
134
135 {
136 local( $/, *FH ) ;
137 open( FH, $file ) or die "sudden flaming death\n"
138 $text = <FH>
139 }
140
141Notice the use of C<local()>. It sets C<$/> to C<undef> for you and when
142the scope exits it will revert C<$/> back to its previous value (most
143likely "\n").
144
145Here is a Perl idiom that allows the C<$text> variable to be declared,
146and there is no need for a tightly nested block. The C<do> block will
147execute C<< <FH> >> in a scalar context and slurp in the file named by
148C<$text>:
149
150 local( *FH ) ;
151 open( FH, $file ) or die "sudden flaming death\n"
152 my $text = do { local( $/ ) ; <FH> } ;
153
154Both of those slurps used localized filehandles to be compatible with
1555.005. Here they are with 5.6.0 lexical autovivified handles:
156
157 {
158 local( $/ ) ;
159 open( my $fh, $file ) or die "sudden flaming death\n"
160 $text = <$fh>
161 }
162
163 open( my $fh, $file ) or die "sudden flaming death\n"
164 my $text = do { local( $/ ) ; <$fh> } ;
165
166And this is a variant of that idiom that removes the need for the open
167call:
168
169 my $text = do { local( @ARGV, $/ ) = $file ; <> } ;
170
171The filename in C<$file> is assigned to a localized C<@ARGV> and the
172null filehandle is used which reads the data from the files in C<@ARGV>.
173
174Instead of assigning to a scalar, all the above slurps can assign to an
175array and it will get the file but split into lines (using C<$/> as the
176end of line marker).
177
178There is one common variant of those slurps which is very slow and not
179good code. You see it around, and it is almost always cargo cult code:
180
181 my $text = join( '', <FH> ) ;
182
183That needlessly splits the input file into lines (C<join> provides a
184list context to C<< <FH> >>) and then joins up those lines again. The
185original coder of this idiom obviously never read I<perlvar> and learned
186how to use C<$/> to allow scalar slurping.
187
188=head2 Write Slurping
189
190While reading in entire files at one time is common, writing out entire
191files is also done. We call it ``slurping'' when we read in files, but
192there is no commonly accepted term for the write operation. I asked some
193Perl colleagues and got two interesting nominations. Peter Scott said to
194call it ``burping'' (rhymes with ``slurping'' and suggests movement in
195the opposite direction). Others suggested ``spewing'' which has a
196stronger visual image :-) Tell me your favorite or suggest your own. I
197will use both in this section so you can see how they work for you.
198
199Spewing a file is a much simpler operation than slurping. You don't have
200context issues to worry about and there is no efficiency problem with
201returning a buffer. Here is a simple burp subroutine:
202
203 sub burp {
204 my( $file_name ) = shift ;
205 open( my $fh, ">$file_name" ) ||
206 die "can't create $file_name $!" ;
207 print $fh @_ ;
208 }
209
210Note that it doesn't copy the input text but passes @_ directly to
211print. We will look at faster variations of that later on.
212
213=head2 Slurp on the CPAN
214
215As you would expect there are modules in the CPAN that will slurp files
216for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on
217CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).
218
219Here is the code from Slurp.pm:
220
221 sub slurp {
222 local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ );
223 return <ARGV>;
224 }
225
226 sub to_array {
227 my @array = slurp( @_ );
228 return wantarray ? @array : \@array;
229 }
230
231 sub to_scalar {
232 my $scalar = slurp( @_ );
233 return $scalar;
234 }
235
236+The subroutine C<slurp()> uses the magic undefined value of C<$/> and
237the magic file +handle C<ARGV> to support slurping into a scalar or
238array. It also provides two wrapper subs that allow the caller to
239control the context of the slurp. And the C<to_array()> subroutine will
240return the list of slurped lines or a anonymous array of them according
241to its caller's context by checking C<wantarray>. It has 'slurp' in
242C<@EXPORT> and all three subroutines in C<@EXPORT_OK>.
243
244<Footnote: Slurp.pm is poorly named and it shouldn't be in the top level
245namespace.>
246
247The original File::Slurp.pm has this code:
248
249sub read_file
250{
251 my ($file) = @_;
252
253 local($/) = wantarray ? $/ : undef;
254 local(*F);
255 my $r;
256 my (@r);
257
258 open(F, "<$file") || croak "open $file: $!";
259 @r = <F>;
260 close(F) || croak "close $file: $!";
261
262 return $r[0] unless wantarray;
263 return @r;
264}
265
266This module provides several subroutines including C<read_file()> (more
267on the others later). C<read_file()> behaves simularly to
268C<Slurp::slurp()> in that it will slurp a list of lines or a single
269scalar depending on the caller's context. It also uses the magic
270undefined value of C<$/> for scalar slurping but it uses an explicit
271open call rather than using a localized C<@ARGV> and the other module
272did. Also it doesn't provide a way to get an anonymous array of the
273lines but that can easily be rectified by calling it inside an anonymous
274array constuctor C<[]>.
275
276Both of these modules make it easier for Perl coders to slurp in
277files. They both use the magic C<$/> to slurp in scalar mode and the
278natural behavior of C<< <> >> in list context to slurp as lines. But
279neither is optmized for speed nor can they handle C<binmode()> to
280support binary or unicode files. See below for more on slurp features
281and speedups.
282
283=head2 Slurping API Design
284
285The slurp modules on CPAN are have a very simple API and don't support
286C<binmode()>. This section will cover various API design issues such as
287efficient return by reference, C<binmode()> and calling variations.
288
289Let's start with the call variations. Slurped files can be returned in
290four formats: as a single scalar, as a reference to a scalar, as a list
291of lines or as an anonymous array of lines. But the caller can only
292provide two contexts: scalar or list. So we have to either provide an
293API with more than one subroutine (as Slurp.pm did) or just provide one
294subroutine which only returns a scalar or a list (not an anonymous
295array) as File::Slurp does.
296
297I have used my own C<read_file()> subroutine for years and it has the
298same API as File::Slurp: a single subroutine that returns a scalar or a
299list of lines depending on context. But I recognize the interest of
300those that want an anonymous array for line slurping. For one thing, it
301is easier to pass around to other subs and for another, it eliminates
302the extra copying of the lines via C<return>. So my module provides only
303one slurp subroutine that returns the file data based on context and any
304format options passed in. There is no need for a specific
305slurp-in-as-a-scalar or list subroutine as the general C<read_file()>
306sub will do that by default in the appropriate context. If you want
307C<read_file()> to return a scalar reference or anonymous array of lines,
308you can request those formats with options. You can even pass in a
309reference to a scalar (e.g. a previously allocated buffer) and have that
310filled with the slurped data (and that is one of the fastest slurp
311modes. see the benchmark section for more on that). If you want to
312slurp a scalar into an array, just select the desired array element and
313that will provide scalar context to the C<read_file()> subroutine.
314
315The next area to cover is what to name the slurp sub. I will go with
316C<read_file()>. It is descriptive and keeps compatibilty with the
317current simple and don't use the 'slurp' nickname (though that nickname
318is in the module name). Also I decided to keep the File::Slurp
319namespace which was graciously handed over to me by its current owner,
320David Muir.
321
322Another critical area when designing APIs is how to pass in
323arguments. The C<read_file()> subroutine takes one required argument
324which is the file name. To support C<binmode()> we need another optional
325argument. A third optional argument is needed to support returning a
326slurped scalar by reference. My first thought was to design the API with
3273 positional arguments - file name, buffer reference and binmode. But if
328you want to set the binmode and not pass in a buffer reference, you have
329to fill the second argument with C<undef> and that is ugly. So I decided
330to make the filename argument positional and the other two named. The
331subroutine starts off like this:
332
333 sub read_file {
334
335 my( $file_name, %args ) = @_ ;
336
337 my $buf ;
338 my $buf_ref = $args{'buf'} || \$buf ;
339
340The other sub (C<read_file_lines()>) will only take an optional binmode
341(so you can read files with binary delimiters). It doesn't need a buffer
342reference argument since it can return an anonymous array if the called
343in a scalar context. So this subroutine could use positional arguments,
344but to keep its API similar to the API of C<read_file()>, it will also
345use pass by name for the optional arguments. This also means that new
346optional arguments can be added later without breaking any legacy
347code. A bonus with keeping the API the same for both subs will be seen
348how the two subs are optimized to work together.
349
350Write slurping (or spewing or burping :-)) needs to have its API
351designed as well. The biggest issue is not only needing to support
352optional arguments but a list of arguments to be written is needed. Perl
3536 will be able to handle that with optional named arguments and a final
354slurp argument. Since this is Perl 5 we have to do it using some
355cleverness. The first argument is the file name and it will be
356positional as with the C<read_file> subroutine. But how can we pass in
357the optional arguments and also a list of data? The solution lies in the
358fact that the data list should never contain a reference.
359Burping/spewing works only on plain data. So if the next argument is a
360hash reference, we can assume it cointains the optional arguments and
361the rest of the arguments is the data list. So the C<write_file()>
362subroutine will start off like this:
363
364 sub write_file {
365
366 my $file_name = shift ;
367
368 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
369
370Whether or not optional arguments are passed in, we leave the data list
371in C<@_> to minimize any more copying. You call C<write_file()> like this:
372
373 write_file( 'foo', { binmode => ':raw' }, @data ) ;
374 write_file( 'junk', { append => 1 }, @more_junk ) ;
375 write_file( 'bar', @spew ) ;
376
377=head2 Fast Slurping
378
379Somewhere along the line, I learned about a way to slurp files faster
380than by setting $/ to undef. The method is very simple, you do a single
381read call with the size of the file (which the -s operator provides).
382This bypasses the I/O loop inside perl that checks for EOF and does all
383sorts of processing. I then decided to experiment and found that
384sysread is even faster as you would expect. sysread bypasses all of
385Perl's stdio and reads the file from the kernel buffers directly into a
386Perl scalar. This is why the slurp code in File::Slurp uses
387sysopen/sysread/syswrite. All the rest of the code is just to support
388the various options and data passing techniques.
389
390
391=head2 Benchmarks
392
393Benchmarks can be enlightening, informative, frustrating and
394deceiving. It would make no sense to create a new and more complex slurp
395module unless it also gained signifigantly in speed. So I created a
396benchmark script which compares various slurp methods with differing
397file sizes and calling contexts. This script can be run from the main
398directory of the tarball like this:
399
400 perl -Ilib extras/slurp_bench.pl
401
402If you pass in an argument on the command line, it will be passed to
403timethese() and it will control the duration. It defaults to -2 which
404makes each benchmark run to at least 2 seconds of cpu time.
405
406The following numbers are from a run I did on my 300Mhz sparc. You will
407most likely get much faster counts on your boxes but the relative speeds
408shouldn't change by much. If you see major differences on your
409benchmarks, please send me the results and your Perl and OS
410versions. Also you can play with the benchmark script and add more slurp
411variations or data files.
412
413The rest of this section will be discussing the results of the
414benchmarks. You can refer to extras/slurp_bench.pl to see the code for
415the individual benchmarks. If the benchmark name starts with cpan_, it
416is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are
417from the new File::Slurp.pm. Those that start with file_contents_ are
418from a client's code base. The rest are variations I created to
419highlight certain aspects of the benchmarks.
420
421The short and long file data is made like this:
422
423 my @lines = ( 'abc' x 30 . "\n") x 100 ;
424 my $text = join( '', @lines ) ;
425
426 @lines = ( 'abc' x 40 . "\n") x 1000 ;
427 $text = join( '', @lines ) ;
428
429So the short file is 9,100 bytes and the long file is 121,000
430bytes.
431
432=head3 Scalar Slurp of Short File
433
434 file_contents 651/s
435 file_contents_no_OO 828/s
436 cpan_read_file 1866/s
437 cpan_slurp 1934/s
438 read_file 2079/s
439 new 2270/s
440 new_buf_ref 2403/s
441 new_scalar_ref 2415/s
442 sysread_file 2572/s
443
444=head3 Scalar Slurp of Long File
445
446 file_contents_no_OO 82.9/s
447 file_contents 85.4/s
448 cpan_read_file 250/s
449 cpan_slurp 257/s
450 read_file 323/s
451 new 468/s
452 sysread_file 489/s
453 new_scalar_ref 766/s
454 new_buf_ref 767/s
455
456The primary inference you get from looking at the mumbers above is that
457when slurping a file into a scalar, the longer the file, the more time
458you save by returning the result via a scalar reference. The time for
459the extra buffer copy can add up. The new module came out on top overall
460except for the very simple sysread_file entry which was added to
461highlight the overhead of the more flexible new module which isn't that
462much. The file_contents entries are always the worst since they do a
463list slurp and then a join, which is a classic newbie and cargo culted
464style which is extremely slow. Also the OO code in file_contents slows
465it down even more (I added the file_contents_no_OO entry to show this).
466The two CPAN modules are decent with small files but they are laggards
467compared to the new module when the file gets much larger.
468
469=head3 List Slurp of Short File
470
471 cpan_read_file 589/s
472 cpan_slurp_to_array 620/s
473 read_file 824/s
474 new_array_ref 824/s
475 sysread_file 828/s
476 new 829/s
477 new_in_anon_array 833/s
478 cpan_slurp_to_array_ref 836/s
479
480=head3 List Slurp of Long File
481
482 cpan_read_file 62.4/s
483 cpan_slurp_to_array 62.7/s
484 read_file 92.9/s
485 sysread_file 94.8/s
486 new_array_ref 95.5/s
487 new 96.2/s
488 cpan_slurp_to_array_ref 96.3/s
489 new_in_anon_array 97.2/s
490
491This is perhaps the most interesting result of this benchmark. Five
492different entries have effectively tied for the lead. The logical
493conclusion is that splitting the input into lines is the bounding
494operation, no matter how the file gets slurped. This is the only
495benchmark where the new module isn't the clear winner (in the long file
496entries - it is no worse than a close second in the short file
497entries).
498
499
500Note: In the benchmark information for all the spew entries, the extra
501number at the end of each line is how many wallclock seconds the whole
502entry took. The benchmarks were run for at least 2 CPU seconds per
503entry. The unusually large wallclock times will be discussed below.
504
505=head3 Scalar Spew of Short File
506
507 cpan_write_file 1035/s 38
508 print_file 1055/s 41
509 syswrite_file 1135/s 44
510 new 1519/s 2
511 print_join_file 1766/s 2
512 new_ref 1900/s 2
513 syswrite_file2 2138/s 2
514
515=head3 Scalar Spew of Long File
516
517 cpan_write_file 164/s 20
518 print_file 211/s 26
519 syswrite_file 236/s 25
520 print_join_file 277/s 2
521 new 295/s 2
522 syswrite_file2 428/s 2
523 new_ref 608/s 2
524
525In the scalar spew entries, the new module API wins when it is passed a
526reference to the scalar buffer. The C<syswrite_file2> entry beats it
527with the shorter file due to its simpler code. The old CPAN module is
528the slowest due to its extra copying of the data and its use of print.
529
530=head3 List Spew of Short File
531
532 cpan_write_file 794/s 29
533 syswrite_file 1000/s 38
534 print_file 1013/s 42
535 new 1399/s 2
536 print_join_file 1557/s 2
537
538=head3 List Spew of Long File
539
540 cpan_write_file 112/s 12
541 print_file 179/s 21
542 syswrite_file 181/s 19
543 print_join_file 205/s 2
544 new 228/s 2
545
546Again, the simple C<print_join_file> entry beats the new module when
547spewing a short list of lines to a file. But is loses to the new module
548when the file size gets longer. The old CPAN module lags behind the
549others since it first makes an extra copy of the lines and then it calls
550C<print> on the output list and that is much slower than passing to
551C<print> a single scalar generated by join. The C<print_file> entry
552shows the advantage of directly printing C<@_> and the
553C<print_join_file> adds the join optimization.
554
555Now about those long wallclock times. If you look carefully at the
556benchmark code of all the spew entries, you will find that some always
557write to new files and some overwrite existing files. When I asked David
558Muir why the old File::Slurp module had an C<overwrite> subroutine, he
559answered that by overwriting a file, you always guarantee something
560readable is in the file. If you create a new file, there is a moment
561when the new file is created but has no data in it. I feel this is not a
562good enough answer. Even when overwriting, you can write a shorter file
563than the existing file and then you have to truncate the file to the new
564size. There is a small race window there where another process can slurp
565in the file with the new data followed by leftover junk from the
566previous version of the file. This reinforces the point that the only
567way to ensure consistant file data is the proper use of file locks.
568
569But what about those long times? Well it is all about the difference
570between creating files and overwriting existing ones. The former have to
571allocate new inodes (or the equivilent on other file systems) and the
572latter can reuse the exising inode. This mean the overwrite will save on
573disk seeks as well as on cpu time. In fact when running this benchmark,
574I could hear my disk going crazy allocating inodes during the spew
575operations. This speedup in both cpu and wallclock is why the new module
576always does overwriting when spewing files. It also does the proper
577truncate (and this is checked in the tests by spewing shorter files
578after longer ones had previously been written). The C<overwrite>
579subroutine is just an typeglob alias to C<write_file> and is there for
580backwards compatibilty with the old File::Slurp module.
581
582=head3 Benchmark Conclusion
583
584Other than a few cases where a simpler entry beat it out, the new
585File::Slurp module is either the speed leader or among the leaders. Its
586special APIs for passing buffers by reference prove to be very useful
587speedups. Also it uses all the other optimizations including using
588C<sysread/syswrite> and joining output lines. I expect many projects
589that extensively use slurping will notice the speed improvements,
590especially if they rewrite their code to take advantage of the new API
591features. Even if they don't touch their code and use the simple API
592they will get a significant speedup.
593
594=head2 Error Handling
595
596Slurp subroutines are subject to conditions such as not being able to
597open the file, or I/O errors. How these errors are handled, and what the
598caller will see, are important aspects of the design of an API. The
599classic error handling for slurping has been to call C<die()> or even
600better, C<croak()>. But sometimes you want the slurp to either
601C<warn()>/C<carp()> or allow your code to handle the error. Sure, this
602can be done by wrapping the slurp in a C<eval> block to catch a fatal
603error, but not everyone wants all that extra code. So I have added
604another option to all the subroutines which selects the error
605handling. If the 'err_mode' option is 'croak' (which is also the
606default), the called subroutine will croak. If set to 'carp' then carp
607will be called. Set to any other string (use 'quiet' when you want to
608be explicit) and no error handler is called. Then the caller can use the
609error status from the call.
610
611C<write_file()> doesn't use the return value for data so it can return a
612false status value in-band to mark an error. C<read_file()> does use its
613return value for data, but we can still make it pass back the error
614status. A successful read in any scalar mode will return either a
615defined data string or a reference to a scalar or array. So a bare
616return would work here. But if you slurp in lines by calling it in a
617list context, a bare C<return> will return an empty list, which is the
618same value it would get from an existing but empty file. So now,
619C<read_file()> will do something I normally strongly advocate against,
620i.e., returning an explicit C<undef> value. In the scalar context this
621still returns a error, and in list context, the returned first value
622will be C<undef>, and that is not legal data for the first element. So
623the list context also gets a error status it can detect:
624
625 my @lines = read_file( $file_name, err_mode => 'quiet' ) ;
626 your_handle_error( "$file_name can't be read\n" ) unless
627 @lines && defined $lines[0] ;
628
629
630=head2 File::FastSlurp
631
632 sub read_file {
633
634 my( $file_name, %args ) = @_ ;
635
636 my $buf ;
637 my $buf_ref = $args{'buf_ref'} || \$buf ;
638
639 my $mode = O_RDONLY ;
640 $mode |= O_BINARY if $args{'binmode'} ;
641
642 local( *FH ) ;
643 sysopen( FH, $file_name, $mode ) or
644 carp "Can't open $file_name: $!" ;
645
646 my $size_left = -s FH ;
647
648 while( $size_left > 0 ) {
649
650 my $read_cnt = sysread( FH, ${$buf_ref},
651 $size_left, length ${$buf_ref} ) ;
652
653 unless( $read_cnt ) {
654
655 carp "read error in file $file_name: $!" ;
656 last ;
657 }
658
659 $size_left -= $read_cnt ;
660 }
661
662 # handle void context (return scalar by buffer reference)
663
664 return unless defined wantarray ;
665
666 # handle list context
667
668 return split m|?<$/|g, ${$buf_ref} if wantarray ;
669
670 # handle scalar context
671
672 return ${$buf_ref} ;
673 }
674
675 sub write_file {
676
677 my $file_name = shift ;
678
679 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
680 my $buf = join '', @_ ;
681
682
683 my $mode = O_WRONLY ;
684 $mode |= O_BINARY if $args->{'binmode'} ;
685 $mode |= O_APPEND if $args->{'append'} ;
686
687 local( *FH ) ;
688 sysopen( FH, $file_name, $mode ) or
689 carp "Can't open $file_name: $!" ;
690
691 my $size_left = length( $buf ) ;
692 my $offset = 0 ;
693
694 while( $size_left > 0 ) {
695
696 my $write_cnt = syswrite( FH, $buf,
697 $size_left, $offset ) ;
698
699 unless( $write_cnt ) {
700
701 carp "write error in file $file_name: $!" ;
702 last ;
703 }
704
705 $size_left -= $write_cnt ;
706 $offset += $write_cnt ;
707 }
708
709 return ;
710 }
711
712=head2 Slurping in Perl 6
713
714As usual with Perl 6, much of the work in this article will be put to
715pasture. Perl 6 will allow you to set a 'slurp' property on file handles
716and when you read from such a handle, the file is slurped. List and
717scalar context will still be supported so you can slurp into lines or a
718<scalar. I would expect that support for slurping in Perl 6 will be
719optimized and bypass the stdio subsystem since it can use the slurp
720property to trigger a call to special code. Otherwise some enterprising
721individual will just create a File::FastSlurp module for Perl 6. The
722code in the Perl 5 module could easily be modified to Perl 6 syntax and
723semantics. Any volunteers?
724
725=head2 In Summary
726
727We have compared classic line by line processing with munging a whole
728file in memory. Slurping files can speed up your programs and simplify
729your code if done properly. You must still be aware to not slurp
730humongous files (logs, DNA sequences, etc.) or STDIN where you don't
731know how much data you will read in. But slurping megabyte sized files
732is not an major issue on today's systems with the typical amount of RAM
733installed. When Perl was first being used in depth (Perl 4), slurping
734was limited by the smaller RAM size of 10 years ago. Now, you should be
735able to slurp almost any reasonably sized file, whether it contains
736configuration, source code, data, etc.
737
738=head2 Acknowledgements
739
740
741
742
743