3 new versions checked in finally.
[urisagit/File-Slurp.git] / extras / slurp2.pod
CommitLineData
635c7876 1=head1 Perl Slurp Ease
2
3=head2 Introduction
4
5
6One of the common Perl idioms is processing text files line by line
7
8 while( <FH> ) {
9 do something with $_
10 }
11
12This idiom has several variants but the key point is that it reads in
13only one line from the file in each loop iteration. This has several
14advantages including limiting memory use to one line, the ability to
15handle any size file (including data piped in via STDIN), and it is
16easily taught and understood to Perl newbies. In fact newbies are the
17ones who do silly things like this:
18
19 while( <FH> ) {
20 push @lines, $_ ;
21 }
22
23 foreach ( @lines ) {
24 do something with $_
25 }
26
27Line by line processing is fine but it isn't the only way to deal with
28reading files. The other common style is reading the entire file into a
29scalar or array and that is commonly known as slurping. Now slurping has
30somewhat of a poor reputation and this article is an attempt at
31rehabilitating it. Slurping files has advantages and limitations and is
32not something you should just do when line by line processing is fine.
33It is best when you need the entire file in memory for processing all at
34once. Slurping with in memory processing can be faster and lead to
35simpler code than line by line if done properly.
36
37The biggest issue to watch for with slurping is file size. Slurping very
38large files or unknown amounts of data from STDIN can be disastrous to
39your memory usage and cause swap disk thrashing. I advocate slurping
40only disk files and only when you know their size is reasonable and you
41have a real reason to process the file as a whole. Note that reasonable
42size these days is larger than the bad old days of limited RAM. Slurping
43in a megabyte size file is not an issue on most systems. But most of the
44files I tend to slurp in are much smaller than that. Typical files that
45work well with slurping are configuration files, (mini)language scripts,
46some data (especially binary) files, and other files of known sizes
47which need fast processing.
48
49Another major win for slurping over line by line is speed. Perl's IO
50system (like many others) is slow. Calling <> for each line requires a
51check for the end of line, checks for EOF, copying a line, munging the
52internal handle structure, etc. Plenty of work for each line read
53in. Whereas slurping, if done correctly, will usually involve only one
54IO call and no extra data copying. The same is true for writing files to
55disk and we will cover that as well (even though the term slurping is
56traditionally a read operation, I use the term slurp for the concept of
57doing IO with an entire file in one operation).
58
59Finally, when you have slurped the entire file into memory, you can do
60operations on the data that are not possible or easily done with line by
61line processing. These include global search/replace (without regard for
62newlines), grabbing all matches with one call of m//g, complex parsing
63(which in many cases must ignore newlines), processing *ML (where line
64endings are just white space) and performing complex transformations
65such as template expansion.
66
67=head2 Global Operations
68
69Here are some simple global operations that can be done quickly and
70easily on an entire file that has been slurped in. They could also be
71done with line by line processing but that would be slower and require
72more code.
73
74A common problem is reading in a file with key/value pairs. There are
75modules which do this but who needs them for simple formats? Just slurp
76in the file and do a single parse to grab all the key/value pairs.
77
78 my $text = read_file( $file ) ;
79 my %config = $test =~ /^(\w+)=(.+)$/mg ;
80
81That matches a key which starts a line (anywhere inside the string
82because of the /m modifier), the '=' char and the text to the end of the
83line (again /m makes that work). In fact the ending $ is not even needed
84since . will not normally match a newline. Since the key and value are
85grabbed and the m// is in list context with the /g modifier, it will
86grab all key/value pairs and return them. The %config hash will be
87assigned this list and now you have the file fully parsed into a hash.
88
89Various projects I have worked on needed some simple templating and I
90wasn't in the mood to use a full module (please,no flames about your
91favorite template module :-). So I rolled my own by slurping in the
92template file, setting up a template hash and doing this one line:
93
94 $text =~ s/<%(.+?)%>/$template{$1}/g ;
95
96That only works if the entire file was slurped in. With a little
97extra work it can handle chunks of text to be expanded:
98
99 $text =~ s/<%(\w+)_START%>(.+)<%\1_END%>/ template($1, $2)/sge ;
100
101Just supply a template sub to expand the text between the markers and
102you have yourself a simple system with minimal code. Note that this will
103work and grab over multiple lines due the the /s modifier. This is
104something that is much trickier with line by line processing.
105
106Note that this is a very simple templating system and it can't directly
107handle nested tags and other complex features. But even if you use one
108of the myriad of template modules on the CPAN, you will gain by having
109speedier ways to read/write files.
110
111Slurping in a file into an array also offers some useful advantages.
112
113
114=head2 Traditional Slurping
115
116Perl has always supported slurping files with minimal code. Slurping of
117a file to a list of lines is trivial, just call the <> operator in a
118list context:
119
120 my @lines = <FH> ;
121
122and slurping to a scalar isn't much more work. Just set the built in
123variable $/ (the input record separator to the undefined value and read
124in the file with <>:
125
126 {
127 local( $/, *FH ) ;
128 open( FH, $file ) or die "sudden flaming death\n"
129 $text = <FH>
130 }
131
132Notice the use of local(). It sets $/ to undef for you and when the
133scope exits it will revert $/ back to its previous value (most likely
134"\n"). Here is a Perl idiom that allows the $text variable to be
135declared and there is no need for a tightly nested block. The do block
136will execute the <FH> in a scalar context and slurp in the file which is
137assigned to $text.
138
139 local( *FH ) ;
140 open( FH, $file ) or die "sudden flaming death\n"
141 my $text = do { local( $/ ) ; <FH> } ;
142
143Both of those slurps used localized filehandles to be compatible with
1445.005. Here they are with 5.6.0 lexical autovivified handles:
145
146 {
147 local( $/ ) ;
148 open( my $fh, $file ) or die "sudden flaming death\n"
149 $text = <$fh>
150 }
151
152 open( my $fh, $file ) or die "sudden flaming death\n"
153 my $text = do { local( $/ ) ; <$fh> } ;
154
155And this is a variant of that idiom that removes the need for the open
156call:
157
158 my $text = do { local( @ARGV, $/ ) = $file ; <> } ;
159
160The filename in $file is assigned to a localized @ARGV and the null
161filehandle is used which reads the data from the files in @ARGV.
162
163Instead of assigning to a scalar, all the above slurps can assign to an
164array and it will get the file but split into lines (using $/ as the end
165of line marker).
166
167There is one common variant of those slurps which is very slow and not
168good code. You see it around and it is almost always cargo cult code:
169
170 my $text = join( '', <FH> ) ;
171
172That needlessly splits the input file into lines (join provides a list
173context to <FH>) and then joins up those lines again. The original coder
174of this idiom obviously never read perlvar and learned how to use $/ to
175allow scalar slurping.
176
177=head2 Write Slurping
178
179While reading in entire files at one time is common, writing out entire
180files is also done. We call it slurping when we read in files but there
181is no commonly accepted term for the write operation. I asked some Perl
182colleagues and got two interesting nominations. Peter Scott said to call
183it burping (rhymes with slurping and the noise is the opposite
184direction). Others suggested spewing which has a stronger visual image
185:-) Tell me your favorite or suggest your own. I will use both in this
186section so you can see how they work for you.
187
188Spewing a file is a much simpler operation than slurping. You don't have
189context issues to worry about and there is no efficiency problem with
190returning a buffer. Here is a simple burp sub:
191
192 sub burp {
193 my( $file_name ) = shift ;
194 open( my $fh, ">$file_name" ) ||
195 die "can't create $file_name $!" ;
196 print $fh @_ ;
197 }
198
199Note that it doesn't copy the input text but passes @_ directly to
200print. We will look at faster variations of that later on.
201
202=head2 Slurp on the CPAN
203
204As you would expect there are modules in the CPAN that will slurp files
205for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on
206CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).
207
208Here is the code from Slurp.pm:
209
210 sub slurp {
211 local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ );
212 return <ARGV>;
213 }
214
215 sub to_array {
216 my @array = slurp( @_ );
217 return wantarray ? @array : \@array;
218 }
219
220 sub to_scalar {
221 my $scalar = slurp( @_ );
222 return $scalar;
223 }
224
225The sub slurp uses the magic undefined value of $/ and the magic file
226handle ARGV to support slurping into a scalar or array. It also provides
227two wrapper subs that allow the caller to control the context of the
228slurp. And the to_array sub will return the list of slurped lines or a
229anonymous array of them according to its caller's context by checking
230wantarray. It has 'slurp' in @EXPORT and all three subs in @EXPORT_OK.
231A final point is that Slurp.pm is poorly named and it shouldn't be in
232the top level namespace.
233
234File::Slurp.pm has this code:
235
236sub read_file
237{
238 my ($file) = @_;
239
240 local($/) = wantarray ? $/ : undef;
241 local(*F);
242 my $r;
243 my (@r);
244
245 open(F, "<$file") || croak "open $file: $!";
246 @r = <F>;
247 close(F) || croak "close $file: $!";
248
249 return $r[0] unless wantarray;
250 return @r;
251}
252
253This module provides several subs including read_file (more on the
254others later). read_file behaves simularly to Slurp::slurp in that it
255will slurp a list of lines or a single scalar depending on the caller's
256context. It also uses the magic undefined value of $/ for scalar
257slurping but it uses an explicit open call rather than using a localized
258@ARGV and the other module did. Also it doesn't provide a way to get an
259anonymous array of the lines but that can easily be rectified by calling
260it inside an anonymous array constuctor [].
261
262Both of these modules make it easier for Perl coders to slurp in
263files. They both use the magic $/ to slurp in scalar mode and the
264natural behavior of <> in list context to slurp as lines. But neither is
265optmized for speed nor can they handle binmode to support binary or
266unicode files. See below for more on slurp features and speedups.
267
268=head2 Slurping API Design
269
270The slurp modules on CPAN are have a very simple API and don't support
271binmode. This section will cover various API design issues such as
272efficient return by reference, binmode and calling variations.
273
274Let's start with the call variations. Slurped files can be returned in
275four formats, as a single scalar, as a reference to a scalar, as a list
276of lines and as an anonymous array of lines. But the caller can only
277provide two contexts, scalar or list. So we have to either provide an
278API with more than one sub as Slurp.pm did or just provide one sub which
279only returns a scalar or a list (no anonymous array) as File::Slurp.pm
280does.
281
282I have used my own read_file sub for years and it has the same API as
283File::Slurp.pm, a single sub which returns a scalar or a list of lines
284depending on context. But I recognize the interest of those that want an
285anonymous array for line slurping. For one thing it is easier to pass
286around to other subs and another it eliminates the extra copying of the
287lines via return. So my module will support multiple subs with one that
288returns the file based on context and the other returns only lines
289(either as a list or as an anonymous array). So this API is in between
290the two CPAN modules. There is no need for a specific slurp in as a
291scalar sub as the general slurp will do that in scalar context. If you
292wanted to slurp a scalar into an array, just select the desired array
293element and that will provide scalar context to the read_file sub.
294
295The next area to cover is what to name these subs. I will go with
296read_file and read_file_lines. They are descriptive, simple and don't
297use the 'slurp' nickname (though that nick is in the module name).
298
299Another critical area when designing APIs is how to pass in
300arguments. The read_file subs takes one required argument which is the
301file name. To support binmode we need another optional argument. And a
302third optional argument is needed to support returning a slurped scalar
303by reference. My first thought was to design the API with 3 positional
304arguments - file name, buffer reference and binmode. But if you want to
305set the binmode and not pass in a buffer reference, you have to fill the
306second argument with undef and that is ugly. So I decided to make the
307filename argument positional and the other two are pass by name.
308The sub will start off like this:
309
310 sub read_file {
311
312 my( $file_name, %args ) = @_ ;
313
314 my $buf ;
315 my $buf_ref = $args{'buf'} || \$buf ;
316
317The binmode argument will be handled later (see code below).
318
319The other sub read_file_lines will only take an optional binmode (so you
320can read files with binary delimiters). It doesn't need a buffer
321reference argument since it can return an anonymous array if the called
322in a scalar context. So this sub could use positional arguments but to
323keep its API similar to the API of read_file, it will also use pass by
324name for the optional arguments. This also means that new optional
325arguments can be added later without breaking any legacy code. A bonus
326with keeping the API the same for both subs will be seen how the two
327subs are optimized to work together.
328
329Write slurping (or spewing or burping :-) needs to have its API designed
330as well. The biggest issue is not only needing to support optional
331arguments but a list of arguments to be written is needed. Perl 6 can
332handle that with optional named arguments and a final slurp
333argument. Since this is Perl 5 we have to do it using some
334cleverness. The first argument is the file name and it will be
335positional as with the read_file sub. But how can we pass in the
336optional arguments and also a list of data? The solution lies in the
337fact that the data list should never contain a reference.
338Burping/spewing works only on plain data. So if the next argument is a
339hash reference, we can assume it is the optional arguments and the rest
340of the arguments is the data list. So the write_file sub will start off
341like this:
342
343 sub write_file {
344
345 my $file_name = shift ;
346
347 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
348
349Whether or not optional arguments are passed in, we leave the data list
350in @_ to minimize any more copying. You call write_file like this:
351
352 write_file( 'foo', { binmode => ':raw' }, @data ) ;
353 write_file( 'junk', { append => 1 }, @more_junk ) ;
354 write_file( 'bar', @spew ) ;
355
356=head2 Fast Slurping
357
358
359=head2 Benchmarks
360
361
362=head2 Error Handling
363
364Slurp subs are subject to conditions such as not being able to open the
365file or I/O errors. How these errors are handled and what the caller
366will see are important aspects of the design of an API. The classic
367error handling for slurping has been to call die or even better,
368croak. But sometimes you want to either the slurp to either warn/carp
369and allow your code to handle the error. Sure, this can be done by
370wrapping the slurp in a eval block to catch a fatal error, but not
371everyone wants all that extra code. So I have added another option to
372all the subs which selects the error handling. If the 'err_mode' option
373is 'croak' (which is also the default, the called sub will croak. If set
374to 'carp' then carp will be called. Set to any other string (use 'quiet'
375by convention) and no error handler call is made. Then the caller can
376use the error status from the call.
377
378C<write_file> doesn't use the return value for data so it can return a
379false status value in-band to mark an error. C<read_file> does use its
380return value for data but we can still make it pass back the error
381status. A successful read in any scalar mode will return either a
382defined data string or a (scalar or array) reference. So a bare return
383would work here. But if you slurp in lines by calling it in a list
384context, a bare return will return an empty list which is the same value
385it would from from an existing but empty file. So now, C<read_file> will
386do something I strongly advocate against, which is returning a call to
387undef. In the scalar contexts this still returns a error and now in list
388context, the returned first value will be undef and that is not legal
389data for the first element. So the list context also gets a error status
390it can detect:
391
392 my @lines = read_file( $file_name, err_mode => 'quiet' ) ;
393 your_handle_error( "$file_name can't be read\n" ) unless
394 @lines && defined $lines[0] ;
395
396
397=head2 File::FastSlurp
398
399 sub read_file {
400
401 my( $file_name, %args ) = @_ ;
402
403 my $buf ;
404 my $buf_ref = $args{'buf_ref'} || \$buf ;
405
406 my $mode = O_RDONLY ;
407 $mode |= O_BINARY if $args{'binmode'} ;
408
409 local( *FH ) ;
410 sysopen( FH, $file_name, $mode ) or
411 carp "Can't open $file_name: $!" ;
412
413 my $size_left = -s FH ;
414
415 while( $size_left > 0 ) {
416
417 my $read_cnt = sysread( FH, ${$buf_ref},
418 $size_left, length ${$buf_ref} ) ;
419
420 unless( $read_cnt ) {
421
422 carp "read error in file $file_name: $!" ;
423 last ;
424 }
425
426 $size_left -= $read_cnt ;
427 }
428
429 # handle void context (return scalar by buffer reference)
430
431 return unless defined wantarray ;
432
433 # handle list context
434
435 return split m|?<$/|g, ${$buf_ref} if wantarray ;
436
437 # handle scalar context
438
439 return ${$buf_ref} ;
440 }
441
442
443 sub read_file_lines {
444
445 # handle list context
446
447 return &read_file if wantarray ;;
448
449 # otherwise handle scalar context
450
451 return [ &read_file ] ;
452 }
453
454
455 sub write_file {
456
457 my $file_name = shift ;
458
459 my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
460 my $buf = join '', @_ ;
461
462
463 my $mode = O_WRONLY ;
464 $mode |= O_BINARY if $args->{'binmode'} ;
465 $mode |= O_APPEND if $args->{'append'} ;
466
467 local( *FH ) ;
468 sysopen( FH, $file_name, $mode ) or
469 carp "Can't open $file_name: $!" ;
470
471 my $size_left = length( $buf ) ;
472 my $offset = 0 ;
473
474 while( $size_left > 0 ) {
475
476 my $write_cnt = syswrite( FH, $buf,
477 $size_left, $offset ) ;
478
479 unless( $write_cnt ) {
480
481 carp "write error in file $file_name: $!" ;
482 last ;
483 }
484
485 $size_left -= $write_cnt ;
486 $offset += $write_cnt ;
487 }
488
489 return ;
490 }
491
492=head2 Slurping in Perl 6
493
494As usual with Perl 6, much of the work in this article will be put to
495pasture. Perl 6 will allow you to set a 'slurp' property on file handles
496and when you read from such a handle, the file is slurped. List and
497scalar context will still be supported so you can slurp into lines or a
498<scalar. I would expect that support for slurping in Perl 6 will be
499optimized and bypass the stdio subsystem since it can use the slurp
500property to trigger a call to special code. Otherwise some enterprising
501individual will just create a File::FastSlurp module for Perl 6. The
502code in the Perl 5 module could easily be modified to Perl 6 syntax and
503semantics. Any volunteers?
504
505=head2 In Summary
506
507We have compared classic line by line processing with munging a whole
508file in memory. Slurping files can speed up your programs and simplify
509your code if done properly. You must still be aware to not slurp
510humongous files (logs, DNA sequences, etc.) or STDIN where you don't
511know how much data you will read in. But slurping megabyte sized files
512of is not an major issue on today's systems with the typical amount of
513RAM installed. When Perl was first being used in depth (Perl 4),
514slurping was limited by the smalle RAM size of 10 years ago. Now you
515should be able to slurp most any reasonably sized file be they
516configurations, source code, data, etc.