Commit | Line | Data |
635c7876 |
1 | =head1 Perl Slurp Ease |
2 | |
3 | =head2 Introduction |
4 | |
5 | |
6 | One of the common Perl idioms is processing text files line by line |
7 | |
8 | while( <FH> ) { |
9 | do something with $_ |
10 | } |
11 | |
12 | This idiom has several variants but the key point is that it reads in |
13 | only one line from the file in each loop iteration. This has several |
14 | advantages including limiting memory use to one line, the ability to |
15 | handle any size file (including data piped in via STDIN), and it is |
16 | easily taught and understood to Perl newbies. In fact newbies are the |
17 | ones who do silly things like this: |
18 | |
19 | while( <FH> ) { |
20 | push @lines, $_ ; |
21 | } |
22 | |
23 | foreach ( @lines ) { |
24 | do something with $_ |
25 | } |
26 | |
27 | Line by line processing is fine but it isn't the only way to deal with |
28 | reading files. The other common style is reading the entire file into a |
29 | scalar or array and that is commonly known as slurping. Now slurping has |
30 | somewhat of a poor reputation and this article is an attempt at |
31 | rehabilitating it. Slurping files has advantages and limitations and is |
32 | not something you should just do when line by line processing is fine. |
33 | It is best when you need the entire file in memory for processing all at |
34 | once. Slurping with in memory processing can be faster and lead to |
35 | simpler code than line by line if done properly. |
36 | |
37 | The biggest issue to watch for with slurping is file size. Slurping very |
38 | large files or unknown amounts of data from STDIN can be disastrous to |
39 | your memory usage and cause swap disk thrashing. I advocate slurping |
40 | only disk files and only when you know their size is reasonable and you |
41 | have a real reason to process the file as a whole. Note that reasonable |
42 | size these days is larger than the bad old days of limited RAM. Slurping |
43 | in a megabyte size file is not an issue on most systems. But most of the |
44 | files I tend to slurp in are much smaller than that. Typical files that |
45 | work well with slurping are configuration files, (mini)language scripts, |
46 | some data (especially binary) files, and other files of known sizes |
47 | which need fast processing. |
48 | |
49 | Another major win for slurping over line by line is speed. Perl's IO |
50 | system (like many others) is slow. Calling <> for each line requires a |
51 | check for the end of line, checks for EOF, copying a line, munging the |
52 | internal handle structure, etc. Plenty of work for each line read |
53 | in. Whereas slurping, if done correctly, will usually involve only one |
54 | IO call and no extra data copying. The same is true for writing files to |
55 | disk and we will cover that as well (even though the term slurping is |
56 | traditionally a read operation, I use the term slurp for the concept of |
57 | doing IO with an entire file in one operation). |
58 | |
59 | Finally, when you have slurped the entire file into memory, you can do |
60 | operations on the data that are not possible or easily done with line by |
61 | line processing. These include global search/replace (without regard for |
62 | newlines), grabbing all matches with one call of m//g, complex parsing |
63 | (which in many cases must ignore newlines), processing *ML (where line |
64 | endings are just white space) and performing complex transformations |
65 | such as template expansion. |
66 | |
67 | =head2 Global Operations |
68 | |
69 | Here are some simple global operations that can be done quickly and |
70 | easily on an entire file that has been slurped in. They could also be |
71 | done with line by line processing but that would be slower and require |
72 | more code. |
73 | |
74 | A common problem is reading in a file with key/value pairs. There are |
75 | modules which do this but who needs them for simple formats? Just slurp |
76 | in the file and do a single parse to grab all the key/value pairs. |
77 | |
78 | my $text = read_file( $file ) ; |
79 | my %config = $test =~ /^(\w+)=(.+)$/mg ; |
80 | |
81 | That matches a key which starts a line (anywhere inside the string |
82 | because of the /m modifier), the '=' char and the text to the end of the |
83 | line (again /m makes that work). In fact the ending $ is not even needed |
84 | since . will not normally match a newline. Since the key and value are |
85 | grabbed and the m// is in list context with the /g modifier, it will |
86 | grab all key/value pairs and return them. The %config hash will be |
87 | assigned this list and now you have the file fully parsed into a hash. |
88 | |
89 | Various projects I have worked on needed some simple templating and I |
90 | wasn't in the mood to use a full module (please,no flames about your |
91 | favorite template module :-). So I rolled my own by slurping in the |
92 | template file, setting up a template hash and doing this one line: |
93 | |
94 | $text =~ s/<%(.+?)%>/$template{$1}/g ; |
95 | |
96 | That only works if the entire file was slurped in. With a little |
97 | extra work it can handle chunks of text to be expanded: |
98 | |
99 | $text =~ s/<%(\w+)_START%>(.+)<%\1_END%>/ template($1, $2)/sge ; |
100 | |
101 | Just supply a template sub to expand the text between the markers and |
102 | you have yourself a simple system with minimal code. Note that this will |
103 | work and grab over multiple lines due the the /s modifier. This is |
104 | something that is much trickier with line by line processing. |
105 | |
106 | Note that this is a very simple templating system and it can't directly |
107 | handle nested tags and other complex features. But even if you use one |
108 | of the myriad of template modules on the CPAN, you will gain by having |
109 | speedier ways to read/write files. |
110 | |
111 | Slurping in a file into an array also offers some useful advantages. |
112 | |
113 | |
114 | =head2 Traditional Slurping |
115 | |
116 | Perl has always supported slurping files with minimal code. Slurping of |
117 | a file to a list of lines is trivial, just call the <> operator in a |
118 | list context: |
119 | |
120 | my @lines = <FH> ; |
121 | |
122 | and slurping to a scalar isn't much more work. Just set the built in |
123 | variable $/ (the input record separator to the undefined value and read |
124 | in the file with <>: |
125 | |
126 | { |
127 | local( $/, *FH ) ; |
128 | open( FH, $file ) or die "sudden flaming death\n" |
129 | $text = <FH> |
130 | } |
131 | |
132 | Notice the use of local(). It sets $/ to undef for you and when the |
133 | scope exits it will revert $/ back to its previous value (most likely |
134 | "\n"). Here is a Perl idiom that allows the $text variable to be |
135 | declared and there is no need for a tightly nested block. The do block |
136 | will execute the <FH> in a scalar context and slurp in the file which is |
137 | assigned to $text. |
138 | |
139 | local( *FH ) ; |
140 | open( FH, $file ) or die "sudden flaming death\n" |
141 | my $text = do { local( $/ ) ; <FH> } ; |
142 | |
143 | Both of those slurps used localized filehandles to be compatible with |
144 | 5.005. Here they are with 5.6.0 lexical autovivified handles: |
145 | |
146 | { |
147 | local( $/ ) ; |
148 | open( my $fh, $file ) or die "sudden flaming death\n" |
149 | $text = <$fh> |
150 | } |
151 | |
152 | open( my $fh, $file ) or die "sudden flaming death\n" |
153 | my $text = do { local( $/ ) ; <$fh> } ; |
154 | |
155 | And this is a variant of that idiom that removes the need for the open |
156 | call: |
157 | |
158 | my $text = do { local( @ARGV, $/ ) = $file ; <> } ; |
159 | |
160 | The filename in $file is assigned to a localized @ARGV and the null |
161 | filehandle is used which reads the data from the files in @ARGV. |
162 | |
163 | Instead of assigning to a scalar, all the above slurps can assign to an |
164 | array and it will get the file but split into lines (using $/ as the end |
165 | of line marker). |
166 | |
167 | There is one common variant of those slurps which is very slow and not |
168 | good code. You see it around and it is almost always cargo cult code: |
169 | |
170 | my $text = join( '', <FH> ) ; |
171 | |
172 | That needlessly splits the input file into lines (join provides a list |
173 | context to <FH>) and then joins up those lines again. The original coder |
174 | of this idiom obviously never read perlvar and learned how to use $/ to |
175 | allow scalar slurping. |
176 | |
177 | =head2 Write Slurping |
178 | |
179 | While reading in entire files at one time is common, writing out entire |
180 | files is also done. We call it slurping when we read in files but there |
181 | is no commonly accepted term for the write operation. I asked some Perl |
182 | colleagues and got two interesting nominations. Peter Scott said to call |
183 | it burping (rhymes with slurping and the noise is the opposite |
184 | direction). Others suggested spewing which has a stronger visual image |
185 | :-) Tell me your favorite or suggest your own. I will use both in this |
186 | section so you can see how they work for you. |
187 | |
188 | Spewing a file is a much simpler operation than slurping. You don't have |
189 | context issues to worry about and there is no efficiency problem with |
190 | returning a buffer. Here is a simple burp sub: |
191 | |
192 | sub burp { |
193 | my( $file_name ) = shift ; |
194 | open( my $fh, ">$file_name" ) || |
195 | die "can't create $file_name $!" ; |
196 | print $fh @_ ; |
197 | } |
198 | |
199 | Note that it doesn't copy the input text but passes @_ directly to |
200 | print. We will look at faster variations of that later on. |
201 | |
202 | =head2 Slurp on the CPAN |
203 | |
204 | As you would expect there are modules in the CPAN that will slurp files |
205 | for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on |
206 | CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN). |
207 | |
208 | Here is the code from Slurp.pm: |
209 | |
210 | sub slurp { |
211 | local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ ); |
212 | return <ARGV>; |
213 | } |
214 | |
215 | sub to_array { |
216 | my @array = slurp( @_ ); |
217 | return wantarray ? @array : \@array; |
218 | } |
219 | |
220 | sub to_scalar { |
221 | my $scalar = slurp( @_ ); |
222 | return $scalar; |
223 | } |
224 | |
225 | The sub slurp uses the magic undefined value of $/ and the magic file |
226 | handle ARGV to support slurping into a scalar or array. It also provides |
227 | two wrapper subs that allow the caller to control the context of the |
228 | slurp. And the to_array sub will return the list of slurped lines or a |
229 | anonymous array of them according to its caller's context by checking |
230 | wantarray. It has 'slurp' in @EXPORT and all three subs in @EXPORT_OK. |
231 | A final point is that Slurp.pm is poorly named and it shouldn't be in |
232 | the top level namespace. |
233 | |
234 | File::Slurp.pm has this code: |
235 | |
236 | sub read_file |
237 | { |
238 | my ($file) = @_; |
239 | |
240 | local($/) = wantarray ? $/ : undef; |
241 | local(*F); |
242 | my $r; |
243 | my (@r); |
244 | |
245 | open(F, "<$file") || croak "open $file: $!"; |
246 | @r = <F>; |
247 | close(F) || croak "close $file: $!"; |
248 | |
249 | return $r[0] unless wantarray; |
250 | return @r; |
251 | } |
252 | |
253 | This module provides several subs including read_file (more on the |
254 | others later). read_file behaves simularly to Slurp::slurp in that it |
255 | will slurp a list of lines or a single scalar depending on the caller's |
256 | context. It also uses the magic undefined value of $/ for scalar |
257 | slurping but it uses an explicit open call rather than using a localized |
258 | @ARGV and the other module did. Also it doesn't provide a way to get an |
259 | anonymous array of the lines but that can easily be rectified by calling |
260 | it inside an anonymous array constuctor []. |
261 | |
262 | Both of these modules make it easier for Perl coders to slurp in |
263 | files. They both use the magic $/ to slurp in scalar mode and the |
264 | natural behavior of <> in list context to slurp as lines. But neither is |
265 | optmized for speed nor can they handle binmode to support binary or |
266 | unicode files. See below for more on slurp features and speedups. |
267 | |
268 | =head2 Slurping API Design |
269 | |
270 | The slurp modules on CPAN are have a very simple API and don't support |
271 | binmode. This section will cover various API design issues such as |
272 | efficient return by reference, binmode and calling variations. |
273 | |
274 | Let's start with the call variations. Slurped files can be returned in |
275 | four formats, as a single scalar, as a reference to a scalar, as a list |
276 | of lines and as an anonymous array of lines. But the caller can only |
277 | provide two contexts, scalar or list. So we have to either provide an |
278 | API with more than one sub as Slurp.pm did or just provide one sub which |
279 | only returns a scalar or a list (no anonymous array) as File::Slurp.pm |
280 | does. |
281 | |
282 | I have used my own read_file sub for years and it has the same API as |
283 | File::Slurp.pm, a single sub which returns a scalar or a list of lines |
284 | depending on context. But I recognize the interest of those that want an |
285 | anonymous array for line slurping. For one thing it is easier to pass |
286 | around to other subs and another it eliminates the extra copying of the |
287 | lines via return. So my module will support multiple subs with one that |
288 | returns the file based on context and the other returns only lines |
289 | (either as a list or as an anonymous array). So this API is in between |
290 | the two CPAN modules. There is no need for a specific slurp in as a |
291 | scalar sub as the general slurp will do that in scalar context. If you |
292 | wanted to slurp a scalar into an array, just select the desired array |
293 | element and that will provide scalar context to the read_file sub. |
294 | |
295 | The next area to cover is what to name these subs. I will go with |
296 | read_file and read_file_lines. They are descriptive, simple and don't |
297 | use the 'slurp' nickname (though that nick is in the module name). |
298 | |
299 | Another critical area when designing APIs is how to pass in |
300 | arguments. The read_file subs takes one required argument which is the |
301 | file name. To support binmode we need another optional argument. And a |
302 | third optional argument is needed to support returning a slurped scalar |
303 | by reference. My first thought was to design the API with 3 positional |
304 | arguments - file name, buffer reference and binmode. But if you want to |
305 | set the binmode and not pass in a buffer reference, you have to fill the |
306 | second argument with undef and that is ugly. So I decided to make the |
307 | filename argument positional and the other two are pass by name. |
308 | The sub will start off like this: |
309 | |
310 | sub read_file { |
311 | |
312 | my( $file_name, %args ) = @_ ; |
313 | |
314 | my $buf ; |
315 | my $buf_ref = $args{'buf'} || \$buf ; |
316 | |
317 | The binmode argument will be handled later (see code below). |
318 | |
319 | The other sub read_file_lines will only take an optional binmode (so you |
320 | can read files with binary delimiters). It doesn't need a buffer |
321 | reference argument since it can return an anonymous array if the called |
322 | in a scalar context. So this sub could use positional arguments but to |
323 | keep its API similar to the API of read_file, it will also use pass by |
324 | name for the optional arguments. This also means that new optional |
325 | arguments can be added later without breaking any legacy code. A bonus |
326 | with keeping the API the same for both subs will be seen how the two |
327 | subs are optimized to work together. |
328 | |
329 | Write slurping (or spewing or burping :-) needs to have its API designed |
330 | as well. The biggest issue is not only needing to support optional |
331 | arguments but a list of arguments to be written is needed. Perl 6 can |
332 | handle that with optional named arguments and a final slurp |
333 | argument. Since this is Perl 5 we have to do it using some |
334 | cleverness. The first argument is the file name and it will be |
335 | positional as with the read_file sub. But how can we pass in the |
336 | optional arguments and also a list of data? The solution lies in the |
337 | fact that the data list should never contain a reference. |
338 | Burping/spewing works only on plain data. So if the next argument is a |
339 | hash reference, we can assume it is the optional arguments and the rest |
340 | of the arguments is the data list. So the write_file sub will start off |
341 | like this: |
342 | |
343 | sub write_file { |
344 | |
345 | my $file_name = shift ; |
346 | |
347 | my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ; |
348 | |
349 | Whether or not optional arguments are passed in, we leave the data list |
350 | in @_ to minimize any more copying. You call write_file like this: |
351 | |
352 | write_file( 'foo', { binmode => ':raw' }, @data ) ; |
353 | write_file( 'junk', { append => 1 }, @more_junk ) ; |
354 | write_file( 'bar', @spew ) ; |
355 | |
356 | =head2 Fast Slurping |
357 | |
358 | |
359 | =head2 Benchmarks |
360 | |
361 | |
362 | =head2 Error Handling |
363 | |
364 | Slurp subs are subject to conditions such as not being able to open the |
365 | file or I/O errors. How these errors are handled and what the caller |
366 | will see are important aspects of the design of an API. The classic |
367 | error handling for slurping has been to call die or even better, |
368 | croak. But sometimes you want to either the slurp to either warn/carp |
369 | and allow your code to handle the error. Sure, this can be done by |
370 | wrapping the slurp in a eval block to catch a fatal error, but not |
371 | everyone wants all that extra code. So I have added another option to |
372 | all the subs which selects the error handling. If the 'err_mode' option |
373 | is 'croak' (which is also the default, the called sub will croak. If set |
374 | to 'carp' then carp will be called. Set to any other string (use 'quiet' |
375 | by convention) and no error handler call is made. Then the caller can |
376 | use the error status from the call. |
377 | |
378 | C<write_file> doesn't use the return value for data so it can return a |
379 | false status value in-band to mark an error. C<read_file> does use its |
380 | return value for data but we can still make it pass back the error |
381 | status. A successful read in any scalar mode will return either a |
382 | defined data string or a (scalar or array) reference. So a bare return |
383 | would work here. But if you slurp in lines by calling it in a list |
384 | context, a bare return will return an empty list which is the same value |
385 | it would from from an existing but empty file. So now, C<read_file> will |
386 | do something I strongly advocate against, which is returning a call to |
387 | undef. In the scalar contexts this still returns a error and now in list |
388 | context, the returned first value will be undef and that is not legal |
389 | data for the first element. So the list context also gets a error status |
390 | it can detect: |
391 | |
392 | my @lines = read_file( $file_name, err_mode => 'quiet' ) ; |
393 | your_handle_error( "$file_name can't be read\n" ) unless |
394 | @lines && defined $lines[0] ; |
395 | |
396 | |
397 | =head2 File::FastSlurp |
398 | |
399 | sub read_file { |
400 | |
401 | my( $file_name, %args ) = @_ ; |
402 | |
403 | my $buf ; |
404 | my $buf_ref = $args{'buf_ref'} || \$buf ; |
405 | |
406 | my $mode = O_RDONLY ; |
407 | $mode |= O_BINARY if $args{'binmode'} ; |
408 | |
409 | local( *FH ) ; |
410 | sysopen( FH, $file_name, $mode ) or |
411 | carp "Can't open $file_name: $!" ; |
412 | |
413 | my $size_left = -s FH ; |
414 | |
415 | while( $size_left > 0 ) { |
416 | |
417 | my $read_cnt = sysread( FH, ${$buf_ref}, |
418 | $size_left, length ${$buf_ref} ) ; |
419 | |
420 | unless( $read_cnt ) { |
421 | |
422 | carp "read error in file $file_name: $!" ; |
423 | last ; |
424 | } |
425 | |
426 | $size_left -= $read_cnt ; |
427 | } |
428 | |
429 | # handle void context (return scalar by buffer reference) |
430 | |
431 | return unless defined wantarray ; |
432 | |
433 | # handle list context |
434 | |
435 | return split m|?<$/|g, ${$buf_ref} if wantarray ; |
436 | |
437 | # handle scalar context |
438 | |
439 | return ${$buf_ref} ; |
440 | } |
441 | |
442 | |
443 | sub read_file_lines { |
444 | |
445 | # handle list context |
446 | |
447 | return &read_file if wantarray ;; |
448 | |
449 | # otherwise handle scalar context |
450 | |
451 | return [ &read_file ] ; |
452 | } |
453 | |
454 | |
455 | sub write_file { |
456 | |
457 | my $file_name = shift ; |
458 | |
459 | my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ; |
460 | my $buf = join '', @_ ; |
461 | |
462 | |
463 | my $mode = O_WRONLY ; |
464 | $mode |= O_BINARY if $args->{'binmode'} ; |
465 | $mode |= O_APPEND if $args->{'append'} ; |
466 | |
467 | local( *FH ) ; |
468 | sysopen( FH, $file_name, $mode ) or |
469 | carp "Can't open $file_name: $!" ; |
470 | |
471 | my $size_left = length( $buf ) ; |
472 | my $offset = 0 ; |
473 | |
474 | while( $size_left > 0 ) { |
475 | |
476 | my $write_cnt = syswrite( FH, $buf, |
477 | $size_left, $offset ) ; |
478 | |
479 | unless( $write_cnt ) { |
480 | |
481 | carp "write error in file $file_name: $!" ; |
482 | last ; |
483 | } |
484 | |
485 | $size_left -= $write_cnt ; |
486 | $offset += $write_cnt ; |
487 | } |
488 | |
489 | return ; |
490 | } |
491 | |
492 | =head2 Slurping in Perl 6 |
493 | |
494 | As usual with Perl 6, much of the work in this article will be put to |
495 | pasture. Perl 6 will allow you to set a 'slurp' property on file handles |
496 | and when you read from such a handle, the file is slurped. List and |
497 | scalar context will still be supported so you can slurp into lines or a |
498 | <scalar. I would expect that support for slurping in Perl 6 will be |
499 | optimized and bypass the stdio subsystem since it can use the slurp |
500 | property to trigger a call to special code. Otherwise some enterprising |
501 | individual will just create a File::FastSlurp module for Perl 6. The |
502 | code in the Perl 5 module could easily be modified to Perl 6 syntax and |
503 | semantics. Any volunteers? |
504 | |
505 | =head2 In Summary |
506 | |
507 | We have compared classic line by line processing with munging a whole |
508 | file in memory. Slurping files can speed up your programs and simplify |
509 | your code if done properly. You must still be aware to not slurp |
510 | humongous files (logs, DNA sequences, etc.) or STDIN where you don't |
511 | know how much data you will read in. But slurping megabyte sized files |
512 | of is not an major issue on today's systems with the typical amount of |
513 | RAM installed. When Perl was first being used in depth (Perl 4), |
514 | slurping was limited by the smalle RAM size of 10 years ago. Now you |
515 | should be able to slurp most any reasonably sized file be they |
516 | configurations, source code, data, etc. |