Commit | Line | Data |
635c7876 |
1 | =head1 Perl Slurp Ease |
2 | |
3 | =head2 Introduction |
4 | |
5 | |
6 | One of the common Perl idioms is processing text files line by line: |
7 | |
8 | while( <FH> ) { |
9 | do something with $_ |
10 | } |
11 | |
12 | This idiom has several variants, but the key point is that it reads in |
13 | only one line from the file in each loop iteration. This has several |
14 | advantages, including limiting memory use to one line, the ability to |
15 | handle any size file (including data piped in via STDIN), and it is |
16 | easily taught and understood to Perl newbies. In fact newbies are the |
17 | ones who do silly things like this: |
18 | |
19 | while( <FH> ) { |
20 | push @lines, $_ ; |
21 | } |
22 | |
23 | foreach ( @lines ) { |
24 | do something with $_ |
25 | } |
26 | |
27 | Line by line processing is fine, but it isn't the only way to deal with |
28 | reading files. The other common style is reading the entire file into a |
29 | scalar or array, and that is commonly known as slurping. Now, slurping has |
30 | somewhat of a poor reputation, and this article is an attempt at |
31 | rehabilitating it. Slurping files has advantages and limitations, and is |
32 | not something you should just do when line by line processing is fine. |
33 | It is best when you need the entire file in memory for processing all at |
34 | once. Slurping with in memory processing can be faster and lead to |
35 | simpler code than line by line if done properly. |
36 | |
37 | The biggest issue to watch for with slurping is file size. Slurping very |
38 | large files or unknown amounts of data from STDIN can be disastrous to |
39 | your memory usage and cause swap disk thrashing. You can slurp STDIN if |
40 | you know that you can handle the maximum size input without |
41 | detrimentally affecting your memory usage. So I advocate slurping only |
42 | disk files and only when you know their size is reasonable and you have |
43 | a real reason to process the file as a whole. Note that reasonable size |
44 | these days is larger than the bad old days of limited RAM. Slurping in a |
45 | megabyte is not an issue on most systems. But most of the |
46 | files I tend to slurp in are much smaller than that. Typical files that |
47 | work well with slurping are configuration files, (mini-)language scripts, |
48 | some data (especially binary) files, and other files of known sizes |
49 | which need fast processing. |
50 | |
51 | Another major win for slurping over line by line is speed. Perl's IO |
52 | system (like many others) is slow. Calling C<< <> >> for each line |
53 | requires a check for the end of line, checks for EOF, copying a line, |
54 | munging the internal handle structure, etc. Plenty of work for each line |
55 | read in. Whereas slurping, if done correctly, will usually involve only |
56 | one I/O call and no extra data copying. The same is true for writing |
57 | files to disk, and we will cover that as well (even though the term |
58 | slurping is traditionally a read operation, I use the term ``slurp'' for |
59 | the concept of doing I/O with an entire file in one operation). |
60 | |
61 | Finally, when you have slurped the entire file into memory, you can do |
62 | operations on the data that are not possible or easily done with line by |
63 | line processing. These include global search/replace (without regard for |
64 | newlines), grabbing all matches with one call of C<//g>, complex parsing |
65 | (which in many cases must ignore newlines), processing *ML (where line |
66 | endings are just white space) and performing complex transformations |
67 | such as template expansion. |
68 | |
69 | =head2 Global Operations |
70 | |
71 | Here are some simple global operations that can be done quickly and |
72 | easily on an entire file that has been slurped in. They could also be |
73 | done with line by line processing but that would be slower and require |
74 | more code. |
75 | |
76 | A common problem is reading in a file with key/value pairs. There are |
77 | modules which do this but who needs them for simple formats? Just slurp |
78 | in the file and do a single parse to grab all the key/value pairs. |
79 | |
80 | my $text = read_file( $file ) ; |
81 | my %config = $text =~ /^(\w+)=(.+)$/mg ; |
82 | |
83 | That matches a key which starts a line (anywhere inside the string |
84 | because of the C</m> modifier), the '=' char and the text to the end of the |
85 | line (again, C</m> makes that work). In fact the ending C<$> is not even needed |
86 | since C<.> will not normally match a newline. Since the key and value are |
87 | grabbed and the C<m//> is in list context with the C</g> modifier, it will |
88 | grab all key/value pairs and return them. The C<%config>hash will be |
89 | assigned this list and now you have the file fully parsed into a hash. |
90 | |
91 | Various projects I have worked on needed some simple templating and I |
92 | wasn't in the mood to use a full module (please, no flames about your |
93 | favorite template module :-). So I rolled my own by slurping in the |
94 | template file, setting up a template hash and doing this one line: |
95 | |
96 | $text =~ s/<%(.+?)%>/$template{$1}/g ; |
97 | |
98 | That only works if the entire file was slurped in. With a little |
99 | extra work it can handle chunks of text to be expanded: |
100 | |
101 | $text =~ s/<%(\w+)_START%>(.+?)<%\1_END%>/ template($1, $2)/sge ; |
102 | |
103 | Just supply a C<template> sub to expand the text between the markers and |
104 | you have yourself a simple system with minimal code. Note that this will |
105 | work and grab over multiple lines due the the C</s> modifier. This is |
106 | something that is much trickier with line by line processing. |
107 | |
108 | Note that this is a very simple templating system, and it can't directly |
109 | handle nested tags and other complex features. But even if you use one |
110 | of the myriad of template modules on the CPAN, you will gain by having |
111 | speedier ways to read and write files. |
112 | |
113 | Slurping in a file into an array also offers some useful advantages. |
114 | One simple example is reading in a flat database where each record has |
115 | fields separated by a character such as C<:>: |
116 | |
117 | my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ; |
118 | |
119 | Random access to any line of the slurped file is another advantage. Also |
120 | a line index could be built to speed up searching the array of lines. |
121 | |
122 | |
123 | =head2 Traditional Slurping |
124 | |
125 | Perl has always supported slurping files with minimal code. Slurping of |
126 | a file to a list of lines is trivial, just call the C<< <> >> operator |
127 | in a list context: |
128 | |
129 | my @lines = <FH> ; |
130 | |
131 | and slurping to a scalar isn't much more work. Just set the built in |
132 | variable C<$/> (the input record separator to the undefined value and read |
133 | in the file with C<< <> >>: |
134 | |
135 | { |
136 | local( $/, *FH ) ; |
137 | open( FH, $file ) or die "sudden flaming death\n" |
138 | $text = <FH> |
139 | } |
140 | |
141 | Notice the use of C<local()>. It sets C<$/> to C<undef> for you and when |
142 | the scope exits it will revert C<$/> back to its previous value (most |
143 | likely "\n"). |
144 | |
145 | Here is a Perl idiom that allows the C<$text> variable to be declared, |
146 | and there is no need for a tightly nested block. The C<do> block will |
147 | execute C<< <FH> >> in a scalar context and slurp in the file named by |
148 | C<$text>: |
149 | |
150 | local( *FH ) ; |
151 | open( FH, $file ) or die "sudden flaming death\n" |
152 | my $text = do { local( $/ ) ; <FH> } ; |
153 | |
154 | Both of those slurps used localized filehandles to be compatible with |
155 | 5.005. Here they are with 5.6.0 lexical autovivified handles: |
156 | |
157 | { |
158 | local( $/ ) ; |
159 | open( my $fh, $file ) or die "sudden flaming death\n" |
160 | $text = <$fh> |
161 | } |
162 | |
163 | open( my $fh, $file ) or die "sudden flaming death\n" |
164 | my $text = do { local( $/ ) ; <$fh> } ; |
165 | |
166 | And this is a variant of that idiom that removes the need for the open |
167 | call: |
168 | |
169 | my $text = do { local( @ARGV, $/ ) = $file ; <> } ; |
170 | |
171 | The filename in C<$file> is assigned to a localized C<@ARGV> and the |
172 | null filehandle is used which reads the data from the files in C<@ARGV>. |
173 | |
174 | Instead of assigning to a scalar, all the above slurps can assign to an |
175 | array and it will get the file but split into lines (using C<$/> as the |
176 | end of line marker). |
177 | |
178 | There is one common variant of those slurps which is very slow and not |
179 | good code. You see it around, and it is almost always cargo cult code: |
180 | |
181 | my $text = join( '', <FH> ) ; |
182 | |
183 | That needlessly splits the input file into lines (C<join> provides a |
184 | list context to C<< <FH> >>) and then joins up those lines again. The |
185 | original coder of this idiom obviously never read I<perlvar> and learned |
186 | how to use C<$/> to allow scalar slurping. |
187 | |
188 | =head2 Write Slurping |
189 | |
190 | While reading in entire files at one time is common, writing out entire |
191 | files is also done. We call it ``slurping'' when we read in files, but |
192 | there is no commonly accepted term for the write operation. I asked some |
193 | Perl colleagues and got two interesting nominations. Peter Scott said to |
194 | call it ``burping'' (rhymes with ``slurping'' and suggests movement in |
195 | the opposite direction). Others suggested ``spewing'' which has a |
196 | stronger visual image :-) Tell me your favorite or suggest your own. I |
197 | will use both in this section so you can see how they work for you. |
198 | |
199 | Spewing a file is a much simpler operation than slurping. You don't have |
200 | context issues to worry about and there is no efficiency problem with |
201 | returning a buffer. Here is a simple burp subroutine: |
202 | |
203 | sub burp { |
204 | my( $file_name ) = shift ; |
205 | open( my $fh, ">$file_name" ) || |
206 | die "can't create $file_name $!" ; |
207 | print $fh @_ ; |
208 | } |
209 | |
210 | Note that it doesn't copy the input text but passes @_ directly to |
211 | print. We will look at faster variations of that later on. |
212 | |
213 | =head2 Slurp on the CPAN |
214 | |
215 | As you would expect there are modules in the CPAN that will slurp files |
216 | for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on |
217 | CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN). |
218 | |
219 | Here is the code from Slurp.pm: |
220 | |
221 | sub slurp { |
222 | local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ ); |
223 | return <ARGV>; |
224 | } |
225 | |
226 | sub to_array { |
227 | my @array = slurp( @_ ); |
228 | return wantarray ? @array : \@array; |
229 | } |
230 | |
231 | sub to_scalar { |
232 | my $scalar = slurp( @_ ); |
233 | return $scalar; |
234 | } |
235 | |
236 | +The subroutine C<slurp()> uses the magic undefined value of C<$/> and |
237 | the magic file +handle C<ARGV> to support slurping into a scalar or |
238 | array. It also provides two wrapper subs that allow the caller to |
239 | control the context of the slurp. And the C<to_array()> subroutine will |
240 | return the list of slurped lines or a anonymous array of them according |
241 | to its caller's context by checking C<wantarray>. It has 'slurp' in |
242 | C<@EXPORT> and all three subroutines in C<@EXPORT_OK>. |
243 | |
244 | <Footnote: Slurp.pm is poorly named and it shouldn't be in the top level |
245 | namespace.> |
246 | |
247 | The original File::Slurp.pm has this code: |
248 | |
249 | sub read_file |
250 | { |
251 | my ($file) = @_; |
252 | |
253 | local($/) = wantarray ? $/ : undef; |
254 | local(*F); |
255 | my $r; |
256 | my (@r); |
257 | |
258 | open(F, "<$file") || croak "open $file: $!"; |
259 | @r = <F>; |
260 | close(F) || croak "close $file: $!"; |
261 | |
262 | return $r[0] unless wantarray; |
263 | return @r; |
264 | } |
265 | |
266 | This module provides several subroutines including C<read_file()> (more |
267 | on the others later). C<read_file()> behaves simularly to |
268 | C<Slurp::slurp()> in that it will slurp a list of lines or a single |
269 | scalar depending on the caller's context. It also uses the magic |
270 | undefined value of C<$/> for scalar slurping but it uses an explicit |
271 | open call rather than using a localized C<@ARGV> and the other module |
272 | did. Also it doesn't provide a way to get an anonymous array of the |
273 | lines but that can easily be rectified by calling it inside an anonymous |
274 | array constuctor C<[]>. |
275 | |
276 | Both of these modules make it easier for Perl coders to slurp in |
277 | files. They both use the magic C<$/> to slurp in scalar mode and the |
278 | natural behavior of C<< <> >> in list context to slurp as lines. But |
279 | neither is optmized for speed nor can they handle C<binmode()> to |
280 | support binary or unicode files. See below for more on slurp features |
281 | and speedups. |
282 | |
283 | =head2 Slurping API Design |
284 | |
285 | The slurp modules on CPAN are have a very simple API and don't support |
286 | C<binmode()>. This section will cover various API design issues such as |
287 | efficient return by reference, C<binmode()> and calling variations. |
288 | |
289 | Let's start with the call variations. Slurped files can be returned in |
290 | four formats: as a single scalar, as a reference to a scalar, as a list |
291 | of lines or as an anonymous array of lines. But the caller can only |
292 | provide two contexts: scalar or list. So we have to either provide an |
293 | API with more than one subroutine (as Slurp.pm did) or just provide one |
294 | subroutine which only returns a scalar or a list (not an anonymous |
295 | array) as File::Slurp does. |
296 | |
297 | I have used my own C<read_file()> subroutine for years and it has the |
298 | same API as File::Slurp: a single subroutine that returns a scalar or a |
299 | list of lines depending on context. But I recognize the interest of |
300 | those that want an anonymous array for line slurping. For one thing, it |
301 | is easier to pass around to other subs and for another, it eliminates |
302 | the extra copying of the lines via C<return>. So my module provides only |
303 | one slurp subroutine that returns the file data based on context and any |
304 | format options passed in. There is no need for a specific |
305 | slurp-in-as-a-scalar or list subroutine as the general C<read_file()> |
306 | sub will do that by default in the appropriate context. If you want |
307 | C<read_file()> to return a scalar reference or anonymous array of lines, |
308 | you can request those formats with options. You can even pass in a |
309 | reference to a scalar (e.g. a previously allocated buffer) and have that |
310 | filled with the slurped data (and that is one of the fastest slurp |
311 | modes. see the benchmark section for more on that). If you want to |
312 | slurp a scalar into an array, just select the desired array element and |
313 | that will provide scalar context to the C<read_file()> subroutine. |
314 | |
315 | The next area to cover is what to name the slurp sub. I will go with |
316 | C<read_file()>. It is descriptive and keeps compatibilty with the |
317 | current simple and don't use the 'slurp' nickname (though that nickname |
318 | is in the module name). Also I decided to keep the File::Slurp |
319 | namespace which was graciously handed over to me by its current owner, |
320 | David Muir. |
321 | |
322 | Another critical area when designing APIs is how to pass in |
323 | arguments. The C<read_file()> subroutine takes one required argument |
324 | which is the file name. To support C<binmode()> we need another optional |
325 | argument. A third optional argument is needed to support returning a |
326 | slurped scalar by reference. My first thought was to design the API with |
327 | 3 positional arguments - file name, buffer reference and binmode. But if |
328 | you want to set the binmode and not pass in a buffer reference, you have |
329 | to fill the second argument with C<undef> and that is ugly. So I decided |
330 | to make the filename argument positional and the other two named. The |
331 | subroutine starts off like this: |
332 | |
333 | sub read_file { |
334 | |
335 | my( $file_name, %args ) = @_ ; |
336 | |
337 | my $buf ; |
338 | my $buf_ref = $args{'buf'} || \$buf ; |
339 | |
340 | The other sub (C<read_file_lines()>) will only take an optional binmode |
341 | (so you can read files with binary delimiters). It doesn't need a buffer |
342 | reference argument since it can return an anonymous array if the called |
343 | in a scalar context. So this subroutine could use positional arguments, |
344 | but to keep its API similar to the API of C<read_file()>, it will also |
345 | use pass by name for the optional arguments. This also means that new |
346 | optional arguments can be added later without breaking any legacy |
347 | code. A bonus with keeping the API the same for both subs will be seen |
348 | how the two subs are optimized to work together. |
349 | |
350 | Write slurping (or spewing or burping :-)) needs to have its API |
351 | designed as well. The biggest issue is not only needing to support |
352 | optional arguments but a list of arguments to be written is needed. Perl |
353 | 6 will be able to handle that with optional named arguments and a final |
354 | slurp argument. Since this is Perl 5 we have to do it using some |
355 | cleverness. The first argument is the file name and it will be |
356 | positional as with the C<read_file> subroutine. But how can we pass in |
357 | the optional arguments and also a list of data? The solution lies in the |
358 | fact that the data list should never contain a reference. |
359 | Burping/spewing works only on plain data. So if the next argument is a |
360 | hash reference, we can assume it cointains the optional arguments and |
361 | the rest of the arguments is the data list. So the C<write_file()> |
362 | subroutine will start off like this: |
363 | |
364 | sub write_file { |
365 | |
366 | my $file_name = shift ; |
367 | |
368 | my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ; |
369 | |
370 | Whether or not optional arguments are passed in, we leave the data list |
371 | in C<@_> to minimize any more copying. You call C<write_file()> like this: |
372 | |
373 | write_file( 'foo', { binmode => ':raw' }, @data ) ; |
374 | write_file( 'junk', { append => 1 }, @more_junk ) ; |
375 | write_file( 'bar', @spew ) ; |
376 | |
377 | =head2 Fast Slurping |
378 | |
379 | Somewhere along the line, I learned about a way to slurp files faster |
380 | than by setting $/ to undef. The method is very simple, you do a single |
381 | read call with the size of the file (which the -s operator provides). |
382 | This bypasses the I/O loop inside perl that checks for EOF and does all |
383 | sorts of processing. I then decided to experiment and found that |
384 | sysread is even faster as you would expect. sysread bypasses all of |
385 | Perl's stdio and reads the file from the kernel buffers directly into a |
386 | Perl scalar. This is why the slurp code in File::Slurp uses |
387 | sysopen/sysread/syswrite. All the rest of the code is just to support |
388 | the various options and data passing techniques. |
389 | |
390 | |
391 | =head2 Benchmarks |
392 | |
393 | Benchmarks can be enlightening, informative, frustrating and |
394 | deceiving. It would make no sense to create a new and more complex slurp |
395 | module unless it also gained signifigantly in speed. So I created a |
396 | benchmark script which compares various slurp methods with differing |
397 | file sizes and calling contexts. This script can be run from the main |
398 | directory of the tarball like this: |
399 | |
400 | perl -Ilib extras/slurp_bench.pl |
401 | |
402 | If you pass in an argument on the command line, it will be passed to |
403 | timethese() and it will control the duration. It defaults to -2 which |
404 | makes each benchmark run to at least 2 seconds of cpu time. |
405 | |
406 | The following numbers are from a run I did on my 300Mhz sparc. You will |
407 | most likely get much faster counts on your boxes but the relative speeds |
408 | shouldn't change by much. If you see major differences on your |
409 | benchmarks, please send me the results and your Perl and OS |
410 | versions. Also you can play with the benchmark script and add more slurp |
411 | variations or data files. |
412 | |
413 | The rest of this section will be discussing the results of the |
414 | benchmarks. You can refer to extras/slurp_bench.pl to see the code for |
415 | the individual benchmarks. If the benchmark name starts with cpan_, it |
416 | is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are |
417 | from the new File::Slurp.pm. Those that start with file_contents_ are |
418 | from a client's code base. The rest are variations I created to |
419 | highlight certain aspects of the benchmarks. |
420 | |
421 | The short and long file data is made like this: |
422 | |
423 | my @lines = ( 'abc' x 30 . "\n") x 100 ; |
424 | my $text = join( '', @lines ) ; |
425 | |
426 | @lines = ( 'abc' x 40 . "\n") x 1000 ; |
427 | $text = join( '', @lines ) ; |
428 | |
429 | So the short file is 9,100 bytes and the long file is 121,000 |
430 | bytes. |
431 | |
432 | =head3 Scalar Slurp of Short File |
433 | |
434 | file_contents 651/s |
435 | file_contents_no_OO 828/s |
436 | cpan_read_file 1866/s |
437 | cpan_slurp 1934/s |
438 | read_file 2079/s |
439 | new 2270/s |
440 | new_buf_ref 2403/s |
441 | new_scalar_ref 2415/s |
442 | sysread_file 2572/s |
443 | |
444 | =head3 Scalar Slurp of Long File |
445 | |
446 | file_contents_no_OO 82.9/s |
447 | file_contents 85.4/s |
448 | cpan_read_file 250/s |
449 | cpan_slurp 257/s |
450 | read_file 323/s |
451 | new 468/s |
452 | sysread_file 489/s |
453 | new_scalar_ref 766/s |
454 | new_buf_ref 767/s |
455 | |
456 | The primary inference you get from looking at the mumbers above is that |
457 | when slurping a file into a scalar, the longer the file, the more time |
458 | you save by returning the result via a scalar reference. The time for |
459 | the extra buffer copy can add up. The new module came out on top overall |
460 | except for the very simple sysread_file entry which was added to |
461 | highlight the overhead of the more flexible new module which isn't that |
462 | much. The file_contents entries are always the worst since they do a |
463 | list slurp and then a join, which is a classic newbie and cargo culted |
464 | style which is extremely slow. Also the OO code in file_contents slows |
465 | it down even more (I added the file_contents_no_OO entry to show this). |
466 | The two CPAN modules are decent with small files but they are laggards |
467 | compared to the new module when the file gets much larger. |
468 | |
469 | =head3 List Slurp of Short File |
470 | |
471 | cpan_read_file 589/s |
472 | cpan_slurp_to_array 620/s |
473 | read_file 824/s |
474 | new_array_ref 824/s |
475 | sysread_file 828/s |
476 | new 829/s |
477 | new_in_anon_array 833/s |
478 | cpan_slurp_to_array_ref 836/s |
479 | |
480 | =head3 List Slurp of Long File |
481 | |
482 | cpan_read_file 62.4/s |
483 | cpan_slurp_to_array 62.7/s |
484 | read_file 92.9/s |
485 | sysread_file 94.8/s |
486 | new_array_ref 95.5/s |
487 | new 96.2/s |
488 | cpan_slurp_to_array_ref 96.3/s |
489 | new_in_anon_array 97.2/s |
490 | |
491 | This is perhaps the most interesting result of this benchmark. Five |
492 | different entries have effectively tied for the lead. The logical |
493 | conclusion is that splitting the input into lines is the bounding |
494 | operation, no matter how the file gets slurped. This is the only |
495 | benchmark where the new module isn't the clear winner (in the long file |
496 | entries - it is no worse than a close second in the short file |
497 | entries). |
498 | |
499 | |
500 | Note: In the benchmark information for all the spew entries, the extra |
501 | number at the end of each line is how many wallclock seconds the whole |
502 | entry took. The benchmarks were run for at least 2 CPU seconds per |
503 | entry. The unusually large wallclock times will be discussed below. |
504 | |
505 | =head3 Scalar Spew of Short File |
506 | |
507 | cpan_write_file 1035/s 38 |
508 | print_file 1055/s 41 |
509 | syswrite_file 1135/s 44 |
510 | new 1519/s 2 |
511 | print_join_file 1766/s 2 |
512 | new_ref 1900/s 2 |
513 | syswrite_file2 2138/s 2 |
514 | |
515 | =head3 Scalar Spew of Long File |
516 | |
517 | cpan_write_file 164/s 20 |
518 | print_file 211/s 26 |
519 | syswrite_file 236/s 25 |
520 | print_join_file 277/s 2 |
521 | new 295/s 2 |
522 | syswrite_file2 428/s 2 |
523 | new_ref 608/s 2 |
524 | |
525 | In the scalar spew entries, the new module API wins when it is passed a |
526 | reference to the scalar buffer. The C<syswrite_file2> entry beats it |
527 | with the shorter file due to its simpler code. The old CPAN module is |
528 | the slowest due to its extra copying of the data and its use of print. |
529 | |
530 | =head3 List Spew of Short File |
531 | |
532 | cpan_write_file 794/s 29 |
533 | syswrite_file 1000/s 38 |
534 | print_file 1013/s 42 |
535 | new 1399/s 2 |
536 | print_join_file 1557/s 2 |
537 | |
538 | =head3 List Spew of Long File |
539 | |
540 | cpan_write_file 112/s 12 |
541 | print_file 179/s 21 |
542 | syswrite_file 181/s 19 |
543 | print_join_file 205/s 2 |
544 | new 228/s 2 |
545 | |
546 | Again, the simple C<print_join_file> entry beats the new module when |
547 | spewing a short list of lines to a file. But is loses to the new module |
548 | when the file size gets longer. The old CPAN module lags behind the |
549 | others since it first makes an extra copy of the lines and then it calls |
550 | C<print> on the output list and that is much slower than passing to |
551 | C<print> a single scalar generated by join. The C<print_file> entry |
552 | shows the advantage of directly printing C<@_> and the |
553 | C<print_join_file> adds the join optimization. |
554 | |
555 | Now about those long wallclock times. If you look carefully at the |
556 | benchmark code of all the spew entries, you will find that some always |
557 | write to new files and some overwrite existing files. When I asked David |
558 | Muir why the old File::Slurp module had an C<overwrite> subroutine, he |
559 | answered that by overwriting a file, you always guarantee something |
560 | readable is in the file. If you create a new file, there is a moment |
561 | when the new file is created but has no data in it. I feel this is not a |
562 | good enough answer. Even when overwriting, you can write a shorter file |
563 | than the existing file and then you have to truncate the file to the new |
564 | size. There is a small race window there where another process can slurp |
565 | in the file with the new data followed by leftover junk from the |
566 | previous version of the file. This reinforces the point that the only |
567 | way to ensure consistant file data is the proper use of file locks. |
568 | |
569 | But what about those long times? Well it is all about the difference |
570 | between creating files and overwriting existing ones. The former have to |
571 | allocate new inodes (or the equivilent on other file systems) and the |
572 | latter can reuse the exising inode. This mean the overwrite will save on |
573 | disk seeks as well as on cpu time. In fact when running this benchmark, |
574 | I could hear my disk going crazy allocating inodes during the spew |
575 | operations. This speedup in both cpu and wallclock is why the new module |
576 | always does overwriting when spewing files. It also does the proper |
577 | truncate (and this is checked in the tests by spewing shorter files |
578 | after longer ones had previously been written). The C<overwrite> |
579 | subroutine is just an typeglob alias to C<write_file> and is there for |
580 | backwards compatibilty with the old File::Slurp module. |
581 | |
582 | =head3 Benchmark Conclusion |
583 | |
584 | Other than a few cases where a simpler entry beat it out, the new |
585 | File::Slurp module is either the speed leader or among the leaders. Its |
586 | special APIs for passing buffers by reference prove to be very useful |
587 | speedups. Also it uses all the other optimizations including using |
588 | C<sysread/syswrite> and joining output lines. I expect many projects |
589 | that extensively use slurping will notice the speed improvements, |
590 | especially if they rewrite their code to take advantage of the new API |
591 | features. Even if they don't touch their code and use the simple API |
592 | they will get a significant speedup. |
593 | |
594 | =head2 Error Handling |
595 | |
596 | Slurp subroutines are subject to conditions such as not being able to |
597 | open the file, or I/O errors. How these errors are handled, and what the |
598 | caller will see, are important aspects of the design of an API. The |
599 | classic error handling for slurping has been to call C<die()> or even |
600 | better, C<croak()>. But sometimes you want the slurp to either |
601 | C<warn()>/C<carp()> or allow your code to handle the error. Sure, this |
602 | can be done by wrapping the slurp in a C<eval> block to catch a fatal |
603 | error, but not everyone wants all that extra code. So I have added |
604 | another option to all the subroutines which selects the error |
605 | handling. If the 'err_mode' option is 'croak' (which is also the |
606 | default), the called subroutine will croak. If set to 'carp' then carp |
607 | will be called. Set to any other string (use 'quiet' when you want to |
608 | be explicit) and no error handler is called. Then the caller can use the |
609 | error status from the call. |
610 | |
611 | C<write_file()> doesn't use the return value for data so it can return a |
612 | false status value in-band to mark an error. C<read_file()> does use its |
613 | return value for data, but we can still make it pass back the error |
614 | status. A successful read in any scalar mode will return either a |
615 | defined data string or a reference to a scalar or array. So a bare |
616 | return would work here. But if you slurp in lines by calling it in a |
617 | list context, a bare C<return> will return an empty list, which is the |
618 | same value it would get from an existing but empty file. So now, |
619 | C<read_file()> will do something I normally strongly advocate against, |
620 | i.e., returning an explicit C<undef> value. In the scalar context this |
621 | still returns a error, and in list context, the returned first value |
622 | will be C<undef>, and that is not legal data for the first element. So |
623 | the list context also gets a error status it can detect: |
624 | |
625 | my @lines = read_file( $file_name, err_mode => 'quiet' ) ; |
626 | your_handle_error( "$file_name can't be read\n" ) unless |
627 | @lines && defined $lines[0] ; |
628 | |
629 | |
630 | =head2 File::FastSlurp |
631 | |
632 | sub read_file { |
633 | |
634 | my( $file_name, %args ) = @_ ; |
635 | |
636 | my $buf ; |
637 | my $buf_ref = $args{'buf_ref'} || \$buf ; |
638 | |
639 | my $mode = O_RDONLY ; |
640 | $mode |= O_BINARY if $args{'binmode'} ; |
641 | |
642 | local( *FH ) ; |
643 | sysopen( FH, $file_name, $mode ) or |
644 | carp "Can't open $file_name: $!" ; |
645 | |
646 | my $size_left = -s FH ; |
647 | |
648 | while( $size_left > 0 ) { |
649 | |
650 | my $read_cnt = sysread( FH, ${$buf_ref}, |
651 | $size_left, length ${$buf_ref} ) ; |
652 | |
653 | unless( $read_cnt ) { |
654 | |
655 | carp "read error in file $file_name: $!" ; |
656 | last ; |
657 | } |
658 | |
659 | $size_left -= $read_cnt ; |
660 | } |
661 | |
662 | # handle void context (return scalar by buffer reference) |
663 | |
664 | return unless defined wantarray ; |
665 | |
666 | # handle list context |
667 | |
668 | return split m|?<$/|g, ${$buf_ref} if wantarray ; |
669 | |
670 | # handle scalar context |
671 | |
672 | return ${$buf_ref} ; |
673 | } |
674 | |
675 | sub write_file { |
676 | |
677 | my $file_name = shift ; |
678 | |
679 | my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ; |
680 | my $buf = join '', @_ ; |
681 | |
682 | |
683 | my $mode = O_WRONLY ; |
684 | $mode |= O_BINARY if $args->{'binmode'} ; |
685 | $mode |= O_APPEND if $args->{'append'} ; |
686 | |
687 | local( *FH ) ; |
688 | sysopen( FH, $file_name, $mode ) or |
689 | carp "Can't open $file_name: $!" ; |
690 | |
691 | my $size_left = length( $buf ) ; |
692 | my $offset = 0 ; |
693 | |
694 | while( $size_left > 0 ) { |
695 | |
696 | my $write_cnt = syswrite( FH, $buf, |
697 | $size_left, $offset ) ; |
698 | |
699 | unless( $write_cnt ) { |
700 | |
701 | carp "write error in file $file_name: $!" ; |
702 | last ; |
703 | } |
704 | |
705 | $size_left -= $write_cnt ; |
706 | $offset += $write_cnt ; |
707 | } |
708 | |
709 | return ; |
710 | } |
711 | |
712 | =head2 Slurping in Perl 6 |
713 | |
714 | As usual with Perl 6, much of the work in this article will be put to |
715 | pasture. Perl 6 will allow you to set a 'slurp' property on file handles |
716 | and when you read from such a handle, the file is slurped. List and |
717 | scalar context will still be supported so you can slurp into lines or a |
718 | <scalar. I would expect that support for slurping in Perl 6 will be |
719 | optimized and bypass the stdio subsystem since it can use the slurp |
720 | property to trigger a call to special code. Otherwise some enterprising |
721 | individual will just create a File::FastSlurp module for Perl 6. The |
722 | code in the Perl 5 module could easily be modified to Perl 6 syntax and |
723 | semantics. Any volunteers? |
724 | |
725 | =head2 In Summary |
726 | |
727 | We have compared classic line by line processing with munging a whole |
728 | file in memory. Slurping files can speed up your programs and simplify |
729 | your code if done properly. You must still be aware to not slurp |
730 | humongous files (logs, DNA sequences, etc.) or STDIN where you don't |
731 | know how much data you will read in. But slurping megabyte sized files |
732 | is not an major issue on today's systems with the typical amount of RAM |
733 | installed. When Perl was first being used in depth (Perl 4), slurping |
734 | was limited by the smaller RAM size of 10 years ago. Now, you should be |
735 | able to slurp almost any reasonably sized file, whether it contains |
736 | configuration, source code, data, etc. |
737 | |
738 | =head2 Acknowledgements |
739 | |
740 | |
741 | |
742 | |
743 | |