Commit | Line | Data |
635c7876 |
1 | <HTML> |
2 | <HEAD> |
3 | <TITLE>Perl Slurp Ease</TITLE> |
4 | <LINK REV="made" HREF="mailto:steve@dewitt.vnet.net"> |
5 | </HEAD> |
6 | |
7 | <BODY> |
8 | |
9 | <A NAME="__index__"></A> |
10 | <!-- INDEX BEGIN --> |
11 | |
12 | <UL> |
13 | |
14 | <LI><A HREF="#perl slurp ease">Perl Slurp Ease</A></LI> |
15 | <UL> |
16 | |
17 | <LI><A HREF="#introduction">Introduction</A></LI> |
18 | <LI><A HREF="#global operations">Global Operations</A></LI> |
19 | <LI><A HREF="#traditional slurping">Traditional Slurping</A></LI> |
20 | <LI><A HREF="#write slurping">Write Slurping</A></LI> |
21 | <LI><A HREF="#slurp on the cpan">Slurp on the CPAN</A></LI> |
22 | <LI><A HREF="#slurping api design">Slurping API Design</A></LI> |
23 | <LI><A HREF="#fast slurping">Fast Slurping</A></LI> |
24 | <UL> |
25 | |
26 | <LI><A HREF="#scalar slurp of short file">Scalar Slurp of Short File</A></LI> |
27 | <LI><A HREF="#scalar slurp of long file">Scalar Slurp of Long File</A></LI> |
28 | <LI><A HREF="#list slurp of short file">List Slurp of Short File</A></LI> |
29 | <LI><A HREF="#list slurp of long file">List Slurp of Long File</A></LI> |
30 | <LI><A HREF="#scalar spew of short file">Scalar Spew of Short File</A></LI> |
31 | <LI><A HREF="#scalar spew of long file">Scalar Spew of Long File</A></LI> |
32 | <LI><A HREF="#list spew of short file">List Spew of Short File</A></LI> |
33 | <LI><A HREF="#list spew of long file">List Spew of Long File</A></LI> |
34 | <LI><A HREF="#benchmark conclusion">Benchmark Conclusion</A></LI> |
35 | </UL> |
36 | |
37 | <LI><A HREF="#error handling">Error Handling</A></LI> |
38 | <LI><A HREF="#file::fastslurp">File::FastSlurp</A></LI> |
39 | <LI><A HREF="#slurping in perl 6">Slurping in Perl 6</A></LI> |
40 | <LI><A HREF="#in summary">In Summary</A></LI> |
41 | <LI><A HREF="#acknowledgements">Acknowledgements</A></LI> |
42 | </UL> |
43 | |
44 | </UL> |
45 | <!-- INDEX END --> |
46 | |
47 | <HR> |
48 | <P> |
49 | <H1><A NAME="perl slurp ease">Perl Slurp Ease</A></H1> |
50 | <P> |
51 | <H2><A NAME="introduction">Introduction</A></H2> |
52 | <P>One of the common Perl idioms is processing text files line by line:</P> |
53 | <PRE> |
54 | while( <FH> ) { |
55 | do something with $_ |
56 | }</PRE> |
57 | <P>This idiom has several variants, but the key point is that it reads in |
58 | only one line from the file in each loop iteration. This has several |
59 | advantages, including limiting memory use to one line, the ability to |
60 | handle any size file (including data piped in via STDIN), and it is |
61 | easily taught and understood to Perl newbies. In fact newbies are the |
62 | ones who do silly things like this:</P> |
63 | <PRE> |
64 | while( <FH> ) { |
65 | push @lines, $_ ; |
66 | }</PRE> |
67 | <PRE> |
68 | foreach ( @lines ) { |
69 | do something with $_ |
70 | }</PRE> |
71 | <P>Line by line processing is fine, but it isn't the only way to deal with |
72 | reading files. The other common style is reading the entire file into a |
73 | scalar or array, and that is commonly known as slurping. Now, slurping has |
74 | somewhat of a poor reputation, and this article is an attempt at |
75 | rehabilitating it. Slurping files has advantages and limitations, and is |
76 | not something you should just do when line by line processing is fine. |
77 | It is best when you need the entire file in memory for processing all at |
78 | once. Slurping with in memory processing can be faster and lead to |
79 | simpler code than line by line if done properly.</P> |
80 | <P>The biggest issue to watch for with slurping is file size. Slurping very |
81 | large files or unknown amounts of data from STDIN can be disastrous to |
82 | your memory usage and cause swap disk thrashing. You can slurp STDIN if |
83 | you know that you can handle the maximum size input without |
84 | detrimentally affecting your memory usage. So I advocate slurping only |
85 | disk files and only when you know their size is reasonable and you have |
86 | a real reason to process the file as a whole. Note that reasonable size |
87 | these days is larger than the bad old days of limited RAM. Slurping in a |
88 | megabyte is not an issue on most systems. But most of the |
89 | files I tend to slurp in are much smaller than that. Typical files that |
90 | work well with slurping are configuration files, (mini-)language scripts, |
91 | some data (especially binary) files, and other files of known sizes |
92 | which need fast processing.</P> |
93 | <P>Another major win for slurping over line by line is speed. Perl's IO |
94 | system (like many others) is slow. Calling <CODE><></CODE> for each line |
95 | requires a check for the end of line, checks for EOF, copying a line, |
96 | munging the internal handle structure, etc. Plenty of work for each line |
97 | read in. Whereas slurping, if done correctly, will usually involve only |
98 | one I/O call and no extra data copying. The same is true for writing |
99 | files to disk, and we will cover that as well (even though the term |
100 | slurping is traditionally a read operation, I use the term ``slurp'' for |
101 | the concept of doing I/O with an entire file in one operation).</P> |
102 | <P>Finally, when you have slurped the entire file into memory, you can do |
103 | operations on the data that are not possible or easily done with line by |
104 | line processing. These include global search/replace (without regard for |
105 | newlines), grabbing all matches with one call of <CODE>//g</CODE>, complex parsing |
106 | (which in many cases must ignore newlines), processing *ML (where line |
107 | endings are just white space) and performing complex transformations |
108 | such as template expansion.</P> |
109 | <P> |
110 | <H2><A NAME="global operations">Global Operations</A></H2> |
111 | <P>Here are some simple global operations that can be done quickly and |
112 | easily on an entire file that has been slurped in. They could also be |
113 | done with line by line processing but that would be slower and require |
114 | more code.</P> |
115 | <P>A common problem is reading in a file with key/value pairs. There are |
116 | modules which do this but who needs them for simple formats? Just slurp |
117 | in the file and do a single parse to grab all the key/value pairs.</P> |
118 | <PRE> |
119 | my $text = read_file( $file ) ; |
120 | my %config = $test =~ /^(\w+)=(.+)$/mg ;</PRE> |
121 | <P>That matches a key which starts a line (anywhere inside the string |
122 | because of the <CODE>/m</CODE> modifier), the '=' char and the text to the end of the |
123 | line (again, <CODE>/m</CODE> makes that work). In fact the ending <CODE>$</CODE> is not even needed |
124 | since <CODE>.</CODE> will not normally match a newline. Since the key and value are |
125 | grabbed and the <CODE>m//</CODE> is in list context with the <CODE>/g</CODE> modifier, it will |
126 | grab all key/value pairs and return them. The <CODE>%config</CODE>hash will be |
127 | assigned this list and now you have the file fully parsed into a hash.</P> |
128 | <P>Various projects I have worked on needed some simple templating and I |
129 | wasn't in the mood to use a full module (please, no flames about your |
130 | favorite template module :-). So I rolled my own by slurping in the |
131 | template file, setting up a template hash and doing this one line:</P> |
132 | <PRE> |
133 | $text =~ s/<%(.+?)%>/$template{$1}/g ;</PRE> |
134 | <P>That only works if the entire file was slurped in. With a little |
135 | extra work it can handle chunks of text to be expanded:</P> |
136 | <PRE> |
137 | $text =~ s/<%(\w+)_START%>(.+?)<%\1_END%>/ template($1, $2)/sge ;</PRE> |
138 | <P>Just supply a <CODE>template</CODE> sub to expand the text between the markers and |
139 | you have yourself a simple system with minimal code. Note that this will |
140 | work and grab over multiple lines due the the <CODE>/s</CODE> modifier. This is |
141 | something that is much trickier with line by line processing.</P> |
142 | <P>Note that this is a very simple templating system, and it can't directly |
143 | handle nested tags and other complex features. But even if you use one |
144 | of the myriad of template modules on the CPAN, you will gain by having |
145 | speedier ways to read and write files.</P> |
146 | <P>Slurping in a file into an array also offers some useful advantages. |
147 | One simple example is reading in a flat database where each record has |
148 | fields separated by a character such as <CODE>:</CODE>:</P> |
149 | <PRE> |
150 | my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ;</PRE> |
151 | <P>Random access to any line of the slurped file is another advantage. Also |
152 | a line index could be built to speed up searching the array of lines.</P> |
153 | <P> |
154 | <H2><A NAME="traditional slurping">Traditional Slurping</A></H2> |
155 | <P>Perl has always supported slurping files with minimal code. Slurping of |
156 | a file to a list of lines is trivial, just call the <CODE><></CODE> operator |
157 | in a list context:</P> |
158 | <PRE> |
159 | my @lines = <FH> ;</PRE> |
160 | <P>and slurping to a scalar isn't much more work. Just set the built in |
161 | variable <CODE>$/</CODE> (the input record separator to the undefined value and read |
162 | in the file with <CODE><></CODE>:</P> |
163 | <PRE> |
164 | { |
165 | local( $/, *FH ) ; |
166 | open( FH, $file ) or die "sudden flaming death\n" |
167 | $text = <FH> |
168 | }</PRE> |
169 | <P>Notice the use of <CODE>local()</CODE>. It sets <CODE>$/</CODE> to <CODE>undef</CODE> for you and when |
170 | the scope exits it will revert <CODE>$/</CODE> back to its previous value (most |
171 | likely ``\n'').</P> |
172 | <P>Here is a Perl idiom that allows the <CODE>$text</CODE> variable to be declared, |
173 | and there is no need for a tightly nested block. The <CODE>do</CODE> block will |
174 | execute <CODE><FH></CODE> in a scalar context and slurp in the file named by |
175 | <CODE>$text</CODE>:</P> |
176 | <PRE> |
177 | local( *FH ) ; |
178 | open( FH, $file ) or die "sudden flaming death\n" |
179 | my $text = do { local( $/ ) ; <FH> } ;</PRE> |
180 | <P>Both of those slurps used localized filehandles to be compatible with |
181 | 5.005. Here they are with 5.6.0 lexical autovivified handles:</P> |
182 | <PRE> |
183 | { |
184 | local( $/ ) ; |
185 | open( my $fh, $file ) or die "sudden flaming death\n" |
186 | $text = <$fh> |
187 | }</PRE> |
188 | <PRE> |
189 | open( my $fh, $file ) or die "sudden flaming death\n" |
190 | my $text = do { local( $/ ) ; <$fh> } ;</PRE> |
191 | <P>And this is a variant of that idiom that removes the need for the open |
192 | call:</P> |
193 | <PRE> |
194 | my $text = do { local( @ARGV, $/ ) = $file ; <> } ;</PRE> |
195 | <P>The filename in <CODE>$file</CODE> is assigned to a localized <CODE>@ARGV</CODE> and the |
196 | null filehandle is used which reads the data from the files in <CODE>@ARGV</CODE>.</P> |
197 | <P>Instead of assigning to a scalar, all the above slurps can assign to an |
198 | array and it will get the file but split into lines (using <CODE>$/</CODE> as the |
199 | end of line marker).</P> |
200 | <P>There is one common variant of those slurps which is very slow and not |
201 | good code. You see it around, and it is almost always cargo cult code:</P> |
202 | <PRE> |
203 | my $text = join( '', <FH> ) ;</PRE> |
204 | <P>That needlessly splits the input file into lines (<CODE>join</CODE> provides a |
205 | list context to <CODE><FH></CODE>) and then joins up those lines again. The |
206 | original coder of this idiom obviously never read <EM>perlvar</EM> and learned |
207 | how to use <CODE>$/</CODE> to allow scalar slurping.</P> |
208 | <P> |
209 | <H2><A NAME="write slurping">Write Slurping</A></H2> |
210 | <P>While reading in entire files at one time is common, writing out entire |
211 | files is also done. We call it ``slurping'' when we read in files, but |
212 | there is no commonly accepted term for the write operation. I asked some |
213 | Perl colleagues and got two interesting nominations. Peter Scott said to |
214 | call it ``burping'' (rhymes with ``slurping'' and suggests movement in |
215 | the opposite direction). Others suggested ``spewing'' which has a |
216 | stronger visual image :-) Tell me your favorite or suggest your own. I |
217 | will use both in this section so you can see how they work for you.</P> |
218 | <P>Spewing a file is a much simpler operation than slurping. You don't have |
219 | context issues to worry about and there is no efficiency problem with |
220 | returning a buffer. Here is a simple burp subroutine:</P> |
221 | <PRE> |
222 | sub burp { |
223 | my( $file_name ) = shift ; |
224 | open( my $fh, ">$file_name" ) || |
225 | die "can't create $file_name $!" ; |
226 | print $fh @_ ; |
227 | }</PRE> |
228 | <P>Note that it doesn't copy the input text but passes @_ directly to |
229 | print. We will look at faster variations of that later on.</P> |
230 | <P> |
231 | <H2><A NAME="slurp on the cpan">Slurp on the CPAN</A></H2> |
232 | <P>As you would expect there are modules in the CPAN that will slurp files |
233 | for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on |
234 | CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).</P> |
235 | <P>Here is the code from Slurp.pm:</P> |
236 | <PRE> |
237 | sub slurp { |
238 | local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ ); |
239 | return <ARGV>; |
240 | }</PRE> |
241 | <PRE> |
242 | sub to_array { |
243 | my @array = slurp( @_ ); |
244 | return wantarray ? @array : \@array; |
245 | }</PRE> |
246 | <PRE> |
247 | sub to_scalar { |
248 | my $scalar = slurp( @_ ); |
249 | return $scalar; |
250 | }</PRE> |
251 | <P>+The subroutine <CODE>slurp()</CODE> uses the magic undefined value of <CODE>$/</CODE> and |
252 | the magic file +handle <CODE>ARGV</CODE> to support slurping into a scalar or |
253 | array. It also provides two wrapper subs that allow the caller to |
254 | control the context of the slurp. And the <CODE>to_array()</CODE> subroutine will |
255 | return the list of slurped lines or a anonymous array of them according |
256 | to its caller's context by checking <CODE>wantarray</CODE>. It has 'slurp' in |
257 | <CODE>@EXPORT</CODE> and all three subroutines in <CODE>@EXPORT_OK</CODE>.</P> |
258 | <P><Footnote: Slurp.pm is poorly named and it shouldn't be in the top level |
259 | namespace.></P> |
260 | <P>The original File::Slurp.pm has this code:</P> |
261 | <P>sub read_file |
262 | { |
263 | my ($file) = @_;</P> |
264 | <PRE> |
265 | local($/) = wantarray ? $/ : undef; |
266 | local(*F); |
267 | my $r; |
268 | my (@r);</PRE> |
269 | <PRE> |
270 | open(F, "<$file") || croak "open $file: $!"; |
271 | @r = <F>; |
272 | close(F) || croak "close $file: $!";</PRE> |
273 | <PRE> |
274 | return $r[0] unless wantarray; |
275 | return @r; |
276 | }</PRE> |
277 | <P>This module provides several subroutines including <CODE>read_file()</CODE> (more |
278 | on the others later). <CODE>read_file()</CODE> behaves simularly to |
279 | <CODE>Slurp::slurp()</CODE> in that it will slurp a list of lines or a single |
280 | scalar depending on the caller's context. It also uses the magic |
281 | undefined value of <CODE>$/</CODE> for scalar slurping but it uses an explicit |
282 | open call rather than using a localized <CODE>@ARGV</CODE> and the other module |
283 | did. Also it doesn't provide a way to get an anonymous array of the |
284 | lines but that can easily be rectified by calling it inside an anonymous |
285 | array constuctor <CODE>[]</CODE>.</P> |
286 | <P>Both of these modules make it easier for Perl coders to slurp in |
287 | files. They both use the magic <CODE>$/</CODE> to slurp in scalar mode and the |
288 | natural behavior of <CODE><></CODE> in list context to slurp as lines. But |
289 | neither is optmized for speed nor can they handle <CODE>binmode()</CODE> to |
290 | support binary or unicode files. See below for more on slurp features |
291 | and speedups.</P> |
292 | <P> |
293 | <H2><A NAME="slurping api design">Slurping API Design</A></H2> |
294 | <P>The slurp modules on CPAN are have a very simple API and don't support |
295 | <CODE>binmode()</CODE>. This section will cover various API design issues such as |
296 | efficient return by reference, <CODE>binmode()</CODE> and calling variations.</P> |
297 | <P>Let's start with the call variations. Slurped files can be returned in |
298 | four formats: as a single scalar, as a reference to a scalar, as a list |
299 | of lines or as an anonymous array of lines. But the caller can only |
300 | provide two contexts: scalar or list. So we have to either provide an |
301 | API with more than one subroutine (as Slurp.pm did) or just provide one |
302 | subroutine which only returns a scalar or a list (not an anonymous |
303 | array) as File::Slurp does.</P> |
304 | <P>I have used my own <CODE>read_file()</CODE> subroutine for years and it has the |
305 | same API as File::Slurp: a single subroutine that returns a scalar or a |
306 | list of lines depending on context. But I recognize the interest of |
307 | those that want an anonymous array for line slurping. For one thing, it |
308 | is easier to pass around to other subs and for another, it eliminates |
309 | the extra copying of the lines via <CODE>return</CODE>. So my module provides only |
310 | one slurp subroutine that returns the file data based on context and any |
311 | format options passed in. There is no need for a specific |
312 | slurp-in-as-a-scalar or list subroutine as the general <CODE>read_file()</CODE> |
313 | sub will do that by default in the appropriate context. If you want |
314 | <CODE>read_file()</CODE> to return a scalar reference or anonymous array of lines, |
315 | you can request those formats with options. You can even pass in a |
316 | reference to a scalar (e.g. a previously allocated buffer) and have that |
317 | filled with the slurped data (and that is one of the fastest slurp |
318 | modes. see the benchmark section for more on that). If you want to |
319 | slurp a scalar into an array, just select the desired array element and |
320 | that will provide scalar context to the <CODE>read_file()</CODE> subroutine.</P> |
321 | <P>The next area to cover is what to name the slurp sub. I will go with |
322 | <CODE>read_file()</CODE>. It is descriptive and keeps compatibilty with the |
323 | current simple and don't use the 'slurp' nickname (though that nickname |
324 | is in the module name). Also I decided to keep the File::Slurp |
325 | namespace which was graciously handed over to me by its current owner, |
326 | David Muir.</P> |
327 | <P>Another critical area when designing APIs is how to pass in |
328 | arguments. The <CODE>read_file()</CODE> subroutine takes one required argument |
329 | which is the file name. To support <CODE>binmode()</CODE> we need another optional |
330 | argument. A third optional argument is needed to support returning a |
331 | slurped scalar by reference. My first thought was to design the API with |
332 | 3 positional arguments - file name, buffer reference and binmode. But if |
333 | you want to set the binmode and not pass in a buffer reference, you have |
334 | to fill the second argument with <CODE>undef</CODE> and that is ugly. So I decided |
335 | to make the filename argument positional and the other two named. The |
336 | subroutine starts off like this:</P> |
337 | <PRE> |
338 | sub read_file {</PRE> |
339 | <PRE> |
340 | my( $file_name, %args ) = @_ ;</PRE> |
341 | <PRE> |
342 | my $buf ; |
343 | my $buf_ref = $args{'buf'} || \$buf ;</PRE> |
344 | <P>The other sub (<CODE>read_file_lines()</CODE>) will only take an optional binmode |
345 | (so you can read files with binary delimiters). It doesn't need a buffer |
346 | reference argument since it can return an anonymous array if the called |
347 | in a scalar context. So this subroutine could use positional arguments, |
348 | but to keep its API similar to the API of <CODE>read_file()</CODE>, it will also |
349 | use pass by name for the optional arguments. This also means that new |
350 | optional arguments can be added later without breaking any legacy |
351 | code. A bonus with keeping the API the same for both subs will be seen |
352 | how the two subs are optimized to work together.</P> |
353 | <P>Write slurping (or spewing or burping :-)) needs to have its API |
354 | designed as well. The biggest issue is not only needing to support |
355 | optional arguments but a list of arguments to be written is needed. Perl |
356 | 6 will be able to handle that with optional named arguments and a final |
357 | slurp argument. Since this is Perl 5 we have to do it using some |
358 | cleverness. The first argument is the file name and it will be |
359 | positional as with the <CODE>read_file</CODE> subroutine. But how can we pass in |
360 | the optional arguments and also a list of data? The solution lies in the |
361 | fact that the data list should never contain a reference. |
362 | Burping/spewing works only on plain data. So if the next argument is a |
363 | hash reference, we can assume it cointains the optional arguments and |
364 | the rest of the arguments is the data list. So the <CODE>write_file()</CODE> |
365 | subroutine will start off like this:</P> |
366 | <PRE> |
367 | sub write_file {</PRE> |
368 | <PRE> |
369 | my $file_name = shift ;</PRE> |
370 | <PRE> |
371 | my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;</PRE> |
372 | <P>Whether or not optional arguments are passed in, we leave the data list |
373 | in <CODE>@_</CODE> to minimize any more copying. You call <CODE>write_file()</CODE> like this:</P> |
374 | <PRE> |
375 | write_file( 'foo', { binmode => ':raw' }, @data ) ; |
376 | write_file( 'junk', { append => 1 }, @more_junk ) ; |
377 | write_file( 'bar', @spew ) ;</PRE> |
378 | <P> |
379 | <H2><A NAME="fast slurping">Fast Slurping</A></H2> |
380 | <P>Somewhere along the line, I learned about a way to slurp files faster |
381 | than by setting $/ to undef. The method is very simple, you do a single |
382 | read call with the size of the file (which the -s operator provides). |
383 | This bypasses the I/O loop inside perl that checks for EOF and does all |
384 | sorts of processing. I then decided to experiment and found that |
385 | sysread is even faster as you would expect. sysread bypasses all of |
386 | Perl's stdio and reads the file from the kernel buffers directly into a |
387 | Perl scalar. This is why the slurp code in File::Slurp uses |
388 | sysopen/sysread/syswrite. All the rest of the code is just to support |
389 | the various options and data passing techniques.</P> |
390 | <P></P> |
391 | <PRE> |
392 | |
393 | =head2 Benchmarks</PRE> |
394 | <P>Benchmarks can be enlightening, informative, frustrating and |
395 | deceiving. It would make no sense to create a new and more complex slurp |
396 | module unless it also gained signifigantly in speed. So I created a |
397 | benchmark script which compares various slurp methods with differing |
398 | file sizes and calling contexts. This script can be run from the main |
399 | directory of the tarball like this:</P> |
400 | <PRE> |
401 | perl -Ilib extras/slurp_bench.pl</PRE> |
402 | <P>If you pass in an argument on the command line, it will be passed to |
403 | <CODE>timethese()</CODE> and it will control the duration. It defaults to -2 which |
404 | makes each benchmark run to at least 2 seconds of cpu time.</P> |
405 | <P>The following numbers are from a run I did on my 300Mhz sparc. You will |
406 | most likely get much faster counts on your boxes but the relative speeds |
407 | shouldn't change by much. If you see major differences on your |
408 | benchmarks, please send me the results and your Perl and OS |
409 | versions. Also you can play with the benchmark script and add more slurp |
410 | variations or data files.</P> |
411 | <P>The rest of this section will be discussing the results of the |
412 | benchmarks. You can refer to extras/slurp_bench.pl to see the code for |
413 | the individual benchmarks. If the benchmark name starts with cpan_, it |
414 | is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are |
415 | from the new File::Slurp.pm. Those that start with file_contents_ are |
416 | from a client's code base. The rest are variations I created to |
417 | highlight certain aspects of the benchmarks.</P> |
418 | <P>The short and long file data is made like this:</P> |
419 | <PRE> |
420 | my @lines = ( 'abc' x 30 . "\n") x 100 ; |
421 | my $text = join( '', @lines ) ;</PRE> |
422 | <PRE> |
423 | @lines = ( 'abc' x 40 . "\n") x 1000 ; |
424 | $text = join( '', @lines ) ;</PRE> |
425 | <P>So the short file is 9,100 bytes and the long file is 121,000 |
426 | bytes.</P> |
427 | <P> |
428 | <H3><A NAME="scalar slurp of short file">Scalar Slurp of Short File</A></H3> |
429 | <PRE> |
430 | file_contents 651/s |
431 | file_contents_no_OO 828/s |
432 | cpan_read_file 1866/s |
433 | cpan_slurp 1934/s |
434 | read_file 2079/s |
435 | new 2270/s |
436 | new_buf_ref 2403/s |
437 | new_scalar_ref 2415/s |
438 | sysread_file 2572/s</PRE> |
439 | <P> |
440 | <H3><A NAME="scalar slurp of long file">Scalar Slurp of Long File</A></H3> |
441 | <PRE> |
442 | file_contents_no_OO 82.9/s |
443 | file_contents 85.4/s |
444 | cpan_read_file 250/s |
445 | cpan_slurp 257/s |
446 | read_file 323/s |
447 | new 468/s |
448 | sysread_file 489/s |
449 | new_scalar_ref 766/s |
450 | new_buf_ref 767/s</PRE> |
451 | <P>The primary inference you get from looking at the mumbers above is that |
452 | when slurping a file into a scalar, the longer the file, the more time |
453 | you save by returning the result via a scalar reference. The time for |
454 | the extra buffer copy can add up. The new module came out on top overall |
455 | except for the very simple sysread_file entry which was added to |
456 | highlight the overhead of the more flexible new module which isn't that |
457 | much. The file_contents entries are always the worst since they do a |
458 | list slurp and then a join, which is a classic newbie and cargo culted |
459 | style which is extremely slow. Also the OO code in file_contents slows |
460 | it down even more (I added the file_contents_no_OO entry to show this). |
461 | The two CPAN modules are decent with small files but they are laggards |
462 | compared to the new module when the file gets much larger.</P> |
463 | <P> |
464 | <H3><A NAME="list slurp of short file">List Slurp of Short File</A></H3> |
465 | <PRE> |
466 | cpan_read_file 589/s |
467 | cpan_slurp_to_array 620/s |
468 | read_file 824/s |
469 | new_array_ref 824/s |
470 | sysread_file 828/s |
471 | new 829/s |
472 | new_in_anon_array 833/s |
473 | cpan_slurp_to_array_ref 836/s</PRE> |
474 | <P> |
475 | <H3><A NAME="list slurp of long file">List Slurp of Long File</A></H3> |
476 | <PRE> |
477 | cpan_read_file 62.4/s |
478 | cpan_slurp_to_array 62.7/s |
479 | read_file 92.9/s |
480 | sysread_file 94.8/s |
481 | new_array_ref 95.5/s |
482 | new 96.2/s |
483 | cpan_slurp_to_array_ref 96.3/s |
484 | new_in_anon_array 97.2/s</PRE> |
485 | <P>This is perhaps the most interesting result of this benchmark. Five |
486 | different entries have effectively tied for the lead. The logical |
487 | conclusion is that splitting the input into lines is the bounding |
488 | operation, no matter how the file gets slurped. This is the only |
489 | benchmark where the new module isn't the clear winner (in the long file |
490 | entries - it is no worse than a close second in the short file |
491 | entries).</P> |
492 | <P>Note: In the benchmark information for all the spew entries, the extra |
493 | number at the end of each line is how many wallclock seconds the whole |
494 | entry took. The benchmarks were run for at least 2 CPU seconds per |
495 | entry. The unusually large wallclock times will be discussed below.</P> |
496 | <P> |
497 | <H3><A NAME="scalar spew of short file">Scalar Spew of Short File</A></H3> |
498 | <PRE> |
499 | cpan_write_file 1035/s 38 |
500 | print_file 1055/s 41 |
501 | syswrite_file 1135/s 44 |
502 | new 1519/s 2 |
503 | print_join_file 1766/s 2 |
504 | new_ref 1900/s 2 |
505 | syswrite_file2 2138/s 2</PRE> |
506 | <P> |
507 | <H3><A NAME="scalar spew of long file">Scalar Spew of Long File</A></H3> |
508 | <PRE> |
509 | cpan_write_file 164/s 20 |
510 | print_file 211/s 26 |
511 | syswrite_file 236/s 25 |
512 | print_join_file 277/s 2 |
513 | new 295/s 2 |
514 | syswrite_file2 428/s 2 |
515 | new_ref 608/s 2</PRE> |
516 | <P>In the scalar spew entries, the new module API wins when it is passed a |
517 | reference to the scalar buffer. The <CODE>syswrite_file2</CODE> entry beats it |
518 | with the shorter file due to its simpler code. The old CPAN module is |
519 | the slowest due to its extra copying of the data and its use of print.</P> |
520 | <P> |
521 | <H3><A NAME="list spew of short file">List Spew of Short File</A></H3> |
522 | <PRE> |
523 | cpan_write_file 794/s 29 |
524 | syswrite_file 1000/s 38 |
525 | print_file 1013/s 42 |
526 | new 1399/s 2 |
527 | print_join_file 1557/s 2</PRE> |
528 | <P> |
529 | <H3><A NAME="list spew of long file">List Spew of Long File</A></H3> |
530 | <PRE> |
531 | cpan_write_file 112/s 12 |
532 | print_file 179/s 21 |
533 | syswrite_file 181/s 19 |
534 | print_join_file 205/s 2 |
535 | new 228/s 2</PRE> |
536 | <P>Again, the simple <CODE>print_join_file</CODE> entry beats the new module when |
537 | spewing a short list of lines to a file. But is loses to the new module |
538 | when the file size gets longer. The old CPAN module lags behind the |
539 | others since it first makes an extra copy of the lines and then it calls |
540 | <CODE>print</CODE> on the output list and that is much slower than passing to |
541 | <CODE>print</CODE> a single scalar generated by join. The <CODE>print_file</CODE> entry |
542 | shows the advantage of directly printing <CODE>@_</CODE> and the |
543 | <CODE>print_join_file</CODE> adds the join optimization.</P> |
544 | <P>Now about those long wallclock times. If you look carefully at the |
545 | benchmark code of all the spew entries, you will find that some always |
546 | write to new files and some overwrite existing files. When I asked David |
547 | Muir why the old File::Slurp module had an <CODE>overwrite</CODE> subroutine, he |
548 | answered that by overwriting a file, you always guarantee something |
549 | readable is in the file. If you create a new file, there is a moment |
550 | when the new file is created but has no data in it. I feel this is not a |
551 | good enough answer. Even when overwriting, you can write a shorter file |
552 | than the existing file and then you have to truncate the file to the new |
553 | size. There is a small race window there where another process can slurp |
554 | in the file with the new data followed by leftover junk from the |
555 | previous version of the file. This reinforces the point that the only |
556 | way to ensure consistant file data is the proper use of file locks.</P> |
557 | <P>But what about those long times? Well it is all about the difference |
558 | between creating files and overwriting existing ones. The former have to |
559 | allocate new inodes (or the equivilent on other file systems) and the |
560 | latter can reuse the exising inode. This mean the overwrite will save on |
561 | disk seeks as well as on cpu time. In fact when running this benchmark, |
562 | I could hear my disk going crazy allocating inodes during the spew |
563 | operations. This speedup in both cpu and wallclock is why the new module |
564 | always does overwriting when spewing files. It also does the proper |
565 | truncate (and this is checked in the tests by spewing shorter files |
566 | after longer ones had previously been written). The <CODE>overwrite</CODE> |
567 | subroutine is just an typeglob alias to <CODE>write_file</CODE> and is there for |
568 | backwards compatibilty with the old File::Slurp module.</P> |
569 | <P> |
570 | <H3><A NAME="benchmark conclusion">Benchmark Conclusion</A></H3> |
571 | <P>Other than a few cases where a simpler entry beat it out, the new |
572 | File::Slurp module is either the speed leader or among the leaders. Its |
573 | special APIs for passing buffers by reference prove to be very useful |
574 | speedups. Also it uses all the other optimizations including using |
575 | <CODE>sysread/syswrite</CODE> and joining output lines. I expect many projects |
576 | that extensively use slurping will notice the speed improvements, |
577 | especially if they rewrite their code to take advantage of the new API |
578 | features. Even if they don't touch their code and use the simple API |
579 | they will get a significant speedup.</P> |
580 | <P> |
581 | <H2><A NAME="error handling">Error Handling</A></H2> |
582 | <P>Slurp subroutines are subject to conditions such as not being able to |
583 | open the file, or I/O errors. How these errors are handled, and what the |
584 | caller will see, are important aspects of the design of an API. The |
585 | classic error handling for slurping has been to call <CODE>die()</CODE> or even |
586 | better, <CODE>croak()</CODE>. But sometimes you want the slurp to either |
587 | <CODE>warn()</CODE>/<CODE>carp()</CODE> or allow your code to handle the error. Sure, this |
588 | can be done by wrapping the slurp in a <CODE>eval</CODE> block to catch a fatal |
589 | error, but not everyone wants all that extra code. So I have added |
590 | another option to all the subroutines which selects the error |
591 | handling. If the 'err_mode' option is 'croak' (which is also the |
592 | default), the called subroutine will croak. If set to 'carp' then carp |
593 | will be called. Set to any other string (use 'quiet' when you want to |
594 | be explicit) and no error handler is called. Then the caller can use the |
595 | error status from the call.</P> |
596 | <P><CODE>write_file()</CODE> doesn't use the return value for data so it can return a |
597 | false status value in-band to mark an error. <CODE>read_file()</CODE> does use its |
598 | return value for data, but we can still make it pass back the error |
599 | status. A successful read in any scalar mode will return either a |
600 | defined data string or a reference to a scalar or array. So a bare |
601 | return would work here. But if you slurp in lines by calling it in a |
602 | list context, a bare <CODE>return</CODE> will return an empty list, which is the |
603 | same value it would get from an existing but empty file. So now, |
604 | <CODE>read_file()</CODE> will do something I normally strongly advocate against, |
605 | i.e., returning an explicit <CODE>undef</CODE> value. In the scalar context this |
606 | still returns a error, and in list context, the returned first value |
607 | will be <CODE>undef</CODE>, and that is not legal data for the first element. So |
608 | the list context also gets a error status it can detect:</P> |
609 | <PRE> |
610 | my @lines = read_file( $file_name, err_mode => 'quiet' ) ; |
611 | your_handle_error( "$file_name can't be read\n" ) unless |
612 | @lines && defined $lines[0] ;</PRE> |
613 | <P> |
614 | <H2><A NAME="file::fastslurp">File::FastSlurp</A></H2> |
615 | <PRE> |
616 | sub read_file {</PRE> |
617 | <PRE> |
618 | my( $file_name, %args ) = @_ ;</PRE> |
619 | <PRE> |
620 | my $buf ; |
621 | my $buf_ref = $args{'buf_ref'} || \$buf ;</PRE> |
622 | <PRE> |
623 | my $mode = O_RDONLY ; |
624 | $mode |= O_BINARY if $args{'binmode'} ;</PRE> |
625 | <PRE> |
626 | local( *FH ) ; |
627 | sysopen( FH, $file_name, $mode ) or |
628 | carp "Can't open $file_name: $!" ;</PRE> |
629 | <PRE> |
630 | my $size_left = -s FH ;</PRE> |
631 | <PRE> |
632 | while( $size_left > 0 ) {</PRE> |
633 | <PRE> |
634 | my $read_cnt = sysread( FH, ${$buf_ref}, |
635 | $size_left, length ${$buf_ref} ) ;</PRE> |
636 | <PRE> |
637 | unless( $read_cnt ) {</PRE> |
638 | <PRE> |
639 | carp "read error in file $file_name: $!" ; |
640 | last ; |
641 | }</PRE> |
642 | <PRE> |
643 | $size_left -= $read_cnt ; |
644 | }</PRE> |
645 | <PRE> |
646 | # handle void context (return scalar by buffer reference)</PRE> |
647 | <PRE> |
648 | return unless defined wantarray ;</PRE> |
649 | <PRE> |
650 | # handle list context</PRE> |
651 | <PRE> |
652 | return split m|?<$/|g, ${$buf_ref} if wantarray ;</PRE> |
653 | <PRE> |
654 | # handle scalar context</PRE> |
655 | <PRE> |
656 | return ${$buf_ref} ; |
657 | }</PRE> |
658 | <PRE> |
659 | sub write_file {</PRE> |
660 | <PRE> |
661 | my $file_name = shift ;</PRE> |
662 | <PRE> |
663 | my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ; |
664 | my $buf = join '', @_ ;</PRE> |
665 | <PRE> |
666 | my $mode = O_WRONLY ; |
667 | $mode |= O_BINARY if $args->{'binmode'} ; |
668 | $mode |= O_APPEND if $args->{'append'} ;</PRE> |
669 | <PRE> |
670 | local( *FH ) ; |
671 | sysopen( FH, $file_name, $mode ) or |
672 | carp "Can't open $file_name: $!" ;</PRE> |
673 | <PRE> |
674 | my $size_left = length( $buf ) ; |
675 | my $offset = 0 ;</PRE> |
676 | <PRE> |
677 | while( $size_left > 0 ) {</PRE> |
678 | <PRE> |
679 | my $write_cnt = syswrite( FH, $buf, |
680 | $size_left, $offset ) ;</PRE> |
681 | <PRE> |
682 | unless( $write_cnt ) {</PRE> |
683 | <PRE> |
684 | carp "write error in file $file_name: $!" ; |
685 | last ; |
686 | }</PRE> |
687 | <PRE> |
688 | $size_left -= $write_cnt ; |
689 | $offset += $write_cnt ; |
690 | }</PRE> |
691 | <PRE> |
692 | return ; |
693 | }</PRE> |
694 | <P> |
695 | <H2><A NAME="slurping in perl 6">Slurping in Perl 6</A></H2> |
696 | <P>As usual with Perl 6, much of the work in this article will be put to |
697 | pasture. Perl 6 will allow you to set a 'slurp' property on file handles |
698 | and when you read from such a handle, the file is slurped. List and |
699 | scalar context will still be supported so you can slurp into lines or a |
700 | <scalar. I would expect that support for slurping in Perl 6 will be |
701 | optimized and bypass the stdio subsystem since it can use the slurp |
702 | property to trigger a call to special code. Otherwise some enterprising |
703 | individual will just create a File::FastSlurp module for Perl 6. The |
704 | code in the Perl 5 module could easily be modified to Perl 6 syntax and |
705 | semantics. Any volunteers?</P> |
706 | <P> |
707 | <H2><A NAME="in summary">In Summary</A></H2> |
708 | <P>We have compared classic line by line processing with munging a whole |
709 | file in memory. Slurping files can speed up your programs and simplify |
710 | your code if done properly. You must still be aware to not slurp |
711 | humongous files (logs, DNA sequences, etc.) or STDIN where you don't |
712 | know how much data you will read in. But slurping megabyte sized files |
713 | is not an major issue on today's systems with the typical amount of RAM |
714 | installed. When Perl was first being used in depth (Perl 4), slurping |
715 | was limited by the smaller RAM size of 10 years ago. Now, you should be |
716 | able to slurp almost any reasonably sized file, whether it contains |
717 | configuration, source code, data, etc.</P> |
718 | <P> |
719 | <H2><A NAME="acknowledgements">Acknowledgements</A></H2> |
720 | |
721 | </BODY> |
722 | |
723 | </HTML> |