extras/new_text

   1 Somewhere along the line, I learned about a way to slurp files faster
   2 than by setting $/ to undef. The method is very simple, you do a single
   3 read call with the size of the file (which the -s operator provides).
   4 This bypasses the I/O loop inside perl that checks for EOF and does all
   5 sorts of processing. I then decided to experiment and found that
   6 sysread is even faster as you would expect. sysread bypasses all of
   7 Perl's stdio and reads the file from the kernel buffers directly into a
   8 Perl scalar. This is why the slurp code in File::Slurp uses
   9 sysopen/sysread/syswrite. All the rest of the code is just to support
  10 the various options and data passing techniques.
  11
  12
  13 Benchmarks can be enlightening, informative, frustrating and
  14 deceiving. It would make no sense to create a new and more complex slurp
  15 module unless it also gained signifigantly in speed. So I created a
  16 benchmark script which compares various slurp methods with differing
  17 file sizes and calling contexts. This script can be run from the main
  18 directory of the tarball like this:
  19
  20         perl -Ilib extras/slurp_bench.pl
  21
  22 If you pass in an argument on the command line, it will be passed to
  23 timethese() and it will control the duration. It defaults to -2 which
  24 makes each benchmark run to at least 2 seconds of cpu time.
  25
  26 The following numbers are from a run I did on my 300Mhz sparc. You will
  27 most likely get much faster counts on your boxes but the relative speeds
  28 shouldn't change by much. If you see major differences on your
  29 benchmarks, please send me the results and your Perl and OS
  30 versions. Also you can play with the benchmark script and add more slurp
  31 variations or data files.
  32
  33 The rest of this section will be discussing the results of the
  34 benchmarks. You can refer to extras/slurp_bench.pl to see the code for
  35 the individual benchmarks. If the benchmark name starts with cpan_, it
  36 is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are
  37 from the new File::Slurp.pm. Those that start with file_contents_ are
  38 from a client's code base. The rest are variations I created to
  39 highlight certain aspects of the benchmarks.
  40
  41 The short and long file data is made like this:
  42
  43         my @lines = ( 'abc' x 30 . "\n")  x 100 ;
  44         my $text = join( '', @lines ) ;
  45
  46         @lines = ( 'abc' x 40 . "\n")  x 1000 ;
  47         $text = join( '', @lines ) ;
  48
  49 So the short file is 9,100 bytes and the long file is 121,000
  50 bytes.
  51
  52 =head3  Scalar Slurp of Short File
  53
  54         file_contents        651/s
  55         file_contents_no_OO  828/s
  56         cpan_read_file      1866/s
  57         cpan_slurp          1934/s
  58         read_file           2079/s
  59         new                 2270/s
  60         new_buf_ref         2403/s
  61         new_scalar_ref      2415/s
  62         sysread_file        2572/s
  63
  64 =head3  Scalar Slurp of Long File
  65
  66         file_contents_no_OO 82.9/s
  67         file_contents       85.4/s
  68         cpan_read_file       250/s
  69         cpan_slurp           257/s
  70         read_file            323/s
  71         new                  468/s
  72         sysread_file         489/s
  73         new_scalar_ref       766/s
  74         new_buf_ref          767/s
  75
  76 The primary inference you get from looking at the mumbers above is that
  77 when slurping a file into a scalar, the longer the file, the more time
  78 you save by returning the result via a scalar reference. The time for
  79 the extra buffer copy can add up. The new module came out on top overall
  80 except for the very simple sysread_file entry which was added to
  81 highlight the overhead of the more flexible new module which isn't that
  82 much. The file_contents entries are always the worst since they do a
  83 list slurp and then a join, which is a classic newbie and cargo culted
  84 style which is extremely slow. Also the OO code in file_contents slows
  85 it down even more (I added the file_contents_no_OO entry to show this).
  86 The two CPAN modules are decent with small files but they are laggards
  87 compared to the new module when the file gets much larger.
  88
  89 =head3  List Slurp of Short File
  90
  91         cpan_read_file          589/s
  92         cpan_slurp_to_array     620/s
  93         read_file               824/s
  94         new_array_ref           824/s
  95         sysread_file            828/s
  96         new                     829/s
  97         new_in_anon_array       833/s
  98         cpan_slurp_to_array_ref 836/s
  99
 100 =head3  List Slurp of Long File
 101
 102         cpan_read_file          62.4/s
 103         cpan_slurp_to_array     62.7/s
 104         read_file               92.9/s
 105         sysread_file            94.8/s
 106         new_array_ref           95.5/s
 107         new                     96.2/s
 108         cpan_slurp_to_array_ref 96.3/s
 109         new_in_anon_array       97.2/s
 110
 111
 112 =head3  Scalar Spew of Short File
 113
 114         cpan_write_file 1035/s
 115         print_file      1055/s
 116         syswrite_file   1135/s
 117         new             1519/s
 118         print_join_file 1766/s
 119         new_ref         1900/s
 120         syswrite_file2  2138/s
 121
 122 =head3  Scalar Spew of Long File
 123
 124         cpan_write_file 164/s   20
 125         print_file      211/s   26
 126         syswrite_file   236/s   25
 127         print_join_file 277/s   2
 128         new             295/s   2
 129         syswrite_file2  428/s   25
 130         new_ref         608/s   2
 131
 132
 133 =head3 List Spew of Short File
 134
 135         cpan_write_file  794/s
 136         syswrite_file   1000/s
 137         print_file      1013/s
 138         new             1399/s
 139         print_join_file 1557/s
 140
 141 =head3  List Spew of Long File
 142
 143         cpan_write_file 112/s   12
 144         print_file      179/s   21
 145         syswrite_file   181/s   19
 146         print_join_file 205/s   2
 147         new             228/s   2
 148