Commit | Line | Data |
d54256af |
1 | |
2 | =head1 NAME |
3 | |
319fab50 |
4 | IO::Compress::FAQ -- Frequently Asked Questions about IO::Compress |
d54256af |
5 | |
6 | =head1 DESCRIPTION |
7 | |
8 | Common questions answered. |
9 | |
10 | =head2 Compatibility with Unix compress/uncompress. |
11 | |
319fab50 |
12 | Although C<Compress::Zlib> has a pair of functions called C<compress> and |
13 | C<uncompress>, they are I<not> related to the Unix programs of the same |
14 | name. The C<Compress::Zlib> module is not compatible with Unix |
15 | C<compress>. |
d54256af |
16 | |
17 | If you have the C<uncompress> program available, you can use this to read |
18 | compressed files |
19 | |
20 | open F, "uncompress -c $filename |"; |
21 | while (<F>) |
22 | { |
23 | ... |
24 | |
25 | Alternatively, if you have the C<gunzip> program available, you can use |
26 | this to read compressed files |
27 | |
28 | open F, "gunzip -c $filename |"; |
29 | while (<F>) |
30 | { |
31 | ... |
32 | |
33 | and this to write compress files, if you have the C<compress> program |
34 | available |
35 | |
36 | open F, "| compress -c $filename "; |
37 | print F "data"; |
38 | ... |
39 | close F ; |
40 | |
41 | =head2 Accessing .tar.Z files |
42 | |
319fab50 |
43 | The C<Archive::Tar> module can optionally use C<Compress::Zlib> (via the |
44 | C<IO::Zlib> module) to access tar files that have been compressed with |
45 | C<gzip>. Unfortunately tar files compressed with the Unix C<compress> |
46 | utility cannot be read by C<Compress::Zlib> and so cannot be directly |
47 | accessed by C<Archive::Tar>. |
d54256af |
48 | |
319fab50 |
49 | If the C<uncompress> or C<gunzip> programs are available, you can use one |
50 | of these workarounds to read C<.tar.Z> files from C<Archive::Tar> |
d54256af |
51 | |
52 | Firstly with C<uncompress> |
53 | |
54 | use strict; |
55 | use warnings; |
56 | use Archive::Tar; |
57 | |
58 | open F, "uncompress -c $filename |"; |
59 | my $tar = Archive::Tar->new(*F); |
60 | ... |
61 | |
62 | and this with C<gunzip> |
63 | |
64 | use strict; |
65 | use warnings; |
66 | use Archive::Tar; |
67 | |
68 | open F, "gunzip -c $filename |"; |
69 | my $tar = Archive::Tar->new(*F); |
70 | ... |
71 | |
72 | Similarly, if the C<compress> program is available, you can use this to |
73 | write a C<.tar.Z> file |
74 | |
75 | use strict; |
76 | use warnings; |
77 | use Archive::Tar; |
78 | use IO::File; |
79 | |
80 | my $fh = new IO::File "| compress -c >$filename"; |
81 | my $tar = Archive::Tar->new(); |
82 | ... |
83 | $tar->write($fh); |
84 | $fh->close ; |
85 | |
86 | =head2 Accessing Zip Files |
87 | |
88 | This module provides support for reading/writing zip files using the |
89 | C<IO::Compress::Zip> and C<IO::Uncompress::Unzip> modules. |
90 | |
91 | The primary focus of the C<IO::Compress::Zip> and C<IO::Uncompress::Unzip> |
92 | modules is to provide an C<IO::File> compatible streaming read/write |
93 | interface to zip files/buffers. They are not fully flegged archivers. If |
94 | you are looking for an archiver check out the C<Archive::Zip> module. You |
95 | can find it on CPAN at |
96 | |
97 | http://www.cpan.org/modules/by-module/Archive/Archive-Zip-*.tar.gz |
98 | |
99 | =head2 Compressed files and Net::FTP |
100 | |
101 | The C<Net::FTP> module provides two low-level methods called C<stor> and |
102 | C<retr> that both return filehandles. These filehandles can used with the |
103 | C<IO::Compress/Uncompress> modules to compress or uncompress files read |
104 | from or written to an FTP Server on the fly, without having to create a |
105 | temporary file. |
106 | |
107 | Firstly, here is code that uses C<retr> to uncompressed a file as it is |
108 | read from the FTP Server. |
109 | |
110 | use Net::FTP; |
111 | use IO::Uncompress::Gunzip qw(:all); |
112 | |
113 | my $ftp = new Net::FTP ... |
114 | |
115 | my $retr_fh = $ftp->retr($compressed_filename); |
116 | gunzip $retr_fh => $outFilename, AutoClose => 1 |
117 | or die "Cannot uncompress '$compressed_file': $GunzipError\n"; |
118 | |
119 | and this to compress a file as it is written to the FTP Server |
120 | |
121 | use Net::FTP; |
122 | use IO::Compress::Gzip qw(:all); |
123 | |
124 | my $stor_fh = $ftp->stor($filename); |
125 | gzip "filename" => $stor_fh, AutoClose => 1 |
126 | or die "Cannot compress '$filename': $GzipError\n"; |
127 | |
128 | =head2 How do I recompress using a different compression? |
129 | |
130 | This is easier that you might expect if you realise that all the |
131 | C<IO::Compress::*> objects are derived from C<IO::File> and that all the |
132 | C<IO::Uncompress::*> modules can read from an C<IO::File> filehandle. |
133 | |
134 | So, for example, say you have a file compressed with gzip that you want to |
135 | recompress with bzip2. Here is all that is needed to carry out the |
136 | recompression. |
137 | |
138 | use IO::Uncompress::Gunzip ':all'; |
139 | use IO::Compress::Bzip2 ':all'; |
140 | |
141 | my $gzipFile = "somefile.gz"; |
142 | my $bzipFile = "somefile.bz2"; |
143 | |
144 | my $gunzip = new IO::Uncompress::Gunzip $gzipFile |
145 | or die "Cannot gunzip $gzipFile: $GunzipError\n" ; |
146 | |
147 | bzip2 $gunzip => $bzipFile |
148 | or die "Cannot bzip2 to $bzipFile: $Bzip2Error\n" ; |
149 | |
150 | Note, there is a limitation of this technique. Some compression file |
151 | formats store extra information along with the compressed data payload. For |
152 | example, gzip can optionally store the original filename and Zip stores a |
153 | lot of information about the original file. If the original compressed file |
154 | contains any of this extra information, it will not be transferred to the |
155 | new compressed file usign the technique above. |
156 | |
157 | =head2 Apache::GZip Revisited |
158 | |
159 | Below is a mod_perl Apache compression module, called C<Apache::GZip>, |
160 | taken from |
161 | F<http://perl.apache.org/docs/tutorials/tips/mod_perl_tricks/mod_perl_tricks.html#On_the_Fly_Compression> |
162 | |
163 | package Apache::GZip; |
164 | #File: Apache::GZip.pm |
165 | |
166 | use strict vars; |
167 | use Apache::Constants ':common'; |
168 | use Compress::Zlib; |
169 | use IO::File; |
170 | use constant GZIP_MAGIC => 0x1f8b; |
171 | use constant OS_MAGIC => 0x03; |
172 | |
173 | sub handler { |
174 | my $r = shift; |
175 | my ($fh,$gz); |
176 | my $file = $r->filename; |
177 | return DECLINED unless $fh=IO::File->new($file); |
178 | $r->header_out('Content-Encoding'=>'gzip'); |
179 | $r->send_http_header; |
180 | return OK if $r->header_only; |
181 | |
182 | tie *STDOUT,'Apache::GZip',$r; |
183 | print($_) while <$fh>; |
184 | untie *STDOUT; |
185 | return OK; |
186 | } |
187 | |
188 | sub TIEHANDLE { |
189 | my($class,$r) = @_; |
190 | # initialize a deflation stream |
191 | my $d = deflateInit(-WindowBits=>-MAX_WBITS()) || return undef; |
192 | |
193 | # gzip header -- don't ask how I found out |
194 | $r->print(pack("nccVcc",GZIP_MAGIC,Z_DEFLATED,0,time(),0,OS_MAGIC)); |
195 | |
196 | return bless { r => $r, |
197 | crc => crc32(undef), |
198 | d => $d, |
199 | l => 0 |
200 | },$class; |
201 | } |
202 | |
203 | sub PRINT { |
204 | my $self = shift; |
205 | foreach (@_) { |
206 | # deflate the data |
207 | my $data = $self->{d}->deflate($_); |
208 | $self->{r}->print($data); |
209 | # keep track of its length and crc |
210 | $self->{l} += length($_); |
211 | $self->{crc} = crc32($_,$self->{crc}); |
212 | } |
213 | } |
214 | |
215 | sub DESTROY { |
216 | my $self = shift; |
217 | |
218 | # flush the output buffers |
219 | my $data = $self->{d}->flush; |
220 | $self->{r}->print($data); |
221 | |
222 | # print the CRC and the total length (uncompressed) |
223 | $self->{r}->print(pack("LL",@{$self}{qw/crc l/})); |
224 | } |
225 | |
226 | 1; |
227 | |
228 | Here's the Apache configuration entry you'll need to make use of it. Once |
229 | set it will result in everything in the /compressed directory will be |
230 | compressed automagically. |
231 | |
232 | <Location /compressed> |
233 | SetHandler perl-script |
234 | PerlHandler Apache::GZip |
235 | </Location> |
236 | |
237 | Although at first sight there seems to be quite a lot going on in |
238 | C<Apache::GZip>, you could sum up what the code was doing as follows -- |
239 | read the contents of the file in C<< $r->filename >>, compress it and write |
240 | the compressed data to standard output. That's all. |
241 | |
242 | This code has to jump through a few hoops to achieve this because |
243 | |
244 | =over |
245 | |
246 | =item 1. |
247 | |
248 | The gzip support in C<Compress::Zlib> version 1.x can only work with a real |
249 | filesystem filehandle. The filehandles used by Apache modules are not |
250 | associated with the filesystem. |
251 | |
252 | =item 2. |
253 | |
254 | That means all the gzip support has to be done by hand - in this case by |
255 | creating a tied filehandle to deal with creating the gzip header and |
256 | trailer. |
257 | |
258 | =back |
259 | |
260 | C<IO::Compress::Gzip> doesn't have that filehandle limitation (this was one |
261 | of the reasons for writing it in the first place). So if |
262 | C<IO::Compress::Gzip> is used instead of C<Compress::Zlib> the whole tied |
263 | filehandle code can be removed. Here is the rewritten code. |
264 | |
265 | package Apache::GZip; |
266 | |
267 | use strict vars; |
268 | use Apache::Constants ':common'; |
269 | use IO::Compress::Gzip; |
270 | use IO::File; |
271 | |
272 | sub handler { |
273 | my $r = shift; |
274 | my ($fh,$gz); |
275 | my $file = $r->filename; |
276 | return DECLINED unless $fh=IO::File->new($file); |
277 | $r->header_out('Content-Encoding'=>'gzip'); |
278 | $r->send_http_header; |
279 | return OK if $r->header_only; |
280 | |
281 | my $gz = new IO::Compress::Gzip '-', Minimal => 1 |
282 | or return DECLINED ; |
283 | |
284 | print $gz $_ while <$fh>; |
285 | |
286 | return OK; |
287 | } |
288 | |
289 | or even more succinctly, like this, using a one-shot gzip |
290 | |
291 | package Apache::GZip; |
292 | |
293 | use strict vars; |
294 | use Apache::Constants ':common'; |
295 | use IO::Compress::Gzip qw(gzip); |
296 | |
297 | sub handler { |
298 | my $r = shift; |
299 | $r->header_out('Content-Encoding'=>'gzip'); |
300 | $r->send_http_header; |
301 | return OK if $r->header_only; |
302 | |
303 | gzip $r->filename => '-', Minimal => 1 |
304 | or return DECLINED ; |
305 | |
306 | return OK; |
307 | } |
308 | |
309 | 1; |
310 | |
311 | The use of one-shot C<gzip> above just reads from C<< $r->filename >> and |
312 | writes the compressed data to standard output. |
313 | |
314 | Note the use of the C<Minimal> option in the code above. When using gzip |
315 | for Content-Encoding you should I<always> use this option. In the example |
316 | above it will prevent the filename being included in the gzip header and |
317 | make the size of the gzip data stream a slight bit smaller. |
318 | |
319 | =head2 Using C<InputLength> to uncompress data embedded in a larger file/buffer. |
320 | |
321 | A fairly common use-case is where compressed data is embedded in a larger |
322 | file/buffer and you want to read both. |
323 | |
324 | As an example consider the structure of a zip file. This is a well-defined |
325 | file format that mixes both compressed and uncompressed sections of data in |
326 | a single file. |
327 | |
328 | For the purposes of this discussion you can think of a zip file as sequence |
329 | of compressed data streams, each of which is prefixed by an uncompressed |
330 | local header. The local header contains information about the compressed |
331 | data stream, including the name of the compressed file and, in particular, |
332 | the length of the compressed data stream. |
333 | |
334 | To illustrate how to use C<InputLength> here is a script that walks a zip |
335 | file and prints out how many lines are in each compressed file (if you |
336 | intend write code to walking through a zip file for real see |
337 | L<IO::Uncompress::Unzip/"Walking through a zip file"> ) |
338 | |
339 | use strict; |
340 | use warnings; |
341 | |
342 | use IO::File; |
343 | use IO::Uncompress::RawInflate qw(:all); |
344 | |
345 | use constant ZIP_LOCAL_HDR_SIG => 0x04034b50; |
346 | use constant ZIP_LOCAL_HDR_LENGTH => 30; |
347 | |
348 | my $file = $ARGV[0] ; |
349 | |
350 | my $fh = new IO::File "<$file" |
351 | or die "Cannot open '$file': $!\n"; |
352 | |
353 | while (1) |
354 | { |
355 | my $sig; |
356 | my $buffer; |
357 | |
358 | my $x ; |
359 | ($x = $fh->read($buffer, ZIP_LOCAL_HDR_LENGTH)) == ZIP_LOCAL_HDR_LENGTH |
360 | or die "Truncated file: $!\n"; |
361 | |
362 | my $signature = unpack ("V", substr($buffer, 0, 4)); |
363 | |
364 | last unless $signature == ZIP_LOCAL_HDR_SIG; |
365 | |
366 | # Read Local Header |
367 | my $gpFlag = unpack ("v", substr($buffer, 6, 2)); |
368 | my $compressedMethod = unpack ("v", substr($buffer, 8, 2)); |
369 | my $compressedLength = unpack ("V", substr($buffer, 18, 4)); |
370 | my $uncompressedLength = unpack ("V", substr($buffer, 22, 4)); |
371 | my $filename_length = unpack ("v", substr($buffer, 26, 2)); |
372 | my $extra_length = unpack ("v", substr($buffer, 28, 2)); |
373 | |
374 | my $filename ; |
375 | $fh->read($filename, $filename_length) == $filename_length |
376 | or die "Truncated file\n"; |
377 | |
378 | $fh->read($buffer, $extra_length) == $extra_length |
379 | or die "Truncated file\n"; |
380 | |
381 | if ($compressedMethod != 8 && $compressedMethod != 0) |
382 | { |
383 | warn "Skipping file '$filename' - not deflated $compressedMethod\n"; |
384 | $fh->read($buffer, $compressedLength) == $compressedLength |
385 | or die "Truncated file\n"; |
386 | next; |
387 | } |
388 | |
389 | if ($compressedMethod == 0 && $gpFlag & 8 == 8) |
390 | { |
391 | die "Streamed Stored not supported for '$filename'\n"; |
392 | } |
393 | |
394 | next if $compressedLength == 0; |
395 | |
396 | # Done reading the Local Header |
397 | |
398 | my $inf = new IO::Uncompress::RawInflate $fh, |
399 | Transparent => 1, |
400 | InputLength => $compressedLength |
401 | or die "Cannot uncompress $file [$filename]: $RawInflateError\n" ; |
402 | |
403 | my $line_count = 0; |
404 | |
405 | while (<$inf>) |
406 | { |
407 | ++ $line_count; |
408 | } |
409 | |
410 | print "$filename: $line_count\n"; |
411 | } |
412 | |
413 | The majority of the code above is concerned with reading the zip local |
414 | header data. The code that I want to focus on is at the bottom. |
415 | |
416 | while (1) { |
417 | |
418 | # read local zip header data |
419 | # get $filename |
420 | # get $compressedLength |
421 | |
422 | my $inf = new IO::Uncompress::RawInflate $fh, |
423 | Transparent => 1, |
424 | InputLength => $compressedLength |
425 | or die "Cannot uncompress $file [$filename]: $RawInflateError\n" ; |
426 | |
427 | my $line_count = 0; |
428 | |
429 | while (<$inf>) |
430 | { |
431 | ++ $line_count; |
432 | } |
433 | |
434 | print "$filename: $line_count\n"; |
435 | } |
436 | |
437 | The call to C<IO::Uncompress::RawInflate> creates a new filehandle C<$inf> |
438 | that can be used to read from the parent filehandle C<$fh>, uncompressing |
439 | it as it goes. The use of the C<InputLength> option will guarantee that |
440 | I<at most> C<$compressedLength> bytes of compressed data will be read from |
441 | the C<$fh> filehandle (The only exception is for an error case like a |
442 | truncated file or a corrupt data stream). |
443 | |
444 | This means that once RawInflate is finished C<$fh> will be left at the |
445 | byte directly after the compressed data stream. |
446 | |
447 | Now consider what the code looks like without C<InputLength> |
448 | |
449 | while (1) { |
450 | |
451 | # read local zip header data |
452 | # get $filename |
453 | # get $compressedLength |
454 | |
455 | # read all the compressed data into $data |
456 | read($fh, $data, $compressedLength); |
457 | |
458 | my $inf = new IO::Uncompress::RawInflate \$data, |
459 | Transparent => 1, |
460 | or die "Cannot uncompress $file [$filename]: $RawInflateError\n" ; |
461 | |
462 | my $line_count = 0; |
463 | |
464 | while (<$inf>) |
465 | { |
466 | ++ $line_count; |
467 | } |
468 | |
469 | print "$filename: $line_count\n"; |
470 | } |
471 | |
472 | The difference here is the addition of the temporary variable C<$data>. |
473 | This is used to store a copy of the compressed data while it is being |
474 | uncompressed. |
475 | |
476 | If you know that C<$compressedLength> isn't that big then using temporary |
477 | storage won't be a problem. But if C<$compressedLength> is very large or |
478 | you are writing an application that other people will use, and so have no |
479 | idea how big C<$compressedLength> will be, it could be an issue. |
480 | |
481 | Using C<InputLength> avoids the use of temporary storage and means the |
482 | application can cope with large compressed data streams. |
483 | |
484 | One final point -- obviously C<InputLength> can only be used whenever you |
485 | know the length of the compressed data beforehand, like here with a zip |
486 | file. |
487 | |
488 | =head1 SEE ALSO |
489 | |
490 | L<Compress::Zlib>, L<IO::Compress::Gzip>, L<IO::Uncompress::Gunzip>, L<IO::Compress::Deflate>, L<IO::Uncompress::Inflate>, L<IO::Compress::RawDeflate>, L<IO::Uncompress::RawInflate>, L<IO::Compress::Bzip2>, L<IO::Uncompress::Bunzip2>, L<IO::Compress::Lzop>, L<IO::Uncompress::UnLzop>, L<IO::Compress::Lzf>, L<IO::Uncompress::UnLzf>, L<IO::Uncompress::AnyInflate>, L<IO::Uncompress::AnyUncompress> |
491 | |
492 | L<Compress::Zlib::FAQ|Compress::Zlib::FAQ> |
493 | |
494 | L<File::GlobMapper|File::GlobMapper>, L<Archive::Zip|Archive::Zip>, |
495 | L<Archive::Tar|Archive::Tar>, |
496 | L<IO::Zlib|IO::Zlib> |
497 | |
498 | =head1 AUTHOR |
499 | |
500 | This module was written by Paul Marquess, F<pmqs@cpan.org>. |
501 | |
502 | =head1 MODIFICATION HISTORY |
503 | |
504 | See the Changes file. |
505 | |
506 | =head1 COPYRIGHT AND LICENSE |
507 | |
319fab50 |
508 | Copyright (c) 2005-2009 Paul Marquess. All rights reserved. |
d54256af |
509 | |
510 | This program is free software; you can redistribute it and/or |
511 | modify it under the same terms as Perl itself. |
512 | |