Commit | Line | Data |
3fea05b9 |
1 | =head1 NAME |
2 | |
3 | lwptut -- An LWP Tutorial |
4 | |
5 | =head1 DESCRIPTION |
6 | |
7 | LWP (short for "Library for WWW in Perl") is a very popular group of |
8 | Perl modules for accessing data on the Web. Like most Perl |
9 | module-distributions, each of LWP's component modules comes with |
10 | documentation that is a complete reference to its interface. However, |
11 | there are so many modules in LWP that it's hard to know where to start |
12 | looking for information on how to do even the simplest most common |
13 | things. |
14 | |
15 | Really introducing you to using LWP would require a whole book -- a book |
16 | that just happens to exist, called I<Perl & LWP>. But this article |
17 | should give you a taste of how you can go about some common tasks with |
18 | LWP. |
19 | |
20 | |
21 | =head2 Getting documents with LWP::Simple |
22 | |
23 | If you just want to get what's at a particular URL, the simplest way |
24 | to do it is LWP::Simple's functions. |
25 | |
26 | In a Perl program, you can call its C<get($url)> function. It will try |
27 | getting that URL's content. If it works, then it'll return the |
28 | content; but if there's some error, it'll return undef. |
29 | |
30 | my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current'; |
31 | # Just an example: the URL for the most recent /Fresh Air/ show |
32 | |
33 | use LWP::Simple; |
34 | my $content = get $url; |
35 | die "Couldn't get $url" unless defined $content; |
36 | |
37 | # Then go do things with $content, like this: |
38 | |
39 | if($content =~ m/jazz/i) { |
40 | print "They're talking about jazz today on Fresh Air!\n"; |
41 | } |
42 | else { |
43 | print "Fresh Air is apparently jazzless today.\n"; |
44 | } |
45 | |
46 | The handiest variant on C<get> is C<getprint>, which is useful in Perl |
47 | one-liners. If it can get the page whose URL you provide, it sends it |
48 | to STDOUT; otherwise it complains to STDERR. |
49 | |
50 | % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'" |
51 | |
52 | That is the URL of a plain text file that lists new files in CPAN in |
53 | the past two weeks. You can easily make it part of a tidy little |
54 | shell command, like this one that mails you the list of new |
55 | C<Acme::> modules: |
56 | |
57 | % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'" \ |
58 | | grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER |
59 | |
60 | There are other useful functions in LWP::Simple, including one function |
61 | for running a HEAD request on a URL (useful for checking links, or |
62 | getting the last-revised time of a URL), and two functions for |
63 | saving/mirroring a URL to a local file. See L<the LWP::Simple |
64 | documentation|LWP::Simple> for the full details, or chapter 2 of I<Perl |
65 | & LWP> for more examples. |
66 | |
67 | |
68 | |
69 | =for comment |
70 | ########################################################################## |
71 | |
72 | |
73 | |
74 | =head2 The Basics of the LWP Class Model |
75 | |
76 | LWP::Simple's functions are handy for simple cases, but its functions |
77 | don't support cookies or authorization, don't support setting header |
78 | lines in the HTTP request, generally don't support reading header lines |
79 | in the HTTP response (notably the full HTTP error message, in case of an |
80 | error). To get at all those features, you'll have to use the full LWP |
81 | class model. |
82 | |
83 | While LWP consists of dozens of classes, the main two that you have to |
84 | understand are L<LWP::UserAgent> and L<HTTP::Response>. LWP::UserAgent |
85 | is a class for "virtual browsers" which you use for performing requests, |
86 | and L<HTTP::Response> is a class for the responses (or error messages) |
87 | that you get back from those requests. |
88 | |
89 | The basic idiom is C<< $response = $browser->get($url) >>, or more fully |
90 | illustrated: |
91 | |
92 | # Early in your program: |
93 | |
94 | use LWP 5.64; # Loads all important LWP classes, and makes |
95 | # sure your version is reasonably recent. |
96 | |
97 | my $browser = LWP::UserAgent->new; |
98 | |
99 | ... |
100 | |
101 | # Then later, whenever you need to make a get request: |
102 | my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current'; |
103 | |
104 | my $response = $browser->get( $url ); |
105 | die "Can't get $url -- ", $response->status_line |
106 | unless $response->is_success; |
107 | |
108 | die "Hey, I was expecting HTML, not ", $response->content_type |
109 | unless $response->content_type eq 'text/html'; |
110 | # or whatever content-type you're equipped to deal with |
111 | |
112 | # Otherwise, process the content somehow: |
113 | |
114 | if($response->decoded_content =~ m/jazz/i) { |
115 | print "They're talking about jazz today on Fresh Air!\n"; |
116 | } |
117 | else { |
118 | print "Fresh Air is apparently jazzless today.\n"; |
119 | } |
120 | |
121 | There are two objects involved: C<$browser>, which holds an object of |
122 | class LWP::UserAgent, and then the C<$response> object, which is of |
123 | class HTTP::Response. You really need only one browser object per |
124 | program; but every time you make a request, you get back a new |
125 | HTTP::Response object, which will have some interesting attributes: |
126 | |
127 | =over |
128 | |
129 | =item * |
130 | |
131 | A status code indicating |
132 | success or failure |
133 | (which you can test with C<< $response->is_success >>). |
134 | |
135 | =item * |
136 | |
137 | An HTTP status |
138 | line that is hopefully informative if there's failure (which you can |
139 | see with C<< $response->status_line >>, |
140 | returning something like "404 Not Found"). |
141 | |
142 | =item * |
143 | |
144 | A MIME content-type like "text/html", "image/gif", |
145 | "application/xml", etc., which you can see with |
146 | C<< $response->content_type >> |
147 | |
148 | =item * |
149 | |
150 | The actual content of the response, in C<< $response->decoded_content >>. |
151 | If the response is HTML, that's where the HTML source will be; if |
152 | it's a GIF, then C<< $response->decoded_content >> will be the binary |
153 | GIF data. |
154 | |
155 | =item * |
156 | |
157 | And dozens of other convenient and more specific methods that are |
158 | documented in the docs for L<HTTP::Response>, and its superclasses |
159 | L<HTTP::Message> and L<HTTP::Headers>. |
160 | |
161 | =back |
162 | |
163 | |
164 | |
165 | =for comment |
166 | ########################################################################## |
167 | |
168 | |
169 | |
170 | =head2 Adding Other HTTP Request Headers |
171 | |
172 | The most commonly used syntax for requests is C<< $response = |
173 | $browser->get($url) >>, but in truth, you can add extra HTTP header |
174 | lines to the request by adding a list of key-value pairs after the URL, |
175 | like so: |
176 | |
177 | $response = $browser->get( $url, $key1, $value1, $key2, $value2, ... ); |
178 | |
179 | For example, here's how to send some more Netscape-like headers, in case |
180 | you're dealing with a site that would otherwise reject your request: |
181 | |
182 | |
183 | my @ns_headers = ( |
184 | 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', |
185 | 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', |
186 | 'Accept-Charset' => 'iso-8859-1,*,utf-8', |
187 | 'Accept-Language' => 'en-US', |
188 | ); |
189 | |
190 | ... |
191 | |
192 | $response = $browser->get($url, @ns_headers); |
193 | |
194 | If you weren't reusing that array, you could just go ahead and do this: |
195 | |
196 | $response = $browser->get($url, |
197 | 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', |
198 | 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', |
199 | 'Accept-Charset' => 'iso-8859-1,*,utf-8', |
200 | 'Accept-Language' => 'en-US', |
201 | ); |
202 | |
203 | If you were only ever changing the 'User-Agent' line, you could just change |
204 | the C<$browser> object's default line from "libwww-perl/5.65" (or the like) |
205 | to whatever you like, using the LWP::UserAgent C<agent> method: |
206 | |
207 | $browser->agent('Mozilla/4.76 [en] (Win98; U)'); |
208 | |
209 | |
210 | |
211 | =for comment |
212 | ########################################################################## |
213 | |
214 | |
215 | |
216 | =head2 Enabling Cookies |
217 | |
218 | A default LWP::UserAgent object acts like a browser with its cookies |
219 | support turned off. There are various ways of turning it on, by setting |
220 | its C<cookie_jar> attribute. A "cookie jar" is an object representing |
221 | a little database of all |
222 | the HTTP cookies that a browser can know about. It can correspond to a |
223 | file on disk (the way Netscape uses its F<cookies.txt> file), or it can |
224 | be just an in-memory object that starts out empty, and whose collection of |
225 | cookies will disappear once the program is finished running. |
226 | |
227 | To give a browser an in-memory empty cookie jar, you set its C<cookie_jar> |
228 | attribute like so: |
229 | |
230 | $browser->cookie_jar({}); |
231 | |
232 | To give it a copy that will be read from a file on disk, and will be saved |
233 | to it when the program is finished running, set the C<cookie_jar> attribute |
234 | like this: |
235 | |
236 | use HTTP::Cookies; |
237 | $browser->cookie_jar( HTTP::Cookies->new( |
238 | 'file' => '/some/where/cookies.lwp', |
239 | # where to read/write cookies |
240 | 'autosave' => 1, |
241 | # save it to disk when done |
242 | )); |
243 | |
244 | That file will be an LWP-specific format. If you want to be access the |
245 | cookies in your Netscape cookies file, you can use the |
246 | HTTP::Cookies::Netscape class: |
247 | |
248 | use HTTP::Cookies; |
249 | # yes, loads HTTP::Cookies::Netscape too |
250 | |
251 | $browser->cookie_jar( HTTP::Cookies::Netscape->new( |
252 | 'file' => 'c:/Program Files/Netscape/Users/DIR-NAME-HERE/cookies.txt', |
253 | # where to read cookies |
254 | )); |
255 | |
256 | You could add an C<< 'autosave' => 1 >> line as further above, but at |
257 | time of writing, it's uncertain whether Netscape might discard some of |
258 | the cookies you could be writing back to disk. |
259 | |
260 | |
261 | |
262 | =for comment |
263 | ########################################################################## |
264 | |
265 | |
266 | |
267 | =head2 Posting Form Data |
268 | |
269 | Many HTML forms send data to their server using an HTTP POST request, which |
270 | you can send with this syntax: |
271 | |
272 | $response = $browser->post( $url, |
273 | [ |
274 | formkey1 => value1, |
275 | formkey2 => value2, |
276 | ... |
277 | ], |
278 | ); |
279 | |
280 | Or if you need to send HTTP headers: |
281 | |
282 | $response = $browser->post( $url, |
283 | [ |
284 | formkey1 => value1, |
285 | formkey2 => value2, |
286 | ... |
287 | ], |
288 | headerkey1 => value1, |
289 | headerkey2 => value2, |
290 | ); |
291 | |
292 | For example, the following program makes a search request to AltaVista |
293 | (by sending some form data via an HTTP POST request), and extracts from |
294 | the HTML the report of the number of matches: |
295 | |
296 | use strict; |
297 | use warnings; |
298 | use LWP 5.64; |
299 | my $browser = LWP::UserAgent->new; |
300 | |
301 | my $word = 'tarragon'; |
302 | |
303 | my $url = 'http://www.altavista.com/sites/search/web'; |
304 | my $response = $browser->post( $url, |
305 | [ 'q' => $word, # the Altavista query string |
306 | 'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX', |
307 | ] |
308 | ); |
309 | die "$url error: ", $response->status_line |
310 | unless $response->is_success; |
311 | die "Weird content type at $url -- ", $response->content_type |
312 | unless $response->content_is_html; |
313 | |
314 | if( $response->decoded_content =~ m{AltaVista found ([0-9,]+) results} ) { |
315 | # The substring will be like "AltaVista found 2,345 results" |
316 | print "$word: $1\n"; |
317 | } |
318 | else { |
319 | print "Couldn't find the match-string in the response\n"; |
320 | } |
321 | |
322 | |
323 | |
324 | =for comment |
325 | ########################################################################## |
326 | |
327 | |
328 | |
329 | =head2 Sending GET Form Data |
330 | |
331 | Some HTML forms convey their form data not by sending the data |
332 | in an HTTP POST request, but by making a normal GET request with |
333 | the data stuck on the end of the URL. For example, if you went to |
334 | C<imdb.com> and ran a search on "Blade Runner", the URL you'd see |
335 | in your browser window would be: |
336 | |
337 | http://us.imdb.com/Tsearch?title=Blade%20Runner&restrict=Movies+and+TV |
338 | |
339 | To run the same search with LWP, you'd use this idiom, which involves |
340 | the URI class: |
341 | |
342 | use URI; |
343 | my $url = URI->new( 'http://us.imdb.com/Tsearch' ); |
344 | # makes an object representing the URL |
345 | |
346 | $url->query_form( # And here the form data pairs: |
347 | 'title' => 'Blade Runner', |
348 | 'restrict' => 'Movies and TV', |
349 | ); |
350 | |
351 | my $response = $browser->get($url); |
352 | |
353 | See chapter 5 of I<Perl & LWP> for a longer discussion of HTML forms |
354 | and of form data, and chapters 6 through 9 for a longer discussion of |
355 | extracting data from HTML. |
356 | |
357 | |
358 | |
359 | =head2 Absolutizing URLs |
360 | |
361 | The URI class that we just mentioned above provides all sorts of methods |
362 | for accessing and modifying parts of URLs (such as asking sort of URL it |
363 | is with C<< $url->scheme >>, and asking what host it refers to with C<< |
364 | $url->host >>, and so on, as described in L<the docs for the URI |
365 | class|URI>. However, the methods of most immediate interest |
366 | are the C<query_form> method seen above, and now the C<new_abs> method |
367 | for taking a probably-relative URL string (like "../foo.html") and getting |
368 | back an absolute URL (like "http://www.perl.com/stuff/foo.html"), as |
369 | shown here: |
370 | |
371 | use URI; |
372 | $abs = URI->new_abs($maybe_relative, $base); |
373 | |
374 | For example, consider this program that matches URLs in the HTML |
375 | list of new modules in CPAN: |
376 | |
377 | use strict; |
378 | use warnings; |
379 | use LWP; |
380 | my $browser = LWP::UserAgent->new; |
381 | |
382 | my $url = 'http://www.cpan.org/RECENT.html'; |
383 | my $response = $browser->get($url); |
384 | die "Can't get $url -- ", $response->status_line |
385 | unless $response->is_success; |
386 | |
387 | my $html = $response->decoded_content; |
388 | while( $html =~ m/<A HREF=\"(.*?)\"/g ) { |
389 | print "$1\n"; |
390 | } |
391 | |
392 | When run, it emits output that starts out something like this: |
393 | |
394 | MIRRORING.FROM |
395 | RECENT |
396 | RECENT.html |
397 | authors/00whois.html |
398 | authors/01mailrc.txt.gz |
399 | authors/id/A/AA/AASSAD/CHECKSUMS |
400 | ... |
401 | |
402 | However, if you actually want to have those be absolute URLs, you |
403 | can use the URI module's C<new_abs> method, by changing the C<while> |
404 | loop to this: |
405 | |
406 | while( $html =~ m/<A HREF=\"(.*?)\"/g ) { |
407 | print URI->new_abs( $1, $response->base ) ,"\n"; |
408 | } |
409 | |
410 | (The C<< $response->base >> method from L<HTTP::Message|HTTP::Message> |
411 | is for returning what URL |
412 | should be used for resolving relative URLs -- it's usually just |
413 | the same as the URL that you requested.) |
414 | |
415 | That program then emits nicely absolute URLs: |
416 | |
417 | http://www.cpan.org/MIRRORING.FROM |
418 | http://www.cpan.org/RECENT |
419 | http://www.cpan.org/RECENT.html |
420 | http://www.cpan.org/authors/00whois.html |
421 | http://www.cpan.org/authors/01mailrc.txt.gz |
422 | http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS |
423 | ... |
424 | |
425 | See chapter 4 of I<Perl & LWP> for a longer discussion of URI objects. |
426 | |
427 | Of course, using a regexp to match hrefs is a bit simplistic, and for |
428 | more robust programs, you'll probably want to use an HTML-parsing module |
429 | like L<HTML::LinkExtor> or L<HTML::TokeParser> or even maybe |
430 | L<HTML::TreeBuilder>. |
431 | |
432 | |
433 | |
434 | |
435 | =for comment |
436 | ########################################################################## |
437 | |
438 | =head2 Other Browser Attributes |
439 | |
440 | LWP::UserAgent objects have many attributes for controlling how they |
441 | work. Here are a few notable ones: |
442 | |
443 | =over |
444 | |
445 | =item * |
446 | |
447 | C<< $browser->timeout(15); >> |
448 | |
449 | This sets this browser object to give up on requests that don't answer |
450 | within 15 seconds. |
451 | |
452 | |
453 | =item * |
454 | |
455 | C<< $browser->protocols_allowed( [ 'http', 'gopher'] ); >> |
456 | |
457 | This sets this browser object to not speak any protocols other than HTTP |
458 | and gopher. If it tries accessing any other kind of URL (like an "ftp:" |
459 | or "mailto:" or "news:" URL), then it won't actually try connecting, but |
460 | instead will immediately return an error code 500, with a message like |
461 | "Access to 'ftp' URIs has been disabled". |
462 | |
463 | |
464 | =item * |
465 | |
466 | C<< use LWP::ConnCache; $browser->conn_cache(LWP::ConnCache->new()); >> |
467 | |
468 | This tells the browser object to try using the HTTP/1.1 "Keep-Alive" |
469 | feature, which speeds up requests by reusing the same socket connection |
470 | for multiple requests to the same server. |
471 | |
472 | |
473 | =item * |
474 | |
475 | C<< $browser->agent( 'SomeName/1.23 (more info here maybe)' ) >> |
476 | |
477 | This changes how the browser object will identify itself in |
478 | the default "User-Agent" line is its HTTP requests. By default, |
479 | it'll send "libwww-perl/I<versionnumber>", like |
480 | "libwww-perl/5.65". You can change that to something more descriptive |
481 | like this: |
482 | |
483 | $browser->agent( 'SomeName/3.14 (contact@robotplexus.int)' ); |
484 | |
485 | Or if need be, you can go in disguise, like this: |
486 | |
487 | $browser->agent( 'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)' ); |
488 | |
489 | |
490 | =item * |
491 | |
492 | C<< push @{ $ua->requests_redirectable }, 'POST'; >> |
493 | |
494 | This tells this browser to obey redirection responses to POST requests |
495 | (like most modern interactive browsers), even though the HTTP RFC says |
496 | that should not normally be done. |
497 | |
498 | |
499 | =back |
500 | |
501 | |
502 | For more options and information, see L<the full documentation for |
503 | LWP::UserAgent|LWP::UserAgent>. |
504 | |
505 | |
506 | |
507 | =for comment |
508 | ########################################################################## |
509 | |
510 | |
511 | |
512 | =head2 Writing Polite Robots |
513 | |
514 | If you want to make sure that your LWP-based program respects F<robots.txt> |
515 | files and doesn't make too many requests too fast, you can use the LWP::RobotUA |
516 | class instead of the LWP::UserAgent class. |
517 | |
518 | LWP::RobotUA class is just like LWP::UserAgent, and you can use it like so: |
519 | |
520 | use LWP::RobotUA; |
521 | my $browser = LWP::RobotUA->new('YourSuperBot/1.34', 'you@yoursite.com'); |
522 | # Your bot's name and your email address |
523 | |
524 | my $response = $browser->get($url); |
525 | |
526 | But HTTP::RobotUA adds these features: |
527 | |
528 | |
529 | =over |
530 | |
531 | =item * |
532 | |
533 | If the F<robots.txt> on C<$url>'s server forbids you from accessing |
534 | C<$url>, then the C<$browser> object (assuming it's of class LWP::RobotUA) |
535 | won't actually request it, but instead will give you back (in C<$response>) a 403 error |
536 | with a message "Forbidden by robots.txt". That is, if you have this line: |
537 | |
538 | die "$url -- ", $response->status_line, "\nAborted" |
539 | unless $response->is_success; |
540 | |
541 | then the program would die with an error message like this: |
542 | |
543 | http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt |
544 | Aborted at whateverprogram.pl line 1234 |
545 | |
546 | =item * |
547 | |
548 | If this C<$browser> object sees that the last time it talked to |
549 | C<$url>'s server was too recently, then it will pause (via C<sleep>) to |
550 | avoid making too many requests too often. How long it will pause for, is |
551 | by default one minute -- but you can control it with the C<< |
552 | $browser->delay( I<minutes> ) >> attribute. |
553 | |
554 | For example, this code: |
555 | |
556 | $browser->delay( 7/60 ); |
557 | |
558 | ...means that this browser will pause when it needs to avoid talking to |
559 | any given server more than once every 7 seconds. |
560 | |
561 | =back |
562 | |
563 | For more options and information, see L<the full documentation for |
564 | LWP::RobotUA|LWP::RobotUA>. |
565 | |
566 | |
567 | |
568 | |
569 | |
570 | =for comment |
571 | ########################################################################## |
572 | |
573 | =head2 Using Proxies |
574 | |
575 | In some cases, you will want to (or will have to) use proxies for |
576 | accessing certain sites and/or using certain protocols. This is most |
577 | commonly the case when your LWP program is running (or could be running) |
578 | on a machine that is behind a firewall. |
579 | |
580 | To make a browser object use proxies that are defined in the usual |
581 | environment variables (C<HTTP_PROXY>, etc.), just call the C<env_proxy> |
582 | on a user-agent object before you go making any requests on it. |
583 | Specifically: |
584 | |
585 | use LWP::UserAgent; |
586 | my $browser = LWP::UserAgent->new; |
587 | |
588 | # And before you go making any requests: |
589 | $browser->env_proxy; |
590 | |
591 | For more information on proxy parameters, see L<the LWP::UserAgent |
592 | documentation|LWP::UserAgent>, specifically the C<proxy>, C<env_proxy>, |
593 | and C<no_proxy> methods. |
594 | |
595 | |
596 | |
597 | =for comment |
598 | ########################################################################## |
599 | |
600 | =head2 HTTP Authentication |
601 | |
602 | Many web sites restrict access to documents by using "HTTP |
603 | Authentication". This isn't just any form of "enter your password" |
604 | restriction, but is a specific mechanism where the HTTP server sends the |
605 | browser an HTTP code that says "That document is part of a protected |
606 | 'realm', and you can access it only if you re-request it and add some |
607 | special authorization headers to your request". |
608 | |
609 | For example, the Unicode.org admins stop email-harvesting bots from |
610 | harvesting the contents of their mailing list archives, by protecting |
611 | them with HTTP Authentication, and then publicly stating the username |
612 | and password (at C<http://www.unicode.org/mail-arch/>) -- namely |
613 | username "unicode-ml" and password "unicode". |
614 | |
615 | For example, consider this URL, which is part of the protected |
616 | area of the web site: |
617 | |
618 | http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html |
619 | |
620 | If you access that with a browser, you'll get a prompt |
621 | like |
622 | "Enter username and password for 'Unicode-MailList-Archives' at server |
623 | 'www.unicode.org'". |
624 | |
625 | In LWP, if you just request that URL, like this: |
626 | |
627 | use LWP; |
628 | my $browser = LWP::UserAgent->new; |
629 | |
630 | my $url = |
631 | 'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html'; |
632 | my $response = $browser->get($url); |
633 | |
634 | die "Error: ", $response->header('WWW-Authenticate') || 'Error accessing', |
635 | # ('WWW-Authenticate' is the realm-name) |
636 | "\n ", $response->status_line, "\n at $url\n Aborting" |
637 | unless $response->is_success; |
638 | |
639 | Then you'll get this error: |
640 | |
641 | Error: Basic realm="Unicode-MailList-Archives" |
642 | 401 Authorization Required |
643 | at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html |
644 | Aborting at auth1.pl line 9. [or wherever] |
645 | |
646 | ...because the C<$browser> doesn't know any the username and password |
647 | for that realm ("Unicode-MailList-Archives") at that host |
648 | ("www.unicode.org"). The simplest way to let the browser know about this |
649 | is to use the C<credentials> method to let it know about a username and |
650 | password that it can try using for that realm at that host. The syntax is: |
651 | |
652 | $browser->credentials( |
653 | 'servername:portnumber', |
654 | 'realm-name', |
655 | 'username' => 'password' |
656 | ); |
657 | |
658 | In most cases, the port number is 80, the default TCP/IP port for HTTP; and |
659 | you usually call the C<credentials> method before you make any requests. |
660 | For example: |
661 | |
662 | $browser->credentials( |
663 | 'reports.mybazouki.com:80', |
664 | 'web_server_usage_reports', |
665 | 'plinky' => 'banjo123' |
666 | ); |
667 | |
668 | So if we add the following to the program above, right after the C<< |
669 | $browser = LWP::UserAgent->new; >> line... |
670 | |
671 | $browser->credentials( # add this to our $browser 's "key ring" |
672 | 'www.unicode.org:80', |
673 | 'Unicode-MailList-Archives', |
674 | 'unicode-ml' => 'unicode' |
675 | ); |
676 | |
677 | ...then when we run it, the request succeeds, instead of causing the |
678 | C<die> to be called. |
679 | |
680 | |
681 | |
682 | =for comment |
683 | ########################################################################## |
684 | |
685 | =head2 Accessing HTTPS URLs |
686 | |
687 | When you access an HTTPS URL, it'll work for you just like an HTTP URL |
688 | would -- if your LWP installation has HTTPS support (via an appropriate |
689 | Secure Sockets Layer library). For example: |
690 | |
691 | use LWP; |
692 | my $url = 'https://www.paypal.com/'; # Yes, HTTPS! |
693 | my $browser = LWP::UserAgent->new; |
694 | my $response = $browser->get($url); |
695 | die "Error at $url\n ", $response->status_line, "\n Aborting" |
696 | unless $response->is_success; |
697 | print "Whee, it worked! I got that ", |
698 | $response->content_type, " document!\n"; |
699 | |
700 | If your LWP installation doesn't have HTTPS support set up, then the |
701 | response will be unsuccessful, and you'll get this error message: |
702 | |
703 | Error at https://www.paypal.com/ |
704 | 501 Protocol scheme 'https' is not supported |
705 | Aborting at paypal.pl line 7. [or whatever program and line] |
706 | |
707 | If your LWP installation I<does> have HTTPS support installed, then the |
708 | response should be successful, and you should be able to consult |
709 | C<$response> just like with any normal HTTP response. |
710 | |
711 | For information about installing HTTPS support for your LWP |
712 | installation, see the helpful F<README.SSL> file that comes in the |
713 | libwww-perl distribution. |
714 | |
715 | |
716 | =for comment |
717 | ########################################################################## |
718 | |
719 | |
720 | |
721 | =head2 Getting Large Documents |
722 | |
723 | When you're requesting a large (or at least potentially large) document, |
724 | a problem with the normal way of using the request methods (like C<< |
725 | $response = $browser->get($url) >>) is that the response object in |
726 | memory will have to hold the whole document -- I<in memory>. If the |
727 | response is a thirty megabyte file, this is likely to be quite an |
728 | imposition on this process's memory usage. |
729 | |
730 | A notable alternative is to have LWP save the content to a file on disk, |
731 | instead of saving it up in memory. This is the syntax to use: |
732 | |
733 | $response = $ua->get($url, |
734 | ':content_file' => $filespec, |
735 | ); |
736 | |
737 | For example, |
738 | |
739 | $response = $ua->get('http://search.cpan.org/', |
740 | ':content_file' => '/tmp/sco.html' |
741 | ); |
742 | |
743 | When you use this C<:content_file> option, the C<$response> will have |
744 | all the normal header lines, but C<< $response->content >> will be |
745 | empty. |
746 | |
747 | Note that this ":content_file" option isn't supported under older |
748 | versions of LWP, so you should consider adding C<use LWP 5.66;> to check |
749 | the LWP version, if you think your program might run on systems with |
750 | older versions. |
751 | |
752 | If you need to be compatible with older LWP versions, then use |
753 | this syntax, which does the same thing: |
754 | |
755 | use HTTP::Request::Common; |
756 | $response = $ua->request( GET($url), $filespec ); |
757 | |
758 | |
759 | =for comment |
760 | ########################################################################## |
761 | |
762 | |
763 | =head1 SEE ALSO |
764 | |
765 | Remember, this article is just the most rudimentary introduction to |
766 | LWP -- to learn more about LWP and LWP-related tasks, you really |
767 | must read from the following: |
768 | |
769 | =over |
770 | |
771 | =item * |
772 | |
773 | L<LWP::Simple> -- simple functions for getting/heading/mirroring URLs |
774 | |
775 | =item * |
776 | |
777 | L<LWP> -- overview of the libwww-perl modules |
778 | |
779 | =item * |
780 | |
781 | L<LWP::UserAgent> -- the class for objects that represent "virtual browsers" |
782 | |
783 | =item * |
784 | |
785 | L<HTTP::Response> -- the class for objects that represent the response to |
786 | a LWP response, as in C<< $response = $browser->get(...) >> |
787 | |
788 | =item * |
789 | |
790 | L<HTTP::Message> and L<HTTP::Headers> -- classes that provide more methods |
791 | to HTTP::Response. |
792 | |
793 | =item * |
794 | |
795 | L<URI> -- class for objects that represent absolute or relative URLs |
796 | |
797 | =item * |
798 | |
799 | L<URI::Escape> -- functions for URL-escaping and URL-unescaping strings |
800 | (like turning "this & that" to and from "this%20%26%20that"). |
801 | |
802 | =item * |
803 | |
804 | L<HTML::Entities> -- functions for HTML-escaping and HTML-unescaping strings |
805 | (like turning "C. & E. BrontE<euml>" to and from "C. & E. Brontë") |
806 | |
807 | =item * |
808 | |
809 | L<HTML::TokeParser> and L<HTML::TreeBuilder> -- classes for parsing HTML |
810 | |
811 | =item * |
812 | |
813 | L<HTML::LinkExtor> -- class for finding links in HTML documents |
814 | |
815 | =item * |
816 | |
817 | The book I<Perl & LWP> by Sean M. Burke. O'Reilly & Associates, 2002. |
818 | ISBN: 0-596-00178-9. C<http://www.oreilly.com/catalog/perllwp/> |
819 | |
820 | =back |
821 | |
822 | |
823 | =head1 COPYRIGHT |
824 | |
825 | Copyright 2002, Sean M. Burke. You can redistribute this document and/or |
826 | modify it, but only under the same terms as Perl itself. |
827 | |
828 | =head1 AUTHOR |
829 | |
830 | Sean M. Burke C<sburke@cpan.org> |
831 | |
832 | =for comment |
833 | ########################################################################## |
834 | |
835 | =cut |
836 | |
837 | # End of Pod |