Commit | Line | Data |
3fea05b9 |
1 | =head1 NAME |
2 | |
3 | WWW::Mechanize::FAQ - Frequently Asked Questions about WWW::Mechanize |
4 | |
5 | =head1 How to get help with WWW::Mechanize |
6 | |
7 | If your question isn't answered here in the FAQ, please turn to the |
8 | communities at: |
9 | |
10 | =over |
11 | |
12 | =item * L<http://perlmonks.org> |
13 | |
14 | =item * The libwww-perl mailing list at L<http://lists.perl.org> |
15 | |
16 | =back |
17 | |
18 | =head1 JavaScript |
19 | |
20 | =head2 I have this web page that has JavaScript on it, and my Mech program doesn't work. |
21 | |
22 | That's because WWW::Mechanize doesn't operate on the JavaScript. It only |
23 | understands the HTML parts of the page. |
24 | |
25 | =head2 I thought Mech was supposed to work like a web browser. |
26 | |
27 | It does pretty much, but it doesn't support JavaScript. |
28 | |
29 | I added some basic attempts at picking up URLs in C<window.open()> |
30 | calls and return them in C<< $mech->links >>. They work sometimes. |
31 | Beyond that, there's no support for JavaScript. |
32 | |
33 | =head2 Are you going to add JavaScript support? |
34 | |
35 | I will if anyone sends me the code to do it. I'm not going to write a |
36 | JavaScript processor myself. |
37 | |
38 | =head2 Wouldn't that be a great thing to have in WWW::Mechanize? |
39 | |
40 | Yes. |
41 | |
42 | =head2 Would it be hard to do? |
43 | |
44 | Hard enough that I don't want to deal with it myself. Plus, I don't |
45 | use JavaScript myself, so I don't have an itch to scratch. |
46 | |
47 | =head2 Is anyone working on it? |
48 | |
49 | I've heard noises from people every so often over the past couple of |
50 | years, but nothing you'd pin your hopes on. |
51 | |
52 | =head2 It would really help me with a project I'm working on. |
53 | |
54 | I'm sure it would. |
55 | |
56 | =head2 Do you know when it might get added? |
57 | |
58 | I have no idea if or when such a thing will ever get done. I can |
59 | guarantee that as soon as there's anything close to JavaScript support |
60 | I will let everyone know. |
61 | |
62 | =head2 Maybe I'll ask around and see if anyone else knows of a solution. |
63 | |
64 | If you must, but I doubt that anyone's written JavaScript support for |
65 | Mechanize and neglected to tell me about it. |
66 | |
67 | =head2 So what can I do? |
68 | |
69 | Since Javascript is completely visible to the client, it cannot be used |
70 | to prevent a scraper from following links. But it can make life difficult, |
71 | and until someone writes a Javascript interpreter for Perl or a Mechanize |
72 | clone to control Firefox, there will be no general solution. But if |
73 | you want to scrape specific pages, then a solution is always possible. |
74 | |
75 | One typical use of Javascript is to perform argument checking before |
76 | posting to the server. The URL you want is probably just buried in the |
77 | Javascript function. Do a regular expression match on |
78 | C<< $mech->content() >> |
79 | to find the link that you want and C<< $mech->get >> it directly (this |
80 | assumes that you know what you are looking for in advance). |
81 | |
82 | In more difficult cases, the Javascript is used for URL mangling to |
83 | satisfy the needs of some middleware. In this case you need to figure |
84 | out what the Javascript is doing (why are these URLs always really |
85 | long?). There is probably some function with one or more arguments which |
86 | calculates the new URL. Step one: using your favorite browser, get the |
87 | before and after URLs and save them to files. Edit each file, converting |
88 | the the argument separators ('?', '&' or ';') into newlines. Now it is |
89 | easy to use diff or comm to find out what Javascript did to the URL. |
90 | Step 2 - find the function call which created the URL - you will need |
91 | to parse and interpret its argument list. Using the Javascript Debugger |
92 | Extension for Firefox may help with the analysis. At this point, it is |
93 | fairly trivial to write your own function which emulates the Javascript |
94 | for the pages you want to process. |
95 | |
96 | Here's annother approach that answers the question, "It works in Firefox, |
97 | but why not Mech?" Everything the web server knows about the client is |
98 | present in the HTTP request. If two requests are identical, the results |
99 | should be identical. So the real question is "What is different between |
100 | the mech request and the Firefox request?" |
101 | |
102 | The Firefox extension "Tamper Data" is an effective tool for examining |
103 | the headers of the requests to the server. Compare that with what LWP |
104 | is sending. Once the two are identical, the action of the server should |
105 | be the same as well. |
106 | |
107 | I say "should", because this is an oversimplification - some values |
108 | are naturally unique, e.g. a SessionID, but if a SessionID is present, |
109 | that is probably sufficient, even though the value will be different |
110 | between the LWP request and the Firefox request. The server could use |
111 | the session to store information which is troublesome, but that's not |
112 | the first place to look (and highly unlikely to be relevant when you |
113 | are requesting the login page of your site). |
114 | |
115 | Generally the problem is to be found in missing or incorrect POSTDATA |
116 | arguments, Cookies, User-Agents, Accepts, etc. If you are using mech, |
117 | then redirects and cookies should not be a problem, but are listed here |
118 | for completeness. If you are missing headers, C<< $mech->add_header >> |
119 | can be used to add the headers that you need. |
120 | |
121 | =head1 How do I do X? |
122 | |
123 | =head2 Can I do [such-and-such] with WWW::Mechanize? |
124 | |
125 | If it's possible with LWP::UserAgent, then yes. WWW::Mechanize is |
126 | a subclass of L<LWP::UserAgent>, so all the wondrous magic of that |
127 | class is inherited. |
128 | |
129 | =head2 How do I use WWW::Mechanize through a proxy server? |
130 | |
131 | See the docs in L<LWP::UserAgent> on how to use the proxy. Short version: |
132 | |
133 | $mech->proxy(['http', 'ftp'], 'http://proxy.example.com:8000/'); |
134 | |
135 | or get the specs from the environment: |
136 | |
137 | $mech->env_proxy(); |
138 | |
139 | # Environment set like so: |
140 | gopher_proxy=http://proxy.my.place/ |
141 | wais_proxy=http://proxy.my.place/ |
142 | no_proxy="localhost,my.domain" |
143 | export gopher_proxy wais_proxy no_proxy |
144 | |
145 | =head2 How can I see what fields are on the forms? |
146 | |
147 | Use the mech-dump utility, optionaly installed with Mechanize. |
148 | |
149 | $ mech-dump --forms http://search.cpan.org |
150 | Dumping forms |
151 | GET http://search.cpan.org/search |
152 | query= |
153 | mode=all (option) [*all|module|dist|author] |
154 | <NONAME>=CPAN Search (submit) |
155 | |
156 | =head2 How do I get Mech to handle authentication? |
157 | |
158 | use MIME::Base64; |
159 | |
160 | my $agent = WWW::Mechanize->new(); |
161 | my @args = ( |
162 | Authorization => "Basic " . |
163 | MIME::Base64::encode( USER . ':' . PASS ) |
164 | ); |
165 | |
166 | $agent->credentials( ADDRESS, REALM, USER, PASS ); |
167 | $agent->get( URL, @args ); |
168 | |
169 | If you want to use the credentials for all future requests, you can |
170 | also use the L<LWP::UserAgent> C<default_header()> method instead |
171 | of the extra arguments to C<get()> |
172 | |
173 | $mech->default_header( |
174 | Authorization => 'Basic ' . encode_base64( USER . ':' . PASSWORD ) ); |
175 | |
176 | =head2 How can I get WWW::Mechanize to execute this JavaScript? |
177 | |
178 | You can't. JavaScript is entirely client-based, and WWW::Mechanize |
179 | is a client that doesn't understand JavaScript. See the top part |
180 | of this FAQ. |
181 | |
182 | =head2 How do I check a checkbox that doesn't have a value defined? |
183 | |
184 | Set it to to the value of "on". |
185 | |
186 | $mech->field( my_checkbox => 'on' ); |
187 | |
188 | =head2 How do I handle frames? |
189 | |
190 | You don't deal with them as frames, per se, but as links. Extract |
191 | them with |
192 | |
193 | my @frame_links = $mech->find_link( tag => "frame" ); |
194 | |
195 | =head2 How do I get a list of HTTP headers and their values? |
196 | |
197 | All L<HTTP::Headers> methods work on a L<HTTP::Response> object which is |
198 | returned by the I<get()>, I<reload()>, I<response()/res()>, I<click()>, |
199 | I<submit_form()>, and I<request()> methods. |
200 | |
201 | my $mech = WWW::Mechanize->new( autocheck => 1 ); |
202 | $mech->get( 'http://my.site.com' ); |
203 | my $res = $mech->response(); |
204 | for my $key ( $response->header_field_names() ) { |
205 | print $key, " : ", $response->header( $key ), "\n"; |
206 | } |
207 | |
208 | =head2 How do I enable keep-alive? |
209 | |
210 | Since L<WWW::Mechanize> is a subclass of L<LWP::UserAgent>, you can |
211 | use the same mechanism to enable keep-alive: |
212 | |
213 | use LWP::ConnCache; |
214 | ... |
215 | $mech->conn_cache(LWP::ConnCache->new); |
216 | |
217 | =head2 How can I change/specify the action parameter of an HTML form? |
218 | |
219 | You can access the action of the form by utilizing the L<HTML::Form> |
220 | object returned from one of the specifying form methods. |
221 | |
222 | Using C<< $mech->form_number($number) >>: |
223 | |
224 | my $mech = WWW::mechanize->new; |
225 | $mech->get('http://someurlhere.com'); |
226 | # Access the form using its Zero-Based Index by DOM order |
227 | $mech->form_number(0)->action('http://newAction'); #ABS URL |
228 | |
229 | Using C<< $mech->form_name($number) >>: |
230 | |
231 | my $mech = WWW::mechanize->new; |
232 | $mech->get('http://someurlhere.com'); |
233 | #Access the form using its Zero-Based Index by DOM order |
234 | $mech->form_name('trgForm')->action('http://newAction'); #ABS URL |
235 | |
236 | =head1 Why doesn't this work: Debugging your Mechanize program |
237 | |
238 | =head2 My Mech program doesn't work, but it works in the browser. |
239 | |
240 | Mechanize acts like a browser, but apparently something you're doing |
241 | is not matching the browser's behavior. Maybe it's expecting a |
242 | certain web client, or maybe you've not handling a field properly. |
243 | For some reason, your Mech problem isn't doing exactly what the |
244 | browser is doing, and when you find that, you'll have the answer. |
245 | |
246 | =head2 My Mech program gets these 500 errors. |
247 | |
248 | A 500 error from the web server says that the program on the server |
249 | side died. Probably the web server program was expecting certain |
250 | inputs that you didn't supply, and instead of handling it nicely, |
251 | the program died. |
252 | |
253 | Whatever the cause of the 500 error, if it works in the browser, |
254 | but not in your Mech program, you're not acting like the browser. |
255 | See the previous question. |
256 | |
257 | =head2 Why doesn't my program handle this form correctly? |
258 | |
259 | Run F<mech-dump> on your page and see what it says. |
260 | |
261 | F<mech-dump> is a marvelous diagnostic tool for figuring out what forms |
262 | and fields are on the page. Say you're scraping CNN.com, you'd get this: |
263 | |
264 | $ mech-dump http://www.cnn.com/ |
265 | GET http://search.cnn.com/cnn/search |
266 | source=cnn (hidden readonly) |
267 | invocationType=search/top (hidden readonly) |
268 | sites=web (radio) [*web/The Web ??|cnn/CNN.com ??] |
269 | query= (text) |
270 | <NONAME>=Search (submit) |
271 | |
272 | POST http://cgi.money.cnn.com/servlets/quote_redirect |
273 | query= (text) |
274 | <NONAME>=GET (submit) |
275 | |
276 | POST http://polls.cnn.com/poll |
277 | poll_id=2112 (hidden readonly) |
278 | question_1=<UNDEF> (radio) [1/Simplistic option|2/VIEW RESULTS] |
279 | <NONAME>=VOTE (submit) |
280 | |
281 | GET http://search.cnn.com/cnn/search |
282 | source=cnn (hidden readonly) |
283 | invocationType=search/bottom (hidden readonly) |
284 | sites=web (radio) [*web/??CNN.com|cnn/??] |
285 | query= (text) |
286 | <NONAME>=Search (submit) |
287 | |
288 | Four forms, including the first one duplicated at the end. All the |
289 | fields, all their defaults, lovingly generated by HTML::Form's C<dump> |
290 | method. |
291 | |
292 | If you want to run F<mech-dump> on something that doesn't lend itself |
293 | to a quick URL fetch, then use the C<save_content()> method to write |
294 | the HTML to a file, and run F<mech-dump> on the file. |
295 | |
296 | =head2 Why don't https:// URLs work? |
297 | |
298 | You need either L<IO::Socket::SSL> or L<Crypt::SSLeay> installed. |
299 | |
300 | =head2 Why do I get "Input 'fieldname' is readonly"? |
301 | |
302 | You're trying to change the value of a hidden field and you have |
303 | warnings on. |
304 | |
305 | First, make sure that you actually mean to change the field that you're |
306 | changing, and that you don't have a typo. Usually, hidden variables are |
307 | set by the site you're working on for a reason. If you change the value, |
308 | you might be breaking some functionality by faking it out. |
309 | |
310 | If you really do want to change a hidden value, make the changes in a |
311 | scope that has warnings turned off: |
312 | |
313 | { |
314 | local $^W = 0; |
315 | $agent->field( name => $value ); |
316 | } |
317 | |
318 | =head2 I tried to [such-and-such] and I got this weird error. |
319 | |
320 | Are you checking your errors? |
321 | |
322 | Are you sure? |
323 | |
324 | Are you checking that your action succeeded after every action? |
325 | |
326 | Are you sure? |
327 | |
328 | For example, if you try this: |
329 | |
330 | $mech->get( "http://my.site.com" ); |
331 | $mech->follow_link( "foo" ); |
332 | |
333 | and the C<get> call fails for some reason, then the Mech internals |
334 | will be unusable for the C<follow_link> and you'll get a weird |
335 | error. You B<must>, after every action that GETs or POSTs a page, |
336 | check that Mech succeeded, or all bets are off. |
337 | |
338 | $mech->get( "http://my.site.com" ); |
339 | die "Can't even get the home page: ", $mech->response->status_line |
340 | unless $mech->success; |
341 | |
342 | $mech->follow_link( "foo" ); |
343 | die "Foo link failed: ", $mech->response->status_line |
344 | unless $mech->success; |
345 | |
346 | =head2 How do I figure out why C<< $mech->get($url) >> doesn't work? |
347 | |
348 | There are many reasons why a C<< get() >> can fail. The server can take |
349 | you to someplace you didn't expect. It can generate redirects which are |
350 | not properly handled. You can get time-outs. Servers are down more often |
351 | than you think! etc, etc, etc. A couple of places to start: |
352 | |
353 | =over 4 |
354 | |
355 | =item 1 Check C<< $mech->status() >> after each call |
356 | |
357 | =item 2 Check the URL with C<< $mech->uri() >> to see where you ended up |
358 | |
359 | =item 3 Try debugging with C<< LWP::Debug >>. |
360 | |
361 | =back |
362 | |
363 | If things are really strange, turn on debugging with |
364 | C<< use LWP::Debug qw(+); >> |
365 | Just put this in the main program. This causes LWP to print out a trace |
366 | of the HTTP traffic between client and server and can be used to figure |
367 | out what is happening at the protocol level. |
368 | |
369 | It is also useful to set many traps to verify that processing is |
370 | proceeding as expected. A Mech program should always have an "I didn't |
371 | expect to get here" or "I don't recognize the page that I am processing" |
372 | case and bail out. |
373 | |
374 | Since errors can be transient, by the time you notice that the error |
375 | has occurred, it might not be possible to reproduce it manually. So |
376 | for automated processing it is useful to email yourself the following |
377 | information: |
378 | |
379 | =over 4 |
380 | |
381 | =item * where processing is taking place |
382 | |
383 | =item * An Error Message |
384 | |
385 | =item * $mech->uri |
386 | |
387 | =item * $mech->content |
388 | |
389 | =back |
390 | |
391 | You can also save the content of the page with C<< $mech->save_content( 'filename.html' ); >> |
392 | |
393 | =head2 I submitted a form, but the server ignored everything! I got an empty form back! |
394 | |
395 | The post is handled by application software. It is common for PHP |
396 | programmers to use the same file both to display a form and to process |
397 | the arguments returned. So the first task of the application programmer |
398 | is to decide whether there are arguments to processes. The program can |
399 | check whether a particular parameter has been set, whether a hidden |
400 | parameter has been set, or whether the submit button has been clicked. |
401 | (There are probably other ways that I haven't thought of). |
402 | |
403 | In any case, if your form is not setting the parameter (e.g. the submit |
404 | button) which the web application is keying on (and as an outsider there |
405 | is no way to know what it is keying on), it will not notice that the form |
406 | has been submitted. Try using C<< $mech->click() >> instead of |
407 | C<< $mech->submit() >> or vice-versa. |
408 | |
409 | =head2 I've logged in to the server, but I get 500 errors when I try to get to protected content. |
410 | |
411 | Some web sites use distributed databases for their processing. It |
412 | can take a few seconds for the login/session information to percolate |
413 | through to all the servers. For human users with their slow reaction |
414 | times, this is not a problem, but a Perl script can outrun the server. |
415 | So try adding a C<sleep(5)> between logging in and actually doing anything |
416 | (the optimal delay must be determined experimentally). |
417 | |
418 | =head2 Mech is a big memory pig! I'm running out of RAM! |
419 | |
420 | Mech keeps a history of every page, and the state it was in. It actually |
421 | keeps a clone of the full Mech object at every step along the way. |
422 | |
423 | You can limit this stack size with the C<stack_depth> parm in the C<new()> |
424 | constructor. If you set stack_size to 0, Mech will not keep any history. |
425 | |
426 | =head1 AUTHOR |
427 | |
428 | Copyright 2005-2009 Andy Lester C<< <andy at petdance.com> >> |
429 | |
430 | =cut |