docs and ready for release testing
[catagits/Catalyst-Runtime.git] / lib / Catalyst / UTF8.pod
CommitLineData
a09b49d2 1=encoding UTF-8
2
3=head1 Name
4
d63cc9c8 5Catalyst::UTF8 - All About UTF8 and Catalyst Encoding
a09b49d2 6
7=head1 Description
8
9Starting in 5.90080 L<Catalyst> will enable UTF8 encoding by default for
10text like body responses. In addition we've made a ton of fixes around encoding
11and utf8 scattered throughout the codebase. This document attempts to give
12an overview of the assumptions and practices that L<Catalyst> uses when
13dealing with UTF8 and encoding issues. You should also review the
14Changes file, L<Catalyst::Delta> and L<Catalyst::Upgrading> for more.
15
d63cc9c8 16We attempt to describe all relevant processes, try to give some advice
a09b49d2 17and explain where we may have been exceptional to respect our commitment
18to backwards compatibility.
19
20=head1 UTF8 in Controller Actions
21
22Using UTF8 characters in your Controller classes and actions.
23
24=head2 Summary
25
26In this section we will review changes to how UTF8 characters can be used in
27controller actions, how it looks in the debugging screens (and your logs)
28as well as how you construct L<URL> objects to actions with UTF8 paths
29(or using UTF8 args or captures).
30
31=head2 Unicode in Controllers and URLs
32
33 package MyApp::Controller::Root;
34
35 use uf8;
36 use base 'Catalyst::Controller';
37
38 sub heart_with_arg :Path('♥') Args(1) {
39 my ($self, $c, $arg) = @_;
40 }
41
42 sub base :Chained('/') CaptureArgs(0) {
43 my ($self, $c) = @_;
44 }
45
46 sub capture :Chained('base') PathPart('♥') CaptureArgs(1) {
47 my ($self, $c, $capture) = @_;
48 }
49
50 sub arg :Chained('capture') PathPart('♥') Args(1) {
51 my ($self, $c, $arg) = @_;
52 }
53
54=head2 Discussion
55
56In the example controller above we have constructed two matchable URL routes:
57
58 http://localhost/root/♥/{arg}
59 http://localhost/base/♥/{capture}/♥/{arg}
60
61The first one is a classic Path type action and the second uses Chaining, and
62spans three actions in total. As you can see, you can use unicode characters
63in your Path and PartPart attributes (remember to use the C<utf8> pragma to allow
64these multibyte characters in your source). The two constructed matchable routes
65would match the following incoming URLs:
66
67 (heart_with_arg) -> http://localhost/root/%E2%99%A5/{arg}
68 (base/capture/arg) -> http://localhost/base/%E2%99%A5/{capture}/%E2%99%A5/{arg}
69
70That path path C<%E2%99%A5> is url encoded unicode (assuming you are hitting this with
71a reasonably modern browser). Its basically what goes over HTTP when your type a
72browser location that has the unicode 'heart' in it. However we will use the unicode
73symbol in your debugging messages:
74
75 [debug] Loaded Path actions:
76 .-------------------------------------+--------------------------------------.
77 | Path | Private |
78 +-------------------------------------+--------------------------------------+
79 | /root/♥/* | /root/heart_with_arg |
80 '-------------------------------------+--------------------------------------'
81
82 [debug] Loaded Chained actions:
83 .-------------------------------------+--------------------------------------.
84 | Path Spec | Private |
85 +-------------------------------------+--------------------------------------+
86 | /base/♥/*/♥/* | /root/base (0) |
87 | | -> /root/capture (1) |
88 | | => /root/arg |
89 '-------------------------------------+--------------------------------------'
90
91And if the requested URL uses unicode characters in your captures or args (such as
92C<http://localhost:/base/♥/♥/♥/♥>) you should see the arguments and captures as their
93unicode characters as well:
94
95 [debug] Arguments are "♥"
96 [debug] "GET" request for "base/♥/♥/♥/♥" from "127.0.0.1"
97 .------------------------------------------------------------+-----------.
98 | Action | Time |
99 +------------------------------------------------------------+-----------+
100 | /root/base | 0.000080s |
101 | /root/capture | 0.000075s |
102 | /root/arg | 0.000755s |
103 '------------------------------------------------------------+-----------'
104
105Again, remember that we are display the unicode character and using it to match actions
106containing such multibyte characters BUT over HTTP you are getting these as URL encoded
107bytes. For example if you looked at the L<PSGI> C<$env> value for C<REQUEST_URI> you
108would see (for the above request)
109
110 REQUEST_URI => "/base/%E2%99%A5/%E2%99%A5/%E2%99%A5/%E2%99%A5"
111
112So on the incoming request we decode so that we can match and display unicode characters
113(after decoding the URL encoding). This makes it straightforward to use these types of
114multibyte characters in your actions and see them incoming in captures and arguments. Please
115keep this in might if you are doing for example regular expression matching, length determination
116or other string comparisons, you will need to try these incoming variables as though UTF8
117strings. For example in the following action:
118
119 sub arg :Chained('capture') PathPart('♥') Args(1) {
120 my ($self, $c, $arg) = @_;
121 }
122
123when $arg is "♥" you should expect C<length($arg)> to be C<1> since it is indeed one character
124although it will take more than one byte to store.
125
126=head2 UTF8 in constructing URLs via $c->uri_for
127
128For the reverse (constructing meaningful URLs to actions that contain multibyte characters
129in their paths or path parts, or when you want to include such characters in your captures
130or arguments) L<Catalyst> will do the right thing (again just remember to use the C<utf8>
131pragma).
132
133 use utf8;
134 my $url = $c->uri_for( $c->controller('Root')->action_for('arg'), ['♥','♥']);
135
136When you stringyfy this object (for use in a template, for example) it will automatically
137do the right thing regarding utf8 encoding and url encoding.
138
139 http://localhost/base/%E2%99%A5/%E2%99%A5/%E2%99%A5/%E2%99%A5
140
141Since again what you want is a properly url encoded version of this. In this case your string
142length will reflect URL encoded bytes, not the character length. Ultimately what you want
143to send over the wire via HTTP needs to be bytes.
144
145=head1 UTF8 in GET Query and Form POST
146
147What Catalyst does with UTF8 in your GET and classic HTML Form POST
148
149=head2 UTF8 in URL query and keywords
150
151The same rules that we find in URL paths also cover URL query parts. That is if
152one types a URL like this into the browser (again assuming a modernish UI that
153allows unicode)
154
155 http://localhost/example?♥=♥♥
156
157When this goes 'over the wire' to your application server its going to be as
158percent encoded bytes:
159
160
161 http://localhost/example?%E2%99%A5=%E2%99%A5%E2%99%A5
162
163When L<Catalyst> encounters this we decode the percent encoding and the utf8
164so that we can properly display this information (such as in the debugging
165logs or in a response.)
166
167 [debug] Query Parameters are:
168 .-------------------------------------+--------------------------------------.
169 | Parameter | Value |
170 +-------------------------------------+--------------------------------------+
171 | ♥ | ♥♥ |
172 '-------------------------------------+--------------------------------------'
173
174All the values and keys that are part of $c->req->query_parameters will be
175utf8 decoded. So you should not need to do anything special to take those
176values/keys and send them to the body response (since as we will see later
177L<Catalyst> will do all the necessary encoding for you).
178
179Again, remember that values of your parameters are now decode into Unicode strings. so
180for example you'd expect the result of length to reflect the character length not
181the byte length.
182
183Just like with arguments and captures, you can use utf8 literals (or utf8
184strings) in $c->uri_for:
185
186 use utf8;
187 my $url = $c->uri_for( $c->controller('Root')->action_for('example'), {'♥' => '♥♥'});
188
189When you stringyfy this object (for use in a template, for example) it will automatically
190do the right thing regarding utf8 encoding and url encoding.
191
192 http://localhost/example?%E2%99%A5=%E2%99%A5%E2%99%A5
193
194Since again what you want is a properly url encoded version of this. Ultimately what you want
195to send over the wire via HTTP needs to be bytes (not unicode characters).
196
197Remember if you use any utf8 literals in your source code, you should use the
198C<use utf8> pragma.
199
200=head2 UTF8 in Form POST
201
202In general most modern browsers will follow the specification, which says that POSTed
203form fields should be encoded in the same way that the document was served with. That means
204that if you are using modern Catalyst and serving UTF8 encoded responses, a browser is
205supposed to notice that and encode the form POSTs accordingly.
206
207As a result since L<Catalyst> now serves UTF8 encoded responses by default, this means that
208you can mostly rely on incoming form POSTs to be so encoded. L<Catalyst> will make this
209assumption and decode accordingly (unless you explicitly turn off encoding...) If you are
210running Catalyst in developer debug, then you will see the correct unicode characters in
211the debug output. For example if you generate a POST request:
212
213 use Catalyst::Test 'MyApp';
214 use utf8;
215
216 my $res = request POST "/example/posted", ['♥'=>'♥', '♥♥'=>'♥'];
217
218Running in CATALYST_DEBUG=1 mode you should see output like this:
219
220 [debug] Body Parameters are:
221 .-------------------------------------+--------------------------------------.
222 | Parameter | Value |
223 +-------------------------------------+--------------------------------------+
224 | ♥ | ♥ |
225 | ♥♥ | ♥ |
226 '-------------------------------------+--------------------------------------'
227
228And if you had a controller like this:
229
230 package MyApp::Controller::Example;
231
232 use base 'Catalyst::Controller';
233
234 sub posted :POST Local {
235 my ($self, $c) = @_;
236 $c->res->content_type('text/plain');
237 $c->res->body("hearts => ${\$c->req->post_parameters->{♥}}");
238 }
239
240The following test case would be true:
241
242 use Encode 2.21 'decode_utf8';
243 is decode_utf8($req->content), 'hearts => ♥';
244
245In this case we decode so that we can print and compare strings with multibyte characters.
246
247B<NOTE> In some cases some browsers may not follow the specification and set the form POST
248encoding based on the server response. Catalyst itself doesn't attempt any workarounds, but one
249common approach is to use a hidden form field with a UTF8 value (You might be familiar with
250this from how Ruby on Rails has HTML form helpers that do that automatically). In that case
251some browsers will send UTF8 encoded if it notices the hidden input field contains such a
252character. Also, you can add an HTML attribute to your form tag which many modern browsers
253will respect to set the encoding (accept-charset="utf-8"). And lastly there are some javascript
254based tricks and workarounds for even more odd cases (just search the web for this will return
255a number of approaches. Hopefully as more compliant browsers become popular these edge cases
256will fade.
257
258=head1 UTF8 Encoding in Body Response
259
260When does L<Catalyst> encode your response body and what rules does it use to
261determine when that is needed.
262
263=head2 Summary
264
265 use utf8;
266 use warnings;
267 use strict;
268
269 package MyApp::Controller::Root;
270
271 use base 'Catalyst::Controller';
272 use File::Spec;
273
274 sub scalar_body :Local {
275 my ($self, $c) = @_;
276 $c->response->content_type('text/html');
277 $c->response->body("<p>This is scalar_body action ♥</p>");
278 }
279
280 sub stream_write :Local {
281 my ($self, $c) = @_;
282 $c->response->content_type('text/html');
283 $c->response->write("<p>This is stream_write action ♥</p>");
284 }
285
286 sub stream_write_fh :Local {
287 my ($self, $c) = @_;
288 $c->response->content_type('text/html');
289
290 my $writer = $c->res->write_fh;
291 $writer->write_encoded('<p>This is stream_write_fh action ♥</p>');
292 $writer->close;
293 }
294
295 sub stream_body_fh :Local {
296 my ($self, $c) = @_;
297 my $path = File::Spec->catfile('t', 'utf8.txt');
298 open(my $fh, '<', $path) || die "trouble: $!";
299 $c->response->content_type('text/html');
300 $c->response->body($fh);
301 }
302
303=head2 Discussion
304
305Beginning with L<Catalyst> version 5.90080 You no longer need to set the encoding
306configuration (although doing so won't hurt anything).
307
308Currently we only encode if the content type is one of the types which generally expects a
309UTF8 encoding. This is determined by the following regular expression:
310
311 our $DEFAULT_ENCODE_CONTENT_TYPE_MATCH = qr{text|xml$|javascript$};
312 $c->response->content_type =~ /$DEFAULT_ENCODE_CONTENT_TYPE_MATCH/
313
314This is a global variable in L<Catalyst::Response> which is stored in the C<encodable_content_type>
315attribute of $c->response. You may currently alter this directly on the response or globally. In
316the future we may offer a configuration setting for this.
317
318This would match content-types like the following (examples)
319
320 text/plain
321 text/html
322 text/xml
323 application/javascript
324 application/xml
325 application/vnd.user+xml
326
327You should set your content type prior to header finalization if you want L<Catalyst> to
328encode.
329
330B<NOTE> We do not attempt to encode C<application/json> since the two most commonly used
331approaches (L<Catalyst::View::JSON> and L<Catalyst::Action::REST>) have already configured
332their JSON encoders to produce properly encoding UTF8 responses. If you are rolling your
333own JSON encoding, you may need to set the encoder to do the right thing (or override
334the global regular expression to include the JSON media type).
335
336=head2 Encoding with Scalar Body
337
338L<Catalyst> supports several methods of supplying your response with body content. The first
339and currently most common is to set the L<Catalyst::Response> ->body with a scalar string (
340as in the example):
341
342 use utf8;
343
344 sub scalar_body :Local {
345 my ($self, $c) = @_;
346 $c->response->content_type('text/html');
347 $c->response->body("<p>This is scalar_body action ♥</p>");
348 }
349
350In general you should need to do nothing else since L<Catalyst> will automatically encode
351this string during body finalization. The only matter to watch out for is to make sure
352the string has not already been encoded, as this will result in double encoding errors.
353
354B<NOTE> pay attention to the content-type setting in the example. L<Catalyst> inspects that
355content type carefully to determine if the body needs encoding).
356
357B<NOTE> If you set the character set of the response L<Catalyst> will skip encoding IF the
358character set is set to somethng that doesn't match $c->encoding->mime_name. We will assume
359if you are setting an alternative character set, that means you want to handle the encoding
360yourself. However it might be easier to set $c->encoding for a given response cycle since
361you can override this for a given response. For example here's how to override the default
362encoding and set the correct character set in the response:
363
364 sub override_encoding :Local {
365 my ($self, $c) = @_;
366 $c->res->content_type('text/plain');
367 $c->encoding(Encode::find_encoding('Shift_JIS'));
368 $c->response->body("テスト");
369 }
370
371This will use the alternative encoding for a single response.
372
373B<NOTE> If you manually set the content-type character set to whatever $c->encoding->mime_name
374is set to, we STILL encode, rather than assume your manual setting is a flag to override. This
375is done to support backward compatible assumptions. If you are going to handle encoding
376manually you may set $c->clear_encoding for a single request response cycle.
377
378=head2 Encoding with streaming type responses
379
380L<Catalyst> offers two approaches to streaming your body response. Again, you must remember
381to set your content type prior to streaming, since invoking a streaming response will automatically
382finalize and send your HTTP headers (and your content type MUST be one that matches the regular
383expression given above.)
384
385Also, if you are going to override $c->encoding (or invoke $c->clear_encoding), you should do
386that before anything else!
387
388The first streaming method is to use the C<write> method on the response object. This method
389allows 'inlined' streaming and is generally used with blocking style servers.
390
391 sub stream_write :Local {
392 my ($self, $c) = @_;
393 $c->response->content_type('text/html');
394 $c->response->write("<p>This is stream_write action ♥</p>");
395 }
396
397You may call the C<write> method as often as you need to finish streaming all your content.
398L<Catalyst> will encode each line in turn as long as the content-type meets the 'encodable types'
399requirement and $c->encoding is set (which it is, as long as you did not change it).
400
401B<NOTE> If you try to change the encoding after you start the stream, this will invoke an error
402reponse. However since you've already started streaming this will not show up as an HTTP error
403status code, but rather error information in your body response and an error in your logs.
404
405The second way to stream a response is to get the response writer object and invoke methods
406on that directly:
407
408 sub stream_write_fh :Local {
409 my ($self, $c) = @_;
410 $c->response->content_type('text/html');
411
412 my $writer = $c->res->write_fh;
413 $writer->write_encoded('<p>This is stream_write_fh action ♥</p>');
414 $writer->close;
415 }
416
417This can be used just like the C<write> method, but typicallty you request this object when
418you want to do a nonblocking style response since the writer object can be closed over or
419sent to a model that will invoke it in a non blocking manner. For more on using the writer
420object for non blocking responses you should review the C<Catalyst> documentation and also
421you can look at several articles from last years advent, in particular:
422
423L<http://www.catalystframework.org/calendar/2013/10>, L<http://www.catalystframework.org/calendar/2013/11>,
424L<http://www.catalystframework.org/calendar/2013/12>, L<http://www.catalystframework.org/calendar/2013/13>,
425L<http://www.catalystframework.org/calendar/2013/14>.
426
427The main difference this year is that previously calling ->write_fh would return the actual
428L<Plack> writer object that was supplied by your plack application handler, whereas now we wrap
429that object in a lightweight decorator object that proxies the C<write> and C<close> methods
430and supplies an additional C<write_encoded> method. C<write_encoded> does the exact same thing
431as C<write> except that it will first encode the string when necessary. In general if you are
432streaming encodable content such as HTML this is the method to use. If you are streaming
433binary content, you should just use the C<write> method (although if the content type is set
434correctly we would skip encoding anyway, but you may as well avoid the extra noop overhead).
435
436The last style of content response that L<Catalyst> supports is setting the body to a filehandle
437like object. In this case the object is passed down to the Plack application handler directly
438and currently we do nothing to set encoding.
439
440 sub stream_body_fh :Local {
441 my ($self, $c) = @_;
442 my $path = File::Spec->catfile('t', 'utf8.txt');
443 open(my $fh, '<', $path) || die "trouble: $!";
444 $c->response->content_type('text/html');
445 $c->response->body($fh);
446 }
447
448In this example we create a filehandle to a text file that contains UTF8 encoded characters. We
449pass this down without modification, which I think is correct since we don't want to double
450encode. However this may change in a future development release so please be sure to double
451check the current docs and changelog. Its possible a future release will require you to to set
452a encoding on the IO layer level so that we can be sure to properly encode at body finalization.
453So this is still an edge case we are writing test examples for. But for now if you are returning
454a filehandle like response, you are expected to make sure you are following the L<PSGI> specification
455and that unencoded bytes are returned.
456
457=head2 Override the Encoding on Context
458
459As already noted you may change the current encoding (or remove it) by setting an alternative
460encoding on the context;
461
462 $c->encoding(Encode::find_encoding('Shift_JIS'));
463
464Please note that you can continue to change encoding UNTIL the headers have been finalized. The
465last setting always wins. Trying to change encoding after header finalization is an error.
466
467=head2 Setting the Content Encoding HTTP Header
468
469In some cases you may set a content encoding on your response. For example if you are encoding
470your response with gzip. In this case you are again on your own. If we notice that the
471content encoding header is set when we hit finalization, we skip automatic encoding:
472
473 use Encode;
474 use Compress::Zlib;
475 use utf8;
476
477 sub gzipped :Local {
478 my ($self, $c) = @_;
479
480 $c->res->content_type('text/plain');
481 $c->res->content_type_charset('UTF-8');
482 $c->res->content_encoding('gzip');
483
484 $c->response->body(
485 Compress::Zlib::memGzip(
486 Encode::encode_utf8("manual_1 ♥")));
487 }
488
489
490If you are using L<Catalyst::Plugin::Compress> you need to upgrade to the most recent version
491in order to be compatible with changes introduced in L<Catalyst> 5.90080. Other plugins may
492require updates (please open bugs if you find them).
493
494B<NOTE> Content encoding may be set to 'identify' and we will still perform automatic encoding
495if the content type is encodable and an encoding is present for the context.
496
497=head2 Using Common Views
498
499The following common views have been updated so that their tests pass with default UTF8
500encoding for L<Catalyst>:
501
502L<Catalyst::View::TT>, L<Catalyst::View::Mason>, L<Catalyst::View::HTML::Mason>,
503L<Catalyst::View::Xslate>
504
505See L<Catalyst::Upgrading> for additional information on L<Catalyst> extensions that require
506upgrades.
507
508In generally for the common views you should not need to do anything special. If your actual
509template files contain UTF8 literals you should set configuration on your View to enable that.
510For example in TT, if your template has actual UTF8 character in it you should do the following:
511
512 MyApp::View::TT->config(ENCODING => 'utf-8');
513
514However L<Catalyst::View::Xslate> wants to do the UTF8 encoding for you (We assume that the
515authors of that view did this as a workaround to the fact that until now encoding was not core
516to L<Catalyst>. So if you use that view, you either need to tell it to not encode, or you need
517to turn off encoding for Catalyst.
518
519 MyApp::View::Xslate->config(encode_body => 0);
520
521or
522
523 MyApp->config(encoding=>undef);
524
525Preference is to disable it in the View.
526
527Other views may be similar. You should review View documentation and test during upgrading.
528We tried to make sure most common views worked properly and noted all workaround but if we
529missed something please alert the development team (instead of introducing a local hack into
530your application that will mean nobody will ever upgrade it...).
531
532=head2 Disabling default UTF8 encoding
533
534You may encounter issues with your legacy code running under default UTF8 body encoding. If
535so you can disable this with the following configurations setting:
536
537 MyApp->config(encoding=>undef);
538
539Where C<MyApp> is your L<Catalyst> subclass.
540
541If you believe you have discovered a bug in UTF8 body encoding, I strongly encourage you to
542report it (and not try to hack a workaround in your local code). We also recommend that you
543regard such a workaround as a temporary solution. It is ideal if L<Catalyst> extension
544authors can start to count on L<Catalyst> doing the write thing for encoding
545
546=head1 Conclusion
547
548This document has attempted to be a complete review of how UTF8 and encoding works in the
549current version of L<Catalyst> and also to document known issues, gotchas and backward
550compatible hacks. Please report issues to the development team.
551
552=head1 Author
553
554John Napiorkowski L<jjnapiork@cpan.org|email:jjnapiork@cpan.org>
555
556=cut
557