Commit | Line | Data |
a09b49d2 |
1 | =encoding UTF-8 |
2 | |
3 | =head1 Name |
4 | |
d63cc9c8 |
5 | Catalyst::UTF8 - All About UTF8 and Catalyst Encoding |
a09b49d2 |
6 | |
7 | =head1 Description |
8 | |
b596572b |
9 | Starting in 5.90080 L<Catalyst> will enable UTF8 encoding by default for |
a09b49d2 |
10 | text like body responses. In addition we've made a ton of fixes around encoding |
11 | and utf8 scattered throughout the codebase. This document attempts to give |
12 | an overview of the assumptions and practices that L<Catalyst> uses when |
13 | dealing with UTF8 and encoding issues. You should also review the |
14 | Changes file, L<Catalyst::Delta> and L<Catalyst::Upgrading> for more. |
15 | |
d63cc9c8 |
16 | We attempt to describe all relevant processes, try to give some advice |
a09b49d2 |
17 | and explain where we may have been exceptional to respect our commitment |
18 | to backwards compatibility. |
19 | |
b596572b |
20 | =head1 UTF8 in Controller Actions |
a09b49d2 |
21 | |
22 | Using UTF8 characters in your Controller classes and actions. |
23 | |
24 | =head2 Summary |
25 | |
26 | In this section we will review changes to how UTF8 characters can be used in |
27 | controller actions, how it looks in the debugging screens (and your logs) |
28 | as well as how you construct L<URL> objects to actions with UTF8 paths |
29 | (or using UTF8 args or captures). |
30 | |
31 | =head2 Unicode in Controllers and URLs |
32 | |
33 | package MyApp::Controller::Root; |
34 | |
473078ff |
35 | use utf8; |
a09b49d2 |
36 | use base 'Catalyst::Controller'; |
37 | |
38 | sub heart_with_arg :Path('♥') Args(1) { |
39 | my ($self, $c, $arg) = @_; |
40 | } |
41 | |
42 | sub base :Chained('/') CaptureArgs(0) { |
43 | my ($self, $c) = @_; |
44 | } |
45 | |
46 | sub capture :Chained('base') PathPart('♥') CaptureArgs(1) { |
47 | my ($self, $c, $capture) = @_; |
48 | } |
49 | |
50 | sub arg :Chained('capture') PathPart('♥') Args(1) { |
51 | my ($self, $c, $arg) = @_; |
52 | } |
53 | |
54 | =head2 Discussion |
55 | |
56 | In the example controller above we have constructed two matchable URL routes: |
57 | |
58 | http://localhost/root/♥/{arg} |
59 | http://localhost/base/♥/{capture}/♥/{arg} |
60 | |
61 | The first one is a classic Path type action and the second uses Chaining, and |
62 | spans three actions in total. As you can see, you can use unicode characters |
473078ff |
63 | in your Path and PathPart attributes (remember to use the C<utf8> pragma to allow |
a09b49d2 |
64 | these multibyte characters in your source). The two constructed matchable routes |
65 | would match the following incoming URLs: |
66 | |
67 | (heart_with_arg) -> http://localhost/root/%E2%99%A5/{arg} |
68 | (base/capture/arg) -> http://localhost/base/%E2%99%A5/{capture}/%E2%99%A5/{arg} |
69 | |
70 | That path path C<%E2%99%A5> is url encoded unicode (assuming you are hitting this with |
71 | a reasonably modern browser). Its basically what goes over HTTP when your type a |
72 | browser location that has the unicode 'heart' in it. However we will use the unicode |
73 | symbol in your debugging messages: |
74 | |
75 | [debug] Loaded Path actions: |
76 | .-------------------------------------+--------------------------------------. |
77 | | Path | Private | |
78 | +-------------------------------------+--------------------------------------+ |
79 | | /root/♥/* | /root/heart_with_arg | |
80 | '-------------------------------------+--------------------------------------' |
81 | |
82 | [debug] Loaded Chained actions: |
83 | .-------------------------------------+--------------------------------------. |
84 | | Path Spec | Private | |
85 | +-------------------------------------+--------------------------------------+ |
86 | | /base/♥/*/♥/* | /root/base (0) | |
87 | | | -> /root/capture (1) | |
88 | | | => /root/arg | |
89 | '-------------------------------------+--------------------------------------' |
90 | |
91 | And if the requested URL uses unicode characters in your captures or args (such as |
92 | C<http://localhost:/base/♥/♥/♥/♥>) you should see the arguments and captures as their |
93 | unicode characters as well: |
94 | |
95 | [debug] Arguments are "♥" |
96 | [debug] "GET" request for "base/♥/♥/♥/♥" from "127.0.0.1" |
97 | .------------------------------------------------------------+-----------. |
98 | | Action | Time | |
99 | +------------------------------------------------------------+-----------+ |
100 | | /root/base | 0.000080s | |
101 | | /root/capture | 0.000075s | |
102 | | /root/arg | 0.000755s | |
103 | '------------------------------------------------------------+-----------' |
104 | |
105 | Again, remember that we are display the unicode character and using it to match actions |
106 | containing such multibyte characters BUT over HTTP you are getting these as URL encoded |
b596572b |
107 | bytes. For example if you looked at the L<PSGI> C<$env> value for C<REQUEST_URI> you |
108 | would see (for the above request) |
a09b49d2 |
109 | |
110 | REQUEST_URI => "/base/%E2%99%A5/%E2%99%A5/%E2%99%A5/%E2%99%A5" |
111 | |
112 | So on the incoming request we decode so that we can match and display unicode characters |
113 | (after decoding the URL encoding). This makes it straightforward to use these types of |
114 | multibyte characters in your actions and see them incoming in captures and arguments. Please |
115 | keep this in might if you are doing for example regular expression matching, length determination |
116 | or other string comparisons, you will need to try these incoming variables as though UTF8 |
117 | strings. For example in the following action: |
118 | |
119 | sub arg :Chained('capture') PathPart('♥') Args(1) { |
120 | my ($self, $c, $arg) = @_; |
121 | } |
122 | |
123 | when $arg is "♥" you should expect C<length($arg)> to be C<1> since it is indeed one character |
124 | although it will take more than one byte to store. |
125 | |
126 | =head2 UTF8 in constructing URLs via $c->uri_for |
127 | |
128 | For the reverse (constructing meaningful URLs to actions that contain multibyte characters |
129 | in their paths or path parts, or when you want to include such characters in your captures |
130 | or arguments) L<Catalyst> will do the right thing (again just remember to use the C<utf8> |
131 | pragma). |
132 | |
133 | use utf8; |
134 | my $url = $c->uri_for( $c->controller('Root')->action_for('arg'), ['♥','♥']); |
135 | |
473078ff |
136 | When you stringify this object (for use in a template, for example) it will automatically |
a09b49d2 |
137 | do the right thing regarding utf8 encoding and url encoding. |
138 | |
139 | http://localhost/base/%E2%99%A5/%E2%99%A5/%E2%99%A5/%E2%99%A5 |
140 | |
141 | Since again what you want is a properly url encoded version of this. In this case your string |
142 | length will reflect URL encoded bytes, not the character length. Ultimately what you want |
143 | to send over the wire via HTTP needs to be bytes. |
144 | |
145 | =head1 UTF8 in GET Query and Form POST |
146 | |
147 | What Catalyst does with UTF8 in your GET and classic HTML Form POST |
148 | |
149 | =head2 UTF8 in URL query and keywords |
150 | |
473078ff |
151 | The same rules that we find in URL paths also cover URL query parts. That is |
152 | if one types a URL like this into the browser |
a09b49d2 |
153 | |
154 | http://localhost/example?♥=♥♥ |
155 | |
156 | When this goes 'over the wire' to your application server its going to be as |
157 | percent encoded bytes: |
158 | |
159 | |
160 | http://localhost/example?%E2%99%A5=%E2%99%A5%E2%99%A5 |
161 | |
162 | When L<Catalyst> encounters this we decode the percent encoding and the utf8 |
163 | so that we can properly display this information (such as in the debugging |
164 | logs or in a response.) |
165 | |
166 | [debug] Query Parameters are: |
167 | .-------------------------------------+--------------------------------------. |
168 | | Parameter | Value | |
169 | +-------------------------------------+--------------------------------------+ |
170 | | ♥ | ♥♥ | |
171 | '-------------------------------------+--------------------------------------' |
172 | |
173 | All the values and keys that are part of $c->req->query_parameters will be |
174 | utf8 decoded. So you should not need to do anything special to take those |
175 | values/keys and send them to the body response (since as we will see later |
176 | L<Catalyst> will do all the necessary encoding for you). |
177 | |
178 | Again, remember that values of your parameters are now decode into Unicode strings. so |
179 | for example you'd expect the result of length to reflect the character length not |
b596572b |
180 | the byte length. |
a09b49d2 |
181 | |
182 | Just like with arguments and captures, you can use utf8 literals (or utf8 |
183 | strings) in $c->uri_for: |
184 | |
185 | use utf8; |
186 | my $url = $c->uri_for( $c->controller('Root')->action_for('example'), {'♥' => '♥♥'}); |
187 | |
473078ff |
188 | When you stringify this object (for use in a template, for example) it will automatically |
a09b49d2 |
189 | do the right thing regarding utf8 encoding and url encoding. |
190 | |
191 | http://localhost/example?%E2%99%A5=%E2%99%A5%E2%99%A5 |
192 | |
193 | Since again what you want is a properly url encoded version of this. Ultimately what you want |
b596572b |
194 | to send over the wire via HTTP needs to be bytes (not unicode characters). |
a09b49d2 |
195 | |
196 | Remember if you use any utf8 literals in your source code, you should use the |
197 | C<use utf8> pragma. |
198 | |
f9d5afbc |
199 | B<NOTE:> Assuming UTF-8 in your query parameters and keywords may be an issue if you have |
200 | legacy code where you created URL in templates manually and used an encoding other than UTF-8. |
201 | In these cases you may find versions of Catalyst after 5.90080+ will incorrectly decode. For |
202 | backwards compatibility we offer three configurations settings, here described in order of |
203 | precedence: |
204 | |
205 | C<do_not_decode_query> |
206 | |
207 | If true, then do not try to character decode any wide characters in your |
77b90892 |
208 | request URL query or keywords. You will need gto handle this manually in your action code |
209 | (although if you choose this setting, chances are you already do this). |
f9d5afbc |
210 | |
211 | C<default_query_encoding> |
212 | |
213 | This setting allows one to specify a fixed value for how to decode your query, instead of using |
214 | the default, UTF-8. |
215 | |
216 | C<decode_query_using_global_encoding> |
217 | |
218 | If this is true we decode using whatever you set C<encoding> to. |
219 | |
a09b49d2 |
220 | =head2 UTF8 in Form POST |
221 | |
222 | In general most modern browsers will follow the specification, which says that POSTed |
223 | form fields should be encoded in the same way that the document was served with. That means |
224 | that if you are using modern Catalyst and serving UTF8 encoded responses, a browser is |
225 | supposed to notice that and encode the form POSTs accordingly. |
226 | |
227 | As a result since L<Catalyst> now serves UTF8 encoded responses by default, this means that |
228 | you can mostly rely on incoming form POSTs to be so encoded. L<Catalyst> will make this |
229 | assumption and decode accordingly (unless you explicitly turn off encoding...) If you are |
b596572b |
230 | running Catalyst in developer debug, then you will see the correct unicode characters in |
a09b49d2 |
231 | the debug output. For example if you generate a POST request: |
232 | |
233 | use Catalyst::Test 'MyApp'; |
234 | use utf8; |
235 | |
236 | my $res = request POST "/example/posted", ['♥'=>'♥', '♥♥'=>'♥']; |
237 | |
238 | Running in CATALYST_DEBUG=1 mode you should see output like this: |
239 | |
240 | [debug] Body Parameters are: |
241 | .-------------------------------------+--------------------------------------. |
242 | | Parameter | Value | |
243 | +-------------------------------------+--------------------------------------+ |
244 | | ♥ | ♥ | |
245 | | ♥♥ | ♥ | |
246 | '-------------------------------------+--------------------------------------' |
247 | |
248 | And if you had a controller like this: |
249 | |
250 | package MyApp::Controller::Example; |
b596572b |
251 | |
a09b49d2 |
252 | use base 'Catalyst::Controller'; |
253 | |
254 | sub posted :POST Local { |
255 | my ($self, $c) = @_; |
256 | $c->res->content_type('text/plain'); |
257 | $c->res->body("hearts => ${\$c->req->post_parameters->{♥}}"); |
258 | } |
259 | |
260 | The following test case would be true: |
261 | |
262 | use Encode 2.21 'decode_utf8'; |
263 | is decode_utf8($req->content), 'hearts => ♥'; |
264 | |
b596572b |
265 | In this case we decode so that we can print and compare strings with multibyte characters. |
a09b49d2 |
266 | |
267 | B<NOTE> In some cases some browsers may not follow the specification and set the form POST |
268 | encoding based on the server response. Catalyst itself doesn't attempt any workarounds, but one |
269 | common approach is to use a hidden form field with a UTF8 value (You might be familiar with |
270 | this from how Ruby on Rails has HTML form helpers that do that automatically). In that case |
271 | some browsers will send UTF8 encoded if it notices the hidden input field contains such a |
272 | character. Also, you can add an HTML attribute to your form tag which many modern browsers |
273 | will respect to set the encoding (accept-charset="utf-8"). And lastly there are some javascript |
274 | based tricks and workarounds for even more odd cases (just search the web for this will return |
275 | a number of approaches. Hopefully as more compliant browsers become popular these edge cases |
276 | will fade. |
277 | |
b16a64af |
278 | B<NOTE> It is possible for a form POST multipart response (normally a file upload) to contain |
279 | inline content with mixed content character sets and encoding. For example one might create |
280 | a POST like this: |
281 | |
282 | use utf8; |
283 | use HTTP::Request::Common; |
284 | |
285 | my $utf8 = 'test ♥'; |
286 | my $shiftjs = 'test テスト'; |
287 | my $req = POST '/root/echo_arg', |
288 | Content_Type => 'form-data', |
289 | Content => [ |
290 | arg0 => 'helloworld', |
291 | Encode::encode('UTF-8','♥') => Encode::encode('UTF-8','♥♥'), |
292 | arg1 => [ |
293 | undef, '', |
294 | 'Content-Type' =>'text/plain; charset=UTF-8', |
295 | 'Content' => Encode::encode('UTF-8', $utf8)], |
296 | arg2 => [ |
297 | undef, '', |
298 | 'Content-Type' =>'text/plain; charset=SHIFT_JIS', |
299 | 'Content' => Encode::encode('SHIFT_JIS', $shiftjs)], |
300 | arg2 => [ |
301 | undef, '', |
302 | 'Content-Type' =>'text/plain; charset=SHIFT_JIS', |
303 | 'Content' => Encode::encode('SHIFT_JIS', $shiftjs)], |
304 | ]; |
305 | |
306 | In this case we've created a POST request but each part specifies its own content |
307 | character set (and setting a content encoding would also be possible). Generally one |
308 | would not run into this situation in a web browser context but for completeness sake |
309 | Catalyst will notice if a multipart POST contains parts with complex or extended |
310 | header information and in those cases it will not attempt to apply decoding to the |
311 | form values. Instead the part will be represented as an instance of an object |
312 | L<Catalyst::Request::PartData> which will contain all the header information needed |
313 | for you to perform custom parser of the data. |
314 | |
a09b49d2 |
315 | =head1 UTF8 Encoding in Body Response |
316 | |
317 | When does L<Catalyst> encode your response body and what rules does it use to |
318 | determine when that is needed. |
319 | |
320 | =head2 Summary |
321 | |
322 | use utf8; |
323 | use warnings; |
324 | use strict; |
325 | |
326 | package MyApp::Controller::Root; |
327 | |
328 | use base 'Catalyst::Controller'; |
329 | use File::Spec; |
330 | |
331 | sub scalar_body :Local { |
332 | my ($self, $c) = @_; |
333 | $c->response->content_type('text/html'); |
334 | $c->response->body("<p>This is scalar_body action ♥</p>"); |
335 | } |
336 | |
337 | sub stream_write :Local { |
338 | my ($self, $c) = @_; |
339 | $c->response->content_type('text/html'); |
340 | $c->response->write("<p>This is stream_write action ♥</p>"); |
b596572b |
341 | } |
a09b49d2 |
342 | |
343 | sub stream_write_fh :Local { |
344 | my ($self, $c) = @_; |
345 | $c->response->content_type('text/html'); |
346 | |
347 | my $writer = $c->res->write_fh; |
348 | $writer->write_encoded('<p>This is stream_write_fh action ♥</p>'); |
349 | $writer->close; |
350 | } |
351 | |
352 | sub stream_body_fh :Local { |
353 | my ($self, $c) = @_; |
354 | my $path = File::Spec->catfile('t', 'utf8.txt'); |
355 | open(my $fh, '<', $path) || die "trouble: $!"; |
356 | $c->response->content_type('text/html'); |
357 | $c->response->body($fh); |
358 | } |
359 | |
360 | =head2 Discussion |
361 | |
362 | Beginning with L<Catalyst> version 5.90080 You no longer need to set the encoding |
363 | configuration (although doing so won't hurt anything). |
364 | |
365 | Currently we only encode if the content type is one of the types which generally expects a |
366 | UTF8 encoding. This is determined by the following regular expression: |
367 | |
368 | our $DEFAULT_ENCODE_CONTENT_TYPE_MATCH = qr{text|xml$|javascript$}; |
369 | $c->response->content_type =~ /$DEFAULT_ENCODE_CONTENT_TYPE_MATCH/ |
370 | |
371 | This is a global variable in L<Catalyst::Response> which is stored in the C<encodable_content_type> |
372 | attribute of $c->response. You may currently alter this directly on the response or globally. In |
373 | the future we may offer a configuration setting for this. |
374 | |
375 | This would match content-types like the following (examples) |
376 | |
377 | text/plain |
378 | text/html |
379 | text/xml |
380 | application/javascript |
381 | application/xml |
382 | application/vnd.user+xml |
383 | |
b596572b |
384 | You should set your content type prior to header finalization if you want L<Catalyst> to |
a09b49d2 |
385 | encode. |
386 | |
387 | B<NOTE> We do not attempt to encode C<application/json> since the two most commonly used |
388 | approaches (L<Catalyst::View::JSON> and L<Catalyst::Action::REST>) have already configured |
389 | their JSON encoders to produce properly encoding UTF8 responses. If you are rolling your |
390 | own JSON encoding, you may need to set the encoder to do the right thing (or override |
391 | the global regular expression to include the JSON media type). |
392 | |
393 | =head2 Encoding with Scalar Body |
394 | |
395 | L<Catalyst> supports several methods of supplying your response with body content. The first |
396 | and currently most common is to set the L<Catalyst::Response> ->body with a scalar string ( |
397 | as in the example): |
398 | |
399 | use utf8; |
400 | |
401 | sub scalar_body :Local { |
402 | my ($self, $c) = @_; |
403 | $c->response->content_type('text/html'); |
404 | $c->response->body("<p>This is scalar_body action ♥</p>"); |
405 | } |
406 | |
407 | In general you should need to do nothing else since L<Catalyst> will automatically encode |
408 | this string during body finalization. The only matter to watch out for is to make sure |
409 | the string has not already been encoded, as this will result in double encoding errors. |
410 | |
411 | B<NOTE> pay attention to the content-type setting in the example. L<Catalyst> inspects that |
412 | content type carefully to determine if the body needs encoding). |
413 | |
414 | B<NOTE> If you set the character set of the response L<Catalyst> will skip encoding IF the |
473078ff |
415 | character set is set to something that doesn't match $c->encoding->mime_name. We will assume |
a09b49d2 |
416 | if you are setting an alternative character set, that means you want to handle the encoding |
417 | yourself. However it might be easier to set $c->encoding for a given response cycle since |
418 | you can override this for a given response. For example here's how to override the default |
419 | encoding and set the correct character set in the response: |
420 | |
421 | sub override_encoding :Local { |
422 | my ($self, $c) = @_; |
423 | $c->res->content_type('text/plain'); |
424 | $c->encoding(Encode::find_encoding('Shift_JIS')); |
425 | $c->response->body("テスト"); |
426 | } |
427 | |
428 | This will use the alternative encoding for a single response. |
429 | |
430 | B<NOTE> If you manually set the content-type character set to whatever $c->encoding->mime_name |
431 | is set to, we STILL encode, rather than assume your manual setting is a flag to override. This |
aca337aa |
432 | is done to support backward compatible assumptions (in particular L<Catalyst::View::TT> has set |
433 | a utf-8 character set in its default content-type for ages, even though it does not itself do any |
434 | encoding on the body response). If you are going to handle encoding manually you may set |
435 | $c->clear_encoding for a single request response cycle, or as in the above example set an alternative |
436 | encoding. |
a09b49d2 |
437 | |
438 | =head2 Encoding with streaming type responses |
439 | |
440 | L<Catalyst> offers two approaches to streaming your body response. Again, you must remember |
441 | to set your content type prior to streaming, since invoking a streaming response will automatically |
442 | finalize and send your HTTP headers (and your content type MUST be one that matches the regular |
443 | expression given above.) |
444 | |
445 | Also, if you are going to override $c->encoding (or invoke $c->clear_encoding), you should do |
446 | that before anything else! |
447 | |
448 | The first streaming method is to use the C<write> method on the response object. This method |
449 | allows 'inlined' streaming and is generally used with blocking style servers. |
450 | |
451 | sub stream_write :Local { |
452 | my ($self, $c) = @_; |
453 | $c->response->content_type('text/html'); |
454 | $c->response->write("<p>This is stream_write action ♥</p>"); |
455 | } |
456 | |
457 | You may call the C<write> method as often as you need to finish streaming all your content. |
458 | L<Catalyst> will encode each line in turn as long as the content-type meets the 'encodable types' |
459 | requirement and $c->encoding is set (which it is, as long as you did not change it). |
460 | |
461 | B<NOTE> If you try to change the encoding after you start the stream, this will invoke an error |
473078ff |
462 | response. However since you've already started streaming this will not show up as an HTTP error |
a09b49d2 |
463 | status code, but rather error information in your body response and an error in your logs. |
464 | |
465 | The second way to stream a response is to get the response writer object and invoke methods |
466 | on that directly: |
467 | |
468 | sub stream_write_fh :Local { |
469 | my ($self, $c) = @_; |
470 | $c->response->content_type('text/html'); |
471 | |
472 | my $writer = $c->res->write_fh; |
473 | $writer->write_encoded('<p>This is stream_write_fh action ♥</p>'); |
474 | $writer->close; |
475 | } |
476 | |
473078ff |
477 | This can be used just like the C<write> method, but typically you request this object when |
a09b49d2 |
478 | you want to do a nonblocking style response since the writer object can be closed over or |
479 | sent to a model that will invoke it in a non blocking manner. For more on using the writer |
480 | object for non blocking responses you should review the C<Catalyst> documentation and also |
481 | you can look at several articles from last years advent, in particular: |
482 | |
483 | L<http://www.catalystframework.org/calendar/2013/10>, L<http://www.catalystframework.org/calendar/2013/11>, |
484 | L<http://www.catalystframework.org/calendar/2013/12>, L<http://www.catalystframework.org/calendar/2013/13>, |
485 | L<http://www.catalystframework.org/calendar/2013/14>. |
486 | |
487 | The main difference this year is that previously calling ->write_fh would return the actual |
488 | L<Plack> writer object that was supplied by your plack application handler, whereas now we wrap |
489 | that object in a lightweight decorator object that proxies the C<write> and C<close> methods |
490 | and supplies an additional C<write_encoded> method. C<write_encoded> does the exact same thing |
491 | as C<write> except that it will first encode the string when necessary. In general if you are |
492 | streaming encodable content such as HTML this is the method to use. If you are streaming |
493 | binary content, you should just use the C<write> method (although if the content type is set |
494 | correctly we would skip encoding anyway, but you may as well avoid the extra noop overhead). |
495 | |
496 | The last style of content response that L<Catalyst> supports is setting the body to a filehandle |
497 | like object. In this case the object is passed down to the Plack application handler directly |
498 | and currently we do nothing to set encoding. |
499 | |
500 | sub stream_body_fh :Local { |
501 | my ($self, $c) = @_; |
502 | my $path = File::Spec->catfile('t', 'utf8.txt'); |
503 | open(my $fh, '<', $path) || die "trouble: $!"; |
504 | $c->response->content_type('text/html'); |
505 | $c->response->body($fh); |
506 | } |
507 | |
508 | In this example we create a filehandle to a text file that contains UTF8 encoded characters. We |
509 | pass this down without modification, which I think is correct since we don't want to double |
510 | encode. However this may change in a future development release so please be sure to double |
511 | check the current docs and changelog. Its possible a future release will require you to to set |
512 | a encoding on the IO layer level so that we can be sure to properly encode at body finalization. |
513 | So this is still an edge case we are writing test examples for. But for now if you are returning |
514 | a filehandle like response, you are expected to make sure you are following the L<PSGI> specification |
473078ff |
515 | and return raw bytes. |
a09b49d2 |
516 | |
517 | =head2 Override the Encoding on Context |
518 | |
519 | As already noted you may change the current encoding (or remove it) by setting an alternative |
520 | encoding on the context; |
521 | |
522 | $c->encoding(Encode::find_encoding('Shift_JIS')); |
523 | |
524 | Please note that you can continue to change encoding UNTIL the headers have been finalized. The |
525 | last setting always wins. Trying to change encoding after header finalization is an error. |
526 | |
527 | =head2 Setting the Content Encoding HTTP Header |
528 | |
529 | In some cases you may set a content encoding on your response. For example if you are encoding |
530 | your response with gzip. In this case you are again on your own. If we notice that the |
531 | content encoding header is set when we hit finalization, we skip automatic encoding: |
532 | |
533 | use Encode; |
534 | use Compress::Zlib; |
535 | use utf8; |
536 | |
537 | sub gzipped :Local { |
538 | my ($self, $c) = @_; |
539 | |
540 | $c->res->content_type('text/plain'); |
541 | $c->res->content_type_charset('UTF-8'); |
542 | $c->res->content_encoding('gzip'); |
543 | |
544 | $c->response->body( |
545 | Compress::Zlib::memGzip( |
546 | Encode::encode_utf8("manual_1 ♥"))); |
547 | } |
548 | |
549 | |
550 | If you are using L<Catalyst::Plugin::Compress> you need to upgrade to the most recent version |
551 | in order to be compatible with changes introduced in L<Catalyst> 5.90080. Other plugins may |
552 | require updates (please open bugs if you find them). |
553 | |
554 | B<NOTE> Content encoding may be set to 'identify' and we will still perform automatic encoding |
555 | if the content type is encodable and an encoding is present for the context. |
556 | |
557 | =head2 Using Common Views |
558 | |
559 | The following common views have been updated so that their tests pass with default UTF8 |
560 | encoding for L<Catalyst>: |
561 | |
562 | L<Catalyst::View::TT>, L<Catalyst::View::Mason>, L<Catalyst::View::HTML::Mason>, |
563 | L<Catalyst::View::Xslate> |
564 | |
565 | See L<Catalyst::Upgrading> for additional information on L<Catalyst> extensions that require |
566 | upgrades. |
567 | |
568 | In generally for the common views you should not need to do anything special. If your actual |
569 | template files contain UTF8 literals you should set configuration on your View to enable that. |
570 | For example in TT, if your template has actual UTF8 character in it you should do the following: |
571 | |
572 | MyApp::View::TT->config(ENCODING => 'utf-8'); |
573 | |
574 | However L<Catalyst::View::Xslate> wants to do the UTF8 encoding for you (We assume that the |
575 | authors of that view did this as a workaround to the fact that until now encoding was not core |
576 | to L<Catalyst>. So if you use that view, you either need to tell it to not encode, or you need |
577 | to turn off encoding for Catalyst. |
578 | |
579 | MyApp::View::Xslate->config(encode_body => 0); |
580 | |
581 | or |
582 | |
583 | MyApp->config(encoding=>undef); |
584 | |
585 | Preference is to disable it in the View. |
586 | |
587 | Other views may be similar. You should review View documentation and test during upgrading. |
588 | We tried to make sure most common views worked properly and noted all workaround but if we |
589 | missed something please alert the development team (instead of introducing a local hack into |
590 | your application that will mean nobody will ever upgrade it...). |
591 | |
aca337aa |
592 | =head2 Setting the response from an external PSGI application. |
593 | |
594 | L<Catalyst::Response> allows one to set the response from an external L<PSGI> application. |
595 | If you do this, and that external application sets a character set on the content-type, we |
596 | C<clear_encoding> for the rest of the response. This is done to prevent double encoding. |
597 | |
598 | B<NOTE> Even if the character set of the content type is the same as the encoding set in |
599 | $c->encoding, we still skip encoding. This is a regrettable difference from the general rule |
600 | outlined above, where if the current character set is the same as the current encoding, we |
601 | encode anyway. Nevertheless I think this is the correct behavior since the earlier rule exists |
602 | only to support backward compatibility with L<Catalyst::View::TT>. |
603 | |
604 | In general if you want L<Catalyst> to handle encoding, you should avoid setting the content |
605 | type character set since Catalyst will do so automatically based on the requested response |
606 | encoding. Its best to request alternative encodings by setting $c->encoding and if you really |
607 | want manual control of encoding you should always $c->clear_encoding so that programmers that |
608 | come after you are very clear as to your intentions. |
609 | |
a09b49d2 |
610 | =head2 Disabling default UTF8 encoding |
611 | |
612 | You may encounter issues with your legacy code running under default UTF8 body encoding. If |
613 | so you can disable this with the following configurations setting: |
614 | |
615 | MyApp->config(encoding=>undef); |
616 | |
617 | Where C<MyApp> is your L<Catalyst> subclass. |
618 | |
b16a64af |
619 | If you do not wish to disable all the Catalyst encoding features, you may disable specific |
620 | features via two additional configuration options: 'skip_body_param_unicode_decoding' |
621 | and 'skip_complex_post_part_handling'. The first will skip any attempt to decode POST |
622 | parameters in the creating of body parameters and the second will skip creation of instances |
623 | of L<Catalyst::Request::PartData> in the case that the multipart form upload contains parts |
624 | with a mix of content character sets. |
625 | |
a09b49d2 |
626 | If you believe you have discovered a bug in UTF8 body encoding, I strongly encourage you to |
627 | report it (and not try to hack a workaround in your local code). We also recommend that you |
628 | regard such a workaround as a temporary solution. It is ideal if L<Catalyst> extension |
b16a64af |
629 | authors can start to count on L<Catalyst> doing the write thing for encoding. |
a09b49d2 |
630 | |
631 | =head1 Conclusion |
632 | |
633 | This document has attempted to be a complete review of how UTF8 and encoding works in the |
634 | current version of L<Catalyst> and also to document known issues, gotchas and backward |
635 | compatible hacks. Please report issues to the development team. |
636 | |
637 | =head1 Author |
638 | |
639 | John Napiorkowski L<jjnapiork@cpan.org|email:jjnapiork@cpan.org> |
640 | |
641 | =cut |
642 | |