Commit | Line | Data |
22d4bb9c |
1 | Use Markus Kuhn's UTF-8 Decode Stress Tester at |
2 | |
3 | http://www.cl.cam.ac.uk/~mgk25/ucs/examples/ |
4 | |
5 | Markus: |
6 | > |
7 | > What exactly is malformed UTF-8 data here? |
8 | > |
9 | > Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2. |
10 | > |
11 | > Does it also cover overlong UTF-8 sequences, i.e. any string |
12 | > containing any of the five bit sequences |
13 | > |
14 | > 1100000x, |
15 | > 11100000 100xxxxx, |
16 | > 11110000 1000xxxx, |
17 | > 11111000 10000xxx, |
18 | > 11111100 100000xx |
19 | > |
20 | > Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16 |
21 | > surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must |
22 | > not occur in proper UTF-8 and UTF-32 data according to the standard |
23 | > (see note 3 in section R.4 of UCS)? |
24 | > |
25 | > It might be useful, if the spec were clearer here. |
26 | > |
27 | > References: |
28 | > |
29 | > - ISO/IEC 10646-1:1993(E), Amd. 2, |
30 | > http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html |
31 | > |
32 | > - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 |
33 | > |
34 | |
35 | Markus: |
36 | > |
37 | > It is commonly considered to be good practice to reject at least |
38 | > overlong UTF-8 sequences, otherwise one permits multiple encodings for |
39 | > characters, which makes pattern matching far more difficult in |
40 | > applications where strings are processed in both coded and decoded form. |
41 | > It has been argued that this could easily lead to security |
42 | > vulnerabilities. See |
43 | > |
44 | > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 |
45 | > http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (section 4) |
46 | > ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt (section 6) |
47 | > |
48 | > for a brief discussion. |
49 | > |