ext/Encode/Todo

   1 Use Markus Kuhn's UTF-8 Decode Stress Tester at
   2
   3         http://www.cl.cam.ac.uk/~mgk25/ucs/examples/
   4
   5 Markus:
   6 >
   7 > What exactly is malformed UTF-8 data here?
   8 >
   9 > Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2.
  10 >
  11 > Does it also cover overlong UTF-8 sequences, i.e. any string
  12 > containing any of the five bit sequences
  13 >
  14 >   1100000x,
  15 >   11100000 100xxxxx,
  16 >   11110000 1000xxxx,
  17 >   11111000 10000xxx,
  18 >   11111100 100000xx
  19 >
  20 > Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16
  21 > surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must
  22 > not occur in proper UTF-8 and UTF-32 data according to the standard
  23 > (see note 3 in section R.4 of UCS)?
  24 >
  25 > It might be useful, if the spec were clearer here.
  26 >
  27 > References:
  28 >
  29 >   - ISO/IEC 10646-1:1993(E), Amd. 2,
  30 >     http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
  31 >
  32 >   - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
  33 >
  34
  35 Markus:
  36 >
  37 > It is commonly considered to be good practice to reject at least
  38 > overlong UTF-8 sequences, otherwise one permits multiple encodings for
  39 > characters, which makes pattern matching far more difficult in
  40 > applications where strings are processed in both coded and decoded form.
  41 > It has been argued that this could easily lead to security
  42 > vulnerabilities. See
  43 >
  44 >   http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
  45 >   http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt   (section 4)
  46 >   ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt          (section 6)
  47 >
  48 > for a brief discussion.
  49 >