Use Markus Kuhn's UTF-8 Decode Stress Tester at http://www.cl.cam.ac.uk/~mgk25/ucs/examples/ Markus: > > What exactly is malformed UTF-8 data here? > > Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2. > > Does it also cover overlong UTF-8 sequences, i.e. any string > containing any of the five bit sequences > > 1100000x, > 11100000 100xxxxx, > 11110000 1000xxxx, > 11111000 10000xxx, > 11111100 100000xx > > Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16 > surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must > not occur in proper UTF-8 and UTF-32 data according to the standard > (see note 3 in section R.4 of UCS)? > > It might be useful, if the spec were clearer here. > > References: > > - ISO/IEC 10646-1:1993(E), Amd. 2, > http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html > > - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 > Markus: > > It is commonly considered to be good practice to reject at least > overlong UTF-8 sequences, otherwise one permits multiple encodings for > characters, which makes pattern matching far more difficult in > applications where strings are processed in both coded and decoded form. > It has been argued that this could easily lead to security > vulnerabilities. See > > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 > http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (section 4) > ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt (section 6) > > for a brief discussion. >