By default, if you submit a web form containing certain non-ASCII (Unicode) characters, the browser sends ambiguous encodings of these characters, such that completely different text appears indistinguishable to the server.
For instance, if you enter the Chinese character for “one” into some text box in a web form:
the raw characters that the server receives off the socket are:
which is the URL encoding of the NCR of that Unicode character:
The NCR is just an HTML entity that refers to the numeric ID of a Unicode character; this character’s numeric ID is 19968.
Now if you instead wrote that NCR literally into the same text box:
the raw characters that the server receives off the socket are again:
Notice that it is impossible for the server to determine whether the user had typed in a Unicode character or the NCR thereof. Ideally, we want the server to see the direct URL encoding of the Unicode character (not its NCR):
Note: we always need URL encoding because that’s how HTML form data is transmitted. HTTP is a (mostly) ASCII protocol, and form data also needs to be delimited (
key=value&key=value&... means that at least
& characters must be URL encoded).
Why is the browser sending the URL encoding of the NCR instead of the URL encoding of the original character? Because the browser is using an unsuitable character set. By default, the browser uses an ASCII charset, meaning it assumes the server only handles ASCII characters. When the user enters a non-ASCII character, the browser won’t be able to send this. However, instead of failing, it silently makes an effort to capture the character in ASCII by sending its NCR.
This implicit and ambiguous encoding behavior is AFAICT undefined, but all browsers I’ve tried seem to do this. More relevant information can be found in the HTML 4 specs on forms.
One way to specify a suitable charset (e.g., UTF-8) that the browser should use for submitting data is to add
form element. According to the specs:
The content type “multipart/form-data” should be used for submitting forms that contain files, non-ASCII data, and binary data.
This means that your form data doesn’t need get URL encoded. If Leaving it out, so that the URL encoding of the UTF-8 encoding is sent over the wire, is fine.
Another way is to leave the
form alone and to specify the charset either in the HTTP response header:
Content-type: text/html; charset=utf-8
or in the HTML page head:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
These specify the page encoding, and the form submission encoding should default to the page encoding:
The default value for this attribute is the reserved string “UNKNOWN”. User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.
Specifying a Unicode encoding for the entire page also allows you to display Unicode characters, as opposed to just accepting them in form input. More details on specifying the page encoding are in the HTML specs on charsets.
Notice that the charset specifications are actually specifying a particular encoding for that charset (e.g., UTF-8 for the Unicode character set). In Python, this data can be decoded using
str.decode('utf-8') also works in Python 2), which returns a
str in Python 3).
The moral of the story: to make sure your application is prepared to work with Unicode properly, you will always want to specify a Unicode encoding like UTF-8 in either your form’s
accepted-charset or in your page’s
I was able to find a good discussion of this topic in this Bugzilla ticket.
Follow me on Twitter for stuff far more interesting than what I blog.