Interfacing with older APIs that don’t support unicode can be a pain. Too often I’ve seen errors from third parties saying:
Error: characters must be in the range 0 to 127.
Suppose that we’re writing an API for some site that internally will be interacting with these types of third parties. This API will allow people to create users and sign up. We want to allow people to sign up with their real names with accents and all (such as José), but still only give third parties ASCII characters.
The best way to do this is to store the full unicode version of the name in the database, since we obviously don’t want José to lookup his name and discover that we’re storing his name as Jose. However, we still need to do some sort of translation from José -> Jose
.
To make this transformation, we need to understand a little how unicode works. The character é can be stored in two different forms, composing and decomposing form. Composing form is where “é” is stored as a single unicode code point (character). With decomposing form, “é” is stored as two separate unicode code points, ´ and e, which combined create the “é”.
So to convert é to e, we first want to make sure that we always use the decomposing form. Often, in whatever language of choice, this comes down to choosing the function with “NFD” in it. Here it is in python:
import unicodedata
def map_unicode_to_ascii(value):
decomposed_value = unicodedata.normalize("NFD", value)
Now that the accents have been pulled out, we can encode it to ASCII and ignore any non ASCII values. Here’s the final python function:
import unicodedata
def map_unicode_to_ascii(value):
decomposed_value = unicodedata.normalize("NFD", value)
return decomposed_value.encode("ascii", "ignore")
Now, this means that if the user provides all non-latin characters for their name, we will pass on nothing to the third party, which would normally cause various sorts of errors. The best fix is to prevent people from signing up with something where any character is not mappable. Unfortunately, this does mean that people cannot sign up with say their Chinese names, but as far as I know there’s no consistent mapping from the names in Chinese characters to latin characters (English spellings of names can differ between Mandarin and Cantonese).
So to validate, we’d want to ty and map the name to the composed form as described above, and then see if that length of the result is the same as the length of the map_unicode_to_ascii
function.
import unicodedata
def can_map_unicode_to_ascii(value):
composed_value = unicodedata.normalize("NFC", value)
mapped_value = map_unicode_to_ascii(value)
return len(composed_value) == len(mapped_value)
This function would return True
if “José” is provided, but false if “成龍” is provided.