Ambiguous names in Java due to non-normalised unicode - but all OK in Python

August 11, 2016 [Java, Programming, Programming Languages, Python]

In Java and several other languages, identifiers (e.g. method names) are allowed to contain unicode characters.

Unfortunately, some combinations of unicode characters are logically identical. For example, á (one character: Latin Small Letter a with Acute U+00E1) is the same as á (two characters: Latin Small Letter A U+0061, and Non-spacing Acute Accent U+0301). These combinations are not just similar - they are identical by definition.

Java does not do any normalisation on your code before compiling it, so two identifiers containing equivalent but different unicode combinations are considered different (ref: JLS 7 section 3.8).

$ cat U.java
public class U {
    static String \u00e1() { return "A WITH ACUTE"; }
    static String a\u0301() { return "A + NON-SPACING ACUTE"; }
    public static void main(String[] a) {
        System.out.println(á());
        System.out.println(á());
    }
}
$ javac U.java && java U
A WITH ACUTE
A + NON-SPACING ACUTE

We can define and use two functions called á and á and they are totally independent entities.

But don't do this.

Python 3 also allows unicode characters in identifiers, but it avoids the above problem by normalising them (ref: Python 3 Reference, section 2.3):

$ cat U.py
#!/usr/bin/env python3

def á():
    print("A WITH ACUTE")

def á():
    print("A + NON-SPACING ACUTE")

á()
á()

$ hexdump -C U.py
23 21 2f 75 73 72 2f 62  69 6e 2f 65 6e 76 20 70  |#!/usr/bin/env p|
79 74 68 6f 6e 33 0a 0a  64 65 66 20 c3 a1 28 29  |ython3..def ..()|
3a 0a 20 20 20 20 70 72  69 6e 74 28 22 41 20 57  |:.    print("A W|
49 54 48 20 41 43 55 54  45 22 29 0a 0a 64 65 66  |ITH ACUTE")..def|
20 61 cc 81 28 29 3a 0a  20 20 20 20 70 72 69 6e  | a..():.    prin|
74 28 22 41 20 2b 20 4e  4f 4e 2d 53 50 41 43 49  |t("A + NON-SPACI|
4e 47 20 41 43 55 54 45  22 29 0a 0a c3 a1 28 29  |NG ACUTE")....()|
0a 61 cc 81 28 29 0a 0a                           |.a..()..|
$ ./U.py
A + NON-SPACING ACUTE
A + NON-SPACING ACUTE

(Legend: A WITH ACUTE, A + NON-SPACING ACUTE) The second definition overwrites the first because they are considered identical. You can call it via either way of saying its name. Both ways of working are scary, but I'd definitely choose the Python 3 way if I had to.