Character Encodings

Andy Balaam
artificialworlds.net/blog

Contents

Concepts

Concepts

Concepts

Concepts

Concepts

Words - Octets

Words - (Octet String)

Words - Code Units

Words - Code Points

Words - Characters

Interesting Characters

Character: A U+0041 LATIN CAPITAL LETTER A UTF-8: 41 01000001 UTF-16BE: 00 41 00000000 01000001 UTF-16LE: 41 00 01000001 00000000 UTF-32: 00 00 00 41 00000000 00000000 00000000 01000001 US-ASCII: 41 01000001 ISO-8859-1: 41 01000001 ISO-8859-7: 41 01000001 GB18030: 41 01000001 Shift_JIS: 41 01000001 GSM_0338: 41 01000001 EBCDIC 1047: C1 11000001

Interesting Characters

Character: é U+00E9 LATIN SMALL LETTER E WITH ACUTE UTF-8: C3 A9 11000011 10101001 UTF-16BE: 00 E9 00000000 11101001 UTF-16LE: E9 00 11101001 00000000 UTF-32: 00 00 00 E9 00000000 00000000 00000000 11101001 US-ASCII: - ISO-8859-1: E9 11101001 ISO-8859-7: - GB18030: A8 A6 10101000 10100110 Shift_JIS: - GSM_0338: 05 00000101 EBCDIC 1047: 51 01010001

Interesting Characters

Character: Ω U+03A9 GREEK CAPITAL LETTER OMEGA UTF-8: CE A9 11001110 10101001 UTF-16BE: 03 A9 00000011 10101001 UTF-16LE: A9 03 10101001 00000011 UTF-32: 00 00 03 A9 00000000 00000000 00000011 10101001 US-ASCII: - ISO-8859-1: - ISO-8859-7: D9 11011001 GB18030: A6 B8 10100110 10111000 Shift_JIS: 83 B6 10000011 10110110 GSM_0338: 15 00010101 EBCDIC 1047: -

Interesting Characters

Character: ⺟ U+2E9F CJK RADICAL MOTHER UTF-8: E2 BA 9F 11100010 10111010 10011111 UTF-16BE: 2E 9F 00101110 10011111 UTF-16LE: 9F 2E 10011111 00101110 UTF-32: 00 00 2E 9F 00000000 00000000 00101110 10011111 US-ASCII: - ISO-8859-1: - ISO-8859-7: - GB18030: 81 39 82 33 10000001 00111001 10000010 00110011 Shift_JIS: - GSM_0338: - EBCDIC 1047: -

Interesting Characters

Character: ︘ U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET UTF-8: EF B8 98 11101111 10111000 10011000 UTF-16BE: FE 18 11111110 00011000 UTF-16LE: 18 FE 00011000 11111110 UTF-32: 00 00 FE 18 00000000 00000000 11111110 00011000 US-ASCII: - ISO-8859-1: - ISO-8859-7: - GB18030: 84 31 83 34 10000100 00110001 10000011 00110100 Shift_JIS: - GSM_0338: - EBCDIC 1047: -

Interesting Characters

Character: U+012F LATIN SMALL LETTER I WITH OGONEK U+0307 COMBINING DOT ABOVE U+0301 COMBINING ACUTE ACCENT UTF-8: C4 AF CC 87 CC 81 11000100 10101111 11001100 10000111 11001100 10000001 UTF-16BE: 01 2F 03 07 03 01 00000001 00101111 00000011 00000111 00000011 00000001 UTF-16LE: 2F 01 07 03 01 03 00101111 00000001 00000111 00000011 00000001 00000011 UTF-32: 00 00 01 2F 00 00 03 07 00 00 03 01 00000000 00000000 00000001 00101111 00000000 00000000 00000011 00000111 00000000 00000000 00000011 00000001 US-ASCII: - - - 00111111 00111111 00111111 ISO-8859-1: - - - 00111111 00111111 00111111 ISO-8859-7: - - - 00111111 00111111 00111111 GB18030: 81 30 90 31 81 30 BD 33 81 30 BC 37 10000001 00110000 10010000 00110001 10000001 00110000 10111101 00110011 10000001 00110000 10111100 00110111 Shift_JIS: - - - 00111111 00111111 00111111 GSM_0338: - - - 00111111 00111111 00111111 EBCDIC 1047: - - - 00111111 00111111 00111111

Interesting Characters

Character: 💩 U+1F4A9 PILE OF POO UTF-8: F0 9F 92 A9 11110000 10011111 10010010 10101001 UTF-16BE: D8 3D DC A9 11011000 00111101 11011100 10101001 UTF-16LE: 3D D8 A9 DC 00111101 11011000 10101001 11011100 UTF-32: 00 01 F4 A9 00000000 00000001 11110100 10101001 US-ASCII: - ISO-8859-1: - ISO-8859-7: - GB18030: 94 39 DA 33 10010100 00111001 11011010 00110011 Shift_JIS: - GSM_0338: - EBCDIC 1047: -

Interesting Characters

Character: 🧑🏿 U+1F9D1 ADULT U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 UTF-8: F0 9F A7 91 F0 9F 8F BF 11110000 10011111 10100111 10010001 11110000 10011111 10001111 10111111 UTF-16BE: D8 3E DD D1 D8 3C DF FF 11011000 00111110 11011101 11010001 11011000 00111100 11011111 11111111 UTF-16LE: 3E D8 D1 DD 3C D8 FF DF 00111110 11011000 11010001 11011101 00111100 11011000 11111111 11011111 UTF-32: 00 01 F9 D1 00 01 F3 FF 00000000 00000001 11111001 11010001 00000000 00000001 11110011 11111111 US-ASCII: - - 00111111 00111111 ISO-8859-1: - - 00111111 00111111 ISO-8859-7: - - 00111111 00111111 GB18030: 95 30 E0 33 94 39 C9 33 10010101 00110000 11100000 00110011 10010100 00111001 11001001 00110011 Shift_JIS: - - 00111111 00111111 GSM_0338: - - 00111111 00111111 EBCDIC 1047: - - 00111111 00111111

Character Sets and Encodings

ASCII ("US-ASCII")

Latin-1 ("ISO-8859-1")

Latin-n ("ISO-8859-n")

EBCDIC

Unicode character set

Unicode character set

UTF-32

UTF-16

UTF-16

UTF-16

UCS-2

UTF-16

The top 6 bytes identify the type of code unit:

110110xxxxxxxxxx - high 110111xxxxxxxxxx - low

UTF-16

The top 6 bytes identify the type of code unit:

💩 U+1F4A9 PILE OF POO UTF-16: D83D DCA9 1101100000111101 1101110010101001

Stick the lower 10 bits together:

00001111010010101001 = F4A9

Add 0x10000:

F4A9 + 10000 = 1F4A9

UTF-16

UTF-16

Unicode character set

Unicode character set

Credit: @tunameltsmyheart

Unicode character set

UTF-16

UTF-16

UTF-16

UTF-16

Which byte of my code unit comes first?

💩 U+1F4A9 PILE OF POO UTF-16BE: D8 3D DC A9 11011000 00111101 11011100 10101001 UTF-16LE: 3D D8 A9 DC 00111101 11011000 10101001 11011100

UTF-16

Which byte of my code unit comes first?

UTF-16

Which byte of my code unit comes first?

UTF-16

UTF-8

UTF-8

UTF-8

UTF-8

Character: A U+0041 LATIN CAPITAL LETTER A UTF-8: 41 01000001

UTF-8

Character: é U+00E9 LATIN SMALL LETTER E WITH ACUTE UTF-8: C3 A9 11000011 10101001

UTF-8

Character: ︘ U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET UTF-8: EF B8 98 11101111 10111000 10011000

UTF-8

Character: 💩 U+1F4A9 PILE OF POO UTF-8: F0 9F 92 A9 11110000 10011111 10010010 10101001 000011111010010101001 = 0x1F49A

UTF-8

GB18030

GSM 03.38 ("GSM 7")

GSM 03.38 ("GSM 7")

Similar to ASCII:

Character: A U+0041 LATIN CAPITAL LETTER A GSM_0338: 41 1000001

GSM 03.38 ("GSM 7")

But quirky:

Character: @ U+0040 COMMERCIAL AT GSM_0338: 00 0000000

GSM 03.38 ("GSM 7")

But quirky:

Character: { U+007B LEFT CURLY BRACKET GSM_0338: 1B 28 0011011 0101000 (In Basic Character Set Extension)

More info