Character: A U+0041 LATIN CAPITAL LETTER A
UTF-8: 41 01000001
UTF-16BE: 00 41 00000000 01000001
UTF-16LE: 41 00 01000001 00000000
UTF-32: 00 00 00 41 00000000 00000000 00000000 01000001
US-ASCII: 41 01000001
ISO-8859-1: 41 01000001
ISO-8859-7: 41 01000001
GB18030: 41 01000001
Shift_JIS: 41 01000001
GSM_0338: 41 01000001
EBCDIC 1047: C1 11000001
Character: é U+00E9 LATIN SMALL LETTER E WITH ACUTE
UTF-8: C3 A9 11000011 10101001
UTF-16BE: 00 E9 00000000 11101001
UTF-16LE: E9 00 11101001 00000000
UTF-32: 00 00 00 E9 00000000 00000000 00000000 11101001
US-ASCII: -
ISO-8859-1: E9 11101001
ISO-8859-7: -
GB18030: A8 A6 10101000 10100110
Shift_JIS: -
GSM_0338: 05 00000101
EBCDIC 1047: 51 01010001
Character: Ω U+03A9 GREEK CAPITAL LETTER OMEGA
UTF-8: CE A9 11001110 10101001
UTF-16BE: 03 A9 00000011 10101001
UTF-16LE: A9 03 10101001 00000011
UTF-32: 00 00 03 A9 00000000 00000000 00000011 10101001
US-ASCII: -
ISO-8859-1: -
ISO-8859-7: D9 11011001
GB18030: A6 B8 10100110 10111000
Shift_JIS: 83 B6 10000011 10110110
GSM_0338: 15 00010101
EBCDIC 1047: -
Character: ⺟ U+2E9F CJK RADICAL MOTHER
UTF-8: E2 BA 9F 11100010 10111010 10011111
UTF-16BE: 2E 9F 00101110 10011111
UTF-16LE: 9F 2E 10011111 00101110
UTF-32: 00 00 2E 9F 00000000 00000000 00101110 10011111
US-ASCII: -
ISO-8859-1: -
ISO-8859-7: -
GB18030: 81 39 82 33 10000001 00111001 10000010 00110011
Shift_JIS: -
GSM_0338: -
EBCDIC 1047: -
Character: ︘ U+FE18 PRESENTATION FORM FOR VERTICAL
RIGHT WHITE LENTICULAR BRAKCET
UTF-8: EF B8 98 11101111 10111000 10011000
UTF-16BE: FE 18 11111110 00011000
UTF-16LE: 18 FE 00011000 11111110
UTF-32: 00 00 FE 18 00000000 00000000 11111110 00011000
US-ASCII: -
ISO-8859-1: -
ISO-8859-7: -
GB18030: 84 31 83 34 10000100 00110001 10000011 00110100
Shift_JIS: -
GSM_0338: -
EBCDIC 1047: -
Character: U+012F LATIN SMALL LETTER I WITH OGONEK U+0307 COMBINING DOT ABOVE U+0301 COMBINING ACUTE ACCENT
UTF-8: C4 AF CC 87 CC 81 11000100 10101111 11001100 10000111 11001100 10000001
UTF-16BE: 01 2F 03 07 03 01 00000001 00101111 00000011 00000111 00000011 00000001
UTF-16LE: 2F 01 07 03 01 03 00101111 00000001 00000111 00000011 00000001 00000011
UTF-32: 00 00 01 2F 00 00 03 07 00 00 03 01 00000000 00000000 00000001 00101111 00000000 00000000 00000011 00000111 00000000 00000000 00000011 00000001
US-ASCII: - - - 00111111 00111111 00111111
ISO-8859-1: - - - 00111111 00111111 00111111
ISO-8859-7: - - - 00111111 00111111 00111111
GB18030: 81 30 90 31 81 30 BD 33 81 30 BC 37 10000001 00110000 10010000 00110001 10000001 00110000 10111101 00110011 10000001 00110000 10111100 00110111
Shift_JIS: - - - 00111111 00111111 00111111
GSM_0338: - - - 00111111 00111111 00111111
EBCDIC 1047: - - - 00111111 00111111 00111111
Character: 💩 U+1F4A9 PILE OF POO
UTF-8: F0 9F 92 A9 11110000 10011111 10010010 10101001
UTF-16BE: D8 3D DC A9 11011000 00111101 11011100 10101001
UTF-16LE: 3D D8 A9 DC 00111101 11011000 10101001 11011100
UTF-32: 00 01 F4 A9 00000000 00000001 11110100 10101001
US-ASCII: -
ISO-8859-1: -
ISO-8859-7: -
GB18030: 94 39 DA 33 10010100 00111001 11011010 00110011
Shift_JIS: -
GSM_0338: -
EBCDIC 1047: -
Character: 🧑🏿 U+1F9D1 ADULT U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6
UTF-8: F0 9F A7 91 F0 9F 8F BF 11110000 10011111 10100111 10010001 11110000 10011111 10001111 10111111
UTF-16BE: D8 3E DD D1 D8 3C DF FF 11011000 00111110 11011101 11010001 11011000 00111100 11011111 11111111
UTF-16LE: 3E D8 D1 DD 3C D8 FF DF 00111110 11011000 11010001 11011101 00111100 11011000 11111111 11011111
UTF-32: 00 01 F9 D1 00 01 F3 FF 00000000 00000001 11111001 11010001 00000000 00000001 11110011 11111111
US-ASCII: - - 00111111 00111111
ISO-8859-1: - - 00111111 00111111
ISO-8859-7: - - 00111111 00111111
GB18030: 95 30 E0 33 94 39 C9 33 10010101 00110000 11100000 00110011 10010100 00111001 11001001 00110011
Shift_JIS: - - 00111111 00111111
GSM_0338: - - 00111111 00111111
EBCDIC 1047: - - 00111111 00111111
The top 6 bytes identify the type of code unit:
110110xxxxxxxxxx - high
110111xxxxxxxxxx - low
The top 6 bytes identify the type of code unit:
💩 U+1F4A9 PILE OF POO
UTF-16: D83D DCA9 1101100000111101 1101110010101001
Stick the lower 10 bits together:
00001111010010101001 = F4A9
Add 0x10000:
F4A9 + 10000 = 1F4A9
Which byte of my code unit comes first?
💩 U+1F4A9 PILE OF POO
UTF-16BE: D8 3D DC A9 11011000 00111101 11011100 10101001
UTF-16LE: 3D D8 A9 DC 00111101 11011000 10101001 11011100
Which byte of my code unit comes first?
Which byte of my code unit comes first?
Character: A U+0041 LATIN CAPITAL LETTER A
UTF-8: 41 01000001
Character: é U+00E9 LATIN SMALL LETTER E WITH ACUTE
UTF-8: C3 A9 11000011 10101001
Character: ︘ U+FE18 PRESENTATION FORM FOR VERTICAL
RIGHT WHITE LENTICULAR BRAKCET
UTF-8: EF B8 98 11101111 10111000 10011000
Character: 💩 U+1F4A9 PILE OF POO
UTF-8: F0 9F 92 A9 11110000 10011111 10010010 10101001
000011111010010101001 = 0x1F49A
Similar to ASCII:
Character: A U+0041 LATIN CAPITAL LETTER A
GSM_0338: 41 1000001
But quirky:
Character: @ U+0040 COMMERCIAL AT
GSM_0338: 00 0000000
But quirky:
Character: { U+007B LEFT CURLY BRACKET
GSM_0338: 1B 28 0011011 0101000
(In Basic Character Set Extension)