I read this excellent article about Unicode and wanted to summarize my understanding about Unicode and UTF-8.
Long ago the only characters, which the computers understood was English Alphabets, Numbers and Control characters. In all there were 128 characters. Hence 7 bits(2 ^ 7) were needed to represent them. The code name for these characters is ASCII. In the image given below you can find all the 128(8 rows * 16 cols) ASCII characters. Each of these characters have unique codes called ASCII codes. For example A – 65, B – 66, a – 97 and <space> – 32
Most computers in those days were using 8 bits(1 byte) for every possible ASCII character. Since only 7 bits were needed to represent 128(0 to 127) characters, 1 bit was free. What happens when you have a free bit? Creativity kicked in and everyone wanted to use this extra bit for custom purposes which included supporting languages other than English. Hence different systems emerged for representing characters from 128 and above. These systems are called as Code Pages. Given below are the Windows OEM code pages 437 (US) and 862 (Hebrew).
Code Page 437 (US)
Code Page 862 (Hebrew)
As you can see from the 2 code pages
- The code from 0 to 127 are ASCII and they are the same in both the code pages.
- Some of the codes from 128 and above are different.
- Without knowing the code page that is being used, code from 128 and above cannot be understood.
The challenges did not stop there. Asian languages have more than 1000+ letters, which will not fit in the single byte. Hence DBCS, double byte character set was used. In this, some letters used 1 byte and others used 2 bytes. To simplify all this complexity, we needed a single character set that included all the writing systems. Unicode consortium was thus born.
In Unicode a letter maps to something called as Code Point. For example A maps to U+0041(the number is in hexadecimal), which is 65 in decimal. In fact the first 128 ASCII characters retains the same code point in Unicode. Some of the biggest advantages are
- It is a standard governed by a consortium.
- Single character set to include all the writing systems.
- ASCII code and Unicode code points are the same for the first 128 characters. Hence it is backwards compatible with most of the systems.
Unicode is a theoretical concept with a bunch of numbers. Computers only understand binary format(1’s and 0’s) and we need a way to represent these numbers in binary format. This is where Encoding comes into picture. UTF-8 is one type of encoding for Unicode. It stands for UCS Transformation Format—8-bit. UCS stands for Universal Character Set. Some of the properties of UTF-8 are
- It uses variable width to encode unicode numbers. It uses a minium of 1 byte and a maximum of 6 bytes.
- For the first 256 characters it is compatible with ISO 8859-1. Because of this UTF-8 backwards compatible with most of the systems.
- Handles memory space very efficiently as it uses variable width for encoding and the minium size is 1 byte.
Given below are the range of numbers and the no of bytes needed in UTF-8 for different bases
Base 10 0 - 127(1 byte) 128 - 2047(2 bytes) 2048 - 65535(3 bytes) 65536 - 2097151(4 bytes) 2097152 - 67108863(5 bytes) 67108864 - 2147483647(6 bytes) Base 16 0x0 - 0x7F(1 byte) 0x80 - 0x7FF(2 bytes) 0x800 - 0xFFFF(3 bytes) 0x10000 - 0x1FFFFF(4 bytes) 0x200000 - 0x3FFFFFF(5 bytes) 0x4000000 - 0x7FFFFFFF(6 bytes) Base 2 00000000 - 01111111(1 byte) 11000010 10000000 - 11011111 10111111(2 bytes) 11100000 10100000 10000000 - 11101111 10111111 10111111(3 bytes) 11110000 10010000 10000000 10000000 - 11110111 10111111 10111111 10111111(4 bytes) 11111000 10001000 10000000 10000000 10000000 - 11111011 10111111 10111111 10111111 10111111(5 bytes) 11111100 10000100 10000000 10000000 10000000 10000000 - 11111101 10111111 10111111 10111111 10111111 10111111(6 bytes)
What is the maximum no of characters possible in UTF-8?
6 bytes is the maximum possible length. In all there can be 31 bits in these 6 bytes. The remaining 17 bits are control bits. Hence it can support up to 2 ^ 31 = 2,147,483,647. Just to make sense of this number it can support 82 million languages like English. The math is 2,147,483,647/26
How do I know the no of bytes to read for a character?
If the first bit of the first byte is 0 then it is a single byte.
If the first 3 bits of the first byte is 110 then it has 2 bytes.
If the first 4 bits of the first byte is 1110 then it has 3 bytes.
If the first 7 bits of the first byte is 1111110 then it has 6 bytes.
Why does the first 2 bits of bytes 2 to 6 always start with 10?
It is used as a marker and it indicates that these bytes are continuation for the character that is being read. Without this, there is no way to differentiate if this is a new character or if it is the continuation.
For representing 2 bytes why not have 11 in the first byte instead of 110?
Let us assume that we do not have 0 and instead have only 11. If the third bit happened to be 1 how do we differentiate this with the indicator for 3 bytes which also has 111. Hence we need to have 0.
Content-Type and Meta Tag
In the Internet for the client and the server to communicate properly the encoding type used needs to be exchanged. HTTP Protocol has a header named as content-type which can be used for this. Content-Type: text/plain; charset=”UTF-8″. This works fine if the web server only handles one type of application and all the pages uses the same encoding. What if the web server handles multiple applications with different encoding. HTML has meta tags and we can specify the encoding information in this tag.
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I just scratched the surface of Unicode and UTF-8. For information go to Unicode Consortium.