Base64

Imagine you want to store the contents of a PDF document in an XML file. How would you do it?  XML only supports textual data. PDF contains binary data. The transport system for email is designed to handle plain ASCII text only.  How do you send a PDF attachment in an email?

What we need is an encoding scheme to convert binary to textual and from textual to binary data. Base64 encoding is the solution to this problem. It uses 64 characters and hence the name has 64 suffixed to it. The 64 characters are A – Z a – z 0 – 9 + /

base64chars

For representing these 64 characters you need 6 bits. Why? It is because 2 ^ 6 = 64. Generally computers operate in the unit of a byte which is equal to 8 bits.  What is the common ground for 6 and 8 bits. The Least Common Multiple for 6 and 8 is 24 bits. Hence Base64 encoding operates on 3 bytes(24 bits) or 4 group of 6 bits(24 bits).

Encoding a 3 Byte input – ABC

ABC (8 bit grouping) - 01000001 01000010 01000011
                         (65)    (66)     (67)

ABC (6 bit grouping) - 010000 010100 001001 000011
                         (16)  (20)   (9)    (3)

ABC -> QUJD

What if the input only contains 2 Bytes – AB

AB (8 bit grouping) - 01000001 01000010
                        (65)    (66)

AB (6 bit grouping) - 010000 010100 001000
                        (16)  (20)   (8)

AB -> QUI=

00 is added to the right to complete 3 groups of 6 bits. The output is padded with 1 special trailing character = to indiciate that the input only had 2 bytes.

What if the input only contains 1 Byte – A

A (8 bit grouping) - 01000001
                       (65)

A(6 bit grouping) - 010000 010000
                      (16)  (16)

A -> QQ==

0000 is added to the right to complete 2 groups of 6 bits. The output is padded with 2 special trailing character ==  to indiciate that the input only had 1 byte.

33% Overhead

In all the examples shown above the output results in 4 characters.  Each group of 3 bytes results in 4 characters. Thus if we have N bytes there will be 4 * (N/3) characters. It can be written as (4/3) * N = 1.33 * N. Hence there is 33% overhead in the output characters.

Decoding

Decoding is the opposite of encoding. 4 characters are converted back into 3 bytes.

QUJD(6 bit grouping) - 010000 010100 001001 000011
                         (16)  (20)   (9)    (3)

    (8 bit grouping) - 01000001 01000010 01000011
                          (65)    (66)     (67)
QUJD -> ABC

Decoding with = trailing character

If the input contains single = trailing character then it is converted back into 2 bytes.

QUI=(6 bit grouping) - 010000 010100 001000
                         (16)  (20)   (8)

    (8 bit grouping) - 01000001 01000010
                          (65)    (66)

QUI= -> AB

Decoding with == trailing characters

If the input contains two == trailing character then it is converted back to 1 byte.

QQ== (6 bit grouping) - 010000 010000
                          (16)  (16)

     (8 bit grouping) - 01000001
                          (65)

QQ== -> A

Gmail Testing

Just to make sure that my understanding was correct, I sent an email to myself with an attachment containing the characters ABC. The Base64 encoded data from the email message is given below. Yes it has QUJD which is the Base64 encoding for ABC.

Content-Type: text/plain; charset=US-ASCII; name="test.txt"
Content-Disposition: attachment; filename="test.txt"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_hgri5dn60

QUJD

4 thoughts on “Base64

  1. Hi Jana, Could you please explain, how this mental model is used at other places(fields).
    Thanks,
    Wangan

Comments are closed.