Storing text in binary
Binary Representation
Computers store various types of data, including letters and symbols, using binary numbers through established encodings that map characters to binary values.
Basic Encoding Process
Example of encoding symbols:
Heart = 10
Peace symbol = 01
Smiling emoji = 11This is known as HPE coding. Programmers must agree on a standard encoding to ensure consistent binary representation, enabling correct retrieval and display of characters.
Strings and Character Storage
Multiple characters are stored by concatenating their binary encodings into a string format. For example, a file named msg.hpe might contain:010111111010. Programs can read these files and display the corresponding characters if they understand the encoding used.
ASCII Encoding
ASCII (American Standard Code for Information Interchange) is one of the first standardized character encodings, developed in the 1960s. It encodes each character in binary using seven bits. The first 32 codes are control characters used in transmission rather than display text, such as BEL (bell) = 0000111. Characters like CR (Carriage Return) and LF (Line Feed) assist with text formatting in protocols like HTTP.
Limitations of ASCII
ASCII is limited to the English alphabet and a small set of symbols, lacking support for multilingual text. Various systems used the extra bit differently, leading to compatibility issues, notably: HP used it for European characters while TRS-80 used it for graphics. These limitations highlighted the need for a universal encoding.
Unicode Development
Unicode was developed for a single character set across all languages, assigning each character a unique code point (hexadecimal number) and defined name. As of 2019, Unicode includes 137,929 characters from major scripts, languages, and emojis.
Encoding Unicode with UTF-8
UTF-8 is a backward-compatible encoding to ASCII that covers Unicode characters using one to four bytes per character:
One byte for ASCII characters
Two bytes for Latin scripts
Three bytes for Asian languages
Four bytes for rare characters and emojisUTF-8 dominates HTML files and is prevalent on the web (94.5% of web pages as of December 2019).
Alternatives to UTF-8
Other options include UTF-16 and UTF-32, which also represent all Unicode characters. Specific encodings like Shift JIS cater to particular languages, such as Japanese. Programmers choose encodings to best suit their software needs while maximizing representation with minimal bits.
Examples of ASCII Coding:
For example, if you had 8 bytes as written: 01001001 1111000 10011111 10010010 10011001
You would represent 2 characters and here’s why 👇
01001001, begins with a zero, meaning it is a byte that can only represent on character
1111000, begins with four 1s, meaning that 4 bytes represent one character. But why do you need for bytes to represent 1 character?
Well, according to the ASCII standard, the more 1s you have in the beginning of your code, the more complex the character is. So, for 4 bytes, you are using it to represent emojis and rare characters.