Unicode

Jacob Harvey
4 min readSep 11, 2021
Photo by Timothée Gidenne on Unsplash

What Is Unicode?

Unicode is the international standard for encoding and decoding human-readable text using basic characters. The Unicode Standard contains characters for the following alphabets: Latin, Greek, Cyrillic, Arabic, Hebrew, Syriac, Mongolian, Chinese, Japanese, Devanagari, Armenian, and many others. To put this in perspective, consider how we communicate in English. Most people use the English alphabet in their writing, which comprises the 27 letters A–Z.

A Little About Unicode

Unicode is a character-set standard that was published in 1991. It provides a unified way to represent all languages in the world, regardless of their writing system. Unicode was designed by a small number of people working with the publishing industry standards commission (ISO) in 1991. The standards process was initiated by the Unicode Consortium (UC), a private consortium formed to promote Unicode, but ended up publishing much more than was originally planned.

Photo by Tim Mossholder on Unsplash

Why should you care about Unicode?

The Unicode standard provides the backbone of much of what we use on a daily basis, whether it be a Microsoft operating system, a Facebook app, or a web page. All modern personal computers are able to view documents encoded with UTF-16 or UTF-32 on its screen. The UTF logic can be extended to cover any character-set that is narrow enough to be represented by 2 bytes, which makes the choice of UTF for international communication simple. Unicode as described in ISO 10646, is a set of characters that help translate individual writing systems into a single universal one. There are multiple ways to spell each word in the english language but words like he and she are represented by the same characters across all written languages. When you type the letters lol into Google you’re not searching for laughing out loud, as Google has decided what to call that character in english. Instead you’re searching for the name of the english word laughing. Likewise in Japanese the word tsundere (violent love) is represented by two characters, but in Chinese however the word is one character and that one character has two different pronunciations (結美/美群). Being able to translate different languages on the same web application is important for universal use.

How is Unicode designed?

As mentioned previously Unicode’s goal is to provide a way to represent all language written in a specific writing system. For instance a language like Greek which does not use punctuation or symbols in writing might not be able to be represented by a character set used to represent latin characters. The standard tries to address this by not only providing the full set of Latin characters, but also expanding the ability to represent it with the main set of symbols you would see. For instance, some of the Greek letters would be represented by small curls (diacritics) around them, while other would be represented by connecting diacritics. Since Unicode was developed in 1991 it has grown to include some 400,000 characters, many more than the 256 glyphs allowed in ISO 10646.

Photo by Yuyeung Lau on Unsplash

The impact of Unicode on languages other than English

Bidi is the most popular language for emoji use. Emojis can play an important role for languages that don’t have a standard writing system. Bidi is the most popular language for emoji use but despite being on of the most widely spoken languages there is no Unicode standard for Bidi. To ensure consistency Google’s EmojiCompat library converts Bidi characters to emoji by combining a standard Unicode glyph with an image description from Google’s Bidi Font. Although there are nearly 7,000 different emojis less than 40 of them are available for each language. In a world of emojis and what they represent are becoming a fundamental part of people’s communication.

Conclusion

Unicode makes the world go round! Multilingual people today face an interesting challenge, though they learn and use multiple languages at a very young age they become acutely aware that their native language is an easily differentiated entity. Sometimes people of a certain language end up with incomprehensible english text on their devices which I could imagine to be super annoying. At a time when human language was believed to be reaching a limit, Unicode was introduced to help with this problem. Although Unicode itself is an industry standard, a lot of the work on making it accessible for non-english users is carried out by non-profit organizations, including Unicode Consortium, the IEEE and others. Thank you for reading my blog and stop by soon for my next one!

--

--

Jacob Harvey

Software Developer building the future of banking.