UTF-8 and Unicode
Table of Contents
Understanding what UTF-8 and Unicode really is. Content of quality in GAWK info manual.
- Unicode
- A extensive character set, in which a
charcan be larger than one byte. - UTF-8
- The most used encoding system for unicode character set.
Full content
info "(gawk) Bytes vs. Characters" | head -n20
File: gawk.info, Node: Bytes vs. Characters, Next: Using extensions, Up: Wc Program 11.2.7.1 Modern Character Sets .............................. In the early days of computing, single bytes were used for storing characters. The most common character sets were ASCII and EBCDIC, which each provided all the English upper- and lowercase letters, the 10 Hindu-Arabic numerals from 0 through 9, and a number of other standard punctuation and control characters. Today, the most popular character set in use is Unicode (of which ASCII is a pure subset). Unicode provides tens of thousands of unique characters (called “code points”) to cover most existing human languages (living and dead) and a number of nonhuman ones as well (such as Klingon and J.R.R. Tolkien's elvish languages). To save space in files, Unicode code points are “encoded”, where each character takes from one to four bytes in the file. UTF-8 is possibly the most popular of such “multibyte encodings”.