Character encoding

Character encoding

Character encoding is used to represent a repertoire of characters by some kind of encoding system.^[1] Depending on the abstraction level and context, corresponding code pointsand the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, etc. A character encoding is used in computation, data storage, and transmission of textual data. "Character set", "character map", "codeset" and "code page" are related, but not identical, terms. Early character codes associated with the optical or electrical telegraph could only represent a subset of the characters used in written languages, sometimes restricted to upper case letters, numerals and some punctuation only. The low cost of digital representation of data in modern computer systems allows more elaborate character codes (such as Unicode) which represent most of the characters used in many written languages

Examples of character codes

The basics of ASCII

The name ASCII, originally an abbreviation for "American Standard Code for Information Interchange", denotes an old character repertoire, code, and encoding. ASCII has been used and is used so widely that often the word ASCII refers to "text" or "plain text" in general, even if the character code is something else! The words "ASCII file" quite often mean any text file as opposite to a binary file.

The definition of ASCII also specifies a set of control codes ("control characters") such as linefeed (LF) and escape (ESC). But the character repertoire proper, consisting of the printable characters of ASCII, is the following (where the first item is the blank, or space, character):

! " # $ % & ' ( ) * + , - . /

0 1 2 3 4 5 6 7 8 9 : ; < = > ?

@ A B C D E F G H I J K L M N O

P Q R S T U V W X Y Z [ \ ] ^ _

` a b c d e f g h i j k l m n o

p q r s t u v w x y z { | } ~

A formal view on ASCII

The character code defined by the ASCII standard is the following: code values are assigned to characters consecutively in the order in which the characters are listed above (rowwise), starting from 32 (assigned to the blank) and ending up with 126 (assigned to the tilde character ~). Positions 0 through 31 and 127 are reserved for control codes. They have standardized names and descriptions, but in fact their usage varies a lot. The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value.

National variants of ASCII

There are several national variants of ASCII. In such variants, some special characters have been replaced by national letters (and other symbols). There is great variation here, and even within one country and for one language there might be different variants. The original ASCII is therefore often referred to as US-ASCII; the formal standard (by ANSI) is ANSI X3.4-1986. The phrase "original ASCII" is perhaps not quite adequate, since the creation of ASCII started in late 1950s, and several additions and modifications were made in the 1960s. The 1963 version had several unassigned code positions. The ANSI standard, where those positions were assigned, mainly to accommodate lower case letters, was approved in 1967/1968, later modified slightly. For the early history, including pre-ASCII character codes, see Steven J. Searle's A Brief History of Character Codes in North America, Europe, and East Asia and Tom Jennings'ASCII: American Standard Code for Information Infiltration

. See also Jim Price's ASCII Chart, Mary Brandel's1963: ASCII Debuts, and the computer history documents, including the background and creation of ASCII, written by Bob Bemer, "father of ASCII". The international standard ISO 646 defines a character set similar to US-ASCII but with code positions corresponding to US-ASCII characters @[\]{|} as "national use positions". It also gives some liberties with characters #$^`~. The standard also defines "international reference version (IRV)", which is (in the 1991 edition of ISO 646) identical to US-ASCII. Ecma International has issued the ECMA-6 standard, which is equivalent to ISO 646 and is freely available on the Web.

Within the framework of ISO 646, and partly otherwise too, several "national variants of ASCII" have been defined, assigning different letters and symbols to the "national use" positions. Thus, the characters that appear in those positions - including those in US-ASCII - are somewhat "unsafe" in international data transfer, although this problem is losing significance. The trend is towards using the corresponding codes strictly for US-ASCII meanings; national characters are handled otherwise, giving them their own, unique and universal code positions in character codes larger than ASCII. But old software and devices may still reflect various "national variants of ASCII". The following table lists ASCII characters which might be replaced by other characters in national variants of ASCII. (That is, the code positions of these US-ASCII characters might be occupied by other characters needed for national use.) The lists of characters appearing in national variants are not intended to be exhaustive, just typical examples.

dec	oct	hex	glyph	official Unicode name	National variants

35	43	23	#	number sign	£ Ù
36	44	24	$	dollar sign	¤
64	100	40	@	commercial at	É § Ä à ³
91	133	5B	[	left square bracket	Ä Æ ° â ¡ ÿ é
92	134	5C	\	reverse solidus	Ö Ø ç Ñ ½ ¥
93	135	5D	]	right square bracket	Å Ü § ê é ¿ \|
94	136	5E	^	circumflex accent	Ü î
95	137	5F	_	low line	è
96	140	60	`	grave accent	é ä µ ô ù
123	173	7B	{	left curly bracket	ä æ é à ° ¨
124	174	7C	\|	vertical line	ö ø ù ò ñ f
125	175	7D	}	right curly bracket	å ü è ç ¼
126	176	7E	~	tilde	ü ¯ ß ¨ û ì ´ _

Almost all of the characters used in the national variants have been incorporated into ISO Latin 1. Systems that support ISO Latin 1 in principle may still reflect the use of national variants of ASCII in some details; for example, an ASCII character might get printed or displayed according to some national variant. Thus, even "plain ASCII text" is thereby not always portable from one system or application to another.

More information about national variants and their impact:

· Johan van Wingen: International standardization of 7-bit codes, ISO 646; contains a comparison table of national variants

· Digression on national 7-bit codes by Alan J. Flavell

· The ISO 646 page by Roman Czyborra

· Character tables by Koichi Yasuoka.

Subsets of ASCII for safety

Mainly due to the "national variants" discussed above, some characters are less "safe" than others, i.e. more often transferred or interpreted incorrectly.

In addition to the letters of the English alphabet ("A" to "Z", and "a" to "z"), the digits ("0" to "9") and the space (" "), only the following characters can be regarded as really "safe" in data transmission:

! " % & ' ( ) * + , - . / : ; < = > ?

Even these characters might eventually be interpreted wrongly by the recipient, e.g. by a human reader seeing a glyph for "&" as something else than what it is intended to denote, or by a program interpreting "<" as starting some special markup, "?" as being a so-called wildcard character, etc. When you need to name things (e.g. files, variables, data fields, etc.), it is often best to use only the characters listed above, even if a wider character repertoire is possible. Naturally you need to take into account any additional restrictions imposed by the applicable syntax. For example, the rules of a programming language might restrict the character repertoire in identifier names to letters, digits and one or two other characters.

The misnomer "8-bit ASCII"

Sometimes the phrase "8-bit ASCII" is used. It follows from the discussion above that in realityASCII is strictly and unambiguously a 7-bit code in the sense that all code positions are in the range 0 - 127.

It is a misnomer used to refer to various character codes which are extensions of ASCII in the following sense: the character repertoire contains ASCII as a subset, the code numbers are in the range 0 - 255, and the code numbers of ASCII characters equal their ASCII codes.

Another example: ISO Latin 1 alias ISO 8859-1

The ISO 8859-1 standard (which is part of the ISO 8859 family of standards) defines a character repertoire identified as "Latin alphabet No. 1", commonly called "ISO Latin 1", as well as acharacter code for it. The repertoire contains the ASCII repertoire as a subset, and the code numbers for those characters are the same as in ASCII. The standard also specifies an encoding, which is similar to that of ASCII: each code number is presented simply as one octet.

In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions 160 - 255, and they are:

° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿

À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï

Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß

à á â ã ä å æ ç è é ê ë ì í î ï

ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ