Basic knowledge of national computer applications: Chinese characters, character encoding

National computer application basic knowledge: Chinese characters, character encoding

(1) Information units in computers

Information units in computers The units represented include bits, words, word lengths, bytes, etc. They are the basic concepts used to express the size of information.

① Bit: The smallest unit of data storage in a computer is a binary bit, referred to as bit, which is called bit in English and transliterated as bit, which can be represented by the lowercase letter b.

② Byte: An eight-bit binary bit is called a byte. It is called Byte in English and can be represented by the capital letter B. It is the basic unit of computer storage. An eight-bit binary number of one byte, its bit numbers from left to right are b7, b6, b5, b4, b3, b2, b1, b0. In computers, the number of bytes is often used to express storage capacity. Capacity can be in units of KB, MB, GB, and TB. The conversion relationship between them is as follows:

1KB=210B=1024B

p>

 1MB=210KB=1024KB

 1GB=210MB=1024MB

 1TB=210GB=1024GB

 ③ Word: The computer is storing, transmitting or When operating, a group of binaries that are operated as a whole unit are called a computer word, or word for short.

④ Word length: The number of digits contained in each word is called the word length. Since word length is the number of binary digits that a computer can process at one time, it is related to the rate at which the computer processes data and is an important factor in measuring computer performance.

(2) Character encoding.

① ASCII code.

Computers can only recognize binary numbers, so numbers, letters, and symbols in the computer must also be encoded in binary. There are many encoding methods. The ASCII code (American Standard Information Interchange Code) is commonly used in microcomputers. The ASCII code has been accepted by the International Organization for Standardization (ISO) as an international standard, called ISO-646. There are two types of ASCII codes: 7-bit version and 8-bit version. The internationally common ASCII code is the 7-bit version. The 7-bit version of the ASCII code contains 10 Arabic numerals, 52 English uppercase and lowercase letters, 32 punctuation marks and operators, and 34 control codes, with a maximum of 128 characters, so it can be represented by 7-bit binary numbers. The 7-bit ASCII code character is shown in the figure below:

To determine the ASCII code of a number, letter, symbol or control character, you can first find its position in the table, and then determine its corresponding decimal value or binary value. For example, the ASCII code of the lowercase letter "a" has a decimal value of 97 and a binary value of 1100001B (B represents a binary number). If converted into hexadecimal, its value is 61H (H represents a hexadecimal number). As can be seen from the table, the ASCII codes of the numbers 0 to 9 are 30H to 39H (the suffix H indicates hexadecimal numbers), the ASCII codes of the uppercase letters A to Z are 41H to 5AH, and the ASCII codes of the lowercase letters a to z The codes are 61H~7AH. The comparison of character size is to look at the size of its ASCII code value.

In the table, NUL, BEL, LF, FF, CR, DEL, etc. are control characters, NUL means empty, BEL is a warning character, BS is a backspace character, LF is a line feed character, and FF is a page feed. character, CR is the carriage return character, SP is the space character, and DEL is the delete character.

② BCD code.

When using a computer to process numbers, it is necessary to convert between binary and decimal. This requires encoding the decimal number in binary. The BCD (Binary Coded Decimal) code is a binary coded decimal number.

The most commonly used BCD code is the 8421BCD code.

It uses a 4-digit binary number as a group to represent a decimal number. The bit weights of the 4-digit binary number from left to right are 8, 4, 2, 1. It can be combined into 16 states. For the encoding of the 10 numbers from 0 to 9, only the first 10 states from 0000 to 1001 are used, and the remaining 6 states are not used. In order to encode a multi-digit decimal number, it is necessary to have as many 4-bit binary groups as there are digits in the decimal number, and encode them separately in sequence. Table 1-4 shows the correspondence between 8421BCD codes and decimal numbers.

Table 1-4 Correspondence between BCD codes and decimal numbers

③ Unicode encoding

ASCII code provides 128 characters, and the extended ASC code provides 256 characters, but the text encoding used to represent countries around the world is not enough. More characters and meanings need to be represented, so Unicode encoding appeared again.

Unicode is a 16-bit encoding that can represent more than 65,000 characters or symbols. Currently, there are about 34,000 letters or symbols used in various languages ??in the world, so Unicode encoding can be used in any language. Unicode encoding is fully compatible with the popular ASCII code, and the first 256 symbols of the two are the same.

(3) Encoding of Chinese characters

Chinese characters are a kind of hieroglyphics with a very large number of characters (there are six to seven thousand commonly used characters in modern Chinese characters, and the total number of characters is as high as 50,000 (above), and the glyphs are complex. Each Chinese character has three elements: sound, shape, and meaning. There are also many homophones and variants. These have brought great difficulties to computer processing of Chinese characters. To process Chinese characters in a computer, the following problems must be solved: First, the input of Chinese characters, that is, how to input square Chinese characters with complex structures into the computer, which is the key to Chinese character processing; second, how to represent Chinese characters in the computer and storage, how to be compatible with Western languages; finally, how to output the processing results of Chinese characters from the computer. For this reason, Chinese characters must be coded,

That is, Chinese characters must be encoded. Corresponding to the three main links of input, internal processing and output in the above-mentioned Chinese character processing process, the encoding of each Chinese character includes input code, exchange code, internal code and glyph code. In the computer's Chinese character information processing system, the following code conversion is required when processing Chinese characters: input code → exchange code → internal code → glyph code. The above briefly describes the basic ideas and processes of computer processing of Chinese characters. The following is a detailed introduction to the four encodings of Chinese characters.

① Enter the code.

In order to use the existing standard Western keyboard on the computer to input Chinese characters, an input encoding must be designed for the Chinese characters. The input code is also called the foreign code. At present, there are as many as six to seven hundred Chinese character input encoding schemes that have applied for patents, and new input methods are constantly coming out, so much so that there is a metaphor of "ten thousand codes galloping". According to different design ideas, these numerous input codes can be classified into four categories: digital codes, pinyin codes, glyph codes and phonetic codes. Among them, pinyin codes and glyph codes are currently most widely used.

a. Numerical encoding: Numerical encoding uses equal lengths

The digital string is numbered one by one for Chinese characters, and this number is used as the input code of Chinese characters, such as location code, telephone XX, etc. Belongs to digital encoding. The encoding rules of this kind of encoding are simple and easy to convert with the internal code of Chinese characters, but it is difficult to remember and is only suitable for certain specific departments.

b. Pinyin code: Pinyin code is an input code based on the pronunciation of Chinese characters. The pinyin code is simple to use, can be learned once, and is easy to promote. The disadvantage is that the code repetition rate is high (because there are many homophones in Chinese characters), and screen word selection is often required during input, which affects the input speed. The Pinyin code is input according to the Chinese Pinyin code, so when inputting Chinese characters, standard pronunciation is required and dialects cannot be used. Pinyin code is particularly suitable for non-professional input personnel who do not have high input speed requirements.

c. Glyph code: Glyph code is an input code based on the glyph structure of Chinese characters.

The Wubi font code (Wang code) widely used on microcomputers is a typical representative of font codes. The main feature of the Wubi font code is its fast input speed. The current highest record is 293 Chinese characters per minute (this record is held by a female soldier XXXX). Such a high input speed has reached

The limits of human eye scanning. However, this input method requires a lot of time in the early stage of learning because it requires memorizing character roots and practicing character splitting. In addition, there are a very small number of Chinese characters that are difficult to split, and the codes given are inconsistent with the writing habits of Chinese characters.

d. Phonetic code: Phonetic code is an input code that takes into account the pronunciation and glyphs of Chinese characters. The most commonly used phonetic-phonetic code at present is the natural code.

② Exchange code.

The exchange code is used for the exchange of Chinese character outer codes and internal codes. The "Chinese Coded Character Set for Information Exchange Basic Set" (codenamed GB2312-1980) promulgated by our country in 1981 is the national standard for interchange codes, so interchange codes are also called national standard codes. The national standard code is a two-byte code, that is, there are two bytes encoding a Chinese character, and the highest bit of each byte is "1". The national standard GB2312-1980 includes 6763 commonly used Chinese characters (including 3755 first-level Chinese characters, in pinyin order; 3008 second-level Chinese characters, in radical order), other letters and graphic symbols (such as serial numbers, numbers, Roman numerals, English letters , Japanese kana, Russian letters

and Chinese phonetic phonetics, etc.) 682, totaling 7445 characters. These 7445 characters are arranged together in 94 rows × 94 columns to form the GB2312-1980 character set encoding table. Each Chinese character in the table corresponds to a unique row number (called area code) and column number (called bit number). ), determine the national standard code value of the Chinese character based on the location number, and store it in two bytes respectively. Due to space limitations, this book does not list the GB2312-1980 character encoding table. Readers can refer to relevant books.

③ Internal code.

The internal code is the basic representation of Chinese characters in the computer. It is the encoding used by the computer to recognize, store, process and transmit Chinese characters. The internal code is also a double-byte encoding. The highest bits of the two bytes of the national standard code are set to "1", which is converted into an internal code of Chinese characters. Computer information processing systems distinguish Chinese characters and ASCII code characters based on whether the highest bit of the character encoding is "1" or "0".

④ Glyph code.

Glyph code is a code that represents Chinese character glyph information (structure, shape, strokes, etc.) of Chinese characters. It is used to

realize computer output (display, printing) of Chinese characters. Since Chinese characters are square characters, the most commonly used representation method of glyph codes is dot matrix, including 16×16 dot matrix, 24×24 dot matrix, and 48×48 dot matrix. For example, the meaning of a 16×16 dot matrix is: there are 256 points (16×16=256) to represent the glyph information of a Chinese character. Each point has two states: "on" or "off", using a binary number. "1" or "0" are represented accordingly. Therefore, storing a 16×16 dot matrix of Chinese characters requires 256 binary bits, max. 32 bytes (256 bits/8 bits). The above dot matrix can be selected according to the different needs of Chinese character output. The more points in the dot matrix, the more accurate and beautiful the output Chinese characters will be. The glyph lattice of Chinese characters takes up a lot of storage space. It is usually stored in the external memory of the machine in the form of a font library. The font library is retrieved when needed to output the glyphs of the corresponding Chinese characters. ;