首页 > 代码库 > 编码类型

编码类型

Unicode

Unicode is a computing industry standard allowing computers to consistently represent and manipulate text expressed in most of the world‘s writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode consists of

l     a repertoire of more than 107,000 characters covering 90 scripts,

l     a set of code charts for visual reference,

l     an encoding methodology and set of standard character encodings,

l     an enumeration of character properties such as upper and lower case,

l     a set of reference data computer files,

l     a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

 

Character encoding

In computer science, the terms character encoding, character set, and sometimes character map or code page were historically synonymous, as the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units — usually with a single character per code unit. The terms now have related but distinct meanings, reflecting the efforts of standards bodies to use precise terminology when writing about and unifying many different encoding systems.[1] Regardless, the terms are still used interchangeably, with character set being nearly ubiquitous.

Unicode encoding model

Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set together constitute a modern, unified character encoding. Rather than mapping characters directly to octets (bytes), they separately define what characters are available, their numbering, how those numbers are encoded as a series of "code units" (limited-size numbers), and finally how those units are encoded as a stream of octets. The idea behind this decomposition is to establish a universal set of characters that can be encoded in a variety of ways.[1] To correctly describe this model one needs more precise terms than "character set" and "character encoding". The terms used in the modern model follow:[1]

A character repertoire is the full set of abstract characters that a system supports. The repertoire may be closed, i.e. no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series), or it may be open, allowing additions (as is the case with Unicode and to a limited extent the Windows code pages). The characters in a given repertoire reflect decisions that have been made about how to divide writing systems into linear information units. The basic variants of the Latin, Greek, and Cyrillic alphabets, can be broken down into letters, digits, punctuation, and a few special characters like the space, which can all be arranged in simple linear sequences that are displayed in the same order they are read. Even with these alphabets however diacritics pose a complication: they can be regarded either as part of a single character containing a letter and diacritic (known in modern terminology as a precomposed character), or as separate characters. The former allows a far simpler text handling system but the latter allows any letter/diacritic combination to be used in text. Other writing systems, such as Arabic and Hebrew, are represented with more complex character repertoires due to the need to accommodate things like bidirectional text and glyphs that are joined together in different ways for different situations.

A coded character set specifies how to represent a repertoire of characters using a number of non-negative integer codes called code points. For example, in a given repertoire, a character representing the capital letter "A" in the Latin alphabet might be assigned to the integer 65, the character for "B" to 66, and so on. A complete set of characters and corresponding integers is a coded character set. Multiple coded character sets may share the same repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover the same repertoire but map them to different codes. In a coded character set, each code point only represents one character, i.e., a coded character set is a function.

A character encoding form (CEF) specifies the conversion of a coded character set‘s integer codes into a set of limited-size integer code valuesthat facilitate storage in a system that represents numbers in binary form using a fixed number of bits (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units would only be able to directly represent integers from 0 to 65,535 in each unit, but larger integers could be represented if more than one 16-bit unit could be used. This is what a CEF accommodates: it defines a way of mapping a single code point from a range of, say, 0 to 1.4 million, to a series of one or more code values from a range of, say, 0 to 65,535.

The simplest CEF system is simply to choose large enough units that the values from the coded character set can be encoded directly (one code point to one code value). This works well for coded character sets that fit in 8 bits (as most legacy non-CJK encodings do) and reasonably well for coded character sets that fit in 16 bits (such as early versions of Unicode). However, as the size of the coded character set increases (e.g. modern Unicode requires at least 21 bits/character), this becomes less and less efficient, and it is difficult to adapt existing systems to use larger code values. Therefore, most systems working with later versions of Unicode use either UTF-8, which maps Unicode code points to variable-length sequences of octets, or UTF-16/UCS-2, which maps Unicode code points to variable-length sequences of 16-bit words.

Next, a character encoding scheme (CES) specifies how the fixed-size integer code values should be mapped into an octet sequence suitable for saving on an octet-based file system or transmitting over an octet-based network. With Unicode, a simple character encoding scheme is used in most cases, simply specifying whether the bytes for each integer should be in big-endian or little-endian order (even this isn‘t needed with UTF-8). However, there are also compound character encoding schemes, which use escape sequences to switch between several simple schemes (such as ISO/IEC 2022), and compressing schemes, which try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).

Finally, there may be a higher level protocol which supplies additional information that can be used to select the particular variant of a Unicode character, particularly where there are regional variants that have been ‘unified‘ in Unicode as the same character. An example is the XML attribute xml:lang.

The Unicode model reserves the term character map for historical systems which directly assign a sequence of characters to a sequence of bytes.[1] Such systems include entities which IBM‘s Character Data Representation Architecture (CDRA) designates with coded character set identifiers (CCIDs) and each of which is variously called a charset, character set, code page, or CHARMAP.[1] The term charset is also used for similar mappings by MIME and systems based on it.[1]

Popular character encodings

ISO 8859:

  • ISO 8859-1 Western Europe
  • ISO 8859-2 Western and Central Europe

?  Chinese Guobiao

  • GB 2312
  • GBK (Microsoft Code page 936)
  • GB 18030

?  Taiwan Big5 (a more famous variant is Microsoft Code page 950)

?  Hong Kong HKSCS

?  Korean

 

Universal Character Set

The Universal Character Set (UCS), defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set (UCS) (plus amendments to that standard), is a standard set of characters upon which many character encodings are based.

 

Mapping of Unicode character planes

The Unicode characters can be categorized in many different ways, Unicode code points can be logically divided into 17 planes, each with 65,536 (= 216) code points, although currently only a few planes are used:

  • Plane 0 (0000–FFFF): Basic Multilingual Plane (BMP). This is the plane containing most of the character assignments so far. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing systems in current use.
  • Plane 1 (10000–1FFFF): Supplementary Multilingual Plane (SMP).
  • Plane 2 (20000–2FFFF): Supplementary Ideographic Plane (SIP)
  • Planes 3 to 13 (30000–DFFFF) are unassigned
  • Plane 14 (E0000–EFFFF): Supplementary Special-purpose Plane (SSP)
  • Plane 15 (F0000–FFFFF) reserved for the Private Use Area (PUA)
  • Plane 16 (100000–10FFFF), reserved for the Private Use Area (PUA)

Currently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively blocked out for every current and ancient writing system (script) the Unicode consortium has been able to identify: (see [1]). While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain, if previously unknown scripts with tens of thousands of characters are discovered. This 21-bit limit is therefore unlikely to be reached in the near future.

 

The first plane (plane 0), the Basic Multilingual Plane (BMP),[Chandler1]  is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters.

 

(edit template)

Legend:

Unicode 1.0

Unicode 4.0

Unicode 1.1

Unicode 4.1

Unicode 2.0

Unicode 5.0

Unicode 2.1

Unicode 5.1

Unicode 3.0

Unicode 5.2

Unicode 3.1

Reserved

Unicode 3.2

Noncharacter

Unicode characters

BMP

SMP

SIP

SSP

0000–0FFF

8000–8FFF

10000–10FFF

20000–20FFF

28000–28FFF

E0000–E0FFF

1000–1FFF

9000–9FFF

11000–11FFF

21000–21FFF

29000–29FFF

 

2000–2FFF

A000–AFFF

12000–12FFF

22000–22FFF

2A000–2AFFF

 

3000–3FFF

B000–BFFF

13000–13FFF

23000–23FFF

2B000–2BFFF

 

4000–4FFF

C000–CFFF

 

24000–24FFF

 

 

5000–5FFF

D000–DFFF

1D000–1DFFF

25000–25FFF

 

 

6000–6FFF

E000–EFFF

 

26000–26FFF

 

 

7000–7FFF

F000–FFFF

1F000–1FFFF

27000–27FFF

2F000–2FFFF

 

Note: Unicode characters visualization will depend on the character support of your web browser and the fonts installed on your system.

U+

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

6000

怀

6010

6020

6030

6040

6050

6060

6070

6080

6090

60A0

60B0

60C0

60D0

60E0

60F0

 

GB18030-2000

Introduction

GB18030-2000 is a new character set standard from the PRC that specifies an extended codepage and a mapping table to Unicode.

On March 17, 2000, the Chinese government issued regulations mandating that all operating systems on non-handheld computers sold in the PRC after January 1, 2001 would have to comply with the new multibyte GB18030-2000 standard. However, the initial implementation deadline of January 1, 2001 was later postponed until September 1, 2001.

Evolution of GB18030-2000

All character set standards that originate in the PRC have designations that begin with "GB". GB is an abbreviation for Guojia Biaozhun, meaning "national standard". The GB 2312-1980 character set standard was established in 1981 to represent simplified Chinese characters. GB 2312-1980 is a coded character set that contains 7,445 characters, including 6,763 Hanzi and 682 non-Hanzi characters. With the release of ISO 10646-1/Unicode 2.1 in 1993, the PRC expressed its fundamental consent to support the combined efforts of the ISO/IEC and the Unicode Consortium through publishing a Chinese National Standard that was code- and character-compatible with ISO 10646-1/Unicode 2.1. This standard was named GB 13000.1. Whenever the ISO and the Unicode Consortium changed or revised their common standard, GB 13000.1 subsequently adopted these changes.

To accommodate all additional Hanzi characters specified in GB 13000.1 that are not included in GB 2312-1980, a new specification known as GBK was then introduced. GBK is an abbreviation for "Guojia biaozhun kuozhan", which is the Chinese for "Rules/Specifications defining the extensions of internal codes for Chinese ideograms". GBK is an extension of GB 2312-1980 and the key significant property of GBK is that it leaves the characters and codes as defined in GB 2312-1980 untouched and positions all additional characters around it. The additional characters are mainly those of the Unified Han portion of Unicode 2.1 that go beyond the character repertoire of GB 2312-1980. Thus, code and character compatibility between GBK and GB 2312-1980 is ensured while, at the same time, the complete Unicode Unified Han character set is made available. At the time when GBK was defined, other characters were added that were not available in Unicode.

GBK defines 23,940 code points containing 21,886 characters. At the same time, GBK provides mappings to the code points of Unicode 2.1. However, due to the packed code space used to define GBK, it became obvious that there was no space left for a major addition. The 1,894 code points of GBK‘s three user-defined areas were not even close to providing sufficient space for the CJK Unified Ideographs Extension A, which defines 6,582 new characters in plane 0 of Unicode, version 3.0, the Basic Multilingual Plane (BMP).

Therefore, GB18030-2000 was created as an update of GBK for Unicode 3.0 with an extension that covers all of Unicode. It is fully backward-compatible with GB 2312-1980 and GBK. The mapping table from GB18030-2000 to Unicode is backward-compatible with the mapping table from GB 2312-1980 to Unicode, however, the GBK to Unicode table has a few differences. GBK contains characters which were not defined in Unicode 2.1, but were added in later versions of Unicode.

GB18030-2000 specifies a mapping table that covers all Unicode code points and maintains compatibility of GB-encoded text with GBK and GB 2312-1980.

GBK Encoding Ranges

 

range

byte 1

byte 2

code points

(有多少个值可取)

Characters

实际采用了多少个code points

 

GB 18030

GBK 1.0

Codepage 936

GB 2312

 

Level GBK/1

A1–A9

A1–FE

846

728

717

702

682

 

Level GBK/2

B0–F7

A1–FE

6,768

6,763

6,763

6,763

 

Level GBK/3

81–A0

40–FE except 7F

6,080

6,080

6,080

 

 

Level GBK/4

AA–FE

40–A0 except 7F

8,160

8,160

8,080

 

Level GBK/5

A8–A9

40–A0 except 7F

192

166

166

 

user-defined

AA–AF

A1–FE

564

 

 

user-defined

F8–FE

A1–FE

658

 

user-defined

A1–A7

40–A0 except 7F

672

 

total:

  

23,940

21,897

21,886

21,791

7,445

 

In graphical form, the following figure shows the space of all 64K possible 2-byte codes. Green and yellow areas are assigned GBK codepoints, red are for user-defined characters. The uncolored areas are invalid byte combinations.

 

从上面的两个图上可以分厂清晰的看出GB 18030、GBK 1.0、Microsoft Codepage 936、GB 2312这些编码的两个字节覆盖的取值范围。从下面的”Mapping Tables for Character Sets”中,也能看到,一个字节情况下128个gbk取值,还有后面的两个字节的情况,这里是以第一个字节为0x81为例,注意第二个字节的取值是从0x40开始的,0x8100到0x8139这个区间中是没有合法取值的,是空缺的位置。对应上面说到的“The uncolored areas are invalid byte combinations.”

Mapping Tables for Character Sets - GB2312

How to read this chart:

symbol
UTF-8 (hex)
UTF-16 (hex)

main

 

00

01

02

03

04

05

06

07

08

09

0A

0B

0C

0D

0E

0F

00

NUL
00
0000

STX
01
0001

SOT
02
0002

ETX
03
0003

EOT
04
0004

ENQ
05
0005

ACK
06
0006

BEL
07
0007

BS
08
0008

HT
09
0009

LF
0A
000A

VT
0B
000B

FF
0C
000C

CR
0D
000D

SOT
0E
000E

SI
0F
000F

10

DLE
10
0010

DC1
11
0011

DC2
12
0012

DC3
13
0013

DC4
14
0014

NAK
15
0015

SYN
16
0016

ETB
17
0017

CAN
18
0018

EM
19
0019

SUB
1A
001A

ESC
1B
001B

FS
1C
001C

GS
1D
001D

RS
1E
001E

US
1F
001F

20

SP
20
0020

!
21
0021

"
22
0022

#
23
0023

$
24
0024

%
25
0025

&
26
0026


27
0027

(
28
0028

)
29
0029

*
2A
002A

+
2B
002B

,
2C
002C

-
2D
002D

.
2E
002E

/
2F
002F

30

0
30
0030

1
31
0031

2
32
0032

3
33
0033

4
34
0034

5
35
0035

6
36
0036

7
37
0037

8
38
0038

9
39
0039

:
3A
003A

;
3B
003B

<
3C
003C

=
3D
003D

>
3E
003E

?
3F
003F

40

@
40
0040

A
41
0041

B
42
0042

C
43
0043

D
44
0044

E
45
0045

F
46
0046

G
47
0047

H
48
0048

I
49
0049

J
4A
004A

K
4B
004B

L
4C
004C

M
4D
004D

N
4E
004E

O
4F
004F

50

P
50
0050

Q
51
0051

R
52
0052

S
53
0053

T
54
0054

U
55
0055

V
56
0056

W
57
0057

X
58
0058

Y
59
0059

Z
5A
005A

[
5B
005B

\
5C
005C

]
5D
005D

^
5E
005E

_
5F
005F

60

`
60
0060

a
61
0061

b
62
0062

c
63
0063

d
64
0064

e
65
0065

f
66
0066

g
67
0067

h
68
0068

i
69
0069

j
6A
006A

k
6B
006B

l
6C
006C

m
6D
006D

n
6E
006E

o
6F
006F

70

p
70
0070

q
71
0071

r
72
0072

s
73
0073

t
74
0074

u
75
0075

v
76
0076

w
77
0077

x
78
0078

y
79
0079

z
7A
007A

{
7B
007B

|
7C
007C

}
7D
007D

~
7E
007E

DEL
7F
007F

80


E282AC
20AC

81

82

83

84

85

86

87

88

89

8A

8B

8C

8D

8E

8F

90

90

91

92

93

94

95

96

97

98

99

9A

9B

9C

9D

9E

9F

A0

A0

A1

A2

A3

A4

A5

A6

A7

A8

A9

AA

AB

AC

AD

AE

AF

B0

B0

B1

B2

B3

B4

B5

B6

B7

B8

B9

BA

BB

BC

BD

BE

BF

C0

C0

C1

C2

C3

C4

C5

C6

C7

C8

C9

CA

CB

CC

CD

CE

CF

D0

D0

D1

D2

D3

D4

D5

D6

D7

D8

D9

DA

DB

DC

DD

DE

DF

E0

E0

E1

E2

E3

E4

E5

E6

E7

E8

E9

EA

EB

EC

ED

EE

EF

F0

F0

F1

F2

F3

F4

F5

F6

F7

F8

F9

FA

FB

FC

FD

FE

 


main | 81

 

00

01

02

03

04

05

06

07

08

09

0A

0B

0C

0D

0E

0F

40


E4B882
4E02


E4B884
4E04


E4B885
4E05


E4B886
4E06


E4B88F
4E0F


E4B892
4E12


E4B897
4E17


E4B89F
4E1F


E4B8A0
4E20


E4B8A1
4E21


E4B8A3
4E23


E4B8A6
4E26


E4B8A9
4E29


E4B8AE
4E2E


E4B8AF
4E2F


E4B8B1
4E31

50


E4B8B3
4E33


E4B8B5
4E35


E4B8B7
4E37


E4B8BC
4E3C


E4B980
4E40


E4B981
4E41


E4B982
4E42


E4B984
4E44


E4B986
4E46


E4B98A
4E4A


E4B991
4E51


E4B995
4E55


E4B997
4E57


E4B99A
4E5A


E4B99B
4E5B


E4B9A2
4E62

60


E4B9A3
4E63


E4B9A4
4E64


E4B9A5
4E65


E4B9A7
4E67


E4B9A8
4E68


E4B9AA
4E6A


E4B9AB
4E6B


E4B9AC
4E6C


E4B9AD
4E6D


E4B9AE
4E6E


E4B9AF
4E6F


E4B9B2
4E72


E4B9B4
4E74


E4B9B5
4E75


E4B9B6
4E76


E4B9B7
4E77

70


E4B9B8
4E78


E4B9B9
4E79


E4B9BA
4E7A


E4B9BB
4E7B


E4B9BC
4E7C


E4B9BD
4E7D

乿
E4B9BF
4E7F


E4BA80
4E80


E4BA81
4E81


E4BA82
4E82


E4BA83
4E83


E4BA84
4E84


E4BA85
4E85


E4BA87
4E87


E4BA8A
4E8A

 

80


E4BA90
4E90


E4BA96
4E96


E4BA97
4E97


E4BA99
4E99


E4BA9C
4E9C


E4BA9D
4E9D


E4BA9E
4E9E


E4BAA3
4EA3


E4BAAA
4EAA


E4BAAF
4EAF


E4BAB0
4EB0


E4BAB1
4EB1


E4BAB4
4EB4


E4BAB6
4EB6


E4BAB7
4EB7


E4BAB8
4EB8

90


E4BAB9
4EB9


E4BABC
4EBC


E4BABD
4EBD


E4BABE
4EBE


E4BB88
4EC8


E4BB8C
4ECC


E4BB8F
4ECF


E4BB90
4ED0


E4BB92
4ED2


E4BB9A
4EDA


E4BB9B
4EDB


E4BB9C
4EDC


E4BBA0
4EE0


E4BBA2
4EE2


E4BBA6
4EE6


E4BBA7
4EE7

A0


E4BBA9
4EE9


E4BBAD
4EED


E4BBAE
4EEE


E4BBAF
4EEF


E4BBB1
4EF1


E4BBB4
4EF4


E4BBB8
4EF8


E4BBB9
4EF9


E4BBBA
4EFA


E4BBBC
4EFC


E4BBBE
4EFE


E4BC80
4F00


E4BC82
4F02


E4BC83
4F03


E4BC84
4F04


E4BC85
4F05

B0


E4BC86
4F06


E4BC87
4F07


E4BC88
4F08


E4BC8B
4F0B


E4BC8C
4F0C


E4BC92
4F12


E4BC93
4F13


E4BC94
4F14


E4BC95
4F15


E4BC96
4F16


E4BC9C
4F1C


E4BC9D
4F1D


E4BCA1
4F21


E4BCA3
4F23


E4BCA8
4F28


E4BCA9
4F29

C0


E4BCAC
4F2C


E4BCAD
4F2D


E4BCAE
4F2E


E4BCB1
4F31


E4BCB3
4F33


E4BCB5
4F35


E4BCB7
4F37


E4BCB9
4F39


E4BCBB
4F3B


E4BCBE
4F3E

伿
E4BCBF
4F3F


E4BD80
4F40


E4BD81
4F41


E4BD82
4F42


E4BD84
4F44


E4BD85
4F45

D0


E4BD87
4F47


E4BD88
4F48


E4BD89
4F49


E4BD8A
4F4A


E4BD8B
4F4B


E4BD8C
4F4C


E4BD92
4F52


E4BD94
4F54


E4BD96
4F56


E4BDA1
4F61


E4BDA2
4F62


E4BDA6
4F66


E4BDA8
4F68


E4BDAA
4F6A


E4BDAB
4F6B


E4BDAD
4F6D

E0


E4BDAE
4F6E


E4BDB1
4F71


E4BDB2
4F72


E4BDB5
4F75


E4BDB7
4F77


E4BDB8
4F78


E4BDB9
4F79


E4BDBA
4F7A


E4BDBD
4F7D


E4BE80
4F80


E4BE81
4F81


E4BE82
4F82


E4BE85
4F85


E4BE86
4F86


E4BE87
4F87


E4BE8A
4F8A

F0


E4BE8C
4F8C


E4BE8E
4F8E


E4BE90
4F90


E4BE92
4F92


E4BE93
4F93


E4BE95
4F95


E4BE96
4F96


E4BE98
4F98


E4BE99
4F99


E4BE9A
4F9A


E4BE9C
4F9C


E4BE9E
4F9E


E4BE9F
4F9F


E4BEA1
4FA1


E4BEA2
4FA2

 

 

UTF-8

UTF-8 (8-bit UCS/Unicode Transformation Format) is avariable-length character encoding for Unicode. It isable to represent any character in the Unicode standard.

Unicode

Byte1

Byte2

Byte3

Byte4

example

U+0000–U+007F

0xxxxxxx

   

‘$‘ U+0024
→ 00100100
→ 0x24

U+0080–U+07FF

110yyyxx

10xxxxxx

  

‘¢‘ U+00A2
→ 11000010,10100010
→ 0xC2,0xA2

U+0800–U+FFFF

1110yyyy

10yyyyxx

10xxxxxx

 

‘€‘ U+20AC
→ 11100010,10000010,10101100
→ 0xE2,0x82,0xAC

U+10000–U+10FFFF

11110zzz

10zzyyyy

10yyyyxx

10xxxxxx

‘??‘ U+024B62
→ 11110000,10100100,10101101,10100010
→ 0xF0,0xA4,0xAD,0xA2

 

 

 

 

UTF-16/UCS-2

In computing, UTF-16(16-bit UCS/Unicode Transformation Format) is avariable-length character encoding for Unicode, capableof encoding the entire Unicode repertoire.

 

UCS-2(2-byteUniversal Character Set) The UCS-2 encodingform is identical to that of UTF-16, except that it does not support surrogatepairs and therefore can only encode characters in the BMP range U+0000 throughU+FFFF.

Encoding of characters outside theBMP

Theimprovement that UTF-16 made over UCS-2 is its ability to encode characters inplanes 1–16, not just those in plane 0 (BMP). This was done by taking anunassigned portion of the 16 bit UCS-2 space, shown to scale by color here:

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

注意:在上面的BMP(mapping of Unicodecharacter planes)中0xD8到0xF8都是“utf-16 surrogates and private use”,这里的前面8个区域就是用来支持utf-16表示BMP以外的unicode编码用的。

 

DC00

DC01

DFFF

D800

010000

010001

0103FF

D801

010400

010401

0107FF

  ?

  

?

    ?

DBFF

10FC00

10FC01

10FFFF

UTF-16represents non-BMP characters (those from U+10000 through U+10FFFF) using apair of 16-bit words, known as a surrogate pair. First 1000016is subtracted from the code point to give a 20-bit value. This is then splitinto two separate 10-bit values each of which is represented as a surrogatewith the most significant half placed in the first surrogate. To allow safe useof simple word-oriented stringprocessing, separate ranges of values are used for the two surrogates:0xD800–0xDBFF for the first, most significant surrogate (marked brown) and0xDC00-0xDFFF for the second, least significant surrogate (marked azure).

For example,the character at code point U+10000 becomes the code unit sequence 0xD8000xDC00, and the character at U+10FFFD, the upper limit of Unicode, becomesthe sequence 0xDBFF 0xDFFD.[Chandler2]  Unicode and ISO/IEC 10646 do not, and will never, assigncharacters to any of the code points in the U+D800–U+DFFF range, so anindividual code value from a surrogate pair does not ever represent acharacter.


 [Chandler1]

下面网址提供了unicode BMP的查询

http://www.atm.ox.ac.uk/user/iwi/charmap.html

 [Chandler2]U+10FFFD减去0x10000得到0xFFFFD,分成两个10bits的half,第一个是0x3FF,第二个是0x3FD,最终:

1stsurrogate:0xD800+0x3FF=0xDBFF

2nd surrogate:0xDFFD+0x3FD=0xFFFD