首页 > 代码库 > 为啥在java中不要使用char类型

为啥在java中不要使用char类型


背景 


最近项目中遇到一个问题,反复测试才发现问题出在了数据库中,由于使用了 Hibernate 这种ORM框架,因此,在java中写的 EntityBean 就可以直接通过ORM映射到Oracle数据库了,这也导致了很多的问题。当然,查了很多的资料,最终解决了这个问题,并且对Oracle的数据类型也有了一个更深层次的理解。下面是我的译文(原文是英文版的)。 


译文 


要理解char类型,您首先必须了解Unicode编码模式。Unicode的发明克服了传统的字符编码方案的局限性。在Unicode出现之前,有许多不同的标准:美国的ASCII编码,ISO8859-1 为西方欧洲语言编码,KOI-8 为俄罗斯编码方式,GB18030 BIG-5 是中国语言的编码方式,等等。这将导致两个问题:一个特定的代码值对应于不同的字母的各种编码方案。此外,与大字符集编码语言长度相比,一些常见的字符编码为一个字节,其他人需要两个或两个以上的字节。 

Unicode旨在解决这些问题。当统一工作始于1980年代,一个固定的2字节代码已经足够宽度编码用于世界上所有的语言,所有的人物和空闲空间以及未来的扩展,而且当时的每个人也都这样认为。1991年,Unicode 1.0版本发布,Unicode字符使用了全部范围(65536)的略低于一半的代码值。在其他编程语言还在使用8位字符时,Java已经开始从头设计,选择使用16位Unicode字符,这是一个重大进步。 

不幸的是,随着时间的推移,不可避免的发生了。Unicode增长超出65536个字符,主要是由于增加了一些非常大的象形文字用于中国,日本,韩国。甚至到现在为止,16位字符类型是不足以描述所有Unicode字符的。 

我们需要一个术语,来解释这个问题在Java中是如何解决的。并且,这个解决方案是于从 JDK 5.0 版本开始的。在一个编码方案中,Code Point(代码点)和Code Value(代码值)是通过特征相关联的。在标准的Unicode编码中,Code Point都是用带 U+ 前缀的十六进制表示,比如:U+0041 的Code Point 就表示大写字母 A 。Unicode编码包含很多个 Code Point,这些 Code Point 又组成了17种不同的 Code Plane(代码位面)。第一个 Code Plane,被称为基本的多语种位面,由“经典”Unicode字符的Code Point从 U+0000 一直到 U+FFFF。还包括从 U+10000 到 U+10FFFF 的补充字符,组成了十六种额外的Code Plane。 

utf-16 编码是在一个可变长度的编码方式,它代表了所有Unicode代码点的方法。人物的基本语言平面表示为16位值,这被称为代码单元。这些代码单元还需要不断的补充新的字符编码。在这一系列的编码中,任何一个值都存在与一个未使用过的2048字节的范围内的基本语言平面,这被称为代理区域。这是相当睿智的,因为你可以马上分辨出一个代码单元编码了一个字符,或者是否为第一或第二部分补充字符。例如,数学符号的整数集合的代码点为 U+1d56b ,和由两个代码编码单元 U+D835 和 U+DD6B 组成的。 

在Java中,char类型也仅仅是描述 utf-16 编码的代码单元。 

强烈建议不要使用char类型在程序中,除非你实际上是操纵 utf-16 编码单元。否则,你几乎总是能更好的处理字符串,作为抽象数据类型。 

ps:Code Point(代码点)实际上代表的是一个真正的Unicode字符,即Unicode字符集上的码位值。 


原文 


To understand the char type, you have to know about the Unicode encoding scheme. Unicode was invented to overcome the limitations of traditional character encoding schemes. Before Unicode, there were many different standards: ASCII in the United States, ISO 8859-1 for Western European languages, KOI-8 for Russian, GB18030 and BIG-5 for Chinese, and so on. This causes two problems. A particular code value corresponds to different letters in the various encoding schemes. Moreover, the encodings for languages with large character sets have variable length: some common characters are encoded as single bytes, others require two or more bytes. 

Unicode was designed to solve these problems. When the unification effort started in the 1980s, a fixed 2-byte width code was more than sufficient to encode all characters used in all languages in the world, with room to spare for future expansion—or so everyone thought at the time. In 1991, Unicode 1.0 was released, using slightly less than half of the available 65,536 code values. Java was designed from the ground up to use 16-bit Unicode characters, which was a major advance over other programming languages that used 8-bit characters. 

Unfortunately, over time, the inevitable happened. Unicode grew beyond 65,536 characters, primarily due to the addition of a very large set of ideographs used for Chinese, Japanese, and Korean. Now, the 16-bit char type is insufficient to describe all Unicode characters. 

We need a bit of terminology to explain how this problem is resolved in Java, beginning with JDK 5.0. A code point is a code value that is associated with a character in an encoding scheme. In the Unicode standard, code points are written in hexadecimal and prefixed with U+, such as U+0041 for the code point of the letter A. Unicode has code points that are grouped into 17 code planes. The first code plane, called the basic multilingual plane, consists of the "classic" Unicode characters with code points U+0000 to U+FFFF. Sixteen additional planes, with code points U+10000 to U+10FFFF, hold the supplementary characters. 

The UTF-16 encoding is a method of representing all Unicode code points in a variable length code. The characters in the basic multilingual plane are represented as 16-bit values, called code units. The supplementary characters are encoded as consecutive pairs of code units. Each of the values in such an encoding pair falls into an unused 2048-byte range of the basic multilingual plane, called the surrogates area (U+D800 to U+DBFF for the first code unit, U+DC00 to U+DFFF for the second code unit).This is rather clever, because you can immediately tell whether a code unit encodes a single character or whether it is the first or second part of a supplementary character. For example, the mathematical symbol for the set of integers has code point U+1D56B and is encoded by the two code units U+D835 and U+DD6B. (See http://en.wikipedia.org/wiki/UTF-16 for a description of the encoding algorithm.) 

In Java, the char type describes a code unit in the UTF-16 encoding. 

Our strong recommendation is not to use the char type in your programs unless you are actually manipulating UTF-16 code units. You are almost always better off treating strings (which we will discuss starting on page 51) as abstract data types. 


总结 


说了这么多,你一定还没理解吧?当然了,上面这些都是一些官方的语言,最后,我用自己的话,总结一下这篇文章的中心,当然肯定会不准确,希望各位大神批评指正。 

那么,到底为什么java里不推荐使用char类型呢?其实,1个java的char字符并不完全等于一个unicode的字符。char采用的UCS-2编码,是一种淘汰的UTF-16编码,编码方式最多有65536种,远远少于当今Unicode拥有11万字符的需求。java只好对后来新增的Unicode字符用2个char拼出1个Unicode字符。导致String中char的数量不等于unicode字符的数量。 

然而,大家都知道,char在Oracle中,是固定宽度的字符串类型(即所谓的定长字符串类型),长度不够的就会自动使用空格补全。因此,在一些特殊的查询中,就会导致一些问题,而且这种问题还是很隐蔽的,很难被开发人员发现。一旦发现问题的所在,就意味着数据结构需要变更,可想而知,这是多么大的灾难啊。 


结束语 


千里之堤溃于蚁穴,因此,不要小看任何一个不起眼的点,正是这个小小的char类型,就有可能使你的心血付之一炬。