Subject: | Determining if the data is valid UTF-8 | |||
Doc ID: | Note:162608.1 | Type: | PROBLEM | |
Last Revision Date: | oracle decimal类型18-NOV-2004 | Status: | REVIEWED | |
Problem Description
-------------------
JDBC programs often get exceptions when converting character data from
the database even though it can be viewed in SQLPlus.
i.e.
java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv
Solution Description
--------------------
Verify that the data is comprised of valid UTF-8 characters.
SQL> SELECT dump(utf8_column) FROM utf8_table;
DUMP(UTF8_COLUMN)
--------------------------------------------------------------------------------
Typ=1 Len=17: 80,228,101,112,112,101,114,32,69,110,101,114,103,105,32,65,66
In this example, the second byte is 228 (an 'umlaute a' in WE8859ISOP1.) If this
is to be in the an UTF-8 character, the following two bytes must be greater then
127. (see rules below)
In this case, while the data can be viewed in SQLPlus, the data is not valid for
UTF-8 or a conversion to UCS2.
The data must be scrubbed to work in JAVA. This is not a failure in JAVA. It is
the forgiving nature of Oracle that makes it appear OK in SQLPlus. As the data
stored is not UFT-8 data
You can also use the "dump" command below as it will only display valid characters
of the database:
SQL> SELECT dump(UTF8_COLUMN, 1017 ) FROM utf8;
DUMP(UTF8_COLUMN,1017)
----------------------------------------------------------------------
Typ=1 Len=17 CharacterSet=UTF8: P,e4,e,p,p,e,r, ,E,n,e,r,g,i, ,A,B
If this reports something other then UTF8 or US7ASCII then JAVA may have issues in
converting.
Explanation
-----------
The data in the database is invalid. The reason is usually because an OCI program
(jdbc oci8, OCI, Precompiler, SQL*Loader ) loaded the data with an improper
character set. It displays properly in SQLPlus because the 228 is a valid character
in the host character set. But it fails in Java because all Java is in UCS2.
The conversion follows rules to convert UTF-8 to UCS2. Oracle will apply the
rules, and try to convert the characters. If the conversion fails, the data
is passed as is (Garbage In! Garbage Out!)
The rules are as follows:
When the first byte of the multi-byte character is:
Decimal Bin Total number of bytes Subsequent bytes
<128 0xxxxxxx 1 N/A
>=192 110xxxxx 2 10xxxxxx
>=224 1110xxxx 3 10xxxxxx
>=240 11110xxx 4 10xxxxxx
>=248 111110xx 5 10xxxxxx
>=252 1111110x 6 10xxxxxx
Note that all subsequent bytes, bytes 2 thru 6 as needed, will be greater than
127 and less than 192. This is done to prevent collisions in decoding strings.
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论