解析XML⽂件时,⽆效的XML字符(Unicode:0x7)异常处理
报错信息:
2015-01-29 00:10:22,075  ERROR commonapi.CommonApiAction - errorCode:5000,5000-00;Description:程序异常。Error on line 1 of document  : An invalid XM org.dom4j.DocumentException: Error on line 1 of document  : An invalid XML character (Unicode: 0x19) was found in the CDATA section. Nested exception: An in at org.dom4j.ad(SAXReader.java:482)
at org.dom4j.DocumentHelper.parseText(DocumentHelper.java:278)
at WapDocsSearchJsonInfo(CommonApiAction.java:1866)
flect.GeneratedMethodAccessor43.invoke(Unknown Source)
flect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at flect.Method.invoke(Method.java:597)
错误原因:
使用dom4j解析xml文件
这些⽆效的字符在⼀些⽂档中作为⽂档处理器的控制编码(微软选择了那些再0x82到0x95之间的字符作为"smart"标点),这些也被
Unicode保留作为控制编码的,并且在XML中是不合法的。这⾥的⽆效字符不是指<,>等不能出现在XML⽂件的标签以外的字符,也不是由
于编码问题引起的乱码,⽽是⼀些超出XML合法字符范围的不可见字符。根据W3C标准,有⼀些字符不能出现在XML⽂件中:
// Document authors are encouraged to avoid "compatibility characters", as defined in
// Unicode [Unicode]. The characters defined in the following ranges are also discouraged. // They are either control characters or permanently undefined Unicode
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
解决办法:
为了保证常⽤XML解析⼯具能将⾃⼰⽣成的XML⽂件成功解析,就需要先将⽂件中的⽆效字符过滤掉,或在⽣成XML⽂件时就对字符的有
效性进⾏判断,抛弃⽆效字符。
Unicode是国际组织制定的可以容纳世界上所有⽂字和符号的字符编码⽅案。⽬前的Unicode字符分为17组编排,0x0000 ⾄
0x10FFFF,每组称为平⾯(Plane),⽽每平⾯拥有65536个码位,共1114112个。然⽽⽬前只⽤了少数平⾯。、、都是将数字转换到
程序数据的编码⽅案。
查了⼀下W3C中对XML 1.0的定义,其Unicode的合法字符范围(16进制)是:
Character Range
[2]    Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
⽅法⼀:
// 保留合法字符
public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in))) return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.        if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
String();
}
⽅法⼆:
//过滤⾮法字符
//注意,以下正则表达式过滤不全⾯,过滤范围为
//  0x00 - 0x08
//  0x0b - 0x0c
//  0x0e - 0x1f
public static String stripNonValidXMLChars(String str) {
if (str == null || "".equals(str)) {
return str;
}
placeAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "");
}

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。