获取csv⽂件编码,解决csv读取中⽂乱码问题
咱们解析csv⽂件时最经常遇到的问题就是乱码,可能有朋友说了我在解析时直接设定编码类型为GBK,GB2312就可以解决中⽂乱码,如下
public static List<List<String>> readTxtOrCsvFile(InputStream input) {
List<List<String>> data = wArrayList();
if (input == null) {
return data;
}
InputStreamReader read = null;
BufferedReader br = null;
try {
read = new InputStreamReader(input, "GB2312");
br = new BufferedReader(read);
String line;
while ((line = br.readLine()) != null) {
if (StringUtils.isNotBlank(line)) {
List<String> dd = Arrays.asList(line.split(","));
List<String> n = new ArrayList<>();
for (int i = 0; i < dd.size(); i++) {
String cellData = dd.get(i);
n.add(buildText(cellData));
}
data.add(n);
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
if (read != null) {
read.close();
}
if (input != null) {
input.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
return data;
}
这样可以解决部分⽤户的乱码,我想问下如果我的⽂件类型为UTF-8呢,这样解析出来的还是有乱码。如何做到完全解决呢。作者本⼈也遇到这种问题,想了许久,也在⽹上搜索很久,得到⼀个解决⽅案。
解决⽅案是:⾃⼰获取要解析⽂件编码,然后按照编码进⾏解析。如何才能获取编码,有如下步骤
1、从流中读取前三个字节到⼀个byte[3]数组中;
2、通过HexString(byte[0] & 0xFF),将byte[3]数组中的三个byte分别转换成16进制的字符表⽰;
3、根据对三个byte进⾏转换后得到的字符串,与UTF-8格式头EFBBBF进⾏⽐较即可知道是否UTF-8格式。
/**
* 读取txt,csv⽂件16进制字符串转16进制数组
*
* @return
*/
public static List<List<String>> readTxtOrCsvFile(InputStream input) {
List<List<String>> data = wArrayList();
if (input == null) {
return data;
}
InputStreamReader read = null;
BufferedReader br = null;
BufferedInputStream bb = null;
try {
bb = new BufferedInputStream(input);
read = new InputStreamReader(bb, getCharSet(bb));
br = new BufferedReader(read);
String line;
while ((line = br.readLine()) != null) {
if (StringUtils.isNotBlank(line)) {
List<String> dd = Arrays.asList(line.split(","));
List<String> n = new ArrayList<>();
for (int i = 0; i < dd.size(); i++) {
String cellData = dd.get(i);
n.add(buildText(cellData));
}
data.add(n);
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
if (read != null) {
read.close();
}
if (bb != null) {
bb.close();
}
if (input != null) {
input.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
return data;
}
/**
* 获取流对应的编码类型
* @param bb
* @return
* @throws Exception
*/
private static String getCharSet(BufferedInputStream bb) throws Exception {
String charSet = null;
byte[] buffer = new byte[3];
//因流读取后再读取可能会缺少内容,此处需要先读,然后再还原
bb.mark(bb.available() + 1);
String s = HexString(buffer[0] & 0xFF) + HexString(buffer[1] & 0xFF) + HexString(buffer[2] & 0xFF); switch (s) {
//GBK,GB2312对应均为d5cbba,统⼀当成GB2312解析
case "d5cbba":
charSet = "GB2312";
break;
case "efbbbf":
charSet = "UTF-8";
break;
default:
charSet = "GB2312";
break;
}
return charSet;
}
问题圆满解决
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论