利用POI读取word、Excel文件的最佳实践教程--688IT编程网

利⽤POI读取word、Excel⽂件的最佳实践教程

前⾔

是 Apache 旗下⼀款读写微软家⽂档声名显赫的类库。应该很多⼈在做报表的导出，或者创建 word ⽂档以及读取之类的都是⽤过 POI。POI 也的确对于

这些操作带来很⼤的便利性。我最近做的⼀个⼯具就是读取计算机中的 word 以及 excel ⽂件。

POI结构说明

包名称说明

HSSF提供读写Microsoft Excel XLS格式档案的功能。

XSSF提供读写Microsoft Excel OOXML XLSX格式档案的功能。

HWPF提供读写Microsoft Word DOC格式档案的功能。

HSLF提供读写Microsoft PowerPoint格式档案的功能。

HDGF提供读Microsoft Visio格式档案的功能。

HPBF提供读Microsoft Publisher格式档案的功能。

HSMF提供读Microsoft Outlook格式档案的功能。

下⾯就word和excel两⽅⾯讲解以下遇到的⼀些坑：

word 篇

对于 word ⽂件，我需要的就是提取⽂件中正⽂的⽂字。所以可以创建⼀个⽅法来读取 doc 或者 docx ⽂件：

private static String readDoc(String filePath, InputStream is) {

String text= "";

try {

if (dsWith("doc")) {

WordExtractor ex = new WordExtractor(is);

text = ex.getText();

ex.close();

is.close();

} else dsWith("docx")) {

XWPFDocument doc = new XWPFDocument(is);

XWPFWordExtractor extractor = new XWPFWordExtractor(doc);

text = Text();

extractor.close();

is.close();

}

} catch (Exception e) {

<(filePath, e);

} finally {

if (is != null) {

is.close();

}

return text;

}

理论上来说，这段代码应该对于读取⼤多数 doc 或者 docx ⽂件都是有效的。但是我发现了⼀个奇怪的问题，就是我的代码在读取某些 doc ⽂件的时候，经常会给出这样的⼀个异常：

org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents.

这个异常的意思是什么呢，通俗的来讲，就是你打开的⽂件并不是⼀个 doc ⽂件，你应该使⽤读取 doc

x 的⽅法去读取。但是我们明明打开的就是⼀个后缀是 doc 的⽂件啊！

其实 doc 和 docx 的本质不同的，doc 是 OLE2 类型，⽽ docx ⽽是 OOXML 类型。如果你⽤压缩⽂件打开⼀个 docx ⽂件，你会发现⼀些⽂件夹：

本质上 docx ⽂件就是⼀个 zip ⽂件，⾥⾯包含了⼀些 xml ⽂件。所以，⼀些 docx ⽂件虽然⼤⼩不⼤，但是其内部的 xml ⽂件确实⽐较⼤的，这也是为什么在读取某些看起来不是很⼤的 docx ⽂件的时候却耗费了⼤量的内存。

然后我使⽤压缩⽂件打开这个 doc ⽂件，果不其然，其内部正是如上图，所以本质上我们可以认为它是⼀个 docx ⽂件。可能是因为它是以某种兼容模式保存从⽽导致如此坑爹的问题。所以，现在我们根据后缀名来判断⼀个⽂件是 doc 或者 docx 就是不可靠的了。

⽼实说，我觉得这应该不是⼀个很少见的问题。但是我在⾕歌上并没有到任何关于此的信息。这个例⼦是通过 ZipInputStream 来判断⽂件是否是 docx ⽂件：

boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;

但我并不觉得这是⼀个很好的⽅法，因为我得去构建⼀个ZipInpuStream，这很显然不好。另外，这个操作貌似会影响到 InputStream，所以你在读取正常的 doc ⽂件会有问题。或者你使⽤ File 对象去判断

是否是⼀个 zip ⽂件。但这也不是⼀个好⽅法，因为我还需要在压缩⽂件中读取 doc 或者 docx ⽂件，所以我的输⼊必须是 Inputstream，所以这个选项也是不可以的。我在 stackoverflow 上和⼀帮⽼外扯了⼤半天，有时候我真的很怀疑这帮⽼外的理解能⼒，不过最终还是有⼀个⼤佬给出了⼀个让我欣喜若狂的解决⽅案，。这个是⼀个 POI 3.17新增加的⼀个特性：

public enum FileMagic {

/** OLE2 / BIFF8+ stream used for Office 97 and higher documents */

OLE2(HeaderBlockConstants._signature),

/** OOXML / ZIP stream */

OOXML(OOXML_FILE_HEADER),

/** XML file */

XML(RAW_XML_FILE_HEADER),

/** BIFF2 raw stream - for Excel 2 */

BIFF2(new byte[]{

0x09, 0x00, // sid=0x0009

0x04, 0x00, // size=0x0004

0x00, 0x00, // unused

0x70, 0x00 // 0x70 = multiple values

}),

/** BIFF3 raw stream - for Excel 3 */

BIFF3(new byte[]{

0x09, 0x02, // sid=0x0209

0x06, 0x00, // size=0x0006

0x00, 0x00, // unused

0x70, 0x00 // 0x70 = multiple values

}),

/** BIFF4 raw stream - for Excel 4 */

BIFF4(new byte[]{

0x09, 0x04, // sid=0x0409

0x06, 0x00, // size=0x0006

0x00, 0x00, // unused

0x70, 0x00 // 0x70 = multiple values

},new byte[]{

0x09, 0x04, // sid=0x0409

0x06, 0x00, // size=0x0006

0x00, 0x00, // unused

0x00, 0x01

}),

/** Old MS Write raw stream */

MSWRITE(

new byte[]{0x31, (byte)0xbe, 0x00, 0x00 },

new byte[]{0x32, (byte)0xbe, 0x00, 0x00 }),

/** RTF document */

springboot其实就是springRTF("{\\rtf"),

/** PDF document */

PDF("%PDF"),

/ keep UNKNOWN always as last enum!

/** UNKNOWN magic */

UNKNOWN(new byte[0]);

final byte[][] magic;

FileMagic(long magic) {

this.magic = new byte[1][8];

LittleEndian.putLong(this.magic[0], 0, magic);

}

FileMagic(byte[]... magic) {

this.magic = magic;

}

FileMagic(String magic) {

Bytes(LocaleUtil.CHARSET_1252));

}

public static FileMagic valueOf(byte[] magic) {

for (FileMagic fm : values()) {

int i=0;

boolean found = true;

for (byte[] ma : fm.magic) {

for (byte m : ma) {

byte d = magic[i++];

if (!(d == m || (m == 0x70 && (d == 0x10 || d == 0x20 || d == 0x40)))) {

found = false;

break;

}

if (found) {

return fm;

}

return UNKNOWN;

}

* Get the file magic of the supplied InputStream (which MUST

* support mark and reset).<p>

* If unsure if your InputStream does support mark / reset,

* use {@link #prepareToCheckMagic(InputStream)} to wrap it and make

* sure to always use that, and not the original!<p>

* Even if this method returns {@link FileMagic#UNKNOWN} it could potentially mean,

* that the ZIP stream has leading junk bytes

* @param inp An InputStream which supports either mark/reset

public static FileMagic valueOf(InputStream inp) throws IOException {

if (!inp.markSupported()) {

throw new IOException("getFileMagic() only operates on streams which support mark(int)");

}

// Grab the first 8 bytes

byte[] data = IOUtils.peekFirst8Bytes(inp);

return FileMagic.valueOf(data);

}

/**

* Checks if an {@link InputStream} can be reseted (i.e. used for checking the header magic) and wraps it if not

* @param stream stream to be checked for wrapping

* @return a mark enabled stream

public static InputStream prepareToCheckMagic(InputStream stream) {

if (stream.markSupported()) {

return stream;

}

// we used to process the data via a PushbackInputStream, but user code could provide a too small one

// so we use a BufferedInputStream instead now

return new BufferedInputStream(stream);

}

在这给出主要的代码，其主要就是根据 InputStream 前 8 个字节来判断⽂件的类型，毫⽆以为这就是最优雅的解决⽅式。⼀开始，其实我也是在想对于压缩⽂件的前⼏个字节似乎是由不同的定义的，。因为 FileMagic 的依赖和3.16 版本是兼容的，所以我只需要加⼊这个类就可以了，因此我们现在读取word ⽂件的正确做法是：

private static String readDoc (String filePath, InputStream is) {

String text= "";

is = FileMagic.prepareToCheckMagic(is);

try {

if (FileMagic.valueOf(is) == FileMagic.OLE2) {

WordExtractor ex = new WordExtractor(is);

text = ex.getText();

ex.close();

} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {

XWPFDocument doc = new XWPFDocument(is);

XWPFWordExtractor extractor = new XWPFWordExtractor(doc);

text = Text();

extractor.close();

}

} catch (Exception e) {

<("for file " + filePath, e);

} finally {

if (is != null) {

is.close();

}

return text;

}

excel 篇

对于 excel 篇，我也就不去之前的⽅案和现在的⽅案的对⽐了。就给出我现在的最佳做法了：

@SuppressWarnings("deprecation" )

private static String readExcel(String filePath, InputStream inp) throws Exception {

Workbook wb;

StringBuilder sb = new StringBuilder();

try {

if (dsWith(".xls")) {

wb = new HSSFWorkbook(inp);

} else {

wb = StreamingReader.builder()

.rowCacheSize(1000) // number of rows to keep in memory (defaults to 10)

.bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024)

.open(inp); // InputStream or File for XLSX file (required)

}

sb = readSheet(wb, sb, dsWith(".xls"));

wb.close();

} catch (OLE2NotOfficeXmlFileException e) {

<(filePath, e);

} finally {

if (inp != null) {

inp.close();

}

String();

}

private static String readExcelByFile(String filepath, File file) {

Workbook wb;

StringBuilder sb = new StringBuilder();

try {

if (dsWith(".xls")) {

wb = ate(file);

} else {

wb = StreamingReader.builder()

.rowCacheSize(1000) // number of rows to keep in memory (defaults to 10)

.bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024)

open(file); // InputStream or File for XLSX file (required)

}

sb = readSheet(wb, sb, dsWith(".xls"));

wb.close();

} catch (Exception e) {

<(filepath, e);

}

String();

}

private static StringBuilder readSheet(Workbook wb, StringBuilder sb, boolean isXls) throws Exception {

for (Sheet sheet: wb) {

for (Row r: sheet) {

for (Cell cell: r) {

if (CellType() == Cell.CELL_TYPE_STRING) {

sb.StringCellValue());

sb.append(" ");

} else if (CellType() == Cell.CELL_TYPE_NUMERIC) {

if (isXls) {

DataFormatter formatter = new DataFormatter();

sb.append(formatter.formatCellValue(cell));

} else {

sb.StringCellValue());

}

sb.append(" ");

}

return sb;

}

其实，对于 excel 读取，我的⼯具⾯临的最⼤问题就是内存溢出。经常在读取某些特别⼤的 excel ⽂件的时候都会带来⼀个内存溢出的问题。后来我终于到⼀个优秀的⼯具，它可以流式的读取 xlsx ⽂件，将⼀些特别⼤的⽂件拆分成⼩的⽂件去读。

另外⼀个做的优化就是，对于可以使⽤ File 对象的场景下，我是去使⽤ File 对象去读取⽂件⽽不是使⽤

InputStream 去读取，因为使⽤ InputStream 需要把它全部加载到内存中，所以这样是⾮常占⽤内存的。

最后，我的⼀点⼩技巧就是使⽤ CellType 去减少⼀些数据量，因为我只需要获取⼀些⽂字以及数字的字符串内容就可以了。

以上，就是我在使⽤ POI 读取⽂件的⼀些探索和发现，希望对你能有所帮助。上⾯的这些例⼦也是在我的⼀款⼯具中的应⽤（这款⼯具主要是可以帮助你在电脑中进⾏内容的全⽂搜索），感兴趣的可以看看，欢迎 star 或者 pr。

总结

以上就是这篇⽂章的全部内容了，希望本⽂的内容对⼤家的学习或者⼯作具有⼀定的参考学习价值，如果有疑问⼤家可以留⾔交流，谢谢⼤家对的⽀持。

688IT编程网

利用POI读取word、Excel文件的最佳实践教程

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式选择题

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

688IT编程网

利用POI读取word、Excel文件的最佳实践教程

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式 选择题

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

java正则表达式选择题

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

java正则表达式选择题

非零金额正则表达式