使用JavaStAX解析超大xml(超过60g)文件,并将其存入数据库(MySQL)--688IT编程网

使⽤JavaStAX解析超⼤xml（超过60g）⽂件，并将其存⼊数

据库（MySQL）

遇到的问题

本⼈需要解析Stack Overflow的（xml格式）将数据其存⼊数据库，其中关于Stack Overflow帖⼦（Posts）的xml⽂件超过了60G。

那么如何解析那么⼤的xml⽂件呢（Stack Overflow上有解决⽅案-）？

解决⽅案

或许你已经想到了分块读取，然后解析。那么如何分块解析呢？Java中处理xml⽂件有两种处理⽅案（DOM和Event Driven）。DOM需要

将所有⽂件读取，在内存中构建DOM树，显然这种⽅案不⾏；只能选择基于事件的⽅式，本⼈选择了StAX，或许⽤SAX的⼈⽐较多，那么

SAX和StAX的区别是什么呢？，这是说明：。

你需要了解的是：什么是事件驱动。（懂得这个，也就懂得了StAX为什么是分块处理了）

代码实现

<?xml version="1.0" encoding="utf-8"?>

<posts>

<row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="573" ViewCount="37080" Body="" OwnerUserId="8" LastE <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="256" ViewCount="16306" Body="" OwnerUserId="9" Las <row Id="7" PostTypeId="2" ParentId="4" CreationDate="2008-07-31T22:17:57.883" Score="401" Body="" OwnerUserId="9" LastEditorUserId="4020527" LastEd <row Id="9" PostTypeId="1" AcceptedAnswerId="1404" CreationDate="2008-07-31T23:40:59.743" Score="1743" ViewCount="480476" Body="" OwnerUserId="1 <row Id="11" PostTypeId="1" AcceptedAnswerId="1248" CreationDate="2008-07-31T23:55:37.967" Score="1348" ViewCount="136033" Body="" OwnerUserId=" <row Id="12" PostTy

peId="2" ParentId="11" CreationDate="2008-07-31T23:56:41.303" Score="320" Body="" OwnerUserId="1" LastEditorUserId="1271898" Las <row Id="13" PostTypeId="1" CreationDate="2008-08-01T00:42:38.903" Score="539" ViewCount="157009" Body="" OwnerUserId="9" LastEditorUserId="532136 </posts>

数据库设计

由于帖⼦（Post）可以分为两种问题和回答（通过属性PostTypeId的值确定），因为数据量很⼤（4200万⾏左右），所以我设计了两张

表。

-- PostTypeId = 1

-- Id CreationDate Score ViewCount OwnerUserId Tags AnswerCount FavoriteCount CREATE TABLE `questions` (

`Id` int,

`CreationDate` datetime,

`Score` int,

`ViewCount` int,

`OwnerUserId` int,

`Tags` varchar(250),

`AnswerCount` int,

`FavoriteCount` int,

PRIMARY KEY (`Id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8

-- PostTypeId = 2

--Id ParentId CreationDate Score CommentCount

CREATE TABLE `answers` (

`Id` int,

`ParentId` int,

`CreationDate` datetime,

`Score` int,

`OwnerUserId` int,

`CommentCount` int,

PRIMARY KEY (`Id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8

Java代码

sql.jdbc.PreparedStatement;

l.XMLConstants;

l.stream.XMLInputFactory;

l.stream.XMLStreamException;

l.stream.XMLStreamReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.InputStream;

import java.sql.Connection;

import java.sql.DriverManager;

import java.sql.SQLException;

import java.util.Date;

* Desc: Parse & Import(to mysql db) Stack Overflow dump xml file

* Created by Myth on 10/12/2018

public class XmlProcessor {

private Connection connection = null;

/**

* Get Db Connection

* @throws ClassNotFoundException

* @throws SQLException

public void openMysqlConnection() throws ClassNotFoundException, SQLException { String driver = "sql.jdbc.Driver";

String url = "jdbc:mysql://localhost:3306/stackoverflow";

String username = "root";

String password = "123456";

Connection connection = null;

Class.forName(driver);

connection = Connection(url, username, password);

}

public void closeConnection() throws SQLException {

}

/**

* @param filePath

* @param 每 commitCount ⾏提交⼀次

* @throws SQLException

* @throws FileNotFoundException

* @throws XMLStreamException

public void parsePosts(String filePath, int commitCount) throws SQLException, FileNotFoundException, XMLStreamException { // 计时器 starts

Long begin = new Date().getTime();

// 组合sql语句

String prefixQuestions = "INSERT INTO questions VALUES ";

String prefixAnswers = "INSERT INTO answers VALUES ";

StringBuffer suffixQuestions = new StringBuffer();

StringBuffer suffixAnswers = new StringBuffer();

// 设置事务为⾮⾃动提交

// PreparedStatement 执⾏ sql语句

PreparedStatement pst = (PreparedStatement) tion.prepareStatement("");

// 解析xml获得数据

XMLInputFactory inputFactory = wInstance();

inputFactory.setProperty("acle/xml/jaxp/properties/getEntityCountInfo", "yes");

// 设置entity size , 否则会报 JAXP00010004 错误

inputFactory.setProperty("acle/xml/jaxp/properties/totalEntitySizeLimit", Integer.MAX_VALUE);

File file = new File(filePath);

InputStream isS= new FileInputStream(file);

XMLStreamReader streamReader = ateXMLStreamReader(isS);

int countRow = 0;

// Q: Id CreationDate Score ViewCount OwnerUserId Tags AnswerCount FavoriteCount

// A: Id ParentId CreationDate Score CommentCount

String id, creationDate, score, viewCount, ownerUserId, tags, answerCount, favoriteCount, parentId, commentCount;

String postTypeId;

String sqlQuestions = null, sqlAnswers = null;

// 存储数据

while(streamReader.hasNext()) {

<();

EventType() == XMLStreamReader.START_ELEMENT){

if (LocalName().equals("row")) {

postTypeId = AttributeValue(null,"PostTypeId");

id = AttributeValue(null,"Id");mysql操作官方文档

creationDate = AttributeValue(null,"CreationDate");

score = AttributeValue(null,"Score");

viewCount = AttributeValue(null,"ViewCount");

ownerUserId = AttributeValue(null,"OwnerUserId");

tags = AttributeValue(null,"Tags");

answerCount = AttributeValue(null,"AnswerCount");

favoriteCount = AttributeValue(null,"FavoriteCount");

parentId = AttributeValue(null,"ParentId");

commentCount = AttributeValue(null,"CommentCount");

// 1 Question, 2 Answer

if ("1".equals(postTypeId)) {

suffixQuestions.append("(" + id + "," + "\"" + creationDate + "\"" + "," +

score + "," + viewCount + "," + ownerUserId + "," +

"\"" + tags + "\"" + "," + answerCount + "," + favoriteCount + "),");

} else {

suffixAnswers.append("(" + id + "," + parentId + "," + "\"" + creationDate + "\"" + "," +

score + "," + ownerUserId + "," + commentCount + "),");

}

countRow += 1; // 记录⾏数

if (countRow % commitCount == 0) {

// System.out.print("Count: " + String(count));

// 构建完整sql

sqlQuestions = prefixQuestions + suffixQuestions.substring(0, suffixQuestions.length() - 1); sqlAnswers = prefixAnswers + suffixAnswers.substring(0, suffixAnswers.length() - 1);

// 添加执⾏sql

pst.addBatch(sqlQuestions);

pst.addBatch(sqlAnswers);

// 执⾏操作

// 提交事务

// 清空上⼀次添加的数据

suffixQuestions = new StringBuffer();

suffixAnswers = new StringBuffer();

System.out.println("Committed: " + countRow + " √");

}

if (suffixQuestions.length() != 0) {

sqlQuestions = prefixQuestions + suffixQuestions.substring(0, suffixQuestions.length() - 1);

pst.addBatch(sqlQuestions);

connectionmit();

}

if (suffixAnswers.length() != 0) {

sqlAnswers = prefixAnswers + suffixAnswers.substring(0, suffixAnswers.length() - 1);

/ System.out.println(suffixAnswers.substring(0, suffixAnswers.length() - 1));

pst.addBatch(sqlAnswers);

connectionmit();

}

System.out.println("Committed All: " + countRow + " √");

pst.close();

// 耗时

Long end = new Date().getTime();

System.out.println("Cast : " + (end - begin) / 1000 + " s");

}

⼤约需要10多分钟就可以将全部数据（4200多万⾏）导⼊。

总结

Bug

Exception in thread "main" l.stream.XMLStreamException: ParseError at [row,col]:[1077158,4084]

Message: JAXP00010004: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING".

原因：java xml解析器⾃带limit（），⽤来控制读⼊的⼤⼩、控制内存..... 因为要处理的xml⽂件过⼤，超过了默认的limit，所以在java代码71⾏，设置了limit的⼤⼩，但是即使设置最⼤，也勉强能够处理4000多万条数据，如果超过这个数量，就不能通过这个⽅法了，可以将超⼤的xml切分成⼏个xml⽂件，然后按照上述处理。

其他问题

使⽤batch 插⼊，可以提⾼效率

使⽤ XMLStreamReader ⽽不⽤ XMLEventReader（区别见官⽅⽂档、教程）提⾼效率

688IT编程网

使用JavaStAX解析超大xml(超过60g)文件,并将其存入数据库(MySQL)

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林特征选择原理

自动驾驶系统中的随机森林算法解析

随机森林算法及其在生物信息学中的应用

监督学习中的随机森林算法解析(六)

随机森林算法在数据分析中的应用

机器学习——随机森林,RandomForestClassifier参数含义详解

随机森林的算法

随机森林算法作用

监督学习中的随机森林算法解析(十)

随机森林算法案例

随机森林案例

二分类问题常用的模型

绘制ssd框架训练流程

一种基于信息熵和DTW的多维时间序列相似性度量算法

SVM训练过程范文

如何使用支持向量机进行股票预测与交易分析

二分类交叉熵损失函数binary

tinybert_训练中文文本分类模型_概述说明

基于门控可形变卷积和分层Transformer的图像修复模型及其应用

人工智能开发技术的测试和评估方法

最新文章

基于随机森林的数据分类算法改进

人工智能中的智能识别与分类技术

基于人工智能技术的随机森林算法在医疗数据挖掘中的应用

随机森林回归模型的建模步骤

r语言随机森林预测模型校准曲线

《2024年随机森林算法优化研究》范文

标签列表