基于网络爬虫的论坛数据分析系统的设计与实现--688IT编程网

分类号学号M********* 学校代码10487密级

硕士学位论文

基于网络爬虫的论坛数据分析系统

的设计与实现

学位申请人：黎曦

学科专业：软件工程

指导教师：曹华副教授

答辩日期：2018.12.17

A Thesis Submitted in Partial Fulfillment of the Requirements

for the Degree for the Master of Engineering

The Design and Implementation of Forum Data Analysis System Based on Web Crawler

Candidate : Li Xi

Major : Software Engineering

Supervisor : Assoc. Prof. Cao Hua

Huazhong University of Science & Technology

Wuhan 430074, P.R.China

December, 2018

摘要

游戏论坛是一个重要的玩家建议反馈渠道，通常游戏开发者都需要密切关注论坛舆论风向来发现游戏中存在及潜在的问题，然而论坛数据质量参差不齐，人工从大量论坛帖子中发现收集有价值的帖子信息需要消耗一定的时间和精力，还容易出现纰漏，为了更加敏捷高效的对论坛玩家反馈的有效信息作出相关反馈，如何自动化获取论坛数据并对数据作出相关筛选分析成为提升工作效率的一个关键途径。

该系统主要实现了对游戏建议反馈论坛数据的分析处理，包括数据提取，数据分析和分析结果展示三个主要模块。系统开发使用Python开发工具PyCharm与MySQL 数据库管理系统，用到的相关Python库主要包括jieba自然语言处理工具库、操作MySQL数据库的PyMySQL库和用于绘制词云图的wordcloud库。系统通过网络爬虫将论坛数据内容爬取出来，利用Beautiful Soup库对数据进行解析，从中提取出待分析的数据并保存到数据库中，利用jieba库对待分析的数据进行中文分词处理，在此基础上对分词结果进行价值评分，并将结果通过html的形式展示出来，另外还可以根据不同筛选条件通过词云图展示出满足条件的高频词汇，使用户可以快速掌握论坛高频信息。

系统对论坛数据进行了有效的提取及分析处理，系统的实现使得论坛数据获取更加方便快捷和直观，在一定程度上节省了相关数据关注者在人工筛选帖子内容时投入的时间和精力，提升工作效率。

关键词：网络爬虫中文分词词云图

python爬虫开发

Abstract

Game forum is an important feedback channel for players'suggestions. Usually game developers need to pay close attention to the trend of public opinion in the forum to find out the existing and potential problems in the game. However, the quality of forum data is uneven. Manual collection of valuable post information from a large number of forum posts requires a certain amount of time and effort, and is prone to errors. In order to provide more agile and efficient feedback to the effective information feedback from forum players, how to automatically acquire forum data and make relevant screening analysis of the data has become a key way to improve work efficiency.

The system mainly realizes the analysis and processing of game suggestion feedback forum data, including three main modules: data extraction, data analysis and analysis results display. Python development tool PyCharm and MySQL database management system are used in the system development. The relevant Python libraries used mainly include the natural language processing toolkit Jieba library, the operation of MySQL database toolkit PyMySQL library and the wordcloud library for drawing word clouds. The system crawls the forum data content through the Web crawler, parses the data by using the Beautiful Soup database, extracts the data to be analyzed and saves it to the database, and uses the Jieba database to process the Chinese word segmentation of the analyzed dat

a. On this basis, the value of the word segmentation results is scored, and the results are displayed in the form of html. In addition, according to different screening conditions, high-frequency words satisfying the conditions can be displayed through word cloud, so that users can quickly grasp the high-frequency information of the forum.

The system effectively extracts and analyses the forum data, and the system makes the forum data acquisition more convenient, fast and intuitive. To a certain extent, it saves the time and energy invested by relevant data followers in manual screening of the content of Posts and improves work efficiency.

Key words：Web Crawler Chinese Word Segmentation Word cloud

688IT编程网

基于网络爬虫的论坛数据分析系统的设计与实现

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

基于网络爬虫的论坛数据分析系统的设计与实现

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式