分类号学号M********* 学校代码10487密级
硕士学位论文
基于网络爬虫的论坛数据分析系统
的设计与实现
学位申请人:黎曦
学科专业:软件工程
指导教师:曹华副教授
答辩日期:2018.12.17
A Thesis Submitted in Partial Fulfillment of the Requirements
for the Degree for the Master of Engineering
The Design and Implementation of Forum Data Analysis System Based on Web Crawler
Candidate  : Li Xi
Major : Software Engineering
Supervisor : Assoc. Prof. Cao Hua
Huazhong University of Science & Technology
Wuhan 430074, P.R.China
December, 2018
摘要
游戏论坛是一个重要的玩家建议反馈渠道,通常游戏开发者都需要密切关注论坛舆论风向来发现游戏中存在及潜在的问题,然而论坛数据质量参差不齐,人工从大量论坛帖子中发现收集有价值的帖子信息需要消耗一定的时间和精力,还容易出现纰漏,为了更加敏捷高效的对论坛玩家反馈的有效信息作出相关反馈,如何自动化获取论坛数据并对数据作出相关筛选分析成为提升工作效率的一个关键途径。
该系统主要实现了对游戏建议反馈论坛数据的分析处理,包括数据提取,数据分析和分析结果展示三个主要模块。系统开发使用Python开发工具PyCharm与MySQL 数据库管理系统,用到的相关Python库主要包括jieba自然语言处理工具库、操作MySQL数据库的PyMySQL库和用于绘制词云图的wordcloud库。系统通过网络爬虫将论坛数据内容爬取出来,利用Beautiful Soup库对数据进行解析,从中提取出待分析的数据并保存到数据库中,利用jieba库对待分析的数据进行中文分词处理,在此基础上对分词结果进行价值评分,并将结果通过html的形式展示出来,另外还可以根据不同筛选条件通过词云图展示出满足条件的高频词汇,使用户可以快速掌握论坛高频信息。
系统对论坛数据进行了有效的提取及分析处理,系统的实现使得论坛数据获取更加方便快捷和直观,在一定程度上节省了相关数据关注者在人工筛选帖子内容时投入的时间和精力,提升工作效率。
关键词:网络爬虫中文分词词云图
python爬虫开发
Abstract
Game forum is an important feedback channel for players'suggestions. Usually game developers need to pay close attention to the trend of public opinion in the forum to find out the existing and potential problems in the game. However, the quality of forum data is uneven. Manual collection of valuable post information from a large number of forum posts requires a certain amount of time and effort, and is prone to errors. In order to provide more agile and efficient feedback to the effective information feedback from forum players, how to automatically acquire forum data and make relevant screening analysis of the data has become a key way to improve work efficiency.
The system mainly realizes the analysis and processing of game suggestion feedback forum data, including three main modules: data extraction, data analysis and analysis results display. Python development tool PyCharm and MySQL database management system are used in the system development. The relevant Python libraries used mainly include the natural language processing toolkit Jieba library, the operation of MySQL database toolkit PyMySQL library and the wordcloud library for drawing word clouds. The system crawls the forum data content through the Web crawler, parses the data by using the Beautiful Soup database, extracts the data to be analyzed and saves it to the database, and uses the Jieba database to process the Chinese word segmentation of the analyzed dat
a. On this basis, the value of the word segmentation results is scored, and the results are displayed in the form of html. In addition, according to different screening conditions, high-frequency words satisfying the conditions can be displayed through word cloud, so that users can quickly grasp the high-frequency information of the forum.
The system effectively extracts and analyses the forum data, and the system makes the forum data acquisition more convenient, fast and intuitive. To a certain extent, it saves the time and energy invested by relevant data followers in manual screening of the content of Posts and improves work efficiency.
Key words:Web Crawler Chinese Word Segmentation Word cloud

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。