Integration of Big Data:A Survey
Jingya Hui,Lingli Li(&),and Zhaogong Zhang
Department of Computer Science and Technology,
Heilongjiang University,Harbin,China
ahhjy0807@163,lilingli_grace@163,
zhaogong.zhang@qq
Abstract.Data integration provides users a uniform interface for multiple
heterogonous data sources.This problem has attracted a large amount of
attention from both research and industry areas.In this paper,we overview the
state-of-art approaches in data integration which are roughly divided intofive
parts:schema matching,entity resolution,data fusion,integration system,and
new problems arisen.
Keywords:Data integrationÁSchema matchingÁEntity resolution
Data fusion
1Introduction
Data integration systems offer users a uniform interface to a set of data sources.For instance,first the user submits a query based on a mediated schema;secondly,the system reformulates the query over the relevant sources based on schema matching;thirdly,all the answers of the query are combined from the relevant sources.Due to the importance of data integration,it has attracted significant research attention.On one hand,chal-lenges of the basic operations,such as schema matching,entity resolution,data fusion, continue to appear in the new contexts;on the other hand,some new problems have attracted interest,such as,integrating extremely large data sets,extremely large numbers of data sets,and extremely heterogeneous data sets,such as data lake.
The general data integration system can be roughly divided into three steps: (1)schema matching:gen
erating correspondences between elements of two given schemas;(2)entity resolution:identifying duplicated records which represent the same real-world entity;and(3)data fusion:resolving conflicts andfinding the truth from different data sources.
truncated data1.1Schema Matching
Schema matching is the problem offinding correspondences between elements of two schemas.It is usually thefirst step in data integration.The process of schema matching often consists of the following steps:(1)Mediate Schema which provides a unified view for different data sources;(2)Attribute Matching which matches the attributes in each source schema to the corresponding attributes in mediate schema;and
©Springer Nature Singapore Pte Ltd.2018
Q.Zhou et al.(Eds.):ICPCSEE2018,CCIS901,pp.101–121,2018.
/10.1007/978-981-13-2203-7_9

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。