apache-atlas深度剖析--688IT编程网

apache-atlas深度剖析

atlas 是apache下的⼤数据的元数据管理和数据治理平台，是Hadoop社区为解决Hadoop⽣态系统的元数据治理问题⽽产⽣的开源项⽬，它为Hadoop集提供了包括数据分类、集中策略引擎、数据⾎缘、安全和⽣命周期管理在内的元数据治理核⼼能⼒。⽀持对hive、storm、kafka、hbase、sqoop等进⾏元数据管理以及以图库的形式展⽰数据的⾎缘关系。

• 各种Hadoop和⾮Hadoop元数据的预定义类型

• 为要管理的元数据定义新类型的能⼒

• 类型可以具有原始属性、复杂属性、对象引⽤；可以从其他类型继承。

• 类型的实例，称为实体，捕获元数据对象细节及其关系

• REST API与类型和实例⼀起⼯作更容易集成

• 动态创建分类的能⼒，如PII、ExIPRESION、DATAAL质量、敏感

• 分类可以包括属性，如EXPIRES_ON分类中的expiry_date 属性

• 实体可以与多个分类相关联，从⽽能够更容易地发现和安全执⾏。

• 通过谱系传播分类-⾃动确保分类跟随数据经过各种处理

• 直观的UI，以查看数据的传承，因为它通过各种处理

• REST API访问和更新⾎统

• 通过类型、分类、属性值或⾃由⽂本搜索实体的直观UI

• 丰富的 REST API 实现复杂的标准搜索

• 搜索实体的SQL类查询语⾔——领域特定语⾔（DSL）

• ⽤于元数据访问的细粒度安全性，允许对实体实例和操作（如添加/更新/删除分类）的访问进⾏控制

• 与Apache Ranger的集成使得基于与Apache Atlas中的实体相关联的分类的数据访问的授权/数据屏蔽成为可能。例如：

• 谁可以访问被分类为PII、敏感的数据

• 客户服务⽤户只能看到被列为国家标识的列的最后4位数字

⼀、架构

整体架构实现如下图所⽰：

Type System: Atlas allows users to define a model for the metadata objects they want to manage. The model is composed of definitions called ‘types’. Instances of ‘types’ called ‘entities’ represent the actual metadata objects that are managed. The Type System is a component that allows users to define and manage the types and entities. All metadata objects managed by Atlas out of the box (like Hive tables, ) are modelled using types and represented as entities. To store new types of metadata in Atlas, one needs to understand the concepts of the type system component.

One key point to note is that the generic nature of the modelling in Atlas allows data stewards and integrators to define both technical metadata and business metadata. It is also possible to define rich relationships between the two using features of Atlas.

Graph Engine: Internally, Atlas persists metadata objects it manages using a Graph model. This approach provides great flexibility and enables efficient handling of rich relationships between the metadata objects. Graph engine component is responsible for translating between types and entities of the Atlas type system, and the underlying graph persistence model. In addition to managing the gr

aph objects, the graph engine also creates the appropriate indices for the metadata objects so that they can be searched efficiently. Atlas uses the JanusGraph to store the metadata objects.

JanusGraph 的数据的底层存储⽀持Hbase、cassandra、embeddedcassandra、berkeleyje、inmemory（直接存储在内存中）等。Ingest / Export: The Ingest component allows metadata to be added to Atlas. Similarly, the Export component exposes metadata changes detected by Atlas to be raised as events. Consumers can consume these change events to react to metadata changes in real time.

atlas 的搜索引擎⽀持solr和ElasticSearch

Applications：

Atlas Admin UI: This component is a web based application that allows data stewards and scientists to discover and annotate metadata. Of primary importance here is a search interface and SQL like query language that can be used to query the metadata types and objects managed by Atlas. The Admin UI uses the REST API of Atlas for building its functionality.- Atlas Admin UI: 该组件是⼀个基于 Web 的应⽤程序，允许数据管理员和科学家发现和注释元数据。Admin UI提供了搜索界⾯和类SQL的查询语⾔，可以⽤来查询由 Atlas 管理的元数据类型和对象。Admin UI 使⽤ Atlas 的 REST API 来构

建其功能。

Tag Based Policies: is an advanced security management solution for the Hadoop ecosystem having wide integration with a variety of Hadoop components. By integrating with Atlas, Ranger allows security administrators to define metadata driven security policies for effective governance. Ranger is a consumer to the metadata change events notified by Atlas.

- Tag Based Policies: Apache Ranger 是针对 Hadoop ⽣态系统的⾼级安全管理解决⽅案，与各种 Hadoop 组件具有⼴泛的集成。通过与Atlas 集成，Ranger 允许安全管理员定义元数据驱动的安全策略，以实现有效的治理。 Ranger 是由 Atlas 通知的元数据更改事件的消费者。

- Business Taxonomy：从元数据源获取到 Atlas 的元数据对象主要是⼀种技术形式的元数据。为了增强可发现性和治理能⼒，Atlas 提供了⼀个业务分类界⾯，允许⽤户⾸先定义⼀组代表其业务域的业务术语，并将其与 Atlas 管理的元数据实体相关联。业务分类法是⼀种 Web 应⽤程序，⽬前是 Atlas Admin UI 的⼀部分，并且使⽤ REST API 与 Atlas 集成。

- 在HDP2.5中，Business Taxonomy是提供了Technical Preview版本，需要在Atlas > Configs > Advanced > Custom application-properties中添加atlas.able=true并重启atlas服务来开启

Users can manage metadata in Atlas using two methods:

API: All functionality of Atlas is exposed to end users via a REST API that allows types and entities to be created, updated and deleted. It is also the primary mechanism to query and discover the types and entities managed by Atlas.

Messaging: In addition to the API, users can choose to integrate with Atlas using a messaging interface that is based on Kafka. This is useful both for communicating metadata objects to Atlas, and also to consume metadata change events from Atlas using which applications can be built. The messaging interface is particularly useful if one wishes to use a more loosely coupled integration with Atlas that could allow for better scalability, reliability etc. Atlas uses Apache Kafka as a notification server for communication between hooks and downstream consumers of metadata notification events. Events are written by the hooks and Atlas to different Kafka topics.

Metadata source

Atlas ⽀持与许多元数据源的集成，将来还会添加更多集成。⽬前，Atlas ⽀持从以下数据源获取和管理元数据：

- Hive：通过hive bridge， atlas可以接⼊Hive的元数据，包括hive_db/hive_table/hive_column/hive_process

- Sqoop：通过sqoop bridge，atlas可以接⼊关系型数据库的元数据，包括sqoop_operation_type/

sqoop_dbstore_usage/sqoop_process/sqoop_dbdatastore

- Falcon：通过falcon bridge，atlas可以接⼊Falcon的元数据，包括

falcon_cluster/falcon_feed/falcon_feed_creation/falcon_feed_replication/ falcon_process

- Storm：通过storm bridge，atlas可以接⼊流式处理的元数据，包括storm_topology/storm_spout/storm_bolt

Atlas集成⼤数据组件的元数据源需要实现以下两点：

- ⾸先，需要基于atlas的类型系统定义能够表达⼤数据组件元数据对象的元数据模型(例如Hive的元数据模型实现在

org.apache.del.HiveDataModelGenerator)；

然后，需要提供hook组件去从⼤数据组件的元数据源中提取元数据对象，实时侦听元数据的变更并反馈给atlas；

元数据处理的整体流程⼊下图所⽰：

在Atlas中查询某⼀个元数据对象时往往需要遍历图数据库中的多个顶点与边，相⽐关系型数据库直接查询⼀⾏数据要复杂的多，当然使⽤图数据库作为底层存储也存在它的优势，⽐如可以⽀持复杂的数据类型和更好的⽀持⾎缘数据的读写。

⼆、安装与配置

2、源码下载完后，按照如下⽅式进⾏打包：

tar xvfz apache-atlas-1.0.

cd apache-atlas-sources-1.0.0/

export MAVEN_OPTS="-Xms2g -Xmx2g"

安装：mvn clean -DskipTests install

mvn clean -DskipTests package -Pdist

打包时增加 hbase和solr打⼊： mvn clean -DskipTests package -Pdist,embedded-hbase-solr 打包时增加

cassandra

和solr打⼊：

mvn clean package -Pdist,embedded-cassandra-solr

3、配置与启动

tar -xzvf apache-atlas-{project.version}-

cd atlas-{project.version}/conf，编辑atlas-application.properties配置⽂件Graph Persistence engine - HBase配置：

1 2 3 aph.storage.backend=hbase

Graph Index Search Engine配置：Graph Search Index - Solr：

1 2 3 4 5 6 7 aph.index.search.backend=solr5

# ZK quorum setup for solr as comma separated value. Example: 10.1.6.4:2181,10.1.6.5:aph.index.keeper-url=

# SolrCloud Zookeeper Connection Timeout. Default value is 60000ms

# SolrCloud Zookeeper Session Timeout. Default value is 60000ms

Graph Search Index - Elasticsearch (Tech Preview)：

1 aph.index.search.backend=elasticsearch

Notification Configs：1

2 3 4 5 6 7 8 9 10 11 12 13 14 15atlas.able=false

#Kafka servers. Example: localhost:6667

atlas.kafka.bootstrap.servers=

atlas.up.id=atlas

#Zookeeper connect URL for Kafka. Example: localhost:2181

tion.timeout.ms=30000

keeper.session.timeout.ms=60000

keeper.sync.time.ms=20

#Setup the following configurations only in test deployments where Kafka is started within Atlas in embedded mode #bedded=true

#atlas.kafka.data={sys:atlas.home}/data/kafka

#Setup the following two properties if Kafka is running in Kerberized mode.

#ification.kafka.service.principal=kafka/_HOST@EXAMPLE.COM

#ification.kafka.keytab.location=/etc/security/keytabs/kafka.service.keytab

　Client Configs：

1 2 3 adTimeoutMSecs=60000

tTimeoutMSecs=60000

# URL to access Atlas server. For example: localhost:st.address=

SSL config：

High Availability Properties：

# Set the following property to true, to enable High Availability. Default = false.

atlas.abled=true

# Specify the list of Atlas instances

atlas.server.ids=id1,id2

# For each instance defined above, define the host and port on which Atlas server listens.

atlas.server.address.id1=host1pany:21000

atlas.server.address.id2=host2pany:31000

# Specify Zookeeper properties needed for HA.

# Specify the list of services running Zookeeper servers as a comma separated list.

atlas.t=zk1pany:2181,zk2pany:2181,zk3pany:2181

# Specify how many times should connection try to be established with a Zookeeper cluster, in case of any connection issues.

atlas.ies=3

# Specify how much time should the server wait before attempting connections to Zookeeper, in case of any connection issues.

atlas.sleeptime.ms=1000

# Specify how long a session to Zookeeper should last without inactiviy to be deemed as unreachable.

atlas.keeper.session.timeout.ms=20000

# Specify the scheme and the identity to be used for setting up ACLs on nodes created in Zookeeper for HA.

# The format of these options is <scheme:identity>.

# For more information refer to

下载apache

/doc/r3.2.2/zookeeperProgrammers.html#sc_ZooKeeperAccessControl

# The 'acl' option allows to specify a scheme, identity pair to setup an ACL for.

atlas.keeper.acl=sasl:client@comany

# The 'auth' option specifies the authentication that should be used for connecting to Zookeeper.

atlas.keeper.auth=sasl:client@company

# Since Zookeeper is a shared service that is typically used by many components,

# it is preferable for each component to set its znodes under a namespace.

# Specify the namespace under which the znodes should be written. Default = /apache_atlas

atlas.keeper.zkroot=/apache_atlas

# Specify number of times a client should retry with an instance before selecting another active instance, or failing an operation.

atlas.ies=4

# Specify interval between retries for a client.

atlas.client.ha.sleep.interval.ms=5000

cd atlas-{project.version}

bin/atlas_start.py

三、设置Hive Hook

⽀持的Hive Model：

Hive model includes the following types:

Entity types:

hive_db

super-types: Asset

attributes: qualifiedName, name, description, owner, clusterName, location, parameters, ownerName hive_table

super-types: DataSet

attributes: qualifiedName, name, description, owner, db, createTime, lastAccessTime, comment, retention, sd,

partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary

hive_column

super-types: DataSet

attributes: qualifiedName, name, description, owner, type, comment, table

hive_storagedesc

super-types: Referenceable

attributes: qualifiedName, table, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo,

bucketCols, sortCols, parameters, storedAsSubDirectories

hive_process

super-types: Process

attributes: qualifiedName, name, description, owner, inputs, outputs, startTime, endTime, userName, operationType,

queryText, queryPlan, queryId, clusterName

hive_column_lineage

super-types: Process

attributes: qualifiedName, name, description, owner, inputs, outputs, query, depenendencyType, expression Enum types:

hive_principal_type

values: USER, ROLE, GROUP

Struct types:

hive_order

attributes: col, order

hive_serde

attributes: name, serializationLib, parameters 在hive的 l 配置⽂件中增加如下配置：

1 2 3 4<property>

<name&post.hooks</name>

<value>org.apache.atlas.hive.hook.HiveHook</value>

</property>

untar apache-atlas-${project.version}-

cd apache-atlas-hive-hook-${project.version}

Copy entire contents of folder apache-atlas-hive-hook-${project.version}/hook/hive to <atlas package>/hook/hive Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh of your hive configuration Copy <atlas-conf>/atlas-application.properties to the hive conf directory.

atlas-application.properties的配置⽰例如下：

1 2 3 4 5 6 7 8atlas.hook.hive.synchronous=false# whether to run the hook synchronously. false recommended to avoid delays in Hive query completion. Default: false atlas.hook.hive.numRetries=3# number of retries for notification failure. Default: 3

atlas.hook.hive.queueSize=10000# queue size for the threadpool. Default: 10000

atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary

t= # Zookeeper connect URL for Kafka. Example: localhost:2181

tion.timeout.ms=30000# Zookeeper connection timeout. Default: 30000

keeper.session.timeout.ms=60000# Zookeeper session timeout. Default: 60000

keeper.sync.time.ms=20# Zookeeper sync time. Default: 20

1 2 3 4 5 6 7Usage 1: <atlas package>/hook-bin/import-hive.sh

Usage 2: <atlas package>/hook-bin/import-hive.sh [-d <database regex> OR --database <database regex>] [-t <table regex> OR --table <table regex>] Usage 3: <atlas package>/hook-bin/import-hive.sh [-f <filename>]

File Format:

database1:tbl1

database1:tbl2

database2:tbl1<strong><br></strong>

未完待续，最近会把后续的补充完整

688IT编程网

apache-atlas深度剖析

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

688IT编程网

apache-atlas深度剖析

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法 正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式 最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

nginx map用法正则

shell 正则表达式最后一行