atlas 是apache下的⼤数据的元数据管理和数据治理平台,是Hadoop社区为解决Hadoop⽣态系统的元数据治理问题⽽产⽣的开源项⽬,它为Hadoop集提供了包括数据分类、集中策略引擎、数据⾎缘、安全和⽣命周期管理在内的元数据治理核⼼能⼒。⽀持对hive、storm、kafka、hbase、sqoop等进⾏元数据管理以及以图库的形式展⽰数据的⾎缘关系。
• 各种Hadoop和⾮Hadoop元数据的预定义类型
• 为要管理的元数据定义新类型的能⼒
• 类型可以具有原始属性、复杂属性、对象引⽤;可以从其他类型继承。
• 类型的实例,称为实体,捕获元数据对象细节及其关系
• REST API与类型和实例⼀起⼯作更容易集成
• 动态创建分类的能⼒,如PII、ExIPRESION、DATAAL质量、敏感
• 分类可以包括属性,如EXPIRES_ON分类中的expiry_date 属性
• 实体可以与多个分类相关联,从⽽能够更容易地发现和安全执⾏。
• 通过谱系传播分类-⾃动确保分类跟随数据经过各种处理
• 直观的UI,以查看数据的传承,因为它通过各种处理
• REST API访问和更新⾎统
• 通过类型、分类、属性值或⾃由⽂本搜索实体的直观UI
• 丰富的 REST API 实现复杂的标准搜索
• 搜索实体的SQL类查询语⾔——领域特定语⾔(DSL)
• ⽤于元数据访问的细粒度安全性,允许对实体实例和操作(如添加/更新/删除分类)的访问进⾏控制
• 与Apache Ranger的集成使得基于与Apache Atlas中的实体相关联的分类的数据访问的授权/数据屏蔽成为可能。例如:
• 谁可以访问被分类为PII、敏感的数据
• 客户服务⽤户只能看到被列为国家标识的列的最后4位数字
Type System: Atlas allows users to define a model for the metadata objects they want to manage. The model is composed of definitions called ‘types’. Instances of ‘types’ called ‘entities’ represent the actual metadata objects that are managed. The Type System is a component that allows users to define and manage the types and entities. All metadata objects managed by Atlas out of the box (like Hive tables, ) are modelled using types and represented as entities. To store new types of metadata in Atlas, one needs to understand the concepts of the type system component.
One key point to note is that the generic nature of the modelling in Atlas allows data stewards and integrators to define both technical metadata and business metadata. It is also possible to define rich relationships between the two using features of Atlas.
Graph Engine: Internally, Atlas persists metadata objects it manages using a Graph model. This approach provides great flexibility and enables efficient handling of rich relationships between the metadata objects. Graph engine component is responsible for translating between types and entities of the Atlas type system, and the underlying graph persistence model. In addition to managing the graph objects, the graph engine also creates the appropriate indices for the metadata objects so that they can be searched efficiently. Atlas uses the JanusGraph to store the metadata objects.
aph objects, the graph engine also creates the appropriate indices for the metadata objects so that they can be searched efficiently. Atlas uses the JanusGraph to store the metadata objects.
JanusGraph 的数据的底层存储⽀持Hbase、cassandra、embeddedcassandra、berkeleyje、inmemory(直接存储在内存中)等。Ingest / Export: The Ingest component allows metadata to be added to Atlas. Similarly, the Export component exposes metadata changes detected by Atlas to be raised as events. Consumers can consume these change events to react to metadata changes in real time.
atlas 的搜索引擎⽀持solr和ElasticSearch
Atlas Admin UI: This component is a web based application that allows data stewards and scientists to discover and annotate metadata. Of primary importance here is a search interface and SQL like query language that can be used to query the metadata types and objects managed by Atlas. The Admin UI uses the REST API of Atlas for building its functionality.- Atlas Admin UI: 该组件是⼀个基于 Web 的应⽤程序,允许数据管理员和科学家发现和注释元数据。Admin UI提供了搜索界⾯和类SQL的查询语⾔,可以⽤来查询由 Atlas 管理的元数据类型和对象。Admin UI 使⽤ Atlas 的 REST API 来构
Tag Based Policies: is an advanced security management solution for the Hadoop ecosystem having wide integration with a variety of Hadoop components. By integrating with Atlas, Ranger allows security administrators to define metadata driven security policies for effective governance. Ranger is a consumer to the metadata change events notified by Atlas.
- Tag Based Policies: Apache Ranger 是针对 Hadoop ⽣态系统的⾼级安全管理解决⽅案,与各种 Hadoop 组件具有⼴泛的集成。通过与Atlas 集成,Ranger 允许安全管理员定义元数据驱动的安全策略,以实现有效的治理。 Ranger 是由 Atlas 通知的元数据更改事件的消费者。
- Business Taxonomy:从元数据源获取到 Atlas 的元数据对象主要是⼀种技术形式的元数据。为了增强可发现性和治理能⼒,Atlas 提供了⼀个业务分类界⾯,允许⽤户⾸先定义⼀组代表其业务域的业务术语,并将其与 Atlas 管理的元数据实体相关联。业务分类法是⼀种 Web 应⽤程序,⽬前是 Atlas Admin UI 的⼀部分,并且使⽤ REST API 与 Atlas 集成。
- 在HDP2.5中,Business Taxonomy是提供了Technical Preview版本,需要在Atlas > Configs > Advanced > Custom application-properties中添加atlas.able=true并重启atlas服务来开启
Users can manage metadata in Atlas using two methods:
API: All functionality of Atlas is exposed to end users via a REST API that allows types and entities to be created, updated and deleted. It is also the primary mechanism to query and discover the types and entities managed by Atlas.
Messaging: In addition to the API, users can choose to integrate with Atlas using a messaging interface that is based on Kafka. This is useful both for communicating metadata objects to Atlas, and also to consume metadata change events from Atlas using which applications can be built. The messaging interface is particularly useful if one wishes to use a more loosely coupled integration with Atlas that could allow for better scalability, reliability etc. Atlas uses Apache Kafka as a notification server for communication between hooks and downstream consumers of metadata notification events. Events are written by the hooks and Atlas to different Kafka topics.
Metadata source
Atlas ⽀持与许多元数据源的集成,将来还会添加更多集成。⽬前,Atlas ⽀持从以下数据源获取和管理元数据:
- Hive:通过hive bridge, atlas可以接⼊Hive的元数据,包括hive_db/hive_table/hive_column/hive_process
- Sqoop:通过sqoop bridge,atlas可以接⼊关系型数据库的元数据,包括sqoop_operation_type/
- Falcon:通过falcon bridge,atlas可以接⼊Falcon的元数据,包括
falcon_cluster/falcon_feed/falcon_feed_creation/falcon_feed_replication/ falcon_process
- Storm:通过storm bridge,atlas可以接⼊流式处理的元数据,包括storm_topology/storm_spout/storm_bolt
- ⾸先,需要基于atlas的类型系统定义能够表达⼤数据组件元数据对象的元数据模型(例如Hive的元数据模型实现在
cd atlas-{project.version}/conf,编辑atlas-application.properties配置⽂件Graph Persistence engine - HBase配置:
Graph Index Search Engine配置:Graph Search Index - Solr:
Client Configs:
SSL config:
High Availability Properties:
三、设置Hive Hook
⽀持的Hive Model:
Hive model includes the following types:
Entity types:
super-types: Asset
attributes: qualifiedName, name, description, owner, clusterName, location, parameters, ownerName hive_table
super-types: DataSet
attributes: qualifiedName, name, description, owner, db, createTime, lastAccessTime, comment, retention, sd,
partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary
super-types: DataSet
attributes: qualifiedName, name, description, owner, type, comment, table
super-types: Referenceable
attributes: qualifiedName, table, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo,
bucketCols, sortCols, parameters, storedAsSubDirectories
super-types: Process
attributes: qualifiedName, name, description, owner, inputs, outputs, startTime, endTime, userName, operationType,
queryText, queryPlan, queryId, clusterName
super-types: Process
attributes: qualifiedName, name, description, owner, inputs, outputs, query, depenendencyType, expression Enum types:
Struct types:
attributes: col, order
attributes: name, serializationLib, parameters 在hive的 l 配置⽂件中增加如下配置:
1 2 3 4 5 6 7 8atlas.hook.hive.synchronous=false# whether to run the hook synchronously. false recommended to avoid delays in Hive query completion. Default: false atlas.hook.hive.numRetries=3# number of retries for notification failure. Default: 3
atlas.hook.hive.queueSize=10000# queue size for the threadpool. Default: 10000
atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary
t= # Zookeeper connect URL for Kafka. Example: localhost:2181
tion.timeout.ms=30000# Zookeeper connection timeout. Default: 30000
keeper.session.timeout.ms=60000# Zookeeper session timeout. Default: 60000
keeper.sync.time.ms=20# Zookeeper sync time. Default: 20
