Sqoop安装与使用(sqoop-1.4.5 on hadoop 1.0.4)
1.什么是Sqoop
Sqoop即 SQL to Hadoop ,是一款方便的在传统型数据库与Hadoop之间进行数据迁移的工具,充分利用MapReduce并行特点以批处理的方式加快数据传输,发展至今主要演化了二大版本,Sqoop1和Sqoop2。 
Sqoop工具是hadoop下连接关系型数据库和Hadoop的桥梁,支持关系型数据库和hive、hdfs,hbase之间数据的相互导入,可以使用全表导入和增量导入。
那么为什么选择Sqoop呢? 
高效可控的利用资源,任务并行度,超时时间。  数据类型映射与转化,可自动进行,用户也可自定义  支持多种主流数据库,MySQL,Oracle,SQL Server,DB2等等 
2.Sqoop1和Sqoop2对比的异同之处
两个不同的版本,完全不兼容  版本号划分区别,Apache版本:1.4.x(Sqoop1); 1.99.x(Sqoo
p2)    CDH版本 : Sqoop-1.4.3-cdh4(Sqoop1) ; Sqoop2-1.99.2-cdh4.5.0 (Sqoop2) Sqoop2比Sqoop1的改进  引入Sqoop server,集中化管理connector等  多种访问方式:CLI,Web UI,REST API  引入基于角的安全机制 
3.Sqoop1与Sqoop2的架构图
Sqoop架构图1
Sqoop架构图2
4.Sqoop1与Sqoop2的优缺点 
比较
Sqoop1
Sqoop2
架构
仅仅使用一个Sqoop客户端
引入了Sqoop server集中化管理connector,以及rest api,web,UI,并引入权限安全机制
部署
部署简单,安装需要root权限,connector必须符合JDBC模型
架构稍复杂,配置部署更繁琐
使用   
命令行方式容易出错,格式紧耦合,无法支持所有数据类型,安全机制不够完善,例如密码暴漏
多种交互方式,命令行,web UI,rest API,conncetor集中化管理,所有的链接安装在Sqoop server上,完善权限管理机制,connector规范化,仅仅负责数据的读写
5.Sqoop的安装部署
5.0 安装环境
hadoop:hadoop-1.0.4
sqoop:sqoop-1.4.5.bin__hadoop-1.0.0
5.1 下载安装包及解压
tar -zxvf sqoop-1.4.5.bin__hadoop-1.0. 
ln -s ./package/sqoop-1.4.5.bin__hadoop-1.0.0/ sqoop
5.2 配置环境变量和配置文件
cdsqoop/conf/
mv sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh 
在sqoop-env.sh中添加如下代码
#Set path to where bin/hadoop is available
#export HADOOP_COMMON_HOME=
export HADOOP_COMMON_HOME=/home/hadoop/hadoop-2.2.0
#Set path to where hadoop-*-core.jar is available
#export HADOOP_MAPRED_HOME=
export HADOOP_MAPRED_HOME=/home/hadoop/hadoop-2.2.0
#set the path to where bin/hbase is available
#export HBASE_HOME=
export HBASE_HOME=/home/hadoop/hbase-0.96.2-hadoop2
#Set the path to where bin/hive is available
#export HIVE_HOME=
export HIVE_HOME=/home/hadoop/apache-hive-0.13.1-bin
#Set the path for where zookeperconfigdir is
#export ZOOCFGDIR=
export ZOOCFGDIR=/home/hadoop/zookeeper-3.4.5
(如果数据读取不设计hbase和hive,那么相关hbase和hive的配置可以不加,如果集有独立的zookeeper集,那么配置zookeeper,反之,不用配置)。
5.3 copy需要的lib包到Sqoop/lib
所需的包:hadoop-core包、Oracle的jdbc包、mysql的jdbc包(oracle的jar包:ojdbc6.jar,mysql的jar包mysql-connector-java-5.1.17.jar)
cp ~/hadoop/hadoop-core-2.2.0.jar ~/sqoop/lib/
cp~/ojdbc6.jar ~/sqoop/lib/
cp ~/mysql-connector-java-5.1.17.jar~/sqoop/lib/
5.4 添加环境变量
vi ~/.bash_profile 
添加如下内容
view sourceprint?
1.#Sqoop 
2.export SQOOP_HOME=/home/hadoop/sqoop 
3.export PATH=$PATH:$SQOOP_HOME/bin
source ~/.bash_profile 
5.5 测试oracle数据库的连接使用
连接oracle数据库,列出所有的数据库
[hadoop@eb179 sqoop]$sqoop list-databases --connect jdbc:oracle:thin:@10.1.69.173:1521:ORCLBI --username huangq -P
或者sqoop list-databases --connect jdbc:oracle:thin:@10.1.69.173:1521:ORCLBI --username huangq --password 123456
或者MySQL:sqoop list-databases --connect jdbc:mysql://172.19.17.119:3306/ --username hadoop --password hadoop
Warning: /home/hadoop/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: $HADOOP_HOME is deprecated.
14/08/17 11:59:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
Enter password:
14/08/17 11:59:27 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
14/08/17 11:59:27 INFO manager.SqlManager: Using default fetchSize of 1000
14/08/17 11:59:51 INFO manager.OracleManager: Time zone has been set to GMT
MRDRP
MKFOW_QH
Oracle数据库的表导入到HDFS
注意:
默认情况下会使用4个map任务,每个任务都会将其所导入的数据写到一个单独的文件中,4个文件位于同一目录,本例中 -m1表示只使用一个map任务文本文件不能保存为二进制字段,并且不能区分null值和字符串值"null"  执行下面的命令后会生成一个ENTERPRISE.java文件,可以通过ls ENTERPRISE.java查看,代码生成是sqoop导入过程的必要部分,sqoop在将源数据库中的数据写到HDFS前,首先会用生成的代码将其进行反序列化
[hadoop@eb179 ~]$ sqoop import --connect jdbc:oracle:thin:@10.1.69.173:1521:ORCLBI --username huangq --password 123456 --table ORD_UV -m 1 --target-dir /user/sqoop/test --direct-split-size 67108864
Warning: /home/hadoop/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: $HADOOP_HOME is deprecated.
14/08/17 15:21:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
14/08/17 15:21:34 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/08/17 15:21:34 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
14/08/17 15:21:34 INFO manager.SqlManager: Using default fetchSize of 1000
14/08/17 15:21:34 INFO tool.CodeGenTool: Beginning code generation
14/08/17 15:21:46 INFO manager.OracleManager: Time zone has been set to GMT
14/08/17 15:21:46 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM ORD_UV t WHERE 1=0
14/08/17 15:21:46 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoop/hadoop
Note: /tmp/sqoop-hadoop/compile/328657d577512bd2c61e07d66aaa9bb7/ORD_UV.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/08/17 15:21:47 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/328657d577512bd2c61e07d66aaa9bb7/ORD_UV.jar
14/08/17 15:21:47 INFO manager.OracleManager: Time zone has been set to GMT
14/08/17 15:21:47 INFO manager.OracleManager: Time zone has been set to GMT
14/08/17 15:21:47 INFO mapreduce.ImportJobBase: Beginning import of ORD_UV
14/08/17 15:21:47 INFO manager.OracleManager: Time zone has been set to GMT
14/08/17 15:21:49 INFO db.DBInputFormat: Using read commited transaction isolation
14/08/17 15:21:49 INFO mapred.JobClient: Running job: job_201408151734_0027
14/08/17 15:21:50 INFO mapred.JobClient:  map 0% reduce 0%
14/08/17 15:22:12 INFO mapred.JobClient:  map 100% reduce 0%
14/08/17 15:22:17 INFO mapred.JobClient: Job complete: job_201408151734_0027
14/08/17 15:22:17 INFO mapred.JobClient: Counters: 18
14/08/17 15:22:17 INFO mapred.JobClient:  Job Counters 
14/08/17 15:22:17 INFO mapred.JobClient:    SLOTS_MILLIS_MAPS=15862
14/08/17 15:22:17 INFO mapred.JobClient:    Total time spent by all reduces waiting after reserving slots (ms)=0
14/08/17 15:22:17 INFO mapred.JobClient:    Total time spent by all maps waiting after reserving slots (ms)=0
14/08/17 15:22:17 INFO mapred.JobClient:    Launched map tasks=1
mysql下载jar包14/08/17 15:22:17 INFO mapred.JobClient:    SLOTS_MILLIS_REDUCES=0
14/08/17 15:22:17 INFO mapred.JobClient:  File Output Format Counters 
14/08/17 15:22:17 INFO mapred.JobClient:    Bytes Written=1472
14/08/17 15:22:17 INFO mapred.JobClient:  FileSystemCounters
14/08/17 15:22:17 INFO mapred.JobClient:    HDFS_BYTES_READ=87
14/08/17 15:22:17 INFO mapred.JobClient:    FILE_BYTES_WRITTEN=33755
14/08/17 15:22:17 INFO mapred.JobClient:    HDFS_BYTES_WRITTEN=1472
14/08/17 15:22:17 INFO mapred.JobClient:  File Input Format Counters 
14/08/17 15:22:17 INFO mapred.JobClient:    Bytes Read=0
14/08/17 15:22:17 INFO mapred.JobClient:  Map-Reduce Framework
14/08/17 15:22:17 INFO mapred.JobClient:    Map input records=81
14/08/17 15:22:17 INFO mapred.JobClient:    Physical memory (bytes) snapshot=192405504
14/08/17 15:22:17 INFO mapred.JobClient:    Spilled Records=0
14/08/17 15:22:17 INFO mapred.JobClient:    CPU time spent (ms)=1540
14/08/17 15:22:17 INFO mapred.JobClient:    Total committed heap usage (bytes)=503775232
14/08/17 15:22:17 INFO mapred.JobClient:    Virtual memory (bytes) snapshot=26995712
00
14/08/17 15:22:17 INFO mapred.JobClient:    Map output records=81
14/08/17 15:22:17 INFO mapred.JobClient:    SPLIT_RAW_BYTES=87
14/08/17 15:22:17 INFO mapreduce.ImportJobBase: Transferred 1.4375 KB in 29.3443 seconds (50.1631 bytes/sec)
14/08/17 15:22:17 INFO mapreduce.ImportJobBase: Retrieved 81 records.

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。