idea下spark开发环境搭建及常见问题1.准备⼯作
1.1 安装jdk1.8
1.2 安装scala
2.11.8
1.3 安装idea
版本按⾃⼰需要选择即可,以上过程不在本⽂中详细讲解,有需要在其他⽂章中分享。
1.4 注意事项
1. jdk和scala都需要配置JAVA_HOME和SCALA_HOME的环境变量。
2. 在idea下需要下载scala插件
3. 创建项⽬时通过maven创建,需要下载scala sdk
4. 下载maven包,解压缩后配置maven的l⽬录,同时本地仓库位置
5. maven项⽬创建完成后,在项⽬名称上右键点击Add Framework Support,然后添加scala⽀持
2.spark环境配置
2.1 在l添加spark包
主要是spark-core,spark-sql,spark-mllib,spark-hive等,根据项⽬需要添加依赖即可。
使⽤maven下的加载按钮加载⼀下导⼊的依赖,会有⾃动下载jar包的过程。
<dependencies>
<!-- mvnrepository/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<!-- mvnrepository/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.8</version>
<scope>provided</scope>
</dependency>
<!-- mvnrepository/artifact/org.apache.spark/spark-mllib -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.4.8</version>
<!--<scope>provided</scope>-->
</dependency>
<!-- mvnrepository/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.8</version>
<!--<scope>provided</scope>-->
</dependency>
</dependencies>
2.2 创建scala object,添加配置启动spark环境
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
object readcsv_demo {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\Regent Wan\\install\\hadoop-common-2.2.0-bin-master")
lazy val cfg: SparkConf =new SparkConf().setAppName("local_demo").setMaster("local[*]")
lazy val spark: SparkSession =SparkSession.builder().config(cfg).enableHiveSupport().getOrCreate()
lazy val sc: SparkContext =spark.sparkContext
}
}
2.3 常见问题
2.3.1 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$ Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
原因:在导⼊spark模块时在maven复制了如下code,⽽其中默认添加了<scope>provided</scope>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.8</version>
<scope>provided</scope>
</dependency>
解决办法:注释<scope>provided</scope>,再重新加载即可。
2.3.2 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
21/08/24 20:27:59 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\ in the Hadoop binaries.
原因:未配置hadoop
解决办法:下载hadoop包,添加配置:System.setProperty("hadoop.home.dir", "D:\\Regent Wan\\install\\hadoop-common-2.2.0-bin-master")
2.3.3运⾏会出现很多info
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/08/24 20:33:23 INFO SparkContext: Running Spark version 2.3.0
21/08/24 20:33:23 WARN NativeCodeLoader: Unable to load native-hadoop library for using builtin-java classes where applicable
21/08/24 20:33:23 INFO SparkContext: Submitted application: local_demo
21/08/24 20:33:23 INFO SecurityManager: Changing view acls to: Administrator
21/08/24 20:33:23 INFO SecurityManager: Changing modify acls to: Administrator
21/08/24 20:33:23 INFO SecurityManager: Changing view acls groups to:
21/08/24 20:33:23 INFO SecurityManager: Changing modify acls groups to:
21/08/24 20:33:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Administrator); groups with view permissions: Set(); users with modify permissions: Set(Administrator); gro 21/08/24 20:33:24 INFO Utils: Successfully started service 'sparkDriver' on port 12914.
21/08/24 20:33:24 INFO SparkEnv: Registering MapOutputTracker
21/08/24 20:33:25 INFO SparkEnv: Registering BlockManagerMaster
21/08/24 20:33:25 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/08/24 20:33:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/08/24 20:33:25 INFO DiskBlockManager: Created local directory at C:\Users\Administrator\AppData\Local\Temp\blockmgr-82e75467-4dcc-405f-9f06-94374e10f55b
21/08/24 20:33:25 INFO MemoryStore: MemoryStore started with capacity 877.2 MB
21/08/24 20:33:25 INFO SparkEnv: Registering OutputCommitCoordinator
21/08/24 20:33:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/08/24 20:33:25 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at john-PC:4040
21/08/24 20:33:25 INFO Executor: Starting executor ID driver on host localhost
21/08/24 20:33:25 INFO Utils: Successfully started service 'org.apache.sparkworkty.NettyBlockTransferService' on port 12935.
21/08/24 20:33:25 INFO NettyBlockTransferService: Server created on john-PC:12935
21/08/24 20:33:25 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/08/24 20:33:25 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, john-PC, 12935, None)
21/08/24 20:33:25 INFO BlockManagerMasterEndpoint: Registering block manager john-PC:12935 with 877.2 MB RAM, BlockManagerId(driver, john-PC, 12935, None)
21/08/24 20:33:25 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, john-PC, 12935, None)
21/08/24 20:33:25 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, john-PC, 12935, None)
log4j2 console21/08/24 20:33:26 INFO SharedState: astore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/D:/Regent%20Wan/保存/InScala/spark-warehouse').
21/08/24 20:33:26 INFO SharedState: Warehouse path is 'file:/D:/Regent%20Wan/保存/InScala/spark-warehouse'.
21/08/24 20:33:27 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
21/08/24 20:33:27 INFO InMemoryFileIndex: It took 95 ms to list leaf files for 1 paths.
21/08/24 20:33:27 INFO InMemoryFileIndex: It took 2 ms to list leaf files for 1 paths.
解决办法:在resources⽬录下新建log4j.properties,添加如下程序
sole=org.apache.log4j.ConsoleAppender
sole.
sole.layout=org.apache.log4j.PatternLayout
sole.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to ERROR. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
pl.Main=ERROR
# Settings to quiet third party logs that are too verbose
spark_project.jetty=ERROR
spark_project.jetty.utilponent.AbstractLifeCycle=ERROR
pl.SparkIMain$exprTyper=ERROR
pl.SparkILoop$SparkILoopInterpreter=ERROR
apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
apache.astore.RetryingHMSHandler=FATAL
apache.hadoop.FunctionRegistry=ERROR
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论