Spark调优SparkSQL参数调优--688IT编程网

Spark调优SparkSQL参数调优

前⾔

Spark SQL⾥⾯有很多的参数，⽽且这些参数在Spark官⽹中没有明确的解释，可能是太多了吧，可以通过在spark-sql中使⽤set -v 命令显⽰当前spark-sql版本⽀持的参数。

本⽂讲解最近关于在参与hive往spark迁移过程中遇到的⼀些参数相关问题的调优。

内容分为两部分，第⼀部分讲遇到异常，从⽽需要通过设置参数来解决的调优；第⼆部分讲⽤于提升性能⽽进⾏的调优。

异常调优

spark.vertMetastoreParquet

parquet是⼀种列式存储格式，可以⽤于spark-sql 和hive 的存储格式。在spark中，如果使⽤using parquet的形式创建表，则创建的是spark 的DataSource表；⽽如果使⽤stored as parquet则创建的是hive表。

spark.vertMetastoreParquet默认设置是true, 它代表使⽤spark-sql内置的parquet的reader和writer(即进⾏反序列化和序列化)，它具有更好地性能，如果设置为false，则代表使⽤ Hive的序列化⽅式。

但是有时候当其设置为true时，会出现使⽤hive查询表有数据，⽽使⽤spark查询为空的情况.

但是，有些情况下在将spark.vertMetastoreParquet设为false，可能发⽣以下异常(spark-2.3.2)。

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable

at org.apache.hadoop.hive.serde2.objectinspector.(WritableIntObjectInspector.java:36)

这是因为在其为false时候，是使⽤hive-metastore使⽤的元数据进⾏读取数据，⽽如果此表是使⽤spark sql DataSource创建的parquet表，其数据类型可能出现不⼀致的情况，例如通过metaStore读取到的是IntWritable类型，其创建了⼀个WritableIntObjectInspector⽤来解析数据，⽽实际上value是LongWritable类型，因此出现了类型转换异常。

与该参数相关的⼀个参数是spark.Schema, 如果也是true，那么将会尝试合并各个parquet ⽂件的schema，以使得产⽣⼀个兼容所有parquet⽂件的schema。

spark.sql.files.ignoreMissingFiles && spark.sql.files.ignoreCorruptFiles

这两个参数是只有在进⾏spark DataSource 表查询的时候才有效，如果是对hive表进⾏操作是⽆效的。

在进⾏spark DataSource 表查询时候，可能会遇到⾮分区表中的⽂件缺失/corrupt 或者分区表分区路径下的⽂件缺失/corrupt 异常，这时候加这两个参数会忽略这两个异常，这两个参数默认都是false，建议在线上可以都设为true.

其源码逻辑如下，简单描述就是如果遇到FileNotFoundException, 如果设置了ignoreMissingFiles=true则忽略异常，否则抛出异常；如果不是FileNotFoundException ⽽是IOException(FileNotFoundException的⽗类)或者RuntimeException，则认为⽂件损坏，如果设置了ignoreCorruptFiles=true则忽略异常。

catch {

case e: FileNotFoundException if ignoreMissingFiles =>

logWarning(s"Skipped missing file: $currentFile", e)

finished = true

null

// Throw FileNotFoundException even if `ignoreCorruptFiles` is true

case e: FileNotFoundException if !ignoreMissingFiles => throw e

case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles =>

logWarning(

s"Skipped the rest of the content in the corrupted file: $currentFile", e)

finished = true

null

}

spark.sql.hive.verifyPartitionPath

上⾯的两个参数在分区表情况下是针对分区路径存在的情况下，分区路径下⾯的⽂件不存在或者损坏的处理。⽽有另⼀种情况就是这个分区路径都不存在了。这时候异常信息如下:

java.io.FileNotFoundException: File does not exist: hdfs://hz-cluster10/user/da_haitao/da_hivesrc/haitao_dev_log/integ_browse_app_dt/day=2019-06-25/os=Android/000067_0

⽽spark.sql.hive.verifyPartitionPath参数默认是false，当设置为true的时候会在获得分区路径时对分区路径是否存在做⼀个校验，过滤掉不存在的分区路径，这样就会避免上⾯的错误。

spark.files.ignoreCorruptFiles && spark.files.ignoreMissingFiles

这两个参数和上⾯的spark.sql.files.ignoreCorruptFiles很像，但是区别是很⼤的。在spark进⾏DataSource表查询时候spark.sq.files.*才会⽣效，⽽spark如果查询的是⼀张hive表，其会⾛HadoopRDD这条执⾏路线。

所以就会出现，即使你设置了spark.sql.files.ignoreMissingFiles的情况下，仍然报FileNotFoundException的情况，异常栈如下, 可以看到这⾥⾯⾛到了HadoopRDD，⽽且后⾯是org.ap

ache.hadoop.hive.ql.ad.ParquetRecordReaderWrappe可见是查询⼀张hive表。

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 107052 in stage 914.0 failed 4 times, most recent failure: Lost task 107052.3 in stage 914.0 (TID 387 381, hadoop2698., executor 266): java.io.FileNotFoundException: File does not exist: hdfs://hz-cluster10/user/da_haitao/da_hivesrc/haitao_dev_log/integ_browse_app_dt/day=2 019-06-25/os=Android/000067_0

at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)

at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)

at org.apache.hadoop.solve(FileSystemLinkResolver.java:81)

at org.apache.hadoop.FileStatus(DistributedFileSystem.java:1317)

at parquet.adFooter(ParquetFileReader.java:385)

at parquet.adFooter(ParquetFileReader.java:371)

at org.apache.hadoop.hive.ql.Split(ParquetRecordReaderWrapper.java:252)

at org.apache.hadoop.hive.ql.ad.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:99)

at org.apache.hadoop.hive.ql.ad.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:85)

at org.apache.hadoop.hive.ql.io.RecordReader(MapredParquetInputFormat.java:72)

at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)

此时可以将spark.files.ignoreCorruptFiles && spark.files.ignoreMissingFiles设为true，其代码逻辑和上⾯的spark.sql.file.*逻辑没明显区别，此处不再赘述。

性能调优

除了遇到异常需要被动调整参数之外，我们还可以主动调整参数从⽽对性能进⾏调优。

spark.hadoopRDD.ignoreEmptySplits

默认是false，如果是true，则会忽略那些空的splits，减⼩task的数量。

spark.hadoop.mapreduce.input.fileinputformat.split.minsize

是⽤于聚合input的⼩⽂件，⽤于控制每个mapTask的输⼊⽂件，防⽌⼩⽂件过多时候，产⽣太多的task.

spark.sql.autoBroadcastJoinThreshold && spark.sql.broadcastTimeout

⽤于控制在spark sql中使⽤BroadcastJoin时候表的⼤⼩阈值，适当增⼤可以让⼀些表⾛BroadcastJoin，提升性能，但是如果设置太⼤⼜会造成driver内存压⼒，⽽broadcastTimeout是⽤于控制Broadcast的Future的超时时间，默认是300s，可根据需求进⾏调整。

spark.abled && spark.sql.adaptive.shuffle.targetPostShuffleInputSize

该参数是⽤于开启spark的⾃适应执⾏，这是spark⽐较⽼版本的⾃适应执⾏，后⾯的targetPostShuffleInputSize是⽤于控制之后的shuffle 阶段的平均输⼊数据⼤⼩，防⽌产⽣过多的task。

intel⼤数据团队开发的adaptive-execution相较于⽬前spark的ae更加实⽤，该特性也已经加⼊到社区3.0之后的roadMap中，令⼈期待。

spark.Schema

默认false。当设为true，parquet会聚合所有parquet⽂件的schema，否则是直接读取parquet summary⽂件，或者在没有parquet summary⽂件时候随机选择⼀个⽂件的schema作为最终的schema。

spark.sql.files.opencostInBytes

该参数默认4M，表⽰⼩于4M的⼩⽂件会合并到⼀个分区中，⽤于减⼩⼩⽂件，防⽌太多单个⼩⽂件占⼀个分区情况。

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version

1或者2，默认是1. MapReduce-4815 详细介绍了 fileoutputcommitter 的原理，实践中设置了 version=2 的⽐默认 version=1 的减少了70%以上的 commit 时间，但是1更健壮，能处理⼀些情况下的异常。

Spark SQL 参数表(spark-2.3.2)

key value meaning

spark.abled TRUE When true, enable adaptive query execution.

spark.sql.adaptive.shuffle.targetPostShuffleInputSize67108864b The target post-shuffle input size in bytes of a task.

spark.sql.autoBroadcastJoinThreshold209715200Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscanhas been run, and file-based data source tables where the statistics are computed directly on the files of

data.

spark.sql.broadcastTimeout300000ms Timeout in seconds for the broadcast wait time in broadcast joins.

spark.abled FALSE Enables CBO for estimation of plan statistics when set true. spark.sql.cbo.joinReorder.dp.star.filter FALSE Applies star-join filter heuristics to cost based join enu

meration.

spark.sql.cbo.joinReorder.dp.threshold12The maximum number of joined nodes allowed in the dynamic

programming algorithm.

spark.sql.abled FALSE Enables join reorder in CBO.

spark.sql.cbo.starSchemaDetection FALSE When true, it enables join reordering based on star schema detection.

lumnNameOfCorruptRecord_corrupt_record The name of internal column for storing raw/un-parsed JSON and CSV

records that fail to parse.

abled TRUE When false, we will throw an error if a query contains a cartesian product

without explicit CROSS JOIN syntax.

abled FALSE

When true, make use of Apache Arrow for columnar data transfers. Currently available for use with pyspark.Pandas, and pyspark.ateDataFrame when its input is a Pandas DataFrame. The following data types are unsupported: BinaryType, MapType, ArrayType of TimestampType, and nested StructType.

ution.arrow.maxRecordsPerBatch10000When using Apache Arrow, limit the maximum number of records that can be written to a single ArrowRecordBatch in memory. If set to zero or

negative there is no limit.

sions Name of the class used to configure Spark Session extensions. The class should implement Function1[SparkSessionExtension, Unit], and must have

a no-args constructor.

spark.sql.files.ignoreCorruptFiles FALSE Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read

will still be returned.

spark.sql.files.ignoreMissingFiles FALSE Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will

still be returned.

spark.sql.files.maxPartitionBytes134217728The maximum number of bytes to pack into a single partition when reading

files.

spark.sql.files.maxRecordsPerFile0Maximum number of records to write out to a single file. If this value is

zero or negative, there is no limit.

spark.atBinaryAsString FALSE

When this option is set to false and all inputs are

at returns an output as binary. Otherwise, it returns

as a string.

spark.sql.function.eltOutputAsString FALSE When this option is set to false and all inputs are binary, elt returns an output as binary. Otherwise, it returns as a string.

upByAliases TRUE When true, aliases in a select list can be used in group by clauses. When false, an analysis exception is thrown in the case.

upByOrdinal TRUE When true, the ordinal numbers in group by clauses are treated as the position in the select list. When false, the ordinal numbers are ignored.

spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE Sets the action to take when a case-sensitive schema cannot be read from a Hive table’s properties. Although Spark SQL itself is not case-sensitive, Hive compatible file formats such as Parquet are. Spark SQL must use a case-preserving schema when querying any table backed by files containing case-sensitive field names or queries may not return accurate results. Valid options include INFER_AND_SAVE (the default mode– infer the case-sensitive schema from the underlying data files and write it back to the table properties), INFER_ONLY (infer the schema but don’t attempt to write it to the table properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema instead of inferring).

spark.vertMetastoreParquet TRUE

When set to true, the built-in Parquet reader and writer are used to process parquet tables created by using the HiveQL syntax, instead of

Hive serde.

spark.Schema FALSE

When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files. This configuration is only effective when “spark.vertMetastoreParquet” is true.

spark.sql.hive.filesourcePartitionFileCacheSize262144000When nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition

management is enabled.

spark.sql.hive.manageFilesourcePartitions TRUE When true, enable metastore partition managemen

t for file source tables as well. This includes both datasource and converted Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query

planning.

spark.astore.barrierPrefixes A comma separated list of class prefixes that should explicitly be reloaded

for each version of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a prefix that typically would be

shared (i.e. org.apache.spark.*).

Location of the jars that should be used to instantiate the

key value meaning

spark.astore.jars builtin

adaptiveHiveMetastoreClient. This property can be one of three options: “ 1.“builtin” Use Hive 1.2.1, which is bundled with the Spark assembly when -

Phive is enabled. When this option is

chosen, spark.astore.versionmust be either 1.2.1 or not defined.

2. “maven” Use Hive jars of specified version downloaded from Maven

repositories. 3. A classpath in the standard format for both Hive and

Hadoop.

spark.astore.sql.jdbc,

A comma separated list of class prefixes that should be loaded using the

classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example,

custom appenders that are used by log4j.

org.postgresql,

com.microsoft.sqlserver,

oracle.jdbc

spark.astore.version 1.2.1Version of the Hive metastore. Available options are0.12.0 through 2.1.1.

spark.astorePartitionPruning TRUE When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see

HiveUtils.CONVERT_METASTORE_PARQUET and

HiveUtils.CONVERT_METASTORE_ORC for more information).

spark.sql.hive.thriftServer.async TRUE When set to true, Hive Thrift server executes SQL queries in an

asynchronous way.

spark.sql.hive.thriftServer.singleSession FALSE When set to true, Hive Thrift server is running in a single session mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.

spark.sql.hive.verifyPartitionPath FALSE When true, check all the partition paths under the table’s root directory

when reading data stored in HDFS.

spark.sql.hive.version 1.2.1deprecated, please use spark.astore.version to get the Hive

version in Spark.

spark.sql.inMemoryColumnarStorage.batchSize10000Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching

data.

spark.sql.inMemoryColumnarStoragepressed TRUE When set to true Spark SQL will automatically select a compression codec for each column based on statistics of the data.

spark.ableVectorizedReader TRUE Enables vectorized reader for columnar caching.

spark.adataOnly TRUE

When true, enable the metadata-only query optimization that use the table’s metadata to produce the partition columns instead of table scans.

It applies when all the columns scanned are partition columns and the query has an aggregate operator that satisfies distinct semantics.

dec snappy

Sets the compression codec used when writing ORC files. If either compression or orcpress is specified in the table-specific

options/properties, the precedence would

be compression, orcpress,dec.Acceptable values include: none, uncompressed, snappy, zlib, lzo.

ableVectorizedReader TRUE Enables vectorized orc decoding.

filterPushdown FALSE When true, enable filter pushdown for ORC files.

derByOrdinal TRUE When true, the ordinal numbers are treated as the position in the select list. When false, the ordinal numbers in order/sort by clause are ignored.

spark.sql.parquet.binaryAsString FALSE

Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these

systems.

spark.dec snappy

Sets the compression codec used when writing Parquet files. If either compression or parquetpression is specified in the table-specific options/properties, the precedence would

be compression,parquetpression, spark.dec.

Acceptable values include: none, uncompressed, snappy, gzip, lzo.

spark.ableVectorizedReader TRUE Enables vectorized parquet decoding.

spark.sql.parquet.filterPushdown TRUE Enables Parquet filter push-down optimization when set to true.

key value meaning

spark.sql.parquet.int64AsTimestampMillis FALSE

(Deprecated since Spark 2.3, please set

spark.sql.parquet.outputTimestampType.) When true, timestamp values will be stored as INT64 with TIMESTAMP_MILLIS as the extended type. In this mode, the microsecond portion of the timestamp value will betruncated.

spark.sql.parquet.int96AsTimestamp TRUE

Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these

systems.

spark.sql.parquet.int96TimestampConversion FALSE This controls whether timestamp adjustments should be applied to INT96 data when converting to timestamps, for data written by Impala. This is necessary because Impala stores INT96 data with a different timezone

offset than Hive & Spark.

spark.Schema FALSE When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available.

spark.sql.parquet.outputTimestampType INT96

Sets which Parquet timestamp type to use when Spark writes data to Parquet files. INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp

value.

spark.abled FALSE If true, enables Parquet’s native record-level filtering using the pushed down filters. This configuration only has an effect when

‘spark.sql.parquet.filterPushdown’ is enabled.

spark.spectSummaryFiles FALSE

When true, we make assumption that all part-files of Parquet are consistent with summary files and we will ignore them when merging schema. Otherwise, if this is false, which is the default, we will merge all part-files. This should be considered as expert-only option, and shouldn’t be enabled before knowing what it means exactly.

spark.sql.parquet.writeLegacyFormat FALSE

Whether to be compatible with the legacy Parquet format adopted by Spark 1.4 and prior versions, when converting Parquet schema to Spark

SQL schema and vice versa.

spark.sql.parser.quotedRegexColumnNames FALSE When true, quoted Identifiers (using backticks) in SELECT statement are

interpreted as regular expressions.

spark.sql.pivotMaxValues10000When doing a pivot without specifying values for the pivot column this is the maximum number of (distinct) values that will be collected without

error.

spark.sql.queryExecutionListeners

List of class names implementing QueryExecutionListener that will be automatically added to newly c

reated sessions. The classes should have either a no-arg constructor, or a constructor that expects a SparkConf

argument.

(?i)url

Regex to decide which keys in a Spark SQL command’s options map contain sensitive information. The values of options whose names that match this regex will be redacted in the explain output. This redaction is applied on top of the global redaction configuration defined by

Regex to decide which parts of strings produced by Spark contain sensitive information. When this regex matches a string part, that string part is replaced by a dummy value. This is currently used to redact the output of SQL explain commands. When this conf is not set, the is used.

spark.sql.session.timeZone Asia/Shanghai The ID of session local timezone, e.g. “GMT”, “America/Los_Angeles”,

etc.

spark.sql.shuffle.partitions4096The default number of partitions to use when shuffling data for joins or

aggregations.

spark.sql.abled TRUE When false, we will treat bucketed table as normal table spark.sql.sources.default parquet The default data source to use in input/output.

spark.sql.sources.parallelPartitionDiscovery.threshold32The maximum number of paths allowed for listing files at driver side. If the number of detected paths exceeds this value during partition discovery, it tries to list the files with another Spark distributed job. This applies to

key value meaning

688IT编程网

Spark调优SparkSQL参数调优

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

Spark调优SparkSQL参数调优

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式