如何在Hive中启用Snappy以提高Parquet文件的Spark性能？

最编程 2024-08-02 22:31:45

...

开始时翻译，后面会出集成的具体步骤。主要是讲了一些概览，spark的参数设置，遇到的问题处理等。少环境的搭建。
还有就是问题哪里，报错太多了，格式不好整。可以看原文看详细报错。
spark的安装
配置Yarn
配置Hive
配置Spark
问题
推荐的配置
设计文档
Hive on Spark是Hive1.1发布之后，成为了Hive的一部分。在spark分支中，它得到了大力的开发，定期的合并到master的分支中。详细看[hive-7292]
(https://issues.apache.org/jira/browse/HIVE-7292)。
spark的安装
根据下面的连接安装spark：http://spark.apache.org/docs/latest/running-on-yarn.html（https://spark.apache.org/docs/latest/spark-standalone.html，如果你要运行spark的standalone模式）。hive on spark模式默认支持spark on yarn。特别注意的是，要安装，注意以下几点：
1安装spark(下载已经编译好的spark或者从源码自己编译)-貌似是maven来管理，编译的。
安装/编译一个兼容的版本。hive的pom.xml其中的spark.version定义你相应重新编译的版本。
一旦spark编译好了，找到spark-assembly-*.jar。
主意下，你必须是spark没有带有hive jar包的版本。意味着你编译的时候不要带有hive的依赖。如果你使用parquet table，推荐开启parquet-provided。另外在parquet依赖下，可能会有冲突。为了移除hive的jar包，在定spark的依赖的时候，使用下面的命令。

./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

2开始spark集群（standalone和spark on yarn都支持）
保持注意，spark master url，这个可以在spark的master的webui可以查看。

配置Yarn
yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.Fairscheduler
配置hive
1有几个方式增加hive的spark的依赖、
a.设置这个属性spark.home，指向spark的安装目录

hive> set spark.home=/location/to/sparkHome;

b.在启动hive 客户端/hiveserver2之前，定义SPARK_HOME环境变量

export SPARK_HOME=/usr/lib/spark....

c.把spark-assembly.jar拷贝到HIVE_HOME/lib目录
2配置hive的执行方式为spark。

hive> set hive.execution.engine=spark;

看spark section of hive configuration properties连接，其他的hive配置和远程spark驱动。
3在hive客服端，配置spark-application配置。详情请看http://spark.apache.org/docs/latest/configuration.html。也可以增加一个文件spark-defaults.conf把这些配置写进去，保存在hive的classpath，或者在hive的hive-site.xml这是他们。比如（这是在命名行设置的，也可以直接写入hive-site.xml）：

hive> set spark.master=<Spark Master URL>

hive> set spark.eventLog.enabled=true;

hive> set spark.eventLog.dir=<Spark event log folder (must exist)>

hive> set spark.executor.memory=512m;             

hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;

当对于配置属性的一点点解释：
spark.executor.memory:每一个executor用于计算的内存。
spark.executor.cores:每一个executor用于计算的cpu核数。
spark.yarn.executor.memoryOverhead:当运行模式为spark on yarn，每一个executor内存溢出边界。这个内存有点像VM的 overheads。另外，executor的内存，这个容器需要另外的一些内存来运行系统的进程。
spark.executor.instances:executor能够运行的allication的数量。
spark.driver.memor:被分配给远程spark context的内存数量，我们推荐为4g。
spark.yarn.driver.memoryOverhead:我们推荐400m。
配置spark
设置executor的内存合适大小，比设置竟可能大的内存要好。有如下几点需要考虑到：
1越多的执行内存意味着能够更多的优化查询。
2越多的内存，从另外一方面来说，对于GC来说是不明智的。
3一些实验表明，hdfs客户端不能很好的控制写文件的一致性。如果executor太大的话，将会面临冲突。
当运行spark on yarn模式时，我们推荐设置spark.executor.cores为5,6,7,依靠典型的节点能够被整除（我觉得，就是所有的spark节点，能够被整除）。列如，如果yarn.nodemanager.resource.cpu-vcores（单机所有核数）是19，那么设置成6是比较好的选择（所有的executor只能拥有相同的核数，这里如果我们选择5，那么每一个executor只会得到3核，如果我们选择7，那么仅仅只有2个executor能够使用，而且5核cpu被浪费了）。如果总共核数为20，那么选择5是很好的选择（如果你只有4个executor，那么不会有浪费）。
对于spark.executor.memory，我们建议使用计算，yarn.nodemanager.resource.memory-mb*(spark.executor.cores/yarn.nodemanager.resource.cpu-vcores)，然后按比例分配给，spark.executor.memory和spark.yarn.executor.mamoryOverhead。根据我们的环境，我们推荐设置spark.yarn.executor.memoryOverhead为计算的的15%-20%。
在你决定给每一个executor选择选择多少内存之后。你也决定了有多少个executor分配来做查询。在GA运行时，spark的动态执行分配会被支持。然后beta版本仅仅支持静态资源分配。基于每一个executor的内存和配置sprak.executor.memory和spark.yarn.executor.memoryOverheader，你会选择能够执行多少个实例，通过设置spark.executor.instances。
在真实的案例中。假设有10个节点，每个节点64g内存，12个虚拟核数。那么可以被分配的cpu核数量为yarn.nodemanager.resource.cpu-vcores=12，1个节点做master，9个节点做slaves。我们推荐spark.executor.cores为6。给定可分配的内存资源为yarn.nodemanager.resource.memory-mb为50g。那么我们就算每个executor分的运行内存和溢出类型为：先计算50g/(6/12)=25G。我们把20%分配给spark.yar.executor.memoryOverhead=5g，把80%分配给spark.executor.memory=20g。
在这个9个节点集群中，每个机器有2个executor。所以我们配置spark.executor.instances在2到2*9之间比较合适。如果是18将会利用整个集群。
问题

问题	原因	解决
Error:Could not find or load main class org.apache.spark.deploy.sparksubmit	spark的依赖没有设置正确	给hive增加spark的jar包依赖，看上文
org.apache.spark.sparkException:job aborted due to stage failure:task5.0:0had a not serialable result:java.io.notserializableExcption:org.apache.hadoop.io.byteswritable
spark没有设置序列化为Kryo	设置spark的序列化org.apache.spark.serializer.KryoSerializer，见上文
terminal initialization failed;failling back to unsupported java.lang.incompatableclasschangeerror:found class jline.Terminal,but interface was expeted	hive已经有了jline2的jar包，但是在hadooplib中存在jline0.94的	1删除hadoolib中的jline,2export HADOOP_USER_CLASSPATH_FIRST=true,3如果错误是在mvn test的时候，先clean install，然后在test
Spark executor gets killed all the time and Spark keeps retrying the failed stage; you may find similar information in the YARN nodemanager log.WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=217989,containerID=container_1421717252700_0716_01_50767235] is running beyond physical memory limits. Current usage: 43.1 GB of 43 GB physical memory used; 43.9 GB of 90.3 GB virtual memory used. Killing container.	在spark on yarn中，nodemanager将会kill掉spark的executor，如果executor使用了超过spark.executor.memory+spark.yarn.executor.memoryOverhead时。	增加spark.executor.memoryOverhead来确保不会溢出
运行查询得到的错误：FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask	会发生在Mac系统中，这是一个常见的mac系统snappy问题	在启动hive或者hiveserver2之前，下面的命令，export HADOOP_OPTS=”-Dorg.xerial.snappy.tempdir=/tmp -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib $HADOOP_OPTS”
Stack trace: ExitCodeException exitCode=1: …/launch_container.sh: line 27: PWD:PWD/spark.jar:HADOOPCONFDIR.../usr/hdp/{hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.hdp.version.jar:/etc/hadoop/conf/secure:PWD/app.jar:PWD/∗:badsubstitution\|这个keymapreduce.application.classpath在/etc/hadoop/conf/mapred−site.xml包含了一个变量，在bash中是无效的\|从mapreduce.application.classpath（文件路径，/etc/hadoop/conf/mapred−site.xml）中移除，:/usr/hdp/{hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar
Exception in thread “Driver” scala.MatchError: java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/TaskAttemptContext (of class java.lang.NoClassDefFoundError))	MR不在yarn的classpath中	把/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework改为/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework

推荐的配置

# see HIVE-9153
mapreduce.input.fileinputformat.split.maxsize=750000000
hive.vectorized.execution.enabled=true

hive.cbo.enable=true
hive.optimize.reducededuplication.min.reducer=4
hive.optimize.reducededuplication=true
hive.orc.splits.include.file.footer=false
hive.merge.mapfiles=true
hive.merge.sparkfiles=false
hive.merge.smallfiles.avgsize=16000000
hive.merge.size.per.task=256000000
hive.merge.orcfile.stripe.level=true
hive.auto.convert.join=true
hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=894435328
hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5
hive.map.aggr=true
hive.optimize.sort.dynamic.partition=false
hive.stats.autogather=true
hive.stats.fetch.column.stats=true
hive.vectorized.execution.reduce.enabled=false
hive.vectorized.groupby.checkinterval=4096
hive.vectorized.groupby.flush.percent=0.1
hive.compute.query.using.stats=true
hive.limit.pushdown.memory.usage=0.4
hive.optimize.index.filter=true
hive.exec.reducers.bytes.per.reducer=67108864
hive.smbjoin.cache.rows=10000
hive.exec.orc.default.stripe.size=67108864
hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824
hive.fetch.task.aggr=false
mapreduce.input.fileinputformat.list-status.num-threads=5
spark.kryo.referenceTracking=false
spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch

设计文档
Hive on Spark: Overall Design from HIVE-7272
Hive on Spark: Join Design (HIVE-7613)
Hive on Spark Configuration (HIVE-9449)
Hive on Spark Explain Plan
注意：
spark有自己的配置控制是否合并小文件。
设置hive.merge.sparkfiles=true来合并小文件。

上一篇：提升Parquet性能的测试与调优技巧及建议

下一篇：如何在SparkSQL中通过API设置Parquet表的Snappy压缩参数？