比较Hive表使用Parquet、Gzip、Snappy和未压缩的性能差异
最编程
2024-08-02 22:34:40
...
创建两张表,通过一种是parquet , 一种使用parquet snappy压缩
创建表
使用snappy
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='SNAPPY');
使用gzip
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='GZIP');
使用uncompressed
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='UNCOMPRESSED');
使用默认
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET;
也可以在执行语句前执行 set parquet.compression=SNAPPY; 会对之后跑的数据进行压缩,之前已经存在的不会进行snappy压缩
通过 desc formatted tableName 查看表结构
使用parquet snappy
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
numFiles 25
numPartitions 1
numRows 0
parquet.compression SNAPPY
rawDataSize 0
totalSize 4570350557
transient_lastDdlTime 1552269085
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \u0001
serialization.format \u0001
使用parquet默认
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
numFiles 25
numPartitions 1
numRows 0
rawDataSize 0
totalSize 4570650197
transient_lastDdlTime 1552269039
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \u0001
serialization.format \u0001
测试数据量:20208432
UNCOMPRESSED :4570325699
PARQUET 默认 :4570650197
parquet gzip :4570314033
parquet snappy :4570350557
textfile :10356207038
通过对比发现,当数据量较少时parquet各压缩方式差别不大,但相比TEXTFILE压缩减少了1倍以上,后续再做一下性能对比测试一下。