欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

比较Hive表使用Parquet、Gzip、Snappy和未压缩的性能差异

最编程 2024-08-02 22:34:40
...


 

创建两张表,通过一种是parquet , 一种使用parquet snappy压缩

创建表

使用snappy
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='SNAPPY');

使用gzip
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='GZIP');

使用uncompressed
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='UNCOMPRESSED');

使用默认
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET;

也可以在执行语句前执行 set parquet.compression=SNAPPY; 会对之后跑的数据进行压缩,之前已经存在的不会进行snappy压缩
通过 desc formatted tableName 查看表结构

使用parquet snappy

Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        EXTERNAL                TRUE                
        numFiles                25                  
        numPartitions           1                   
        numRows                 0                   
        parquet.compression     SNAPPY              
        rawDataSize             0                   
        totalSize               4570350557          
        transient_lastDdlTime   1552269085          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \u0001              
        serialization.format    \u0001

使用parquet默认

Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        EXTERNAL                TRUE                
        numFiles                25                  
        numPartitions           1                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               4570650197          
        transient_lastDdlTime   1552269039          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \u0001              
        serialization.format    \u0001

测试数据量:20208432 

UNCOMPRESSED    :4570325699
PARQUET 默认    :4570650197
parquet gzip    :4570314033
parquet snappy  :4570350557
textfile        :10356207038

 

通过对比发现,当数据量较少时parquet各压缩方式差别不大,但相比TEXTFILE压缩减少了1倍以上,后续再做一下性能对比测试一下。