最强大、最全面的蜂巢 SQL 开发人员指南，超过 40,000 字的全面解释（I）--第一部分：

最编程 2024-05-24 16:06:17

...

hive模糊搜索表：show tables like '*name*';
查看表结构信息：desc table_name;
查看分区信息：show partitions table_name;
加载本地文件：load data local inpath '/xxx/test.txt' overwrite into table dm.table_name;
从查询语句给table插入数据：insert overwrite table table_name partition(dt) select * from table_name;
导出数据到本地系统：insert overwrite local directory '/tmp/text' select a.* from table_name a order by 1;
创建表时指定的一些属性：

字段分隔符：row format delimited fields terminated by '\t'
行分隔符：row format delimited lines terminated by '\n'
文件格式为文本型存储：stored as textfile

命令行操作：hive -e 'select table_cloum from table' 执行一个查询,在终端上显示mapreduce的进度，执行完毕后，最后把查询结果输出到终端上，接着hive进程退出，不会进入交互模式

hive -S -e 'select table_cloum from table' -S，终端上的输出不会有mapreduce的进度，执行完毕，只会把查询结果输出到终端上。

hive修改表名：alter table old_table_name rename to new_table_name;
hive复制表结构：create table new_table_name like table_name;
hive添加字段：alter table table_name add columns(columns_values bigint comment 'comm_text');
hive修改字段：alter table table_name change old_column new_column string comment 'comm_text';
删除分区：alter table table_name drop partition(dt='2021-11-30');
添加分区：alter table table_name add partition (dt='2021-11-30');
删除空数据库：drop database myhive2;
强制删除数据库：drop database myhive2 cascade;
删除表：drop table score5;
清空表：truncate table score6;

向hive表中加载数据

直接向分区表中插入数据： insert into table score partition(month ='202107') values ('001','002','100');

通过load方式加载数据： load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');

通过查询方式加载数据： insert overwrite table score2 partition(month = '202106') select s_id,c_id,s_score from score1;

查询语句中创建表并加载数据： create table score2 as select * from score1;

在创建表是通过location指定加载数据的路径： create external table score6 (s_id string,c_id string,s_score int) row format delimited fields terminated by ',' location '/myscore';

export导出与import 导入 hive表数据（内部表操作）：

create table techer2 like techer; --依据已有表结构创建表

export table techer to '/export/techer';

import table techer2 from '/export/techer';

hive表中数据导出

insert导出

将查询的结果导出到本地： insert overwrite local directory '/export/servers/exporthive' select * from score;

将查询的结果格式化导出到本地：insert overwrite local directory '/export/servers/exporthive' row format delimited fields terminated by '\t' collection items terminated by '#' select * from student;

将查询的结果导出到HDFS上(没有local)：insert overwrite directory '/export/servers/exporthive' row format delimited fields terminated by '\t' collection items terminated by '#' select * from score;

Hadoop命令导出到本地： dfs -get /export/servers/exporthive/000000_0 /export/servers/exporthive/local.txt;

hive shell 命令导出

基本语法：（hive -f/-e 执行语句或者脚本 > file） hive -e "select * from myhive.score;" > /export/servers/exporthive/score.txt

hive -f export.sh > /export/servers/exporthive/score.txt

export导出到HDFS上： export table score to '/export/exporthive/score';

Hive查询语句

GROUP BY 分组：select s_id ,avg(s_score) avgscore from score group by s_id having avgscore > 85; 对分组后的数据进行筛选，使用 having
join 连接：inner join 内连接；left join 左连接；right join 右链接；full join 全外链接。
order by 排序：ASC（ascend）: 升序（默认） DESC（descend）: 降序
sort by 局部排序：每个MapReduce内部进行排序，对全局结果集来说不是排序。
distribute by 分区排序：类似MR中partition，进行分区，结合sort by使用

Hive函数

1. 聚合函数

指定列值的数目：count()
指定列值求和：sum()
指定列的最大值：max()
指定列的最小值：min()
指定列的平均值：avg()
非空集合总体变量函数：var_pop(col)
非空集合样本变量函数：var_samp (col)
总体标准偏离函数：stddev_pop(col)
分位数函数：percentile(BIGINT col, p)
中位数函数：percentile(BIGINT col, 0.5)

2. 关系运算

A LIKE B： LIKE比较，如果字符串A符合表达式B 的正则语法，则为TRUE
A RLIKE B：JAVA的LIKE操作，如果字符串A符合JAVA正则表达式B的正则语法，则为TRUE
A REGEXP B：功能与RLIKE相同

3. 数学运算

支持所有数值类型：加(+)、减(-)、乘(*)、除(/)、取余(%)、位与(&)、位或(|)、位异或(^)、位取反(~)

4. 逻辑运算

支持：逻辑与(and)、逻辑或(or)、逻辑非(not)

5. 数值运算

取整函数：round(double a)
指定精度取整函数：round(double a, int d)
向下取整函数：floor(double a)
向上取整函数：ceil(double a)
取随机数函数：rand(),rand(int seed)
自然指数函数：exp(double a)
以10为底对数函数：log10(double a)
以2为底对数函数：log2()
对数函数：log()
幂运算函数：pow(double a, double p)
开平方函数：sqrt(double a)
二进制函数：bin(BIGINT a)
十六进制函数：hex()
绝对值函数：abs()
正取余函数：pmod()

6. 条件函数

if
case when
coalesce(c1,c2,c3)
nvl(c1，c2)

7. 日期函数

获得当前时区的UNIX时间戳: unix_timestamp()
时间戳转日期函数：from_unixtime()
日期转时间戳：unix_timestamp(string date)
日期时间转日期函数：to_date(string timestamp)
日期转年函数：year(string date)
日期转月函数：month (string date)
日期转天函数: day (string date)
日期转小时函数: hour (string date)
日期转分钟函数：minute (string date)
日期转秒函数: second (string date)
日期转周函数: weekofyear (string date)
日期比较函数: datediff(string enddate, string startdate)
日期增加函数: date_add(string startdate, int days)
日期减少函数：date_sub (string startdate, int days)

8. 字符串函数

字符串长度函数：length(string A)
字符串反转函数：reverse(string A)
字符串连接函数: concat(string A, string B…)
带分隔符字符串连接函数：concat_ws(string SEP, string A, string B…)
字符串截取函数: substr(string A, int start, int len)
字符串转大写函数: upper(string A)
字符串转小写函数：lower(string A)
去空格函数：trim(string A)
左边去空格函数：ltrim(string A)
右边去空格函数：rtrim(string A)
正则表达式替换函数： regexp_replace(string A, string B, string C)
正则表达式解析函数: regexp_extract(string subject, string pattern, int index)
URL解析函数：parse_url(string urlString, string partToExtract [, string keyToExtract]) 返回值: string
json解析函数：get_json_object(string json_string, string path)
空格字符串函数：space(int n)
重复字符串函数：repeat(string str, int n)
首字符ascii函数：ascii(string str)
左补足函数：lpad(string str, int len, string pad)
右补足函数：rpad(string str, int len, string pad)
分割字符串函数: split(string str, string pat)
集合查找函数: find_in_set(string str, string strList)

9. 窗口函数

分组求和函数：sum(pv) over(partition by cookieid order by createtime) 有坑，加不加 order by 差别很大，具体详情在下面第二部分。
分组内排序，从1开始顺序排：ROW_NUMBER() 如：1234567
分组内排序，排名相等会在名次中留下空位：RANK() 如：1233567
分组内排序，排名相等不会在名次中留下空位：DENSE_RANK() 如：1233456
有序的数据集合平均分配到指定的数量（num）个桶中：NTILE()
统计窗口内往上第n行值：LAG(col,n,DEFAULT)
统计窗口内往下第n行值：LEAD(col,n,DEFAULT)
分组内排序后，截止到当前行，第一个值：FIRST_VALUE(col)
分组内排序后，截止到当前行，最后一个值: LAST_VALUE(col)
小于等于当前值的行数/分组内总行数：CUME_DIST()

以下函数建议看第二部分详细理解下，此处仅简写，！

将多个group by 逻辑写在一个sql语句中: GROUPING SETS
根据GROUP BY的维度的所有组合进行聚合：CUBE
CUBE的子集，以最左侧的维度为主，从该维度进行层级聚合：ROLLUP

上一篇：定义可视化！在 30 分钟内阅读 39 项有关人类感知世界的研究

下一篇： Cloudera 对 CVE-2021-4428 - 6.2.症状