基于蜂巢的电影评论情感分析和可视化蜂巢电影评论数据分析

最编程 2024-04-04 11:12:44

...

Hive应用实践

题目描述

现有如此三份数据：
1、users.dat 数据格式为： 2::M::56::16::70072，共有6040条数据
对应字段为：UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释：用户id，性别，年龄，职业，邮政编码
2、movies.dat 数据格式为： 2::Jumanji (1995)::Adventure|Children’s|Fantasy，共有3883条数据
对应字段为：MovieID BigInt, Title String, Genres String
对应字段中文解释：电影ID，电影名字，电影类型
3、ratings.dat 数据格式为： 1::1193::5::978300760，共有1000209条数据
对应字段为：UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释：用户ID，电影ID，评分，评分时间戳
数据要求：
（1）使用Hive能解析的方式进行数据导入。
Hive要求：
（1）正确建表，导入数据（三张表，三份数据），并验证是否正确
（2）求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）
（3）分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）
（4）求movieid = 2116这部电影各年龄段（因为年龄就只有7个，就按这个7个分就好了）的平均影评（年龄段，影评分）
（5）求最喜欢看电影（影评次数最多）的那位女性评最高分的10部电影的平均影评分（观影者，电影名，影评分）
（6）求好片（评分>=4.0）最多的那个年份的最好看的10部电影
（7）求1997年上映的电影中，评分最高的10部Comedy类电影
（8）该影评库中各种类型电影中评价最高的5部电影（类型，电影名，平均影评分）
（9）各年评分最高的电影类型（年份，类型，影评分）
（10）每个地区最高评分的电影名，把结果存入HDFS（地区，电影名，影评分）

解答

打开虚拟机，在终端执行以下命令以打开ssh连接、hadoop服务,运行hive

ssh hadoopmaster
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver
hive

数据准备：

首先将需求的文件：users.dat、ratings.dat、movies.dat在windows操作系统中发送到centos7中的Downloads文件夹下，在hdfs中创建新的目录：/usr/hadoop/hive/input,然后将上述的三个文件发送到hdfs集群中新创建的文件夹下。

hdfs dfs -put ./Downloads/users.dat ./hive/input
hdfs dfs -put ./Downloads/ratings.dat ./hive/input
hdfs dfs -put ./Downloads/movies.dat ./hive/input

第一题：
（1）正确建表，导入数据（三张表，三份数据），并验证是否正确
注意：hive不支持解析多字节的分隔符，也就是说hive只能解析’:’, 不支持解析’::’，所以用普通方式建表来使用是行不通的，要求对数据做一次简单清洗.
所以需要使用能解析多字节分隔符的Serde并且使用RegexSerde需要两个参数：
input.regex = “(.):: (.):: (.*)”
output.format.string = “%1s %3$s”

创建数据库

drop database if exists test_movie;
create database if not exists test_movie;
use test_movie;

创建user表

drop table if exists users;
create table users(
userid bigint,
gender string,
age int,
occupation string,
zipcode string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;

创建rating表

drop table if exists ratings;
create table ratings(
userid bigint,
movieid bigint,
rating double,
timestamped string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;

创建movie表

drop table if exists movies;
create table movies(
movieid bigint,
title string,
genres string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;

MovieID BigInt, Title String, Genres String

对应字段中文解释：电影ID，电影名字，电影类型

导入数据

load data inpath "hdfs://192.168.2.2:9000/user/hadoop/hive/input/users.dat" into table users;
load data inpath "hdfs://192.168.2.2:9000/user/hadoop/hive/input/movies.dat" into table movies;
load data inpath "hdfs://192.168.2.2:9000/user/hadoop/hive/input/ratings.dat" into table ratings;

验证

select * from users u;
select * from ratings r;
select * from movies m;

第二题：
（2）求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）
思路：
1、需求字段：电影名 t_movie.moviename 评分次数 t_rating.rate count()
2、核心SQL：按照电影名进行分组统计，求出每部电影的评分次数并按照评分次数降序排序

select  m.title as title, count(m.title) as total
from  movies as m join ratings as r on m.movieid=r.movieid 
group by m.title 
order by total desc
limit 10;

第三题：
（3）分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）
女性：

select 'F' as gender, m.title as title, avg(r.rating) as avgrate
from ratings r 
join users u on r.userid=u.userid 
join movies m on r.movieid=m.movieid 
where u.gender="F" 
group by m.title
order by avgrate desc 
limit 10;

男性：

select 'M' as gender, m.title as title, avg(r.rating) as avgrate
from ratings r 
join users u on r.userid=u.userid 
join movies m on r.movieid=m.movieid 
where u.gender="M" 
group by m.title
order by avgrate desc 
limit 10;

第四题
（4）求movieid = 2116这部电影各年龄段（因为年龄就只有7个，就按这个7个分就好了）的平均影评（年龄段，影评分）

select u.age as age, avg(r.rating) as avgrate
from users as u join ratings as r on u.userid = r.userid
where r.movieid = 2116
group by u.age;

第五题
（5）求最喜欢看电影（影评次数最多）的那位女性评最高分的10部电影的平均影评分（观影者，电影名，影评分）

With a_users as(
select u.userid as userid,count(r.userid) as total 
from ratings as r join users as u on u.userid=r.userid
where u.gender="F" 
group by u.userid 
order by total desc 
limit 1),
a_movies as(
Select a_users.userid,m.title,m.movieid,r.rating 
from movies as m join ratings as r on m.movieid=r.movieid,a_users
where r.userid=a_users.userid 
order by r.rating desc 
limit 10),
ave as(
Select movieid,avg(rating) as avgrate 
from ratings 
group by movieid)
Select a_movies.userid,a_movies.movieid,ave.avgrate 
from a_movies join ave on ave.movieid=a_movies.movieid;

第六题:
（6）求好片（评分>=4.0）最多的那个年份的最好看的10部电影

with years as (
select  m.movieid as movieid, m.title as title, substr(m.title,-5,4) as year, avg(r.rating) as avgrate
from movies m join ratings r on m.movieid=r.movieid 
group by m.movieid, m.title),
goal_year as (
select year, count(year) as total 
from years 
where avgrate >= 4.0 
group by year
order by total desc 
limit 1)
select y.year as year,y.title as title, y.avgrate as rating 
from years as y,goal_year as g 
where y.year =  g.year 
order by rating desc 
limit 10;

第七题
（7）求1997年上映的电影中，评分最高的10部Comedy类电影

with a_movies as(
select substr(m.title,-5,4) as year,m.movieid as movieid,m.title as title,m.genres as genres
from movies as m
where instr(lcase(m.genres),'comedy') >0),
b_movies as (
select m.movieid as movieid,m.title as title
from a_movies as m
where m.year = 1997
)
select m.movieid as movieid, m.title as title,avg(r.rating) as avgrate
from b_movies as m join ratings as r on m.movieid = r.movieid
group by m.movieid,m.title
order by avgrate desc
limit 10;

第八题
（8）该影评库中各种类型电影中评价最高的5部电影（类型，电影名，平均影评分）

With types as(
select movieid,title,tv.tp type 
from movies 
lateral view explode(split(genres,"\\|")) tv as tp),
goals as(
select types.title as title,types.movieid as movieid,types.type as type,avg(ratings.rating) as avgrating 
from types join ratings on types.movieid=ratings.movieid 
group by types.title,types.movieid,types.type)
Select * 
from (
Select type,title,avgrating,row_number() over(partition by type order by avgrating desc) as number 
from goals
) as result
where result.number<=5;

第九题
（9）各年评分最高的电影类型（年份，类型，影评分）

With types as(
select title,movieid,substr(title,-5,4) as year,tv.tp as type 
from movies lateral view explode(split(genres,"\\|")) tv as tp
),
a_ratings as(
select title,year,type,avg(rating) as avgrating 
from ratings as r join types as t on t.movieid=r.movieid 
group by t.movieid,t.title,t.year,t.type)
select * 
from (
select year,type,avgrating,row_number() over(partition by year order by avgrating desc) as number 
from a_ratings as r) 
as result 
where result.number=1;

第十题
（10）每个地区最高评分的电影名，把结果存入HDFS（地区，电影名，影评分）

with local_movie as(
select u.zipcode as zipcode,m.title as title,avg(r.rating) as avgrating
from users as u 
join ratings as r on u.userid = r.userid 
join movies as m on r.movieid = m.movieid
group by m.title,u.zipcode)
insert overwrite directory "./hive/output/"
select * 
from(
select *,row_number() over(partition by l.zipcode order by avgrating desc) as number 
from local_movie as l) 
as result 
where result.number=1;

上一篇： [机器学习]创建自己的电影推荐系统

下一篇：星火电影评分数据分析星火大数据分析实战豆瓣

基于蜂巢的电影评论情感分析和可视化蜂巢电影评论数据分析

Hive应用实践

题目描述

解答