欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

基于蜂巢的电影评论情感分析和可视化蜂巢电影评论数据分析

最编程 2024-04-04 11:12:44
...


Hive应用实践

题目描述

现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072,共有6040条数据
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码
2、movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy,共有3883条数据
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat 数据格式为: 1::1193::5::978300760,共有1000209条数据
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
数据要求:
(1)使用Hive能解析的方式进行数据导入。
Hive要求:
(1)正确建表,导入数据(三张表,三份数据),并验证是否正确
(2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
(3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
(4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
(5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
(6)求好片(评分>=4.0)最多的那个年份的最好看的10部电影
(7)求1997年上映的电影中,评分最高的10部Comedy类电影
(8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
(9)各年评分最高的电影类型(年份,类型,影评分)
(10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)

解答

打开虚拟机,在终端执行以下命令以打开ssh连接、hadoop服务,运行hive

ssh hadoopmaster
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver
hive

数据准备:

首先将需求的文件:users.dat、ratings.dat、movies.dat在windows操作系统中发送到centos7中的Downloads文件夹下,在hdfs中创建新的目录:/usr/hadoop/hive/input,然后将上述的三个文件发送到hdfs集群中新创建的文件夹下。

hdfs dfs -put ./Downloads/users.dat ./hive/input
hdfs dfs -put ./Downloads/ratings.dat ./hive/input
hdfs dfs -put ./Downloads/movies.dat ./hive/input
  • 第一题:
    (1)正确建表,导入数据(三张表,三份数据),并验证是否正确
    注意:hive不支持解析多字节的分隔符,也就是说hive只能解析’:’, 不支持解析’::’,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗.
    所以需要使用能解析多字节分隔符的Serde并且使用RegexSerde需要两个参数:
    input.regex = “(.):: (.):: (.*)”
    output.format.string = “%1 s %3$s”
  1. 创建数据库
drop database if exists test_movie;
create database if not exists test_movie;
use test_movie;
  1. 创建user表
drop table if exists users;
create table users(
userid bigint,
gender string,
age int,
occupation string,
zipcode string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;
  1. 创建rating表
drop table if exists ratings;
create table ratings(
userid bigint,
movieid bigint,
rating double,
timestamped string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;
  1. 创建movie表
drop table if exists movies;
create table movies(
movieid bigint,
title string,
genres string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;

MovieID BigInt, Title String, Genres String

对应字段中文解释:电影ID,电影名字,电影类型

  1. 导入数据
load data inpath "hdfs://192.168.2.2:9000/user/hadoop/hive/input/users.dat" into table users;
load data inpath "hdfs://192.168.2.2:9000/user/hadoop/hive/input/movies.dat" into table movies;
load data inpath "hdfs://192.168.2.2:9000/user/hadoop/hive/input/ratings.dat" into table ratings;
  1. 验证
select * from users u;
select * from ratings r;
select * from movies m;
  • 第二题:
    (2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
    思路:
    1、需求字段:电影名 t_movie.moviename 评分次数 t_rating.rate count()
    2、核心SQL:按照电影名进行分组统计,求出每部电影的评分次数并按照评分次数降序排序
select  m.title as title, count(m.title) as total
from  movies as m join ratings as r on m.movieid=r.movieid 
group by m.title 
order by total desc
limit 10;
  • 第三题:
    (3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
    女性 :
select 'F' as gender, m.title as title, avg(r.rating) as avgrate
from ratings r 
join users u on r.userid=u.userid 
join movies m on r.movieid=m.movieid 
where u.gender="F" 
group by m.title
order by avgrate desc 
limit 10;

男性 :

select 'M' as gender, m.title as title, avg(r.rating) as avgrate
from ratings r 
join users u on r.userid=u.userid 
join movies m on r.movieid=m.movieid 
where u.gender="M" 
group by m.title
order by avgrate desc 
limit 10;
  • 第四题
    (4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
select u.age as age, avg(r.rating) as avgrate
from users as u join ratings as r on u.userid = r.userid
where r.movieid = 2116
group by u.age;
  • 第五题
    (5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
With a_users as(
select u.userid as userid,count(r.userid) as total 
from ratings as r join users as u on u.userid=r.userid
where u.gender="F" 
group by u.userid 
order by total desc 
limit 1),
a_movies as(
Select a_users.userid,m.title,m.movieid,r.rating 
from movies as m join ratings as r on m.movieid=r.movieid,a_users
where r.userid=a_users.userid 
order by r.rating desc 
limit 10),
ave as(
Select movieid,avg(rating) as avgrate 
from ratings 
group by movieid)
Select a_movies.userid,a_movies.movieid,ave.avgrate 
from a_movies join ave on ave.movieid=a_movies.movieid;
  • 第六题:
    (6)求好片(评分>=4.0)最多的那个年份的最好看的10部电影
with years as (
select  m.movieid as movieid, m.title as title, substr(m.title,-5,4) as year, avg(r.rating) as avgrate
from movies m join ratings r on m.movieid=r.movieid 
group by m.movieid, m.title),
goal_year as (
select year, count(year) as total 
from years 
where avgrate >= 4.0 
group by year
order by total desc 
limit 1)
select y.year as year,y.title as title, y.avgrate as rating 
from years as y,goal_year as g 
where y.year =  g.year 
order by rating desc 
limit 10;
  • 第七题
    (7)求1997年上映的电影中,评分最高的10部Comedy类电影
with a_movies as(
select substr(m.title,-5,4) as year,m.movieid as movieid,m.title as title,m.genres as genres
from movies as m
where instr(lcase(m.genres),'comedy') >0),
b_movies as (
select m.movieid as movieid,m.title as title
from a_movies as m
where m.year = 1997
)
select m.movieid as movieid, m.title as title,avg(r.rating) as avgrate
from b_movies as m join ratings as r on m.movieid = r.movieid
group by m.movieid,m.title
order by avgrate desc
limit 10;
  • 第八题
    (8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
With types as(
select movieid,title,tv.tp type 
from movies 
lateral view explode(split(genres,"\\|")) tv as tp),
goals as(
select types.title as title,types.movieid as movieid,types.type as type,avg(ratings.rating) as avgrating 
from types join ratings on types.movieid=ratings.movieid 
group by types.title,types.movieid,types.type)
Select * 
from (
Select type,title,avgrating,row_number() over(partition by type order by avgrating desc) as number 
from goals
) as result
where result.number<=5;
  • 第九题
    (9)各年评分最高的电影类型(年份,类型,影评分)
With types as(
select title,movieid,substr(title,-5,4) as year,tv.tp as type 
from movies lateral view explode(split(genres,"\\|")) tv as tp
),
a_ratings as(
select title,year,type,avg(rating) as avgrating 
from ratings as r join types as t on t.movieid=r.movieid 
group by t.movieid,t.title,t.year,t.type)
select * 
from (
select year,type,avgrating,row_number() over(partition by year order by avgrating desc) as number 
from a_ratings as r) 
as result 
where result.number=1;
  • 第十题
    (10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
with local_movie as(
select u.zipcode as zipcode,m.title as title,avg(r.rating) as avgrating
from users as u 
join ratings as r on u.userid = r.userid 
join movies as m on r.movieid = m.movieid
group by m.title,u.zipcode)
insert overwrite directory "./hive/output/"
select * 
from(
select *,row_number() over(partition by l.zipcode order by avgrating desc) as number 
from local_movie as l) 
as result 
where result.number=1;