Python 网络爬虫 - 抓取并分析福建二手房价格及各类数据

最编程 2024-06-03 18:13:55

...

一、 选题的背景介绍

随着越来越多城市的房地产市场进入存量时代，二手房市场的地位愈发重要，其走势对于房地产整体市场的影响也逐渐加强。在很多二手房市场规模占比较高的一二线城市，二手房市场与新房市场通过“卖旧买新”的链条形成较强的联动：二手房卖家通过置换现有住房获得资金，转而在新房市场实现住房改善。

在买房之前，人们会在相关网站上搜索自己想要了解的住房信息，但是面对庞大且来源广泛的网络数据，如何缩短网页下载时间，如何分析大量数据并找出有用信息，就需要用到网络爬虫技术。本文通过 Python爬虫程序爬取链家网上福建省份的二手房数据，爬取的字段包括所在区域、户型结构、装修情况、总价、单价等，并对采集到的这些数据进行清洗。最后对清洗过的数据用echarts进行可视化分析，探索福建二手房建筑面积、位置、户型等对二手房房价的影响情况。

二、 数据可视化分析的步骤

1. 爬虫名称：爬取链家网站二手房数据

2. 爬取内容：爬取房子名称、价格、平方价格、面积、所在城市、所在地区、户型等

3. 方案描述：观察链家网站二手房的访问地址，找到规律，进行地址拼接，再爬取每个页面的房子详情页的地址，循环解析提取房子详情页里的信息，存储到数据库中，方便接下来的数据分析和数据可视化的展示。

技术难点：

1.在爬取过程中除了目标数据以外还存在广告等其他数据，需要做异常处理

2.使用pandas将数据库里的数据进行处理和分析

三、 主题页面的结构特征分析

1.主题页面的结构特征分析

我们要爬取的是链家网福建省内所有城市的二手房数据，链家支持的福建城市有福州、厦门、泉州、漳州，每个城市里面有100页数据，每一页里有30条房屋信息，我们需要将这三十条房屋信息的地址爬取出来逐一进行数据提取。所以爬虫的结构有三层，第一层是城市循环，第二层是页面循环，第三层是该页面的房屋信息循环。

2.节点（标签）查找方法与遍历方法

 1 name = html.xpath("//h1/text()")  
 2    price = html.xpath("//span[@class='total']/text()")  
 3    area = html.xpath("//div[@class='area']/div[1]/text()")  
 4    priceAvg = html.xpath("//span[@class='unitPriceValue']/text()")  
 5    houseType= html.xpath("//div[@class='room']/div[@class='mainInfo']/text()")  
 6    orientation = html.xpath("//div[@class='type']/div[@class='mainInfo']/text()")  
 7    city = html.xpath("//div[4]/div/div/a[1]/text()")  
 8    district= html.xpath("//div[@class='areaName']/span[@class='info']/a[1]/text()") 
 9    community = html.xpath("//div[@class='communityName']//a[1]/text()")  
10    decoration=html.xpath("//div[@class='base']/div[@class='content']/ul/li[9]/text()")  
11    propertyRight= html.xpath("//div[@class='transaction']/div[@class='content']/ul/li[6]/span[2]/text()") 
12    lift = html.xpath("//div[@class='base']/div[@class='content']/ul/li[11]/text()")  
13    lifeRate= html.xpath("//div[@class='base']/div[@class='content']/ul/li[10]/text()") 
14    builtType = html.xpath("//div[@class='base']/div[@class='content']/ul/li[6]/text()") 
15  builtStructure= tml.xpath("//div[@class='base']/div[@class='content']/ul/li[8]/text()")

四、 网络爬虫设计程序设计

1. 数据的爬取与采集

 1 def start():
 2     # 爬取所需城市的名称缩写，用于地址拼接，链家网站福建省内支持的城市仅有以下四个、支持的城市列表可以查看https://map.lianjia.com/map/350200/ESF/
 3     cityList = ['xm', 'fz', 'quanzhou', 'zhangzhou']
 4     # 设置请求头，以防被网站识别为爬虫
 5     headers = {
 6         "Upgrade-Insecure-Requests": "1",
 7         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46",
 8         "Cookie": "lianjia_uuid=60c54ae6-6b42-4222-8692-efe4fb2c554e; crosSdkDT2019DeviceId=eu2b5l-mfq32x-njycw4jyhevekso-ud4cezlw8; _smt_uid=632e762f.20c8106d; _ga=GA1.2.203719967.1663989297; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1669339641; lianjia_ssid=917d080d-af96-4114-92ce-e9ec6ded0cde; _gid=GA1.2.1343492366.1671084252; lianjia_token=2.0111ad29317b6152f82536df72a8e72e0777cf78c4; beikeBaseData=%7B%22parentSceneId%22:%226413099907731748097%22%7D; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221836d7da82957c-0d0e66116b0617-78565473-1327104-1836d7da82a940%22%2C%22%24device_id%22%3A%221836d7da82957c-0d0e66116b0617-78565473-1327104-1836d7da82a940%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1671089180; select_city=350200"
 9     }
10     # 写入csv文件，只是备份一份数据，后面并无用到
11     with open(r'二手房房价2.csv', 'a', encoding='utf_8_sig', newline='')as f:
12         table_label = ['房屋名称', '房屋总价', '建筑面积', '每平方价格', '房屋户型', '房屋朝向',
13                        '城市', '地区', '小区名称', '装修情况', '房屋产权', '电梯情况', '梯户比例', '建筑类型', '建筑结构']
14         wt = csv.writer(f)
15         wt.writerow(table_label)
16     # 按城市列表里的值进行第一轮地址拼接
17     for i in cityList:
18         url = 'https://' + i + '.lianjia.com/ershoufang/'
19         # 第二轮城市地址拼接
20         all_url = Get_url(url)
21         # 将所有url地址和请求头传入方法内进行采集
22         Get_house_url(all_url, headers)
23 # 每个城市有100页的地址，将拼接后的数据装入列表后返回
24 def Get_url(url):
25     all_url = []
26     for i in range(1, 101):
27         all_url.append(url + 'pg' + str(i) + '/')
28     return all_url
29 
30 
31 # 获取每一个页面里面每一套房的url
32 def Get_house_url(all_url, headers):
33     num = 0
34     # 简单统计页数
35     for i in all_url:
36         # 获取当前页面的代码
37         r = requests.get(i, headers=headers)
38         html = etree.HTML(r.text)
39         # 使用xpath匹配该页面每套房子的url，将会以列表形式存储
40         url_ls = html.xpath("//ul[@class ='sellListContent']/li/a/@href")
41         # 爬取房子详情页的内容
42         Analysis_html(url_ls, headers)
43         time.sleep(4)
44         num += 1
45         print("第%d页爬完了" % num)  # num记录爬取索成功的索引值
46 
47 
48 # 获取每一套房子的详情信息
49 def Analysis_html(url_ls, headers):
50     for i in url_ls:
51         r = requests.get(i, headers=headers)
52         html = etree.HTML(r.text)
53         name = html.xpath("//h1/text()")  # 在获取房名
54         price = html.xpath("//span[@class='total']/text()")  # 获取总价
55         area = html.xpath("//div[@class='area']/div[1]/text()")  # 获取面积
56         priceAvg = html.xpath("//span[@class='unitPriceValue']/text()")  # 获取每平方的价格
57         houseType = html.xpath("//div[@class='room']/div[@class='mainInfo']/text()")  # 获取房屋类型
58         orientation = html.xpath("//div[@class='type']/div[@class='mainInfo']/text()")  # 获取房屋朝向
59         city = html.xpath("//div[4]/div/div/a[1]/text()")  # 获取城市
60         district = html.xpath("//div[@class='areaName']/span[@class='info']/a[1]/text()")  # 获取地区
61         community = html.xpath("//div[@class='communityName']//a[1]/text()")  # 获取社区
62         decoration = html.xpath("//div[@class='base']/div[@class='content']/ul/li[9]/text()")  # 获取装修情况
63         propertyRight = html.xpath(
64             "//div[@class='transaction']/div[@class='content']/ul/li[6]/span[2]/text()")  # 获取房屋产权
65         lift = html.xpath("//div[@class='base']/div[@class='content']/ul/li[11]/text()")  # 获取电梯情况
66         lifeRate = html.xpath("//div[@class='base']/div[@class='content']/ul/li[10]/text()")  # 获取梯户比例
67         builtType = html.xpath("//div[@class='base']/div[@class='content']/ul/li[6]/text()")  # 获取建筑类型
68         builtStructure = html.xpath("//div[@class='base']/div[@class='content']/ul/li[8]/text()")  # 获取建筑结构
69         # 爬取过程中可能会出现广告，异常捕获跳过
70         try:
71             # 将爬取下来的数据存储到数据库
72             Save_data(name, price, area, priceAvg, houseType, orientation, city, district, community, decoration,
73                       propertyRight, lift, lifeRate, builtType, builtStructure)
74             print(name, price, area, priceAvg, houseType, orientation, city, district, community, decoration,
75                       propertyRight, lift, lifeRate, builtType, builtStructure)
76         except:
77             continue
78         # 设置休眠实现
79         time.sleep(random.randint(1, 3))

2.对数据进行清洗和处理
将多余的字符去掉并将某些需要计算的列转为数值类型

1 # 数据清洗
2 def data_clean(df):
3     print(df.houseCity)
4     # 将城市列的后三个字房地产去掉
5     df['houseCity'] = df['houseCity'].str.strip('房产网')
6     # 将字符串转换为数值，用于接下来的统计
7     df['housePrice'] = df['housePrice'].astype(float, errors='raise')
8     df['housePriceAvg'] = df['housePriceAvg'].astype(float, errors='raise')
9     print(df.houseCity)

处理每个城市房价和平米价格的数据

 1 # 生成城市房价图和每平米价格图所需数据
 2 def get_house_price(df):
 3     # 获取最贵的单价数据
 4     maxPrice = df['housePrice'].max(axis=0)
 5     # 将数据帧里的数据，按城市分组，求房价和平米价的平均值保留1位小数
 6     housePriceList = round(df.groupby(['houseCity'])['housePrice'].mean(), 1)
 7     houseUnitPriceList = round(df.groupby(['houseCity'])['housePriceAvg'].mean(), 1)
 8     # 转为数据帧(因为使用mean方法的返回值是序列)
 9     housePriceList = pd.DataFrame({'houseCity': housePriceList.index, 'housePriceAvg': housePriceList.values})
10     houseUnitPriceList = pd.DataFrame(
11         {'houseCity': houseUnitPriceList.index, 'houseUnitPriceAvg': houseUnitPriceList.values})
12     # 排序，按照房价这个字段降序，ascending True降序
13     housePriceList.sort_values(by=['housePriceAvg'], axis=0, ascending=[True], inplace=True)
14     # 将俩个数据帧以城市名为关联字段，重新级联，保持数据位置的一致性
15     cityAvg = pd.merge(housePriceList, houseUnitPriceList, on='houseCity', how='inner')
16     # 以下将数据转为列表返回前端用于图标生成
17     cityList = np.array(cityAvg.houseCity)
18     cityList = cityList.tolist()
19     priceList = np.array(cityAvg.housePriceAvg)
20     priceList = priceList.tolist()
21     unitPriceList = np.array(cityAvg.houseUnitPriceAvg)
22     unitPriceList = unitPriceList.tolist()
23     print(cityList, priceList, unitPriceList)
24     return cityList, priceList, unitPriceList, len(cityList), len(df), maxPrice, housePriceList.houseCity[0]

处理所选城市户型、区县户数、装修情况、建筑类型的数据

 1 # 四张饼图生成
 2 def get_pie(df, cityName=None):
 3     # 如果有传入城市参数，将数据帧缩小到该城市数据
 4     if cityName != None:
 5         df = df[df['houseCity'].str.contains(cityName)]
 6     # 使用size统计每个列里面字段出现过的次数
 7     houseTypeList = df.groupby(['houseType']).size()
 8     houseDistrictList = df.groupby(['houseDistrict']).size()
 9     houseDecorationList = df.groupby(['houseDecoration']).size()
10     builtTypeList = df.groupby(['builtType']).size()
11 
12     # 将各个字段的值以字典的格式添加到临时列表里面
13     templist = []
14     for i, j in zip(houseTypeList.index, houseTypeList.values):
15         templist.append({'value': str(j), 'name': str(i)})
16     templist1 = []
17     for i, j in zip(houseDistrictList.index, houseDistrictList.values):
18         templist1.append({'value': str(j), 'name': str(i)})
19 
20     templist2 = []
21     for i, j in zip(houseDecorationList.index, houseDecorationList.values):
22         templist2.append({'value': str(j), 'name': str(i)})
23     templist3 = []
24     for i, j in zip(builtTypeList.index, builtTypeList.values):
25         templist3.append({'value': str(j), 'name': str(i)})
26     all_list = []
27     all_list.append(templist)
28     all_list.append(templist1)
29     all_list.append(templist2)
30     all_list.append(templist3)
31     print(all_list)
32     return all_list

处理每个城市的区县单价数据

 1 # 城市里各区县的单价统计
 2 def analyse_district(df, cityName=None):
 3     if cityName != None:
 4         df = df[df['houseCity'].str.contains(cityName)]
 5     houseDistrictPrice = round(df.groupby(['houseDistrict'])['housePrice'].mean(), 1)
 6 
 7     districtList = np.array(houseDistrictPrice.index)
 8     districtList = districtList.tolist()
 9     priceList = np.array(houseDistrictPrice.values)
10     priceList = priceList.tolist()
11     print(districtList,'\n', priceList)
12     return districtList, priceList

3. 文本分析

分析每个城市的小区的热门程度

 1 # 词云生成
 2 def wordCloud(df):
 3     # 循环生成每个城市的词云
 4     for i in df.houseCity.unique():
 5         if os.path.exists(r'D:\Python\workspace\HousePriceAnalysis\static\images\\' + i + '.jpg'):
 6             pass
 7         else:
 8             # 分词
 9             df = df[df['houseCity'].str.contains(i)]
10             strAll = ''
11             for j in df['houseCommunity']:
12                 strAll += j
13             cut = jieba.cut(strAll)
14             strList = " ".join(cut)
15             print(strList)
16             # 生成遮罩
17             img = Image.open(r'D:\Python\workspace\HousePriceAnalysis\static\images\遮罩.jpg')
18             img_array = np.array(img)  # 图片转为数组
19             wc = WordCloud(
20                 background_color='white',
21                 mask=img_array,
22                 font_path="simhei.ttf",
23                 height=100,
24                 width=300
25             )
26             wc.generate_from_text(strList)
27             # fig = plt.figure(1)
28             plt.imshow(wc)
29             plt.axis('off')
30             plt.savefig(r'D:\Python\workspace\HousePriceAnalysis\static\images\%s.jpg' % i)

4. 数据分析与可视化

城市房价图和每平米的价格图，可以清晰的看见几个城市的数据对比。北京无愧于帝都的称号，在房价方面还是遥遥领先。

 1 var chartDom = document.getElementById('main1');
 2 var myChart = echarts.init(chartDom);
 3 var option;
 4 option = {
 5   title: {
 6     text: '每平米价格图'
 7   },
 8   tooltip: {
 9     trigger: 'axis',
10     axisPointer: {
11       type: ''
12     }
13   },
14   legend: {},
15   grid: {
16     left: '3%',
17     right: '4%',
18     bottom: '3%',
19     containLabel: true
20   },
21   xAxis: {
22     type: 'value',
23     boundaryGap: [0, 0.01]
24   },
25   yAxis: {
26     type: 'category',
27     data: {{cityList|safe}}
28   },
29   series: [
30     {
31       name: '每平米价格/元',
32       type: 'bar',
33       data: {{unitPriceList|safe}}
34     }
35   ]
36 };
37 option && myChart.setOption(option);
38 
39 var chartDom = document.getElementById('main');
40 var myChart = echarts.init(chartDom);
41 var option;
42 option = {
43   title: {
44     text: '城市房价图'
45   },
46   tooltip: {
47     trigger: 'axis',
48     axisPointer: {
49       type: ''
50     }
51   },
52   legend: {},
53   grid: {
54     left: '3%',
55     right: '4%',
56     bottom: '3%',
57     containLabel: true
58   },
59   xAxis: {
60     type: 'value',
61     boundaryGap: [0, 0.01]
62   },
63   yAxis: {
64     type: 'category',
65     data: {{cityList|safe}}
66   },
67   series: [
68     {
69       name: '平均房价/万元',
70       type: 'bar',
71       data: {{priceList|safe}}

						
							上一篇：							
								Python爬虫课程设计】--二手房数据爬取+数据分析							
						
						
							下一篇：							
								数据分析项目：链家二手房数据分析