Python 网络爬虫 - 抓取并分析福建二手房价格及各类数据
一、 选题的背景介绍
随着越来越多城市的房地产市场进入存量时代,二手房市场的地位愈发重要,其走势对于房地产整体市场的影响也逐渐加强。在很多二手房市场规模占比较高的一二线城市,二手房市场与新房市场通过“卖旧买新”的链条形成较强的联动:二手房卖家通过置换现有住房获得资金,转而在新房市场实现住房改善。
在买房之前,人们会在相关网站上搜索自己想要了解的住房信息,但是面对庞大且来源广泛的网络数 据,如何缩短网页下载时间,如何分析大量数据并找出有用信息,就需要用到网络爬虫技术。本文通过 Python爬虫程序爬取链家网上福建省份的二手房数据,爬取的字段包括所在区域、户型结构、装修情况、总价、单价等,并对采集到的这些数据进行清洗。最后对清洗过的数据用echarts进行可视化分析,探索福建二手房建筑面积、位置、户型等对二手房房价的影响情况。
二、 数据可视化分析的步骤
1. 爬虫名称:爬取链家网站二手房数据
2. 爬取内容:爬取房子名称、价格、平方价格、面积、所在城市、所在地区、户型等
3. 方案描述:观察链家网站二手房的访问地址,找到规律,进行地址拼接,再爬取每个页面的房子详情页的地址,循环解析提取房子详情页里的信息,存储到数据库中,方便接下来的数据分析和数据可视化的展示。
技术难点:
1.在爬取过程中除了目标数据以外还存在广告等其他数据,需要做异常处理
2.使用pandas将数据库里的数据进行处理和分析
三、 主题页面的结构特征分析
1.主题页面的结构特征分析
我们要爬取的是链家网福建省内所有城市的二手房数据,链家支持的福建城市有福州、厦门、泉州、漳州,每个城市里面有100页数据,每一页里有30条房屋信息,我们需要将这三十条房屋信息的地址爬取出来逐一进行数据提取。所以爬虫的结构有三层,第一层是城市循环,第二层是页面循环,第三层是该页面的房屋信息循环。
2.节点(标签)查找方法与遍历方法
1 name = html.xpath("//h1/text()") 2 price = html.xpath("//span[@class='total']/text()") 3 area = html.xpath("//div[@class='area']/div[1]/text()") 4 priceAvg = html.xpath("//span[@class='unitPriceValue']/text()") 5 houseType= html.xpath("//div[@class='room']/div[@class='mainInfo']/text()") 6 orientation = html.xpath("//div[@class='type']/div[@class='mainInfo']/text()") 7 city = html.xpath("//div[4]/div/div/a[1]/text()") 8 district= html.xpath("//div[@class='areaName']/span[@class='info']/a[1]/text()") 9 community = html.xpath("//div[@class='communityName']//a[1]/text()") 10 decoration=html.xpath("//div[@class='base']/div[@class='content']/ul/li[9]/text()") 11 propertyRight= html.xpath("//div[@class='transaction']/div[@class='content']/ul/li[6]/span[2]/text()") 12 lift = html.xpath("//div[@class='base']/div[@class='content']/ul/li[11]/text()") 13 lifeRate= html.xpath("//div[@class='base']/div[@class='content']/ul/li[10]/text()") 14 builtType = html.xpath("//div[@class='base']/div[@class='content']/ul/li[6]/text()") 15 builtStructure= tml.xpath("//div[@class='base']/div[@class='content']/ul/li[8]/text()")
四、 网络爬虫设计程序设计
1. 数据的爬取与采集
1 def start(): 2 # 爬取所需城市的名称缩写,用于地址拼接,链家网站福建省内支持的城市仅有以下四个、支持的城市列表可以查看https://map.lianjia.com/map/350200/ESF/ 3 cityList = ['xm', 'fz', 'quanzhou', 'zhangzhou'] 4 # 设置请求头,以防被网站识别为爬虫 5 headers = { 6 "Upgrade-Insecure-Requests": "1", 7 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46", 8 "Cookie": "lianjia_uuid=60c54ae6-6b42-4222-8692-efe4fb2c554e; crosSdkDT2019DeviceId=eu2b5l-mfq32x-njycw4jyhevekso-ud4cezlw8; _smt_uid=632e762f.20c8106d; _ga=GA1.2.203719967.1663989297; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1669339641; lianjia_ssid=917d080d-af96-4114-92ce-e9ec6ded0cde; _gid=GA1.2.1343492366.1671084252; lianjia_token=2.0111ad29317b6152f82536df72a8e72e0777cf78c4; beikeBaseData=%7B%22parentSceneId%22:%226413099907731748097%22%7D; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221836d7da82957c-0d0e66116b0617-78565473-1327104-1836d7da82a940%22%2C%22%24device_id%22%3A%221836d7da82957c-0d0e66116b0617-78565473-1327104-1836d7da82a940%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1671089180; select_city=350200" 9 } 10 # 写入csv文件,只是备份一份数据,后面并无用到 11 with open(r'二手房房价2.csv', 'a', encoding='utf_8_sig', newline='')as f: 12 table_label = ['房屋名称', '房屋总价', '建筑面积', '每平方价格', '房屋户型', '房屋朝向', 13 '城市', '地区', '小区名称', '装修情况', '房屋产权', '电梯情况', '梯户比例', '建筑类型', '建筑结构'] 14 wt = csv.writer(f) 15 wt.writerow(table_label) 16 # 按城市列表里的值进行第一轮地址拼接 17 for i in cityList: 18 url = 'https://' + i + '.lianjia.com/ershoufang/' 19 # 第二轮城市地址拼接 20 all_url = Get_url(url) 21 # 将所有url地址和请求头传入方法内进行采集 22 Get_house_url(all_url, headers) 23 # 每个城市有100页的地址,将拼接后的数据装入列表后返回 24 def Get_url(url): 25 all_url = [] 26 for i in range(1, 101): 27 all_url.append(url + 'pg' + str(i) + '/') 28 return all_url 29 30 31 # 获取每一个页面里面每一套房的url 32 def Get_house_url(all_url, headers): 33 num = 0 34 # 简单统计页数 35 for i in all_url: 36 # 获取当前页面的代码 37 r = requests.get(i, headers=headers) 38 html = etree.HTML(r.text) 39 # 使用xpath匹配该页面每套房子的url,将会以列表形式存储 40 url_ls = html.xpath("//ul[@class ='sellListContent']/li/a/@href") 41 # 爬取房子详情页的内容 42 Analysis_html(url_ls, headers) 43 time.sleep(4) 44 num += 1 45 print("第%d页爬完了" % num) # num记录爬取索成功的索引值 46 47 48 # 获取每一套房子的详情信息 49 def Analysis_html(url_ls, headers): 50 for i in url_ls: 51 r = requests.get(i, headers=headers) 52 html = etree.HTML(r.text) 53 name = html.xpath("//h1/text()") # 在获取房名 54 price = html.xpath("//span[@class='total']/text()") # 获取总价 55 area = html.xpath("//div[@class='area']/div[1]/text()") # 获取面积 56 priceAvg = html.xpath("//span[@class='unitPriceValue']/text()") # 获取每平方的价格 57 houseType = html.xpath("//div[@class='room']/div[@class='mainInfo']/text()") # 获取房屋类型 58 orientation = html.xpath("//div[@class='type']/div[@class='mainInfo']/text()") # 获取房屋朝向 59 city = html.xpath("//div[4]/div/div/a[1]/text()") # 获取城市 60 district = html.xpath("//div[@class='areaName']/span[@class='info']/a[1]/text()") # 获取地区 61 community = html.xpath("//div[@class='communityName']//a[1]/text()") # 获取社区 62 decoration = html.xpath("//div[@class='base']/div[@class='content']/ul/li[9]/text()") # 获取装修情况 63 propertyRight = html.xpath( 64 "//div[@class='transaction']/div[@class='content']/ul/li[6]/span[2]/text()") # 获取房屋产权 65 lift = html.xpath("//div[@class='base']/div[@class='content']/ul/li[11]/text()") # 获取电梯情况 66 lifeRate = html.xpath("//div[@class='base']/div[@class='content']/ul/li[10]/text()") # 获取梯户比例 67 builtType = html.xpath("//div[@class='base']/div[@class='content']/ul/li[6]/text()") # 获取建筑类型 68 builtStructure = html.xpath("//div[@class='base']/div[@class='content']/ul/li[8]/text()") # 获取建筑结构 69 # 爬取过程中可能会出现广告,异常捕获跳过 70 try: 71 # 将爬取下来的数据存储到数据库 72 Save_data(name, price, area, priceAvg, houseType, orientation, city, district, community, decoration, 73 propertyRight, lift, lifeRate, builtType, builtStructure) 74 print(name, price, area, priceAvg, houseType, orientation, city, district, community, decoration, 75 propertyRight, lift, lifeRate, builtType, builtStructure) 76 except: 77 continue 78 # 设置休眠实现 79 time.sleep(random.randint(1, 3))
2.对数据进行清洗和处理
将多余的字符去掉并将某些需要计算的列转为数值类型
1 # 数据清洗 2 def data_clean(df): 3 print(df.houseCity) 4 # 将城市列的后三个字房地产去掉 5 df['houseCity'] = df['houseCity'].str.strip('房产网') 6 # 将字符串转换为数值,用于接下来的统计 7 df['housePrice'] = df['housePrice'].astype(float, errors='raise') 8 df['housePriceAvg'] = df['housePriceAvg'].astype(float, errors='raise') 9 print(df.houseCity)
处理每个城市房价和平米价格的数据
1 # 生成城市房价图和每平米价格图所需数据 2 def get_house_price(df): 3 # 获取最贵的单价数据 4 maxPrice = df['housePrice'].max(axis=0) 5 # 将数据帧里的数据,按城市分组,求房价和平米价的平均值保留1位小数 6 housePriceList = round(df.groupby(['houseCity'])['housePrice'].mean(), 1) 7 houseUnitPriceList = round(df.groupby(['houseCity'])['housePriceAvg'].mean(), 1) 8 # 转为数据帧(因为使用mean方法的返回值是序列) 9 housePriceList = pd.DataFrame({'houseCity': housePriceList.index, 'housePriceAvg': housePriceList.values}) 10 houseUnitPriceList = pd.DataFrame( 11 {'houseCity': houseUnitPriceList.index, 'houseUnitPriceAvg': houseUnitPriceList.values}) 12 # 排序,按照房价这个字段降序,ascending True降序 13 housePriceList.sort_values(by=['housePriceAvg'], axis=0, ascending=[True], inplace=True) 14 # 将俩个数据帧以城市名为关联字段,重新级联,保持数据位置的一致性 15 cityAvg = pd.merge(housePriceList, houseUnitPriceList, on='houseCity', how='inner') 16 # 以下将数据转为列表返回前端用于图标生成 17 cityList = np.array(cityAvg.houseCity) 18 cityList = cityList.tolist() 19 priceList = np.array(cityAvg.housePriceAvg) 20 priceList = priceList.tolist() 21 unitPriceList = np.array(cityAvg.houseUnitPriceAvg) 22 unitPriceList = unitPriceList.tolist() 23 print(cityList, priceList, unitPriceList) 24 return cityList, priceList, unitPriceList, len(cityList), len(df), maxPrice, housePriceList.houseCity[0]
处理所选城市户型、区县户数、装修情况、建筑类型的数据
1 # 四张饼图生成 2 def get_pie(df, cityName=None): 3 # 如果有传入城市参数,将数据帧缩小到该城市数据 4 if cityName != None: 5 df = df[df['houseCity'].str.contains(cityName)] 6 # 使用size统计每个列里面字段出现过的次数 7 houseTypeList = df.groupby(['houseType']).size() 8 houseDistrictList = df.groupby(['houseDistrict']).size() 9 houseDecorationList = df.groupby(['houseDecoration']).size() 10 builtTypeList = df.groupby(['builtType']).size() 11 12 # 将各个字段的值以字典的格式添加到临时列表里面 13 templist = [] 14 for i, j in zip(houseTypeList.index, houseTypeList.values): 15 templist.append({'value': str(j), 'name': str(i)}) 16 templist1 = [] 17 for i, j in zip(houseDistrictList.index, houseDistrictList.values): 18 templist1.append({'value': str(j), 'name': str(i)}) 19 20 templist2 = [] 21 for i, j in zip(houseDecorationList.index, houseDecorationList.values): 22 templist2.append({'value': str(j), 'name': str(i)}) 23 templist3 = [] 24 for i, j in zip(builtTypeList.index, builtTypeList.values): 25 templist3.append({'value': str(j), 'name': str(i)}) 26 all_list = [] 27 all_list.append(templist) 28 all_list.append(templist1) 29 all_list.append(templist2) 30 all_list.append(templist3) 31 print(all_list) 32 return all_list
处理每个城市的区县单价数据
1 # 城市里各区县的单价统计 2 def analyse_district(df, cityName=None): 3 if cityName != None: 4 df = df[df['houseCity'].str.contains(cityName)] 5 houseDistrictPrice = round(df.groupby(['houseDistrict'])['housePrice'].mean(), 1) 6 7 districtList = np.array(houseDistrictPrice.index) 8 districtList = districtList.tolist() 9 priceList = np.array(houseDistrictPrice.values) 10 priceList = priceList.tolist() 11 print(districtList,'\n', priceList) 12 return districtList, priceList
3. 文本分析
分析每个城市的小区的热门程度
1 # 词云生成 2 def wordCloud(df): 3 # 循环生成每个城市的词云 4 for i in df.houseCity.unique(): 5 if os.path.exists(r'D:\Python\workspace\HousePriceAnalysis\static\images\\' + i + '.jpg'): 6 pass 7 else: 8 # 分词 9 df = df[df['houseCity'].str.contains(i)] 10 strAll = '' 11 for j in df['houseCommunity']: 12 strAll += j 13 cut = jieba.cut(strAll) 14 strList = " ".join(cut) 15 print(strList) 16 # 生成遮罩 17 img = Image.open(r'D:\Python\workspace\HousePriceAnalysis\static\images\遮罩.jpg') 18 img_array = np.array(img) # 图片转为数组 19 wc = WordCloud( 20 background_color='white', 21 mask=img_array, 22 font_path="simhei.ttf", 23 height=100, 24 width=300 25 ) 26 wc.generate_from_text(strList) 27 # fig = plt.figure(1) 28 plt.imshow(wc) 29 plt.axis('off') 30 plt.savefig(r'D:\Python\workspace\HousePriceAnalysis\static\images\%s.jpg' % i)
4. 数据分析与可视化
城市房价图和每平米的价格图,可以清晰的看见几个城市的数据对比。北京无愧于帝都的称号,在房价方面还是遥遥领先。
1 var chartDom = document.getElementById('main1'); 2 var myChart = echarts.init(chartDom); 3 var option; 4 option = { 5 title: { 6 text: '每平米价格图' 7 }, 8 tooltip: { 9 trigger: 'axis', 10 axisPointer: { 11 type: '' 12 } 13 }, 14 legend: {}, 15 grid: { 16 left: '3%', 17 right: '4%', 18 bottom: '3%', 19 containLabel: true 20 }, 21 xAxis: { 22 type: 'value', 23 boundaryGap: [0, 0.01] 24 }, 25 yAxis: { 26 type: 'category', 27 data: {{cityList|safe}} 28 }, 29 series: [ 30 { 31 name: '每平米价格/元', 32 type: 'bar', 33 data: {{unitPriceList|safe}} 34 } 35 ] 36 }; 37 option && myChart.setOption(option); 38 39 var chartDom = document.getElementById('main'); 40 var myChart = echarts.init(chartDom); 41 var option; 42 option = { 43 title: { 44 text: '城市房价图' 45 }, 46 tooltip: { 47 trigger: 'axis', 48 axisPointer: { 49 type: '' 50 } 51 }, 52 legend: {}, 53 grid: { 54 left: '3%', 55 right: '4%', 56 bottom: '3%', 57 containLabel: true 58 }, 59 xAxis: { 60 type: 'value', 61 boundaryGap: [0, 0.01] 62 }, 63 yAxis: { 64 type: 'category', 65 data: {{cityList|safe}} 66 }, 67 series: [ 68 { 69 name: '平均房价/万元', 70 type: 'bar', 71 data: {{priceList|safe}}上一篇: Python爬虫课程设计】--二手房数据爬取+数据分析
下一篇: 数据分析项目:链家二手房数据分析