实战体验:从零开始爬取百度百科
最编程
2024-01-15 16:20:33
...
一、页面分析
以洛天依的百度百科为例,获取红色框内容。(仅文本内容,暂不考虑表格、图片、视频以及链接)
结果展示:
二、爬取思路
由于这个页面的特殊结构,即不是父子关系,而是兄弟关系。因此考虑通过找属性之间的关系进行分析,如class、data-index、label-module等【注:红框2中data-pid是动态生成的,即在静态源代码下无法直接获得】
思路一:对于目录部分,本考虑通过对data-pid进行循环遍历,但是框架为scrapy,因此不考虑这种方法。
思路二(实际采用的思路):
红框1
1、观察代码可得:dt和dd是一一对应的, step1:获得dt或者dd标签的个数
xpath("count(//dl[@class='basicInfo-block basicInfo-left']/dt[@class='basicInfo-item name'])")
step2:遍历取值
红框2
step1:获取data-index的个数或者获得最后一个data-index的值 step2:以data-index为划分,获取两个data-index之间label-module="para-title"和"para"的兄弟div step3:对两个data-index的div的label-module进行判断,如果为para-title可以直接获得,为para还需要考虑文本相加
三、实现代码
import requests
from lxml import etree
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
def request_url(url):
response = requests.get(url = url,headers=headers)
response_data = etree.HTML(response.text)
return response_data
def get_basicInfo(response_data):
basicinfo = {}
# 获得dd/dt的个数,转成整数类型
basicinfo_left_count = response_data.xpath("count(//dl[@class='basicInfo-block basicInfo-left']/dt[@class='basicInfo-item name'])")
basicinfo_right_count = response_data.xpath("count(//dl[@class='basicInfo-block basicInfo-right']/dt[@class='basicInfo-item name'])")
left_num = int(basicinfo_left_count)
right_num = int(basicinfo_right_count)
# 左侧数据
for i in range(1, left_num + 1):
name = response_data.xpath(
"//dl[@class='basicInfo-block basicInfo-left']/dt[@class='basicInfo-item name'][" + str(i) + "]/text()")[0]
name = name.replace("\xa0", "")
value = response_data.xpath(
"//dl[@class='basicInfo-block basicInfo-left']/dd[@class='basicInfo-item value'][" + str(i) + "]//text()")
value = " ".join(value).replace("\n", "")
print(name,value)
basicinfo.update({name: value})
# 右侧数据
for i in range(1, right_num + 1):
name = response_data.xpath(
"//dl[@class='basicInfo-block basicInfo-right']/dt[@class='basicInfo-item name'][" + str(i) + "]/text()")[0]
name = name.replace("\xa0", "")
value = response_data.xpath(
"//dl[@class='basicInfo-block basicInfo-right']/dd[@class='basicInfo-item value'][" + str(i) + "]//text()")
value = " ".join(value).replace("\n", "")
print(name, value)
basicinfo.update({name: value})
return basicinfo
def get_catalogue(response_data):
catalogue = {}
# 获得最后一个div[@class="para-title level-2 J-chapter"]的data-index
data_index_num = response_data.xpath('//div[@class="main-content J-content"]/div[@class="para-title level-2 J-chapter"][last()]/@data-index')[0]
for num in range(1, int(data_index_num) + 1):
start_num = num
# 目录标题h2
para_title = response_data.xpath('//div[@class="main-content J-content"]/div[@data-index=' + str(start_num) + ']/h2/text()')[0]
print(para_title)
if num == int(data_index_num):
# 获得两个div标签之间的同级div包含label-module='para'或'para-title'的标签
para_items = response_data.xpath( "//div[contains(@label-module,'para') or contains(@label-module,'para-title')][preceding-sibling::div[@data-index = " + str(start_num) + "] and following-sibling::div[@id = 'J-main-content-end-dom']]")
else:
end_num = num + 1
# 获得两个div标签之间的同级div包含label-module='para'或'para-title'的标签
para_items = response_data.xpath("//div[contains(@label-module,'para') or contains(@label-module,'para-title')][preceding-sibling::div[@data-index = " + str(start_num) + "] and following-sibling::div[@data-index = " + str(end_num) + "]]")
# 用于存储目录标题h2对应的列表
para_list = ""
for para_item in para_items:
if para_item.xpath("./@label-module")[0] == 'para-title':
listitem = para_item.xpath("./h3/text()")[0] + " : "
else:
listitem = para_item.xpath(".//text()")
listitem = ''.join(listitem).replace("\n", "").replace("\xa0", "") + '\n'
para_list = para_list + listitem
print(para_list)
catalogue.update({para_title: para_list})
return catalogue
if __name__ =="__main__":
requ_url = 'https://baike.baidu.com/item/%E6%B4%9B%E5%A4%A9%E4%BE%9D/6753346?fromModule=lemma_search-box'
response_data = request_url(requ_url)
basicinfo_dict = get_basicInfo(response_data)
catalogue_dict = get_catalogue(response_data)
print(basicinfo_dict,catalogue_dict)
四、改进
1、指定字段的存取
对于字段如果只想存取特定字段的话,可以先写定一个字典dict1,在字典dict2(最终输出)更新的时候判断该字段是否在dict1中,有就更新,没有就不更新
basicinfo_dict = {
'中文名':'basic_name',
'外文名':'en_name',
'别名':'alias'
}
if name in basicinfo_dict.keys():
basicinfo.update({basicinfo_dict[name]:value})
2、批量URL爬取
如果想批量爬取,主要是获得URL。对于同一类型的关键字搜索,可以采用同样的规律。这里提供一种该方法,该lemmaId是唯一的,这样即使出现同名,也可以通过对应不同的lemmaId获得自己想要的页面数据。
baike.baidu.com/item/keywor…