Github 项目 - crawl4ai

最编程 2024-10-02 16:51:47

...

github项目--crawl4ai

- 输出html
- 输出markdown格式
- 输出结构化数据
- 与BeautifulSoup的对比

crawl4ai github上这个项目，没记错的话，昨天涨了3000多的star，今天又新增2000star。一款抓取和解析工具，简单写个demo感受下

这里我们使用crawl4ai抓取github每日趋势，每天通过邮件发到自己邮箱

输出html

async def github_trend_html():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://github.com/trending",
        )
        assert result.success, "github 数据抓取失败"
        return result.cleaned_html

输出的还是html，但对原始页面做了处理，比如移除不相关元素，动态元素，简化html结构。

在这里插入图片描述

输出markdown格式

async def github_trend_md():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://github.com/trending",
        )
        assert result.success, "github 数据抓取失败"
        return result.markdown

用md软件打开看一下效果：

在这里插入图片描述

输出结构化数据

async def github_trend_json():
    schema = {
        "name": "Github trending",
        "baseSelector": ".Box-row",
        "fields": [
            {
                "name": "repository",
                "selector": ".lh-condensed a[href]",
                "type": "text",
            },
            {
                "name": "description",
                "selector": "p",
                "type": "text",
            },
            {
                "name": "lang",
                "type": "text",
                "selector": "span[itemprop='programmingLanguage']",
            },
            {
                "name": "stars",
                "type": "text",
                "selector": "a[href*='/stargazers']"
            },
            {
                "name": "today_star",
                "type": "text",
                "selector": "span.float-sm-right",
            },
        ],
    }
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://github.com/trending",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
        )
        assert result.success, "github 数据抓取失败"
        github_trending_json = json.loads(result.extracted_content)
        for ele in github_trending_json:
            ele['repository'] = 'https://github.com/' + ''.join(ele['repository'].split())
        return github_trending_json

与前两种不同的是，结构化输出需要通过自定义schema来定义解析的数据结构。控制台按照我们定义的schema输出了标准了JSON数据。将数据放入html模版，通过邮件每日发送。看一下邮件显示：

在这里插入图片描述

与BeautifulSoup的对比

记得第一次用soup的时候，对于只用过Java sax解析xml的我来说，soup真的太方便了。今天简单测试了下crawl4ai，和soup相比

crawl4ai数据采集分析更方便

soup需要配合使用request进行网页抓取，BeautifulSoup负责html解析

html解析有点类似，都是通过CSS选择器，但crawl4ai通过定义schema，解析更方便

数据解析方面，crawl4ai除了提供了markdown和简化版的html，还提供了通过集成OpenAI提取结构化数据的能力(尚未体验)

上一篇：用于自动化 Windows 开发工作流程的 PowerShell 脚本

下一篇：解释 JavaScript 中函数的实参和形参

Github 项目 - crawl4ai

github项目--crawl4ai

输出html

输出markdown格式

输出结构化数据

与BeautifulSoup的对比

STM32 开发环境设置]-3-STM32CubeMX 项目管理器配置-自动生成 Keil (MDK-ARM) 5 项目

在线远程考试｜基于 SpringBoot 的在线远程考试系统设计与实施（含项目源代码 + 论文 + 数据库）

VUE前后端分离毕业设计题目项目有哪些，VUE程序开发常见论文设计建议

旧 vue2 项目的打包优化：优化脚本生成的代码 - 验证方案

在 SpringBoot-MybatisPlus 项目中，在控制台中查看 sql 执行日志的方法

Vortex GPGPU github 进程运行和功能模块波形探索（二）

Spring Boot + MyBatis 项目常用注解详解（长达 10,000 字的解释）

计算机毕业设计 Java 酷听音乐系统设计与实现 Java 实用项目含源代码 + 文档 + 视频讲解

计算机毕业设计招生宣传管理系统的设计与实施 Java实战项目，含源代码+文档+视频讲解

Github 项目 - crawl4ai

Github 项目 - crawl4ai

github项目--crawl4ai

输出html

输出markdown格式

输出结构化数据

与BeautifulSoup的对比

STM32 开发环境设置]-3-STM32CubeMX 项目管理器配置-自动生成 Keil (MDK-ARM) 5 项目

在线远程考试｜基于 SpringBoot 的在线远程考试系统设计与实施（含项目源代码 + 论文 + 数据库）

VUE前后端分离毕业设计题目项目有哪些，VUE程序开发常见论文设计建议

旧 vue2 项目的打包优化：优化脚本生成的代码 - 验证方案

在 SpringBoot-MybatisPlus 项目中，在控制台中查看 sql 执行日志的方法

Vortex GPGPU github 进程运行和功能模块波形探索（二）

Spring Boot + MyBatis 项目常用注解详解（长达 10,000 字的解释）

计算机毕业设计 Java 酷听音乐系统设计与实现 Java 实用项目 含源代码 + 文档 + 视频讲解

计算机毕业设计 招生宣传管理系统的设计与实施 Java实战项目，含源代码+文档+视频讲解

Github 项目 - crawl4ai

计算机毕业设计 Java 酷听音乐系统设计与实现 Java 实用项目含源代码 + 文档 + 视频讲解

计算机毕业设计招生宣传管理系统的设计与实施 Java实战项目，含源代码+文档+视频讲解