keyBERT

最编程 2024-02-07 11:41:57

...

keyBERT是一个使用BERT embedding向量来生成关键词的工具，其思路是对文档用BERT编码成向量，再将文档使用Scikit-Learns 的 CountVectorizer 对文档进行分词，然后比较文档中的每个单词向量与文档向量之间的相似度，选择相似度最大的一些词作为关键词。

其基本用法：

from keybert import KeyBERT

# 英文文档关键词提取示例，不指定embedding模型，默认使用sentence-transformers的all-MiniLM-L6-v2模型
doc = """
When we want to understand key information from specific documents, we typically turn towards keyword extraction. Keyword extraction is the automated process of extracting the words and phrases that are most relevant to an input text.
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

# 中文文档关键词提取示例
# 中文需要自定义CountVectorizer，并为它指定分词器，比如下面示例中使用了jieba来分词
from sklearn.feature_extraction.text import CountVectorizer
import jieba
def tokenize_zh(text):
    words = jieba.lcut(text)
    return words
vectorizer = CountVectorizer(tokenizer=tokenize_zh)
kw_model = KeyBERT(model='paraphrase-multilingual-MiniLM-L12-v2')
doc = """
    强化学习是机器通过与环境交互来实现目标的一种计算方法。机器和环境的一轮交互是指，机器在环境的一个状态下做一个动作决策，把这个动作作用到环境当中，这个环境发生相应的改变并且将相应的奖励反馈和下一轮状态传回机器。这种交互是迭代进行的，机器的目标是最大化在多轮交互过程中获得的累积奖励的期望。"""
keywords = kw_model.extract_keywords(doc, vectorizer=vectorizer)

如果我们只选择那些与文档最相似的一些词来作为关键词，很可能会使挑选出的关键词之间的相似度比较高，所以keyBert的作者实现了下面两种来算法来使选择的关键词更多样化一些：

Max Sum Distance：设生成关键词个数为 $top\_n$ ，选择一个比 $top\_n$ 大的数 $nr\_candidates$ ，先从文档中挑选出与文档最相似的 $nr\_candidates$ 个候选词，然后从 $nr\_candidates$ 个词选择 $top\_n$ 个词的所有组合中选择两两相似度之和最小的组合作为返回的关键词。因此 $nr\_candidates$ 越大选择出来的关键词更多样化，但也可能会选择出一些并不能代表文档的词，作者的建议是 $nr\_candidates$ 不要超过文档中词汇个数的20%。
```
# 代码来自keyBERT 源码 https://github.com/MaartenGr/KeyBERT/blob/master/keybert/_maxsum.py
import numpy as np
import itertools
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Tuple


def max_sum_distance(
    doc_embedding: np.ndarray,
    word_embeddings: np.ndarray,
    words: List[str],
    top_n: int,
    nr_candidates: int,
) -> List[Tuple[str, float]]:
    """Calculate Max Sum Distance for extraction of keywords

    We take the 2 x top_n most similar words/phrases to the document.
    Then, we take all top_n combinations from the 2 x top_n words and
    extract the combination that are the least similar to each other
    by cosine similarity.

    This is O(n^2) and therefore not advised if you use a large `top_n`

    Arguments:
        doc_embedding: The document embeddings
        word_embeddings: The embeddings of the selected candidate keywords/phrases
        words: The selected candidate keywords/keyphrases
        top_n: The number of keywords/keyhprases to return
        nr_candidates: The number of candidates to consider

    Returns:
         List[Tuple[str, float]]: The selected keywords/keyphrases with their distances
    """
    if nr_candidates < top_n:
        raise Exception(
            "Make sure that the number of candidates exceeds the number "
            "of keywords to return."
        )
    elif top_n > len(words):
        return []

    # Calculate distances and extract keywords
    distances = cosine_similarity(doc_embedding, word_embeddings)
    distances_words = cosine_similarity(word_embeddings, word_embeddings)

    # Get 2*top_n words as candidates based on cosine similarity
    words_idx = list(distances.argsort()[0][-nr_candidates:])
    words_vals = [words[index] for index in words_idx]
    candidates = distances_words[np.ix_(words_idx, words_idx)]

    # Calculate the combination of words that are the least similar to each other
    min_sim = 100_000
    candidate = None
    for combination in itertools.combinations(range(len(words_idx)), top_n):
        sim = sum(
            [candidates[i][j] for i in combination for j in combination if i != j]
        )
        if sim < min_sim:
            candidate = combination
            min_sim = sim

    return [
        (words_vals[idx], round(float(distances[0][words_idx[idx]]), 4))
        for idx in candidate
    ]
```
Maximal Marginal Relevance (MMR)：除了使关键词与文档尽可能的相似之外，同时降低关键词之间的相似性或者冗余度。参数diversity来控制候选词的多样性，diversity越小，则候选关键词之间可能更相似，而diversity越大，则关键词之间的冗余度更小。算法具体为：1. 先生成候选词与文档相似度矩阵、候选词之间的相似度矩阵。2. 挑选与文档最相似的候选词作为第一个关键词。3. 其余关键词逐一挑选，挑选规则为：设剩余候选词与文档相似性矩阵为 $can\_sim$ ，剩余候选词与已挑选的关键词最大相似度矩阵为 $tar\_sim$ ，按公式 $mmr = (1-diversity)*can\_sim - diversity*tar\_sim$ 计算mmr，挑选mmr最大的候选词作为下一个关键词。

# 代码来自keyBERT 源码 https://github.com/MaartenGr/KeyBERT/blob/master/keybert/_mmr.py
def mmr(
    doc_embedding: np.ndarray,
    word_embeddings: np.ndarray,
    words: List[str],
    top_n: int = 5,
    diversity: float = 0.8,
) -> List[Tuple[str, float]]:
    """Calculate Maximal Marginal Relevance (MMR)
    between candidate keywords and the document.


    MMR considers the similarity of keywords/keyphrases with the
    document, along with the similarity of already selected
    keywords and keyphrases. This results in a selection of keywords
    that maximize their within diversity with respect to the document.

    Arguments:
        doc_embedding: The document embeddings
        word_embeddings: The embeddings of the selected candidate keywords/phrases
        words: The selected candidate keywords/keyphrases
        top_n: The number of keywords/keyhprases to return
        diversity: How diverse the select keywords/keyphrases are.
                   Values between 0 and 1 with 0 being not diverse at all
                   and 1 being most diverse.

    Returns:
         List[Tuple[str, float]]: The selected keywords/keyphrases with their distances

    """

    # Extract similarity within words, and between words and the document
    word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)
    word_similarity = cosine_similarity(word_embeddings)

    # Initialize candidates and already choose best keyword/keyphras
    keywords_idx = [np.argmax(word_doc_similarity)]
    candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]

    for _ in range(min(top_n - 1, len(words) - 1)):
        # Extract similarities within candidates and
        # between candidates and selected keywords/phrases
        candidate_similarities = word_doc_similarity[candidates_idx, :]
        target_similarities = np.max(
            word_similarity[candidates_idx][:, keywords_idx], axis=1
        )

        # Calculate MMR
        mmr = (
            1 - diversity
        ) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
        mmr_idx = candidates_idx[np.argmax(mmr)]

        # Update keywords & candidates
        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)

    # Extract and sort keywords in descending similarity
    keywords = [
        (words[idx], round(float(word_doc_similarity.reshape(1, -1)[0][idx]), 4))
        for idx in keywords_idx
    ]
    keywords = sorted(keywords, key=itemgetter(1), reverse=True)
    return keywords

理解了keyBERT的原理后，就发现keyBERT提取关键词的效果十分依赖于向量编码模型的质量，所以需要根据自己业务情况挑选适合的向量编码模型。

上一篇：探索物联网世界：简便易用的开发平台 IoT Explorer

下一篇：如何在Android中使用Kotlin配置build.gradle.kts文件指南