欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

keyBERT

最编程 2024-02-07 11:41:57
...

keyBERT是一个使用BERT embedding向量来生成关键词的工具,其思路是对文档用BERT编码成向量,再将文档使用Scikit-Learns 的 CountVectorizer 对文档进行分词,然后比较文档中的每个单词向量与文档向量之间的相似度,选择相似度最大的一些词作为关键词。

其基本用法:

from keybert import KeyBERT

# 英文文档关键词提取示例,不指定embedding模型,默认使用sentence-transformers的all-MiniLM-L6-v2模型
doc = """
When we want to understand key information from specific documents, we typically turn towards keyword extraction. Keyword extraction is the automated process of extracting the words and phrases that are most relevant to an input text.
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

# 中文文档关键词提取示例
# 中文需要自定义CountVectorizer,并为它指定分词器,比如下面示例中使用了jieba来分词
from sklearn.feature_extraction.text import CountVectorizer
import jieba
def tokenize_zh(text):
    words = jieba.lcut(text)
    return words
vectorizer = CountVectorizer(tokenizer=tokenize_zh)
kw_model = KeyBERT(model='paraphrase-multilingual-MiniLM-L12-v2')
doc = """
    强化学习是机器通过与环境交互来实现目标的一种计算方法。机器和环境的一轮交互是指,机器在环境的一个状态下做一个动作决策,把这个动作作用到环境当中,这个环境发生相应的改变并且将相应的奖励反馈和下一轮状态传回机器。这种交互是迭代进行的,机器的目标是最大化在多轮交互过程中获得的累积奖励的期望。"""
keywords = kw_model.extract_keywords(doc, vectorizer=vectorizer)

如果我们只选择那些与文档最相似的一些词来作为关键词,很可能会使挑选出的关键词之间的相似度比较高,所以keyBert的作者实现了下面两种来算法来使选择的关键词更多样化一些:

  • Max Sum Distance:设生成关键词个数为 t o p _ n top\_n top_n, 选择一个比 t o p _ n top\_n top_n大的数 n r _ c a n d i d a t e s nr\_candidates nr_candidates,先从文档中挑选出与文档最相似的 n r _ c a n d i d a t e s nr\_candidates nr_candidates个候选词,然后从 n r _ c a n d i d a t e s nr\_candidates nr_candidates个词选择 t o p _ n top\_n top_n个词的所有组合中选择两两相似度之和最小的组合作为返回的关键词。因此 n r _ c a n d i d a t e s nr\_candidates nr_candidates越大选择出来的关键词更多样化,但也可能会选择出一些并不能代表文档的词,作者的建议是 n r _ c a n d i d a t e s nr\_candidates nr_candidates不要超过文档中词汇个数的20%。

    # 代码来自keyBERT 源码 https://github.com/MaartenGr/KeyBERT/blob/master/keybert/_maxsum.py
    import numpy as np
    import itertools
    from sklearn.metrics.pairwise import cosine_similarity
    from typing import List, Tuple
    
    
    def max_sum_distance(
        doc_embedding: np.ndarray,
        word_embeddings: np.ndarray,
        words: List[str],
        top_n: int,
        nr_candidates: int,
    ) -> List[Tuple[str, float]]:
        """Calculate Max Sum Distance for extraction of keywords
    
        We take the 2 x top_n most similar words/phrases to the document.
        Then, we take all top_n combinations from the 2 x top_n words and
        extract the combination that are the least similar to each other
        by cosine similarity.
    
        This is O(n^2) and therefore not advised if you use a large `top_n`
    
        Arguments:
            doc_embedding: The document embeddings
            word_embeddings: The embeddings of the selected candidate keywords/phrases
            words: The selected candidate keywords/keyphrases
            top_n: The number of keywords/keyhprases to return
            nr_candidates: The number of candidates to consider
    
        Returns:
             List[Tuple[str, float]]: The selected keywords/keyphrases with their distances
        """
        if nr_candidates < top_n:
            raise Exception(
                "Make sure that the number of candidates exceeds the number "
                "of keywords to return."
            )
        elif top_n > len(words):
            return []
    
        # Calculate distances and extract keywords
        distances = cosine_similarity(doc_embedding, word_embeddings)
        distances_words = cosine_similarity(word_embeddings, word_embeddings)
    
        # Get 2*top_n words as candidates based on cosine similarity
        words_idx = list(distances.argsort()[0][-nr_candidates:])
        words_vals = [words[index] for index in words_idx]
        candidates = distances_words[np.ix_(words_idx, words_idx)]
    
        # Calculate the combination of words that are the least similar to each other
        min_sim = 100_000
        candidate = None
        for combination in itertools.combinations(range(len(words_idx)), top_n):
            sim = sum(
                [candidates[i][j] for i in combination for j in combination if i != j]
            )
            if sim < min_sim:
                candidate = combination
                min_sim = sim
    
        return [
            (words_vals[idx], round(float(distances[0][words_idx[idx]]), 4))
            for idx in candidate
        ]
    
    
  • Maximal Marginal Relevance (MMR):除了使关键词与文档尽可能的相似之外,同时降低关键词之间的相似性或者冗余度。参数diversity来控制候选词的多样性,diversity越小,则候选关键词之间可能更相似,而diversity越大,则关键词之间的冗余度更小。算法具体为:1. 先生成候选词与文档相似度矩阵、候选词之间的相似度矩阵。2. 挑选与文档最相似的候选词作为第一个关键词。3. 其余关键词逐一挑选,挑选规则为:设剩余候选词与文档相似性矩阵为 c a n _ s i m can\_sim can_sim,剩余候选词与已挑选的关键词最大相似度矩阵为 t a r _ s i m tar\_sim tar_sim,按公式 m m r = ( 1 − d i v e r s i t y ) ∗ c a n _ s i m − d i v e r s i t y ∗ t a r _ s i m mmr = (1-diversity)*can\_sim - diversity*tar\_sim mmr=(1diversity)can_simdiversitytar_sim 计算mmr,挑选mmr最大的候选词作为下一个关键词。

# 代码来自keyBERT 源码 https://github.com/MaartenGr/KeyBERT/blob/master/keybert/_mmr.py
def mmr(
    doc_embedding: np.ndarray,
    word_embeddings: np.ndarray,
    words: List[str],
    top_n: int = 5,
    diversity: float = 0.8,
) -> List[Tuple[str, float]]:
    """Calculate Maximal Marginal Relevance (MMR)
    between candidate keywords and the document.


    MMR considers the similarity of keywords/keyphrases with the
    document, along with the similarity of already selected
    keywords and keyphrases. This results in a selection of keywords
    that maximize their within diversity with respect to the document.

    Arguments:
        doc_embedding: The document embeddings
        word_embeddings: The embeddings of the selected candidate keywords/phrases
        words: The selected candidate keywords/keyphrases
        top_n: The number of keywords/keyhprases to return
        diversity: How diverse the select keywords/keyphrases are.
                   Values between 0 and 1 with 0 being not diverse at all
                   and 1 being most diverse.

    Returns:
         List[Tuple[str, float]]: The selected keywords/keyphrases with their distances

    """

    # Extract similarity within words, and between words and the document
    word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)
    word_similarity = cosine_similarity(word_embeddings)

    # Initialize candidates and already choose best keyword/keyphras
    keywords_idx = [np.argmax(word_doc_similarity)]
    candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]

    for _ in range(min(top_n - 1, len(words) - 1)):
        # Extract similarities within candidates and
        # between candidates and selected keywords/phrases
        candidate_similarities = word_doc_similarity[candidates_idx, :]
        target_similarities = np.max(
            word_similarity[candidates_idx][:, keywords_idx], axis=1
        )

        # Calculate MMR
        mmr = (
            1 - diversity
        ) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
        mmr_idx = candidates_idx[np.argmax(mmr)]

        # Update keywords & candidates
        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)

    # Extract and sort keywords in descending similarity
    keywords = [
        (words[idx], round(float(word_doc_similarity.reshape(1, -1)[0][idx]), 4))
        for idx in keywords_idx
    ]
    keywords = sorted(keywords, key=itemgetter(1), reverse=True)
    return keywords

理解了keyBERT的原理后,就发现keyBERT提取关键词的效果十分依赖于向量编码模型的质量,所以需要根据自己业务情况挑选适合的向量编码模型。

推荐阅读