干货|Elasticsearch BM25 模型评分细节的逐步分解
Elasticsearch 5 之前的版本,评分机制或者打分模型基于 TF-IDF 实现。
从 Elasticsearch 5 开始,Elasticsearch 的默认相似度算法是 Okapi BM25,Okapi BM25模型于 1994 年提出,BM25 的 BM 是缩写自 Best Match, 25 是经过 25 次迭代调整之后得出的算法,该模型也是基于 TF/IDF 进化来的,Okapi 信息检索系统是第一个实现此功能的系统,之后被广泛应用在不同系统里。
相似性(评分/排名模型)定义了匹配文档的评分方式, 对一组文档执行搜索并提供按相关性排序的结果。在这篇文章中,我们将一步步拆解 Okapi BM25 模型的内部工作原理。
在拆解评分算法之前,必须简单解释一下背后的理论——Elasticsearch 基于 Lucene。要了解 Elasticsearch,我们必须了解 Lucene。
1、Okapi BM25 基本概念
Okapi BM25 模型的计算公式如下:
类似的公式,我看到后的第一反应:这是科研人员才能搞懂的事情,我等只能围观。
但,为了进一步深入算分机制,我们一个个参数拆解一下,期望能“拨开云天、豁然开朗”!
上述公式中:
- D:代表文档。
- Q:代表查询。
- K1:*参数,默认值:1.2。
- b:*参数,默认值:0.75。
参见 Lucene 官方文档:
https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
2、词频 TF
词频英文释义:TF(Term Frequency) ,即:分词单元(Term)在文档中出现的频率。
由于每个文本的长度不同,一个单词在长文档中出现的次数可能比短文档中出现的次数要多得多。
一个词出现的次数越多,它的得分就越高。
可以简记为:
特定分词单元 Term 出现次数 (Number of times term t appears in a document)
TF = ----------------------------------------------
所在文档 Terms 总数 (Total number of terms in the document)
3、逆文档频率 IDF
逆文档频率英文释义: IDF(Inverse Document Frequency),衡量分词单元Term的重要性。
但是,众所周知,诸如“the”、“is”、“of、“that”、“的”、“吗”等之类的特定词可能会出现很多次但重要性不大。
因此,我们需要通过计算以下公式来降低常用分词单元的权重,同时扩大稀有分词单元的权重。
文档数(Total number of documents)
IDF = log ---------------------------------------
包含特定分词单元 Term 的文档数 (Number of documents with term t in it)
4、实战探索
4.1 索引准备
本文基于:7.12.0 版本的 Elasticsearch 进行拆解验证。
创建索引:got,并制定字段 quote 为 text 类型,同时指定:english 分词器。
DELETE got
PUT got
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"quote": {
"type": "text",
"analyzer": "english"
}
}
}
}
4.2 数据准备
bulk 批量导入数据,数据来自《权利的游戏》电视剧的台词。
POST _bulk
{ "index" : { "_index" : "got", "_id" : "1" } }
{ "quote" : "A mind needs books as a sword needs a whetstone, if it is to keep its edge." }
{ "index" : { "_index" : "got", "_id" : "2" } }
{ "quote" : "Never forget what you are, for surely the world will not. Make it your strength. Then it can never be your weakness. Armor yourself in it, and it will never be used to hurt you." }
{ "index" : { "_index" : "got", "_id" : "3" } }
{ "quote" : "Let them see that their words can cut you, and you’ll never be free of the mockery. If they want to give you a name, take it, make it your own. Then they can’t hurt you with it anymore." }
{ "index" : { "_index" : "got", "_id" : "4" } }
{ "quote" : "When you play the game of thrones, you win or you die. There is no middle ground." }
{ "index" : { "_index" : "got", "_id" : "5" } }
{ "quote" : "The common people pray for rain, healthy children, and a summer that never ends. It is no matter to them if the high lords play their game of thrones, so long as they are left in peace." }
{ "index" : { "_index" : "got", "_id" : "6" } }
{ "quote" : "If you would take a man’s life, you owe it to him to look into his eyes and hear his final words. And if you cannot bear to do that, then perhaps the man does not deserve to die." }
{ "index" : { "_index" : "got", "_id" : "7" } }
{ "quote" : "Sorcery is the sauce fools spoon over failure to hide the flavor of their own incompetence." }
{ "index" : { "_index" : "got", "_id" : "8" } }
{ "quote" : "Power resides where men believe it resides. No more and no less." }
{ "index" : { "_index" : "got", "_id" : "9" } }
{ "quote" : "There’s no shame in fear, my father told me, what matters is how we face it." }
{ "index" : { "_index" : "got", "_id" : "10" } }
{ "quote" : "Love is poison. A sweet poison, yes, but it will kill you all the same." }
{ "index" : { "_index" : "got", "_id" : "11" } }
{ "quote" : "What good is this, I ask you? He who hurries through life hurries to his grave." }
{ "index" : { "_index" : "got", "_id" : "12" } }
{ "quote" : "Old stories are like old friends, she used to say. You have to visit them from time to time." }
{ "index" : { "_index" : "got", "_id" : "13" } }
{ "quote" : "The greatest fools are ofttimes more clever than the men who laugh at them." }
{ "index" : { "_index" : "got", "_id" : "14" } }
{ "quote" : "Everyone wants something, Alayne. And when you know what a man wants you know who he is, and how to move him." }
{ "index" : { "_index" : "got", "_id" : "15" } }
{ "quote" : "Always keep your foes confused. If they are never certain who you are or what you want, they cannot know what you are like to do next. Sometimes the best way to baffle them is to make moves that have no purpose, or even seem to work against you." }
{ "index" : { "_index" : "got", "_id" : "16" } }
{ "quote" : "One voice may speak you false, but in many there is always truth to be found." }
{ "index" : { "_index" : "got", "_id" : "17" } }
{ "quote" : "History is a wheel, for the nature of man is fundamentally unchanging." }
{ "index" : { "_index" : "got", "_id" : "18" } }
{ "quote" : "Knowledge is a weapon, Jon. Arm yourself well before you ride forth to battle." }
{ "index" : { "_index" : "got", "_id" : "19" } }
{ "quote" : "I prefer my history dead. Dead history is writ in ink, the living sort in blood." }
{ "index" : { "_index" : "got", "_id" : "20" } }
{ "quote" : "In the game of thrones, even the humblest pieces can have wills of their own. Sometimes they refuse to make the moves you’ve planned for them. Mark that well, Alayne. It’s a lesson that Cersei Lannister still has yet to learn." }
{ "index" : { "_index" : "got", "_id" : "21" } }
{ "quote" : "Every man should lose a battle in his youth, so he does not lose a war when he is old." }
{ "index" : { "_index" : "got", "_id" : "22" } }
{ "quote" : "A reader lives a thousand lives before he dies. The man who never reads lives only one." }
{ "index" : { "_index" : "got", "_id" : "23" } }
{ "quote" : "The fisherman drowned, but his daughter got Stark to the Sisters before the boat went down. They say he left her with a bag of silver and a bastard in her belly. Jon Snow, she named him, after Arryn." }
{ "index" : { "_index" : "got", "_id" : "24" } }
{ "quote" : "You could make a poultice out of mud to cool a fever. You could plant seeds in mud and grow a crop to feed your children. Mud would nourish you, where fire would only consume you, but fools and children and young girls would choose fire every time." }
{ "index" : { "_index" : "got", "_id" : "25" } }
{ "quote" : "Men live their lives trapped in an eternal present, between the mists of memory and the sea of shadow that is all we know of the days to come." }
{ "index" : { "_index" : "got", "_id" : "26" } }
{ "quote" : "No. Hear me, Daenerys Targaryen. The glass candles are burning. Soon comes the pale mare, and after her the others. Kraken and dark flame, lion and griffin, the sun’s son and the mummer’s dragon. Trust none of them. Remember the Undying. Beware the perfumed seneschal." }
4.3 实施检索
GET got/_search
{
"query": {
"match": {
"quote": "live"
}
}
}
返回结果(仅列举评分、Quote 字段)如下:
Score |
Quote |
---|---|
3.3297362 |
A reader lives a thousand lives before he dies. The man who never reads lives only one. |
2.847715 |
Men live their lives trapped in an eternal present, between the mists of memory and the sea of shadow that is all we know of the days to come. |
2.313831 |
I prefer my history dead. Dead history is writ in ink, the living sort in blood. |
---|
这时候会面临我们的终极疑惑——这些评分咋来的?咋计算的呢?
别急,我们一步步拆解。
5、评分拆解
加上 "explain":true 一探究竟。
GET got/_search
{
"query": {
"match": {
"quote": "you"
}
},
"explain": true
}
拿第一个返回文档也就是评分为:3.3297362 的结果数据为例,自顶向下的方法有利于理解计算。
如下拆解结果所示,分数 3.3297362 是分词单元 live 的 boost * IDF * TF 三者的乘积,简记为:
总评分 = 2.2 * 2.043074 * 0.74080354 = 3.3297362。
explain 执行后的结果,核心部分如下所示:
{
"_shard" : "[got][0]",
"_node" : "m9VCQfPDRqqMuupU_Xz5Eg",
"_index" : "got",
"_type" : "_doc",
"_id" : "22",
"_score" : 3.3297362,
"_source" : {
"quote" : "A reader lives a thousand lives before he dies. The man who never reads lives only one."
},
"_explanation" : {
"value" : 3.3297362,
"description" : "weight(quote:live in 21) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 3.3297362,
"description" : "score(freq=3.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 2.043074,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
},
{
"value" : 0.7408035,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
}
]
}
]
}
}
5.1 词频 TF 拆解
执行 explain 后,词频 TF 拆解计算如下,
{
"value":0.7408035,
"description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details":[
{
"value":"3.0",
"description":"freq, occurrences of term within document",
"details":[
]
},
{
"value":1.2,
"description":"k1, term saturation parameter",
"details":[
]
},
{
"value":0.75,
"description":"b, length normalization parameter",
"details":[
]
},
{
"value":"14.0",
"description":"dl, length of field",
"details":[
]
},
{
"value":16.807692,
"description":"avgdl, average length of field",
"details":[
]
}
]
}
词频计算涉及参数如下:
- freq = 分词单元 live 在文档中出现的次数为 3 次,如下图所示:
- k1:1.2,缺省值。
- b:0.75 缺省值。
- dl:该文档的分词后分词单元的个数(number of tokens),为 14。
可以借助——analyze API 验证:
POST got/_analyze
{
"text": "A reader lives a thousand lives before he dies. The man who never reads lives only one",
"analyzer": "english"
}
分词后的 token 为:
- avgdl:等于所有文档的分词单元的总数 / 文档个数) ,计算结果为:16.807692。
如何计算的呢?这里有同学会有疑惑,解读如下:
avgdl 计算步骤 1:所有文档的分词单元的总数。
如下所示:共 437个。文档数为 26 个。
为了方面查看,我把 26 个文档的全部 document 内容集合到一个文档里面,求得的分词后的结果值为 437 。
avgdl 计算步骤 2:avgdl = 437 / 26 = 16.807692。
最终 TF 词频 求解结果为:0.740803524(该手算值精度和最终 Elasticsearch 返回结果精度值不完全一致,属于精度问题,不影响理解全局),其求解步骤如下:
TF = freq / (freq + k1 * (1 - b + b * dl / avgdl))
= 3 / (3 + 1.2 *( 1 - 0.75 + 0.75 * 14 / 16.807692))
= 3 / (3 + 1.2 *0.87471397)
= 3/(3+1.049656764)
= 3/4.049656764
= 0.740803524
5.2 逆文档频率 IDF 拆解
执行 explain 后,逆文档频率 IDF 拆解如下:
{
"value":2.043074,
"description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details":[
{
"value":3,
"description":"n, number of documents containing term",
"details":[
]
},
{
"value":26,
"description":"N, total number of documents with field",
"details":[
]
}
]
}
- N:待检索文档数,本示例为 26。
- n:包含分词单元 live 的文档数目,本示例为 3。
最终 IDF 求解结果为:2.043074,其计算公式如下:
IDF = log(1 + (N - n + 0.5) / (n + 0.5))
= log( 1 + ( 26 - 3 + 0.5) / ( 3 + 0.5))
= log( 1 + 23.5/3.5)
= log( 1 + 6.714285)
= log(7.714285)
= 2.043074
如上计算对数, 底数为 e。
5.3 总评分拆解
总评分
= boost * TF * IDF
= 2.2 * 0.74080354 * 2.043074 = 3.3297362
boost 为什么等于 2.2 ?
如果我们不指定 boost,boost 就是使用缺省值,缺省值是 2.2。
boost 参见:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html
https://www.infoq.com/articles/similarity-scoring-elasticsearch/
6、小结
一步步拆解,才能知道 BM25 模型的评分‘奥秘’所在,原来难懂的数学计算公式,也变得清晰明朗!
有了拆解,再来看其他的检索评分问题自然会“毫不费力"。
本文由英文博客:https://blog.mimacom.com/bm25-got/ 翻译而来,较原来博客内容,增加了计算的细节和个人解读,确保每一个计算细节小学生都能看懂。
欢迎就评分问题留言交流细节。
参考
- https://blog.mimacom.com/bm25-got/
- 实战 | Elasticsearch自定义评分的N种方法
- https://ruby-china.org/topics/31934
- https://www.elastic.co/guide/en/elasticsearch/reference/current/similarity.html
- https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
- https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html
上一篇: 你对字符串匹配算法了解多少?
推荐阅读
-
干货|Elasticsearch BM25 模型评分细节的逐步分解
-
纯干货分享 | 研发效能提升——敏捷需求篇-而敏捷需求是提升效能的方式中不可或缺的模块之一。 云智慧的敏捷教练——Iris Xu近期在公司做了一场分享,主题为「敏捷需求挖掘和组织方法,交付更高业务价值的产品」。Iris具有丰富的团队敏捷转型实施经验,完成了企业多个团队从传统模式到敏捷转型的落地和实施,积淀了很多的经验。 这次分享主要包含以下2个部分: 第一部分是用户影响地图 第二部分是事件驱动的业务分析Event driven business analysis(以下简称EDBA) 用户影响地图,是一种从业务目标到产品需求映射的需求挖掘和组织的方法。 在软件开发过程中可能会遇到一些问题,比如大家使用不同的业务语言、技术语言,造成角色间的沟通阻碍,还会导致一些问题,比如需求误解、需求传递错误等;这会直接导致产品的功能需求和要实现的业务目标不是映射关系。 但在交付期间,研发人员必须要将这些需求实现交付,他们实则并不清楚这些功能需求产生的原因是什么、要解决客户的哪些痛点。研发人员往往只是拿到了解决方案,需要把它实现,但没有和业务侧一起去思考解决方案是否正确,能否真正的帮助客户解决问题。而用户影响地图通常是能够连接业务目标和产品功能的一种手段。 我们在每次迭代里加入的假设,也就是功能需求。首先把它先实现,再逐步去验证我们每一个小目标是否已经实现,再看下一个目标要是什么。那影响地图就是在这个过程中帮我们不断地去梳理目标和功能之间的关系。 我们在软件开发中可能存在的一些问题 针对这些问题,我们如何避免?先简单介绍做敏捷转型的常规思路: 先做团队级的敏捷,首先把产品、开发、测试人员,还有一些更后端的人员比如交互运维的同学放在一起,组成一个特训团队做交付。这个团队要包含交付过程中所涉及的所有角色。 接着业务敏捷要打通整个业务环节和研发侧的一个交付。上图中可以看到在敏捷中需求是分层管理的,第一层是业务需求,在这个层级是以用户目标和业务目标作为输入进行规划,同时需要去考虑客户的诉求。业务人员通过获取到的业务需求,进一步的和团队一起将其分解为产品需求。所以业务需求其实是我们真正去发布和运营的单元,它可以被独立发布到我们的生产环境上。我们的产品需求其实就是产品的具体功能,它是我们集成和测试的对象,也就是我们最终去部署到系统上的一个基本单元。产品需求再到了我们的开发团队,映射到迭代计划会上要把它分解为相应的技术任务,包括我们平时所说的比如一些前端的开发、后端的开发、测试都是相应的技术任务。所以业务敏捷要达到的目标是需要去持续顺畅高质量的交付业务价值。 将这几个点串起来,形成金字塔结构。最上层我们会把业务目标放在整个金字塔的塔尖。这个业务目标是通过用户的目标以及北极星指标确立的。确认业务目标后再去梳理相应的业务流程,最后生产。另外产品需求包含了操作流程和业务规则,具需求交付时间、工程时间以及我们的一些质量标准的要求。 谈到用户影响的地图,在敏捷江湖上其实有一个传说,大家都有一个说法叫做敏捷需求的“任督二脉”。用户影响地图其实就是任脉,在黑客马拉松上用过的用户故事地图其实叫督脉。所以说用户影响地图是在用户故事地图之前,先帮我们去梳理出我们要做哪些东西。当我们真正识别出我们要实现的业务活动之后,用户故事地图才去梳理我们整个的业务工作流,以及每个工作流节点下所要包含的具体功能和用户故事。所以说用户影响地图需要解决的问题,我们包括以下这些: 首先是范围蔓延,我们在整张地图上,功能和对应的业务目标是要去有一个映射的。这就避免了一些在我们比如有很多干系人参与的会议上,那大家都有不同想法些立场,会提出很多需求(正确以及错误的需求)。这个时候我们会依据目标去看这些需求是否真的是会影响我们的目标。 这里提到的错误需求,比如是利益相关的人提出的、客户认为产品应该有的、某个产品经理需求分析师认为可以有的....但是这些功能在用户影响地图中匹配不到对应目标的话,就需要降低优先级或弃掉。另外,通常我们去制定解决方案的时候,会考虑较完美的实现,导致解决方案括很多的功能。这个时候关键目标至关重要,会帮助我们梳理筛选、确定优先级。 看一下用户影响到地图概貌 总共分为一个三层的结构: 第一层why,你的业务目标哪个是最重要的,为什么?涉及到的角色有哪些? 第二层how ,怎样产生影响?影响用户角色什么样的行为? (不需要去列出所有的影响,基于业务目标) 第三层what,最关键的是在梳理需求时不需一次把所有细节想全,这通常团队中经常遇到的问题。 我们用这个例子来看一下 这是一个客服中心的影响地图,业务目标是 3个月内不增加客服人数的前提下能支持1.5倍的用户数。此业务目标设定是符合 smart 原则的,specific非常的具体,miserable 是可以衡量的,action reoriented是面向活动的, real list 也是很实际的。 量化的目标会指引我们接下来的行动,梳理一个业务目标,尽量去量化,比如 :我们通过打造一条什么样的流水线,能够提高整个部署的效率,时间是原来的 1/2 。这样才是一个能量化的有意义的目标。 回到这幅图, how 层级识别出来的内容,客服角色:想要对它施加的影响,把客户引导到论坛上,帮助客户更容易的跟踪问题,更快速的去定位问题。初级用户:方论坛上找到问题。高级用户:在论坛上回答问题。通过我们这些用户角色,进行活动,完成在不增加客户客服人数的前提下支持更多的用户数量。 最后一个层级,才是我们日常接触比较多的真正的功能的特性和需求,比如引导到客户到论坛上,其实这个产品就需要有一个常见问题的论坛的链接。这个层次需要我们团队进一步地在交付,在每个迭代之前做进一步的梳理,细化成相应的用户故事。 这个是云智慧团队中,自己做的影响地图的范例,可以看下整个的层级结构。序号表示优先级。 那我们用户影响地图可以总结为: