zero-shot:MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning

最编程 2024-07-16 08:58:07

...

Zero-Shot Learning

Zero-Shot Learning是实现根据描述找到对应的图片的功能？？
那就是：
输入 $Y_{te}$ 和 $A_{te}$
模型根据图像 $X_{te}$ 预测得到 $A_{pre}$ ，然后看 $A_{pre}$ 和 $A_{te}$ 的相似程度，选择最相似的图像。

Zero-Shot Learning的目的是告诉模型特征A，模型就可以找出具有特征A的图片。

在训练的时候让属于 $Y_{tr}$

如果选择 $A_{pre}$ 和 $A_{tr}$ 最相似的图片，然后判断

每个 $Y_{te}$ 都有一个对应的 $A_{te}$
$A_{te}$ 和 $Y_{te}$ 是如何对应上的？

那么如何训练呢？？
答：可能会找错图片，找错图片的时候损失就比较大。从这个意义上用一个交叉熵函数就可以。
当预测的A与所求的A相近，但是标签却不一样，这时损失就要大。
当预测的A与所求的A相近，标签一样，这时损失应该为预测A与所求A的差距。

特征空间与语义空间之间的映射：在经典的分类网络的最后我们总是会输出一个特征向量，而这个特征向量和A还是有差别的。但是我们可以利用特征向量去得到A。
特征向量可以中的每个元素都代表图像中的某一部分特征，这和A还是有一定的相似的、

链接中介绍了：
零次学习（zero-shot learning）基本概念
数据集介绍
ZSL中存在的问题

看这篇前，先看An embarrassingly simple approach to zero-shot learning
有着很强的理论基础，算法简单、有效，虽然已经过去很多年了，但还是目前新工作需要进行对比的方法之一
https://github.com/sbharadwajj/embarrassingly-simple-zero-shot-learning

查阅cv或者few shot 或者分类、分割等特殊领域的常用的tricks

Abstract

1.ZSL概念：学习视觉特征和属性特征，然后将学到的知识，应用在未见过的类上面。
2.之前的工作的缺点：之前的工作是联合图像的全局特征（如resnet101输出的特征向量）和类别对应的语义向量，来得到有限的语义表达。无法有效地发现视觉特征和属性特征之间的内在语义知识。
3.MSDN设计了两个net：1）通过属性学到基于属性视觉特征 2）通过视觉学到基于视觉的属性特征
通过a semantic distillation loss，两个net在训练的过程中互相协作学习和互相教学
4.zero-shot learning常用的数据集为CUB, SUN, and AWA2
代码链接

1. Introduction

1.介绍zero-shot learning的来源
2.ZSL可分为conventional ZSL (CZSL) and generalized ZSL (GZSL)
CZSL aims to predict unseen classes
GZSL can predict both seen and unseen ones.
3.一些space 概念的解释：
semantic space：每个类都有对应的一些属性，将类具有的属性通过一个向量表示出来，这个向量称为class semantic vector。class semantic vector的集合就是semantic space
visual space：就是通过CNN网络学习到的特征向量的集合
common space：是视觉映射和语义映射之间共享的潜在空间
attribute space：是通过语言模型得到的

4.ZSL常用的方法：embedding-based methods, generative methods, and common space learning-based methods

embedding-based methods：将视觉特征映射到语义空间中
generative methods：根据语义生成视觉特征，这就和将图片输入到CNN网络最后得到视觉特征的过程类似，也就是说将问题转化为了常规的分类问题。 Currently, the generative ZSL usually based on variational autoencoders (VAEs) , generative adversarial nets (GANs), and generative flows
common space learning-based methods：将visual features and semantic representations都映射到统一空间中。

5.MSDN设计了两个net：1）通过属性学到基于属性视觉特征 2）通过视觉学到基于视觉的属性特征

通过an attribute-based cross-entropy loss with self-calibration对两个net进行优化
a semantic distillation loss to match the probability estimates of its peers for semantic distillation（不懂），实现两个net在训练的过程中互相协作学习和互相教学。

2. Related Work

2.1.ZSL

概念：ZSL learns a mapping between the visual and attribute/semantic domains
然后重新介绍了一遍：embedding-based methods, generative methods, and common space learning-based methods
上面三种方法都不能捕捉到seen classes和unseen classes之间细微的差别，attention-based ZSL methods利用attribute descriptions作为指导，从而发现更多的细节特征。但是它只利用了unidirectional attention（单向注意），所以得到的语义信息是受限的。

2.2.Knowledge Distillation

Knowledge Distillation：让老师网络教出一个小的学生网络，从而实现对网络的压缩。
受到启发，作者设计了a mutually semantic distillation network来让两个net进行互相学习和教授。

3. Mutually Semantic Distillation Network

1.把以前方法的缺陷和自己方法的概述又说了一遍（tm的废话真jb多）。
2.定义了一下各个字母的含义，包括样本、标签、semantic vectors。每个类拥有的属性用semantic vectors表示。
每个属性用一个向量A表示？？？

3.1. Attribute→Visual Attention Sub-net

根据attribute定位图像中与此属性最相关的区域，提取该区域的attribute-based visual feature。
V ={ $v_1, . . . , v_R$ }表示一张图片的visual features，每一个visual feature都对应于图像中的一个区域。
A = { $a_1, . . . , a_K$ }代表attribute vectors，A中的每一个attribute vector都对应一个属性。
通过以下公式计算第r个区域和第k个属性的attention weights

W_1

是可学习参数。
第k个attribute-based visual feature

F_k

：

图像有明显的

a_k

属性，那么就会给a high positive score，否则给 a negative score。最后得到 a set of attribute-based visual features F = {

F_1, F_2, · · · , F_K

}.
最后通过

M_1

函数将F映射到the semantic embedding space：

得到

\psi(x)

= {

\psi_1, \psi_2, · · · , \psi_K

}

\psi_K

可以代表第k个属性在图像中的置信度，即第k个属性存在在图像中的“可能性”。

\psi(x)

有什么用？
the semantic embedding space与A有什么关系？？

3.2. Visual→Attribute Attention Sub-net

Analogously, we design a visual→attribute attention sub-net to learn visual-based semantic attribute representations。和3.1完全类似。
通过以下公式计算第r个区域和第k个属性的attention weights

W_3

是可学习参数。
第k个visual-based attribute features

S_r

：

最后通过

M_2

函数将S映射到semantic space：

得到

\bar{\Psi}(x)

= {

\bar{\Psi}_1,\bar{\Psi}_2, · · · , \bar{\Psi}_R

}
因为semantic vector是K维的，所以这里将R维的

\bar{\Psi}(x)

变维K维的

\Psi(x)

，即映射到semantic attribute space，公式如下：

\Psi(x)

\bar{\Psi}(x) * Att= \bar{\Psi}(x)*(V^TW_{att}A)

3.3. Model Optimization

两个net通过 attribute-based cross entropy loss with self-calibration被训练，通过semantic distillation loss实现两个网络的互相学习。

Attribute-Based Cross-Entropy Loss：

class semantic vector $z^c$ 相当于标签， $f(x_i)$ 要尽量地与 $z^c$ 接近。
$z^c$ 是啥啊，与A有什么关系？？？？？答：应该是这样吧，A中的 $a_i$ 代表拥有属性i时。Z代表是否拥有这个属性，或者说拥有的可能性。

Semantic Distillation Loss:

$p_1$ = { $\psi(x_i)*z^1, \psi(x_i)*z^2, · · · ,\psi(x_i)*z^C$ }，
$p_2$ = { $\Psi(x_i)*z^1, \Psi(x_i)*z^2, · · · ,\Psi(x_i)*z^C$ }，
$p_1$ 和 $p_2$ 相当于两个网络的预测值。

Overall Loss：

3.4. Zero-Shot Prediction

首先获取待预测样本 $x_i$ 的 $\psi(x_i)$ 和 $\Psi(x_i)$ ，然后通过以下公式计算样本 $x_i$ 所属类别：

4. Experiments

使用数据集：
CUB (Caltech UCSD Birds200), SUN (SUN Attribute) and AWA2 (Animalswith Attributes 2) 。
其中 CUB and SUN are fine-grained datasets（细粒度的数据集），whereas AWA2 is a coarse-grained dataset（粗粒度的数据集）.
粗粒度数据集和细粒度数据集的区别是什么？
数据集的相关信息：
CUB includes 11,788 images of 200 bird classes (seen/unseen classes = 150/50)
with 312 attributes.
SUN has 14,340 images from 717 scene classes (seen/unseen classes = 645/72) with 102 attributes.
AWA2 consists of 37,322 images from 50 animal classes (seen/unseen classes = 40/10) with 85 attributes.
评估方法：
CZSL设置中，预测useen数据，得到的准确率称为acc。
GZSL设置中，预测seen（S）和unseen数据（U）。
harmonic mean： defined as H = (2 × S × U)/(S + U)
top-1 accuracy是什么？前百分之一的准确率？最好的准确率？
CZSL和GZSL区别是什么？为什么会出现这两个东西？答：GZSL可以用来预测seen数据，不对啊，那CZSL为什么就不能用来预测seen数据。不都是从seen数据训练而来的吗？
实现细节：
1.使用在ImageNet上进行预训练过的ResNet101
2.使用优化器RMSProp(momentum= 0.9, weight decay = 0.0001)
3.learning rate = 0.0001
batch size = 50
loss weights { $λ_{cal}, λ_{distill}$ } to{0.1, 0.001} for CUB and AWA2, and {0.0, 0.01} for SUN

上一篇：什么是边缘计算技术和边缘计算平台？

下一篇： MSDN 我告诉你 -- 1.