最新计算机视觉与模式识别研究进展（12月15日版）

最编程 2024-08-02 21:39:59

...

cs.CV 方向，今日共计63篇

Transformer(5篇)

【1】 AdaViT: Adaptive Tokens for Efficient Vision Transformer 标题：AdaViT：高效视觉转换器的自适应标记链接：https://arxiv.org/abs/2112.07658

作者：Hongxu Yin,Arash Vahdat,Jose Alvarez,Arun Mallya,Jan Kautz,Pavlo Molchanov 机构：NVIDIA, Token removal & reorg., Token halting, Transformer Block, Token Depths, Mean-field, aggregation, Task head, Tokens remain: , Adaptive Halting, Halting probability, Tokenization, Embedding, Class token, memory, Layer K, ImageNet,K Examples for Adaptive Tokens 摘要：我们介绍了AdaViT，一种针对不同复杂度的图像自适应调整视觉转换器（ViT）推理代价的方法。AdaViT通过在推理过程中自动减少网络中处理的vision Transformer中的令牌数量来实现这一点。我们为这项任务重新制定了自适应计算时间（ACT），扩展了暂停以丢弃冗余的空间令牌。视觉转换器吸引人的体系结构特性使我们的自适应令牌缩减机制能够在不修改网络体系结构或推理硬件的情况下加速推理。我们证明了AdaViT不需要额外的参数或子网络来停止，因为我们基于原始网络参数学习自适应停止。我们进一步引入了分布先验正则化，与先验ACT方法相比，它可以稳定训练。在图像分类任务（ImageNet1K）中，我们证明了我们提出的AdaViT在过滤信息性空间特征和减少总体计算量方面的高效性。该方法将DeiT-Tiny和DeiT-Small的吞吐量分别提高了62%和38%，而准确度仅下降了0.3%，大大优于现有技术。摘要：We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We reformulate Adaptive Computation Time (ACT) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism to speed up inference without modifying the network architecture or inference hardware. We demonstrate that AdaViT requires no extra parameters or sub-network for halting, as we base the learning of adaptive halting on the original network parameters. We further introduce distributional prior regularization that stabilizes training compared to prior ACT approaches. On the image classification task (ImageNet1K), we show that our proposed AdaViT yields high efficacy in filtering informative spatial features and cutting down on the overall compute. The proposed method improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop, outperforming prior art by a large margin.

【2】 Geometry-Contrastive Transformer for Generalized 3D Pose Transfer 标题：基于几何对比变换的广义三维位姿变换链接：https://arxiv.org/abs/2112.07374

作者：Haoyu Chen,Hao Tang,Zitong Yu,Nicu Sebe,Guoying Zhao 机构：CMVS, University of Oulu, Computer Vision Lab, ETH Zurich, DISI, University of Trento, +, Identity, Pose, Result, =, Known pose space and, unkown shape space, Unknown pose and, Other domain, Known pose and shape, Seen pose, Unseen pose 备注：AAAI 2022 摘要：我们提出了一个定制的三维网格变换模型的姿势转移任务。由于三维位姿转换本质上是一个依赖于给定网格的变形过程，因此这项工作的直觉是通过强大的自我注意机制感知给定网格之间的几何不一致性。具体来说，我们提出了一种新的几何对比变换器，该变换器对给定网格中的全局几何不一致性具有高效的三维结构化感知能力。此外，在局部，进一步提出了一种简单而有效的中心测地对比损失法，以改进区域几何不一致性学习。最后，我们提出了一个潜在的等距正则化模块和一个新的半合成数据集，用于面向未知空间的跨数据集三维姿势转移任务。通过在SMPL-NPT、FAUST和我们新提出的数据集SMG-3D数据集上显示最先进的定量性能，以及在MG cloth和SMAL数据集上显示有希望的定性结果，大量实验结果证明了我们方法的有效性。结果表明，该方法能够实现鲁棒的三维姿态转换，并可推广到跨数据集任务中未知空间的挑战性网格。代码和数据集可用。代码如下：https://github.com/mikecheninoulu/CGT. 摘要：We present a customized 3D mesh Transformer model for the pose transfer task. As the 3D pose transfer essentially is a deformation procedure dependent on the given meshes, the intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism. Specifically, we propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies across the given meshes. Moreover, locally, a simple yet efficient central geodesic contrastive loss is further proposed to improve the regional geometric-inconsistency learning. At last, we present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task towards unknown spaces. The massive experimental results prove the efficacy of our approach by showing state-of-the-art quantitative performances on SMPL-NPT, FAUST and our new proposed dataset SMG-3D datasets, as well as promising qualitative results on MG-cloth and SMAL datasets. It's demonstrated that our method can achieve robust 3D pose transfer and be generalized to challenging meshes from unknown spaces on cross-dataset tasks. The code and dataset are made available. Code is available: https://github.com/mikecheninoulu/CGT.

【3】 Temporal Transformer Networks with Self-Supervision for Action Recognition 标题：用于动作识别的具有自监督功能的时序Transformer网络链接：https://arxiv.org/abs/2112.07338

作者：Yongkang Zhang,Jun Li,Guoming Wu,Han Zhang,Zhiping Shi,Zhaoxun Liu,Zizhang Wu,Na Jiang 机构： Beihang University 摘要：近年来，基于二维卷积网络的视频动作识别得到了广泛的应用；然而，由于缺乏长期的非线性时间关系建模和反向运动信息建模，现有模型的性能受到严重影响。为了解决这个紧迫的问题，我们引入了一种具有自监督功能的瞬态Transformer网络（TTSN）。我们的高性能TTSN主要由一个时态变换模块和一个时态序列自监督模块组成。简而言之，我们利用高效的时间变换模块来建模非局部帧之间的非线性时间依赖关系，这显著增强了复杂的运动特征表示。我们采用的时间序列自监督模块前所未有地采用了“随机批量随机通道”的简化策略来反转视频帧序列，允许从反转的时间维度中鲁棒地提取运动信息表示，并提高了模型的泛化能力。在三个广泛使用的数据集（HMDB51、UCF101和Something V1）上进行的大量实验最终证明，我们提出的TTSN很有希望，因为它成功地实现了动作识别的最新性能。摘要：In recent years, 2D Convolutional Networks-based video action recognition has encouragingly gained wide popularity; However, constrained by the lack of long-range non-linear temporal relation modeling and reverse motion information modeling, the performance of existing models is, therefore, undercut seriously. To address this urgent problem, we introduce a startling Temporal Transformer Network with Self-supervision (TTSN). Our high-performance TTSN mainly consists of a temporal transformer module and a temporal sequence self-supervision module. Concisely speaking, we utilize the efficient temporal transformer module to model the non-linear temporal dependencies among non-local frames, which significantly enhances complex motion feature representations. The temporal sequence self-supervision module we employ unprecedentedly adopts the streamlined strategy of "random batch random channel" to reverse the sequence of video frames, allowing robust extractions of motion information representation from inversed temporal dimensions and improving the generalization capability of the model. Extensive experiments on three widely used datasets (HMDB51, UCF101, and Something-something V1) have conclusively demonstrated that our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.

【4】 Co-training Transformer with Videos and Images Improves Action Recognition 标题：具有视频和图像的联合训练Transformer提高了动作识别能力链接：https://arxiv.org/abs/2112.07175

作者：Bowen Zhang,Jiahui Yu,Christopher Fifty,Wei Han,Andrew M. Dai,Ruoming Pang,Fei Sha 机构：USC, Google Brain, Apple AI, Google Research 摘要：在学习动作识别过程中，模型通常是通过图像（如ImageNet）对对象识别进行预训练，然后通过视频对目标动作识别进行微调。这种方法已经取得了很好的经验性能，尤其是在最近基于Transformer的视频架构中。虽然最近许多工作旨在设计更先进的动作识别Transformer架构，但在如何训练视频Transformer方面所做的工作较少。在这项工作中，我们探索了几种训练模式，并提出了两个发现。首先，视频Transformer受益于对不同视频数据集和标签空间的联合训练（例如，动力学以外观为中心，而某些事物以运动为中心）。其次，通过进一步与图像（作为单帧视频）的联合训练，视频转换器可以学习更好的视频表示。我们将这种方法称为用于动作识别（CoVeR）的联合训练视频和图像。特别是，当基于时间转换器架构在ImageNet-21K上进行预训练时，CoVeR将Kinetics-400 Top-1精度提高了2.4%，Kinetics-600提高了2.3%，SomethingSomething-v2提高了2.3%。当按照之前的最新技术对较大规模的图像数据集进行预训练时，CoVeR在Kinetics-400（87.2%）、Kinetics-600（87.9%）、Kinetics-700（79.8%）、SomethingSomething-v2（70.9%）和时间瞬间（46.1%）上取得了最佳效果，并配备了一个简单的时空视频转换器。摘要：In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition, less effort has been made on how to train video transformers. In this work, we explore several training paradigms and present two findings. First, video transformers benefit from joint training on diverse video datasets and label spaces (e.g., Kinetics is appearance-focused while SomethingSomething is motion-focused). Second, by further co-training with images (as single-frame videos), the video transformers learn even better video representations. We term this approach as Co-training Videos and Images for Action Recognition (CoVeR). In particular, when pretrained on ImageNet-21K based on the TimeSFormer architecture, CoVeR improves Kinetics-400 Top-1 Accuracy by 2.4%, Kinetics-600 by 2.3%, and SomethingSomething-v2 by 2.3%. When pretrained on larger-scale image datasets following previous state-of-the-art, CoVeR achieves best results on Kinetics-400 (87.2%), Kinetics-600 (87.9%), Kinetics-700 (79.8%), SomethingSomething-v2 (70.9%), and Moments-in-Time (46.1%), with a simple spatio-temporal video transformer.

【5】 Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text 标题：迈向统一的基础模型：对未配对的图像和文本进行联合预训练Transformer 链接：https://arxiv.org/abs/2112.07074

作者：Qing Li,Boqing Gong,Yin Cui,Dan Kondratyuk,Xianzhi Du,Ming-Hsuan Yang,Matthew Brown 机构：Google Research, University of California, Los Angeles 备注：preliminary work 摘要：在本文中，我们探讨建立统一的基础模型的可能性，可以适用于视觉和纯文本的任务。从BERT和ViT开始，我们设计了一个统一的转换器，它由特定于模态的标记器、共享的转换器编码器和特定于任务的输出头组成。为了有效地在未配对的图像和文本上对所提出的模型进行联合预训练，我们提出了两种新技术：（i）我们使用单独训练的BERT和ViT模型作为教师，并应用知识提取为联合训练提供额外的、准确的监督信号；（ii）我们提出了一种新的梯度掩蔽策略来平衡来自图像和文本预训练损失的参数更新。我们通过分别在图像分类任务和自然语言理解任务中对联合预训练的transformer进行微调来评估它。实验结果表明，所得到的统一基础Transformer在视觉和纯文本两个任务上都令人惊讶地很好地工作，并且所提出的知识蒸馏和梯度掩蔽策略可以有效地提升性能以接近单独训练的模型的水平。摘要：In this paper, we explore the possibility of building a unified foundation model that can be adapted to both vision-only and text-only tasks. Starting from BERT and ViT, we design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads. To efficiently pre-train the proposed model jointly on unpaired images and text, we propose two novel techniques: (i) We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals for the joint training; (ii) We propose a novel gradient masking strategy to balance the parameter updates from the image and text pre-training losses. We evaluate the jointly pre-trained transformer by fine-tuning it on image classification tasks and natural language understanding tasks, respectively. The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.

检测相关(10篇)

【1】 Out-of-Distribution Detection without Class Labels 标题：无类别标签的失配检测链接：https://arxiv.org/abs/2112.07662

作者：Niv Cohen,Ron Abutbul,Yedid Hoshen 机构：School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel 摘要：异常检测方法识别偏离数据集正常行为的样本。它通常用于包含来自多个标记类或单个未标记类的正常数据的训练集。当前的方法在面对由多个类组成但没有标签的训练数据时会遇到困难。在这项工作中，我们首先发现，通过自监督图像聚类方法学习的分类器为未标记的多类数据集的异常检测提供了强大的基线。也许令人惊讶的是，我们发现使用预先训练的特征初始化聚类方法并没有比其自我监督的方法有所改进。这是由于灾难性遗忘的现象。相反，我们建议采用两阶段方法。我们首先使用自监督方法对图像进行聚类，并为每个图像获得一个聚类标签。我们使用集群标签作为分布外（OOD）方法的“伪监督”。具体地说，我们在通过聚类标签对图像进行分类的任务中对预训练的特征进行微调。我们对我们的方法进行了广泛的分析，并论证了我们两阶段方法的必要性。我们根据最先进的自我监督和预训练方法对其进行评估，并证明其具有优异的性能。摘要：Anomaly detection methods identify samples that deviate from the normal behavior of the dataset. It is typically tackled either for training sets containing normal data from multiple labeled classes or a single unlabeled class. Current methods struggle when faced with training data consisting of multiple classes but no labels. In this work, we first discover that classifiers learned by self-supervised image clustering methods provide a strong baseline for anomaly detection on unlabeled multi-class datasets. Perhaps surprisingly, we find that initializing clustering methods with pre-trained features does not improve over their self-supervised counterparts. This is due to the phenomenon of catastrophic forgetting. Instead, we suggest a two stage approach. We first cluster images using self-supervised methods and obtain a cluster label for every image. We use the cluster labels as "pseudo supervision" for out-of-distribution (OOD) methods. Specifically, we finetune pretrained features on the task of classifying images by their cluster labels. We provide extensive analyses of our method and demonstrate the necessity of our two-stage approach. We evaluate it against the state-of-the-art self-supervised and pretrained methods and demonstrate superior performance.

【2】 Approaches Toward Physical and General Video Anomaly Detection 标题：一种面向物理和一般视频异常检测的方法链接：https://arxiv.org/abs/2112.07661

作者：Laura Kart,Niv Cohen 机构：School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel. 摘要：近年来，许多作品解决了在视频中发现以前从未见过的异常的问题。然而，大多数工作都集中在检测从安全摄像头拍摄的监控视频中的异常帧。同时，异常检测（AD）在表现出异常机械行为的视频中的任务也被忽视了。此类视频中的异常检测具有学术和实际意义，因为它们可以在许多制造、维护和现实环境中自动检测故障。为了评估不同方法检测此类异常的潜力，我们评估了两种简单的基线方法：（i）时间池图像AD技术。（ii）用预训练特征表示的视频密度估计，用于视频分类。开发这种方法需要新的基准，以便对不同的可能方法进行评估。我们介绍了物理异常轨迹或运动（PHANTOM）数据集，它包含六个不同的视频类。每节课由正常和异常视频组成。这些类别在呈现的现象、正常类别可变性以及视频中的异常类型方面有所不同。我们还提出了一个更难的基准，即在高度可变的场景中发现异常活动。摘要：In recent years, many works have addressed the problem of finding never-seen-before anomalies in videos. Yet, most work has been focused on detecting anomalous frames in surveillance videos taken from security cameras. Meanwhile, the task of anomaly detection (AD) in videos exhibiting anomalous mechanical behavior, has been mostly overlooked. Anomaly detection in such videos is both of academic and practical interest, as they may enable automatic detection of malfunctions in many manufacturing, maintenance, and real-life settings. To assess the potential of the different approaches to detect such anomalies, we evaluate two simple baseline approaches: (i) Temporal-pooled image AD techniques. (ii) Density estimation of videos represented with features pretrained for video-classification. Development of such methods calls for new benchmarks to allow evaluation of different possible approaches. We introduce the Physical Anomalous Trajectory or Motion (PHANTOM) dataset, which contains six different video classes. Each class consists of normal and anomalous videos. The classes differ in the presented phenomena, the normal class variability, and the kind of anomalies in the videos. We also suggest an even harder benchmark where anomalous activities should be spotted on highly variable scenes.

【3】 CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning 标题：核心文本：利用对比关系推理改进场景文本检测链接：https://arxiv.org/abs/2112.07513

作者：Jingyang Lin,Yingwei Pan,Rongfeng Lai,Xuehang Yang,Hongyang Chao,Ting Yao 机构：∗Sun Yat-sen University, Guangzhou, China, †The Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, Guangzhou, P. R. China, ‡JD AI Research, Beijing, China 备注：ICME 2021 (Oral); Code is publicly available at: this https URL 摘要：在自然场景中定位文本实例被认为是计算机视觉的一个基本挑战。然而，由于真实场景中文本实例的纵横比和比例差异极大，大多数传统的文本检测器都面临着只定位文本实例片段（即子文本）的子文本问题。在这项工作中，我们定量分析了子文本问题，并提出了一个简单而有效的设计，对比关系（核心）模块，以缓解该问题。CORE首先利用一个普通的关系块来建模所有文本提议（多个文本实例的子文本）之间的关系，并通过实例级子文本区分以对比方式进一步增强关系推理。这种方法自然地学习文本建议的实例感知表示，从而促进场景文本检测。我们将核心模块集成到Mask R-CNN的两级文本检测器中，并设计了我们的文本检测器核心文本。在四个基准上的大量实验证明了核心文本的优越性。代码可用：\url{https://github.com/jylins/CORE-Text}. 摘要：Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision. Nevertheless, owing to the extremely varied aspect ratios and scales of text instances in real scenes, most conventional text detectors suffer from the sub-text problem that only localizes the fragments of text instance (i.e., sub-texts). In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module, to mitigate that issue. CORE first leverages a vanilla relation block to model the relations among all text proposals (sub-texts of multiple text instances) and further enhances relational reasoning via instance-level sub-text discrimination in a contrastive manner. Such way naturally learns instance-aware representations of text proposals and thus facilitates scene text detection. We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text. Extensive experiments on four benchmarks demonstrate the superiority of CORE-Text. Code is available: \url{https://github.com/jylins/CORE-Text}.

【4】 Improving Human-Object Interaction Detection via Phrase Learning and Label Composition 标题：利用短语学习和标签合成改进人机交互检测链接：https://arxiv.org/abs/2112.07383

作者：Zhimin Li,Cheng Zou,Yu Zhao,Boxun Li,Sheng Zhong 机构： National Key Laboratory of Science and Technology on Multispectral Information Processing, School of Artificial, Intelligence and Automation, Huazhong University of Science and Technology, Ant Group, Megvii Technology 备注：Accepted to AAAI2022 摘要：人机交互（HOI）检测是高层以人为中心的场景理解中的一项基本任务。我们提出PhraseHOI，包含一个HOI分支和一个新的短语分支，以利用语言优先权和改进关系表达。具体而言，短语分支由语义嵌入进行监督，语义嵌入的基本事实从原始HOI注释自动转换而来，无需额外的人力。同时，针对HOI中的长尾问题，提出了一种新的标签合成方法，该方法通过语义邻域合成新的短语标签。此外，为了优化相位分支，提出了由蒸馏损耗和平衡三重态损耗组成的损耗。我们进行了大量的实验，以证明所提出的PhraseHOI的有效性，该方法在具有挑战性的HICO-DET基准上，在完整和非罕见方面取得了比基线显著的改进，并超过了以前最先进的方法。摘要：Human-Object Interaction (HOI) detection is a fundamental task in high-level human-centric scene understanding. We propose PhraseHOI, containing a HOI branch and a novel phrase branch, to leverage language prior and improve relation expression. Specifically, the phrase branch is supervised by semantic embeddings, whose ground truths are automatically converted from the original HOI annotations without extra human efforts. Meanwhile, a novel label composition method is proposed to deal with the long-tailed problem in HOI, which composites novel phrase labels by semantic neighbors. Further, to optimize the phrase branch, a loss composed of a distilling loss and a balanced triplet loss is proposed. Extensive experiments are conducted to prove the effectiveness of the proposed PhraseHOI, which achieves significant improvement over the baseline and surpasses previous state-of-the-art methods on Full and NonRare on the challenging HICO-DET benchmark.

【5】 Static-Dynamic Co-Teaching for Class-Incremental 3D Object Detection 标题：静电--增量式三维物体检测的动态协同教学链接：https://arxiv.org/abs/2112.07241

作者：Na Zhao,Gim Hee Lee 机构：Department of Computer Science, National University of Singapore 备注：Accepted at AAAI 2022 摘要：基于深度学习的方法在三维目标检测任务中表现出了显著的性能。然而，当他们在不重新访问旧数据的情况下增量学习新类时，他们会在最初训练的类上遭受灾难性的性能下降。这种“灾难性遗忘”现象阻碍了3D目标检测方法在现实场景中的应用，因为现实场景中需要持续学习系统。在本文中，我们研究了尚未探索但重要的类增量三维目标检测问题，并提出了第一个解决方案-SDCoT，一种新的静态-动态协同教学方法。我们的SDCoT通过静态教师减轻了对旧类的灾难性遗忘，该教师在新样本中为旧类提供伪注释，并通过提取具有蒸馏损失的先前知识来正则化当前模型。同时，SDCoT通过动态教师不断地从新数据中学习基础知识。我们在两个基准数据集上进行了广泛的实验，并在几个增量学习场景中展示了我们的SDCoT方法优于基线方法的性能。摘要：Deep learning-based approaches have shown remarkable performance in the 3D object detection task. However, they suffer from a catastrophic performance drop on the originally trained classes when incrementally learning new classes without revisiting the old data. This "catastrophic forgetting" phenomenon impedes the deployment of 3D object detection approaches in real-world scenarios, where continuous learning systems are needed. In this paper, we study the unexplored yet important class-incremental 3D object detection problem and present the first solution - SDCoT, a novel static-dynamic co-teaching method. Our SDCoT alleviates the catastrophic forgetting of old classes via a static teacher, which provides pseudo annotations for old classes in the new samples and regularizes the current model by extracting previous knowledge with a distillation loss. At the same time, SDCoT consistently learns the underlying knowledge from new data via a dynamic teacher. We conduct extensive experiments on two benchmark datasets and demonstrate the superior performance of our SDCoT over baseline approaches in several incremental learning scenarios.

【6】 Noise Reduction and Driving Event Extraction Method for Performance Improvement on Driving Noise-based Surface Anomaly Detection 标题：提高基于驾驶噪声的表面异常检测性能的降噪和驾驶事件提取方法链接：https://arxiv.org/abs/2112.07214

作者：YeongHyeon Park,JoonSung Lee,Myung Jin Kim,Wonseok Park 机构：SK Planet Co., Ltd. 备注：3 pages, 3 figures, 2 tables 摘要：路面上的异物（如雨水或黑冰）会减少轮胎与路面之间的摩擦。上述情况将降低制动性能，并使车身姿态难以控制。在这种情况下，至少有可能造成财产损失。在最坏的情况下，将发生人身伤害。为了避免这一问题，提出了一种基于车辆行驶噪声的道路异常检测模型。然而，先前的建议不考虑额外的噪声，与驾驶噪声混合，并且跳过没有车辆驾驶的时刻的计算。在本文中，我们提出了一种简单的驱动事件提取方法和降噪方法，以提高计算效率和异常检测性能。摘要：Foreign substances on the road surface, such as rainwater or black ice, reduce the friction between the tire and the surface. The above situation will reduce the braking performance and make difficult to control the vehicle body posture. In that case, there is a possibility of property damage at least. In the worst case, personal damage will be occured. To avoid this problem, a road anomaly detection model is proposed based on vehicle driving noise. However, the prior proposal does not consider the extra noise, mixed with driving noise, and skipping calculations for moments without vehicle driving. In this paper, we propose a simple driving event extraction method and noise reduction method for improving computational efficiency and anomaly detection performance.

【7】 Joint 3D Object Detection and Tracking Using Spatio-Temporal Representation of Camera Image and LiDAR Point Clouds 标题：基于摄像机图像和激光雷达点云时空表示的联合三维目标检测与跟踪链接：https://arxiv.org/abs/2112.07116

作者：Junho Koh,Jaekyum Kim,Jinhyuk Yoo,Yecheol Kim,Jun Won Choi 机构：Hanyang University, Korea Advanced Institute of Science and Technology (KAIST) 摘要：在本文中，我们提出了一种新的联合目标检测和跟踪（JoDT）框架，用于基于相机和激光雷达传感器的三维目标检测和跟踪。所提出的方法称为3D DetectTrack，使探测器和跟踪器能够协作生成相机和激光雷达数据的时空表示，然后使用该表示来执行3D对象检测和跟踪。探测器通过相机和激光雷达融合获得的空间特征的加权时间聚集来构造时空特征。然后，检测器使用来自保持到上一时间步的轨迹的信息重新配置初始检测结果。基于检测器生成的时空特征，跟踪器使用图形神经网络（GNN）将检测到的对象与先前跟踪的对象相关联。我们通过基于规则的边缘修剪和基于注意的边缘选通相结合，设计了一个完全连通的GNN，它利用空间和时间对象上下文来提高跟踪性能。在KITTI和nuScenes基准上进行的实验表明，与基线方法相比，3D DetecTrack在检测和跟踪性能方面取得了显著的改进，并且通过检测器和跟踪器之间的协作，在现有方法中实现了最先进的性能。摘要：In this paper, we propose a new joint object detection and tracking (JoDT) framework for 3D object detection and tracking based on camera and LiDAR sensors. The proposed method, referred to as 3D DetecTrack, enables the detector and tracker to cooperate to generate a spatio-temporal representation of the camera and LiDAR data, with which 3D object detection and tracking are then performed. The detector constructs the spatio-temporal features via the weighted temporal aggregation of the spatial features obtained by the camera and LiDAR fusion. Then, the detector reconfigures the initial detection results using information from the tracklets maintained up to the previous time step. Based on the spatio-temporal features generated by the detector, the tracker associates the detected objects with previously tracked objects using a graph neural network (GNN). We devise a fully-connected GNN facilitated by a combination of rule-based edge pruning and attention-based edge gating, which exploits both spatial and temporal object contexts to improve tracking performance. The experiments conducted on both KITTI and nuScenes benchmarks demonstrate that the proposed 3D DetecTrack achieves significant improvements in both detection and tracking performances over baseline methods and achieves state-of-the-art performance among existing methods through collaboration between the detector and tracker.

【8】 EMDS-6: Environmental Microorganism Image Dataset Sixth Version for Image Denoising, Segmentation, Feature Extraction, Classification and Detection Methods Evaluation 标题：EMDS-6：环境微生物图像数据集第六版，用于图像去噪、分割、特征提取、分类和检测方法评价链接：https://arxiv.org/abs/2112.07111

作者：Peng Zhao,Chen Li,Md Mamunur Rahaman,Hao Xu,Pingli Ma,Hechen Yang,Hongzan Sun,Tao Jiang,Ning Xu,Marcin Grzegorzek 机构：Microscopic Image and Medical Image Analysis Group, MBIE College, Northeastern University, Shenyang, PR China, School of Control Engineering, Chengdu University of Information Technology, Chengdu , China, University of Lübeck, Germany 摘要：环境微生物（EMs）无处不在，对人类社会的生存和发展具有重要影响。然而，环境微生物（EM）数据的高标准和严格要求导致了现有相关数据库的不足，更不用说具有GT图像的数据库了。这个问题严重影响了相关实验的进展。因此，本研究开发了环境微生物数据集第六版（EMDS-6），其中包含21种EMs。每种类型的EM包含40个原始图像和40个GT图像，总共1680个EM图像。在本研究中，为了检验EMDS-6的有效性。我们选择了经典的图像处理算法，如图像去噪、图像分割和目标检测。实验结果表明，EMDS-6可用于评价图像去噪、图像分割、图像特征提取、图像分类和目标检测方法的性能。摘要：Environmental microorganisms (EMs) are ubiquitous around us and have an important impact on the survival and development of human society. However, the high standards and strict requirements for the preparation of environmental microorganism (EM) data have led to the insufficient of existing related databases, not to mention the databases with GT images. This problem seriously affects the progress of related experiments. Therefore, This study develops the Environmental Microorganism Dataset Sixth Version (EMDS-6), which contains 21 types of EMs. Each type of EM contains 40 original and 40 GT images, in total 1680 EM images. In this study, in order to test the effectiveness of EMDS-6. We choose the classic algorithms of image processing methods such as image denoising, image segmentation and target detection. The experimental result shows that EMDS-6 can be used to evaluate the performance of image denoising, image segmentation, image feature extraction, image classification, and object detection methods.

【9】 Improving COVID-19 CXR Detection with Synthetic Data Augmentation 标题：利用合成数据增强改进冠状病毒CXR检测链接：https://arxiv.org/abs/2112.07529

作者：Daniel Schaudt,Christopher Kloth,Christian Spaete,Andreas Hinteregger,Meinrad Beer,Reinhold von Schwerin 机构： Technische Hochschule Ulm - Ulm University of Applied Sciences, Universitätsklinikum Ulm - Ulm University Medical Center 备注：This paper has been accepted at the Upper-Rhine Artificial Intelligence Symposium 2021 arXiv:2112.05657 摘要：自从COVID-19流行病开始以来，研究人员已经开发出深度学习模型来分类COVID-19诱导的肺炎。与许多医学成像任务一样，可用数据的质量和数量通常是有限的。2019冠状病毒疾病的影像学研究，在国内外的胸部X射线数据上进行了深入的研究。两名放射科医生对数据进行了审查和标记，以确保对模型的泛化能力进行高质量的估计。此外，我们正在使用生成性对抗网络，根据这些数据生成合成X射线图像。我们的结果表明，使用这些合成图像进行数据扩充可以显著提高模型的性能。对于许多稀疏数据域，这是一种很有前途的方法。摘要：Since the beginning of the COVID-19 pandemic, researchers have developed deep learning models to classify COVID-19 induced pneumonia. As with many medical imaging tasks, the quality and quantity of the available data is often limited. In this work we train a deep learning model on publicly available COVID-19 image data and evaluate the model on local hospital chest X-ray data. The data has been reviewed and labeled by two radiologists to ensure a high quality estimation of the generalization capabilities of the model. Furthermore, we are using a Generative Adversarial Network to generate synthetic X-ray images based on this data. Our results show that using those synthetic images for data augmentation can improve the model's performance significantly. This can be a promising approach for many sparse data domains.

【10】 COVID-19 Pneumonia and Influenza Pneumonia Detection Using Convolutional Neural Networks 标题：基于卷积神经网络的冠状病毒肺炎和流感肺炎检测链接：https://arxiv.org/abs/2112.07102

作者：Julianna Antonchuk,Benjamin Prescott,Philip Melanchthon,Robin Singh 机构：Northwestern University 备注：for associated Azure ML notebook code, see this https URL 摘要：在研究2019冠状病毒疾病、流感病毒肺炎和正常生物标志物方面，我们开发了一种计算机视觉解决方案来支持诊断放射学。COVID-19肺炎的胸片外观被认为是非特异性的，它提出了一个挑战，以确定卷积神经网络（CNN）的最佳结构，该分类将在COVID-19和非COVID-19型肺炎的肺部炎症特征之间进行高灵敏度分类。拉赫曼（2021）2019冠状病毒疾病图像的不可用性和质量问题影响诊断过程，并影响深度学习检测模型的准确性。COVID-19射线照相图像的一个显著的不足引入了一个不平衡的数据激励我们使用过采样技术。在这项研究2019冠状病毒疾病中，我们包括了一套广泛的X射线成像（CXR）和COVID-19肺炎、流感病毒肺炎和正常生物标志物，以实现可扩展和精确的CNN模型。在研究的实验阶段，我们评估了各种卷积网络结构，选择了具有两个传统卷积层和两个最大功能池层的顺序卷积网络。在分类性能方面，表现最好的模型的验证准确率为93%，F1得分为0.95。我们选择Azure机器学习服务来执行网络实验和解决方案部署。自动缩放计算集群显著减少了网络训练的时间。我们希望看到人工智能和人类生物学领域的科学家合作并扩大拟议解决方案的范围，以提供快速和全面的诊断，有效地减缓病毒的传播摘要：In the research, we developed a computer vision solution to support diagnostic radiology in differentiating between COVID-19 pneumonia, influenza virus pneumonia, and normal biomarkers. The chest radiograph appearance of COVID-19 pneumonia is thought to be nonspecific, having presented a challenge to identify an optimal architecture of a convolutional neural network (CNN) that would classify with a high sensitivity among the pulmonary inflammation features of COVID-19 and non-COVID-19 types of pneumonia. Rahman (2021) states that COVID-19 radiography images observe unavailability and quality issues impacting the diagnostic process and affecting the accuracy of the deep learning detection models. A significant scarcity of COVID-19 radiography images introduced an imbalance in data motivating us to use over-sampling techniques. In the study, we include an extensive set of X-ray imaging of human lungs (CXR) with COVID-19 pneumonia, influenza virus pneumonia, and normal biomarkers to achieve an extensible and accurate CNN model. In the experimentation phase of the research, we evaluated a variety of convolutional network architectures, selecting a sequential convolutional network with two traditional convolutional layers and two pooling layers with maximum function. In its classification performance, the best performing model demonstrated a validation accuracy of 93% and an F1 score of 0.95. We chose the Azure Machine Learning service to perform network experimentation and solution deployment. The auto-scaling compute clusters offered a significant time reduction in network training. We would like to see scientists across fields of artificial intelligence and human biology collaborating and expanding on the proposed solution to provide rapid and comprehensive diagnostics, effectively mitigating the spread of the virus

分类|识别相关(6篇)

【1】 Text Classification Models for Form Entity Linking 标题：用于表单实体链接的文本分类模型链接：https://arxiv.org/abs/2112.07443

作者：María Villota,César Domínguez,Jónathan Heras,Eloy Mata,Vico Pascual 机构：Department of Mathematics and Computer Science, University of La Rioja, Spain 摘要：表单是一种广泛使用的基于模板的文档类型，用于各种领域，包括管理、医学、金融或保险等。由于每天生成的表单数量不断增加，因此迫切需要自动提取这些文档中包含的信息。但是，在处理扫描的表单时，这并不是一项简单的任务，因为具有不同表单实体位置的模板非常多样，而且扫描文档的质量也很高。在这种情况下，所有表单都有一个共同的特性：它们包含一个作为键值（或标签值）对构建的互连实体集合，以及其他实体，如标题或图像。在这项工作中，我们通过结合图像处理技术和基于BERT体系结构的文本分类模型，解决了表单中的实体链接问题。这种方法在FUNSD数据集上的F1得分为0.80，达到了最先进的结果，比以前最好的方法提高了5%。该项目的代码可在https://github.com/mavillot/FUNSD-Entity-Linking. 摘要：Forms are a widespread type of template-based document used in a great variety of fields including, among others, administration, medicine, finance, or insurance. The automatic extraction of the information included in these documents is greatly demanded due to the increasing volume of forms that are generated in a daily basis. However, this is not a straightforward task when working with scanned forms because of the great diversity of templates with different location of form entities, and the quality of the scanned documents. In this context, there is a feature that is shared by all forms: they contain a collection of interlinked entities built as key-value (or label-value) pairs, together with other entities such as headers or images. In this work, we have tacked the problem of entity linking in forms by combining image processing techniques and a text classification model based on the BERT architecture. This approach achieves state-of-the-art results with a F1-score of 0.80 on the FUNSD dataset, a 5% improvement regarding the best previous method. The code of this project is available at https://github.com/mavillot/FUNSD-Entity-Linking.

【2】 Federated Learning for Face Recognition with Gradient Correction 标题：梯度校正的联合学习在人脸识别中的应用链接：https://arxiv.org/abs/2112.07246

作者：Yifan Niu,Weihong Deng 机构：Beijing University of Posts and Telecommunications 备注：accepted by AAAI2022 摘要：随着对人脸识别中隐私问题的日益关注，联合学习已经成为研究具有私有分散数据的无约束人脸识别问题的最流行的方法之一。然而，在人脸识别场景中，传统的分散联邦算法在客户端之间共享整个网络参数，存在隐私泄漏问题。在这项工作中，我们引入了一个框架，FedGC，以解决人脸识别的联合学习问题，并保证更高的隐私。我们从反向传播的角度探索了一种新的梯度校正思想，并提出了一种基于softmax的正则化器，通过精确注入跨客户端梯度项来校正类嵌入的梯度。理论上，我们证明了FedGC构成了一个类似于标准softmax的有效损失函数。通过大量的实验验证了FedGC的优越性，它可以在几个流行的基准数据集上与传统的集中式方法的性能相匹配。摘要：With increasing appealing to privacy issues in face recognition, federated learning has emerged as one of the most prevalent approaches to study the unconstrained face recognition problem with private decentralized data. However, conventional decentralized federated algorithm sharing whole parameters of networks among clients suffers from privacy leakage in face recognition scene. In this work, we introduce a framework, FedGC, to tackle federated learning for face recognition and guarantees higher privacy. We explore a novel idea of correcting gradients from the perspective of backward propagation and propose a softmax-based regularizer to correct gradients of class embeddings by precisely injecting a cross-client gradient term. Theoretically, we show that FedGC constitutes a valid loss function similar to standard softmax. Extensive experiments have been conducted to validate the superiority of FedGC which can match the performance of conventional centralized methods utilizing full training dataset on several popular benchmark datasets.

【3】 Margin Calibration for Long-Tailed Visual Recognition 标题：长尾视觉识别中的边缘校正链接：https://arxiv.org/abs/2112.07225

作者：Yidong Wang,Bowen Zhang,Wenxin Hou,Zhen Wu,Jindong Wang,Takahiro Shinozaki 机构：Tokyo Institute of Technology, Nanjing University, Microsoft Research Asia 备注：Technical report; 9 pages 摘要：视觉识别任务中的长尾类分布对神经网络如何处理头类和尾类之间的有偏预测提出了巨大挑战，即该模型倾向于将尾类分类为头类。虽然现有的研究主要集中在数据重采样和损失函数工程上，但在本文中，我们采用了不同的视角：分类裕度。我们研究了边际与logits（分类分数）之间的关系，并实证观察了有偏边际和有偏logits之间的正相关关系。我们提出MARC，一个简单而有效的边缘校准函数，用于动态校准无偏Logit的有偏边缘。我们通过对常见的长尾基准测试（包括CIFAR-LT、ImageNet LT、Places LT和iNaturalist-LT）进行广泛的实验来验证MARC。实验结果表明，我们的MARC在这些基准测试上取得了良好的结果。此外，MARC非常容易实现，只需三行代码。我们希望这一简单的方法将激励人们重新思考长尾视觉识别中的偏差边际和偏差逻辑。摘要：The long-tailed class distribution in visual recognition tasks poses great challenges for neural networks on how to handle the biased predictions between head and tail classes, i.e., the model tends to classify tail classes as head classes. While existing research focused on data resampling and loss function engineering, in this paper, we take a different perspective: the classification margins. We study the relationship between the margins and logits (classification scores) and empirically observe the biased margins and the biased logits are positively correlated. We propose MARC, a simple yet effective MARgin Calibration function to dynamically calibrate the biased margins for unbiased logits. We validate MARC through extensive experiments on common long-tailed benchmarks including CIFAR-LT, ImageNet-LT, Places-LT, and iNaturalist-LT. Experimental results demonstrate that our MARC achieves favorable results on these benchmarks. In addition, MARC is extremely easy to implement with just three lines of code. We hope this simple method will motivate people to rethink the biased margins and biased logits in long-tailed visual recognition.

【4】 Exploring Category-correlated Feature for Few-shot Image Classification 标题：探索类别相关特征进行Few-Shot图像分类链接：https://arxiv.org/abs/2112.07224

作者：Jing Xu,Xinglin Pan,Xu Luo,Wenjie Pei,Zenglin Xu 机构：Harbin Institute of Technology, Shenzhen, University of Electronic Science and Technology of China 备注：10 pages, 9 figures 摘要：Few-Shot分类旨在使分类器适应具有少量训练样本的新类。然而，训练数据的不足可能会导致对某一类特征分布的有偏估计。为了缓解这一问题，我们提出了一种简单而有效的特征校正方法，该方法利用新类和基类之间的类别相关性作为先验知识。我们通过将特征映射到一个与基类数目相匹配的维数的潜在向量中，将其视为特征在基类上的对数概率，来明确地捕获这种相关性。基于此潜在向量，校正后的特征由解码器直接构造，我们期望解码器在保留类别相关信息的同时去除其他随机因素，从而更接近其类别质心。此外，通过改变softmax中的温度值，我们可以重新平衡特征校正和重构以获得更好的性能。我们的方法是通用的，灵活的，对任何特征提取和分类器都是不可知的，很容易嵌入到现有的FSL方法中。实验证明，我们的方法能够纠正有偏特征，特别是当特征远离类质心时。所提出的方法在三个广泛使用的基准上一致地获得了可观的性能增益，并使用不同的主干和分类器进行了评估。该守则将予以公布。摘要：Few-shot classification aims to adapt classifiers to novel classes with a few training samples. However, the insufficiency of training data may cause a biased estimation of feature distribution in a certain class. To alleviate this problem, we present a simple yet effective feature rectification method by exploring the category correlation between novel and base classes as the prior knowledge. We explicitly capture such correlation by mapping features into a latent vector with dimension matching the number of base classes, treating it as the logarithm probability of the feature over base classes. Based on this latent vector, the rectified feature is directly constructed by a decoder, which we expect maintaining category-related information while removing other stochastic factors, and consequently being closer to its class centroid. Furthermore, by changing the temperature value in softmax, we can re-balance the feature rectification and reconstruction for better performance. Our method is generic, flexible and agnostic to any feature extractor and classifier, readily to be embedded into existing FSL approaches. Experiments verify that our method is capable of rectifying biased features, especially when the feature is far from the class centroid. The proposed approach consistently obtains considerable performance gains on three widely used benchmarks, evaluated with different backbones and classifiers. The code will be made public.

【5】 Multi-Expert Human Action Recognition with Hierarchical Super-Class Learning 标题：基于分层超类学习的多专家人体动作识别链接：https://arxiv.org/abs/2112.07015

作者：Hojat Asgarian Dehkordi,Ali Soltani Nezhad,Hossein Kashiani,Shahriar Baradaran Shokouhi,Ahmad Ayatollahi 机构：School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran 备注：47 pages 摘要：在静态图像人类行为识别中，现有的研究主要利用额外的边界框信息和类标签来缓解静态图像中时间信息的不足；但是，使用手动注释准备额外数据非常耗时，而且容易出现人为错误。此外，现有的研究还没有解决行动识别的长尾分布。在本文中，我们提出了一种两阶段多专家分类的人体行为识别方法，以处理长尾分布，通过超类学习，在没有任何额外信息的情况下。为了为每个超类选择最佳配置并描述不同动作类之间的类间依赖关系，我们提出了一种新的基于图的类选择（GCS）算法。在所提出的方法中，粗粒度阶段选择最相关的细粒度专家。然后，细粒度专家对每个超类中的复杂细节进行编码，以便增加类间的变化。对各种公共人类行为识别数据集进行了广泛的实验评估，包括Stanford40、Pascal VOC 2012 action、BU101+和IHAR数据集。实验结果表明，该方法具有良好的效果。更具体地说，在IHAR、Sanford40、Pascal VOC 2012 Action和BU101+基准中，所提出的方法比最新研究高出8.92%、0.41%、0.66%和2.11%，计算成本更低，且无任何辅助注释信息。此外，本文还证明了在解决长尾分布的动作识别问题时，该方法的性能明显优于其他方法。摘要：In still image human action recognition, existing studies have mainly leveraged extra bounding box information along with class labels to mitigate the lack of temporal information in still images; however, preparing extra data with manual annotation is time-consuming and also prone to human errors. Moreover, the existing studies have not addressed action recognition with long-tailed distribution. In this paper, we propose a two-phase multi-expert classification method for human action recognition to cope with long-tailed distribution by means of super-class learning and without any extra information. To choose the best configuration for each super-class and characterize inter-class dependency between different action classes, we propose a novel Graph-Based Class Selection (GCS) algorithm. In the proposed approach, a coarse-grained phase selects the most relevant fine-grained experts. Then, the fine-grained experts encode the intricate details within each super-class so that the inter-class variation increases. Extensive experimental evaluations are conducted on various public human action recognition datasets, including Stanford40, Pascal VOC 2012 Action, BU101+, and IHAR datasets. The experimental results demonstrate that the proposed method yields promising improvements. To be more specific, in IHAR, Sanford40, Pascal VOC 2012 Action, and BU101+ benchmarks, the proposed approach outperforms the state-of-the-art studies by 8.92%, 0.41%, 0.66%, and 2.11 % with much less computational cost and without any auxiliary annotation information. Besides, it is proven that in addressing action recognition with long-tailed distribution, the proposed method outperforms its counterparts by a significant margin.

【6】 Classification of histopathology images using ConvNets to detect Lupus Nephritis 标题：基于ConvNets的狼疮性肾炎组织病理学图像分类链接：https://arxiv.org/abs/2112.07555

作者：Akash Gupta,Anirudh Reddy,CV Jawahar,PK Vinod 机构：New York University, IIIT Hyderabad 备注：Accepted in the 2021 Medical Imaging meets NeurIPS Workshop 摘要：系统性红斑狼疮（SLE）是一种自身免疫性疾病，患者的免疫系统开始攻击身体的健康组织。狼疮性肾炎（LN）是指肾组织的炎症，由于这些攻击而导致肾功能衰竭。国际肾病学会/肾脏病理学会（ISN/RPS）发布了一个基于SLE肾损伤过程中观察到的各种模式的分类系统。传统的方法需要对肾活检进行细致的病理评估，而且耗时。最近，计算技术通过使用虚拟显微镜或全玻片成像（WSI）帮助缓解了这个问题。通过使用深度学习和现代计算机视觉技术，我们提出了一个管道，该管道能够自动完成以下过程：1）检测这些完整幻灯片图像中的各种肾小球模式；2）使用提取的肾小球特征对每个图像进行分类。摘要：Systemic lupus erythematosus (SLE) is an autoimmune disease in which the immune system of the patient starts attacking healthy tissues of the body. Lupus Nephritis (LN) refers to the inflammation of kidney tissues resulting in renal failure due to these attacks. The International Society of Nephrology/Renal Pathology Society (ISN/RPS) has released a classification system based on various patterns observed during renal injury in SLE. Traditional methods require meticulous pathological assessment of the renal biopsy and are time-consuming. Recently, computational techniques have helped to alleviate this issue by using virtual microscopy or Whole Slide Imaging (WSI). With the use of deep learning and modern computer vision techniques, we propose a pipeline that is able to automate the process of 1) detection of various glomeruli patterns present in these whole slide images and 2) classification of each image using the extracted glomeruli features.

分割|语义相关(5篇)

【1】 n-CPS: Generalising Cross Pseudo Supervision to n networks for Semi-Supervised Semantic Segmentation标题：n-CPS：将交叉伪监督推广到n网络进行半监督语义分割链接：https://arxiv.org/abs/2112.07528

作者：Dominik Filipiak,Piotr Tempczyk,Marek Cygan 机构： AI Clearing, Inc., Semantic Technology Institute, Department of Computer Science, University of Innsbruck, Informatics and Mechanics, University of Warsaw 摘要：我们提出了$n$-CPS——一种最新的用于半监督语义切分任务的交叉伪监督（CPS）方法的推广。在$n$-CPS中，有$n$同时训练的子网络通过一个热编码扰动和一致性正则化相互学习。我们还表明，集成技术应用于子网输出可以显著提高性能。据我们所知，$n$-CPS与CutMix组合的表现优于CPS，并为Pascal VOC 2012设定了新的最先进水平，包括（1/16、1/8、1/4和1/2监管制度）和城市景观（1/16监管）。摘要：We present $n$-CPS - a generalisation of the recent state-of-the-art cross pseudo supervision (CPS) approach for the task of semi-supervised semantic segmentation. In $n$-CPS, there are $n$ simultaneously trained subnetworks that learn from each other through one-hot encoding perturbation and consistency regularisation. We also show that ensembling techniques applied to subnetworks outputs can significantly improve the performance. To the best of our knowledge, $n$-CPS paired with CutMix outperforms CPS and sets the new state-of-the-art for Pascal VOC 2012 with (1/16, 1/8, 1/4, and 1/2 supervised regimes) and Cityscapes (1/16 supervised).

【2】 A Style and Semantic Memory Mechanism for Domain Generalization 标题：一种面向领域泛化的风格和语义记忆机制链接：https://arxiv.org/abs/2112.07517

作者：Yang Chen,Yu Wang,Yingwei Pan,Ting Yao,Xinmei Tian,Tao Mei 机构：† University of Science and Technology of China, Hefei, China, ‡JD AI Research, Beijing, China 备注：ICCV 2021 摘要：主流最先进的领域泛化算法倾向于优先考虑跨领域语义不变性的假设。同时，固有的域内风格不变性通常被低估和搁置。在本文中，我们发现利用域内风格不变性对于提高域泛化的效率也至关重要。我们验证了网络提供关于哪些领域特征是不变的以及在实例之间共享的信息是至关重要的，这样网络可以增强其理解能力并提高其语义辨别能力。相应地，我们还提出了一种新的“陪审团”机制，该机制在学习领域间有用的语义特征共性方面特别有效。我们称为STEAM的完整模型可以解释为一种新的概率图形模型，其实现需要方便地构造两种类型的内存库：语义特征库和样式特征库。实证结果表明，我们提出的框架明显优于最先进的方法。摘要：Mainstream state-of-the-art domain generalization algorithms tend to prioritize the assumption on semantic invariance across domains. Meanwhile, the inherent intra-domain style invariance is usually underappreciated and put on the shelf. In this paper, we reveal that leveraging intra-domain style invariance is also of pivotal importance in improving the efficiency of domain generalization. We verify that it is critical for the network to be informative on what domain features are invariant and shared among instances, so that the network sharpens its understanding and improves its semantic discriminative ability. Correspondingly, we also propose a novel "jury" mechanism, which is particularly effective in learning useful semantic feature commonalities among domains. Our complete model called STEAM can be interpreted as a novel probabilistic graphical model, for which the implementation requires convenient constructions of two kinds of memory banks: semantic feature bank and style feature bank. Empirical results show that our proposed framework surpasses the state-of-the-art methods by clear margins.

【3】 Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation 标题：弱监督语义分割中伪掩模噪声抑制的响应尺度不确定性估计链接：https://arxiv.org/abs/2112.07431

作者：Yi Li,Yiqun Duan,Zhanghui Kuang,Yimin Chen,Wayne Zhang,Xiaomeng Li 机构： Department of Electronic and Computer Engineering, The * University of Science and Technology, SenseTime Research , Department of Computer Science, University of Technology Sydney 备注：Accept at AAAI 2022, Code is available at this https URL 摘要：弱监督语义分割（WSSS）在不需要密集注释的情况下分割对象。而作为代价，生成的伪掩模存在明显的噪声像素，这导致在这些伪掩模上训练次优分割模型。但很少有研究注意到或研究这个问题，即使这些噪声像素在伪掩模改进后也是不可避免的。因此，我们试图在噪声抑制方面对无线传感器网络进行改进。我们观察到许多噪声像素具有很高的置信度，特别是当响应范围太宽或太窄时，呈现不确定状态。因此，在本文中，我们通过多次缩放预测图来模拟响应的噪声变化，以进行不确定性估计。然后利用不确定性对分割损失进行加权，以减轻噪声监督信号。我们称这种方法为URN，缩写为通过响应缩放进行噪声缓解的不确定性估计。实验验证了URN的好处，我们的方法在PASCAL VOC 2012和MS COCO 2014上分别达到71.2%和41.5%的最新结果，而无需额外的模型，如显著性检测。代码可在https://github.com/XMed-Lab/URN. 摘要：Weakly-Supervised Semantic Segmentation (WSSS) segments objects without a heavy burden of dense annotation. While as a price, generated pseudo-masks exist obvious noisy pixels, which result in sub-optimal segmentation model

上一篇：为什么要选择Kafka？其他的消息队列是否可行？——彭文华的看法

下一篇：实战分享：我的机器学习算法求职经历（上篇）