CLIP $1$ ¹¶

摘要

本文提出一个新的预训练方式：预测图像的描述。通过在4亿图像文本对（来自于互联网）上进行预训练，本文在超过30个数据集的多种下游任务上均达到了SOTA结果。通过将自然语言与视觉概念进行关联，本文提出的方法在某些任务中无需任何数据微调即可达到SOTA结果，如ResNet-50 $1$ ²在ImageNet $1$ ³分类任务中。

动机¶

传统计算机视觉模型的主干网络通常都通过基于ImageNet $1$ ³图像分类任务的监督学习来进行训练。这使得他们极大的受限于有标注数据。尽管以对比学习 $2$ ⁴⁵与重建学习 $2$ ⁶⁷为代表的自监督学习缓解了有标注数据不足的问题，但他们仍需要微调来应用在实际任务上。

原创¶

CLIP

自Mori et al., (1999) $1$ ⁸以来，已经有数个方法来关联图像和其描述，但是他们的性能都不尽人意，Li et al., (2017) $1$ ⁹的方法在ImageNet $1$ ³分类任务的零样本学习中只取得了$11.5\%$的准确率，这与SOTA相去甚远。

此前已经有多个工作讨论在大规模数据集上通过弱监督进行训练 $4$ ¹⁰¹¹¹²¹³，他们取得了出色的效果。但是，他们都使用了一套精心设计的标签集，因而无法完全释放自然语言的潜能。这限制了他们在零样本学习任务中的性能。

尽管MS-COCO $1$ ¹⁴和Visual Genome $1$ ¹⁵具有高质量且丰富的标注，他们只有约十万张图像因而不满足要求。 YFCC100M $1$ ¹⁶具有超过一亿张图像，但他们的元数据较为稀疏且质量良莠不齐。在通过判断标题和描述是否包含英文的自然语言进行筛选后，数据集仅剩约一千五百万张图像，与ImageNet $1$ ³大小相当。

本文的核心贡献包括以下三点：

本文提出了通过一个简化版的ConVIRT $1$ ¹⁷结构来进行图像和其描述之间的对比学习。
本文构建了构建了一个有4亿张图片及对应的描述文本的大规模数据集上来训练所提出的模型。
本文构建了横跨两个数量级的八个模型来验证这个方法，结果表明迁移性能是一个与计算量有关的平滑的可预测方程 $2$ ¹⁸¹⁹。

方法¶

模型¶

本文将图像和文本分别经过一个独立的编码器来提取特征，然后通过一个线性映射层来将特征映射到同一空间。本文没有使用SimCLRv2 $1$ ²⁰中额外的非线性映射层，因为他没有带来额外收益。本文猜想只有同一模态的对比学习才需要额外的非线性映射层。

视觉¶

本文探究了两种视觉编码器，ResNet $1$ ²和ViT $1$ ²¹。所有模型都没有使用任何预训练的权重，从头训练。

ResNet $1$ ²：本文使用了一个ResNet $1$ ²的改进版ResNet-D $1$ ²²模型。将池化层替换为Zhang (2019)的antialised rect-2 blur池化层。将最后的全局平均池化层替换为一个类似于自注意力机制的注意力池化层。
ViT $1$ ²¹：本文几乎没有对ViT进行修改，除了在stem和pos embed之后加入了一个LN层。

语言¶

本文使用了一个GPT-2 $1$ ²³方式的Transformer $1$ ²⁴来作为语言模型。为了训练效率，序列长度被限制为76。在句子的开始和结束有[SOS]和[EOS]令牌，最后一层的[EOS]令牌被用做文本的特征。

扩展¶

遵循EfficientNet $1$ ²⁵，本文在扩展模型时同时扩展了视觉模型的深度宽度和分辨率；对于文本模型则只有宽度被扩展，本文发现CLIP对语言编码器的容量不敏感。

数据¶

本文从英文版维基百科 $1$ ²⁶中筛选了所有出现至少100次的二元词组，最终构成了一个50万个词组的词典。然后，根据词典中的查询词在互联网上爬取（图像，文本）二元组。爬取得到的数据会按照查询词进行均衡，最终每个查询词约包括2万（图像，文本）对。本文将这一数据集称为WIT (WebImageText)。 WIT的总单词量与GPT-2 $1$ ²³的WebText数据集的总单词量相当。

训练¶

最开始，和VirTex $1$ ²⁷相似，本文联合训练一个图像卷积网络和文本变换网络来预测图像的描述。但是，这种方法速度很慢，难以高效的扩展。

本文认为这一部分是因为这个任务对每张图像预测其描述的确切原文。由于图像的描述具有巨大的多样性，这种训练非常困难。将文本预测目标从预测确切的原文更换为这些词的词袋向量，速度即可提升三倍。

此前研究发现对比学习比预测任务能得到更好的表示 $1$ ²⁸，因此本文进一步将训练目标更换为类似于的对比学习，取得了额外的四倍提升。

本文的训练设置遵循了ConVIRT $1$ ¹⁷的实现，下图描述了损失函数的伪代码。

伪代码

实验¶

TODO

A. Radford et al., “Learning transferable visual models from natural language supervision,” in Proceedings of the 38^th international conference on machine learning, M. Meila and T. Zhang, Eds., in Proceedings of machine learning research, vol. 139. PMLR, 2021, pp. 8748–8763. Available: https://proceedings.mlr.press/v139/radford21a.html ↩
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016. ↩↩↩↩
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “ImageNet: A large-scale hierarchical image database,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2009, pp. 248–255. ↩↩↩↩
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020. ↩
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37^th international conference on machine learning, H. D. III and A. Singh, Eds., in Proceedings of machine learning research, vol. 119. PMLR, 2020, pp. 1597–1607. Available: https://proceedings.mlr.press/v119/chen20j.html ↩
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022, pp. 16000–16009. ↩
H. Bao, L. Dong, S. Piao, and F. Wei, “BEiT: BERT pre-training of image transformers,” in International conference on learning representations, 2022. Available: https://openreview.net/forum?id=p-BhZSz59o4 ↩
Y. Mori, H. Takahashi, and R. Oka, “Image-to-word transformation based on dividing and vector quantizing images with words,” in First international workshop on multimedia intelligent storage and retrieval management, Citeseer, 1999, pp. 1–9. ↩
A. Li, A. Jabri, A. Joulin, and L. van der Maaten, “Learning visual n-grams from web data,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2017. ↩
C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable effectiveness of data in deep learning era,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2017. ↩
X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022, pp. 12104–12113. ↩
Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves ImageNet classification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020. ↩
D. Mahajan et al., “Exploring the limits of weakly supervised pretraining,” in Proceedings of the european conference on computer vision (ECCV), 2018. ↩
T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proceedings of the european conference on computer vision (ECCV), D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., Springer International Publishing, 2014, pp. 740–755. Available: https://www.microsoft.com/en-us/research/publication/microsoft-coco-common-objects-in-context/ ↩
R. Krishna et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, May 2017, doi: 10.1007/s11263-016-0981-7. ↩
B. Thomee et al., “YFCC100M: The new data in multimedia research,” Commun. ACM, vol. 59, no. 2, pp. 64–73, Jan. 2016, doi: 10.1145/2812802. ↩
Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” CoRR, vol. abs/2010.00747, 2020, Available: https://arxiv.org/abs/2010.00747 ↩↩
J. Hestness et al., “Deep learning scaling is predictable, empirically,” CoRR, vol. abs/1712.00409, 2017, Available: http://arxiv.org/abs/1712.00409 ↩
J. Kaplan et al., “Scaling laws for neural language models,” CoRR, vol. abs/2001.08361, 2020, Available: https://arxiv.org/abs/2001.08361 ↩
T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, “Big self-supervised models are strong semi-supervised learners,” in Advances in neural information processing systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., Curran Associates, Inc., 2020, pp. 22243–22255. Available: https://proceedings.neurips.cc/paper/2020/file/fcbc95ccdd551da181207c0c1400c655-Paper.pdf ↩
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International conference on learning representations, 2021. Available: https://openreview.net/forum?id=YicbFdNTTy ↩↩
T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of tricks for image classification with convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019. ↩
A. Radford et al., “Language models are unsupervised multitask learners,” 2018. ↩↩
A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., Curran Associates, Inc., 2017. Available: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf ↩
M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the 36^th international conference on machine learning, K. Chaudhuri and R. Salakhutdinov, Eds., in Proceedings of machine learning research, vol. 97. PMLR, 2019, pp. 6105–6114. Available: https://proceedings.mlr.press/v97/tan19a.html ↩
W. Foundation, “Wikimedia downloads.” $Online$ . Available: https://dumps.wikimedia.org ↩
K. Desai and J. Johnson, “VirTex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2021, pp. 11162–11173. ↩
Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in Proceedings of the european conference on computer vision (ECCV), A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., Cham: Springer International Publishing, 2020, pp. 776–794. ↩

clip.pdf

CLIP 1 1¶

动机¶

原创¶

方法¶

模型¶

视觉¶

语言¶

扩展¶

数据¶

训练¶

实验¶

CLIP $1$ ¹¶