深度学习
  • 前言
  • 第一章:经典网络
    • ImageNet Classification with Deep Convolutional Neural Network
    • Very Deep Convolutional Networks for Large-Scale Image Recognition
    • Going Deeper with Convolutions
    • Deep Residual Learning for Image Recognition
    • PolyNet: A Pursuit of Structural Diversity in Very Deep Networks
    • Squeeze-and-Excitation Networks
    • Densely Connected Convolutional Networks
    • SQUEEZENET: ALEXNET-LEVEL ACCURACY WITH 50X FEWER PARAMETERS AND <0.5MB MODEL SIZE
    • MobileNet v1 and MobileNet v2
    • Xception: Deep Learning with Depthwise Separable Convolutions
    • Aggregated Residual Transformations for Deep Neural Networks
    • ShuffleNet v1 and ShuffleNet v2
    • CondenseNet: An Efficient DenseNet using Learned Group Convolution
    • Neural Architecture Search with Reinforecement Learning
    • Learning Transferable Architectures for Scalable Image Recognition
    • Progressive Neural Architecture Search
    • Regularized Evolution for Image Classifier Architecture Search
    • 实例解析:12306验证码破解
  • 第二章:自然语言处理
    • Recurrent Neural Network based Language Model
    • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
    • Neural Machine Translation by Jointly Learning to Align and Translate
    • Hierarchical Attention Networks for Document Classification
    • Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Ne
    • About Long Short Term Memory
    • Attention Is All you Need
    • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • 第三章:语音识别
    • Speech Recognition with Deep Recurrent Neural Network
  • 第四章:物体检测
    • Rich feature hierarchies for accurate object detection and semantic segmentation
    • Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
    • Fast R-CNN
    • Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
    • R-FCN: Object Detection via Region-based Fully Convolutuional Networks
    • Mask R-CNN
    • You Only Look Once: Unified, Real-Time Object Detection
    • SSD: Single Shot MultiBox Detector
    • YOLO9000: Better, Faster, Stronger
    • Focal Loss for Dense Object Detection
    • YOLOv3: An Incremental Improvement
    • Learning to Segment Every Thing
    • SNIPER: Efficient Multi-Scale Training
  • 第五章:光学字符识别
    • 场景文字检测
      • DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images
      • Detecting Text in Natural Image with Connectionist Text Proposal Network
      • Scene Text Detection via Holistic, Multi-Channel Prediction
      • Arbitrary-Oriented Scene Text Detection via Rotation Proposals
      • PixelLink: Detecting Scene Text via Instance Segmentation
    • 文字识别
      • Spatial Transform Networks
      • Robust Scene Text Recognition with Automatic Rectification
      • Bidirectional Scene Text Recognition with a Single Decoder
      • multi-task learning for text recognition with joint CTC-attention
    • 端到端文字检测与识别
      • Reading Text in the Wild with Convolutional Neural Networks
      • Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework
    • 实例解析:字符验证码破解
    • 二维信息识别
      • 基于Seq2Seq的公式识别引擎
      • Show and Tell: A Neural Image Caption Generator
      • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
  • 第六章:语义分割
    • U-Net: Convolutional Networks for Biomedical Image Segmentation
  • 第七章:人脸识别
    • 人脸检测
      • DenseBox: Unifying Landmark Localization with End to End Object Detection
      • UnitBox: An Advanced Object Detection Network
  • 第八章:网络优化
    • Batch Normalization
    • Layer Normalization
    • Weight Normalization
    • Instance Normalization
    • Group Normalization
    • Switchable Normalization
  • 第九章:生成对抗网络
    • Generative Adversarial Nets
  • 其它应用
    • Holistically-Nested Edge Detection
    • Image Style Transfer Using Convolutional Nerual Networks
    • Background Matting: The World is Your Green Screen
  • Tags
  • References
由 GitBook 提供支持
在本页
  • 前言
  • 1. Show and Tell详解
  • 1.1 网络结构
  • 1.2 损失函数
  • 1.3 测试(推理)
  • 2. 总结

这有帮助吗?

  1. 第五章:光学字符识别
  2. 二维信息识别

Show and Tell: A Neural Image Caption Generator

上一页基于Seq2Seq的公式识别引擎下一页Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

最后更新于4年前

这有帮助吗?

前言

图像描述(Image Catpion)是非常经典的图像二维信息识别的一个英语。类似的还有表格识别,公式识别等。如图1所示,Image Caption的输入时衣服图像,输出是对这个图像的描述。Image Caption的难点有二:

  1. 模型不仅要能够对图像中的每一个物体进行分类,还需要能够理解和描述它们的空间关系。

  2. 描述的生成要考虑语义信息,当前的输出高度依赖之前生成的内容。

这篇论文提供了一个Image Caption的基础框架:即用CNN作为特征提取器用于将图像转换为特征向量,之后用一个RNN作为解码器(生成器),用于生成对图像的描述。

1. Show and Tell详解

1.1 网络结构

在训练时,输入图像编码的Feature Map只在最开始的t−1t_{-1}t−1​时刻输入,这里作者说通过实验结果表明如果每个时间片都输入会容易造成训练的过拟合且对噪声非常敏感。在预测第t+1t+1t+1时刻的内容时,我们会用到ttt时刻的输出的词编码作为特征输入,整个过程表示为:

x−1=CNN(I)x_{-1} = CNN(I)x−1​=CNN(I)
xt=WeSt,t∈{0...N−1}x_{t} = W_e S_t, \quad t \in \{0 ... N-1\}xt​=We​St​,t∈{0...N−1}
pt+1=LSTM(xt)p_{t+1} = \text{LSTM}(x_t)pt+1​=LSTM(xt​)

其中III是输入图像的Feature Map,在训练时StS_tSt​是ttt时刻的标签真值,在测试时这个值则是上一个时间片的预测结果。另外,S0S_0S0​和SNS_NSN​是两个特殊字符,表示句子的开始与结束。WtW_tWt​是词向量的编码矩阵,pt+1p_{t+1}pt+1​是预测结果的概率分布,通过最大化概率分布可以得到该时刻的输出内容。

1.2 损失函数

和机器翻译类似,Image Caption的目标函数也是最大化标签值得概率,这里的标签即使训练集的描述内容SSS,表示为:

θ∗=arg⁡max⁡θ∑(I,θ)log⁡p(S∣I;θ)\theta ^ * = \arg \max _ {\theta} \sum _{ (I, \theta) } \log p(S|I; \theta)θ∗=argθmax​(I,θ)∑​logp(S∣I;θ)

其中III是输入图像,θ\thetaθ是模型的参数。log⁡p(S∣I;θ)\log p(S|I; \theta)logp(S∣I;θ) 表示为NNN个输出的概率和,第ttt时刻的内容是000到t−1t-1t−1时刻以及图像编码的后验概率:

log⁡p(S∣I;θ)=∑t=0Nlog⁡p(St∣I,S0,⋯ ,St−1)\log p(S|I; \theta) = \sum _{t=0} ^N \log p (S_t | I, S_0, \cdots, S_{t-1})logp(S∣I;θ)=t=0∑N​logp(St​∣I,S0​,⋯,St−1​)

所以模型的损失函数是所有时间片的负log似然之和,表示为:

L(I,S)=−∑i=1Nlog⁡pt(St)L(I,S) = - \sum _ {i = 1} ^N \log p_t (S_t)L(I,S)=−i=1∑N​logpt​(St​)

1.3 测试(推理)

2. 总结

作为一个领域的奠基的文章,算法的结构和思想还是非常简单的,整个结构几乎照搬了机器翻译的Encoder-Decoder架构。这么简简单单的照搬也能大幅刷新STOA的结果,可见深度学习的厉害之处。

受限于当时的数据集太小,作者尝试过把图像作为特征输入到每个时间片,但是导致了过拟合。随着数据集的增大,每个时间片加入图像特征无疑会更容易收敛。

Show and Tell这篇论文也是采用的的算法框架, 作者当初设计这个算法的时候,也是借鉴了神经机器翻译的思想,故而采用了类似的网络架构。论文中给出的网络结构如图2所示,为了便于理解,这里将RNN按时间片展开了,它们实际上是一个LSTM。图2的左半部分是编码器,由CNN组成,图中给的是GoogLeNet,在实际场景中我们可以根据自己的需求选择其它任意CNN。图2的右侧是一个单项,其内部结构不再赘述。

在Image Caption的推理过程中有两种策略,一种是贪心,另一种则是Beam Search。两者的异同可以看我在文章中给出的讲解。

Encoder-Decoder
LSTM
CTC