深度学习
  • 前言
  • 第一章:经典网络
    • ImageNet Classification with Deep Convolutional Neural Network
    • Very Deep Convolutional Networks for Large-Scale Image Recognition
    • Going Deeper with Convolutions
    • Deep Residual Learning for Image Recognition
    • PolyNet: A Pursuit of Structural Diversity in Very Deep Networks
    • Squeeze-and-Excitation Networks
    • Densely Connected Convolutional Networks
    • SQUEEZENET: ALEXNET-LEVEL ACCURACY WITH 50X FEWER PARAMETERS AND <0.5MB MODEL SIZE
    • MobileNet v1 and MobileNet v2
    • Xception: Deep Learning with Depthwise Separable Convolutions
    • Aggregated Residual Transformations for Deep Neural Networks
    • ShuffleNet v1 and ShuffleNet v2
    • CondenseNet: An Efficient DenseNet using Learned Group Convolution
    • Neural Architecture Search with Reinforecement Learning
    • Learning Transferable Architectures for Scalable Image Recognition
    • Progressive Neural Architecture Search
    • Regularized Evolution for Image Classifier Architecture Search
    • 实例解析:12306验证码破解
  • 第二章:自然语言处理
    • Recurrent Neural Network based Language Model
    • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
    • Neural Machine Translation by Jointly Learning to Align and Translate
    • Hierarchical Attention Networks for Document Classification
    • Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Ne
    • About Long Short Term Memory
    • Attention Is All you Need
    • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • 第三章:语音识别
    • Speech Recognition with Deep Recurrent Neural Network
  • 第四章:物体检测
    • Rich feature hierarchies for accurate object detection and semantic segmentation
    • Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
    • Fast R-CNN
    • Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
    • R-FCN: Object Detection via Region-based Fully Convolutuional Networks
    • Mask R-CNN
    • You Only Look Once: Unified, Real-Time Object Detection
    • SSD: Single Shot MultiBox Detector
    • YOLO9000: Better, Faster, Stronger
    • Focal Loss for Dense Object Detection
    • YOLOv3: An Incremental Improvement
    • Learning to Segment Every Thing
    • SNIPER: Efficient Multi-Scale Training
  • 第五章:光学字符识别
    • 场景文字检测
      • DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images
      • Detecting Text in Natural Image with Connectionist Text Proposal Network
      • Scene Text Detection via Holistic, Multi-Channel Prediction
      • Arbitrary-Oriented Scene Text Detection via Rotation Proposals
      • PixelLink: Detecting Scene Text via Instance Segmentation
    • 文字识别
      • Spatial Transform Networks
      • Robust Scene Text Recognition with Automatic Rectification
      • Bidirectional Scene Text Recognition with a Single Decoder
      • multi-task learning for text recognition with joint CTC-attention
    • 端到端文字检测与识别
      • Reading Text in the Wild with Convolutional Neural Networks
      • Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework
    • 实例解析:字符验证码破解
    • 二维信息识别
      • 基于Seq2Seq的公式识别引擎
      • Show and Tell: A Neural Image Caption Generator
      • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
  • 第六章:语义分割
    • U-Net: Convolutional Networks for Biomedical Image Segmentation
  • 第七章:人脸识别
    • 人脸检测
      • DenseBox: Unifying Landmark Localization with End to End Object Detection
      • UnitBox: An Advanced Object Detection Network
  • 第八章:网络优化
    • Batch Normalization
    • Layer Normalization
    • Weight Normalization
    • Instance Normalization
    • Group Normalization
    • Switchable Normalization
  • 第九章:生成对抗网络
    • Generative Adversarial Nets
  • 其它应用
    • Holistically-Nested Edge Detection
    • Image Style Transfer Using Convolutional Nerual Networks
    • Background Matting: The World is Your Green Screen
  • Tags
  • References
由 GitBook 提供支持
在本页
  • 前言
  • 1. DeepText详解
  • 1.1 Inception-RPN
  • 1.2 Ambiguous Text Classification(ATC)
  • 1.3 Multi Layer ROI Pooling(MLRP)
  • 1.4 Iterative Bounding Box Voting (IBBV)
  • 总结

这有帮助吗?

  1. 第五章:光学字符识别
  2. 场景文字检测

DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images

上一页场景文字检测下一页Detecting Text in Natural Image with Connectionist Text Proposal Network

最后更新于4年前

这有帮助吗?

前言

16年那段时间的文字检测的文章,多少都和当年火极一时的[2]有关,DeepText(图1)也不例外,整体上依然是Faster R-CNN的框架,并在其基础上做了如下优化:

  1. Inception-RPN:将RPN的3×33\times33×3卷积划窗换成了基于[3]的划窗。这点也是这篇文章的亮点;

  2. ATC: 将类别扩展为‘文本区域’,‘模糊区域’与‘背景区域’;

  3. MLRP:使用了多尺度的特征,ROI提供的按Grid的池化的方式正好融合不同尺寸的Feature Map。

  4. IBBV:使用多个Iteration的bounding boxes的集合使用NMS

图1:DeepText网络结构图

在阅读本文前,一定要先搞清楚Faster R-CNN,本文只会对DeepText对Faster R-CNN的改进进行说明,相同部分不再重复。

1. DeepText详解

DeepText的结构如Faster R-CNN如出一辙:首先特征层使用的是VGG-16,其次算法均由用于提取候选区域的RPN和用于物体检测的Fast R-CNN。

下面我们对DeepText优化的四点进行讲解。

1.1 Inception-RPN

首先DeepText使用了GoogLeNet提出的Inception结构代替Faster R-CNN中使用的3×33\times33×3卷积在Conv5_3上进行滑窗。Inception的作用参照GoogLeNet中的讲解。

DeepText的Inception由3路不同的卷积构成:

  • padding=1的3×33\times33×3 的Max Pooling后接128个用于降维的1×11\times11×1卷积;

  • 384个padding=1的3×33\times33×3卷积;

  • 128个padding=2的5×55\times55×5卷积。

由于上面的Inception的3路卷积并不会改变Feature Map的尺寸,经过Concatnate操作后,Feature Map的个数变成了128+384+128=640128+384+128 = 640128+384+128=640。

针对场景文字检测中Ground Truth的特点,DeepText使用了和Faster R-CNN不同的锚点:(32,48,64,80)(32, 48, 64, 80)(32,48,64,80)四个尺寸及(0.2,0.5,0.8,1.0,1.2,1.5)(0.2, 0.5, 0.8, 1.0, 1.2, 1.5)(0.2,0.5,0.8,1.0,1.2,1.5)六种比例共4×6=244\times6=244×6=24个锚点。

DeepText的采样阈值也和Faster R-CNN不同:当IoU>0.5\text{IoU} > 0.5IoU>0.5时,锚点为正;IoU<0.3\text{IoU} < 0.3IoU<0.3,锚点为负。

Inception-RPN使用了阈值为0.7的NMS过滤锚点,最终得到的候选区域是top-2000的样本。

1.2 Ambiguous Text Classification(ATC)

DeepText将样本分成3类:

  • Text: IoU>0.5\text{IoU} > 0.5IoU>0.5;

  • Ambiguous: 0.2<IoU<0.50.2 < \text{IoU} < 0.50.2<IoU<0.5;

  • Background: IoU<0.2\text{IoU} < 0.2IoU<0.2.

这样做的目的是让模型在训练过程中见过所有IoU的样本,该方法对于提高模型的召回率作用非常明显。

1.3 Multi Layer ROI Pooling(MLRP)

DeepText使用了VGG-16的Conv4_3和Conv5_3的多尺度特征,使用基于Grid的ROI Pooling将两个不同尺寸的Feature Map变成7×7×5127\times7\times5127×7×512的大小,通过1×11\times11×1卷积将Concatnate后的1024维的Feature Map降维到512维,如图1所示。

1.4 Iterative Bounding Box Voting (IBBV)

在训练过程中,每个Iteration会预测一组检测框:Dct={Bi,ct,Si,ct}i=1Nc,tD_c^t = \{B_{i,c}^t, S_{i,c}^t\}_{i=1}^{N_{c,t}}Dct​={Bi,ct​,Si,ct​}i=1Nc,t​​,其中t=1,2,...,Tt=1,2,...,Tt=1,2,...,T表示训练阶段,Nc,tN_{c,t}Nc,t​表示类别ccc的检测框,BBB和SSS分别表示检测框和置信度。NMS合并的是每个训练阶段的并集:Dc=∪t=1TUctD_c=\cup_{t=1}^{T} U_c^tDc​=∪t=1T​Uct​。NMS使用的合并阈值是0.30.30.3。

在IBBV之后,DeepText接了一个过滤器用于过滤多余的检测框,过滤器的具体内容不详,后续待补。

总结

结合当时的研究现状,DeepText结合了当时state-of-the-art的Faster R-CNN,Inception设计了该算法。算法本身的技术性和创新性并不是很强,但是其设计的ATC和MLRP均在后面的物体检测算法中多次使用,而IBBV也在实际场景中非常值得测试。

Faster R-CNN
Inception