前言
在YOLO 的文章中我们介绍到YOLO存在三个缺陷:
两个bounding box功能的重复降低了模型的精度;
全连接层的使用不仅使特征向量失去了位置信息,还产生了大量的参数,影响了算法的速度;
只使用顶层的特征向量使算法对于小尺寸物体的检测效果很差。
为了解决这些问题,SSD 应运而生。SSD的全称是Single Shot MultiBox Detector,Single Shot表示SSD是像YOLO一样的单次检测算法,MultiBox指SSD每次可以检测多个物体,Detector表示SSD是用来进行物体检测的。
针对YOLO的三个问题,SSD做出的改进如下:
使用了类似Faster R-CNN中RPN网络提出的锚点(Anchor)机制,增加了bounding box的多样性;
使用网络中多个阶段的Feature Map,提升了特征多样性。
SSD的算法如图1。
图1:SSD算法流程
从某个角度讲,SSD和RPN的相似度也非常高,网络结构都是全卷积,都是采用了锚点进行采样,不同之处有下面两点:
RPN只使用卷积网络的顶层特征,不过在FPN和Mask R-CNN中已经对这点进行了改进;
RPN是一个二分类任务(前/背景),而SSD是一个包含了物体类别的多分类任务。
SSD详解
1. 算法流程
SSD的流程和YOLO是一样的,输入一张图片得到一系列候选区域,使用NMS得到最终的检测框。与YOLO不同的是,SSD使用了不同阶段的Feature Map用于检测,YOLO和SSD的对比如图2所示。
图1:SSD vs YOLO
在详解SSD之前,我先在代码片段1中列出SSD的超参数(./models/keras_ssd300.py
),随后我们会在下面的章节中介绍这些超参数是如何使用的。
代码片段1:SSD的超参数
复制 def ssd_300(image_size,
n_classes,
mode='training',
l2_regularization=0.0005,
min_scale=None,
max_scale=None,
scales=None,
aspect_ratios_global=None,
aspect_ratios_per_layer=[[1.0, 2.0, 0.5],
[1.0, 2.0, 0.5, 3.0, 1.0/3.0],
[1.0, 2.0, 0.5, 3.0, 1.0/3.0],
[1.0, 2.0, 0.5, 3.0, 1.0/3.0],
[1.0, 2.0, 0.5],
[1.0, 2.0, 0.5]],
two_boxes_for_ar1=True,
steps=[8, 16, 32, 64, 100, 300],
offsets=None,
clip_boxes=False,
variances=[0.1, 0.1, 0.2, 0.2],
coords='centroids',
normalize_coords=True,
subtract_mean=[123, 117, 104],
divide_by_stddev=None,
swap_channels=[2, 1, 0],
confidence_thresh=0.01,
iou_threshold=0.45,
top_k=200,
nms_max_output_size=400,
return_predictor_sizes=False)
1.1 SSD的骨干网络
首先我们先看一下SSD的骨干网络的源码(./models/keras_ssd300.py
),再结合源码和图2我们来剖析SSD的算法细节。
代码片段2:SSD骨干网络源码。注意源码中的变量名称和图2不一样,我在代码中进行了更正。
复制 conv1_1 = Conv2D(64, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv1_1')(x1)
conv1_2 = Conv2D(64, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv1_2')(conv1_1)
pool1 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same', name='pool1')(conv1_2)
conv2_1 = Conv2D(128, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv2_1')(pool1)
conv2_2 = Conv2D(128, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv2_2')(conv2_1)
pool2 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same', name='pool2')(conv2_2)
conv3_1 = Conv2D(256, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv3_1')(pool2)
conv3_2 = Conv2D(256, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv3_2')(conv3_1)
conv3_3 = Conv2D(256, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv3_3')(conv3_2)
pool3 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same', name='pool3')(conv3_3)
conv4_1 = Conv2D(512, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv4_1')(pool3)
conv4_2 = Conv2D(512, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv4_2')(conv4_1)
conv4_3 = Conv2D(512, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv4_3')(conv4_2)
pool4 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same', name='pool4')(conv4_3)
conv5_1 = Conv2D(512, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv5_1')(pool4)
conv5_2 = Conv2D(512, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv5_2')(conv5_1)
conv5_3 = Conv2D(512, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv5_3')(conv5_2)
pool5 = MaxPooling2D(pool_size=(3, 3), strides=(1, 1), padding='same', name='pool5')(conv5_3)
fc6 = Conv2D(1024, (3, 3), dilation_rate=(6, 6), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='fc6')(pool5)
fc7 = Conv2D(1024, (1, 1), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='fc7')(fc6)
conv8_1 = Conv2D(256, (1, 1), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv6_1')(fc7)
conv8_1 = ZeroPadding2D(padding=((1, 1), (1, 1)), name='conv6_padding')(conv8_1)
conv8_2 = Conv2D(512, (3, 3), strides=(2, 2), activation='relu', padding='valid', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv6_2')(conv8_1)
conv9_1 = Conv2D(128, (1, 1), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv7_1')(conv8_2)
conv9_1 = ZeroPadding2D(padding=((1, 1), (1, 1)), name='conv7_padding')(conv9_1)
conv9_2 = Conv2D(256, (3, 3), strides=(2, 2), activation='relu', padding='valid', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv7_2')(conv9_1)
conv10_1 = Conv2D(128, (1, 1), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv8_1')(conv9_2)
conv10_2 = Conv2D(256, (3, 3), strides=(1, 1), activation='relu', padding='valid', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv8_2')(conv10_1)
conv11_1 = Conv2D(128, (1, 1), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv9_1')(conv10_2)
conv11_2 = Conv2D(256, (3, 3), strides=(1, 1), activation='relu', padding='valid', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv9_2')(conv11_1)
在VGG的卷积部分之后,全连接被换成了卷机操作,在block6的卷积含有一个参数rate=6
。此时的卷积操作为空洞卷积(Dilation Convolution),在TensorFLow中使用tf.nn.atrous_conv2d()
调用。
图3:空洞卷积示例图
1.2 多尺度预测
在卷积网络中,不同深度的Feature Map趋向于响应不同程度的特征,SDD使用了骨干网络中的多个Feature Map用于预测检测框。通过图1和图2我们可以发现,SSD使用的是conv4_3, fc7, conv8_2, conv9_2, conv10_2, conv11_2分别用于检测尺寸从小到大的物体,如代码片段3 (./models/keras_ssd300.py
)。
代码片段3:SSD使用全卷积预测检测框
复制 # Feed conv4_3 into the L2 normalization layer
conv4_3_norm = L2Normalization(gamma_init=20, name='conv4_3_norm')(conv4_3)
### Build the convolutional predictor layers on top of the base network
# We precidt `n_classes` confidence values for each box, hence the confidence predictors have depth `n_boxes * n_classes`
# Output shape of the confidence layers: `(batch, height, width, n_boxes * n_classes)`
conv4_3_norm_mbox_conf = Conv2D(n_boxes[0] * n_classes, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv4_3_norm_mbox_conf')(conv4_3_norm)
fc7_mbox_conf = Conv2D(n_boxes[1] * n_classes, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='fc7_mbox_conf')(fc7)
conv8_2_mbox_conf = Conv2D(n_boxes[2] * n_classes, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv8_2_mbox_conf')(conv8_2)
conv9_2_mbox_conf = Conv2D(n_boxes[3] * n_classes, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv9_2_mbox_conf')(conv9_2)
conv10_2_mbox_conf = Conv2D(n_boxes[4] * n_classes, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv10_2_mbox_conf')(conv10_2)
conv11_2_mbox_conf = Conv2D(n_boxes[5] * n_classes, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv11_2_mbox_conf')(conv11_2)
# We predict 4 box coordinates for each box, hence the localization predictors have depth `n_boxes * 4`
# Output shape of the localization layers: `(batch, height, width, n_boxes * 4)`
conv4_3_norm_mbox_loc = Conv2D(n_boxes[0] * 4, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv4_3_norm_mbox_loc')(conv4_3_norm)
fc7_mbox_loc = Conv2D(n_boxes[1] * 4, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='fc7_mbox_loc')(fc7)
conv8_2_mbox_loc = Conv2D(n_boxes[2] * 4, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv8_2_mbox_loc')(conv8_2)
conv9_2_mbox_loc = Conv2D(n_boxes[3] * 4, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv9_2_mbox_loc')(conv9_2)
conv10_2_mbox_loc = Conv2D(n_boxes[4] * 4, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv10_2_mbox_loc')(conv10_2)
conv11_2_mbox_loc = Conv2D(n_boxes[5] * 4, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv11_2_mbox_loc')(conv11_2)
其中第二行的L2Normalization使用的是ParseNet 中提出的全局归一化。即对像素点的在通道维度上进行归一化,其中gamma是一个可训练的放缩变量。
通过代码片段3,我们可以看出SSD并没有使用全连接产生预测结果,而是使用的3*3的卷机操作分别产生了分类和回归的预测结果。对于一个分类任务来说,Feature Map的数量是(C+1)*n_boxes[i],而回归任务的Feature Map的数量是4*n_boxes[i]。
1.3 SSD中的锚点
在1.2节中,我们介绍了SSD的n_boxes=[4,6,6,6,4,4]
,下面我们就来详细解析SSD锚点是什么样子的。
SSD使用多尺度的Feature Map的原因是使用不同层次的Feature Map检测不同尺寸的物体,所以onv4_3, fc7, conv8_2, conv9_2, conv10_2, conv11_2的锚点的尺寸也是从小到大。论文中给出的值是从0.2到0.9间一个线性变化的值:
图4:锚点示例,改图也展示了锚点对Ground Truth的响应。
锚点如何设计是一种见仁见智的方案,例如源码中锚点的尺度便和论文不同,在源码中,尺度定义在jupyter notebook 文件./ssd300_training.ipynb
中。关于具体如何定义这些锚点其实不必太过在意,这些锚点的作用是为检测框提供一个先验假设,网络最后输出的候选框还是要经过Ground Truth纠正的。
复制 scales_pascal = [0.1, 0.2, 0.37, 0.54, 0.71, 0.88, 1.05] # The anchor box scaling factors used in the original SSD300 for the Pascal VOC datasets
scales_coco = [0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05] # The anchor box scaling factors used in the original SSD300 for the MS COCO datasets
除了锚点的尺度以外,源码中锚点的中心点的实现也和论文不同。源码使用预先计算好的步长加上位移进行预测的,即超参数中的变量steps=[8, 16, 32, 64, 100, 300]
。conv4_3经过了3次降采样,即Feature Map的一步相当于原图的8步。但是对于这种方案存在一个问题,即75降采样到38时是不能整除的,也就是最后一列并没有参加降采样,这样步长非精确的计算经过多次累积会被放大到很大。例如经过源码中步长为64的conv9_2层的最后一行和最后一列的锚点的中心点将会取到图像之外,有兴趣的读者可以打印一下。
源码中,锚点是在keras_layers/keras_layer_AnchorBoxes
中实现的,通过AnchorBoxes函数调用。网络中的6个Feature Map会产生6组共8732个先验box,如代码片段4所示。
代码片段4:计算先验box
复制 # Output shape of anchors: `(batch, height, width, n_boxes, 8)`
conv4_3_norm_mbox_priorbox = AnchorBoxes(img_height, img_width, this_scale=scales[0], next_scale=scales[1], aspect_ratios=aspect_ratios[0],
two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[0], this_offsets=offsets[0], clip_boxes=clip_boxes,
variances=variances, coords=coords, normalize_coords=normalize_coords, name='conv4_3_norm_mbox_priorbox')(conv4_3_norm_mbox_loc)
fc7_mbox_priorbox = AnchorBoxes(img_height, img_width, this_scale=scales[1], next_scale=scales[2], aspect_ratios=aspect_ratios[1],
two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[1], this_offsets=offsets[1], clip_boxes=clip_boxes,
variances=variances, coords=coords, normalize_coords=normalize_coords, name='fc7_mbox_priorbox')(fc7_mbox_loc)
conv6_2_mbox_priorbox = AnchorBoxes(img_height, img_width, this_scale=scales[2], next_scale=scales[3], aspect_ratios=aspect_ratios[2],
two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[2], this_offsets=offsets[2], clip_boxes=clip_boxes,
variances=variances, coords=coords, normalize_coords=normalize_coords, name='conv6_2_mbox_priorbox')(conv6_2_mbox_loc)
conv7_2_mbox_priorbox = AnchorBoxes(img_height, img_width, this_scale=scales[3], next_scale=scales[4], aspect_ratios=aspect_ratios[3],
two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[3], this_offsets=offsets[3], clip_boxes=clip_boxes,
variances=variances, coords=coords, normalize_coords=normalize_coords, name='conv7_2_mbox_priorbox')(conv7_2_mbox_loc)
conv8_2_mbox_priorbox = AnchorBoxes(img_height, img_width, this_scale=scales[4], next_scale=scales[5], aspect_ratios=aspect_ratios[4],
two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[4], this_offsets=offsets[4], clip_boxes=clip_boxes,
variances=variances, coords=coords, normalize_coords=normalize_coords, name='conv8_2_mbox_priorbox')(conv8_2_mbox_loc)
conv9_2_mbox_priorbox = AnchorBoxes(img_height, img_width, this_scale=scales[5], next_scale=scales[6], aspect_ratios=aspect_ratios[5],
two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[5], this_offsets=offsets[5], clip_boxes=clip_boxes,
variances=variances, coords=coords, normalize_coords=normalize_coords, name='conv9_2_mbox_priorbox')(conv9_2_mbox_loc)
1.4 SSD的匹配准则
从Feature Map得到锚点之后,我们要确定Ground Truth和哪个锚点匹配,与之匹配的锚点将负责该Ground Truth的预测。在YOLO中,Ground Truth的中心点落在哪个单元内,则该单元的bounding box负责预测其准确的边界。SSD的锚点匹配采用了‘bipartite’和‘multi’两种策略,匹配源码位于./ssd_encoder_decoder/
目录下面。
在bipartite模式中,每个Ground Truth选择与其IoU(论文用的是Jaccard Overlap)最大的锚点进行匹配.如果一个锚点被多个Ground Truth匹配,那么该锚点只匹配与其IoU最大的Ground Truth,其它Ground Truth从剩下的锚点中选择Iou最大的那个进行匹配。bipartite可以保证每个Ground Truth都会有唯一的一个锚点进行匹配。bipartite的源码见代码片段5。
代码片段5:bipartite匹配
复制 def match_bipartite_greedy(weight_matrix):
'''
Parameters:
weight_matrix (array): A 2D Numpy array that represents the weight matrix
for the matching process. If `(m,n)` is the shape of the weight matrix,
it must be `m <= n`. The weights can be integers or floating point
numbers. The matching process will maximize, i.e. larger weights are
preferred over smaller weights.
Returns:
A 1D Numpy array of length `weight_matrix.shape[0]` that represents
the matched index along the second axis of `weight_matrix` for each index
along the first axis.
'''
weight_matrix = np.copy(weight_matrix)
num_ground_truth_boxes = weight_matrix.shape[0]
all_gt_indices = list(range(num_ground_truth_boxes))
matches = np.zeros(num_ground_truth_boxes, dtype=np.int)
for _ in range(num_ground_truth_boxes):
# Find the maximal anchor-ground truth pair in two steps: First, reduce
# over the anchor boxes and then reduce over the ground truth boxes.
anchor_indices = np.argmax(weight_matrix, axis=1) # Reduce along the anchor box axis.
overlaps = weight_matrix[all_gt_indices, anchor_indices]
ground_truth_index = np.argmax(overlaps) # Reduce along the ground truth box axis.
anchor_index = anchor_indices[ground_truth_index]
matches[ground_truth_index] = anchor_index # Set the match.
# Set the row of the matched ground truth box and the column of the matched
# anchor box to all zeros. This ensures that those boxes will not be matched again,
# because they will never be the best matches for any other boxes.
weight_matrix[ground_truth_index] = 0
weight_matrix[:,anchor_index] = 0
return matches
代码片段6:multi匹配
复制 def match_multi(weight_matrix, threshold):
'''
Returns:
Two 1D Numpy arrays of equal length that represent the matched indices. The first
array contains the indices along the first axis of `weight_matrix`, the second array
contains the indices along the second axis.
'''
num_anchor_boxes = weight_matrix.shape[1]
all_anchor_indices = list(range(num_anchor_boxes))
# Find the best ground truth match for every anchor box.
ground_truth_indices = np.argmax(weight_matrix, axis=0)
overlaps = weight_matrix[ground_truth_indices, all_anchor_indices]
# Filter out the matches with a weight below the threshold.
anchor_indices_thresh_met = np.nonzero(overlaps >= threshold)[0]
gt_indices_thresh_met = ground_truth_indices[anchor_indices_thresh_met]
return gt_indices_thresh_met, anchor_indices_thresh_met
1.5 SSD的损失函数
损失函数表示为实际偏移和预测偏移的Smooth L1损失:
1.6 SSD的检测过程
2. DSSD
SSD一个非常有意思的变种是使用反卷积增加了上下文信息的DSSD ,或者说用反卷积代替了基于双线性插值的上采样过程。下面我们来讲解DSSD是怎么进一步优化SSD的。
2.1 DSSD的骨干网络
在骨干网络方面,DSSD使用了层数更深的Residual Net-101,检测模块的网络是从conv5_x之后开始的,用于进行检测的则包括conv3_x,conv5_x和添加的检测模块,如图5。
图5:DSSD的骨干网络
DSSD并没有把反卷积部分构造的非常深,的原因有二:
过多的反卷积会影响检测的速度,这与SSD的初衷不符;
模型的训练依赖于迁移学习的初始化,而反卷积部分是没有模型可工迁移的。随机初始化部分如果过深的话会降低模型的收敛速度。
单纯的网络替换并不能带来检测效果的提升,DSSD的最大特点是图5右侧红色的反卷积部分。
2.2 反卷积
图5中的Deconvolution Module(反卷积模块)展开如图6所示。
图6:DSSD的反卷积模块
2.3 预测模块
作者在反卷积模块之后尝试了几种预测模块,图7。其中(a)是最常见的预测模块,例如SSD,YOLO;(b)和(c)分别是YOLOv2和YOLOv3采用的模块,不同的是YOLO需要上采样或者将采样到相同的尺寸。(c)是DSSD采用的预测模块,作者同时尝试了图7所有模块,实验结果表明(c)在DSSD中表现最好。
图7:DSSD中预测模块的几个变种
2.4 DSSD的锚点聚类
小结
SSD算法的核心点在于
1. 使用多尺度的Feature Map提取特征;
2. 利用Faster R-CNN的锚点机制改进候选框。
DSSD的提出时间则较晚,其主要特别是反卷积的引入,从最近的趋势可以看出,物体检测和语义分割的交集越来越多,双方都不断的从对方汲取灵感来源来优化对应任务。