The secret formula

The “secret formula” is a method found by Tartaglia to solve cubic equations. The 16th century priority fight between Tartaglia and Cardano’s student Ferrari over this formula is well known. A brief summary: Niccolo Tartaglia, an abbaco master in Venice, was challenged in a mathematical duel by Antonio Fior, and solved the questions, all reducible to solving cubic equations, in a couple of hours. This made him instantly famous, since Luca Pacioli had stated that, contrary to the quadratic equation, there did not exist a general formula for cubic equations. Tartaglia did not want to reveal his secret formula, probably to keep up his fame and to have an advantage in future duels. Gerolamo Cardano however wanted to include the formula in his book on algebra that he was preparing. Only after a lot of nagging, Tartaglia discloses to him his method in verse form, subject to strict secrecy. When Cardano and his student Ludovico Ferrari discovered that the formula was known in unpublished work from Scipione dal Ferro, they considered the promise to Tartaglia not binding any more, and the method was included in the book with proper reference. This resulted in a public fight between Tartaglia and Ferrari publishing insulting pamphlets back and forth.

Niccolo Tartaglia was born in Brescia around 1500. His last name is often believed to be Fontana, but Toscano claims that there is no solid proof for that. Niccolo preferred to use his nickname Tartaglia, i.e. the stammerer, because that is what he did since at the age of 12, a French soldier’s sabre mutilated his jaw and left him for dead during a reprisal attack of the troupes of Louis XII on his home town. Against all odds, his mother could keep him alive. Later he became an abbaco teacher. That means that he instructed and applied the practical use of numerical calculation using the newly introduced Hindu-Arabic numerals as described in Fibonacci’s Liber Abbaci, rather than the impractical Roman numerals. This practical kind of mathematics was needed in commerce for bookkeeping or for example to converse different measures or currency. It was quite different from the geometry of Euclid’s Elements that was taught at an academic level.

The Renaissance habit of having public challenges and scientific duels had some historical background. A number of questions were formulated by the challenger to be solved within a certain time span. It was a gentlemen’s agreement though that no questions should be asked that the challenger was not able to solve himself. The one that was challenged then reposted with a set of questions for the challenger, and the winner of the duel was the one who first solved all the problems first or who solved most of the problems. Tartaglia received in 1530 two questions by Zuanne de Tonini da Coi, and that surprised him because the problems reduced to the solution of a cubic equation, which was claimed to be impossible by Pacioli. So he assumed that da Coi did not know how to solve it either. Nevertheless he started thinking about the problem, and obviously found a solution, at least for some cases. It should be noted that what we write in modern notation as $x^3+ax=c$ and $x^3=bx+c$ were considered to be two different types of equations. Because the terms represented (positive) quantities, (often lengths with a geometric interpretation), they could not be zero or negative. Only positive coefficients were allowed, which made it difficult to switch terms to the other side of the equal sign.

To explain the state of the art of algebraic manipulation, Toscano sketches in a second chapter the history of how algebra came to Europe from the Babylonian Plimpton 322 tablet and the Egyptian Rhind papyrus to Al-Khwarizmi’s Algebra book (The Compendious Book on Calculation by Completion and Balancing) in which is explained how to solve equations without explicitly switching terms to the other side of the equal sign. It would have been simpler if negative numbers were allowed and if our symbolic notation was used. Although the latter was once promoted by Diophantus of Alexandria (3rd century), the habit was lost over time. It is only because of the 16th century events described here that our modern notation and manipulation came about.

The next chapter is describing the 1535 duel with Fior that started Tartaglia’s fame. Antonio Fior challenged Tartaglia with problems that all reduced to cubic equations and Tartaglia, who had figured out how to do it since da Coi’s questions, gave the answers in a few hours long before Fior could solve one of Tartaglia’s problems. Whether Fior was able to solve the problems himself, is not clear since he kept begging Tartaglia to reveal his method, although he claimed that the method was explained to him by “some mathematician” 30 years ago. This was most likely dal Ferro since Fior was his assistant.

But Tartaglia was now also approached by Cardano, first through his publisher, who wanted to include Tartaglia’s method in his book on algebra. When Cardano invited him later to Milan, Tartaglia finally disclosed his method after Cardano had sworn not to publish it. Toscano explains the rhyme that Tartaglia used to summarize how to solve the two forms of the cubic equation mentioned above and how a third form is reduced to one of those. When Cardano’s book Ars Magna sive de regulis algebraicis was published in 1545 it was a big success and historians consider it as the beginning of modern mathematics. It contained, besides Ferrari’s solution for the fourth degree equation, also the method for the cubic equation with proper reference to Dal Ferro and Tartaglia. Tartaglia was however furious and he published his Quesiti et inventioni nuove containing his account of what has happened, alongside some insulting remarks about Cardano. This was published one year after Cardano’s book, while he had neglected for many years to publish his method himself. Cardano had accepted a long-cherished physician’s position and had left mathematics teaching, so Ferrari took up the defence of his supervisor and there was quite some verbal abuse in public pamphlets exchanged between Ferrari and Tartaglia in subsequent months. This culminated in a duel between both in 1548 in Milan, with Ferrari as victor.

Cardano’s book is important for the history of mathematics because it initiated some ideas that lead to complex numbers. On the other hand, the Cardano-Tartaglia-Fior-Ferrari interaction is a juicy topic that easily lends itself to be discussed in popular science books. So it has been told by many authors, but it is often Cardano who is placed at the center of the story. For example in P.J. Nahin An Imaginary Tale: The Story of √-1 (1998/2010) some time is devoted to this melee and in the novel by M. Brooks The Quantum Astrologer’s Handbook (2017) Cardano is the main historical character. In the present book, it is basically the same story all over, but told more from Tartaglia’s viewpoint. A lot is taken from Tartaglia’s own account with many translated quotations in which the mutual scolding in the pamphlets are made blatantly clear. There is of course some background and history of mathematics but Toscano’s main focus is the solution of the cubic equation leaving some other work of Tartaglia and Cardano in the shadow. For example Tartaglia wrote a treatise on ballistics and found that the maximum reach was obtained firing in an angle of 45 degrees. The result is correct although it has several mistakes for which Ferrari reproached him later. He also translated Euclid’s Elements to Italian (working on this was his excuse for not publishing his formula).

Toscano has a pleasant writing style (and/or the translation by Arturo Sangalli is smooth). The opener of the books describes Niccolo with his mother and sister lost in the chaos of French soldiers attacking Brescia. Niccolo is hit twice by the sabre of a soldier and left for dead. That is like the opener of a dramatic novel. The attention of the reader is immediately caught. As Toscano unravels the historical development, he makes use of many quotations, which are fortunately provided by Tartaglia himself. This implies that the story is told close to how Tartaglia has experienced what has happened. However Toscano does not hesitate to give some interpretations and place some question marks where appropriate. The yeast of the story has been told already many times, but it has never been told like Toscano does in this book.

Adhemar Bultheel

This is a translation of the Italian original published in 2009. On the background of 16th century Italy, Toscano describes how Tartaglia has learned how to solve cubic equations, thus winning in a spectacular way a mathematical duel against Antonio Fior. Tartaglia does not want to share his method with others, but eventually he lifts a tip of the veil for Cardano subject to strict secrecy. Cardano publishes it anyway because he discovered that the formula was described in older unpublished work of Dal Ferro. This results in a fierce public pamphlet war between Tartaglia and Cardano’s apprentice Ferrari.



9780691183671 (hbk), 9780691200323 (ebk)
USD 24.95 (hbk)

How to Improve YOLOv3

How to Improve YOLOv3

YOLOv3 is a popular and fast object detection algorithm, but unfortunately not as accurate as RetinaNet or Faster RCNN, which you can see in the image below. In this article I will discuss two simple yet powerful approaches suggested in recent object detection literature to improve YOLOv3. These are: 1) Different Training Heuristics for Object Detection, and 2) Adaptive Spatial Fusion of Feature Pyramids. We will look at them one by one. Let’s dig into it.

How to Improve YOLOv3
Source: YOLOv3 paper

Bring this project to life

Different Training Heuristics for Object Detection

The performance of image classification networks has improved a lot with the use of refined training procedures. A brief discussion of these training tricks can be found here from CPVR 2019. Similarly, for object detection networks, some have suggested different training heuristics (1), like:

  • Image mix-up with geometry preserved alignment
  • Using cosine learning rate scheduler
  • Synchronized batch normalization
  • Data augmentation
  • Label smoothing

These modifications improved the mAP@(.5:.9) score of YOLOv3 from 33.0 to 37.0 without any extra computation cost during inference, and a negligible increase in computation cost during training (1). The improved YOLOv3 with pre-trained weights can be found here. To understand the intuition behind these heuristics, we will look at them one by one.


Let’s start with mixup training. In image classification networks, image mixup is just the linear interpolation of the pixels of two images (e.g. the left image below). The distribution of blending ratio in the mixup algorithm for image classification is drawn from a beta distribution, B(0.2, 0.2), which is also used to mix up one-hot image labels using the same ratio. For performing the mix-up both images have to be of the same dimensions so they are generally resized, however this would require bounding boxes of objects present in images to be resized as well. To avoid this hassle, a new image mix-up strategy is used. It takes an image of max width and max height out of the two images, with pixel values equal to 0 to 255, and adds the linear interpolation of two images to it. For this mixup strategy, blending ratios were obtained from the beta distribution B(1.5, 1.5) because (1) found that for object detection B(1.5, 1.5) gives a visually coherent mixed-up image and empirically better mAP score. Object labels are merged as a new array. This is demonstrated below. Now we have one method of mixup for image classification, and another for object detection.

Natural co-occurrence of objects in training images plays a significant role in the performance of object detection networks. For instance, a bowl, a cup, and a refrigerator should appear together more frequently than a refrigerator and an elephant. This makes detecting an object outside of its typical environment difficult. Using image mixup with an increased blending ratio makes the network more robust to such detection problems. Mixup also acts as a regularizer and forces the network to favor simple linear behavior.

def object_det_mix_up_(image1, image2, mixup_ratio):

    image1, image2: images to be mixed up, type=ndarray
    mixup_ratio: ratio in which two images are mixed up
    Returns a mixed-up image with new set of smoothed labels
    height = max(image1.shape[0], image2.shape[0])
    width = max(image1.shape[1], image2.shape[1])
    mix_img = np.zeros((height, width, 3),dtype=np.float32)
    mix_img[:image1.shape[0], :image1.shape[1], :] = image1.astype(np.float32)
                                                     * mixup_ratio
    mix_img[:image2.shape[0], :image2.shape[1], :] += image2.astype(np.float32)
                                                     * (1-mixup_ratio)
    return mix_img
Image mixup code

Learning Rate Scheduler

Most of the popular object detection networks (Faster RCNN, YOLO, etc.) use a learning rate scheduler. According to (1), the resulting sharp learning rate transition may cause the optimizer to re-stabilize the learning momentum in the following iterations. Using a cosine scheduler (where the learning rate decreases slowly) with proper warmup (two epochs) can give even better validation accuracy than using a step scheduler, shown below.

How to Improve YOLOv3
Comparison of step scheduler vs cosine scheduler on the PASCAL VOC 2007 test set (source)

Classification Head Label Smoothing

In label smoothing we convert our one-hot encoded labels to a smooth probability distribution using:

How to Improve YOLOv3

Where K is the number of classes, ε is a small constant, and qis the ground truth distribution. This acts as a regularizer by reducing the model’s confidence.

Synchronized Batch Normalisation

In current deep convolutional architectures, batch normalization is considered an essential layer. It’s responsible for speeding up the training process and making the network less sensitive to weight initialization by normalizing the activations of hidden layers. Due to large input image size, presence of feature pyramid architectures, and a large number of candidate object proposals (in case of multi-stage networks), the batch sizes one can fit on a single GPU become very small (i.e. less than 8 or so images per batch).

In the distributed training paradigm, the hidden activations are normalized within each GPU. This causes the calculation of noisy mean and variance estimates, which hinders the whole batch normalization process. Synchronized batch normalization has therefore been suggested to help increase the batch size by considering activations over several GPUs for the calculation of statistical estimates. As a result, this makes the calculations less noisy.

Synchronized batch normalization can be achieved easily using the Apex library from NVIDIA for mixed-precision and distributed training in PyTorch. We can also convert any standard BatchNorm module in PyTorch to SyncBatchNorm using the convert_syncbn_model method, which recursively traverses the passed module and its children to replace all instances of torch.nn.modules.batchnorm._BatchNorm with apex.parallel.SyncBatchNorm, where apex.parallel.SyncBatchNorm is a PyTorch module to perform synchronized batch norm on NVIDIA GPUs.

import apex
sync_bn_model = apex.parallel.convert_syncbn_model(model)
Converting standard batch normalization to synchronized batch normalization in PyTorch using Apex

Data Augmentation

Data augmentation techniques also seem to improve object detection models, although they improve single-stage detectors more than the multi-stage detectors. According to (1), the reason behind this is that in multi-stage detectors like Faster-RCNN, where a certain number of candidate object proposals are sampled from a large pool of generated ROIs, the detection results are produced by repeatedly cropping the corresponding regions on feature maps. Due to this cropping operation, multi-stage models substitutes the operation of randomly cropping input images, hence these networks do not require extensive geometric augmentations applied during the training stage.

Empirically, augmentation methods like random cropping (with constraints), expansion, horizontal flip, resize (with random interpolation), and color jittering (including brightness, hue, saturation, and contrast) work better during training. During testing, images are just resized by randomly choosing one of the popular interpolation techniques and then normalizing.

def horizontal_flip(image, boxes):
    Flips the image and its bounding boxes horizontally

    _, width, _ = image.shape
    if random.randrange(2):
        image = image[:, ::-1]
        boxes = boxes.copy()
        boxes[:, 0::2] = width - boxes[:, 2::-2]
    return image, boxes

def random_crop(image, boxes, labels, ratios = None):
    Performs random crop on image and its bounding boxes 

    height, width, _ = image.shape

    if len(boxes)== 0:
        return image, boxes, labels, ratios

    while True:
        mode = random.choice((
            (0.1, None),
            (0.3, None),
            (0.5, None),
            (0.7, None),
            (0.9, None),
            (None, None),

        if mode is None:
            return image, boxes, labels, ratios

        min_iou, max_iou = mode
        if min_iou is None:
            min_iou = float('-inf')
        if max_iou is None:
            max_iou = float('inf')

        for _ in range(50):
            scale = random.uniform(0.3,1.)
            min_ratio = max(0.5, scale*scale)
            max_ratio = min(2, 1. / scale / scale)
            ratio = math.sqrt(random.uniform(min_ratio, max_ratio))
            w = int(scale * ratio * width)
            h = int((scale / ratio) * height)

            l = random.randrange(width - w)
            t = random.randrange(height - h)
            roi = np.array((l, t, l + w, t + h))

            iou = matrix_iou(boxes, roi[np.newaxis])

            if not (min_iou <= iou.min() and iou.max() <= max_iou):

            image_t = image[roi[1]:roi[3], roi[0]:roi[2]]

            centers = (boxes[:, :2] + boxes[:, 2:]) / 2
            mask = np.logical_and(roi[:2] < centers, centers < roi[2:]) 
            boxes_t = boxes[mask].copy()
            labels_t = labels[mask].copy()
            if ratios is not None:
                ratios_t = ratios[mask].copy()

            if len(boxes_t) == 0:

            boxes_t[:, :2] = np.maximum(boxes_t[:, :2], roi[:2])
            boxes_t[:, :2] -= roi[:2]
            boxes_t[:, 2:] = np.minimum(boxes_t[:, 2:], roi[2:])
            boxes_t[:, 2:] -= roi[:2]

            return image_t, boxes_t,labels_t, ratios_t
Several data augmentations to be applied during training
def preproc_for_test(image, input_size, mean, std):
    Data Augmentation applied during testing/validation 
    :image: an ndarray object
    :input_size: tuple of int with two elements (H,W) of image
    :mean: mean of training dataset or image_net rgb mean
    :std: standard deviation of training dataset or imagenet rgb std 
    interp_methods = [cv2.INTER_LINEAR, cv2.INTER_CUBIC, cv2.INTER_AREA, cv2.INTER_NEAREST, cv2.INTER_LANCZOS4]
    interp_method = interp_methods[random.randrange(5)]
    image = cv2.resize(image, input_size,interpolation=interp_method)
    image = image.astype(np.float32)
    image = image[:,:,::-1]
    image /= 255.
    if mean is not None:
        image -= mean
    if std is not None:
        image /= std
    return image.transpose(2, 0, 1)
Data augmentation to be applied for testing/validation

Other than above heuristics, training YOLOv3 model at different scales of input image like, {320 x 320; 352 x 352 ; 384 x 384; 416 x 416; 448 x 448; 480 x 480; 512 x 512; 544 x 544; 576 x 576; 608 x 608 }, reduces the risk of overfitting and improves model’s generalization capabilities, just like in standard YOLOv3 training. These changes have improved YOLOv3 performance a lot, but next we will look at another approach called Adaptive Spatial Fusion of Feature Pyramids. If combined with these training heuristics, this technique can make YOLOv3 perform even better than baselines like Faster RCNN or Mask RCNN (2).

Adaptive Spatial Fusion of Feature Pyramids

Object detection networks that use feature pyramids make predictions at different scales of features, or the fusion of different scales of features. For instance, YOLOv3 makes predictions at three different scales with strides 32, 16 and 8. In other words, if given an input image of 416 x 416, it makes predictions on scales of 13 x 13, 26 x 26, and 52 x 52.

Low-resolution features have high semantic value, while high-resolution features have semantically low value. Low-resolution feature maps also contain grid cells that cover larger regions of the image and are, therefore, more suitable for detecting larger objects. On the contrary, grid cells from higher resolution feature maps are better for detecting smaller objects. This means that detecting objects of different scales using features of only one scale is difficult. To combat this issue, detection can be done on different scales individually to detect objects of different scales like in single shot detector (SSD) architecture. However, although the approach requires little extra cost in computation, it is still sub-optimal since the high-resolution feature maps cannot sufficiently obtain semantic features from the images. Architectures like RetinaNet, YOLOv3, etc. therefore combine both high and low semantic value features to create a semantically and spatially strong feature. Performing detection on those features presents a better trade-off between speed and accuracy.

Combining different resolution features is done by concatenating or adding them element-wise. Some have suggested an approach to combine these feature maps in a manner so that only relevant information from each scale feature map is kept for combination (2). The figure below summarizes this. In short, instead of making predictions on features at each level like in standard YOLOv3, features from the three levels are first rescaled and then adaptively combined at each level, and then prediction/detection is performed on those new features.

To understand this better we will look at the two important steps of this approach: 1) Identical Rescaling and 2) Adaptive Feature Fusion.

How to Improve YOLOv3
Illustration of Adaptive Spatial Fusion of Feature Pyramids (source)

Identical Rescaling

All features at each level are rescaled and their number of channels are adjusted. Suppose an input image of size 416 x 416 has been given as input, and we have to combine features at level 2 (where the feature map size is 26 x 26 and number of channels is 512) with the higher-resolution features at level 3 (resolution 52 x 52, number of channels 256). This layer would then be downsampled to 26 x 26 while the number of channels is increased to 512. On the other hand, the features at the lower-resolution level 1 (resolution 13 x 13, number of channels 1024) would be upsampled to 26 x 26 whereas the number of channels would be reduced to 512.

For up-sampling, first a 1 x 1 convolution layer is applied to compress the number of channels of features, and then upscaling is done with interpolation. For down-sampling with a 1/2 ratio, a 3 x 3 convolution layer with a stride of 2 is used to modify the number of channels and the resolution simultaneously. For the scale ratio of 1/4, a 2-stride max pooling layer before the 2-stride convolution is used. The code below defines and performs these operations using PyTorch.

def add_conv(in_ch, out_ch, ksize, stride, leaky=True):
    Add a conv2d / batchnorm / leaky ReLU block.
        in_ch (int): number of input channels of the convolution layer.
        out_ch (int): number of output channels of the convolution layer.
        ksize (int): kernel size of the convolution layer.
        stride (int): stride of the convolution layer.
        stage (Sequential) : Sequential layers composing a convolution block.
    stage = nn.Sequential()
    pad = (ksize - 1) // 2
    stage.add_module('conv', nn.Conv2d(in_channels=in_ch,
                                       out_channels=out_ch, kernel_size=ksize, stride=stride,
                                       padding=pad, bias=False))
    stage.add_module('batch_norm', nn.BatchNorm2d(out_ch))
    if leaky:
        stage.add_module('leaky', nn.LeakyReLU(0.1))
        stage.add_module('relu6', nn.ReLU6(inplace=True))
    return stage
Adds a convolutional block with a sequence of conv, batchnorm and relu layers
def scaling_ops(level, x_level_0, x_level_1, x_level_2):
    Performs upscaling/downscaling operation for each level of features
        level (int): level number of features.
        x_level_0 (Tensor): features obtained from standard YOLOv3 at level 0.
        x_level_1 (Tensor): features obtained from standard YOLOv3 at level 1.
        x_level_2 (Tensor): features obtained from standard YOLOv3 at level 2.
        resized features at all three levels and a conv block
    dim = [512, 256, 256]
    inter_dim = dim[level]
    if level==0:
        stride_level_1 = add_conv(256, inter_dim, 3, 2)
        stride_level_2 = add_conv(256, inter_dim, 3, 2)
        expand = add_conv(inter_dim, 1024, 3, 1)

        level_0_resized = x_level_0
        level_1_resized = stride_level_1(x_level_1)
        level_2_downsampled_inter = F.max_pool2d(x_level_2, 3, stride=2, padding=1)
        level_2_resized = stride_level_2(level_2_downsampled_inter)
    elif level==1:
        compress_level_0 = add_conv(512, inter_dim, 1, 1)
        stride_level_2 = add_conv(256, inter_dim, 3, 2)
        expand = add_conv(inter_dim, 512, 3, 1)

        level_0_compressed = compress_level_0(x_level_0)
        level_0_resized = F.interpolate(level_0_compressed, scale_factor=2, mode='nearest')
        level_1_resized = x_level_1
        level_2_resized = stride_level_2(x_level_2)
    elif level==2:
        compress_level_0 = add_conv(512, inter_dim, 1, 1)
        expand = add_conv(inter_dim, 256, 3, 1)

        level_0_compressed = compress_level_0(x_level_0)
        level_0_resized = F.interpolate(level_0_compressed, scale_factor=4, mode='nearest')
        level_1_resized = F.interpolate(x_level_1, scale_factor=2, mode='nearest')
        level_2_resized = x_level_2

    return level_0_resized, level_1_resized,level_2_resized, expand
Performs upscaling or downscaling given the level number and set of features

Adaptive Feature Fusion

Once features are rescaled, they are combined by taking the weighted average of each pixel of all three rescaled feature maps (assuming the same weight across all channels). These weights are learned dynamically as we train the network. This equation can explain it better:

How to Improve YOLOv3
How to Improve YOLOv3
How to Improve YOLOv3

Here these operations are defined using PyTorch.

def adaptive_feature_fusion(level, level_0_resized, level_1_resized,level_2_resized, expand):
    Combines the features adaptively.
        level (int): level number of features.
        level_0_resized (Tensor): features obtained after rescaling at level 0.
        level_1_resized (Tensor): features obtained after rescaling at at level 1.
        level_2_resized (Tensor): features obtained after rescaling at at level 2.
        expand (Sequential): a conv block
        out (Tensor): new combibed feature on which detection will be performed.
    dim = [512, 256, 256]
    inter_dim = dim[level]
    compress_c = 16  
    weight_level_0 = add_conv(inter_dim, compress_c, 1, 1)
    weight_level_1 = add_conv(inter_dim, compress_c, 1, 1)
    weight_level_2 = add_conv(inter_dim, compress_c, 1, 1)

    weight_levels = nn.Conv2d(compress_c*3, 3, kernel_size=1, stride=1, padding=0)
    level_0_weight_v = weight_level_0(level_0_resized)
    level_1_weight_v = weight_level_1(level_1_resized)
    level_2_weight_v = weight_level_2(level_2_resized)
    levels_weight_v =, level_1_weight_v, level_2_weight_v),1)
    levels_weight = weight_levels(levels_weight_v)
    levels_weight = F.softmax(levels_weight, dim=1)

    fused_out_reduced = level_0_resized * levels_weight[:,0:1,:,:]+
                        level_1_resized * levels_weight[:,1:2,:,:]+
                        level_2_resized * levels_weight[:,2:,:,:]

    out = expand(fused_out_reduced)
    return out
Performs Adaptive Feature Fusion given rescaled features

According to (2), newly adapted features filter out the inconsistency across different scales (using adaptive spatial fusion weights) which is a primary limitation of single-shot detectors with feature pyramids. When used with a YOLOv3 model trained using the training heuristics mentioned above, it significantly improves ( gives mAP@(.5:.95) of 42.4 while YOLOv3 baseline  mAP@(.5:.95) is 33.0 on COCO test-dev 2014) (2) the YOLOv3  baseline with only a small increase in computation cost (also measured on COCO test-dev 2014) i.e. from 52 FPS (frames per second) of YOLOv3 baseline to 45.5 FPS (1), during inference. Also, integrating a few other modules like DropBock, RFB, etc. on top of adaptive feature fusion, can surpass (2) Faster RCNN and Mask RCNN baselines. One can download the pre-trained weights here.

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

End Notes

In this article we saw how YOLOv3 baseline can be improved significantly by using simple training heuristics for object detection and the novel technique of adaptive feature fusion with either no increase, or only a small increase in the inference cost. These approaches require minimal architectural changes and can be easily integrated. The training heuristics mentioned above can be used directly for fine-tuning a pre-trained YOLOv3 model as well. The improved YOLOv3 certainly offers a better trade-off between speed and accuracy. You can find the complete code to fine-tune YOLOv3 using above mentioned approaches on your custom data here.

Few training heuristics and small architectural changes that can significantly improve YOLOv3 performance with tiny increase in inference cost. – SKRohit/Improving-YOLOv3
How to Improve YOLOv3


  1. Bag of Freebies for Training Object Detection Neural Networks
  2. Learning Spatial Fusion for Single-Shot Object Detection

5 Genetic Algorithm Applications Using PyGAD

5 Genetic Algorithm Applications Using PyGAD

This tutorial introduces PyGAD, an open-source Python library for implementing the genetic algorithm and training machine learning algorithms. PyGAD supports 19 parameters for customizing the genetic algorithm for various applications.

Within this tutorial we’ll discuss 5 different applications of the genetic algorithm and build them using PyGAD.

The outline of the tutorial is as follows:

  • PyGAD Installation
  • Getting Started with PyGAD
  • Fitting a Linear Model
  • Reproducing Images
  • 8 Queen Puzzle
  • Training Neural Networks
  • Training Convolutional Neural Networks

You can follow along with each of these projects and run them for free on the ML Showcase. Let’s get started.

Bring this project to life

PyGAD Installation

PyGAD is available through PyPI (Python Package Index) and thus it can be installed simply using pip. For Windows, simply use the following command:

pip install pygad

For Mac/Linux, use pip3 instead of pip in the terminal command:

pip3 install pygad

Then make sure the library is installed by importing it from the Python shell:

import pygad

The latest PyGAD version is currently 2.3.2, which was released on June 1st 2020. Using the __version__ special variable, the current version can be returned.

import pygad


Now that PyGAD is installed, let’s cover a brief introduction to PyGAD.

Getting Started with PyGAD

The main goal of PyGAD is to provide a simple implementation of the genetic algorithm. It offers a range of parameters that allow the user to customize the genetic algorithm for a wide range of applications. Five such applications are discussed in this tutorial.

The full documentation of PyGAD is available at Read the Docs. Here we’ll cover a more digestible breakdown of the library.

In PyGAD 2.3.2 there are 5 modules:

  1. pygad: The main module comes already imported.
  2. pygad.nn: For implementing neural networks.
  3. pygad.gann: For training neural networks using the genetic algorithm.
  4. pygad.cnn: For implementing convolutional neural networks.
  5. pygad.gacnn: For training convolutional neural networks using the genetic algorithm.

Each module has its own repository on GitHub, linked below.

  1. pygad
  2. pygad.nn
  3. pygad.gann
  4. pygad.cnn
  5. pygad.gacnn

The main module of the library is named pygad. This module has a single class named GA. Just create an instance of the pygad.GA class to use the genetic algorithm.

The steps to use the pygad module are:

  1. Create the fitness function.
  2. Prepare the necessary parameters for the pygad.GA class.
  3. Create an instance of the pygad.GA class.
  4. Run the genetic algorithm.

In PyGAD 2.3.2, the constructor of the pygad.GA class has 19 parameters, of which 16 are optional. The three required parameters are:

  1. num_generations: Number of generations.
  2. num_parents_mating: Number of solutions to be selected as parents.
  3. fitness_func: The fitness function that calculates the fitness value for the solutions.

The fitness_func parameter is what allows the genetic algorithm to be customized for different problems. This parameter accepts a user-defined function that calculates the fitness value for a single solution. This takes two additional parameters: the solution, and its index within the population.

Let’s see an example to make this clearer. Assume there is a population with 3 solutions, as given below.

[221, 342, 213]
[675, 32, 242]
[452, 23, -212]

The assigned function to the fitness_func parameter must return a single number representing the fitness of each solution. Here is an example that returns the sum of the solution.

def fitness_function(solution, solution_idx):
    return sum(solution)

The fitness values for the 3 solutions are then:

  1. 776
  2. 949
  3. 263

The parents are selected based on such fitness values. The higher the fitness value, the better the solution.

For the complete list of parameters in the pygad.GA class constructor, check out this page.

After creating an instance of the pygad.GA class, the next step is to call the run() method which goes through the generations that evolve the solutions.

import pygad

ga_instance = pygad.GA(...)

These are the essential steps for using PyGAD. Of course there are additional steps that can be taken as well, but this is the minimum needed.

The next sections discuss using PyGAD for several different use cases.

Fitting a Linear Model

Assume there is an equation with 6 inputs, 1 output, and 6 parameters, as follows:

 y = f(w1:w6) = w1x1 + w2x2 + w3x3 + w4x4 + w5x5 + 6wx6

Let’s assume that the inputs are (4,-2,3.5,5,-11,-4.7) and the output is 44. What are the values for the 6 parameters to satisfy the equation? The genetic algorithm can be used to find the answer.

The first thing to do is to prepare the fitness function as given below. It calculates the sum of products between each input and its corresponding parameter. The absolute difference between the desired output and the sum of products is calculated. Because the fitness function must be a maximization function, the returned fitness is equal to 1.0/difference. The solutions with the highest fitness values are selected as parents.

function_inputs = [4,-2,3.5,5,-11,-4.7]  # Function inputs.
desired_output = 44  # Function output.

def fitness_func(solution, solution_idx):
    output = numpy.sum(solution*function_inputs)
    fitness = 1.0 / numpy.abs(output - desired_output)
    return fitness

Now that we’ve prepared the fitness function, here’s a list with other important parameters.

sol_per_pop = 50
num_genes = len(function_inputs)

init_range_low = -2
init_range_high = 5

mutation_percent_genes = 1

You should also specify your desired mandatory parameters as you see fit. After the necessary parameters are prepared, the pygad.GA class is instantiated. For information about each of the parameters, refer to this page.

ga_instance = pygad.GA(num_generations=num_generations,

The next step is to call the run() method which starts the generations.

After the run() method completes, the plot_result() method can be used to show the fitness values over the generations.


5 Genetic Algorithm Applications Using PyGAD

Using the best_solution() method we can also retrieve what the best solution was, its fitness, and its index within the population.

solution, solution_fitness, solution_idx = ga_instance.best_solution()
print("Parameters of the best solution : {solution}".format(solution=solution))
print("Fitness value of the best solution = {solution_fitness}".format(solution_fitness=solution_fitness))
print("Index of the best solution : {solution_idx}".format(solution_idx=solution_idx))

The full code for this project can be found in the Fitting a Linear Model notebook on the ML Showcase.

Reproducing Images

In this application we’ll start from a random image (random pixel values), then evolve the value of each pixel using the genetic algorithm.

The tricky part of this application is that an image is 2D or 3D, and the genetic algorithm expects the solutions to be 1D vectors. To tackle this issue we’ll use the img2chromosome() function defined below to convert an image to a 1D vector.

def img2chromosome(img_arr):

    return numpy.reshape(a=img_arr, newshape=(functools.reduce(operator.mul, img_arr.shape)))

The chromosome2img() function (below) can then be used to restore the 2D or 3D image back from the vector.

def chromosome2img(vector, shape):
    # Check if the vector can be reshaped according to the specified shape.
    if len(vector) != functools.reduce(operator.mul, shape):
        raise ValueError("A vector of length {vector_length} into an array of shape {shape}.".format(vector_length=len(vector), shape=shape))

    return numpy.reshape(a=vector, newshape=shape)

Besides the regular steps for using PyGAD, we’ll need one additional step to read the image.

import imageio
import numpy

target_im = imageio.imread('fruit.jpg')
target_im = numpy.asarray(target_im/255, dtype=numpy.float)

This sample image can be downloaded here.

Next, the fitness function is prepared. This will calculate the difference between the pixels in the solution and the target images. To make it a maximization function, the difference is subtracted from the sum of all pixels in the target image.

target_chromosome = gari.img2chromosome(target_im)

def fitness_fun(solution, solution_idx):
    fitness = numpy.sum(numpy.abs(target_chromosome-solution))

    # Negating the fitness value to make it increasing rather than decreasing.
    fitness = numpy.sum(target_chromosome) - fitness
    return fitness

The next step is to create an instance of the pygad.GA class, as shown below. It is critical to the success of the application to use appropriate parameters. If the range of pixel values in the target image is 0 to 255, then the init_range_low and init_range_high must be set to 0 and 255, respectively. The reason is to initialize the population with images of the same data type as the target image. If the image pixel values range from 0 to 1, then the two parameters must be set to 0 and 1, respectively.

import pygad

ga_instance = pygad.GA(num_generations=20000,

When the mutation_type argument is set to random, then the default behavior is to add a random value to each gene selected for mutation. This random value is selected from the range specified by the random_mutation_min_val and random_mutation_max_val parameters.

Assume the range of pixel values is 0 to 1. If a pixel has the value 0.9 and a random value of 0.3 is generated, then the new pixel value is 1.2. Because the pixel values must fall within the 0 to 1 range, the new pixel value is therefore invalid. To work around this issue, it is very important to set the mutation_by_replacement parameter to True. This causes the random value to replace the current pixel rather than being added to the pixel.

After the parameters are prepared, then the genetic algorithm can run.

The plot_result() method can be used to show how the fitness value evolves by generation.


5 Genetic Algorithm Applications Using PyGAD

After the generations complete, some information can be returned about the best solution.

solution, solution_fitness, solution_idx = ga_instance.best_solution()
print("Fitness value of the best solution = {solution_fitness}".format(solution_fitness=solution_fitness))
print("Index of the best solution : {solution_idx}".format(solution_idx=solution_idx))

The best solution can be converted into an image to be displayed.

import matplotlib.pyplot

result = gari.chromosome2img(solution, target_im.shape)

Here is the result.

5 Genetic Algorithm Applications Using PyGAD

You can run this project for free on the ML Showcase.

8 Queen Puzzle

The 8 Queen Puzzle involves 8 chess queens distributed across an 8×8 matrix, with one queen per row. The goal is to place these queens such that no queen can attack another one vertically, horizontally, or diagonally. The genetic algorithm can be used to find a solution that satisfies such conditions.

This project is available on GitHub. It has a GUI built using Kivy that shows an 8×8 matrix, as shown in the next figure.

5 Genetic Algorithm Applications Using PyGAD

The GUI has three buttons at the bottom of the screen. The function of these buttons are as follows:

  • The Initial Population button creates the initial population of the GA.
  • The Show Best Solution button shows the best solution from the last generation the GA stopped at.
  • The Start GA button starts the GA iterations/generations.

To use this project start by pressing the Initial Population button, followed by the Start GA button. Below is the method called by the Initial Population button which, as you might have guessed, generates the initial population.

def initialize_population(self, *args):
    self.num_solutions = 10


    self.population_1D_vector = numpy.zeros(shape=(self.num_solutions, 8))

    for solution_idx in range(self.num_solutions):
        initial_queens_y_indices = numpy.random.rand(8)*8
        initial_queens_y_indices = initial_queens_y_indices.astype(numpy.uint8)
        self.population_1D_vector[solution_idx, :] = initial_queens_y_indices


    self.pop_created = 1
    self.num_attacks_Label.text = "Initial Population Created."

Each solution in the population is a vector with 8 elements referring to the column indices of the 8 queens. To show the queens’ locations on the screen, the 1D vector is converted into a 2D matrix using the vector_to_matrix() method. The next figure shows the queens on the screen.

5 Genetic Algorithm Applications Using PyGAD

Now that the GUI is built, we’ll build and run the genetic algorithm using PyGAD.

The fitness function used in this project is given below. It simply calculates the number of attacks that can be made by each of the 8 queens and returns this as the fitness value.

def fitness(solution_vector, solution_idx):

    if solution_vector.ndim == 2:
        solution = solution_vector
        solution = numpy.zeros(shape=(8, 8))

        row_idx = 0
        for col_idx in solution_vector:
            solution[row_idx, int(col_idx)] = 1
            row_idx = row_idx + 1

    total_num_attacks_column = attacks_column(solution)

    total_num_attacks_diagonal = attacks_diagonal(solution)

    total_num_attacks = total_num_attacks_column + total_num_attacks_diagonal

    if total_num_attacks == 0:
        total_num_attacks = 1.1 # float("inf")
        total_num_attacks = 1.0/total_num_attacks

    return total_num_attacks

By pressing the Start GA button, an instance of the pygad.GA class is created and the run() method is called.

ga_instance = pygad.GA(num_generations=500,

Here is a possible solution in which the 8 queens are placed on the board where no queen attacks another.

5 Genetic Algorithm Applications Using PyGAD

The complete code for this project can be found on GitHub.

Training Neural Networks

Among other types of machine learning algorithms, the genetic algorithm can be used to train neural networks. PyGAD supports training neural networks and, in particular, convolutional neural networks, by using the pygad.gann.GANN and pygad.gacnn.GACNN modules. This section discusses how to use the pygad.gann.GANN module for training neural networks for a classification problem.

Before building the genetic algorithm, the training data is prepared. This example builds a network that simulates the XOR logic gate.

# Preparing the NumPy array of the inputs.
data_inputs = numpy.array([[1, 1],
                           [1, 0],
                           [0, 1],
                           [0, 0]])

# Preparing the NumPy array of the outputs.
data_outputs = numpy.array([0, 

The next step is to create an instance of the pygad.gann.GANN class. This class builds a population of neural networks that all have the same architecture.

num_inputs = data_inputs.shape[1]
num_classes = 2

num_solutions = 6
GANN_instance = pygad.gann.GANN(num_solutions=num_solutions,

After creating the instance of the pygad.gann.GANN class, the next step is to create the fitness function. This returns the classification accuracy for the passed solution.

import pygad.nn
import pygad.gann

def fitness_func(solution, sol_idx):
    global GANN_instance, data_inputs, data_outputs

    predictions = pygad.nn.predict(last_layer=GANN_instance.population_networks[sol_idx],
    correct_predictions = numpy.where(predictions == data_outputs)[0].size
    solution_fitness = (correct_predictions/data_outputs.size)*100

    return solution_fitness

Besides the fitness function, the other necessary parameters are prepared which we discussed previously.

population_vectors = pygad.gann.population_as_vectors(population_networks=GANN_instance.population_networks)

initial_population = population_vectors.copy()

num_parents_mating = 4

num_generations = 500

mutation_percent_genes = 5

parent_selection_type = "sss"

crossover_type = "single_point"

mutation_type = "random" 

keep_parents = 1

init_range_low = -2
init_range_high = 5

After all parameters are prepared, an instance of the pygad.GA class is created.

ga_instance = pygad.GA(num_generations=num_generations, 

The callback_generation parameter refers to a function that is called after each generation. In this application, this function is used to update the weights of all the neural networks after each generation.

def callback_generation(ga_instance):
    global GANN_instance

    population_matrices = pygad.gann.population_as_matrices(population_networks=GANN_instance.population_networks, population_vectors=ga_instance.population)

The next step is to call the run() method.

After the run() method completes, the next figure shows how the fitness value evolved. The figure shows that a classification accuracy of 100% is reached.

5 Genetic Algorithm Applications Using PyGAD

The complete code for building and training the neural network can be accessed and run for free on the ML Showcase in the Training Neural Networks notebook.

Training Convolutional Neural Networks

Similar to training multilayer perceptrons, PyGAD supports training convolutional neural networks using the genetic algorithm.

The first step is to prepare the training data. The data can be downloaded from these links:

  1. dataset_inputs.npy: Data inputs.
  2. dataset_outputs.npy: Class labels.
import numpy

train_inputs = numpy.load("dataset_inputs.npy")
train_outputs = numpy.load("dataset_outputs.npy")

The next step is to build the CNN architecture using the pygad.cnn module.

import pygad.cnn

input_layer = pygad.cnn.Input2D(input_shape=(80, 80, 3))
conv_layer = pygad.cnn.Conv2D(num_filters=2,
average_pooling_layer = pygad.cnn.AveragePooling2D(pool_size=5,

flatten_layer = pygad.cnn.Flatten(previous_layer=average_pooling_layer)
dense_layer = pygad.cnn.Dense(num_neurons=4,

After the layers in the network are stacked, a model is created.

model = pygad.cnn.Model(last_layer=dense_layer,

Using the summary() method, a summary of the model architecture is returned.

----------Network Architecture----------
<class 'cnn.Conv2D'>
<class 'cnn.AveragePooling2D'>
<class 'cnn.Flatten'>
<class 'cnn.Dense'>

After the model is prepared, the pygad.gacnn.GACNN class is instantiated to create the initial population. All the networks have the same architecture.

import pygad.gacnn

GACNN_instance = pygad.gacnn.GACNN(model=model,

The next step is to prepare the fitness function. This calculates the classification accuracy for the passed solution.

def fitness_func(solution, sol_idx):
    global GACNN_instance, data_inputs, data_outputs

    predictions = GACNN_instance.population_networks[sol_idx].predict(data_inputs=data_inputs)
    correct_predictions = numpy.where(predictions == data_outputs)[0].size
    solution_fitness = (correct_predictions/data_outputs.size)*100

    return solution_fitness

The other parameters are also prepared.

population_vectors = pygad.gacnn.population_as_vectors(population_networks=GACNN_instance.population_networks)

initial_population = population_vectors.copy()

num_parents_mating = 2

num_generations = 10

mutation_percent_genes = 0.1

parent_selection_type = "sss"

crossover_type = "single_point"

mutation_type = "random"

keep_parents = -1

After all parameters are prepared, an instance of the pygad.GA class is created.

ga_instance = pygad.GA(num_generations=num_generations, 

The callback_generation parameter is used to update the network weights after each generation.

def callback_generation(ga_instance):
    global GACNN_instance, last_fitness

    population_matrices = pygad.gacnn.population_as_matrices(population_networks=GACNN_instance.population_networks, population_vectors=ga_instance.population)


The last step is to call the run() method.

The complete code for building and training the convolutional neural network can be found on the ML Showcase, where you can also run it on a free GPU from your free Gradient account.

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales


This tutorial introduced PyGAD, an open-source Python library for implementing the genetic algorithm. The library supports a number of parameters to customize the genetic algorithm for a number of applications.

In this tutorial we used PyGAD to build 5 different applications including fitting a linear model, solving the 8 queens puzzle, reproducing images, and training neural networks (both conventional and convolutional). I hope you found this tutorial useful, and please feel free reach out in the comments or check out the docs if you have any questions!

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

The aim of this three-part series has been to shed light on the landscape and development of deep learning models that have defined the field and improved our ability to solve challenging problems. In Part 1 we covered models developed from 2012-2014, namely AlexNet, VGG16, and GoogleNet. In Part 2 we saw more recent models from 2015-2016: ResNet, InceptionV3, and SqueezeNet. Now that we’ve covered the popular architectures and models of the past, we’ll move on to the state of the art.

The architectures that we’ll discuss here include:

  • DenseNet
  • ResNeXt
  • MnasNet
  • ShuffleNet v2

Let’s get started.

Bring this project to life

DenseNet (2016)

The name “DenseNet” refers to Densely Connected Convolutional Networks. It was proposed by Gao Huang, Zhuang Liu, and their team in 2017 at the CVPR Conference. It received the best paper award, and has accrued over 2000 citations.

Traditional convolutional networks with n layers have n connections; one between each layer and its subsequent layer. In DenseNet each layer connects to every other layer in a feed-forward fashion, meaning that DenseNet has n(n+1)/2 connections in total.  For each layer the feature maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs to all subsequent layers.

Dense Blocks

DenseNet boasts one big advantage over conventional deep CNNs: the information passed through many layers will not be washed-out or vanish by the time it reaches the end of the network. This is achieved by a simple connectivity pattern. To understand this, one must know how layers in a normal CNN are connected.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

Here’s a simple CNN where the layers are sequentially connected. In the Dense Block, however, each layer obtains additional inputs from all preceding layers, and passes its own feature maps to all subsequent layers. Below is an image depicting the dense block.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

As the layers in the network receive feature maps from all the preceding layers, the network will be thinner and more compact. Below is a 5-layer dense block with the number of channels set to 4.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

The DenseNet Architecture

DenseNet has been applied to various different datasets. Based on the dimensionality of the input, different types of dense blocks are used. Below is a brief description of these layers.

  • Basic DenseNet Composition Layer: In this type of dense block each layer is followed by a pre-activated batch normalization layer, ReLU activation function, and a 3×3 convolution. Below is a snapshot.
A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2
  • BottleNeck DenseNet (DenseNet-B): As every layer produces k output feature maps, computation can be harder at every level. Hence the authors presented a bottleneck structure where 1×1 convolutions are used before a 3×3 convolution layer, shown below.
A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2
  • DenseNet Compression: To improve the model compactness, the authors tried reducing the feature maps at the transition layers. So if a dense block consists of m feature maps and the transition layer generates i output feature maps, where 0 < i <= 1, this i also denotes the compression factor. If the value of i is equal to one (i=1), the number of feature maps across transition layers remains unchanged. If i < 1, then the architecture is referred to as DenseNet-C and the value of i would be changed to 0.5. When both the bottleneck and transition layers with i < 1 are used, we refer to our model as DenseNet-BC.
  • Multiple Dense Blocks with Transition Layers: The dense blocks in the architecture are followed by a 1×1 Convolution layer and 2×2 average pooling layer. As the feature map sizes are the same, it’s easy to concatenate the transition layers. Lastly, at the end of the dense block, a global average pooling is performed which is attached to a softmax classifier.
A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

DenseNet Training and Results

The DenseNet architecture defined in the original research paper is applied to three datasets: CIFAR, SVHN, and ImageNet. All the architectures used a stochastic gradient descent optimizer for training. The training batch size for CIFAR and SVHN was 64, for 300 and 40 epochs, respectively. The initial learning rate was set to 0.1 and was further reduced. Below are the metrics for DenseNet trained on ImageNet:

  • Batch size: 256
  • Epochs: 90
  • Learning rate: 0.1, decreased by a factor of 10 at epochs 30 and 60
  • Weight decay and momentum: 0.00004 and 0.9

Below are the detailed results showing how different configurations of DenseNet compare to other networks on the CIFAR and SVHN datasets. The data in blue indicates the best results.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

Below are the Top-1 and Top-5 errors for different sizes of DenseNet on ImageNet.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

Below are a few relevant links if you want to look into the original paper, its implementation, or how to implement DenseNet yourself:

  1. Most of the images in this review are taken from the original research paper (DenseNet) and the article Review: DenseNet — Dense Convolutional Network (Image Classification) by Sik-Ho Tsang
  2. Code corresponding to the original paper
  3. TensorFlow Implementation of DenseNet
  4. PyTorch Implementation of DenseNet

ResNeXt (2017)

ResNeXt is a homogeneous neural network which reduces the number of hyperparameters required by conventional ResNet. This is achieved by their use of “cardinality”, an additional dimension on top of the width and depth of ResNet. Cardinality defines the size of the set of transformations.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

In this image the leftmost diagram is a conventional ResNet block; the rightmost is the ResNeXt block, which has a cardinality of 32. The same transformations are applied 32 times, and the result is aggregated at the end. This technique was suggested in the 2017 paper Aggregated Residual Transformations for Deep Neural Networks, co-authored by Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He, who all worked under Facebook AI Research.

VGG-nets, ResNets, and Inception Networks have gained a lot of momentum in the field of feature engineering. However, despite their great performances, they still have a handful of limitations. These models remain well adapted to a few datasets, whereas to the others’, it is vague as to how these networks have to be adapted, due to the reason that there are several hyperparameters and computations involved. To overcome such issues, the advantages of both VGG/ResNet (ResNet is born out of VGG) and Inception Networks have been considered. In a nutshell, the repetition strategy of ResNet is combined with the split-transform-merge strategy of Inception Network, i.e., a network block splits the input, transforms it into a required format, and merges it to get the output where each block follows the same topology.

ResNeXt Architecture

The basic architecture of ResNeXt has two rules bound to it – one, if the blocks produce same-dimensional spatial maps, they share the same set of hyperparameters, and if at all the spatial map is downsampled by a factor of 2, the width of the block is multiplied by a factor of 2.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

As seen in the table, ResNeXt-50 has 32 as its cardinality repeated 4 times (depth). The dimensions in [] denote the residual block structures, whereas the numbers written adjacent to them refer to the number of stacked blocks. 32 precisely denotes that there are 32 groups in the grouped convolution.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

The above network structures explain what a grouped convolution is, and how it trumps the other two network structures.

  • (a) denotes a usual ResNeXt block that has already been seen previously. It has a cardinality of 32, and follows the split-transform-merge strategy.
  • (b) does seem to be a leaf taken out of Inception-ResNet. However, Inception or Inception-ResNet doesn’t have network blocks following the same topology.
  • (c) is related to the grouped convolution which has been proposed in AlexNet architecture. 32*4 as has been seen in (a) and (b) has been replaced with 128 in-short, meaning splitting is done by a grouped convolutional layer. Similarly, the transformation is done by the other grouped convolutional layer that does 32 groups of convolutions. Later, concatenation happens.

Among the above three, (c) proved to be the best as it is simple to implement.

ResNeXt Training and Results

ImageNet has been used to show the enhancement in accuracy when cardinality is considered rather than width/depth.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

Both ResNeXt-50 and ResNeXt-101 are less error-prone when the cardinality is high. Also, in comparison to ResNet, ResNeXt performed well.

Below are a few important links,

  1. Link to Original Research Paper
  2. PyTorch Implementation of ResNext
  3. Tensorflow Implementation of ResNext

ShuffleNet v2 (2018)

ShuffleNet v2 considers direct metrics, such as speed or memory access cost to measure the network’s computational complexity, besides FLOPs, that acts as an indirect metric. Moreover, the direct metrics are also evaluated on the target platform. Taking all the required guidelines into account, ShuffleNet v2 has been proposed. It was introduced in the paper, ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design, that came out in the year 2018. It was co-authored by Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun.

FLOPs is the usual metric to measure the performance of a network, in terms of its computations. However, a few studies have substantiated the fact that FLOPs do not wholly dig the underlying truths; networks having similar FLOPs differ in their speeds, this can be because of the memory access cost, degree of parallelism, target platform, etc. All these do not fall under FLOPs, and thus, are being ignored. ShuffleNet v2 overcomes such hassles by proposing four guidelines to model a network.

ShuffleNet v2 Architecture

Prior to understanding the network architecture, the guidelines upon which the network has been built shall give a glimpse into how various other direct metrics have been considered:

  1. Equal channel width minimizes the memory access cost: When the number of input channels and output channels are in the same proportion (1:1), memory access cost becomes low.
  2. Excessive group convolution increases memory access cost: The group number shouldn’t be too high, otherwise the memory access cost tends to increase.
  3. Network fragmentation reduces degree of parallelism: Fragmentation reduces the network’s efficiency in executing parallel computations.
  4. Element-wise operations are non-negligible: Element-wise operations have small FLOPs, but can increase the memory access time.

All these are integrated in the ShuffleNet v2 architecture to improve the network efficiency.

The channel split operator divides the channels into two groups, where one remains as an identity (3rd guideline). The other branch has an equal number of input and output channels along the three convolutions (1st guideline). The 1×1 convolutions aren’t group-wise (2nd guideline). Element-wise operations like ReLU, Concat, depth-wise convolutions are confined to a single branch (4the guideline).

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

The overall ShuffleNet v2 architecture is tabulated as follows:

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

The results are with respect to different variations of the output channels.

ShuffleNet v2 Training and Results

Imagenet has been used as the dataset to derive results with various datasets.

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

Complexity, error rate, GPU speed, and ARM speed have been used to derive the robust and efficient model among the contemplated models. Although ShuffleNet v2 lacks GPU speed, it records the lowest top-1 error rate, which outweighs the other limitations.

Below are a few important links,

  1. Link to Original Research Paper
  2. Tensorflow Implementation of ShuffleNet v2
  3. Implementation of ShuffleNet V2

MnasNet (2019)

MnasNet is an automated mobile neural architecture search network that is used to build mobile models using reinforcement learning. It incorporates the basic essence of CNN and thereby strikes the right balance between enhancing accuracy and reducing latency, to depict high performance when the model is deployed onto a mobile. This idea was put forth in the paper, MnasNet: Platform-Aware Neural Architecture Search for Mobile, that came out in the year 2019. It was co-authored by Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Andrew Howard, all belonging to the Google Brain team.

The conventional mobile CNN models that have been developed so far, do not yield the right outcome when latency and accuracy are taken into account; they somehow lack in either of those. The latency is often estimated using FLOPS, which doesn’t output the right results. However, in MnasNet, the model is directly deployed onto a mobile, and the results are estimated; there are no proxies involved. Mobiles are usually resource-constrained, therefore, factors such as performance, cost, and latency are significant metrics to be considered.

MnasNet Architecture

The architecture, in general, consists of two phases – search space and reinforcement learning approach.

  • Factorized hierarchical search space: The search space supports diverse layer structures to be included throughout the network. The CNN model is factorized into various blocks wherein each block has a unique layer architecture. The connections are chosen such that both the input and output are compatible with each other, and henceforth yield good results to maintain a higher accuracy rate. Below is how a search space looks like:
A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

As can be noticed, there are several blocks that make-up for the search space. All the layers are segregated based on their dimensions and filter size. Each block has a specific set of layers where the operations are chosen (as mentioned in blue color). The first layer in every block has a stride 2 if input or output dimensions are different, and the stride is 1 for the remaining layers. The same set of operations is repeated starting from the second layer to the Nth layer where N is the block number.

  • Reinforcement search algorithm: As we have two major objectives to achieve – latency and accuracy, we employ a reinforcement learning approach where the rewards are maximized (multi-objective reward). Each CNN model as defined in the search space would be mapped to a sequence of actions that are to be performed by a reinforcement learning agent.
A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

This is what is present in the search algorithm – the controller is a Recurrent Neural Network (RNN), and the trainer trains the model and outputs the accuracy. The model is deployed onto a mobile phone to estimate the latency. Both accuracy and latency are consolidated into a multi-objective reward. This reward is sent to RNN using which the parameters of RNN are updated, to maximize the total reward.

MnasNet Training and Results

Imagenet has been used to depict the accuracy achieved by a MnasNet model, in comparison with the other conventional mobile CNN models. Here’s a table representing the same:

A Review of Popular Deep Learning Architectures: DenseNet, ResNeXt, MnasNet, and ShuffleNet v2

MnasNet definitely has a reduced latency along with improved accuracy.

If you want to check out the original paper or implement MnasNet yourself in PyTorch, check out these links:

  1. Link to Original Research Paper
  2. PyTorch Implementation of MnasNet

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

That wraps up our three-part series covering popular deep learning architectures that have defined the field. If you haven’t already, feel free to check out Part 1 and Part 2, which cover models like ResNet, Inception v3, AlexNet, and others. I hope you found this series useful.

Gradient Health and Paperspace Team Up to Advance Medical Imaging

Gradient Health and Paperspace Team Up to Advance Medical Imaging

Gradient Health, a medical technology company building an open research platform for medical imaging, has selected Paperspace as its MLOps platform to bring reproducibility and determinism to its machine learning pipeline.

Gradient Health and Paperspace Team Up to Advance Medical Imaging
Lung segmentations image set made available by Gradient Health

Gradient Health, founded by Duke University students and graduates, uses the latest machine learning technologies to build critical infrastructure for developers of imaging algorithms. These developers bring cutting-edge scientific research into the real world, building novel imaging workflows to the field of radiology for the benefit of physicians and scientists.

Misha Kutsovsky, Senior Machine Learning Architect at Paperspace, said: “Gradient Health is on a mission to help the medical imaging community innovate with the latest and greatest ML techniques. We’re excited to support their machine learning efforts and collaborate with them on this critical work.”

With Paperspace’s MLOps platform (which coincidentally shares the name Gradient), Gradient Health supports an interoperable and secure clinical environment for its developers. Having a tight machine learning feedback loop is key to delivering a valuable product.

Gradient Health and Paperspace Team Up to Advance Medical Imaging

“Through our partnership with PACS vendors, our goal is to provide all researchers secure access to train and validate models on production-scenario medical datasets. Paperspace provides us a way to better interface with researchers; allowing us to focus on building tooling, and models that are widely applicable to the field.”

Ouwen Huang, Gradient Health CEO

Working together, Gradient Health and Paperspace will make it easier than ever to translate leading scientific research into real-world machine learning algorithms that improve outcomes for patients.

To learn more about Gradient Health, please visit:

Kando and Paperspace Partner to Bring Advanced Machine Learning to Municipal Systems Monitoring

Kando and Paperspace Partner to Bring Advanced Machine Learning to Municipal Systems Monitoring

Kando, a technology company providing smart wastewater management solutions to municipal utilities, has integrated Paperspace’s Gradient machine learning platform to bolster its leading Clear Upstream wastewater event monitoring system.

Kando and Paperspace Partner to Bring Advanced Machine Learning to Municipal Systems Monitoring
Kando’s end-to-end solution Clear Upstream provides continuous awareness of events in wastewater networks.

With the partnership, Kando brings a state-of-the-art machine learning toolset and MLOps platform to its technology stack.

Machine learning is helping IoT companies leverage automation and analysis for big data solutions. Kando is taking advantage of this trend by deploying ML solutions to identify pollution risks, pinpoint sources of pollution, and evaluate impacts that require a swift emergency response to keep cities and residents safe.

As a result, clients such as El Paso Water and Clean Water Services of Portland are able to gain real-time, continuous monitoring.

Todd Feinroth, VP of Sales for Paperspace, said: “We’re pleased to partner with Kando to bring best-in-class machine learning tools to critical municipal infrastructure monitoring. We look forward to helping Kando build its machine learning capability and deliver leading solutions to wastewater utilities around the world.”

Kando and Paperspace Partner to Bring Advanced Machine Learning to Municipal Systems Monitoring

“Our partnership with Paperspace will boost our system’s advanced analytics so that we can better enable cities to remotely and continuously control their wastewater quality. Accordingly, we will begin to see greater wastewater reuse, cleaner environments, and healthier communities.”

Ari Goldfarb, Kando CEO

Working together, Kando and Paperspace will bring advanced machine learning capabilities to the management of environmental risks and public health.

To learn more about Kando and their exciting projects and customers, please visit:

The 11 Best AI & Machine Learning Podcasts to Add to Your Listening Pipeline

The 11 Best AI & Machine Learning Podcasts to Add to Your Listening Pipeline

Bring this project to life

AI in Business

Learn what’s possible and what’s working with Artificial Intelligence in Business. Each week you’ll find featured interviews from top AI and machine learning-focused executives and researchers in industries like Financial Services, Pharma, Retail, Defense, and more. Discover trends, learn about what’s working in industry right now, and see how to adapt and thrive in an era of AI disruption.

Host: Daniel Faggella

Posting Schedule: Weekly on Tuesdays and Thursdays

Links: Apple, Soundcloud, Stitcher

The AI Podcast

Artificial intelligence has been described as “Thor’s Hammer“ and “the new electricity.” But it’s also a bit of a mystery–even to those who know it best. On the AI Podcast, Noah Kravitz connects with some of the world’s leading experts in AI, deep learning, and machine learning to explain how it works, how it’s evolving, and how it intersects with every facet of human endeavor, from art to science.

Host: Noah Kravitz

Posting Schedule: Bi-Weekly on various days

Links: Apple, Spotify, Google Podcasts

AI Today Podcast

Cognilytica’s AI Today podcast focuses on relevant information about what’s going on today in the world of artificial intelligence. Discussed are pressing topics around artificial intelligence with easy-to-digest content, interviewed guests and experts on various domains within AI, and an attempt to cut through the hype and noise to identify what’s really happening with adoption and implementation of AI.

Host: Kathleen Walch & Ron Schmelzer

Posting Schedule: Weekly on Wednesdays

Link: Apple, Spotify, Google Podcast

Artificial Intelligence: AI Podcast

Artificial Intelligence: AI Podcast (also known as “AI + Lex”) showcases conversations about the nature of intelligence, science, and technology (at MIT and beyond) from the perspective of deep learning, robotics, AI, AGI, neuroscience, philosophy, psychology, cognitive science, economics, physics, mathematics, and more.

Host: Lex Fridman

Posting Schedule: Weekly on Mondays and Fridays

Links: Apple, Spotify, RSS

Data Skeptic

The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.

Host: Kyle Polich

Posting Schedule: Weekly on Fridays

Links: Apple, Stitcher

Eye on AI

Eye on A.I. is a biweekly podcast, hosted by longtime New York Times correspondent Craig S. Smith. In each episode, Craig will talk with some of the leaders making a difference in this space, putting incremental advances in machine intelligence into a broader context and considering the global implications of developing technology. AI is about to change your world, so pay attention.

Host: Craig Smith

Posting Schedule: Bi-weekly on Wednesdays

Links: Apple, Spotify, Google Play

Linear Digressions

Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago. In each episode, your hosts explore machine learning and data science through interesting (and often very unusual) applications.

Host: Katie Malone & Ben Jaffe

Posting Schedule: Weekly on Mondays

Links: Apple, Spotify, Stitcher

Practical AI

The focus of this podcast is making artificial intelligence practical, productive, and accessible to everyone. Practical AI is a show in which technology professionals, business people, students, enthusiasts, and expert guests engage in lively discussions about Artificial Intelligence and related topics (Machine Learning, Deep Learning, Neural Networks, etc). The focus is on productive implementations and real-world scenarios that are accessible to all. If you want to keep up with the latest advances in AI while keeping one foot in the real world, then this is the show for you!

Host: Chris Benson & Daniel Whitenack

Posting Schedule: Weekly on Mondays

Link: Apple, Spotify, RSS

Talking Machines

Talking Machines is your window into the world of machine learning. During each episode the hosts bring you clear conversations with experts in the field, insightful discussions of industry news, and useful answers to your questions. Machine learning is changing the questions we can ask of the world around us. Here, we explore how to ask the best questions and what to do with the answers.

Host: Katherine Gorman & Neil Lawrence

Posting Schedule: Bi-weekly on Fridays

Links: Apple, Spotify, Stitcher

The TWIML AI Podcast

Machine learning and artificial intelligence are dramatically changing the way businesses operate and people live. The TWIML AI Podcast brings the top minds and ideas from the world of ML and AI to a broad and influential community of ML/AI researchers, data scientists, engineers, tech-savvy business and IT leaders.

Host: Sam Charrington

Posting Schedule: Weekly on Mondays and Thursdays

Links: Apple, Spotify, Google Play

Voices in AI

The goal of this show is to capture this unique moment in time, where everything seems like it might be possible, both the good and the bad. Artificial intelligence isn’t over-hyped.  The optimists and pessimists believe one thing in common: That AI will be transformative. Voices in AI strives to document that transformation.

Host: Byron Reese

Posting Schedule: Bi-weekly on Thursdays

Links: Apple, Google Play, Stitcher

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Generative Art and the Science of Streaming the Collective Consciousness

Generative Art and the Science of Streaming the Collective Consciousness

In collaboration with Bitforms Gallery and Small Data Industries, Paperspace spoke with artist Daniel Canogar, whose work Loom is being presented as part of a cloud-based, 24/7 stream of generative software-based art.

Daniel was born in Madrid and splits his time between the United States and Spain. He graduated from NYU with a degree in Photography and became interested in projected image as a medium. Throughout his career he has created installations for the American Museum of Natural History and the European Union, as well as his solo work that has been presented in Times Square, and at the Sundance Film Festival in Park City.

We were very excited to have the opportunity to speak with Daniel about Loom and his experience as a technology-first artist. We found his insight into the everyday shared spaces of technology and society profound and worthy of deep study.

Bring this project to life

Paperspace: Is it worth distinguishing between art and technology? What’s the difference in your mind?  

Canogar: We forget how art has always had a technological/technical component. Examples include the development of finer brushes and pigments, refining optics for camera obscuras used for drawing outlines, as well as the development of the perspectival system, which in itself was a scientific invention. I’m not so interested in framing how the intersection of art and technology is a contemporary phenomenon; rather, I am fascinated to learn about the many historical examples of how this intersection happened in the past, and perhaps realizing that in many ways the same old issues remerge again and again but updated with the technology of the moment.

Paperspace: Your first medium was photography. What made you branch out into film and public installations? Does your process change with each medium?

Canogar: I did start as a photographer, but I was technically very clumsy with the medium. My negatives were always full of dust (I started out with pre-digital photography) and never had the patience to refine my craft. After a while, I realized what really captivated me about photography was the experience of the darkroom: light projected by the enlarger and burning an image onto the light-sensitive paper, the smell of chemicals, the red light, the miracle of seeing an image magically appear on a blank piece of paper. These were formative experiences in my late teens that are still present in my present work. It was only natural for me to start thinking about projected images, experiential installation art and using technology as a way of making these creations happen.

Paperspace: So what exactly is generative art?

Canogar: Generative art is basically algorithmic art. An algorithm is coded to establish a set of behavioral rules that process the data. Once these rules are set in motion the artwork takes on a life of its own, always recombining the information in new and surprising ways. In many ways generative art is closer to performance art than to, say, video art: it has a life of its own, is ephemeral, and never repeats itself in exactly the same way. Once I started working with generative art, I never wanted to go back to video.

Paperspace: When creating generative art, do you already have an idea of what statement the piece will make, or do you come to that conclusion after analyzing what was generated?

Canogar: The process of creating a generative artwork is a dialogue. I have an initial concept which I share with my programmers, together with sketches, other visual references, etc. They get to work and a few days or weeks later they show me a first draft of the artwork. This is when the dialogue with the artwork begins. I take a careful look at the draft, notice what works and what doesn’t, and try to listen to what the algorithm is trying to tell me. My programmers pick up on this internal dialogue I have with the developing artwork and continue to tweak. I try to remain open to the process, not imposing my initial concepts and ideas of what I imagined the artwork would look like, but being open to the many surprising things that happen along the way. Needless to say that I also listen to my programmers’ input during our creative journey. After endless tweaks, and literally living with an artwork that is constantly changing, I feel I know the work deeply and become ready to release it to the world.

Paperspace: What about the possibility of having an unknown outcome excites you?

Canogar: Serendipity, synchronicity, the art of the accident. I am also excited about losing control of the outcome. I have seen amazing things happen with generative work that I could never have planned. It taps into endless combinatorial possibilities that take you beyond what can be planned, organized, controlled. For example, as I write this, I have just seen the single word “Protest” in smoldering black slide down my screen on which I have Loom running, capturing the moment we are living right now with a chilling and beautiful precision.

Paperspace: How do you choose the length of an installation? For Loom, I imagine the understanding of the piece would change depending on if it was streamed for a day, week, month or even a year. Do you agree?

Canogar: Time gets warped with artworks like Loom. There is really no past, nor future. By using real-time data, it lives solely in the present. Traditional notions of time related to narrative, of how long a video or a film runs, do not apply. When considering the life-span of the artwork, perhaps we should think about when the work dies: when the software that runs it becomes obsolete and can no longer find a computer to run on.

Paperspace: What do you think is the difference between displaying art in a physical gallery vs in a digital microcosm? Is there novel understanding of your audience available through presenting in this alternative space?

Canogar: Perhaps the audience may not change that much: those that experience Loom online are just as likely to visit an art gallery. I imagine they are generally interested in art and will consume it both online and off. What changes fundamentally is the attitude, mood, and predisposition of an online audience. They are using their screens for multiple activities, including creating Excel sheets, streaming Netflix, and connecting to the outside world via Zoom. It’s a tightly packed and busy platform to share artwork with. Perhaps for this reason I wanted to create an antidote to the business of the screen, by creating a tranquil, almost meditative artwork.

Paperspace: What do you hope people will take away from this installation?

Canogar: I do hope people will spend some time with the work, not necessarily watch it continuously, but have it up on their computer screens and glance at it occasionally. By living with the work, you are able to see current events unfold in front of you, not as sensationalistic breaking news, but as a poetic experience. Loom captures what is on people’s minds right now. Their online searches have been artistically distilled so as to better grasp the collective consciousness of the moment. Above all, I hope people will enjoy this artwork.

Paperspace: Anything else you’d like to share?

Canogar: I’d like to profusely thank Paperspace for making this artwork happen. We couldn’t have done it without you!

Be sure to visit Biforms Gallery’s first-of-its-kind virtual art gallery, starting with Canogar’s Loom which is live until June 9, 2020. With the help of a Paperspace Core P6000 virtual machine, Bitforms Gallery (in collaboration with Small Data) will be streaming five additional pieces 24/7 for the next three months.

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales