• Student network is trained to mimic the output of teacher network. Source: Upadhyay 2018.
• Mimic SNN: a shallow neural network mimicking the teacher network performs as well as deep CNN. Source: Ba and Caruana 2014, fig. 1.
• FitNet uses intermediate-level hints from teacher network. Source: Romero et al. 2015, fig. 1.
• In BANs, each generation of student trains the next generation. Source: Furlanello et al. 2018, fig. 1.
• Relational Knowledge Distillation. Source: Park et al. 2019, fig. 1.
• Distilling the knowledge from a teacher to a student. Source: Neural Network Distiller 2019.
• A framework for visual question answering. Source: Mun et al. 2018, fig. 2.

# Knowledge Distillation

Varsha2018
903 DevCoins

arvindpdmn
871 DevCoins
Last updated by arvindpdmn
on 2019-11-02 14:54:44
Created by Varsha2018
on 2019-09-10 14:30:37

## Summary

Deep learning is being used in a plethora of applications ranging from Computer Vision and Digital Assistants to Healthcare and Finance. The popularity of the fields of Machine Learning and Deep Learning can be attributed to the high accuracy of the obtained results, which is largely due to the average of an ensemble of thousands of models. However, such computationally intensive models cannot be deployed on mobile devices, or FPGAs for instant use. These devices have constraints on resources like limited memory and input/output ports.

One way of mitigating this problem is to use Knowledge Distillation. We train an ensemble of models or a complex model ('teacher') on the data. We then train a lighter model ('student') with the help of the complex model. The less-intensive student model can then be deployed on FPGAs.

## Milestones

1989

Hanson and Pratt propose network pruning using biased weight decay. They call their pruned networks minimal networks. In the early 1990s, other pruning methods such as optimal brain damage and optimal brain surgeon are proposed. These are early approaches to compress a neural network model. Knowledge distillation as an alternative is invented about two decades later.

2006

Buciluǎ et al. publish a paper titled Model Compression. They present a method for “compressing” large, complex ensembles into smaller, faster models, usually without significant loss in performance. They use the ensemble to label large unlabelled datasets. They then use this labelled data to train a single model that performs as well as the ensemble. Because it's not easy to obtain large sets of unlabelled data, they develop an algorithm called MUNGE to generate pseudo data. This work is limited to shallow networks.

2014

Ba and Caruana propose the idea of teacher-student learning method. They show that shallow models can perform as well as deep models. A complex teacher network (either a deep network or an ensemble) is trained. Instead of using the softmax output, the logits are used to train the shallow student network. Thus, the student network benefits from what the teacher network has learned without losing information via the softmax layer. Some call this softened softmax.

2014

Hinton et al. introduce the idea of passing on the 'dark' knowledge from an ensemble of models into a lighter, deployable model. In a paper published in March 2015, they explain that they're "distilling knowledge" from the complex model. The core idea is that models should generalize well to new data rather than optimize on training data. Instead of using logits, they use distillation, in which the softmax is used with a higher temperature, also called "soft targets". They note that using logits is a special case of distillation.

2015

FitNet aims to produce a student network that's thinner than teacher network while being of similar depth. In addition to the teacher's distilled knowledge of the final softmax layer, Fitnets also make use of intermediate-level hints from the hidden layers. Yim et al. propose a variation of this in 2017 by distilling knowledge from the inner product of features of two layers.

2018

Furlanello et al. show that student models parameterized similar to teacher models outperform the latter. They call these Born-Again Networks (BANs) where model compression is not the goal. Students are trained to predict correct labels plus match the teacher's output distribution (knowledge distillation).

2019

Researchers at the Indian Institute of Science, Bangalore, propose Zero-Shot Knowledge Distillation (ZSKD) in which they don't use teacher's training dataset or a transfer dataset for distillation. Instead, they synthesize pseudo data from the teacher's model parameters. They call this Data Impressions (DI). This is then used as a transfer dataset to perform distillation. Another research group, with the aim of reducing training, show that just 1% of the training data can be adequate.

2019

Park et al. look at the mutual relationships among data samples and transfer this knowledge to the student network. Called Relational Knowledge Distillation (RKD), this departs from the conventional approach of looking at individual samples. Liu et al. propose something similar, calling it Instance Relationship Graph (IRG). Attention network is another approach to distil relationships.

Sep
2019

Yuan et al. note the conventional KD can be reversed; that is, the teacher can also learn from the student. Another observation is that a poorly-trained teacher can improve the student. They see KD not just as similarity across categories but also as a regularization of soft targets. With this understanding, they propose Teacher-free Knowledge Distillation (Tf-KD) in which a student model learns from itself.

## Discussion

• What is a teacher-student network?

The best Machine Learning models are those that average the predictions of an ensemble of thousands of models. While deploying on hardware devices like FPGAs, however, problems ensue. FPGAs have a limited number of I/O ports, which forces developers to drastically reduce the number of inputs and outputs at each layer of their network.

To alleviate this problem, we use two networks - a teacher and a student. Essentially, we train a bulky ensemble of models (teacher) and use a smaller, lighter model (student) for testing, prediction and deployment. The student is trained to mimic the prediction capabilities of the teacher. How we go about doing this constitutes the crux of Knowledge Distillation.

• What is dark knowledge and softmax temperature?

The inputs to the softmax layer of the network, called logits, are 'softened' or divided by a constant value, called the temperature. When the temperature is 1, the probabilities obtained are said to be unsoftened. Hinton et.al. that, in general, the temperature depends on the number of units in the hidden layer of a network. For example, when the number of units in the hidden layer was 300, temperatures above 8 worked well, whereas when the number of units was 30, temperatures in the range of 2.5-4 worked best. Higher the temperature, softer the probabilities.

Consider a classification problem with four classes, [cow, dog, cat, car]. If we have an image of a dog, the true labels would be [0, 1, 0, 0]. Consider also a network that computes the probabilities for the four classes as [0.05, 0.3, 0.2, 0.005], respectively.

It can be observed that the probability of the image being classified as a cow is 10 times greater than that of it being classified as a car. It is this 'dark' knowledge, first described by Hinton in his paper, that needs to be distilled from the teacher network to the student.

• How could I implement this Knowledge distillation?

Buciluǎ et al. designed the first methods of model compression. Later, Hinton et.al. showed the means of distilling the knowledge from an ensemble of models into a single, lighter model.

For example, in image classification, the student would be trained on the class probabilities, or logits, output by the teacher. The logits represent a similarity metric over the classes and help in training good classifiers. Extracting this form of 'dark knowledge' from the teacher network and passing it on to the student is called distillation.

Kariya's Medium article provides a simple implementation of Hinton's paper. He touches upon dark knowledge and proceeds to build a simple CNN-based network on the MNIST dataset , showing how the teacher-trained student performed better than a standalone student.

Implementing knowledge distillation can be a resource-intensive task. It requires the training of the student model on the teacher's logits, in addition to training the teacher model.

While training the student, care should be taken to avoid the vanishing gradient problem, which can occur if the learning rate of the student is too high.

The objective of distilling the knowledge from an ensemble of models into a single, lightweight model is to ease the processes of deployment and testing. It is of paramount importance that accuracy not be compromised in trying to achieve this objective.

In the original paper authored by Hinton et. al., the performance of the student network after knowledge distillation improved, when compared with a standalone student network. Both networks were trained on the MNIST dataset of images. The accuracies of the various models have been tabulated.

As is obvious from the table, the best results are obtained from the bulky ensemble of models and their student alternatives must be used only in case of constrained resources.

• What are the challenges with Knowledge Distillation?

KD is limited to classification tasks that use softmax layer. Sometimes the assumptions are too strict, such as in FitNets where student models may not suit constrained deployment environments. Other approaches to model compression may therefore be preferred over KD.

However, KD continues to be a promising area of research. In 2017, it was adapted for multiclass object detection. In 2018, KD was applied to construct specialized student models for visual question answering. Also in 2018, Guo et al. improved the robustness of student network so that it resists perturbations.

In some domains such as healthcare, DNNs are not preferred. Decision trees are preferred since their predictions can be more easily interpreted. KD has been used to distil DNN into decision tree and thereby provide good performance and interpretability.

## Sample Code

• import os
os.environ['CUDA_DEVICE_ORDER']= "PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICE'] =  '"'

import tensorflow as tf
import numpy as np
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D,MaxPooling2D,Dropout,Flatten,Dense,Activation,Lambda
from tensorflow.keras.optimizers import SGD

train_data = train_data.astype('float32')
test_data = test_data.astype('float32')
train_data = train_data/255
test_data = test_data/255
train_labels = keras.utils.to_categorical(train_labels.astype('float32'))
test_labels = keras.utils.to_categorical(test_labels.astype('float32'))

def swish(x):
beta = 1.5
return beta * x * keras.backend.sigmoid(x)

def new_softmax(logits, temperature=1):
logits = logits/temperature
return np.exp(logits)/np.sum(np.exp(logits))

print(train_data.shape)

#teacher
model = Sequential()
model.summary()

opt = SGD(lr=0.001, momentum=0.9)

model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_data,train_labels,epochs=50)
(loss,accuracy) = model.evaluate(test_data,test_labels)
print(loss, accuracy)

model_sans_softmax = keras.models.Model(inputs=model.input, outputs = model.get_layer('logits').output)
new_logits = model_sans_softmax.predict(train_data)
unsoftened_prob = new_softmax(new_logits, 1)
print("Unsoftened probabilities " + str(unsoftened_prob[0]))
temperature = 4
softened_prob = new_softmax(new_logits, temperature)
print("Softened probabilities " + str(softened_prob[0]))

#student
model1 = Sequential()
model1.summary()
logits = model1.get_layer('logits').output
logits = Lambda(lambda x:x/temperature)(logits)
out = Activation('softmax',name='soft')(logits)

new_student = keras.models.Model(inputs=model1.input,outputs=out)
new_student.summary()

new_student.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

new_student.fit(train_data,softened_prob,epochs=100)
(loss,accuracy) = new_student.evaluate(test_data,test_labels)
print(loss, accuracy)

## Milestones

1989

Hanson and Pratt propose network pruning using biased weight decay. They call their pruned networks minimal networks. In the early 1990s, other pruning methods such as optimal brain damage and optimal brain surgeon are proposed. These are early approaches to compress a neural network model. Knowledge distillation as an alternative is invented about two decades later.

2006

Buciluǎ et al. publish a paper titled Model Compression. They present a method for “compressing” large, complex ensembles into smaller, faster models, usually without significant loss in performance. They use the ensemble to label large unlabelled datasets. They then use this labelled data to train a single model that performs as well as the ensemble. Because it's not easy to obtain large sets of unlabelled data, they develop an algorithm called MUNGE to generate pseudo data. This work is limited to shallow networks.

2014

Ba and Caruana propose the idea of teacher-student learning method. They show that shallow models can perform as well as deep models. A complex teacher network (either a deep network or an ensemble) is trained. Instead of using the softmax output, the logits are used to train the shallow student network. Thus, the student network benefits from what the teacher network has learned without losing information via the softmax layer. Some call this softened softmax.

2014

Hinton et al. introduce the idea of passing on the 'dark' knowledge from an ensemble of models into a lighter, deployable model. In a paper published in March 2015, they explain that they're "distilling knowledge" from the complex model. The core idea is that models should generalize well to new data rather than optimize on training data. Instead of using logits, they use distillation, in which the softmax is used with a higher temperature, also called "soft targets". They note that using logits is a special case of distillation.

2015

FitNet aims to produce a student network that's thinner than teacher network while being of similar depth. In addition to the teacher's distilled knowledge of the final softmax layer, Fitnets also make use of intermediate-level hints from the hidden layers. Yim et al. propose a variation of this in 2017 by distilling knowledge from the inner product of features of two layers.

2018

Furlanello et al. show that student models parameterized similar to teacher models outperform the latter. They call these Born-Again Networks (BANs) where model compression is not the goal. Students are trained to predict correct labels plus match the teacher's output distribution (knowledge distillation).

2019

Researchers at the Indian Institute of Science, Bangalore, propose Zero-Shot Knowledge Distillation (ZSKD) in which they don't use teacher's training dataset or a transfer dataset for distillation. Instead, they synthesize pseudo data from the teacher's model parameters. They call this Data Impressions (DI). This is then used as a transfer dataset to perform distillation. Another research group, with the aim of reducing training, show that just 1% of the training data can be adequate.

2019

Park et al. look at the mutual relationships among data samples and transfer this knowledge to the student network. Called Relational Knowledge Distillation (RKD), this departs from the conventional approach of looking at individual samples. Liu et al. propose something similar, calling it Instance Relationship Graph (IRG). Attention network is another approach to distil relationships.

Sep
2019

Yuan et al. note the conventional KD can be reversed; that is, the teacher can also learn from the student. Another observation is that a poorly-trained teacher can improve the student. They see KD not just as similarity across categories but also as a regularization of soft targets. With this understanding, they propose Teacher-free Knowledge Distillation (Tf-KD) in which a student model learns from itself.

Author
No. of Edits
No. of Chats
DevCoins
6
4
903
4
4
871
1608
Words
9
Chats
10
Edits
9
Likes
1916
Hits

## Cite As

Devopedia. 2019. "Knowledge Distillation." Version 10, November 2. Accessed 2020-07-03. https://devopedia.org/knowledge-distillation
• Site Map