Knowledge Distillation

Article Info

Contributed by
2 authors

Last updated on
2020-07-24 04:47:08

Model Compression in Deep Neural Networks
Artificial Neural Network
Deep Learning
Machine Learning
TensorFlow
Keras

Article Versions

11 2020-07-24 04:47:08
2172,1725 11,2172

By arvindpdmn

Added missing citations. Include equation.
10 2019-11-02 14:54:44
1725,1724 10,1725

By arvindpdmn

Uploaded a high res image.
9 2019-11-02 14:49:17
1724,1720 9,1724

By arvindpdmn

Updated milestones. Added more images and refs. One more question in Discussion. Publishing.
8 2019-10-29 04:07:55
1720,1708 8,1720

By arvindpdmn

Minor update to citations. Review in progress. Article has multiple warnings.
7 2019-10-26 07:01:42
1708,1706 7,1708

By Varsha2018

Added one reference

Chat Room

Submitting ...

You are editing an existing chat message.
2020-07-24 04:48:04
-

By devbot5S

[URL Check] The following URLs in this article are outdated. Please update.

Missing URLs:
References: 404 HTTP response: http://zpascal.net/cvpr2017/Yim_A_Gift_From_CVPR_2017_paper.pdf

Redirected URLs:
References: http://cwww.ee.nctu.edu.tw/~cfung/docs/learning/cheng2018DNN_model_compression_accel.pdf → https://mcube.nctu.edu.tw/~cfung/docs/learning/cheng2018DNN_model_compression_accel.pdf
Further Reading: https://towardsdatascience.com/knowledge-distillation-and-the-concept-of-dark-knowledge-8b7aed8014ac → https://towardsdatascience.com/knowledge-distillation-and-the-concept-of-dark-knowledge-8b7aed8014ac?gi=3e1a0bb3dc80
2020-06-22 22:01:21
Sample Code

By Aryan

Hi. thank you for your post. I have a question though. In the knowledge distillation paper( Geoffry Hinton et al.) they said that the student will be trained by using both true lables and also the softened probabilities of the teacher model. But at the end of the code snippet, the new student model just have been trained with the sofetened probabilites of the teacher. How can we implement the whole idea?
2019-10-26 06:30:31
-

By Varsha2018

Sure, sir, thanks!
2019-10-26 03:43:25
-

By arvindpdmn

Thanks for the update. Will review by tomorrow and publish. The reference issue may be because of the order of first name and surname. I will correct.
2019-10-25 15:36:36
-
1

By Varsha2018

Hello sir, I have made the edits, as suggested by you.
Please do take a look.
Also, I wasn't able to add the following reference in the "References" section:
* [Buciluǎ, C., Caruana, R. and Niculescu-Mizil, A., 2006, August. "Model compression." In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 535-541). ACM.]
(https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf)
Could you please add it and cite it in the place marked [2] in the Discussion?
Thank you.

Student network is trained to mimic the output of teacher network. Source: Upadhyay 2018.

Deep learning is being used in a plethora of applications ranging from Computer Vision and Digital Assistants to Healthcare and Finance. The popularity of the fields of Machine Learning and Deep Learning can be attributed to the high accuracy of the obtained results, which is largely due to the average of an ensemble of thousands of models. However, such computationally intensive models cannot be deployed on mobile devices, or FPGAs for instant use. These devices have constraints on resources like limited memory and input/output ports.

One way of mitigating this problem is to use Knowledge Distillation. We train an ensemble of models or a complex model ('teacher') on the data. We then train a lighter model ('student') with the help of the complex model. The less-intensive student model can then be deployed on FPGAs.

Discussion

What is a teacher-student network?
The best Machine Learning models are those that average the predictions of an ensemble of thousands of models. While deploying on hardware devices like FPGAs, however, problems ensue. FPGAs have a limited number of I/O ports, which forces developers to drastically reduce the number of inputs and outputs at each layer of their network.
To alleviate this problem, we use two networks - a teacher and a student. Essentially, we train a bulky ensemble of models (teacher) and use a smaller, lighter model (student) for testing, prediction and deployment. The student is trained to mimic the prediction capabilities of the teacher. How we go about doing this constitutes the crux of Knowledge Distillation.
In other words, the ensemble is simply a function that maps input to output. Transfer the knowledge in this function to the student network is knowledge distillation.
What is dark knowledge and softmax temperature?
In classification problems, neural networks output logits that are computed for each class. A softmax layer "normalizes" these logits $z_i$ into probabilities $q_i$. For a softer distribution, logits are 'softened' or divided by a constant value, called the temperature $T$:
$$q_i = \frac{exp(z_i/T)}{\sum_j exp(z_j/T)}$$
When the temperature is 1, the probabilities obtained are said to be unsoftened. Hinton et.al. that, in general, the temperature depends on the number of units in the hidden layer of a network. For example, when the number of units in the hidden layer was 300, temperatures above 8 worked well, whereas when the number of units was 30, temperatures in the range of 2.5-4 worked best. Higher the temperature, softer the probabilities.
Consider a classification problem with four classes, [cow, dog, cat, car]. If we have an image of a dog, unsoftened hard targets would be [0, 1, 0, 0]. This doesn't tell much about what the ensemble has learned. By softening, we may get [0.05, 0.3, 0.2, 0.005]. It's clear that predicting a cow is 10 times greater than a car. It's this 'dark' knowledge that needs to be distilled from the teacher network to the student.
How could I implement this Knowledge distillation?
Distilling the knowledge from a teacher to a student. Source: Neural Network Distiller 2019.
Buciluǎ et al. designed the first methods of model compression. Later, Hinton et.al. showed the means of distilling the knowledge from an ensemble of models into a single, lighter model.
For example, in image classification, the student would be trained on the class probabilities, or logits, output by the teacher. The logits represent a similarity metric over the classes and help in training good classifiers. Extracting this form of 'dark knowledge' from the teacher network and passing it on to the student is called distillation.
Kariya's Medium article provides a simple implementation of Hinton's paper. He touches upon dark knowledge and proceeds to build a simple CNN-based network on the MNIST dataset , showing how the teacher-trained student performed better than a standalone student.
Implementing knowledge distillation can be a resource-intensive task. It requires the training of the student model on the teacher's logits, in addition to training the teacher model.
While training the student, care should be taken to avoid the vanishing gradient problem, which can occur if the learning rate of the student is too high.
How about performance?
The objective of distilling the knowledge from an ensemble of models into a single, lightweight model is to ease the processes of deployment and testing. It is of paramount importance that accuracy not be compromised in trying to achieve this objective.
In the original paper authored by Hinton et. al., the performance of the student network after knowledge distillation improved, when compared with a standalone student network. Both networks were trained on the MNIST dataset of images. The accuracies of the various models have been tabulated.
As is obvious from the table, the best results are obtained from the bulky ensemble of models and their student alternatives must be used only in case of constrained resources.
What are the challenges with Knowledge Distillation?
A framework for visual question answering. Source: Mun et al. 2018, fig. 2.
KD is limited to classification tasks that use softmax layer. Sometimes the assumptions are too strict, such as in FitNets where student models may not suit constrained deployment environments. Other approaches to model compression may therefore be preferred over KD.
However, KD continues to be a promising area of research. In 2017, it was adapted for multiclass object detection. In 2018, KD was applied to construct specialized student models for visual question answering. Also in 2018, Guo et al. improved the robustness of student network so that it resists perturbations.
In some domains such as healthcare, DNNs are not preferred. Decision trees are preferred since their predictions can be more easily interpreted. KD has been used to distil DNN into decision tree and thereby provide good performance and interpretability.

Milestones

1989

Hanson and Pratt propose network pruning using biased weight decay. They call their pruned networks minimal networks. In the early 1990s, other pruning methods such as optimal brain damage and optimal brain surgeon are proposed. These are early approaches to compress a neural network model. Knowledge distillation as an alternative is invented about two decades later.

2006

Buciluǎ et al. publish a paper titled Model Compression. They present a method for “compressing” large, complex ensembles into smaller, faster models, usually without significant loss in performance. They use the ensemble to label large unlabelled datasets. They then use this labelled data to train a single model that performs as well as the ensemble. Because it's not easy to obtain large sets of unlabelled data, they develop an algorithm called MUNGE to generate pseudo data. This work is limited to shallow networks.

2014

Ba and Caruana propose the idea of teacher-student learning method. They show that shallow models can perform as well as deep models. A complex teacher network (either a deep network or an ensemble) is trained. Instead of using the softmax output, the logits are used to train the shallow student network. Thus, the student network benefits from what the teacher network has learned without losing information via the softmax layer. Some call this softened softmax.

2014

Hinton et al. introduce the idea of passing on the 'dark' knowledge from an ensemble of models into a lighter, deployable model. In a paper published in March 2015, they explain that they're "distilling knowledge" from the complex model. The core idea is that models should generalize well to new data rather than optimize on training data. Instead of using logits, they use distillation, in which the softmax is used with a higher temperature, also called "soft targets". They note that using logits is a special case of distillation.

2015

FitNet aims to produce a student network that's thinner than teacher network while being of similar depth. In addition to the teacher's distilled knowledge of the final softmax layer, Fitnets also make use of intermediate-level hints from the hidden layers. Yim et al. propose a variation of this in 2017 by distilling knowledge from the inner product of features of two layers.

2018

Furlanello et al. show that student models parameterized similar to teacher models outperform the latter. They call these Born-Again Networks (BANs) where model compression is not the goal. Students are trained to predict correct labels plus match the teacher's output distribution (knowledge distillation).

2019

Researchers at the Indian Institute of Science, Bangalore, propose Zero-Shot Knowledge Distillation (ZSKD) in which they don't use teacher's training dataset or a transfer dataset for distillation. Instead, they synthesize pseudo data from the teacher's model parameters. They call this Data Impressions (DI). This is then used as a transfer dataset to perform distillation. Another research group, with the aim of reducing training, show that just 1% of the training data can be adequate.

2019

Park et al. look at the mutual relationships among data samples and transfer this knowledge to the student network. Called Relational Knowledge Distillation (RKD), this departs from the conventional approach of looking at individual samples. Liu et al. propose something similar, calling it Instance Relationship Graph (IRG). Attention network is another approach to distil relationships.

Sep
2019

Yuan et al. note the conventional KD can be reversed; that is, the teacher can also learn from the student. Another observation is that a poorly-trained teacher can improve the student. They see KD not just as similarity across categories but also as a regularization of soft targets. With this understanding, they propose Teacher-free Knowledge Distillation (Tf-KD) in which a student model learns from itself.

Sample Code

python

import os
os.environ['CUDA_DEVICE_ORDER']= "PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICE'] =  '"'
 
import tensorflow as tf
import numpy as np
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D,MaxPooling2D,Dropout,Flatten,Dense,Activation,Lambda
from tensorflow.keras.optimizers import SGD
 
(train_data,train_labels),(test_data,test_labels) = keras.datasets.cifar10.load_data()
train_data = train_data.astype('float32')
test_data = test_data.astype('float32')
train_data = train_data/255
test_data = test_data/255
train_labels = keras.utils.to_categorical(train_labels.astype('float32'))
test_labels = keras.utils.to_categorical(test_labels.astype('float32'))
 
def swish(x):
   beta = 1.5 
   return beta * x * keras.backend.sigmoid(x)
 
def new_softmax(logits, temperature=1):
   logits = logits/temperature
   return np.exp(logits)/np.sum(np.exp(logits))
 
print(train_data.shape)
 
#teacher
model = Sequential()
model.add(Conv2D(32,(3,3),activation=swish, kernel_initializer='he_uniform', padding='same', input_shape=(32,32,3)))
model.add(Conv2D(32, (3, 3), activation=swish, kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(64, (3, 3), activation=swish, kernel_initializer='he_uniform', padding='same'))
model.add(Conv2D(64, (3, 3), activation=swish, kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(128, (3, 3), activation=swish, kernel_initializer='he_uniform', padding='same'))
model.add(Conv2D(128, (3, 3), activation=swish, kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation=swish, kernel_initializer='he_uniform'))
model.add(Dropout(0.2))
model.add(Dense(10, name='logits'))
model.add(Activation('softmax'))
model.summary()
 
opt = SGD(lr=0.001, momentum=0.9)
 
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_data,train_labels,epochs=50)
(loss,accuracy) = model.evaluate(test_data,test_labels)
print(loss, accuracy)
 
model_sans_softmax = keras.models.Model(inputs=model.input, outputs = model.get_layer('logits').output)
new_logits = model_sans_softmax.predict(train_data)
unsoftened_prob = new_softmax(new_logits, 1)
print("Unsoftened probabilities " + str(unsoftened_prob[0]))
temperature = 4
softened_prob = new_softmax(new_logits, temperature)
print("Softened probabilities " + str(softened_prob[0]))
 
#student
model1 = Sequential()
model1.add(Conv2D(32, (3,3), activation=swish, kernel_initializer='he_uniform', padding='same', input_shape=(32,32,3)))
model1.add(Conv2D(32, (3, 3), activation=swish, kernel_initializer='he_uniform', padding='same'))
model1.add(MaxPooling2D((2, 2)))
model1.add(Dropout(0.2))
model1.add(Conv2D(8,(3,3),activation=swish,kernel_initializer='he_uniform', padding='same', input_shape=(16,16,32)))
model1.add(MaxPooling2D((4,4)))
model1.add(Conv2D(4,(3,3),activation=swish,kernel_initializer='he_uniform', padding='same'))
model1.add(Conv2D(8,(3,3),activation=swish,kernel_initializer='he_uniform', padding='same'))
model1.add(Conv2D(64, (3, 3), activation=swish, kernel_initializer='he_uniform', padding='same'))
model1.add(MaxPooling2D((2, 2)))
model1.add(Dropout(0.2))
model1.add(Conv2D(128, (3, 3), activation=swish, kernel_initializer='he_uniform', padding='same'))
model1.add(Conv2D(128, (3, 3), activation=swish, kernel_initializer='he_uniform', padding='same'))
model1.add(MaxPooling2D((2, 2)))
model1.add(Dropout(0.2))
model1.add(Flatten())
model1.add(Dense(128, activation=swish, kernel_initializer='he_uniform'))
model1.add(Dropout(0.2))
model1.add(Dense(10, name='logits'))
model1.add(Activation('softmax'))
model1.summary()
logits = model1.get_layer('logits').output
logits = Lambda(lambda x:x/temperature)(logits)
out = Activation('softmax',name='soft')(logits)
 
new_student = keras.models.Model(inputs=model1.input,outputs=out)
new_student.summary()
 
new_student.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
 
new_student.fit(train_data,softened_prob,epochs=100)
(loss,accuracy) = new_student.evaluate(test_data,test_labels)
print(loss, accuracy)

References

Article Stats

1635

Words

Authors

Edits

Chats

Likes

9575

Hits

Cite As

Devopedia. 2020. "Knowledge Distillation." Version 11, July 24. Accessed 2023-11-12. https://devopedia.org/knowledge-distillation

Contributed by
2 authors

Last updated on
2020-07-24 04:47:08

design data machine learning deep learning neural networks

Model Compression in Deep Neural Networks
Artificial Neural Network
Deep Learning
Machine Learning
TensorFlow
Keras

Knowledge Distillation

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login