Probabilistic Neural Network
 Summary

Discussion
 What is the architecture of a PNN?
 Could you explain PNN with a simple example?
 What are the concepts from which PNN was derived?
 How does PNN work?
 How do I select the right smoothing parameter for a PNN?
 What are some of the variations of the traditional PNN?
 What are some specializations of PNN?
 What are the applications based on PNN?
 What are the pros and cons of PNN?
 Which are the main research approaches to improve PNN performance?
 Milestones
 Sample Code
 References
 Further Reading
 Article Stats
 Cite As
A Probabilistic Neural Network (PNN) is a feedforward neural network in which connections between nodes don't form a cycle. It's a classifier that can estimate the probability density function of a given set of data. PNN estimates the probability of a sample being part of a learned category. Machine learning engineers use PNN for classification and pattern recognition tasks. A PNN is designed to solve classification problems by using a statistical memorybased approach that can be supervised or unsupervised.^{}
The probabilistic neural net is based on the idea of conventional probability theory, such as Bayesian classification and other estimators for probability density functions, to construct a neural net for classification.^{} The widespread use of PNN originated from the usage of kernel functions for discriminant analysis and pattern recognition.
Discussion

What is the architecture of a PNN? The PNN architecture has four layers:^{}
 Input Layer: \(p\) neurons represent the input vector and distribute it to the next layer. \(p\) equals the number of input features.
 Pattern Layer: This layer applies the kernel to the input. It organizes the learning set by representing each training vector by a hidden neuron that records the features of this vector. During inference, each neuron calculates the Euclidean distance between the input test vector and the training sample, then applies the radial basis kernel function. In this way, it encodes the PDF centered on each training sample or pattern. For class \(j\), we have \(n_j\) neurons, for \(j\) in \([1,m]\).
 Summation Layer: This layer computes the average of the output of the pattern units for each class. There's one neuron for each class. Each class neuron is connected to all neurons in the pattern layer of that class.
 Output Layer: This layer selects the maximum value from the summation layer, and the associated class label is determined accordingly.

Could you explain PNN with a simple example? Consider the task of classifying the letters O, X, and I. The characters can be in uppercase or lowercase. We consider two features: length and area of each character. Consequently, the training set will have 6 letters
(O,o,X,x,I,i)
. Each training data point will be identified with a(length, area)
value. For example,O(0.5,0.7)
,o(0.2,0.5)
,X(0.8,0.8)
,x(0.4,0.5)
,I(0.6,0.3)
andi(0.3,0.2)
.^{}The input layer of the PNN will have two neurons, one for each feature, that is, one node for length and one for area.
We have three classes. Each class has two patterns in the pattern layer, one for uppercase and one for lowercase. For example, for class O there are two subtypes (O,o). In total, the pattern layer has six neurons.
The summation layer will calculate the average value for each pattern type of the pattern layer and output layer will pick the maximum value, thereby determining the suitable class O, X, I.^{}
An advantage of PNN is that there is no backpropagation training. New pattern units can be added without additional time overhead, since no training is needed; it is automatic.

What are the concepts from which PNN was derived? The Parzen window density estimation, or the Kernel Density Estimation (KDE), is a nonparametric density estimation technique. It's used to derive a density function \(f(x)\). When we have a new training sample \(x\) and there's a need to compute the value of the likelihoods, \(f(x)\) is used. \(f(x)\) takes the sample input data value and returns the density estimate of the given data sample. This doesn't require any knowledge about the underlying distribution and is also used for classification.^{}
Parzen windows are seen as a generalization of kNearest Neighbour (KNN) techniques. Rather than choosing k nearest neighbours of a test point and labelling the test point with the weighted majority of its neighbours' votes, one can consider all points in the voting scheme and assign their weights by using kernel function.^{}
KNN is a nonparametric algorithm based on supervised learning. This is used for classification and regression. The KNN algorithm assumes that similar things exist in close proximity. It considers k nearest neighbours (data points) to predict the class or continuous value for the new data point.^{}

How does PNN work? The input layer transmits the characteristics of the sample to the network, specifically the pattern layer. The number of input neurons are the same as dimensions of the sample.^{}
For the pattern layer, the Euclidean distance between the feature vector of the training sample \(X\) and radial center \(x_{ij}\) realizes matching between the input feature vector and various types in training set. Here, \(X=[x_1, x_2, … , x_n] \cdot T\), for \(n\) in \([1 .. l]\), and \(l\) represents all types of training, \(d\) is the dimension of eigenvector, \(x_{ij}\) is the jth center of the ith training sample, and σ is a smoothing factor. The pattern layer also shows \(m\) different classes. Among the \(l\) neurons, each one belongs to exactly one class.
\(v_i\) is the output for class \(i\) in the summation layer. \(L\) is the number of class \(i\) neurons.^{} The type corresponding to maximum output in the summation layer is the output type of the output layer, given by\(Type(v_i) = arg max(v_i)\).

How do I select the right smoothing parameter for a PNN? Particularly when the training dataset is limited, performance of a PNN depends on the right selection of the smoothing parameter σ.^{} Small σ creates a multimodal distribution. Larger σ leads to interpolation between points. Very large σ approaches Gaussian PDF. Intuitively, σ should depend on the density of the samples.^{}
The simplest technique is to use the standard deviation of training samples for each dimension or feature. Crossvalidation (training vs validation datasets) gives better generalization. Clustering is another technique.^{} In gapbased estimation, Zhong et al. improved on these techniques by modelling the distances between a training sample and its neighbours. They estimated σ per input feature, noting that estimating σ per feature per class is not as good.^{}
Genetic algorithms have been used to estimate σ.^{} ^{} In R language,
pnn
package uses a genetic algorithm fromrgenoud
package to estimate σ.^{} ^{}Kusy and Zajdel studied three techniques from reinforcement learning: Q(0)learning, Q(λ)learning, and stateless Qlearning. Results were similar to stateoftheart performance of alternative approaches.^{} ^{}

What are some of the variations of the traditional PNN? Enhanced PNN (EPNN) uses Local Decision Circles (LDCs) that enable incorporation of local information and nonhomogeneity existing in the training population. The circle has a radius that limits the contribution of the local decision. The two Bayesian rules used by EPNN are: (a) A global rule that estimates the conditional probability of each class, given an input vector of data considering all training data and using the spread parameter. (b) A local rule that estimates the conditional probability of each class, given an input vector of data existing within a decision circle, considering only the training data.^{}
Competitive PNN (CPNN) adds novel competitive features to EPNN to utilize data most critical to the classification process. A competitive layer ranks kernels for each class and an optimum fraction of kernels are selected to estimate the classconditional probability.^{}
Supervised Learning PNN (SLPNN) has three kinds of network parameters that are adjusted through training: variable weights representing the importance of input variables, reciprocal of kernel radius representing the effective range of data, and data weights representing the data reliability.^{}

What are some specializations of PNN? Often researchers attempt to change the PNN's structure so that it becomes more practical to implement the pattern layer even with large number of training samples. Type of data is also a motivation to specialize. We note a few of these.^{}
Zaknich et al. applied PNN for time series analysis. Their architecture was modified so that current output depends on current input, preceding five inputs and following five inputs. Thus, their input vector had 11 coefficients. Tested on sinusoidal signals, attenuated and then corrupted with noise, the PNN network produced a smoothened output. Performance improved when the model contained more classes.^{}
Interval PNN is a PNN that classifies interval data. As is common in practical applications, the training dataset may be accurate but test data contains less precise measurements. This imprecision is handled by the model using intervals.^{}

What are the applications based on PNN? PNN's main applications include classifying labelled stationary data patterns or patterns with timevarying PDF. In signal processing, PNN considers waveforms as patterns and thereby recognizes specific events and their severity. In one example, PNN was used to recognize 11 types of disturbances in power quality using waveforms of voltage magnitude, frequency, and phase.^{}
PNN is applied to pattern recognition problems such as character/object/face recognition.^{} PNN brings flexibility, straightforward design and minimal training time.
For textindependent speaker identification, PNN provides good results in matching speaker for each input vector. To increase the success rate, multiple input vectors from each sample are needed.^{}
The figure shows PNN applied to ship identification. Even with a noisy image has been used as input of neural network, PNN performs well. Covariance matrix of discrete wavelet transform of ship image is used as input.^{}
PNN is used for overcoming the computational complexity involved in performing sensor configuration management in a wireless adhoc network.^{}

What are the pros and cons of PNN? In a PNN, there's no extensive training computation time associated with networks that use backpropagation. Instead, each data pattern is represented with a unit that measures the similarity of the input pattern to the data patterns. PNN learns from the training data instantaneously. With this speed of learning, PNN has the capability to adapt its learning in real time, deleting or adding training data as new conditions arise. Additionally, PNNs are relatively insensitive to outliers and approach Bayes optimal classification as the number of training samples increases. PNNs are guaranteed to converge to an optimal classifier as the size of the representative training set increases.
However, PNN has its limitations. Because there's one hidden node for each training instance, more computational resources (storage and time) during inference.^{} ^{} Additionally, the performance of the system usually decreases in terms of the classification accuracy and speed with a very big hidden layer.^{}

Which are the main research approaches to improve PNN performance? To reduce expensive computational times and storage requirements of a full Parzen window classifier, a weightedParzen window classifier is an option. A clustering procedure is used to find a set of reference vectors and weights that are used to approximate the Parzen window (kernel estimator) classifier.^{} For clustering, even kmeans algorithm can be applied. The basis is that not all the patterns contain original, independent, and discriminating information. Thus, clustering can reduce the number of neurons in the pattern layer.^{}
When input samples have too many features, Principal Component Analysis (PCA) can be used to reduce the number of features.^{}
For optimization and alteration of spread parameters in PNNs, in heteroscedastic PNN (hetero  different, skedasis  dispersion), the kernels of each class are allowed to have their own spread parameter matrix.^{}
The issue of data heterogeneity and noisy datasets are addressed by EPNN, by implementing Local Decision Circles (LDCs) to modify the spread parameter of each training vector and bilevel optimization to find the optimal value of spread parameter and radius of LDCs.^{}
Milestones
Parzen discusses the problem of estimation of a PDF and the problem of determining the mode of a PDF. He also relates the similarity of the problem of estimating a PDF to the problem of estimating the spectral density function of a stationary time series. While the problem of estimating the mode of a PDF is almost similar to the problem of maximum likelihood estimation of a parameter.^{}
In an attempt to perform classification for pattern recognition, Specht uses a Bayes strategy to merely transform the problem to one of estimating PDFs for each of the possible categories on the basis of training samples available. This is accomplished with an estimator which a) is shown to be consistent (tends to be identical with the true density in the limit as the number of training samples is increased to infinity and b} can be expressed in terms of a polynomial, the coefficients of which can be computed on a onepatternatatime basis.^{}
Specht introduces the term Probabilistic Neural Network, for a neural network that replaces the sigmoid activation function often used in neural networks with an exponential function. A PNN can compute nonlinear decision boundaries which approach the Bayes optimal. Architecturally, the neural network is designed to have four layers that can map any input pattern to any number of classifications. This technique offers a tremendous speed advantage for problems in which the incremental adaptation time of back propagation is a significant fraction of the total computation time.^{}
In order to compensate the flaw of PNN of not being robust with respect to affine transformations of feature space, leading to poor performance on certain data, a weighted PNN (WPNN) is derived. This allows anisotropic Gaussians, i.e. Gaussians whose covariance is not a multiple of identity matrix.^{}
Mao et al. propose two improvements to the PNN: select a suitable smoothing parameter using a genetic algorithm and then select a representative set of pattern layer neurons from the training samples using Forward Regression Orthogonal Algorithm. A similar research was published independently by Chen et al. in September 1999.^{}
A modified PNN for brain tissue segmentation with MRI is proposed. Here, covariance matrices are used to replace the singular smoothing factor in the PNN's kernel function, and weighting factors are added in the pattern of summation layer. This weighted PNN (WPNN) classifier can account for partial volume effects that exist commonly in MRI, not only in the final result stage, but also in the modelling process.^{}
An enhanced and generalized PNN (EPNN) is proposed using local decision circles (LDCs) to overcome the shortcoming of PNN wherein it doesn't consider probable local densities or heterogeneity in training data. Also, EPNN improves PNN's robustness to noise in data.^{}
A Supervised Learning PNN (SLPNN) is proposed with three kinds of network parameters that can be adjusted through training. The SLPNN is slightly more accurate than MLP and much more accurate than PNN.^{}
Considering that spread has a great influence on PNN's performance, a selfadaptive PNN (SaPNN) is proposed. In this, spread can be selfadaptively adjusted and selected and then the best selected spread is used to guide the SaPNN train and test. This SaPNN has a more accurate prediction and better generalization performance as compared to basic PNN.^{}
A modified PNN (MPNN) is introduced which is an extension of PNN with the weight coefficients introduced between pattern and summation layer of the model. These weights are calculated by using the sensitivity analysis procedure. MPNN improves the prediction ability of the PNN classifier.^{}
A Competitive PNN (CPNN) is presented wherein a competitive layer ranks kernels for each class and an optimum fraction of kernels are selected to estimate the classconditional probability. Performance percentage of CPNN is found to be greater than or equivalent to that of traditional PNN.^{}
Sample Code
References
 Ahmadlou, Mehran, and Hojjat Adeli. 2010. "Enhanced Probabilistic Neural Network with local decision circles: A robust classifier." Integrated Computer Aided Engineering, August. Accessed 20220114.
 Araghi, Leila Fallah, Hamid Khaloozadeh, and M R Arvan. 2009. "Ship Identification Using Probabilistic Neural Networks." Volume 2, Proceedings of the International MultiConference of Engineers and Computer Scientists, March. Accessed 20220114.
 Babich, G A, and O I Camps. 1996. "Weighted Parzen windows for pattern classification." Volume 18, Issue 5, IEEE Transactions on Pattern Analysis and Machine Intelligence, May 01. Accessed 20220113.
 Boutin, Prof.. 2008. "Statistical Pattern Recognition and Decision Making Processes." ECE662 Slecture, Project Rhea. Accessed 20220125.
 Brown, Michael. 1999. "Parzen Windows." Compbio UCSC. Accessed 20220114.
 Chasset, PierreOlivier. 2019a. "pnnpackage: PNN." In pnn: Probabilistic neural networks, R Package Documentation, CRAN, via rdrr.io, May 2. Accessed 20220215.
 Chasset, PierreOlivier. 2019b. "smooth: Smooth." In pnn: Probabilistic neural networks, R Package Documentation, CRAN, via rdrr.io, May 2. Accessed 20220215.
 Elmary, Ibrahiem M M El, and Ramakrishnan Srinivasan. 2008. "On the applications of various probabilistic neural networks in solving different pattern classification problems." World Applied Sciences Journal. Accessed 20220126.
 Georgiou, Vasileios L, Philipos D Alevizos, and Michael N Vrahatis. 2008. "Novel Approaches to Probabilistic Neural Networks Through Bagging and Evolutionary Estimating of Prior Probabilities." Neural Processing Letters  Springer, April. Accessed 20220114.
 GutierrezOsuna, Ricardo. 2010. "Probabilistic Neural Networks." CPSC 636: Neural Networks, Texas A&M University College of Engineering, Spring. Accessed 20220216.
 Kowalski, Piotr A. and Piotr Kulczycki. 2017. "Interval probabilistic neural network." Neural Comput Appl, vol. 28, no. 4, pp. 817834. Accessed 20220215.
 Kusy, Maciej, and Piotr A Kowalski. 2016. "Modification of the Probabilistic Neural Network with the use of sensitivity analysis procedure." Volume 8, Proceedings of the Federated Conference on Computer Science and Information Systems. Accessed 20220114.
 Kusy, Maciej, and Roman Zajdel. 2014. "Probabilistic neural network training procedure based on Q(0)learning algorithm in medical data classification." Applied Intelligence, vol. 41, pp. 837854. doi: 10.1007/s1048901405629. Accessed 20220215.
 Kusy, Maciej, and Roman Zajdel. 2015. "Application of Reinforcement Learning Algorithms for the Adaptive Computation of the Smoothing Parameter for Probabilistic Neural Network." IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 9, pp. 21632175, September. doi: 10.1109/TNNLS.2014.2376703. Accessed 20220215.
 Lassoued, Hela, Raouf Ketata, and Slim Yacoub. 2018. "ECG Decision Support System based on feedforward neural networks." Issue 1, Volume 11, International Journal on Smart Sensing and Intelligent Systems. Accessed 20220126.
 Lotfi, Abdelhadi, and Abdelkader Benyettou. 2014. "A reduced probabilistic neural network for the classification of large databases." Turkish Journal of Electrical Engineering and Computer Sciences, July. Accessed 20220126.
 Mao, K.Z., K.C. Tan, and W. Ser. 2000. "Probabilistic NeuralNetwork Structure Determination for Pattern Classification." IEEE Transactions on Neural Networks, vol. 11, no. 4, pp. 10091016, July. doi: 10.1109/72.857781. Accessed 20220215.
 Mebane, Walter R. Jr., and Jasjeet S. Sekhon. 2011. "Genetic Optimization Using Derivatives: The rgenoud Package for R." J. of Statistical Software, vol. 42, no. 11. Accessed 20220215.
 Mohebali, Behshad, Amirhessam Tahmassebi, Anke MeyerBaese, and Amir H Gandomi. 2020. "Probabilistic neural networks: a brief overview of theory, implementation, and application." In: Pijush Samui, Dieu Tien Bui, Subrata Chakraborty, and Ravinesh C. Deo (eds), Handbook of Probabilistic Models, ButterworthHeinemann, pp. 347367. doi: 10.1016/B9780128165140.00014X. Accessed 20220114.
 Montana, David. 1991. "A Weighted Probabilistic Neural Network." Advances in Neural Information Processing Systems. Accessed 20220114.
 Mostafa, G M. 2017. "Pattern Recognition." Lecture 7: Non Parametric TechniquesSl, Slideshare, January 25. Accessed 20220124.
 Naik, S.M., R.P.K. Jagannath, and V. Kuppili. 2020. "Estimation of the Smoothing Parameter in Probabilistic Neural Network Using Evolutionary Algorithms." Arab J Sci Eng, vol. 45, pp. 29452955. doi: 10.1007/s13369019042275. Accessed 20220215.
 Parzen, Emanuel. 1962. "On Estimation of a probability density function and mode." Vol. 33, No. 3, The Annals of Mathematical Statistics, September. Accessed 20220124.
 Patwardhan, Sai. 2021. "Simple Understanding and implementation of KNN algorithm." Analytics Vidhya, April 21. Accessed 20220126.
 Sivanandam, S.N., and SN Deepa. 2011. "Principles of Soft Computing." 2nd Edition, Wiley India. Accessed 20220114.
 Song, Tao, Mo. M. Jamshidi, Roland Robert Lee, and MingXiong Huang. 2007. "A Modified Probabilistic Neural Network for Partial Volume Segmentation in Brain MR Image." IEEE Transactions on Neural Networks, October. Accessed 20220113.
 Specht, Donald F. 1966. "Generation of Polynomial Discriminant Function for Pattern Recognition." University Microfilms Inc.. Accessed 20220115.
 Specht, Donald F. 1990. "Probabilistic Neural Networks." Volume 3, Issue 1, Pages 109118, Neural Networks. Accessed 20220115.
 Stevens, Thomas J, and Malur K Sundareshan. 2004. "Probabilistic Neural Networkbased sensor configuration management in a wireless adhoc network." Citeseerx. Accessed 20220126.
 Vinitha, K V, and G Santhosh Kumar. 2009. "Face Recognition using Probabilistic Neural Networks." World Congress on Nature and Biologically Inspired Computing, December. Accessed 20220114.
 Xoax.net. 2009. "Neural Networks: Probabilistic Neural Networks." Lesson 2, Xoax.net. Accessed 20220202.
 Yeh, ICheng, and KuanCheng Lin. 2011. "Supervised Learning Probabilistic Neural Networks." Neural Processing Letters, October. Accessed 20220114.
 Yi, JiaoHong, Jian Wang, and GaiGe Wang. 2016. "Improved Probabilistic Neural Networks with selfadaptive strategies for transformer fault diagnosis problem." Volume 8, Advances in Mechanical Engineering. Accessed 20220114.
 Zaknich, A., C.J.S. deSilva, and Y. Attikiouzel. 1991. "A modified probabilistic neural network (PNN) for nonlinear time series analysis." Proc. of IEEE International Joint Conference on Neural Networks, vol. 2, pp. 15301535. doi: 10.1109/IJCNN.1991.170617. Accessed 20220216.
 Zeinali, Yasha, and Brett Story. 2017. "Competitive Probabilistic Neural Network." DOI:10.3233/ICA170540, Integrated Computer Aided Engineering, January. Accessed 20220114.
 Zhang, Wei, Xiaohui Yang, Yeheng Deng, and Anyi Li. 2020. "An Inspired Machine Learning Algorithm with a Hybrid Whale Optimization for Power Transformer PHM." Energies, June 17. Accessed 20220125.
 Zhong, Mingyu,, Dave Coggeshall, Ehsan Ghaneie, Thomas Pope, Mark Rivera, Michael Georgiopoulos, Georgios C. Anagnostopoulos, Mansooreh Mollaghasemi, and Samuel Richie. 2007. "GapBased Estimation: Choosing the Smoothing Parameters for Probabilistic and General Regression Neural Networks." Neural Computation, MIT, vol. 19, pp. 28402864. doi: 10.1162/neco.2007.19.10.2840. Accessed 20220215.
Further Reading
 Zeinali, Yasha, and Brett Story. 2017. "Competitive Probabilistic Neural Network." DOI:10.3233/ICA170540, Integrated Computer Aided Engineering, January. Accessed 20220114.
 Ahmadlou, Mehran, and Hojjat Adeli. 2010. "Enhanced Probabilistic Neural Network with local decision circles: A robust classifier." Integrated Computer Aided Engineering, August. Accessed 20220114.
 Yeh, ICheng, and KuanCheng Lin. 2011. "Supervised Learning Probabilistic Neural Networks." Neural Processing Letters, October. Accessed 20220114.
Article Stats
Cite As
See Also
 Artificial Neural Network
 Probabilistic Graphical Model
 Probability for Data Scientists
 Probability Distributions
 Machine Learning Model
 Supervised vs Unsupervised Learning
Article Warnings
 Readability score of this article is below 60 (51.1). Use shorter sentences. Use simpler words.