# Probabilistic Neural Network

A Probabilistic Neural Network (PNN) is a feed-forward neural network in which connections between nodes don't form a cycle. It's a classifier that can estimate the probability density function of a given set of data. PNN estimates the probability of a sample being part of a learned category. Machine learning engineers use PNN for classification and pattern recognition tasks. A PNN is designed to solve classification problems by using a statistical memory-based approach that can be supervised or unsupervised.

The probabilistic neural net is based on the idea of conventional probability theory, such as Bayesian classification and other estimators for probability density functions, to construct a neural net for classification. The widespread use of PNN originated from the usage of kernel functions for discriminant analysis and pattern recognition.

## Discussion

• What is the architecture of a PNN?

The PNN architecture has four layers:

• Input Layer: $$p$$ neurons represent the input vector and distribute it to the next layer. $$p$$ equals the number of input features.
• Pattern Layer: This layer applies the kernel to the input. It organizes the learning set by representing each training vector by a hidden neuron that records the features of this vector. During inference, each neuron calculates the Euclidean distance between the input test vector and the training sample, then applies the radial basis kernel function. In this way, it encodes the PDF centered on each training sample or pattern. For class $$j$$, we have $$n_j$$ neurons, for $$j$$ in $$[1,m]$$.
• Summation Layer: This layer computes the average of the output of the pattern units for each class. There's one neuron for each class. Each class neuron is connected to all neurons in the pattern layer of that class.
• Output Layer: This layer selects the maximum value from the summation layer, and the associated class label is determined accordingly.
• Could you explain PNN with a simple example?

Consider the task of classifying the letters O, X, and I. The characters can be in uppercase or lowercase. We consider two features: length and area of each character. Consequently, the training set will have 6 letters (O,o,X,x,I,i). Each training data point will be identified with a (length, area) value. For example, O(0.5,0.7), o(0.2,0.5), X(0.8,0.8), x(0.4,0.5), I(0.6,0.3) and i(0.3,0.2).

The input layer of the PNN will have two neurons, one for each feature, that is, one node for length and one for area.

We have three classes. Each class has two patterns in the pattern layer, one for uppercase and one for lowercase. For example, for class O there are two subtypes (O,o). In total, the pattern layer has six neurons.

The summation layer will calculate the average value for each pattern type of the pattern layer and output layer will pick the maximum value, thereby determining the suitable class O, X, I.

An advantage of PNN is that there is no back-propagation training. New pattern units can be added without additional time overhead, since no training is needed; it is automatic.

• What are the concepts from which PNN was derived?

The Parzen window density estimation, or the Kernel Density Estimation (KDE), is a non-parametric density estimation technique. It's used to derive a density function $$f(x)$$. When we have a new training sample $$x$$ and there's a need to compute the value of the likelihoods, $$f(x)$$ is used. $$f(x)$$ takes the sample input data value and returns the density estimate of the given data sample. This doesn't require any knowledge about the underlying distribution and is also used for classification.

Parzen windows are seen as a generalization of k-Nearest Neighbour (KNN) techniques. Rather than choosing k nearest neighbours of a test point and labelling the test point with the weighted majority of its neighbours' votes, one can consider all points in the voting scheme and assign their weights by using kernel function.

KNN is a non-parametric algorithm based on supervised learning. This is used for classification and regression. The KNN algorithm assumes that similar things exist in close proximity. It considers k nearest neighbours (data points) to predict the class or continuous value for the new data point.

• How does PNN work?

The input layer transmits the characteristics of the sample to the network, specifically the pattern layer. The number of input neurons are the same as dimensions of the sample.

For the pattern layer, the Euclidean distance between the feature vector of the training sample $$X$$ and radial center $$x_{ij}$$ realizes matching between the input feature vector and various types in training set. Here, $$X=[x_1, x_2, … , x_n] \cdot T$$, for $$n$$ in $$[1 .. l]$$, and $$l$$ represents all types of training, $$d$$ is the dimension of eigenvector, $$x_{ij}$$ is the j-th center of the i-th training sample, and σ is a smoothing factor. The pattern layer also shows $$m$$ different classes. Among the $$l$$ neurons, each one belongs to exactly one class.

$$v_i$$ is the output for class $$i$$ in the summation layer. $$L$$ is the number of class $$i$$ neurons. The type corresponding to maximum output in the summation layer is the output type of the output layer, given by$$Type(v_i) = arg max(v_i)$$.

• How do I select the right smoothing parameter for a PNN?

Particularly when the training dataset is limited, performance of a PNN depends on the right selection of the smoothing parameter σ. Small σ creates a multimodal distribution. Larger σ leads to interpolation between points. Very large σ approaches Gaussian PDF. Intuitively, σ should depend on the density of the samples.

The simplest technique is to use the standard deviation of training samples for each dimension or feature. Cross-validation (training vs validation datasets) gives better generalization. Clustering is another technique. In gap-based estimation, Zhong et al. improved on these techniques by modelling the distances between a training sample and its neighbours. They estimated σ per input feature, noting that estimating σ per feature per class is not as good.

Genetic algorithms have been used to estimate σ. In R language, pnn package uses a genetic algorithm from rgenoud package to estimate σ.

Kusy and Zajdel studied three techniques from reinforcement learning: Q(0)-learning, Q(λ)-learning, and stateless Q-learning. Results were similar to state-of-the-art performance of alternative approaches.

• What are some of the variations of the traditional PNN?

Enhanced PNN (EPNN) uses Local Decision Circles (LDCs) that enable incorporation of local information and non-homogeneity existing in the training population. The circle has a radius that limits the contribution of the local decision. The two Bayesian rules used by EPNN are: (a) A global rule that estimates the conditional probability of each class, given an input vector of data considering all training data and using the spread parameter. (b) A local rule that estimates the conditional probability of each class, given an input vector of data existing within a decision circle, considering only the training data.

Competitive PNN (CPNN) adds novel competitive features to EPNN to utilize data most critical to the classification process. A competitive layer ranks kernels for each class and an optimum fraction of kernels are selected to estimate the class-conditional probability.

Supervised Learning PNN (SLPNN) has three kinds of network parameters that are adjusted through training: variable weights representing the importance of input variables, reciprocal of kernel radius representing the effective range of data, and data weights representing the data reliability.

• What are some specializations of PNN?

Often researchers attempt to change the PNN's structure so that it becomes more practical to implement the pattern layer even with large number of training samples. Type of data is also a motivation to specialize. We note a few of these.

Zaknich et al. applied PNN for time series analysis. Their architecture was modified so that current output depends on current input, preceding five inputs and following five inputs. Thus, their input vector had 11 coefficients. Tested on sinusoidal signals, attenuated and then corrupted with noise, the PNN network produced a smoothened output. Performance improved when the model contained more classes.

Interval PNN is a PNN that classifies interval data. As is common in practical applications, the training dataset may be accurate but test data contains less precise measurements. This imprecision is handled by the model using intervals.

• What are the applications based on PNN?

PNN's main applications include classifying labelled stationary data patterns or patterns with time-varying PDF. In signal processing, PNN considers waveforms as patterns and thereby recognizes specific events and their severity. In one example, PNN was used to recognize 11 types of disturbances in power quality using waveforms of voltage magnitude, frequency, and phase.

PNN is applied to pattern recognition problems such as character/object/face recognition. PNN brings flexibility, straightforward design and minimal training time.

For text-independent speaker identification, PNN provides good results in matching speaker for each input vector. To increase the success rate, multiple input vectors from each sample are needed.

The figure shows PNN applied to ship identification. Even with a noisy image has been used as input of neural network, PNN performs well. Covariance matrix of discrete wavelet transform of ship image is used as input.

PNN is used for overcoming the computational complexity involved in performing sensor configuration management in a wireless ad-hoc network.

• What are the pros and cons of PNN?

In a PNN, there's no extensive training computation time associated with networks that use back-propagation. Instead, each data pattern is represented with a unit that measures the similarity of the input pattern to the data patterns. PNN learns from the training data instantaneously. With this speed of learning, PNN has the capability to adapt its learning in real time, deleting or adding training data as new conditions arise. Additionally, PNNs are relatively insensitive to outliers and approach Bayes optimal classification as the number of training samples increases. PNNs are guaranteed to converge to an optimal classifier as the size of the representative training set increases.

However, PNN has its limitations. Because there's one hidden node for each training instance, more computational resources (storage and time) during inference. Additionally, the performance of the system usually decreases in terms of the classification accuracy and speed with a very big hidden layer.

• Which are the main research approaches to improve PNN performance?

To reduce expensive computational times and storage requirements of a full Parzen window classifier, a weighted-Parzen window classifier is an option. A clustering procedure is used to find a set of reference vectors and weights that are used to approximate the Parzen window (kernel estimator) classifier. For clustering, even k-means algorithm can be applied. The basis is that not all the patterns contain original, independent, and discriminating information. Thus, clustering can reduce the number of neurons in the pattern layer.

When input samples have too many features, Principal Component Analysis (PCA) can be used to reduce the number of features.

For optimization and alteration of spread parameters in PNNs, in heteroscedastic PNN (hetero - different, skedasis - dispersion), the kernels of each class are allowed to have their own spread parameter matrix.

The issue of data heterogeneity and noisy datasets are addressed by EPNN, by implementing Local Decision Circles (LDCs) to modify the spread parameter of each training vector and bi-level optimization to find the optimal value of spread parameter and radius of LDCs.

## Milestones

1962

Parzen discusses the problem of estimation of a PDF and the problem of determining the mode of a PDF. He also relates the similarity of the problem of estimating a PDF to the problem of estimating the spectral density function of a stationary time series. While the problem of estimating the mode of a PDF is almost similar to the problem of maximum likelihood estimation of a parameter.

1966

In an attempt to perform classification for pattern recognition, Specht uses a Bayes strategy to merely transform the problem to one of estimating PDFs for each of the possible categories on the basis of training samples available. This is accomplished with an estimator which a) is shown to be consistent (tends to be identical with the true density in the limit as the number of training samples is increased to infinity and b} can be expressed in terms of a polynomial, the coefficients of which can be computed on a one-pattern-at-a-time basis.

1990

Specht introduces the term Probabilistic Neural Network, for a neural network that replaces the sigmoid activation function often used in neural networks with an exponential function. A PNN can compute nonlinear decision boundaries which approach the Bayes optimal. Architecturally, the neural network is designed to have four layers that can map any input pattern to any number of classifications. This technique offers a tremendous speed advantage for problems in which the incremental adaptation time of back propagation is a significant fraction of the total computation time.

1991

In order to compensate the flaw of PNN of not being robust with respect to affine transformations of feature space, leading to poor performance on certain data, a weighted PNN (WPNN) is derived. This allows anisotropic Gaussians, i.e. Gaussians whose covariance is not a multiple of identity matrix.

2000

Mao et al. propose two improvements to the PNN: select a suitable smoothing parameter using a genetic algorithm and then select a representative set of pattern layer neurons from the training samples using Forward Regression Orthogonal Algorithm. A similar research was published independently by Chen et al. in September 1999.

2007

A modified PNN for brain tissue segmentation with MRI is proposed. Here, covariance matrices are used to replace the singular smoothing factor in the PNN's kernel function, and weighting factors are added in the pattern of summation layer. This weighted PNN (WPNN) classifier can account for partial volume effects that exist commonly in MRI, not only in the final result stage, but also in the modelling process.

2010

An enhanced and generalized PNN (EPNN) is proposed using local decision circles (LDCs) to overcome the shortcoming of PNN wherein it doesn't consider probable local densities or heterogeneity in training data. Also, EPNN improves PNN's robustness to noise in data.

2011

A Supervised Learning PNN (SLPNN) is proposed with three kinds of network parameters that can be adjusted through training. The SLPNN is slightly more accurate than MLP and much more accurate than PNN.

2016

Considering that spread has a great influence on PNN's performance, a self-adaptive PNN (SaPNN) is proposed. In this, spread can be self-adaptively adjusted and selected and then the best selected spread is used to guide the SaPNN train and test. This SaPNN has a more accurate prediction and better generalization performance as compared to basic PNN.

2016

A modified PNN (MPNN) is introduced which is an extension of PNN with the weight coefficients introduced between pattern and summation layer of the model. These weights are calculated by using the sensitivity analysis procedure. MPNN improves the prediction ability of the PNN classifier.

2017

A Competitive PNN (CPNN) is presented wherein a competitive layer ranks kernels for each class and an optimum fraction of kernels are selected to estimate the class-conditional probability. Performance percentage of CPNN is found to be greater than or equivalent to that of traditional PNN.

## Sample Code

• # Source: https://rdrr.io/cran/pnn/man/pnn-package.html
# Accessed 2022-02-16

library(pnn)
data(norms)

# The long way
pnn <- learn(norms)
pnn <- smooth(pnn, sigma=0.9)
pnn$sigma ## Not run: pnn <- perf(pnn) # Optional ## Not run: pnn$success_rate # Optional
guess(pnn, c(1,1))
guess(pnn, c(2,1))
guess(pnn, c(1.5,1))

# The short way
guess(smooth(learn(norms), sigma=0.8), c(1,1))
guess(smooth(learn(norms), sigma=0.8), c(2,1))
guess(smooth(learn(norms), sigma=0.8), c(1.5,1))

# Demonstrations
## Not run: demo("norms-trainingset", "pnn")
## Not run: demo("small-trainingset", "pnn")


## Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins suchi_shen
5
3
2780 arvindpdmn
4
6
1391
2605
Words
4
Likes
4088
Hits

## Cite As

Devopedia. 2022. "Probabilistic Neural Network." Version 9, February 16. Accessed 2022-10-09. https://devopedia.org/probabilistic-neural-network