Probabilistic Neural Network

Article Info

Contributed by
2 authors

Last updated on
2022-02-16 07:23:51

Article Versions

9 2022-02-16 07:23:51
3406,3295 9,3406

By arvindpdmn

Added important question of how to select smoothing param. Added specializations of PNN. New milestone. Added two images.
8 2022-02-14 06:56:58
3295,3288 8,3295

By arvindpdmn

Improved formatting and some sentences. Improve image quality. Compare question removed but moved image to earlier question. Improved couple of answers.
7 2022-02-11 05:48:09
3288,3280 7,3288

By arvindpdmn

Remove AI tag. Updated See Also. Equation formatting. Other updates for later.
6 2022-02-10 11:42:53
3280,3277 6,3280

By suchi_shen

Change in Milestone, as per feedback
5 2022-02-09 13:55:43
3277,3229 5,3277

By suchi_shen

Changes to improve article based on comments.

Chat Room

Submitting ...

You are editing an existing chat message.
2022-02-10 13:32:12
-

By arvindpdmn

Term "a weighted-Parzen window classifier" is part of the last answer in Discussion.
2022-02-10 11:45:58
-

By suchi_shen

Since the example with GIF has been removed, comment 11 had no workaround.
Comment 12: applications section was modified in the last edit. If further editing needed, please let me know.
Comment 13: the term has been removed to ease the understanding for any reader
Comment 14: Completed
2022-02-10 10:14:37
-
1

By arvindpdmn

Any reason why comments 11-14 are not implemented? Otherwise, the changes so are fine.
2022-02-09 13:56:07
-

By suchi_shen

The article has been edited and is ready for review.
2022-02-02 05:46:00
-

By arvindpdmn

1. Summary is better. No comments.
2. Consider a better ordering of questions for the ease of reader.
3. Add a question to explain PNN with a simple example. No need of an image for now.
4. Input layer: p, "number of variables or training samples": input actually is just one sample with p features. The flow also supports this interpretation: we put a sample through NN and get a single class at the output. Pattern layer: add this: for class j, we have \(n_j\) neurons, for j in [1,m].
5. Answer where two equations are explained: see if it can be improved with some intuitive explanation.
6. It should be started clearly early in the article that there's no backprop method of training. Training is simply adding new patterns/neurons to the pattern layer with suitable weights.
7. "that implements a Bayes classifier": not clear how Parzen window is linked to Bayes classifier.
8. "new sample feature x": reword as "new training sample" so that we don't confuse with test sample
9. "highest PDF value": may not be good idea to include in the output layer: PDF can be introduced in pattern/summ layers.
10. "summation layer is weighted": misleading since summation layer doesn't deal with weights: it sums all patterns alike.

Comments carried over previous review:
11. Why do they call it Parzen PNN? Is it same as PNN? I saw that the cited source says PPNN. Cited source is good.
12. Applications: can be improved. We don't get a sense of why PNN because in most of the applications you've mentioned a CNN can be used. Sensor setup is a unique application but why PNN works here? We need that insight. The answer should help developers choose PNN vs CNN vs LSTM, etc.
13. "weighted-Parzen window classifier": where is the weight used? Rest of the answer is fine.
14. "The term Probabilistic Neural Network was introduced by Specht": change to present tense and active voice.

A Probabilistic Neural Network (PNN) is a feed-forward neural network in which connections between nodes don't form a cycle. It's a classifier that can estimate the probability density function of a given set of data. PNN estimates the probability of a sample being part of a learned category. Machine learning engineers use PNN for classification and pattern recognition tasks. A PNN is designed to solve classification problems by using a statistical memory-based approach that can be supervised or unsupervised.

The probabilistic neural net is based on the idea of conventional probability theory, such as Bayesian classification and other estimators for probability density functions, to construct a neural net for classification. The widespread use of PNN originated from the usage of kernel functions for discriminant analysis and pattern recognition.

Discussion

What is the architecture of a PNN?
PNN Architecture. Source: Mohebali et al. 2020, fig. 14.4.
The PNN architecture has four layers:
- Input Layer: \(p\) neurons represent the input vector and distribute it to the next layer. \(p\) equals the number of input features.
- Pattern Layer: This layer applies the kernel to the input. It organizes the learning set by representing each training vector by a hidden neuron that records the features of this vector. During inference, each neuron calculates the Euclidean distance between the input test vector and the training sample, then applies the radial basis kernel function. In this way, it encodes the PDF centered on each training sample or pattern. For class \(j\), we have \(n_j\) neurons, for \(j\) in \([1,m]\).
- Summation Layer: This layer computes the average of the output of the pattern units for each class. There's one neuron for each class. Each class neuron is connected to all neurons in the pattern layer of that class.
- Output Layer: This layer selects the maximum value from the summation layer, and the associated class label is determined accordingly.
Could you explain PNN with a simple example?
Explaining PNN with an example. Source: Xoax.net 2009.
Consider the task of classifying the letters O, X, and I. The characters can be in uppercase or lowercase. We consider two features: length and area of each character. Consequently, the training set will have 6 letters (O,o,X,x,I,i). Each training data point will be identified with a (length, area) value. For example, O(0.5,0.7), o(0.2,0.5), X(0.8,0.8), x(0.4,0.5), I(0.6,0.3) and i(0.3,0.2).
The input layer of the PNN will have two neurons, one for each feature, that is, one node for length and one for area.
We have three classes. Each class has two patterns in the pattern layer, one for uppercase and one for lowercase. For example, for class O there are two subtypes (O,o). In total, the pattern layer has six neurons.
The summation layer will calculate the average value for each pattern type of the pattern layer and output layer will pick the maximum value, thereby determining the suitable class O, X, I.
An advantage of PNN is that there is no back-propagation training. New pattern units can be added without additional time overhead, since no training is needed; it is automatic.
What are the concepts from which PNN was derived?
Parzen window and KNN example. Source: Mostafa 2017, fig. 4.2.
The Parzen window density estimation, or the Kernel Density Estimation (KDE), is a non-parametric density estimation technique. It's used to derive a density function \(f(x)\). When we have a new training sample \(x\) and there's a need to compute the value of the likelihoods, \(f(x)\) is used. \(f(x)\) takes the sample input data value and returns the density estimate of the given data sample. This doesn't require any knowledge about the underlying distribution and is also used for classification.
Parzen windows are seen as a generalization of k-Nearest Neighbour (KNN) techniques. Rather than choosing k nearest neighbours of a test point and labelling the test point with the weighted majority of its neighbours' votes, one can consider all points in the voting scheme and assign their weights by using kernel function.
KNN is a non-parametric algorithm based on supervised learning. This is used for classification and regression. The KNN algorithm assumes that similar things exist in close proximity. It considers k nearest neighbours (data points) to predict the class or continuous value for the new data point.
How does PNN work?
PNN explained with equations. Source: Adapted from Zhang et al. 2020, fig. 1.
The input layer transmits the characteristics of the sample to the network, specifically the pattern layer. The number of input neurons are the same as dimensions of the sample.
For the pattern layer, the Euclidean distance between the feature vector of the training sample \(X\) and radial center \(x_{ij}\) realizes matching between the input feature vector and various types in training set. Here, \(X=[x_1, x_2, … , x_n] \cdot T\), for \(n\) in \([1 .. l]\), and \(l\) represents all types of training, \(d\) is the dimension of eigenvector, \(x_{ij}\) is the j-th center of the i-th training sample, and σ is a smoothing factor. The pattern layer also shows \(m\) different classes. Among the \(l\) neurons, each one belongs to exactly one class.
\(v_i\) is the output for class \(i\) in the summation layer. \(L\) is the number of class \(i\) neurons. The type corresponding to maximum output in the summation layer is the output type of the output layer, given by\(Type(v_i) = arg max(v_i)\).
How do I select the right smoothing parameter for a PNN?
Two datasets with different optimal σ values. Source: Naik et al. 2020, fig. 6.
Particularly when the training dataset is limited, performance of a PNN depends on the right selection of the smoothing parameter σ. Small σ creates a multimodal distribution. Larger σ leads to interpolation between points. Very large σ approaches Gaussian PDF. Intuitively, σ should depend on the density of the samples.
The simplest technique is to use the standard deviation of training samples for each dimension or feature. Cross-validation (training vs validation datasets) gives better generalization. Clustering is another technique. In gap-based estimation, Zhong et al. improved on these techniques by modelling the distances between a training sample and its neighbours. They estimated σ per input feature, noting that estimating σ per feature per class is not as good.
Genetic algorithms have been used to estimate σ. In R language, pnn package uses a genetic algorithm from rgenoud package to estimate σ.
Kusy and Zajdel studied three techniques from reinforcement learning: Q(0)-learning, Q(λ)-learning, and stateless Q-learning. Results were similar to state-of-the-art performance of alternative approaches.
What are some of the variations of the traditional PNN?
Enhanced PNN (EPNN) uses Local Decision Circles (LDCs) that enable incorporation of local information and non-homogeneity existing in the training population. The circle has a radius that limits the contribution of the local decision. The two Bayesian rules used by EPNN are: (a) A global rule that estimates the conditional probability of each class, given an input vector of data considering all training data and using the spread parameter. (b) A local rule that estimates the conditional probability of each class, given an input vector of data existing within a decision circle, considering only the training data.
Competitive PNN (CPNN) adds novel competitive features to EPNN to utilize data most critical to the classification process. A competitive layer ranks kernels for each class and an optimum fraction of kernels are selected to estimate the class-conditional probability.
Supervised Learning PNN (SLPNN) has three kinds of network parameters that are adjusted through training: variable weights representing the importance of input variables, reciprocal of kernel radius representing the effective range of data, and data weights representing the data reliability.
What are some specializations of PNN?
Often researchers attempt to change the PNN's structure so that it becomes more practical to implement the pattern layer even with large number of training samples. Type of data is also a motivation to specialize. We note a few of these.
Zaknich et al. applied PNN for time series analysis. Their architecture was modified so that current output depends on current input, preceding five inputs and following five inputs. Thus, their input vector had 11 coefficients. Tested on sinusoidal signals, attenuated and then corrupted with noise, the PNN network produced a smoothened output. Performance improved when the model contained more classes.
Interval PNN is a PNN that classifies interval data. As is common in practical applications, the training dataset may be accurate but test data contains less precise measurements. This imprecision is handled by the model using intervals.
What are the applications based on PNN?
Ship Identification using PNN. Source: Araghi et al. 2009.
PNN's main applications include classifying labelled stationary data patterns or patterns with time-varying PDF. In signal processing, PNN considers waveforms as patterns and thereby recognizes specific events and their severity. In one example, PNN was used to recognize 11 types of disturbances in power quality using waveforms of voltage magnitude, frequency, and phase.
PNN is applied to pattern recognition problems such as character/object/face recognition. PNN brings flexibility, straightforward design and minimal training time.
For text-independent speaker identification, PNN provides good results in matching speaker for each input vector. To increase the success rate, multiple input vectors from each sample are needed.
The figure shows PNN applied to ship identification. Even with a noisy image has been used as input of neural network, PNN performs well. Covariance matrix of discrete wavelet transform of ship image is used as input.
PNN is used for overcoming the computational complexity involved in performing sensor configuration management in a wireless ad-hoc network.
What are the pros and cons of PNN?
Comparison of MLP, RBF, and PNN. Source: Lassoued et al. 2018.
In a PNN, there's no extensive training computation time associated with networks that use back-propagation. Instead, each data pattern is represented with a unit that measures the similarity of the input pattern to the data patterns. PNN learns from the training data instantaneously. With this speed of learning, PNN has the capability to adapt its learning in real time, deleting or adding training data as new conditions arise. Additionally, PNNs are relatively insensitive to outliers and approach Bayes optimal classification as the number of training samples increases. PNNs are guaranteed to converge to an optimal classifier as the size of the representative training set increases.
However, PNN has its limitations. Because there's one hidden node for each training instance, more computational resources (storage and time) during inference. Additionally, the performance of the system usually decreases in terms of the classification accuracy and speed with a very big hidden layer.
Which are the main research approaches to improve PNN performance?
To reduce expensive computational times and storage requirements of a full Parzen window classifier, a weighted-Parzen window classifier is an option. A clustering procedure is used to find a set of reference vectors and weights that are used to approximate the Parzen window (kernel estimator) classifier. For clustering, even k-means algorithm can be applied. The basis is that not all the patterns contain original, independent, and discriminating information. Thus, clustering can reduce the number of neurons in the pattern layer.
When input samples have too many features, Principal Component Analysis (PCA) can be used to reduce the number of features.
For optimization and alteration of spread parameters in PNNs, in heteroscedastic PNN (hetero - different, skedasis - dispersion), the kernels of each class are allowed to have their own spread parameter matrix.
The issue of data heterogeneity and noisy datasets are addressed by EPNN, by implementing Local Decision Circles (LDCs) to modify the spread parameter of each training vector and bi-level optimization to find the optimal value of spread parameter and radius of LDCs.

Milestones

1962

Parzen discusses the problem of estimation of a PDF and the problem of determining the mode of a PDF. He also relates the similarity of the problem of estimating a PDF to the problem of estimating the spectral density function of a stationary time series. While the problem of estimating the mode of a PDF is almost similar to the problem of maximum likelihood estimation of a parameter.

1966

In an attempt to perform classification for pattern recognition, Specht uses a Bayes strategy to merely transform the problem to one of estimating PDFs for each of the possible categories on the basis of training samples available. This is accomplished with an estimator which a) is shown to be consistent (tends to be identical with the true density in the limit as the number of training samples is increased to infinity and b} can be expressed in terms of a polynomial, the coefficients of which can be computed on a one-pattern-at-a-time basis.

1990

Specht introduces the term Probabilistic Neural Network, for a neural network that replaces the sigmoid activation function often used in neural networks with an exponential function. A PNN can compute nonlinear decision boundaries which approach the Bayes optimal. Architecturally, the neural network is designed to have four layers that can map any input pattern to any number of classifications. This technique offers a tremendous speed advantage for problems in which the incremental adaptation time of back propagation is a significant fraction of the total computation time.

1991

In order to compensate the flaw of PNN of not being robust with respect to affine transformations of feature space, leading to poor performance on certain data, a weighted PNN (WPNN) is derived. This allows anisotropic Gaussians, i.e. Gaussians whose covariance is not a multiple of identity matrix.

2000

Mao et al. propose two improvements to the PNN: select a suitable smoothing parameter using a genetic algorithm and then select a representative set of pattern layer neurons from the training samples using Forward Regression Orthogonal Algorithm. A similar research was published independently by Chen et al. in September 1999.

2007

A modified PNN for brain tissue segmentation with MRI is proposed. Here, covariance matrices are used to replace the singular smoothing factor in the PNN's kernel function, and weighting factors are added in the pattern of summation layer. This weighted PNN (WPNN) classifier can account for partial volume effects that exist commonly in MRI, not only in the final result stage, but also in the modelling process.

2010

An enhanced and generalized PNN (EPNN) is proposed using local decision circles (LDCs) to overcome the shortcoming of PNN wherein it doesn't consider probable local densities or heterogeneity in training data. Also, EPNN improves PNN's robustness to noise in data.

2011

A Supervised Learning PNN (SLPNN) is proposed with three kinds of network parameters that can be adjusted through training. The SLPNN is slightly more accurate than MLP and much more accurate than PNN.

2016

Considering that spread has a great influence on PNN's performance, a self-adaptive PNN (SaPNN) is proposed. In this, spread can be self-adaptively adjusted and selected and then the best selected spread is used to guide the SaPNN train and test. This SaPNN has a more accurate prediction and better generalization performance as compared to basic PNN.

2016

A modified PNN (MPNN) is introduced which is an extension of PNN with the weight coefficients introduced between pattern and summation layer of the model. These weights are calculated by using the sensitivity analysis procedure. MPNN improves the prediction ability of the PNN classifier.

2017

A Competitive PNN (CPNN) is presented wherein a competitive layer ranks kernels for each class and an optimum fraction of kernels are selected to estimate the class-conditional probability. Performance percentage of CPNN is found to be greater than or equivalent to that of traditional PNN.

Sample Code

R-PNN

# Source: https://rdrr.io/cran/pnn/man/pnn-package.html
# Accessed 2022-02-16
 
library(pnn)
data(norms)
 
# The long way
pnn <- learn(norms)
pnn <- smooth(pnn, sigma=0.9)
pnn$sigma
## Not run: pnn <- perf(pnn) # Optional
## Not run: pnn$success_rate # Optional
guess(pnn, c(1,1))
guess(pnn, c(2,1))
guess(pnn, c(1.5,1))
 
# The short way
guess(smooth(learn(norms), sigma=0.8), c(1,1))
guess(smooth(learn(norms), sigma=0.8), c(2,1))
guess(smooth(learn(norms), sigma=0.8), c(1.5,1))
 
# Demonstrations
## Not run: demo("norms-trainingset", "pnn")
## Not run: demo("small-trainingset", "pnn")

References

Article Stats

2605

Words

Authors

Edits

Chats

Likes

12K

Hits

Cite As

Devopedia. 2022. "Probabilistic Neural Network." Version 9, February 16. Accessed 2023-11-13. https://devopedia.org/probabilistic-neural-network

Contributed by
2 authors

Last updated on
2022-02-16 07:23:51

algorithms machine learning neural networks classification probabilistic

Probabilistic Neural Network

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login