4445-main

May 4, 2017 | Author: Ashish Sharma | Category: N/A

Share Embed Donate

Report this link

Short Description

required map distributed...

Description

Neurocomputing 138 (2014) 106–113

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Letters

A new online data imputation method based on general regression auto associative neural network Vadlamani Ravi n, Mannepalli Krishna Centre of Excellence in CRM and Analytics, Institute for Development and Research in Banking Technology (IDRBT), Castle Hills Road #1, Masab Tank, Hyderabad, 500 057 AP, India

art ic l e i nf o

a b s t r a c t

Article history: Received 4 August 2013 Received in revised form 25 January 2014 Accepted 7 February 2014 Communicated by Dr. Swagatam Das Available online 8 April 2014

In this paper we proposed online, offline and semi-online data imputation models based on the four auto associative neural networks. The online model employs mean imputation followed by general regression auto associative neural network (GRAANN). The offline methods include mean imputation followed by particle swarm optimization based auto associative neural network (PSOAANN); mean imputation followed by particle swarm optimization based auto associative wavelet neural network (PSOAAWNN) and the semi-online method involving mean imputation followed by radial basis function auto associative neural network (RBFAANN). We compared the performance of these hybrid models with that of mean imputation and a hybrid imputation method viz, K-means and multi-layer perceptron (MLP) of Ankaiah and Ravi (2011) [65]. We tested the effectiveness of these models on four benchmark classification and four benchmark regression datasets; three bankruptcy prediction datasets and one credit scoring datasets under 10-fold cross-validation testing. From the experiments, we observed that the GRAANN yielded better imputation for the missing values than the rest of the models. We confirmed this by performing the Wilcoxon signed rank test to test the statistical significance between the methods proposed. It turned out that GRAANN outperformed other models in most of the datasets. & 2014 Elsevier B.V. All rights reserved.

Keywords: Data imputation Auto associative neural network General regression auto associative neural network Radial basis function auto associative neural network Particle swarm optimization

1. Introduction The missing data or incomplete data is a very common problem in many real world datasets. Missing data occurs due to several reasons like non-response to some fields in data collection process by respondents because of negligence, privacy reasons, data entry errors, system failure, ambiguity of the survey questions, cultural issues in updating the databases and other various reasons. Imputation is defined as the substitution of a missing data point or a missing component of a data point by some suitable value. Missing values or incomplete values result in less efficient estimates in classification or regression problems because of sample bias and reduced sample size. Missing value imputation became mandatory because most data mining algorithms cannot work with incomplete datasets. The completeness and quality of the data plays a major role in analyzing the available data, because the inferences made from complete data are more accurate than those made from incomplete data [1]. Data imputation found applications in automatic speech recognition, financial and business applications, traffic

n

Corresponding author. Tel.: þ 91 402 329 4042; fax: þ 91 402 353 5157. E-mail addresses: [email protected] (V. Ravi), [email protected] (M. Krishna). http://dx.doi.org/10.1016/j.neucom.2014.02.037 0925-2312/& 2014 Elsevier B.V. All rights reserved.

monitoring, industrial process, telecommunications and computer networks, and medical diagnosis among others [2]. Little and Rubin [3] categorized missing data into (i) missing completely at random (MCAR), (ii) missing at random (MAR), and (iii) not missing at random (NMAR). MCAR situation occurs if the probability of missing value for variable X does not depend on the value X itself or on any other variable in the dataset. MAR occurs if the probability of missing data on particular variable X depends on other variables, but not on the variable X itself. NMAR occurs if the probability of missing value of particular variable X depends on the variable X itself. The missing data under the category of MCAR and MAR are recoverable, whereas missing data under NMAR category are irrecoverable. Several techniques for imputing missing data based on statistical analysis [3] are mean substitution methods [4], hot deck imputation [5,6], regression methods [3], expectation maximization [7], and multiple imputation methods [8]. The methods based on machine learning techniques are multilayer perceptron [9], K-nearest neighbor [10], fuzzy-neural network [11], AANN imputation with genetic algorithms (GA) [1], and self-organizing maps [12] among others. Marseguerra and Zoia [13] were the first to propose AANN (also known as auto encoder)for reconstruction of a missing signal in nuclear reactor simulated data. They used Robust AANN (RAANN) for reconstruction of a missing time series if it is linearly or nonlinearly correlated with the other measured data. Abdella and

V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106–113

Marwala [1] proposed AANN trained by the back propagation algorithm for imputation. They employed AANN to train the samples with complete values and then used the trained network and a genetic algorithm (GA) to impute the missing values in incomplete samples. The GA was used to approximate the missing input values by optimizing an objective function driven by the trained AANN. Some other works involving AANN for imputation are reviewed in the literature review section. In this paper, we exploited the strength of GRNN in solving multi-input–multi-output (MIMO) problems. Accordingly, we proposed a variant of GRNN called GRAANN, where the output layer contains the input variables themselves during training, thereby achieving auto association. We propose GRAANN for imputation as follows: we first employ GRAANN to train the samples with complete values and then used mean imputation for getting the initial estimates of the missing values in incomplete samples followed by the trained GRAANN to get the final imputed values. Therefore, our paper is different from Abdella and Marwala [1] in that we could achieve imputation in one go because GRAANN requires single iteration only, unlike AANN reported in Abdella and Marwala [1]. Consequently, our method can be used for imputing missing values in streaming data or data coming in online. We also propose other variants of AANN viz, PSOAANN, PSOAAWNN and RBFAANN for data imputation. In PSOAANN and PSOAAWNN, PSO is used to update the weights values in the training process. In PSOAANN, PSOAAWNN and RBFAANN, we imputed the missing values in test set first with mean values and then input these test records to the trained PSOAANN, PSOAAWNN and RBFAANN for final imputation. However, since PSOAANN, PSOAAWNN require many iterations to get trained, they cannot be used for online applications. The remainder of this paper is organized as follows: a brief review of literature on imputation of missing data is presented in Section 2. The proposed method is explained in Section 3. The description of the dataset is presented in Section 4. Experimental design is described in Section 5. Results and discussions are presented in Section 6 followed by the conclusions in Section 7.

2. Literature review They are several methods to handle missing data of numerical attributes. According to Kline [14], the methods to handling missing data can be classified into four categories. Those are (1) deletion, (2) imputation, (3) modeling the distribution of missing data and then estimate them based on certain parameters and (4) machine learning methods. 2.1. Deletion procedures The deletion techniques simply delete the cases that contain missing data. It is a brute force technique which is too simple to be effective. There are two forms in this approach: (i) List wise deletion that ignores the cases or records containing missing values. The drawback of this method is that the dataset may lose large number of observations, which may result in large error [15]. (ii) Pair wise deletion that considers each feature separately. All recorded values are considered and missing data ignored for each feature. If the overall sample size is small or missing data cases are large then this approach is good [15]. 2.2. Imputation procedures The imputation techniques include regression imputation, hot and cold deck imputation, multiple imputation and mean imputation. Schafer [16] points out that the disadvantage of these

107

methods is that it ignores the correlations between various components. If the variables are correlated then we can perform the data imputation using regression imputation, where the regression equations are computed each time by considering the attribute containing incomplete value as target variable. The regression method preserves the variance and covariance of missing data with other variables. Hot and cold deck imputation is another type of data imputation, where the missing values are replaced by the closest components that are present in both vectors for each case with a missing value. Mean imputation is the earliest method of imputation, where the missing values of a variable are replaced by the average value of all the remaining cases of that variable. In multiple imputation method, each missing value is replaced with valid and reasonable values. So, we get M complete data sets by replacing each value M times and by analyzing all datasets after which we can make combined inferences [2]. 2.3. Model-based procedures The methods under model-based procedures are maximum likelihood method and expectation maximization. According to DeSarbo and Rao [17], the maximum likelihood approach to analyzing missing data assumes that the observed data are a sample drawn from a multivariate normal distribution. Based on available data the parameters are estimated and based on these parameters the missing values are determined. According to Laird [18] the expectation maximization algorithm is an iterative process, where, in the first iteration it estimates missing data and parameters using maximum likelihood, while in the second iteration it re-estimates the missing data based on new parameters and then recalculates the new parameter estimates based on actual and re-estimated missing data [2]. 2.4. Machine learning methods Samad and Harp [19] proposed SOM approach for handling the missing data. In the feed forward neural network approach, MLP should be trained as a nonlinear regression model using the complete cases and choosing one variable as target each time. Several researchers such as Sharpe and Solly [20], Nordbotten [21], Gupta and Lam [9], Yoon and Lee [22] used MLP for missing data imputation. In K-nearest neighbor (K-nn) [23] approach, the missing values are replaced by their nearest neighbors those are selected from the complete cases which minimize the distance function. K-nn approach found applications in breast cancer prognosis [10,22,24]. We now present the work done in using AANN for data imputation. Mohamed and Marwala [25] proposed three techniques based on neural networks to impute the missing data in a medical database. The first technique consists of AANN combined with GA and is extended to an agent-based system, which is the second system, and the third technique views the missing data problem in a pattern classification perspective. The agent-based system provides the best accuracy on average. Marwala and Chakraverty [26] used AANN and GA for fault classification in mechanical systems with missing data. In Nelwamondo and Marwala [27], fuzzy-ARTMAPs is used an ensemble of neural networks to perform both classification and regression with missing data. Marivate et al. [28] used AANN, principal component analysis neural network (PCANN) and Support Vector Regression for prediction and combined each of them with a GA to impute missing variables. The use of PCA improves the overall performance of the ANN. Nelwamondo et al. [29] separately employed expectation maximization (EM) and the AANN combined with GA. Results show that EM performs better

108

V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106–113

when there is no interdependence between the input variables, whereas the AANNþ GA is better when there is inherent nonlinear relationship between some of the variables. Mohamed et al. [30] proposed a combination of first MLP, AANN and second MLP in that order to achieve imputation. Ssali and Marwala [31] introduced a hybrid decision treeþ AANN combined with GA and a decision treeþPCA-neural network (PCANN) combined with GA based model. The results indicate that the addition of a decision tree improves results for both models. Mistry et al. [32] proposed the combination of AANN with PCA and imputation was performed by the trained network and GA. Chen [33] proposed the hybrid AANNþGA to impute missing variables for predicting the business failure.

3. Proposed methodology The common thread in all the methods presented in [29–33] is that they impute missing values in a dataset by considering one variable at a time. Further, some of them invoke GA in the second stage to complete the job of imputation. Therefore, in this paper, we thought of pursuing a different line of research by devising a few new auto associative neural networks architectures, which are extended, auto associative versions of MLP, RBF, WNN and GRNN and which can finish imputation without having to invoke GA. The reason for choosing these architectures is that they proved to be powerful not only in solving nonlinear regression problems arising in the multi-input–single-output (MISO) framework but also can solve MIMO problems. Further, we intended to reduce the computational burden by simply imputing all variables (where missing values are present) in one go, instead of imputing the missing values in data by considering one variable at a time. Of course, this necessitated the availability of initial imputed estimates, which are provided by mean imputation, which is extremely simple. In the case of MLP, we trained its auto associative counterpart viz, AANN by PSO to come out with PSOAANN because of the inherent defects of the back propagation algorithm. Similarly in the case of WNN, we did not want to be entangled by the defects of gradient based weight updation schemes and hence employed PSO to update the weights, dilation and translation parameters. Accordingly, in this paper, we proposed four new AANN architectures for data imputation. These are (i) general regression auto associative neural network (GRAANN); (ii) particle swarm optimization (PSO) trained AANN; (iii) PSO trained auto associative wavelet neural network (PSOAAWNN) and (iv) semi-online radial basis function auto associative neural network (RBFAANN). All the four auto associative neural nets are trained to predict their own input variables, so the output variables in output layer are approximately equal to the input variables. We now briefly describe them as follows:

nodes which store the input records and the number of pattern nodes is equal to the number of input records. The outputs of the pattern unit are passed onto summation units. The summation unit includes a numerator summation unit and a denominator summation unit. The denominator summation unit adds up the weight values coming from each of the hidden neurons. The numerator summation unit adds up the weight values multiplied by the actual target value for each hidden neuron. The output node generates the estimated value of output by dividing the values of numerator summation unit by the denominator summation unit, and uses the result as the final estimated value. Because GRNN is capable of solving multi input and multi output (MIMO) problems, we extended GRNN to GRAANN by taking the input variables in the output nodes. Fig. 1 depicts the architecture of the GRAANN. In the output layer, after the training is completed, we obtain the modified or predicted input variables (X11, X12, …, X1n) for the input variables (X1, X2, …, Xn). The process of GRAANN based imputation is as follows: 1. Divide the dataset into two sets: a set of complete records and another set of records which contains missing values known as missing records. 2. Train the GRAANN with complete records, using the same training algorithm of GRNN [34]. 3. Impute the missing values first with the mean values of corresponding variable in missing records. This completes the first stage of imputation. Then, input the modified records obtained in the first stage to the trained GRAANN. 4. The quality of the imputation is measured using mean absolute percentage error [35] (MAPE) value 100 n xi x^ i MAPE ¼ ∑ n i ¼ 1 xi where n is the number of missing values in a given dataset, x^ i is predicted by the GRAANN for the missing value and xi is the actual value. The same methodology is followed in the case of other three AANN also. 3.2. Particle swarm optimization (PSO) based auto associative neural network (PSOAANN) The PSO algorithm was introduced by the Kennedy and Eberhart [36], which is population-based optimization technique. The PSO algorithm imitates the behavior of bird flocking and fish schooling. PSOAANN containing one input layer, one hidden layer and one output layer was proposed by Paramjeet et al. [37] for privacy preservation by employing PSO for training AANN.

Input layer 3.1. General regression auto associative neural network (GRAANN) Since GRAANN is a variant of GRNN, we briefly describe GRNN first. The GRNN was proposed by Specht [34]. GRNN has the unique features of quick learning, simple training algorithm and being discriminative against infrequent outliers and erroneous observations. GRNN is capable of approximating any arbitrary function from historical data. Each training sample in GRNN is operated as a kernel during the training process. Parzen window estimator is used to establish the regression surface. In GRNN, estimation is based on non-parametric regression analysis in order to get the best fit for the observed data. GRNN consists of four layers input layer, pattern layer, summation layer and the output layer. The input layer contains input variables connected to all the neurons in the pattern layer. The pattern layer contains the pattern

Pattern layer

Summation layer

Output layer

∑ X

X ∑ X

∑

X

∑

X

∑ X ∑ Fig. 1. Architecture of GRAANN.

V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106–113

However, in this paper, we propose PSOAANN for data imputation, recognizing the versatility of the network. For preserving privacy the output layer contains the input variables. The number of nodes in the hidden layer is a user defined parameter. The sigmoid function is used as the activation function in hidden and output layers. Fig. 2 depicts the architecture of PSOAANN. The three layered AANN is used for data imputation instead of five layered AANN [38–40] because it reduces the computational complexity and it is simpler to understand & implement. We imputed the missing values in test set first with mean values and then input these test records to trained network for final imputation. 3.3. PSO based auto associative wavelet neural network (PSOAAWNN) The auto associative wavelet neural network (AAWNN) used particle swarm optimization to optimize the weights of the network in the training phase. In test phase, the mean imputed test records are supplied to the trained AAWNN for final imputation. Wavelets [41] are functions used to localize a given function in both space and scale [42]. In analyzing physical situations, where the signal contains discontinuities and sharp spikes, the wavelets have advantages over traditional Fourier methods. A class of neural networks called WNN, which originate from wavelet decomposition in signal processing have become more popular, based on the locally supported basis functions such as Radial Basis Function Networks (RBFN) [43,44]. A family of wavelets can be constructed from a “mother wavelet” (w(x)), which is confined in finite interval. Daughter wavelets φa;b ðxÞ are then formed using translation (b) and dilation (a) parameters. An individual wavelet is defined as φa;b ðxÞ ¼ jαj 1=2 φððx aÞ=bÞ WNN was proposed as a universal tool for functional approximation. The WNN shows surprising effectiveness in solving the conventional problem of poor convergence or even divergence encountered in other kinds of neural networks. When compared to other networks WNN converges faster [45]. The popularity of WNN can be seen by its applications [46–50]. 3.4. Semi-online radial basis function auto associative neural network (RBFAANN) The RBFAANN is an extension of semi-online radial basis function neural network (SORBFNN) [51], wherein, auto association is introduced by taking the input variables in the output layer also during training. The training algorithm for RBFAANN works in two steps. The unsupervised learning takes place in the first step on the input data, where the clusters are determined in just one pass using evolving clustering method (ECM) algorithm of Kasabov and Song [52]. The supervised learning is involved in the second step, where the ordinary least squares technique (LSE) was employed. Here the ordinary least squares technique was used Input layer X1

Output layer Hidden layer X1

X2

X2

X3

X3

Xn

Xn Fig. 2. Architecture of PSOAANN.

109

instead of iterative gradient based supervised training to justify the online feature of the training algorithm in both phases. Ravi et al. [51] applied a semi-online training algorithm for radial basis function neural networks to predict bankruptcy in Banks. The auto associative version of the same architecture is now proposed for data imputation. The architecture of the resulting semi-online RBFN is depicted in Fig. 3. The online ECM is a fast and one pass algorithm for a dynamic estimation of the number of clusters in a dataset, and it does not involve any optimization. The online ECM is a distance based clustering method. The threshold value (Dthr) is set as a clustering parameter and in any cluster, the maximum distance, Max Dist, between a sample point and the cluster center, is less than the threshold value, Dthr. This parameter affects the number of clusters to be estimated.

4. Dataset description In this paper, we analyzed several regression, classification and banking datasets. While, the regression datasets are Boston housing, forest fires, auto mpg, and body fat, the classification datasets include wine, Pima Indians, iris and spectf. The banking bankruptcy datasets are Spanish, Turkish and UK bankruptcy apart from a UK credit dataset. All the benchmark datasets those are regression (Boston housing, forest fires, auto mpg, and body fat) and classification datasets (wine, Pima Indians, iris and spectf) are taken from UCI machine learning [53] repository and KEEL datasets [54]. Turkish banks’ dataset is obtained from Canbas et al. [55] and is available at [56]. The Spanish banks’ dataset is obtained from Olmeda and Fernandez [57]. The UK dataset is obtained from Beynon and Peel [58]. The UK credit dataset is obtained from Thomas et al. [59]. The number of records and the number of attributes in these datasets are presented in Table 1.

5. Experimental design We divided the total records in a dataset into a set of complete records and another set with missing records, i.e. those contain missing values. Complete records contain the records without missing values and these complete records are used in training process. However, since none of the datasets analyzed here contain missing values, we simulated the scenario as follows: missing values in some records are created randomly by deleting some feature values and these missing records are used in testing process. In all, 10% of the total records are subjected to this process and this set of records with missing values is taken as test set. We trained the network with the set of complete records, and imputed the missing values in test set first by mean values and then fed these test records to the trained auto associative network for final Unsupervised Learning by ECM

Supervised Learning by LSE

X1

X1

X2

X2

Xn

Xn Fig. 3. Architecture of RBFAANN.

110

V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106–113

imputation. For all datasets, and for all proposed methods, we performed 10 folds cross-validation (10-FCV) by changing the composition of training and test sets randomly. We calculated the average MAPE value over 10 folds. This average MAPE serves as the measure of accuracy of the imputation process; the less the average MAPE value, the better the method is said to be. We state that while we minimize the MAPE value of the training set during every single run of the algorithms, in each fold, one must recognize that the MAPE value of the test set (where imputations take place) is influenced by the composition of the training set. In order to alleviate this influence, one must carry out 10-fold cross-validation. Since MAPE is a dimensionless quantity and expressed as a percentage value, its average over 10 folds indicates the strength of the proposed imputation algorithms across 10 different test settings. Further, 10-fold cross-validation is also used to fine tune the parameters of the algorithms. Standard deviations of MAPE, for all datasets, presented in Table 3, are computed as an additional summary measure.

Table 1 Parameter settings for GRAAN, PSOAAWNN, PSOAANN and SORBFAANN. GRAANN

Smoothing parameter Genetic breeding pool size Calibration method Distance metric

PSOAAWNN/ PSOAANN

Particles C1 And C2 Tolerance Lower and upper bounds on weights and dilation/translation parameters Max iterations Hidden nodes

SORBFAANN

Distance threshold of ECM Weight factor

0.3 20/50/200/ 300 Genetic, adaptive learning Vanilla (Euclidean) or city block 30 1 and 3 0.0001 1 and 1

1000 3 to no. of input nodes Between 0.1 and 0.3 1

6. Results and discussion First we present the tools employed and the parameters fixed for the techniques in the study. We used Neuroshell [60] to implement the GRAANN. Since GRAANN is not readily available in the tool we implemented MIMO model in GRNN and fed the original input variables as the output variables. This trick, in effect, transforms the GRNN into GRAANN. In Neuroshell, in training the GRAANN, the parameter setting chosen are presented in Table 1. Further, we executed the code of PSOAANN developed in [37], while we implemented PSOAAWNN in C. Finally, we extended the SORBFNN [51] written in MATLAB to its auto associative version viz, RBFAANN just like the GRAANN. In the case of PSOAANN and PSOAAWNN, for all the datasets, the parameter settings are presented in Table 1.The parameter settings of the hybrid K-means þMLP in the case of various datasets for different folds in the 10-fold cross-validation are presented in Table 2. Thus, we conducted exhaustive experiments by changing various parameter settings in all models in order to get best possible imputation in all datasets. The average MAPE values and standard deviation of MAPE values obtained from 10-fold cross-validation (10-FCV) by GRAANN and other methods in the case of different datasets are presented in Table 3. In the case of regression, classification and banking datasets, GRAANN produced best results except for Pima Indian. The best average MAPE values are highlighted with a bold font. Table 4 presents the computed Wilcoxon signed rank test values for different datasets. According to the Wilcoxon signed rank test, if the computed value is less than or equal to the critical value, then methods under comparison are statistically significant. We conducted non-directional Wilcoxon signed rank test [61–64] at 1% level of significance, in pairs, to test the whether GRAANN is statistically significantly better compared to other methods proposed here and elsewhere. The critical value for the two tailed Wilcoxon signed rank test at 1% level of significance is 2.576. From Table 4, we observe that statistically there is no significant difference between GRAANN and PSOAANN for the datasets Forest fires, Body fat, Pima Indian and UK bankruptcy datasets out of 12 datasets. Also, statistically, there is no significantly difference between GRAANN and

Table 2 Parameter settings for K-meansþ MLP. Dataset

Clusters Iterations Learning rate

Momentum rate

Hidden layers

Hidden nodes

Training epochs

Iris

3

1000

2

2

300

2

15 (6 folds), 5, 10 and 20 (rest folds) 10

1000

Body fat

0.12, 0.15, 0.16, 0.2 and 0.3 0.1

Pima Indian Spanish banks

3

300

0.1

2

10

200

2

300

0.4, 0.6

1 (5 folds), 2 (4 folds), 3 (rest folds)

10 (8 folds), 5 (2 folds)

1000 (9 folds), 200 (1 fold)

Turkish banks

2

500

2 (7 folds), 1 (3 folds)

10

500

UK banks UK credit Wine forest fires Boston housing Auto MPG Spectf

3 3 3 3 3

1500 1500 500 500 500

0.5 and 0.95 0.5 and 0.95 0.01, 0.2 and 0.8 0.2 (9 folds), 0.8 (1 fold) 0.2, 07 and 08

0.2 (4 folds), 0.1 (2 folds), 0.39 (2 folds), 0.45 (2 folds) 0.1 and 0.52 0.1 and 0.52 0.001 0.0001 0.0001

2 2 1 1 1

10 10 Either 7 or 8 Either 7 or 8 Either 7 or 8

1000 1000 1000 1000 1000

2 2

100 100

0.1 0.1

0.01 0.01

1 2

26 30

1000 1000

0.8, 0.81, 0.85 and 1.0 0.9 and 0.45 (8 folds), 0.7 (rest folds) 0.1 0.9 (3 folds), 0.5 (3 folds), 0.24, 0.38 and 0.38 (rest folds) 0.9 (6 folds), 0.95 (3 folds), 0.6 (1 fold)

1000

V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106–113

111

Table 3 Mean and Standard deviation of MAPE values over 10 folds. Dataset

No. of records

No. of attributes

MAPE values in % (standard deviation) GRAANN

PSOAANN

PSOAAWNN

RBFAANN

Mean imputation

K-meansþ MLP

Regression datasets Boston housing Forest fires Auto mpg Body fat

506 516 392 252

13 10 7 14

15.38 18.47 15.54 4.61

(2.45) (2.08) (3.70) (2.03)

24.61 22.69 37.59 7.61

(5.95) (5.98) (10.13) (4.53)

30.94 26.62 38.16 9.21

(7.54) (5.41) (13.59) (4.01)

98.87 59.24 62.53 25.40

(32.76) (19.17) (15.28) (14.78)

37.77 24.72 59.70 11.61

(10.37) (6.84) (14.36) (7.18)

21.01 26.61 23.75 7.83

(4.16) (5.23) (4.52) (1.64)

Classification datasets Wine Pima Indian Iris Spectf

178 768 150 267

13 8 4 44

12.87 23.89 5.75 8.41

(2.47) (2.86) (2.41) (1.59)

22.16 21.72 15.84 16.69

(4.52) (3.21) (9.03) (4.35)

23.64 23.68 12.83 43.30

(3.94) (3.01) (6.41) (4.48)

39.11 32.28 26.93 21.12

(8.38) (4.66) (13.88) (5.48)

29.99 24.02 23.57 14.85

(4.51) (3.82) (14.46) (4.74)

21.58 29.70 9.41 12.14

(3.87) (3.39) (1.97) (2.68)

1225 66 40 60

12 9 12 10

20.47 23.28 17.25 26.85

(5.32) (11.40) (10.14) (12.36)

33.94 60.95 53.56 33.47

(10.35) (22.20) (23.05) (9.21)

38.64 48.81 33.45 31.48

(9.26) (15.25) (7.86) (5.86)

45.53 847.02 188.85 141.61

(17.47) (1043.3) (122.6) (42.68)

28.43 55.53 66.00 37.07

(1.83) (45.23) (26.01) (11.64)

32.17 39.91 33.01 30.96

(11.56) (13.06) (21.34) (10.58)

Banking datasets UK credit Spanish Turkish UK bankruptcy

Table 4 Wilcoxon signed rank test values of GRAANN versus other methods. Dataset

Regression Boston housing Forest fires Auto mpg Body fat Classification Wine Pima Indian Iris Spectf Banking UK credit Spanish Turkish UK bankruptcy

PSOAANN PSOAAWNN RBFAANN Mean imputation

Kmeans þMLP

2.67

2.77

2.77

2.77

2.67

1.86 2.77 1.96

2.67 2.77 2.77

2.77 2.77 2.77

2.36 2.77 2.67

2.77 2.77 2.67

2.77 2.31

2.77 0.73

2.77 2.77

2.77 0.02

2.77 2.77

2.77 2.77

2.77 2.77

2.77 2.77

2.77 2.77

2.26 2.77

2.77 2.77 2.67 1.65

2.77 2.77 2.26 1.14

2.77 2.77 2.77 2.77

2.77 2.77 2.77 1.75

2.57 2.57 2.36 0.63

7. Conclusions In this paper, based on the concept of AANN, we proposed four new algorithms for data imputation, viz, GRAANN, PSOAANN, PSOAAWNN and RBFAANN. The effectiveness of these proposed methods is tested on four benchmark classification, four benchmark regression datasets and three bankruptcy prediction datasets and one credit scoring datasets under 10-fold cross-validation testing. Based on the results, we conclude that GRAANN should be preferred for imputation in the class of AANN architectures owing to its consistently better imputation for most of the datasets as evidenced by Wilcoxon signed rank test. Further, we conclude that GRAANN can be used in online data imputation applications, even though we did not test any online data here, because of its simple architecture, fast, one-pass training algorithm, and its immunity to outliers etc. Moreover, this study demonstrates that we do not need to use computationally complex evolutionary algorithms to fine tune the imputations yielded by AANN. We could achieve highly accurate imputations with just mean imputation followed by GRAANN. This feature obviates the necessity of invoking evolutionary algorithms in the whole process. This is a significant outcome of the study.

References

PSOAAWNN for Pima Indian, Turkish and UK bankruptcy datasets out of 12 datasets. However, there is a statistically significant difference between GRAANN and RBFAANN for all datasets. Further, there is no statistically significant difference between GRAANN and Mean imputation for Forest fires, Pima Indian and UK bankruptcy datasets out of 12 datasets. Moreover, there is no statistically significant difference between GRAANN and Kmeans þMLP [65] for Iris, UK credit, Spanish, Turkish and UK bankruptcy. By observing the overall results, we conclude that GRAANN performed better imputation for most of the datasets. Most importantly, the fact that GRAANN requires just single iteration to train makes it very attractive compared to all other models. This feature makes GRAANN suitable in applications, where data imputation is required to be performed in real-time or on-line. Therefore, we conclude that GRAANN should be preferred for imputation in the class of AANN architectures.

[1] M. Abdella, D. Marwala, The use of genetic algorithms and neural networks to approximate missing data in database, in: Proceedings of the IEEE 3rd International Conference on Computational Cybernetics (ICCC), 2005, pp. 207–212. [2] P.J. Garcıa-Laencina, J.L. Sancho-Gomez, A.R. Figueiras-Vidal, Pattern classification with missing data: a review, Neural Comput. Appl. 19 (2010) 263–282. [3] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, 2nd ed., WileyInterscience, Hoboken, NJ, USA, 2002. [4] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, Wiley, New York, 1987. [5] I.G. Sande, Hot-deck imputation procedures, Incomplete Data in Sample Surveys, 3, Academic Press, New York (1983) 339–349. [6] B.M. Ford, An overview of hot-deck procedures, Incomplete Data in Sample Surveys, 2, Academic Press, New York (1983) 185–207. [7] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B 39 (1) (1977) 1–38. [8] D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, Wiley, New York, 1987. [9] A. Gupta, M.S. Lam, Estimating missing values using neural networks, J. Oper. Res. Soc. 47 (2) (1996) 229–238. [10] G. Batista, M.C. Monard, A study of K-nearest neighbor as an imputation method, in: A. Abraham, et al., (Eds.), Hybrid Intelligent Systems, Ser Front Artificial Intelligence Applications, IOS Press, 2002, pp. 251–260.

112

V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106–113

[11] B. Gabrys, Neuro-fuzzy approach to processing inputs with missing values in pattern recognition problems, Int. J. Approx. Reasoning 30 (2002) 149–179. [12] P. Merlin, A. Sorjamaa, B. Maillet, A. Lendasse, X-SOM and L-SOM: a double classification approach for missing value imputation, Neuro Comput. 73 (2010) 1103–1108. [13] M. Marseguerra, A. Zoia, The auto-associative neural network in signal analysis II, application to on-line monitoring of a simulated BWR component, Ann. Nucl. Energy 32 (11) (2002) 1207–1223. [14] R.B. Kline, Principles and Practice of Structural Equation Modeling, Guliford Press, New York, 1988. [15] Q. Song, M. Shepperd, A new imputation method for small software project data sets, J. Syst. Software 80 (1) (2007) 51–62. [16] J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, Florida, USA, 1997. [17] W.S. DeSarbo, V.R. Rao, A constrained unfolding methodology for product positioning, Market. Sci. 5 (1) (1986) 1–19. [18] N.M. Laird, Missing data in longitudinal studies, Stat. Med. 7 (1988) 305–315. [19] T. Samad, S.A. Harp, Self-organization with partial data network, Comput. Neural Syst. 3 (1992) 205–212. [20] P.K. Sharpe, R.J. Solly, Dealing with missing values in neural network based diagnostic systems, Neural Comput. Appl. 3 (2) (1995) 73–77. [21] S. Nordbotten, Neural network imputation applied to the Norwegian 1990 population census data, J. Off. Stat. 12 (1996) 385–401. [22] S.Y. Yoon, S.Y. Lee, Training algorithm with incomplete data for feed-forward neural networks, Neural Process. Lett. 10 (1999) 171–179. [23] G. Batista, M.C. Monard, Experimental Comparison of K-nearest Neighbor and Mean or Mode Imputation Methods with the Internal Strategies Used by C4.5 and CN2 to Treat Missing Data. Technical Report, University of Sao Paulo, 2003. [24] J. Jerez, I. Molina, J. Subirates, L. Franco, Missing data imputation in breast cancer prognosis, in: Proceedings of the 24th IASTED International Conference on Biomedical Engineering (BioMed’06), Anaheim, CA, USA, 2006. [25] S. Mohamed, T. Marwala, Neural network based techniques for estimating missing data in databases, in: The 16th Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 2005, pp. 27–32. [26] T. Marwala, S. Chakraverty, Fault classification in structures within complete measured data using auto associative neural networks and genetic algorithm, Curr. Sci. India 90 (4) (2006) 542–548. [27] F.V. Nelwamondo, T. Marwala, Fuzzy ARTMAP and neural network approach to online processing of inputs with missing values, in: Proceedings of the 17th Symposium of the Pattern Recognition Association of South Africa, 2006, pp. 177–182. [28] V.N. Marivate, F.V. Nelwamondo, T. Marwala, Autoencoder Principal Component Analysis and Support Vector Regression for Data Imputation, CoRR, 2007. [29] F.V. Nelwamondo, S. Mohamed, T. Marwala, Missing data: a comparison of neural network and expectation maximization techniques, Curr. Sci. 93 (12) (2007). [30] A.K. Mohamed, F.V. Nelwamondo, T. Marwala, Estimating missing data using neural network techniques, principal component analysis and genetic algorithms, in: Proceedings of the 18th Symposium of the Pattern Recognition Association of South Africa, 2007. [31] G. Ssali, T. Marwala, Computational intelligence and decision trees for missing data estimation, in: IJCNN, 2008, pp. 201–207. [32] J. Mistry, F.V. Nelwamondo, T. Marwala, Using principle component analysis and auto associative neural networks to estimate missing data in a database, in: Proceedings of the12th World Multi-Conference on Systemics, Cybernetics and Informatics: WMSCI, 2008. [33] M.H. Chen, Pattern recognition of business failure by auto associative neural networks in considering the missing values, in: Computer Symposium, 2010, pp. 711–715. [34] D.F. Specht, A general regression neural network, IEEE Trans. Neural Networks 2 (6) (1991) 568–576. [35] B.E. Flores, A pragmatic view of accuracy measurement in forecasting, Omega 14 (2) (1986) 93–98. [36] J. Kennedy, R.C. Eberhart, Particle swarm optimization, in: Proceedings of the IEEE International Conference on Neural Networks, Piscataway, NJ, USA, 1995, pp. 1942–1948. [37] V. Ravi Paramjeet, N. Nekuri, C. RaghavendraRao, Privacy preserving data mining using particle swarm optimization trained auto-associative neural network: an application to bankruptcy prediction in banks, Int. J. Data Min. Model. Manage. 4 (1) (2012) 39–56. [38] M.A. Kramer, Nonlinear principal component analysis using auto associative neural networks, AIChE J. 37 (2) (1991) 233–243. [39] C. Pramodh, V. Ravi, Modified great deluge algorithm based auto associative neural network for bankruptcy prediction in banks, Int. J. Comput. Intell. Res. 3 (4) (2007) 363–370. [40] V. Ravi, C. Pramodh, Non-linear principal component analysis-based hybrid classifiers: an application to bankruptcy prediction in banks, Int. J. Inf. Decis. Sci. 2 (1) (2010) 50–67. [41] A. Grossmann, J. Morlet, Decomposition of Hardi functions into square integrable wavelets of constant shape, SIAM J. Math. Anal. 15 (1984) 725–736. [42] 〈http://mathworld.wolfram.com/Wavelet.html〉 retrieved in 2013. [43] Q. Zhang, A. Benvniste, Wavelet networks, IEEE Trans. Neural Networks 3 (6) (1992) 889–898. [44] Q. Zhang, Using wavelet network in nonparametric estimation, IEEE Trans. Neural Networks 8 (2) (1997) 227–236.

[45] X. Zhang, J. Qi, R. Zhang, M. Liu, Z. Hu, H. Xue, Prediction of programmedtemperature retention values of naphtha's by wavelet neural networks, Comput. Chem. 25 (2) (2001) 25–133. [46] E. Avci, An expert system based on wavelet neural network-adaptive norm entropy for scale invariant texture classification, Expert Syst. Appl. 32 (3) (2007) 919–926. [47] C. Dimoulas, G. Kalliris, G. Papanikolaou, V. Petridis, A. Kalampakas, Bowelsound pattern analysis using wavelets and neural networks with application to long-term unsupervised, gastrointestinal motility monitoring, Expert Syst. Appl. 34 (1) (2008) 26–41. [48] L. Dong, D. Xiao, Y. Liang, Y. Liu, Rough set and fuzzy wavelet neural network integrated with least square weighted fusion algorithm based fault diagnosis research for power transformers, Electric Power Syst. Res. 78 (1) (2008) 129–136. [49] K. Vinaykumar, V. Ravi, M. Carr, N. Rajkiran, Software cost estimation using wavelet neural networks, Syst. Software 81 (11) (2008) 1853–1867. [50] N. Rajkiran, V. Ravi, Software reliability prediction using wavelet neural networks, in: International conference on computational intelligence and multimedia applications, Sivakasi, Tamilnadu, India, 2007. [51] V. Ravi, P. Ravikumar, E. Ravisrinivas, N.K. Kasabov, A. Semi-Online Training, Algorithm for the radial basis function neural networks: applications to bankruptcy prediction in banks, Advances in Banking Technology and Management: Impacts of ICT and CRM, 243–260. [52] N.K. Kasabov, Q. Song, DENFIS: dynamic, evolving neural-fuzzy inference systems and its application for time-series prediction, IEEE Trans. Fuzzy Syst. 10 (2) (2002) 144–154. [53] 〈http://archive.ics.uci.edu/ml/datasets.html〉, last retrieved in 2013. [54] 〈http://sci2s.ugr.es/keel/datasets.php〉, last retrieved in 2013. [55] S. Canbas, A. Caubak, S.B. Kilic, Prediction of commercial bank failure via multivariate statistical analysis of financial structures: the Turkish case, Eur. J. Oper. Res. 166 (2005) 528–546. [56] 〈http://www.tbb.org.tr/english/bulten/yillik/2000/ratios.xls〉, last retrieved in 2013. [57] I. Olmeda, E. Fernandez, Hybrid classifiers for financial multi criteria decision making: the case of bankruptcy prediction, Comput. Econ. 10 (1997) 317–335. [58] M.J. Beynon, M.J. Peel, Variable precision rough set theory and data discretization: an application to corporate failure prediction, Omega 29 (2001) 561–576. [59] L.C. Thomas, D.B. Edelman, J.N. Crook, Credit Scoring and Its Applications, SIAM, Philadelphia, USA, 2002. [60] 〈http://www.neuroshell.com〉, last retrieved in 2013. [61] 〈https://www.msu.edu/user/sw/statrev/strv50.htm?Q〉, last retrieved in 2013. [62] F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull. 1 (1945) 80–83. [63] S. Siegel, Non-parametric Statistics for the Behavioral Sciences, McGraw Hill, New York (1956) 75–83. [64] R. Lowry, Concepts and Applications of Inferential Statistics, 2013. Retrieved from 〈http://vassarstats.net/textbook/ch12a.html〉. [65] N. Ankaiah, V. Ravi, A novel soft computing hybrid for data imputation, in: Proceedings of the 7th International Conference on Data Mining (DMIN), Las Vegas, USA, 2011.

Vadlamani Ravi is an associate professor in the Institute for Development and Research in Banking Technology, Hyderabad, since February 2010. He obtained his Ph.D. in the area of soft computing from Osmania University, Hyderabad and RWTH Aachen, Germany (2001); M.S. (science and technology) from BITS, Pilani (1991) and M.Sc. (statistics and operations research) from IIT, Bombay (1987). At IDRBT, he spearheads the CRM Lab, first-of-its-kind in India and evangelizes CRM in a big way by conducting customized training programmes for bankers on CRM subsuming OCRM & ACRM; data warehousing and data mining and conducting POC for banks, etc. He has 130 papers to his credit with the break-up of 70 papers in refereed international journals. His papers appeared in Applied Soft Computing, Soft Computing, Asia-Pacific Journal of Operational Research, Decision Support Systems, European Journal of Operational Research, Expert Systems with Applications, Fuzzy Sets and Systems, IEEE Transactions on Fuzzy Systems, IEEE Transactions on Reliability, Information Sciences, Journal of Systems and Software, Knowledge Based Systems, IJUFKS, IJCIA, IJAEC, IJDMMM, IJIDS, IJDATS, IJISSS, IJCIR, IJCISIM, IJBIC, Computers and Chemical Engineering, Canadian Geotechnical Journal, Biochemical Engineering Journal, Bioinformation, Journal of Services Research, etc. He also edited a book entitled Advances in Banking Technology and Management: Impacts of ICT and CRM (http:// www.igi-global.com/reference/details.asp?id=6995), published by IGI Global, USA, 2007. Some of his research papers are listed in Top 25 Hottest Articles by Elsevier and World Scientific. He has an H-index of 24 with 1935 citations for his papers (http://scholar.google.co.in/). He is recognized as a Ph.D. supervisor at Department of Computer and Information Sciences, University of Hyderabad and Department of Computer Sciences, Berhampur University, Orissa. He is an invited member in Marquis Who's Who in the World, USA in 2009, 2010. He is also an invited member in 2000 Outstanding Intellectuals of the 21st Century 2009/2010 – published by International Biographical Center, Cambridge, England. He is an invited member of “Top 100 Educators in 2009” published by International Biographical Centre,

V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106–113 Cambridge, England. Three Ph.D. students graduated under his supervision. So far, he advised 50 M.Tech./M.C.A./M.Sc. projects and at least a dozen Summer Interns from various IITs. He currently supervises three Ph.D. students and 5 M.Tech. students. He is on the Steering Committee of Canara Bank for their DWH and CRM project; IT advisor for Indian Bank for their DWH and CRM projects and principal consultant for Bank of India for their CRM project; expert committee member for IRDA for their business analytics and fraud analytics projects. He is a referee for 36 international journals of repute. Moreover, he is a member of the Editorial Review Board for the International Journal of Information Systems in Service Sector published by IGI Global, USA; International Journal of Data Analysis Techniques and Strategies published by Interscience, Switzerland; International Journal of Information and Decision Sciences (IJIDS), Interscience, Switzerland; International Journal of Strategic Decision Sciences (IJSDS), IGI Global, USA and International Journal of Information Technology Project Management (IJITPM), IGI Global, USA. He is on the PC for some international conferences and chaired many sessions in international conferences in India and abroad. His research interests include fuzzy computing, neuro computing, soft computing, data mining, web mining, privacy preserving data mining, global/multi-criteria/combinatorial optimization, bankruptcy prediction, risk measurement, text mining, customer relationship management (CRM), churn prediction in banks and firms and asset liability management through optimization. In a career spanning 25 years, he has worked in several cross-disciplinary areas such as financial engineering, software engineering, reliability engineering, chemical engineering, environmental engineering, chemistry, medical entomology, bioinformatics and geotechnical engineering. At IDRBT, he held various administrative positions such as coordinator, IDRBT-industry relations (2005–2006), M. Tech. (IT) coordinator (2006–2009), convener, IDRBT working group on CRM (2010–2011). As the convener, IDRBT working group on CRM, he co-authored a Handbook on Holistic CRM and Analytics (http://www.idrbt.ac.in/PDFs/Holistic% 20CRM%20Booklet_Nov2011.pdf ), where a new framework for CRM, best practices and new organization structures apart from HR issues for Indian banking industry are all suggested. He has 25 years of research experience and 12 years of teaching experience. He designed and developed a number of courses in Singapore and India at the M.Tech. level in soft computing, data warehousing and data mining, fuzzy computing, neuro computing, quantitative methods in finance, soft computing in finance, etc. Furthermore, he designed and developed a number of short courses for Executive Development Programmes (EDPs) in the form of 2-week long CRM for senior and junior executives, data mining, big data and its relevance to banking, fraud analytics, etc. He conducted ACRM proof of the concept (POC) for six banks on their real data. He established excellent research collaborations with University of Hong Kong, University of Ghent, Belgium, IISc, Bangalore and IIT Kanpur. He co-

113

ordinated the first international EDP in IDRBT on ACRM to banking executives jointly with Prof Dr. Dirk Van den Poel, University of Ghent at Ghent, Belgium in 2011. This programme is successfully repeated in 2012. As part of academic outreach, he is an invited resource person in various national workshops and faculty development programmes on soft computing, data mining funded by AICTE and organized by some engineering colleges in India. Prior to joining IDRBT as an assistant professor in April 2005, he worked as a faculty at the Institute of Systems Science (ISS), National University of Singapore (April 2002–March 2005). At ISS, he was involved in teaching M.Tech. (knowledge engineering) and research in the areas of Fuzzy systems, neural networks, soft computing systems and data mining and machine learning. Furthermore, he consulted for Seagate Technologies, Singapore and Knowledge Dynamics Pvt. Ltd., Singapore, on data mining projects. Before leaving for Singapore, he worked as an assistant director (Scientist E1) from 1996 to 2002 and Scientist C from 1993 to 1996, respectively, at the Indian Institute of Chemical Technology (IICT), Hyderabad. He was deputed to RWTH Aachen (Aachen University of Technology) Germany under the DAAD Long Term Fellowship to carry out advanced research during 1997–1999. He earlier worked as Scientist B and Scientist C at the Central Building Research Institute, Roorkee (1988–1993) and was listed as an expert in soft computing by TIFAC, Government of India.

Mannepalli Krishna holds an M.Tech. (IT) from University of Hyderabad and IDRBT, Hyderabad. His research interests include data imputation and machine learning. He now works as an officer in Andhra Bank, Mumbai, India.

4445-main

Short Description

Description

Comments

We need your help!