Comparing Three Data Mining Algorithms for Identifying the Associated Risk Factors of Type 2 Diabetes

Background: Increasing the prevalence of type 2 diabetes has given rise to a global health burden and a concern among health service providers and health administrators. The current study aimed at developing and comparing some statistical models to identify the risk factors associated with type 2 diabetes. In this light, artificial neural network (ANN), support vector machines (SVMs), and multiple logistic regression (MLR) models were applied, using demographic, anthropometric, and biochemical characteristics, on a sample of 9528 individuals from Mashhad City in Iran. Methods: This study has randomly selected 6654 (70%) cases for training and reserved the remaining 2874 (30%) cases for testing. The three methods were compared with the help of ROC curve. Results: The prevalence rate of type 2 diabetes was 14% in our population. The ANN model had 78.7% accuracy, 63.1% sensitivity, and 81.2% specificity. Also, the values of these three parameters were 76.8%, 64.5%, and 78.9%, for SVM and 77.7%, 60.1%, and 80.5% for MLR. The area under the ROC curve was 0.71 for ANN, 0.73 for SVM, and 0.70 for MLR. Conclusion: Our findings showed that ANN performs better than the two models (SVM and MLR) and can be used effectively to identify the associated risk factors of type 2 diabetes.


INTRODUCTION
iabetes mellitus (DM) is rapidly growing in prevalence, thus posing a major public health challenge globally. The estimated number of diabetic adults was 382 million in 2013, while the prediction for year 2035 is 592 million [1,2] . Diabetes is associated with health complications such as cardiovascular, renal, and retinal diseases; therefore, it is important to identify individuals having a high risk of diabetes, in order to control other health risks.
Different models have been developed during the last two decades for evaluating the risk of DM, but none of them have been found to achieve the desired accuracy, hence having a limited clinical utility [3,4] . Studies have also suggested an connection between anthropometric factors and DM [5,6] .
Data mining involves choosing, analyzing, and modeling huge amounts of information, aiming to reveal undisclosed patterns or correlations that propose obvious and beneficial results [7,8] . This method has progressed fast in the recent years. Several researches have used data mining in order to investigate unknown variables. In addition, some predictive models are created in medicine [9,10] .
The multiple logistic regression (MLP) is a nonlinear D regression approach to predict a categorical and a dependent variable. This method has been used for identifying the risk factors for various diseases through patients' history, characteristics, and other risk factors. The logistic model measures the probability of the considered illness y (y = 0 if the participant does not suffer from the illness; if not, y = 1) as a function for the predictive risk factor values. If the person suffers from this illness, the conditional probability can be obtained by p (y = 1-X) = p (X), and the logistic model becomes: log [p (x)/1 = p (x)] = B 0 ‫+‬ B 1 x 1 ‫‬ + B 2 x 2 ‫‬ + B k x k in which X = (x 1 , x 2 , ... , x k ) indicates the k's risk factors vector through the logistic regression model [11,12] . The support vector machines (SVMs) create appropriate boundaries between the information sets through solving a quadratic optimization problem. By making the use of various kernel functions, different amounts of flexibility and nonlinearity can be attached to the model. Since these features could be elicited from high level statistical ideas, and their generalization error bounds can be measured, a great deal of research has been conducted on SVMs during the last years [13,14] .
The artificial neural networks (ANNs) are suitable data mining instruments that are utilized for building nonlinear and complex models. The present study utilized a common ANN architecture named multilayer perception network with back-propagation algorithm, which is surely the most widely used and carefully deliberated ANN architecture [15,16] . The multi-layer perception networks are feed-forward neural networks that are instructed with standard backpropagation algorithms and are considered as a strong function approximate for the problems of classifying and predicting [17] .
In public health, it is common to use MLR for identifying factors associated with the disease and for developing predictive models [18,19] . Some data mining techniques have already been applied for medical conditions [9,20,21] and for predicting diabetes [8,22,23] . The present study aims at comparing ANN, SVM, and MLR for identifying associated risk factors of DM.

Experimental sample
A sample of 9528 subjects were enrolled in the Mashhad Stroke and Heart Atherosclerotic Disorders Study at Mashhad University of Medical Sciences (MUMS), Mashhad, Iran [24] . The Ethics Committee of MUMS has approved the protocol, and all the participants have been given an informed written consent.

Data collection (anthropometric and biochemical measurements)
Demographic characteristics consisted of age, gender, marital status, education, cigarette smoking habit, physical activity level (PAL), family history of diabetes, and depression score. The depression score was evaluated using Beck's depression inventory II. Anthropometric information included weight, height, waist, and hip circumference. Systolic and diastolic blood pressures were measured as described earlier [12] . Biochemical measurements were composed of fasting blood glucose, fasting serum triglycerides (TGs), total cholesterol (TC), HDL-and LDL-cholesterol, and high-sensitivity C-reactive protein (hs-CRP), as previously described [24] .

Phenotypic definition of type 2 diabetes mellitus
Phenotypic definition of Type 2 DM was specified based on the fasting blood glucose level of 126 mg/dl or higher.

Artificial neural network
ANN is a data mining tool and is used for constructing non-linear prediction models [7] . A multilayer neural network has three layers, namely an input layer, a hidden layer, and an output layer. Each layer has nodes that are connected by links from one layer to the next. Nodes in the input layer represent predictors, while in the output layer, the nodes are viewed as outcome variables [15] . One of the most applications of neural network is multilayer backpropagation learning algorithm that has ability of modeling a non-linear systems [25] . Interpretation of neural networks is more complicated than other statistical models; however, the neural network is used in different medical fields [25,26] . The structure of perceptron network is composed of some nodes with an activation function that are held in different layers. Each node by its weight coefficients collects the results of all previous nodes and converts it to next layer through activation function. The number of nodes in each layer of network depends on the structure of the investigated subject [15,25] . In a perceptron network with a hidden layer, the amount of i-th output is calculated using the following formula in which n: the number of observations, M: the number of hidden layer nodes, p: the number of entrance layer nodes, w js : weight related to χ is enter in i-th node, χ is : weight of i-th node, b 0 and b j0 : bias of middle and output layers, respectively, Ø 2 and Ø 1 : activation functions of hidden and output layers.
The activation function of hidden layer usually is non-linear (hyperbolic tangent or sigmoid), and conversion function of output layer can be linear or non-linear [15,27] .
The aim of ANN is calculating the proper weight for network. One way of measuring the weights is backpropagation algorithm. Back-propagation rule consists of two paths. One path is forward path, in which entrance vector applies to perceptron network, and its effect expands into output layer through a middle layer. The constructed output vector in the output layer makes the real response of the perceptron network, which is named as the backward path, in which the parameters of network are considered fixed. In this path, in contrast to the going path, the parameters of perceptron network are changed and adjusted. This adjustment is done according to the error correction law. These paths are repeated until the parameter estimates (bias and weight) are adjusted. The process of measuring the proper weight is considered as the learning process. For doing this process, weight coefficients will be changed to reduce the goal function of network, which is considered as mean square error [28,29] .

Support vector machine
SVM is a data mining tool and is a supervised classification technique. This method can be used for prediction when the outcome variable is binary. SVM constructs multi-dimensional hyper-planes separating the two classes while maximizing the margin between the two classes. SVM uses kernel functions and has the ability to discriminate between classes that are not linearly separable [30] . The dataset is divided into two sets with the training dataset comprising of 6654 cases (70%) and testing dataset containing 2874 cases (30%). Each model is developed using the training dataset and tested using the testing dataset [31] . In case of each model, the incidence of type 2 diabetes is predicted, and confusion matrix is constructed in order to measure the accuracy of the model. Accuracy is measured as the proportion of cases classified correctly. Sensitivity is measured by the proportion of positive cases classified correctly, while specificity is determined by the proportion of negative cases classified correctly [7] . Mathematically, if TP stands for true positive, TN for true negative, FP for false positive, and FN for false negative, then accuracy = (TP + TN)/(TP + FP + TN + FN), sensitivity = TP/(TP + FN), and specificity = TN/(FP + TN).

Statistical analysis
The data were analyzed using R version 3.0.2. All variables were analyzed to generate descriptive statistics, chi-square tests, independent sample t-tests for variables with a normally distributed variables, and Mann-Whitney tests for non-normally distributed variables. MLR was used to identify factors that are strongly associated with type 2 DM.

Characteristics of the population
Descriptive statistics for anthropometric and biochemical characteristics are reported in Tables 1 and 2, respectively. The number of subjects having type 2 diabetes was 1361, which were selected from 9528 subjects. The mean age of diabetic individuals was higher than non-diabetic individuals (52.01 ± 7.2 vs. 47.70 ± 8.1). Of 1361 diabetic individuals, 843 (61.9%) were female, 1239 (91%) were married, and 783 (57.5%) were unemployed. Subjects having DM were significantly (p < 0.05) older and had higher BMI, systolic blood pressure, diastolic blood pressure, serum TC, and LDL cholesterol, as well as a lower level of HDL-C, compared to the individuals without DM (Table 1).

Results of artificial neural network model
For designing the neural network and finding the best model, we performed the test on the least square error function. At the beginning, the learning rate and the number of neurons were considered as 0.05 and 10, respectively. Sigmoid activation function was used in a hidden layer, and linear activation function was used in the output layer. Gradually, the learning rate was increased up to 0.5, and the number of neurons up to 25 increased. All the combinations of different learning rates and the number of neurons were tested.
Subsequently, according to the mean square error, the best model was selected. After considering different combinations, the three-layer perceptron network with 24 nodes in entrance layer, 20 nodes in hidden layer, and 2 nodes in output layer with the rate learning of 0.2 were used, which in comparison with other combinations had the least mean square error Iran. Biomed. J. 22 (5): 303-311     ( Table 3). The order of important variables in the ANN model containing six parameters was as follow: family history of diabetes, hs-CRP, TC, BPS, TG, and age risk factor diabetic (Fig. 1B).

Results of support vector machines model
SVM applied the same 17 characteristics used by ANN and tried sigmoid as well as polynomial kernel functions to identify the best kernel function. Based on the results of the SVM model, family history of diabetes, age, hs-CRP, TG, BMI, and PAL were the most important risk factors related to type 2 diabetes (Fig. 1C). Table 4 summarizes the results on accuracy, sensitivity, specificity, and area under the ROC curve for these three models. As shown in the Table, SVM model had the best performance. ROC curve for the three models is displayed in Figure 2.

DISCUSSION
In the present study, we developed and explored the effectiveness of three models of data mining, including ANN, MLR, and SVM, in order to identify some potential factors associated with type 2 diabetes in a large population consisted of 9582 subjects with and without diabetes. In line with the previous studies, our results indicated that ANN is better than MLR and SVM for identifying associated risk factors of type 2 diabetes [10,19] MLR, ANN, and SVM were applied on training data set, and all models were used for the evaluation of testing data set. Based on demographic and biochemical markers, our findings showed that the ANN model had a higher predictive accuracy in comparison to MLR and SVM models.
The sensitivity and specificity are two important factors for the validity of a model [32] ; therefore, these two criteria were calculated for the three models. The results revealed that ANN model had more sensitivity and specificity than MLR model and less sensitivity than SVM, which are in agreement with the findings of Meng et al. [8] and Sedehi et al. [29] studies. Investigations have also suggested that ANN could give a better prediction value than MLR [8,17] . Those studies have compared ANN and MLR models using demographic and anthropometric variables, but we have used biochemical parameters. Moreover, we found that in the ANN model, the family history of diabetes and hs-CRP had a more important role in the identification of individuals with type 2 diabetes, while in MLR model, family history of diabetes, TG and hs-CRP and in SVM model, the family history of diabetes, age, and hs-CRP play a key role in determining diabetes risk, showing that hs-CRP is an important and a common factor in three models for determining the individuals with type 2 diabetes A major strength of the present study is that it was performed in a large number of samples, thus providing a new insight to the application of ANN, MLR, and SVM for investigating new potential risk factors associated with type 2 diabetes in a representative sample of the Iranian population. However, future studies based on cohort studies are needed to give better estimates of the accuracy, sensitivity, specificity, and area under the ROC curve for ANN.