阿丽娜博士
高级定量研究方法 PG8014
神经网络SPSS统计分析与数据挖掘
主要内容
什么是神经网络
2.神经网络模式下参数设置及SPSS实例分析
什么是神经网络?人工神经网络 (ANN) 由节点层组成,包含一个输入层、一个或多个隐藏层和一个输出层。每个节点或人工神经元连接到另一个节点并具有相关的权重和阈值。如果任何单个节点的输出高于指定的阈值,则激活该节点,将数据发送到网络的下一层。否则,不会将任何数据传递到网络的下一层。随着时间的推移,神经网络依靠训练数据来学习和提高其准确性。然而,一旦这些学习算法针对准确性进行了微调,它们就会成为计算机科学和人工智能中的强大工具,使我们能够对数据进行高速分类和聚类。与人类专家的手动识别相比,语音识别或图像识别中的任务可能需要几分钟而不是几小时。最著名的神经网络之一是谷歌的搜索算法。将使用SPSS中的神经网络模块进行分析,使用的数据集是SPSS自带的数据。设置bankloan.sav,这个数据集是银行贷款的用户信用记录数,共包含12个变量,包括年龄、ed、地址、收入等,数据集包含850个调查样本。数据的格式集合如图 下图以该数据集中的 700 个样本数据作为训练数据集,创建多层感知器的神经网络模型,并使用创建的模型分析剩余 150 个被调查用户的信用记录,观察这些150个用户 你的信用好坏SPSS神经网络模型设置及实例分析首先选择菜单“变换”→“随机数生成器”,会弹出如图2所示的对话框,选择“设置起始点”选项栏,并选择“固定值(Fixed Value)”选项,填写在9191972中,然后单击图3中的“确定”按钮。然后选择菜单“转换-计算变量”,会弹出如图4所示的对话框,在“目标变量”选项中填写变量名分区列,然后单击“数学”在“数值表达式”选项框中填写计算表达式2*RV.BERNOULLI(0.7)-1,此公式用于生成伯努利分布数据,数据集名称为partition .设置好后,点击界面“确定”“按钮进行计算。生成随机数后,选择菜单“Analyze →Multilayer Perceptron”,会弹出如图5所示的对话框。选择变量先前默认[默认]到“相关变量”选项栏。选择变量 Level ofeducation [ed] 到“因素”选项栏。选择变量 age、employ、address、income.debtinc、creddebt、othdebt 到“协变量”选项列。然后点击“Partitions”选项卡,弹出如图5所示的对话框,选择“Use Partitioning Variable to Assign Cases”选项栏,然后在选项栏中选择“Partitioning Variable”的变量分区。然后点击“Partitions”选项卡,弹出如图5所示的对话框,选择“Use Partitioning Variable to Assign Cases”选项栏,然后在选项栏中选择“Partitioning Variable”的变量分区。然后单击“输出”选项卡,弹出如图6所示的对话框。选择“ROC Curve”、“Cumulative Gains Chart”、“Lift Chart”和“Predicted by Observed Chart”选项栏,去掉“Chart (Diagram)”选项。最后选择“自变量重要性分析”选项栏,然后点击“确定”按钮进行分析。
结果分析设置完成后,在主界面点击“确定”按钮进行分析。第一个输出是案例处理信息,汇总,如图7所示。它包括700个有效样本数据,从850个总体中选取,然后排除了150个样本数据。然后是网络信息,如图8所示,包括输入层、隐藏层和输出层。有7个Karl变量,隐藏层数为1,包括4个单元。下一个输出是模型信息,如图 9 所示,它给出了有关训练数据集的各种信息。图10为反判分析表,给出了样本数据集中反判正确率等信息。比如训练集中变量No对应的样本数为375,347个判断为No,28个样本判断为Yes,所以判断正确率为92.5%,正确返回率同一个变量Yes对应的判断是59.7%,看效果不是很好,平均正确率jud gment 为 84.4%。对于Holdout样本数据集,Yes对应的样本正确率只有45.8%,也不是很好。基于上面的训练模型,结果不是很好,所以需要重新设置。选择菜单“Transform → Compute Variable”,弹出如图11所示的对话框。在(Target Variable)选项框中填写变量名分区,然后在Numeric Expression选项框中重新填写计算表达式分区-RV.BERNOULL(0.2)。这个公式是用来生成伯努利分布数据的,数据集的名字也是partition然后点击图11中的“If”按钮设置条件,如图12。选择“Include if Case Satisfies Condition”选项栏,输入partition>0,然后点击“Continue”按钮。设置完成后,选择图4中的“保存”选项卡,会弹出如图13所示的对话框,选择“保存每个相关变量的预测伪概率”选项,然后点击“确定”按钮进行分析运算结果如下,首先汇总case处理信息,如图14所示,初始样本共499个。接下来的输出是图15所示的网络信息输出,与上述网络信息不同的隐藏层数为7。图 16 显示了模型的信息,包括正确判断的百分比。训练(Training)为15.2%,保持(Holdout)为25.9%。图17的输出是判别分类结果,给出了每个类别的判别正确率。然后输出的是ROC曲线,如图18所示。ROC曲线是用来比较二元判断中判别法优劣的曲线。上图中,曲线的横坐标尽量小,纵坐标尽量大,即较好方法的ROC曲线应始终位于贫富ROV曲线的左上角。图 19 显示了增益图。
Dr Alina
Advanced Quantitative Research Methods PG8014
neural network spss statistical analysis and data mining
MAIN CONTENT
What are Neural networks
2. Parameter setting and SPSS example analysis in neural network mode
What are neural networks?
Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.
Neural networks rely on training data to learn and improve their accuracy over time. However, once these learning algorithms are fine-tuned for accuracy, they are powerful tools in computer science and artificial intelligence, allowing us to classify and cluster data at a high velocity. Tasks in speech recognition or image recognition can take minutes versus hours when compared to the manual identification by human experts. One of the most well-known neural networks is Google’s search algorithm.
The neural network module in SPSS will be used for analysis, and the data set used is the data that comes with SPSS. Set bankloan.sav, this data set is the number of user credit records of bank loans, Contains a total of 12 variables, including age , ed , address , income , etc. The data set contains 850 survey samples.The format of the data set is shown in Figure The following uses 700 sample data in this data set as a training data set to create a neural network model of a multilayer perceptron, and use the created model to analyze the credit records of the remaining 150 surveyed users to observe these 150 users Is your credit good or bad
SPSS Neural Network Model Setting and Example Analysis
First select the menu “Transform” →”Random Number Generators", the dialog box shown in Figure 2 will pop up, and select the "Set Starting Point" option bar , And select the "Fixed Value (Fixed Value)" option, fill in 9191972, and then click the "OK" button in Figure 3. Then select the menu "Transform-Compute Variable", the dialog box shown in Figure 4 will pop up, fill in the variable name partition in the "Target Variable" option column, and then click the "Mathematics Fill in the calculation expression 2*RV.BERNOULLI(0.7)-1 in the "Numeric Expression" option box, this formula is used to generate bernoulli distribution data, the name of the data set is partition. After setting, click the interface "OK" "Button to calculate.
After generating the random number, select the menu "Analyze →Multilayer Perceptron", and the dialog box shown in Figure 5 will pop up. Select the variable Previously defaulted [default] to the "Dependent Variables" option bar. Select the variable Level ofeducation [ed] to the "Factors" option bar. Select the variables age, employ, address, income.debtinc, creddebt, othdebt to the "Covariates" option column.
Then click the "Partitions" tab, the dialog box shown in Figure 5 pops up, select the "Use Partitioning Variable to Assign Cases" option bar, and then select the variable partition to the "Partitioning Variable " in the option bar.
Then click the "Partitions" tab, the dialog box shown in Figure 5 pops up, select the "Use Partitioning Variable to Assign Cases" option bar, and then select the variable partition to the "Partitioning Variable " in the option bar.
Then click the "Output" tab, and the dialog box shown in Figure 6 pops up. Select "ROC Curve", "Cumulative Gains Chart", "Lift Chart", and "Predicted by Observed Chart" option bar, remove the "Chart ( Diagram)" option. Finally, select the "Independent Variable Importance Analysis" option bar, and then click the "OK" button to analyze.
Results analysis
After the setting is completed, click the "OK" button in the main interface to analyze. The first output is the case processing information, summarized, as shown in Figure 7. It includes 700 valid sample data, which is selected from 850 populations, and then 150 sample data are excluded.
Then there is network information, as shown in Figure 8, including Input Layer, Hidden Layer, and Output Layer. There are 7 Karl variables, and the number of hidden layers is 1, including 4 units.
The next output is the model information, as shown in Figure 9, which gives various information about the training data set. Figure 10 is a back-judgment analysis table, which gives the correct rate of back-judgment and other information in the sample data set. For example, the number of samples corresponding to the variable No in the training set is 375, 347 is judged as No, and 28 samples are judged as Yes, so the correct rate of return judgment is 92.5%, and the correct rate of return judgment corresponding to the same variable Yes is 59.7%, see The effect is not very good, and the average correct rate of return judgment is 84.4%. For the Holdout sample data set, the sample corresponding to Yes has a correct response rate of only 45.8%, which is also not very good.
Based on the above training model, the result is not very good, so you need to reset it. Select the menu "Transform → Compute Variable", and the dialog box shown in Figure 11 will pop up. Fill in the variable name partition in the (Target Variable) option box, and then refill the calculation expression partition -RV.BERNOULL(0.2) in the Numeric Expression option box. This formula is used to generate bernoulli distribution data, The name of the data set is also partition
Then click the "If" button in Figure 11 to set the conditions, as shown in Figure 12. Select the "Include if Case Satisfies Condition" option column, enter partition>0, and then click the "Continue" button.
After setting, select the "Save" tab in Figure 4, and the dialog box shown in Figure 13 will pop up, and select "Save Predicted Pseudo-probability for EachDependent Variable" Option, and then click the "OK" button to analyze.
The results of the operation are as follows, first of all, the case processing information is summarized, as shown in Figure 14. There are a total of 499 initial samples.
The next output is the network information output shown in Figure 15. The number of hidden layers different from the above network information is 7.
Figure 16 shows the information of the model, including the percentage of correct judgments. The training (Training) is 15.2%, and the hold (Holdout) is 25.9%. The output of Figure 17 is the discriminant classification result, and the discriminant correct rate of each category is given.
Then the output is the ROC curve, as shown in Figure 18. The ROC curve is a curve used to compare the pros and cons of the discrimination method in the binary judgment. In the above, the abscissa of the curve is as small as possible and the ordinate is as large as possible, that is, the ROC curve of the better method should always be at the upper left of the poor and rich ROV curve. Figure 19 shows the gain graph.