Correlation and Nonparametric Statistics of Variables with Different Data Types in Graduation Design Project

It’s the graduation season again, and it’s about time for the graduation defense this year. Graduation design projects this year involved eye movement research, “heads-down tribe” on campus, human-computer interface interaction design, and queuing theory application research. Except for queuing theory application research, which was selected by one of the students independently, other topics were given for reference this year. Among them, the human-computer interface interaction design subject was the first attempt to combine with the subject of the major of Information Management and Information System. Two students worked on the same project from different perspectives (Back-end database development, front-end interactive interface design). The specific projects are as follows:

Eye movement research project – eye movement research on visual contrast and intent affecting the attention of advertising keywords

Research project on “heads-down tribe” on campus——discomfort measurement on upper limb musculoskeletal system of college students with different levels of mobile phone use, research on influencing factors and prevention strategies of neck and shoulder pain of “heads-down tribe” on campus

Human-computer interface interaction design project——human-computer interface interaction design of c2c second-hand book information system on campus

Application research project of queuing theory——parking space matching and management charging standard of shopping mall underground parking lot based on queuing theory model

The research project of “heads-down tribe” on campus is my main research direction in recent years. There were three related topics here. Different research methods were used to research and design the discomfort of the upper limb musculoskeletal system of “heads-down tribe” on campus. Survey tools such as questionnaire and Likert scale were used for several times. Obviously, questionnaire and scale data are not continuous variables, and parametric statistical methods cannot be used directly. Even sEMG data and eye movement data collected during the ergonomic experiment needed to be tested for normality before using the parametric test method. Therefore, it is necessary to make a generalization of correlation analysis statistical methods and nonparametric statistical methods for different data types.

1. Data types of variables

The most common data classification method is to divide data according to the measurement level of data. Data can be divided into categorical variables, ordinal variables, equidistant variables and ratio variables. Equidistant and ratio variables are continuous variables, and categorical and ordinal variables are discrete variables. Equidistant variables have equal units but no absolute zero point, and can perform addition and subtraction operations, while cannot perform multiplication and division operations. Ratio variables have both equal units and absolute zero points, and can perform four arithmetic operations. Likert scale data are ordinal variables. Questionnaire data and independent variables in the experimental design are mostly categorical variables, and sEMG data and eye movement data are ratio variables. For ordinal variables such as Likert scale data, if they are identified as interval variables by the Mantel-Haenszel trend test, you can analyze interval ordinal variables as continuous variables.

2. Correlation analysis of variables with different data types

Pearson correlation is used to analyze the strength of linear association between two continuous variables, and the population from which the two columns of variables come must be normally or approximately normally distributed.

For correlation analysis between two ordinal variables, Spearman correlation is generally used to test the strength and direction of association with at least one ordinal variable, or two continuous variables but the population from which they are derived is not normal distribution or distribution is unknown.

Kendall’s tau-b correlation is a nonparametric analysis method used to test the strength and direction of association with at least one ordinal variable.

For the correlation analysis between two categorical variables, Chi-square test can be used to test their independence. This test can only analyze the statistical significance of the correlation and cannot reflect the strength of the association. It is often combined with Cramer’s V test to indicate the strength of the association.

For the correlation analysis between an ordinal variable and a continuous variable, the continuous variable is first tested as an ordinal variable, that is, to analyze the relationship between the two ordinal variables. Spearman correlation can be used.

For a detailed description of this part, please refer to the document as follows.


3. Normality test of sample data

One-sample K-S test can check whether the sample comes from a normally distributed population. Binomial method can test whether the actual distribution of the data in the binomial distribution conforms to a certain hypothesis, expectation, or specific form.

4. Nonparametric statistics of variables with different data types

Nonparametric tests with large samples are more reliable. In the case of a single sample, Chi-square test can be used to test the degree of cooperation to analyze whether the actual frequency of the variable value is consistent with the theoretical frequency.

To test whether the two independent samples come from the same population, or whether the data distribution of the two samples is the same, for the data that cannot meet the normal distribution condition, or two ordinal variables, Mann-Whitney U test needs to be used, which corresponds to independent sample t-test in parametric statistical method. It requires the independent variable to be a categorical variable with two levels, and the dependent variable to be an ordinal variable or continuous variable with at least an ordinal scale.

To test the significance between two related samples, it is usually applicable to two experimental design situations: repeated measures design and paired sample design. Four types of Wilcoxon signed-rank test, Sign test, McNemr test, and Marginal Homogeneity test can be used, corresponding to paired samples t-test and correlation coefficient significance test in parametric statistical method. Wilcoxon signed-rank test is the most widely used and is suitable for data with continuous distribution and symmetry. Sign test has a slightly lower statistical precision. McNemr test is only suitable for dichotomous correlated variables, and Marginal Homogeneity test is an extension of the McNemr test, which can test variables with multiple responses, but only for ordinal variables, and they are especially suitable for pretest-posttest experimental designs.

To test the significance among multiple independent samples, Kruskal-Wallis H test, Median test and Jonckheere-Terpstra test can be used, which correspond to the variance analysis of one-way completely randomized design in parametric statistical method. It requires the independent variable to be a categorical variable with more than two levels and the dependent variable be an ordinal variable or a continuous variable with at least an ordinal scale. Kruskal-Wallis H test corresponds directly to one-way ANOVA in parametric statistics and is frequently used. Median test is actually a contingency table analysis with low precision. Jonckheere-Terpstra test is similar to the Kruskal-Wallis H test, with higher precision when the grouping variable is ordinal.

To test the significance among multiple related samples, Friedman test, Cochrans Q test and Kendall W test can be used, which correspond to the variance analysis of randomized block design in parametric statistical method. Friedman test is an extension of Wilcoxon signed-rank test. Cochrans Q test is only applicable to several related dichotomous variables, which is an extension of McNemr test. Kendall W test is used to test whether the opinions of different evaluators are consistent. Both Friedman test and Cochrans Q test are applicable to repeated measures design and paired sample design. If there is a significant difference in the test results, further post-hoc tests are required, such as Wilcoxon signed-rank test.

For a detailed description of this part, please refer to the literature as follows.

丁国盛, 李涛编著. SPSS统计教程——从研究设计到数据分析. 北京: 机械工业出版社, 2014.











最常见的数据分类方法是按照数据的测量水平来划分,可将数据区分为分类变量、顺序变量、等距变量和比率变量,其中等距变量和比率变量为连续变量,分类变量和顺序变量为离散变量。等距变量有相等单位但没有绝对零点,可进行加减运算,不能进行乘除运算;比率变量既有相等单位也有绝对零点,可以进行四则运算。里克特量表数据为顺序变量,问卷数据和实验设计中的自变量大部分为分类变量,表面肌电数据和眼动数据均为比率变量。对于里克特量表数据这一类的顺序变量,通过Mantel-Haenszel 趋势检验(根据研究者对顺序变量类别的赋值,判断两个顺序变量之间的线性趋势)认定为定距变量的话,也可以将定距顺序变量作为连续变量进行分析。




Kendall’s tau-b相关用于检验至少有一个顺序变量关联强度和方向的非参分析方法,该检验与Spearman相关的应用范围基本一致,但更适用于存在多种关联的数据(如列联表)。

对于两个分类变量之间的相关分析,可采用卡方检验对它们进行独立性检验,该检验只能分析相关的统计学意义,不能反映关联强度,常联合Cramer’s V检验提示关联强度。







检验两个独立样本是否来自同一总体,或者两个样本的数据分布是否相同,对于数据无法满足正态分布条件,或者两个顺序变量,需要采用Mann-Whitney U检验,对应于参数统计方法中的独立样本t检验,该检验要求自变量为两个水平的分类变量,因变量为至少达到顺序尺度的顺序变量或连续变量。

检验两个相关样本的差异显著性,通常适用于重复测量设计与配对样本设计两种实验设计情形,可以采用Wilcoxon符号秩检验、Sign检验、McNemr检验、Marginal Homogeneity检验4种,对应于参数统计方法中的配对样本t检验和相关系数显著性检验。Wilcoxon符号秩检验应用最广,适用于数据呈连续分布,有对称性。Sign检验统计精度略低。McNemr检验只适用于二分相关变量,Marginal Homogeneity检验是McNemr检验的扩展,可检验多重反应的变量,但仅限于顺序变量,它们特别适用于前测-后测的实验设计。

检验多个独立样本的差异显著性检验,可采用Kruskal-Wallis H检验、Median检验和Jonckheere-Terpstra检验,对应于参数统计方法中的单因素完全随机设计的方差分析,该检验要求自变量为两个以上水平的分类变量,因变量为至少达到顺序尺度的顺序变量或连续变量。Kruskal-Wallis H检验直接对应于参数统计中的单因素方差分析,使用率最高。Median检验事实上是列联表分析,精度较低。Jonckheere-Terpstra检验与Kruskal-Wallis H检验类似,当分组变量为顺序变量时精度更高。

检验多个相关样本的差异显著性,可采用Friedman检验、Cochrans Q检验和Kendall W检验,对应于参数统计方法中的随机区组设计的方差分析。Friedman检验是Wilcoxon符号秩检验的扩展,Cochrans Q检验只适用于几个相关的二分变量,是McNemr检验的扩展,Kendall W检验用于检验不同评价者的意见是否一致。Friedman检验和Cochrans Q检验都适用于重复测量设计与配对样本设计。如果检验结果发现存在显著性差异时,需要进一步进行事后检验,如采用Wilcoxon符号秩检验进行。

关于这部分的详细说明可参考文献“丁国盛, 李涛编著. SPSS统计教程——从研究设计到数据分析. 北京: 机械工业出版社, 2014。”