R语⾔:常⽤统计检验
统计检验是将抽样结果和抽样分布相对照⽽作出判断的⼯作。主要分5个步骤:
1. 建⽴假设
2. 求抽样分布
3. 选择显著性⽔平和否定域
4. 计算检验统计量
5. 判定 ——
假设检验(hypothesis test)亦称显著性检验(significant test),是统计推断的另⼀重要内容,其⽬的是⽐较总体参数之间有⽆差别。假设检验的实质是判断观察到的“差别”是由抽样误差引起还是总体上的不同,⽬的是评价两种不同处理引起效应不同的证据有多强,这种证据的强度⽤概率P来度量和表⽰。除t分布外,针对不同的资料还有其他各种检验统计量及分布,如F分布、X2分布等,应⽤这些分布对不同类型的数据进⾏假设检验的步骤相同,其差别仅仅是需要计算的检验统计量不同。
正态总体均值的假设检验
t检验
require(graphics)
## 经典案例: 学⽣犯困数据
plot(extra ~ group, data = sleep)
## 传统表达式
with(sleep, t.test(extra[group == 1], extra[group == 2]))
Welch Two Sample t-test
data:  extra[group == 1] and extra[group == 2]
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3654832  0.2054832
sample estimates:
mean of x mean of y
0.75      2.33
## 公式形式
Welch Two Sample t-test
data:  extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3654832  0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75            2.33
单个总体
某种元件的寿命X(⼩时)服从正态分布N(mu,sigma2),其中mu、sigma2均未知,16只元件的寿命如下;问是否有理由认为元件的平均寿命⼤于255⼩时。
X<-c(159, 280, 101, 212, 224, 379, 179, 264,
222, 362, 168, 250, 149, 260, 485, 170)
One Sample t-test
data:  X
t = 0.66852, df = 15, p-value = 0.257
alternative hypothesis: true mean is greater than 225
95 percent confidence interval:
198.2321      Inf
sample estimates:
mean of x
241.5
两个总体
X为旧炼钢炉出炉率,Y为新炼钢炉出炉率,问新的操作能否提⾼出炉率?
X<-c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)
Y<-c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)
Two Sample t-test
data:  X and Y
t = -4.2957, df = 18, p-value = 0.0002176
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -1.908255
sample estimates:
mean of x mean of y
76.23    79.43
成对数据t检验
对每个⾼炉进⾏配对t检验
X<-c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)
Y<-c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)
One Sample t-test
data:  X - Y
t = -4.2018, df = 9, p-value = 0.00115
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
-Inf -1.803943
sample estimates:
mean of x
-3.2
正态总体⽅差的假设检验
x <- rnorm(50, mean = 0, sd = 2)
y <- rnorm(30, mean = 1, sd = 1)
从⼩学5年级男⽣中抽取20名,测量其⾝⾼(厘⽶)如下;问:在0.05显著性⽔平下,平均值是否等于149,sigma^2是否等于75?X<-scan()
136 144 143 157 137 159 135 158 147 165
158 142 159 150 156 152 140 149 148 155
F test to compare two variances
data:  X and Y
F = 34.945, num df = 19, denom df = 9, p-value = 6.721e-06
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
9.487287 100.643093
sample estimates:
ratio of variances
34.94489
对炼钢炉的数据进⾏分析
X<-c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)
Y<-c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)
F test to compare two variances
data:  X and Y
F = 1.4945, num df = 9, denom df = 9, p-value = 0.559
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.3712079 6.0167710
sample estimates:
ratio of variances
1.494481
⼆项分布的总体检验
有⼀批蔬菜种⼦的平均发芽率为P=0.85,现在随机抽取500粒,⽤种⾐剂进⾏浸种处理,结果有445粒发芽,问种⾐剂有⽆效果。st(445,500,p=0.85)
Exact binomial test
data:  445 and 500
number of successes = 445, number of trials = 500, p-value = 0.01207
alternative hypothesis: true probability of success is not equal to 0.85
95 percent confidence interval:
0.8592342 0.9160509
sample estimates:
probability of success
0.89
按照以往经验,新⽣⼉染⾊体异常率⼀般为1%,某医院观察了当地400名新⽣⼉,有⼀例染⾊体异常,问该地区新⽣⼉染⾊体是否低于⼀般⽔平?
Exact binomial test
data:  1 and 400
number of successes = 1, number of trials = 400, p-value = 0.09048
alternative hypothesis: true probability of success is less than 0.01
95 percent confidence interval:
0.0000000 0.0118043
sample estimates:
probability of success
0.0025
⾮参数检验
数据是否正态分布的Neyman-Pearson 拟合优度检验-chisq
5种品牌啤酒爱好者的⼈数如下
A 210
B 312
C 170
D 85
E 223
问不同品牌啤酒爱好者⼈数之间有没有差异?
X<-c(210, 312, 170, 85, 223)
Chi-squared test for given probabilities
data:  X
X-squared = 136.49, df = 4, p-value < 2.2e-16
检验学⽣成绩是否符合正态分布
X<-scan()
25 45 50 54 55 61 64 68 72 75 75
78 79 81 83 84 84 84 85 86 86 86
87 89 89 89 90 91 91 92 100
A<-table(cut(X, br=c(0,69,79,89,100)))
#cut 将变量区域划分为若⼲区间
#table 计算因⼦合并后的个数
p<-pnorm(c(70,80,90,100), mean(X), sd(X))
p<-c(p[1], p[2]-p[1], p[3]-p[2], 1-p[3])
Chi-squared test for given probabilities
data:  A
X-squared = 8.334, df = 3, p-value = 0.03959
#均值之间有⽆显著区别
⼤麦的杂交后代芒性状的⽐例⽆芒:长芒:短芒=9:3:4,⽽实际观测值为335:125:160 ,检验观测值是否符合理论假设?
Chi-squared test for given probabilities
data:  c(335, 125, 160)正则化统计
X-squared = 1.362, df = 2, p-value = 0.5061
现有42个数据,分别表⽰某⼀时间段内电话总机借到呼叫的次数,
接到呼叫的次数 0  1  2  3  4  5  6
出现的频率    7  10  12  8  3  2  0
问:某个时间段内接到的呼叫次数是否符合Possion分布?
x<-0:6
y<-c(7,10,12,8,3,2,0)
mean<-mean(rep(x,y))
q<-ppois(x,mean)
n<-length(y)
p[1]<-q[1]
p[n]<-1-q[n-1]
for(i in 2:(n-1))
p[i]<-1-q[i-1]
Chi-squared test for given probabilities
data:  y
X-squared = 19.667, df = 6, p-value = 0.003174
Z<-c(7, 10, 12, 8)
n<-length(Z); p<-p[1:n-1]; p[n]<-1-q[n-1]
Chi-squared test for given probabilities
data:  Z
X-squared = 1.5946, df = 3, p-value = 0.6606
P值越⼩越有理由拒绝⽆效假设,认为总体之间有差别的统计学证据越充分。需要注意:不拒绝H0不等于⽀持H0成⽴,仅表⽰现有样本信息不⾜以拒绝H0。
传统上,通常将P>0.05称为“不显著”,0.0l<P≤0.05称为“显著”,P≤0.0l称为“⾮常显著”。
注:
反馈与建议
作者:
邮箱:

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。