1、向没有编程和R语言基础的非统计、计算机等专业的本科生,介绍R语言中的某一个函数、语法、package等任意一个功能,内容自选。但需要注意: (1)对该函数/语法/package的意义和用法进行说明 (2)包含R语言代码 (3)对代码进行详细的解释、说明 (4)格式规范
knn() 函数 K-Nearest Neighbor Classification:机器学习中最简单和容易理解的一个监督学习模型,基于贝叶斯定理和条件概率,适用于分类变量(qulitative variable)的预测。 原理:利用欧式距离(Euclidean distance)选取离目标点最近的N个点,将这N个点中数量最多的类型当作目标点的类型。
# 用鸢尾花数据集iris举例, 前四个变量是花的四种特征,最后一个变量是花的类型
data(iris)
# 加载数据集
iris
# 查看数据集内容
dt = sort(sample(nrow(iris), nrow(iris)*.7))
#将数据集分成训练集和测试集,其比例为7:3
train <- iris[dt,1:4]
test <- iris[-dt,1:4]
train_target <- iris[dt, 5]
test_target <- iris[-dt, 5]
cat("The shape of iris dataset is", dim(iris), "\n")
The shape of iris dataset is 150 5
cat("The shape of train dataset is", dim(train), "\n")
The shape of train dataset is 105 4
cat("The shape of test dataset is", dim(test), "\n")
The shape of test dataset is 45 4
library(class)
set.seed(1)
# 设置一个随机数种子,使每次运行knn()结果相同
pred = knn(train, test, train_target, k=5)
# 用knn()函数拟合训练集数据和训练集结果,对测试集数据进行预测。
table = table(pred, test_target)
table
test_target
pred setosa versicolor virginica
setosa 14 0 0
versicolor 0 16 2
virginica 0 3 10
# 用table()函数生成混淆矩阵(Confusion Matrix),用来计算预测精度。
accuracy1 = (15+13+16)/45
cat("accuracy1 is", accuracy1, "\n")
accuracy1 is 0.9777778
# 利用混淆矩阵计算精度accuracy = (TP + TN)/(P + N)
accuracy2 = mean(pred==test_target)
cat("accuracy2 is", accuracy2, "\n")
accuracy2 is 0.8888889
# 用mean()函数直接计算精度,原理同上
由计算得出的精度可知,当k值设为5时,对测试集预测的结果为0.98
2、向没有编程和R语言基础的非统计、计算机等专业的本科生,介绍任意一个常用统计学公式及示例,用R语言实现并作解释。
将偏态变量正态化(使数据符合分析时做出的基本假设)
- 读取数据
train <- read.csv("train.csv")
# 读取数据:从数据集中读取一个csv表格数据
# train是这个数据集的名字,此时这个数据集的类型是dataframe
train
# 查看已经读区的数据集
- 取出一个数值型变量,从图形和偏态数值两方面来观察此变量的偏态程度
# 1.画柱状图观察变量偏态分布程度
# 调用ggplot函数
library(ggplot2)
Error in library(ggplot2) : there is no package called 'ggplot2'
从图中我们可以看到,该变量呈现左偏分布
# 2. 通过偏态系数查看变量偏态分布程度
library(moments)
skew_score = skewness(train$SalePrice)
cat("The skewness of this variable is", skew_score) # cat()函数:打印
在一般情形下,当统计数据为右偏分布时,Sk(偏态系数)>0,且Sk值越大,右偏程度越高; 当统计数据为左偏分布时,Sk<0,且Sk值越小,左偏程度越高。 当统计数据为对称分布(即正态分布)时,有Sk= 0。 (摘自百度百科) 由该变量的偏态系数可知,该变量呈现左偏分布。
c.偏态变量正态化
train$SalePrice <- log(train$SalePrice) #用log()函数对变量进行正态化
skew_score_nor = skewness(train$SalePrice) #再次测量该变量的偏态系数
cat("The skewness of this variable after normalizing is", skew_score_nor)
ggplot(data = train, aes(x=SalePrice)) + geom_histogram(fill="blue", binwidth = 0.06)
从偏态系数和柱状图的改变可以看出,之前的处理基本实现了数据的正态化分布
LS0tCnRpdGxlOiAiUiB0dXRvcmlhbCIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKMeOAgeWQkeayoeaciee8lueoi+WSjFLor63oqIDln7rnoYDnmoTpnZ7nu5/orqHjgIHorqHnrpfmnLrnrYnkuJPkuJrnmoTmnKznp5HnlJ/vvIzku4vnu41S6K+t6KiA5Lit55qE5p+Q5LiA5Liq5Ye95pWw44CB6K+t5rOV44CBcGFja2FnZeetieS7u+aEj+S4gOS4quWKn+iDve+8jOWGheWuueiHqumAieOAguS9humcgOimgeazqOaEj++8mgrvvIgx77yJ5a+56K+l5Ye95pWwL+ivreazlS9wYWNrYWdl55qE5oSP5LmJ5ZKM55So5rOV6L+b6KGM6K+05piOCu+8iDLvvInljIXlkKtS6K+t6KiA5Luj56CBCu+8iDPvvInlr7nku6PnoIHov5vooYzor6bnu4bnmoTop6Pph4rjgIHor7TmmI4K77yINO+8ieagvOW8j+inhOiMgwoKa25uKCkg5Ye95pWwCkstTmVhcmVzdCBOZWlnaGJvciBDbGFzc2lmaWNhdGlvbu+8muacuuWZqOWtpuS5oOS4reacgOeugOWNleWSjOWuueaYk+eQhuino+eahOS4gOS4quebkeedo+WtpuS5oOaooeWei++8jOWfuuS6jui0neWPtuaWr+WumueQhuWSjOadoeS7tuamgueOh++8jOmAgueUqOS6juWIhuexu+WPmOmHj++8iHF1bGl0YXRpdmUgdmFyaWFibGXvvInnmoTpooTmtYvjgIIK5Y6f55CG77ya5Yip55So5qyn5byP6Led56a777yIRXVjbGlkZWFuIGRpc3RhbmNl77yJ6YCJ5Y+W56a755uu5qCH54K55pyA6L+R55qETuS4queCue+8jOWwhui/mU7kuKrngrnkuK3mlbDph4/mnIDlpJrnmoTnsbvlnovlvZPkvZznm67moIfngrnnmoTnsbvlnovjgIIKYGBge3J9CiMg55So6bii5bC+6Iqx5pWw5o2u6ZuGaXJpc+S4vuS+iywg5YmN5Zub5Liq5Y+Y6YeP5piv6Iqx55qE5Zub56eN54m55b6B77yM5pyA5ZCO5LiA5Liq5Y+Y6YeP5piv6Iqx55qE57G75Z6LCmRhdGEoaXJpcykKIyDliqDovb3mlbDmja7pm4YKaXJpcyAKIyDmn6XnnIvmlbDmja7pm4blhoXlrrkKYGBgCgpgYGB7cn0KZHQgPSBzb3J0KHNhbXBsZShucm93KGlyaXMpLCBucm93KGlyaXMpKi43KSkgCiPlsIbmlbDmja7pm4bliIbmiJDorq3nu4Ppm4blkozmtYvor5Xpm4bvvIzlhbbmr5TkvovkuLo3OjMKdHJhaW4gPC0gaXJpc1tkdCwxOjRdCnRlc3QgPC0gaXJpc1stZHQsMTo0XQp0cmFpbl90YXJnZXQgPC0gaXJpc1tkdCwgNV0KdGVzdF90YXJnZXQgPC0gaXJpc1stZHQsIDVdCmNhdCgiVGhlIHNoYXBlIG9mIGlyaXMgZGF0YXNldCBpcyIsIGRpbShpcmlzKSwgIlxuIikKY2F0KCJUaGUgc2hhcGUgb2YgdHJhaW4gZGF0YXNldCBpcyIsIGRpbSh0cmFpbiksICJcbiIpCmNhdCgiVGhlIHNoYXBlIG9mIHRlc3QgZGF0YXNldCBpcyIsIGRpbSh0ZXN0KSwgIlxuIikKYGBgCgpgYGB7cn0KbGlicmFyeShjbGFzcykKc2V0LnNlZWQoMSkKIyDorr7nva7kuIDkuKrpmo/mnLrmlbDnp43lrZDvvIzkvb/mr4/mrKHov5DooYxrbm4oKee7k+aenOebuOWQjApwcmVkID0ga25uKHRyYWluLCB0ZXN0LCB0cmFpbl90YXJnZXQsIGs9NSkKIyDnlKhrbm4oKeWHveaVsOaLn+WQiOiuree7g+mbhuaVsOaNruWSjOiuree7g+mbhue7k+aenO+8jOWvuea1i+ivlembhuaVsOaNrui/m+ihjOmihOa1i+OAggp0YWJsZSA9IHRhYmxlKHByZWQsIHRlc3RfdGFyZ2V0KQp0YWJsZQojIOeUqHRhYmxlKCnlh73mlbDnlJ/miJDmt7fmt4bnn6npmLXvvIhDb25mdXNpb24gTWF0cml477yJ77yM55So5p2l6K6h566X6aKE5rWL57K+5bqm44CCCmFjY3VyYWN5MSA9ICgxNSsxMysxNikvNDUKY2F0KCJhY2N1cmFjeTEgaXMiLCBhY2N1cmFjeTEsICJcbiIpCiMg5Yip55So5re35reG55+p6Zi16K6h566X57K+5bqmYWNjdXJhY3kgPSAoVFAgKyBUTikvKFAgKyBOKSAKYWNjdXJhY3kyID0gbWVhbihwcmVkPT10ZXN0X3RhcmdldCkKY2F0KCJhY2N1cmFjeTIgaXMiLCBhY2N1cmFjeTIsICJcbiIpCiMg55SobWVhbigp5Ye95pWw55u05o6l6K6h566X57K+5bqm77yM5Y6f55CG5ZCM5LiKCmBgYArnlLHorqHnrpflvpflh7rnmoTnsr7luqblj6/nn6XvvIzlvZNr5YC86K6+5Li6NeaXtu+8jOWvuea1i+ivlembhumihOa1i+eahOe7k+aenOS4ujAuOTgKCgoy44CB5ZCR5rKh5pyJ57yW56iL5ZKMUuivreiogOWfuuehgOeahOmdnue7n+iuoeOAgeiuoeeul+acuuetieS4k+S4mueahOacrOenkeeUn++8jOS7i+e7jeS7u+aEj+S4gOS4quW4uOeUqOe7n+iuoeWtpuWFrOW8j+WPiuekuuS+i++8jOeUqFLor63oqIDlrp7njrDlubbkvZzop6Pph4rjgIIKCuWwhuWBj+aAgeWPmOmHj+ato+aAgeWMlu+8iOS9v+aVsOaNruespuWQiOWIhuaekOaXtuWBmuWHuueahOWfuuacrOWBh+iuvu+8iQoKYS4g6K+75Y+W5pWw5o2uCmBgYHtyfQp0cmFpbiA8LSByZWFkLmNzdigidHJhaW4uY3N2IikKIyDor7vlj5bmlbDmja7vvJrku47mlbDmja7pm4bkuK3or7vlj5bkuIDkuKpjc3booajmoLzmlbDmja4KIyB0cmFpbuaYr+i/meS4quaVsOaNrumbhueahOWQjeWtl++8jOatpOaXtui/meS4quaVsOaNrumbhueahOexu+Wei+aYr2RhdGFmcmFtZQp0cmFpbgojIOafpeeci+W3sue7j+ivu+WMuueahOaVsOaNrumbhgpgYGAKCmIuIOWPluWHuuS4gOS4quaVsOWAvOWei+WPmOmHj++8jOS7juWbvuW9ouWSjOWBj+aAgeaVsOWAvOS4pOaWuemdouadpeinguWvn+atpOWPmOmHj+eahOWBj+aAgeeoi+W6pgpgYGB7cn0KIyAxLueUu+afseeKtuWbvuinguWvn+WPmOmHj+WBj+aAgeWIhuW4g+eoi+W6pgojIOiwg+eUqGdncGxvdOWHveaVsApsaWJyYXJ5KGdncGxvdDIpCmxpYnJhcnkoc2NhbGVzKQpnZ3Bsb3QoZGF0YSA9IHRyYWluLCAgICAgICAgICAgICAgICAgIyDnu5nlrprmlbDmja7pm4YKICAgICAgIGFlcyh4PVNhbGVQcmljZSkpKyAgICAgICAgICAgICMg6ZyA6KaB6KKr57uY5Yi255qE5Y+Y6YePCiAgICAgICAgZ2VvbV9oaXN0b2dyYW0oZmlsbD0iYmx1ZSIsIGJpbndpZHRoID0gMTAwMDApICsgICAgICAgI+e7mOWItuafseW9ouWbvgogICAgICAgIHNjYWxlX3hfY29udGludW91cyhicmVha3M9IHNlcSgwLCA4MDAwMDAsIGJ5PTEwMDAwMCksIAogICAgICAgICAgICAgICAgICAgICAgICAgICBsYWJlbHMgPSBjb21tYSkgICAgICAgICAgICAgICAgICAgICN46L205pWw5a2X5ZGI546w5pa55byPCmBgYArku47lm77kuK3miJHku6zlj6/ku6XnnIvliLDvvIzor6Xlj5jph4/lkYjnjrDlt6blgY/liIbluIMKYGBge3J9CiMgMi4g6YCa6L+H5YGP5oCB57O75pWw5p+l55yL5Y+Y6YeP5YGP5oCB5YiG5biD56iL5bqmCmxpYnJhcnkobW9tZW50cykKc2tld19zY29yZSA9IHNrZXduZXNzKHRyYWluJFNhbGVQcmljZSkKY2F0KCJUaGUgc2tld25lc3Mgb2YgdGhpcyB2YXJpYWJsZSBpcyIsIHNrZXdfc2NvcmUpICMgY2F0KCnlh73mlbDvvJrmiZPljbAKYGBgCgrlnKjkuIDoiKzmg4XlvaLkuIvvvIzlvZPnu5/orqHmlbDmja7kuLrlj7PlgY/liIbluIPml7bvvIxTayjlgY/mgIHns7vmlbApPjDvvIzkuJRTa+WAvOi2iuWkp++8jOWPs+WBj+eoi+W6pui2iumrmO+8mwrlvZPnu5/orqHmlbDmja7kuLrlt6blgY/liIbluIPml7bvvIxTazww77yM5LiUU2vlgLzotorlsI/vvIzlt6blgY/nqIvluqbotorpq5jjgIIK5b2T57uf6K6h5pWw5o2u5Li65a+556ew5YiG5biD77yI5Y2z5q2j5oCB5YiG5biD77yJ5pe277yM5pyJU2s9IDDjgIIK77yI5pGY6Ieq55m+5bqm55m+56eR77yJCueUseivpeWPmOmHj+eahOWBj+aAgeezu+aVsOWPr+efpe+8jOivpeWPmOmHj+WRiOeOsOW3puWBj+WIhuW4g+OAggoKYy7lgY/mgIHlj5jph4/mraPmgIHljJYKYGBge3J9CnRyYWluJFNhbGVQcmljZSA8LSBsb2codHJhaW4kU2FsZVByaWNlKSAgICAgICAgI+eUqGxvZygp5Ye95pWw5a+55Y+Y6YeP6L+b6KGM5q2j5oCB5YyWCnNrZXdfc2NvcmVfbm9yID0gc2tld25lc3ModHJhaW4kU2FsZVByaWNlKSAgICAgI+WGjeasoea1i+mHj+ivpeWPmOmHj+eahOWBj+aAgeezu+aVsApjYXQoIlRoZSBza2V3bmVzcyBvZiB0aGlzIHZhcmlhYmxlIGFmdGVyIG5vcm1hbGl6aW5nIGlzIiwgc2tld19zY29yZV9ub3IpCmBgYApgYGB7cn0KZ2dwbG90KGRhdGEgPSB0cmFpbiwgYWVzKHg9U2FsZVByaWNlKSkgKyBnZW9tX2hpc3RvZ3JhbShmaWxsPSJibHVlIiwgYmlud2lkdGggPSAwLjA2KQpgYGAK5LuO5YGP5oCB57O75pWw5ZKM5p+x54q25Zu+55qE5pS55Y+Y5Y+v5Lul55yL5Ye677yM5LmL5YmN55qE5aSE55CG5Z+65pys5a6e546w5LqG5pWw5o2u55qE5q2j5oCB5YyW5YiG5biD