2018-09-03

rWeka

1 Weka介绍

Weka： Weka有两种意思：一种不会飞的鸟的名字，一个机器学习开源项目的简称（Waikato Environment for Knowledge Analysis，http://www.cs.waikato.ac.nz/~ml/weka/ ）。我们这里介绍的是第二种意思。Weka项目从1992年开始，由新西兰政府支持，现在已在机器学习领域大名鼎鼎。
Weka里有非常全面的机器学习算法，包括数据预处理、分类、回归、聚类、关联规则等。Weka的图形界面对不会写程序的人来说非常方便，而且提供“KnowledgeFlow” 功能，允许将多个步骤组成一个工作流。另外，Weka也允许在命令行执行命令。

2 函数使用

R提供了RWeka软件包丰富的网上学习资源，包括软件包的使用说明文档、函数源代码、操作实例文档等，具体可参见 https://cran.r-project.org/web/packages/RWeka/index.html ，其中含有相关链接。

## install.packages('RWeka')   # 安装RWeka软件包
library(RWeka)  # 加载RWeka软件包

3 核心函数介绍

3.1 数据输入和输出

WOW()：查看Weka函数的参数。

Weka_control()：设置Weka函数的参数。

read.arff()：读Weka Attribute-Relation File Format (ARFF)格式的数据。

write.arff：将数据写入Weka Attribute-Relation File Format (ARFF)格式的文件。

3.2 数据预处理

Normalize()：无监督的标准化连续性数据。

Discretize()：用MDL(Minimum Description Length)方法，有监督的离散化连续性数值数据。

3.3 分类和回归

IBk()：k最近邻分类

LBR()：naive Bayes法分类

J48()：C4.5决策树算法（决策树在分析各个属性时，是完全独立的）。

LMT()：组合树结构和Logistic回归模型，每个叶子节点是一个Logistic回归模型，准确性比单独的决策树和Logistic回归方法要好。

M5P()：M5 模型数算法，组合了树结构和线性回归模型，每个叶子节点是一个线性回归模型，因而可用于连续数据的回归。

DecisionStump()：单层决策树算法，常被作为boosting的基本学习器。

SMO()：支持向量机分类

AdaBoostM1()：Adaboost M1方法。-W参数指定弱学习器的算法。

Bagging()：通过从原始数据取样(用替换方法)，创建多个模型。

LogitBoost()：弱学习器采用了对数回归方法,学习到的是实数值

MultiBoostAB()：AdaBoost 方法的改进，可看作AdaBoost 和 “wagging”的组合。

Stacking()：用于不同的基本分类器集成的算法。

LinearRegression()：建立合适的线性回归模型。

Logistic()：建立logistic回归模型。

JRip()：一种规则学习方法。

M5Rules()：用M5方法产生回归问题的决策规则。

OneR()：简单的1-R分类法。

PART()：产生PART决策规则。

3.4 聚类

Cobweb()：这是种基于模型方法，它假设每个聚类的模型并发现适合相应模型的数据。不适合对大数据库进行聚类处理。

FarthestFirst()：快速的近似的k均值聚类算法

SimpleKMeans()：k均值聚类算法

XMeans()：改进的k均值法，能自动决定类别数

DBScan()：基于密度的聚类方法，它根据对象周围的密度不断增长聚类。它能从含有噪声的空间数据库中发现任意形状的聚类。此方法将一个聚类定义为一组“密度连接”的点集。

3.5 关联规则

Apriori()：Apriori是关联规则领域里最具影响力的基础算法，是一种广度优先算法，通过多次扫描数据库来获取支持度大于最小支持度的频繁项集。它的理论基础是频繁项集的两个单调性原则：频繁项集的任一子集一定是频繁的；非频繁项集的任一超集一定是非频繁的。在海量数据的情况下，Apriori 算法的时间和空间成本非常高。

Tertius()：Tertius算法。

3.6 预测和评估

predict()：根据分类或聚类结果预测新数据的类别

table()：比较两个因子对象

evaluate_Weka_classifier()：评估模型的执行，如：TP Rate，FP Rate，Precision，Recall，F-Measure。

4 使用案例(以决策树为例)

4.1 获取数据集

本章我们使用datasets软件包中的iris数据集进行演示，我们首先对其进行简单的了解。

data(iris)   # 获取数据集iris
head(iris)   # 查看数据集iris前六行

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(iris)    # 查看数据集iris的结构

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

dim(iris)    # 查看数据集iris的维度

## [1] 150   5

summary(iris)# 查看数据集iris的基本统计量

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

在获取完以上信息之后，我们来看iris数据集的基本信息。它一共包含150个样本以及4个样本特征，其中结果标签中总共具有三种类别，并且三种类别均有50个样本，所占比重相同。在输出的结果中还显示了样本的最小值、四分之一分位点、中位数、均值、四分之三分位点以及最大值。

在输出结果中，setosa、versicolor和virginica是鸢尾花所属的三种类别。本数据采集了这三种花的四项基本特征，分别为：花萼长度、花萼宽度、花瓣长度和花瓣宽度。

4.2 建立模型

iris_j48 <- J48(Species ~ ., data = iris)   # 使用C4.5决策树算法对iris数据集做分类
iris_j48   # 显示分类结果

## J48 pruned tree
## ------------------
## 
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## |   Petal.Width <= 1.7
## |   |   Petal.Length <= 4.9: versicolor (48.0/1.0)
## |   |   Petal.Length > 4.9
## |   |   |   Petal.Width <= 1.5: virginica (3.0)
## |   |   |   Petal.Width > 1.5: versicolor (3.0/1.0)
## |   Petal.Width > 1.7: virginica (46.0/1.0)
## 
## Number of Leaves  :  5
## 
## Size of the tree :   9

4.3 摘要分析

summary(iris_j48)  # 决策树模型摘要分析

## 
## === Summary ===
## 
## Correctly Classified Instances         147               98      %
## Incorrectly Classified Instances         3                2      %
## Kappa statistic                          0.97  
## Mean absolute error                      0.0233
## Root mean squared error                  0.108 
## Relative absolute error                  5.2482 %
## Root relative squared error             22.9089 %
## Total Number of Instances              150     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  50  0  0 |  a = setosa
##   0 49  1 |  b = versicolor
##   0  2 48 |  c = virginica

4.4 模型可视化

library(partykit)  # 加载partykit软件包，用于作图

## Warning: package 'partykit' was built under R version 3.4.4

## Loading required package: grid

## Loading required package: libcoin

## Warning: package 'libcoin' was built under R version 3.4.4

## Loading required package: mvtnorm

plot(iris_j48)     # 绘制决策树图

4.5 交叉验证

我们现在使用整个数据集进行训练，但实际上我们可能想要进行交叉验证。

eval_j48 <- evaluate_Weka_classifier(iris_j48, numFolds = 10, complexity = FALSE, seed = 1, class = TRUE)  # irsi全部数据训练决策树模型的交叉验证
eval_j48       # 输出交叉验证结果

## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correctly Classified Instances         144               96      %
## Incorrectly Classified Instances         6                4      %
## Kappa statistic                          0.94  
## Mean absolute error                      0.035 
## Root mean squared error                  0.1586
## Relative absolute error                  7.8705 %
## Root relative squared error             33.6353 %
## Total Number of Instances              150     
## 
## === Detailed Accuracy By Class ===
## 
##                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
##                  0.980    0.000    1.000      0.980    0.990      0.985    0.990     0.987     setosa
##                  0.940    0.030    0.940      0.940    0.940      0.910    0.952     0.880     versicolor
##                  0.960    0.030    0.941      0.960    0.950      0.925    0.961     0.905     virginica
## Weighted Avg.    0.960    0.020    0.960      0.960    0.960      0.940    0.968     0.924     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  49  1  0 |  a = setosa
##   0 47  3 |  b = versicolor
##   0  2 48 |  c = virginica

根据结果，我们的建模准确率不尽如人意。

4.6 Weka-control的了解

我们使用r语言软件包对iris数据集进行分类，但RWeka可以有更多的选项。我们现在用WOW函数来查看这些参数。

WOW("J48")  #

## -U      Use unpruned tree.
## -O      Do not collapse tree.
## -C <pruning confidence>
##         Set confidence threshold for pruning.  (default 0.25)
##  Number of arguments: 1.
## -M <minimum number of instances>
##         Set minimum number of instances per leaf.  (default 2)
##  Number of arguments: 1.
## -R      Use reduced error pruning.
## -N <number of folds>
##         Set number of folds for reduced error pruning. One fold is
##         used as pruning set.  (default 3)
##  Number of arguments: 1.
## -B      Use binary splits only.
## -S      Do not perform subtree raising.
## -L      Do not clean up after the tree has been built.
## -A      Laplace smoothing for predicted probabilities.
## -J      Do not use MDL correction for info gain on numeric
##         attributes.
## -Q <seed>
##         Seed for random data shuffling (default 1).
##  Number of arguments: 1.
## -doNotMakeSplitPointActualValue
##         Do not make split point actual value.
## -output-debug-info
##         If set, classifier is run in debug mode and may output
##         additional info to the console
## -do-not-check-capabilities
##         If set, classifier capabilities are not checked before
##         classifier is built (use with caution).
## -num-decimal-places
##         The number of decimal places for the output of numbers in
##         the model (default 2).
##  Number of arguments: 1.
## -batch-size
##         The desired batch size for batch prediction (default 100).
##  Number of arguments: 1.

4.7 建立成本敏感决策树分类模型

如果你认为对versicolor的分类错误是非常有害的，你想在我们的例子中提出这样的分类，你只需要选择一个不同的分类器，即Weka中的“Cost-sensitive classifier”（成本敏感分类器）

csc <- CostSensitiveClassifier(Species ~ ., data = iris, control = Weka_control(`cost-matrix` = matrix(c(0, 10, 0, 0, 0, 0, 0, 10, 0), ncol = 3), W = "weka.classifiers.trees.J48", M = TRUE))

这里，你需要告诉成本敏感分类器你想要形成的成本矩阵的形式，这里我们使用的是如下的矩阵形式：

matrix(c(0, 1, 0, 0, 0, 0, 0, 1, 0), ncol = 3)

##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    1    0    1
## [3,]    0    0    0

我们再次进行评价：

eval_csc <- evaluate_Weka_classifier(csc, numFolds = 10, complexity = FALSE, seed = 1, class = TRUE)
eval_csc

## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correctly Classified Instances          98               65.3333 %
## Incorrectly Classified Instances        52               34.6667 %
## Kappa statistic                          0.48  
## Mean absolute error                      0.2311
## Root mean squared error                  0.4807
## Relative absolute error                 52      %
## Root relative squared error            101.9804 %
## Total Number of Instances              150     
## 
## === Detailed Accuracy By Class ===
## 
##                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
##                  0.980    0.070    0.875      0.980    0.925      0.887    0.955     0.864     setosa
##                  0.980    0.450    0.521      0.980    0.681      0.517    0.765     0.518     versicolor
##                  0.000    0.000    ?          0.000    ?          ?        0.500     0.333     virginica
## Weighted Avg.    0.653    0.173    ?          0.653    ?          ?        0.740     0.572     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  49  1  0 |  a = setosa
##   1 49  0 |  b = versicolor
##   6 44  0 |  c = virginica

我们看到versicolors现在具有了更好的预测结果（只有一个错误，而在之前的J48中有3个）。但这是以牺牲更多的virginica分类为代价的，我们现在有6个错误分类，而不是2个。

5 参考文献

[1] R talks to Weka about Data Mining. https://www.r-bloggers.com/r-talks-to-weka-about-data-mining/
[2] 开源机器学习新工具RWeka（R Meets Weka）. http://lib.csdn.net/article/machinelearning/36315

源文件下载在线调试

返回： R 语言, R的接口

R与Weka接口