Weka is a collection of machine learning algorithms that can either be applied directly to a dataset or called from your own Java code. There is an article called “
Use WEKA in your Java code” which as its title suggests explains how to use WEKA from your Java code. This is not a surprising thing to do since Weka is implemented in Java. As the title of this post suggests, I will describe how to use WEKA from your Python code instead.
If you have built an entire software system in Python, you might be reluctant to look at libraries in other languages. After all, there are a huge number of excellent Python libraries, and many good machine-learning libraries written in Python or C and C++ with Python bindings. However, as far as I am concerned, it would be a pity not to make use of Weka just because it is written in Java. It is one of the most well known machine-learning libraries around with an extensive number of implemented algorithms. What’s more, there are very few data stream mining libraries around and MOA, related to Weka and also written in Java is the best I have seen.
I use Jpype (http://jpype.sourceforge.net/) to access Weka class libraries. Once you have it installed, download the latest Weka & Moa versions and copy moa.jar, sizeofag.jar and weak.jar into your working directory. Below you can see the full Python listing of the test application. The code initializes the JVM, imports some Weka packages and classes, reads a data set, splits it into a training set and test set, trains a J48 tree classifier and then tests it. If you are familiar with Weka, this will all be very easy.
In a separate post, I will explore how easy it is to use MOA in the same way.
# Initialize the specified JVM
from jpype import *options = [
"-Xmx4G",
"-Djava.class.path=./moa.jar",
"-Djava.class.path=./weka.jar",
"-Djavaagent:sizeofag.jar",
]
startJVM(getDefaultJVMPath(), *options)
# Import java/weka packages and classes
Trees = JPackage("weka.classifiers.trees")
Filter = JClass("weka.filters.Filter")
Attribute = JPackage("weka.filters.unsupervised.attribute")
Instance = JPackage("weka.filters.unsupervised.instance")
RemovePercentage = JClass("weka.filters.unsupervised.instance.RemovePercentage")
Remove = JClass("weka.filters.unsupervised.attribute.Remove")
Classifier = JClass("weka.classifiers.Classifier")
NaiveBayes = JClass("weka.classifiers.bayes.NaiveBayes")
Evaluation = JClass("weka.classifiers.Evaluation")
FilteredClassifier = JClass("weka.classifiers.meta.FilteredClassifier")
Instances = JClass("weka.core.Instances")
BufferedReader = JClass("java.io.BufferedReader")
FileReader = JClass("java.io.FileReader")
Random = JClass("java.util.Random")
#Reading from an ARFF file
reader = BufferedReader(FileReader("./iris.arff"))
data = Instances(reader)
reader.close()
data.setClassIndex(data.numAttributes() - 1) # setting class attribute
# Standardizes all numeric attributes in the given dataset to have zero mean and unit variance, apart from the class attribute.
standardizeFilter = Attribute.Standardize()
standardizeFilter.setInputFormat(data)
data = Filter.useFilter(data, standardizeFilter)
# Randomly shuffles the order of instances passed through it.
randomizeFilter = Instance.Randomize()
randomizeFilter.setInputFormat(data)
data = Filter.useFilter(data, randomizeFilter)
# Creating train set
removeFilter = RemovePercentage()
removeFilter.setInputFormat(data)
removeFilter.setPercentage(30.0)
removeFilter.setInvertSelection(False)
trainData = Filter.useFilter(data, removeFilter)
# Creating test set
removeFilter.setInputFormat(data)
removeFilter.setPercentage(30.0)
removeFilter.setInvertSelection(True)
testData = Filter.useFilter(data, removeFilter)
# Create classifier
j48 = Trees.J48()
j48.setUnpruned(True) # using an unpruned J48
j48.buildClassifier(trainData)
print "Number Training Data", trainData.numInstances(), data.numInstances()
print "Number Test Data", testData.numInstances()
# Test classifier
for i in range(testData.numInstances()):
pred = j48.classifyInstance(testData.instance(i))
print "ID:", testData.instance(i).value(0),
print "actual:", testData.classAttribute().value(int(testData.instance(i).classValue())),
print "predicted:", testData.classAttribute().value(int(pred))
shutdownJVM()