Twitter Mood Predicts Stock Market Movement ~ Software Ideas

Tuesday, October 23, 2012

Twitter Mood Predicts Stock Market Movement

In this post, I try to predict the daily up and down movement of stock prices using twitter mood data and machine learning algorithms. Some time ago, I read a paper called “Twitter mood predicts the stock market.” They claimed to be able to predict stock price movement with an accuracy of over 86%. They used 9,853,498 tweets posted by 2.7 million English speaking users in 2008 and showed that general twitter mood could be used to predict the DJIA.

My initial intention was to reproduce their results. However, their method would have required that I have access to large scale historical twitter data which is not free and probably not cheap. Instead, I found two companies that publish daily mood sentiment for individual stocks with historical data going back a couple of months. You can download here the anonymized dataset I used for IBM and AGN. I have anonymized the dataset for two reasons; I don’t know if it is legal to share this data and secondly, I might decide to use this method for my own financial gain – the results are impressive!

For the first company that publishes mood data, I combine the mood data for the previous 2, 5, and 8 days to predict whether a stock price goes up, down or stays flat. The accuracy is around 75% for AGN and 80% for IBM using 8 days of mood data. For the second company that publishes mood data the results where very impressive; around 90%-100% accuracy using just the previous day's mood data for all the stocks I tested. Below are the results summarized in a confusion matrix.

I used a decision tree and 5x cross-validation for all tests.

Company (1)

$AGN Confusion Matrix with 2 days worth of mood data

	down	flat	up
down	55.1 %	0.0 %	44.9 %	205
flat	33.3 %	0.0 %	66.7 %	3
up	35.9 %	0.0 %	64.1 %	234
	198	0	244	442

Note: columns represent predictions, row represent true classes

$AGN Confusion Matrix with 5 days worth of mood data

	down	flat	up
down	71.7 %	0.0 %	28.3 %	205
flat	66.7 %	0.0 %	33.3 %	3
up	23.7 %	0.0 %	76.3 %	232
	204	0	236	440

Note: columns represent predictions, row represent true classes

$AGN Confusion Matrix with 8 days worth of mood data

	down	flat	up
down	74.9 %	0.0 %	25.1 %	203
flat	0.0 %	0.0 %	100.0 %	3
up	23.5 %	0.4 %	76.1 %	230
	206	1	229	436

Note: columns represent predictions, row represent true classes

$IBM Confusion Matrix with 8 days worth of mood data

	down	flat	up
down	82.8 %	0.0 %	17.2 %	215
flat	N/A %	N/A %	N/A %	0
up	20.1 %	0.0 %	79.9 %	249
	228	0	236	464

Note: columns represent predictions, row represent true classes

Company (2)

$AGN Confusion Matrix with 1 day worth of mood data

	down	flat	up
down	100.0%	0.0 %	44.9 %	460
flat	0.0 %	100.0 %	0.0 %	46
up	0.0 %	0.0 %	100.0 %	436
	460	46	436	942

Note: columns represent predictions, row represent true classes

Socializer Widget By Blogger Yard

SOCIALIZE IT →

8 comments:

julien21 January 2013 at 22:19
Interresting,
Would you mind detail your approach? Didi you use a neural net as they did? Would you privately share the companies providing the data?
Julien
ReplyDelete
Replies
julien21 January 2013 at 22:20
I read too fast, you said you used a tree, sorry.
ReplyDelete
Replies
Anonymous5 April 2013 at 06:45
Could you please explain what each column in the mood dataset means? specifically, s0, s1, s2,.. etc. I am trying to understand the dataset in term of column/fields, however, there is little said about that.

Thanks a lot
--
Mamoun
ReplyDelete
Replies
Dimitri1 May 2013 at 01:00
For example, using mood data (company 1) for the previous 2 days:

s0 s1 s2 s3 adj_close
c c c c d
class
18.0 5.0 25.0 4.0 up

The attributes "s" in the dataset refer to the number of positive or negative mentions in the social media:

s0 = 18 : positive mentions 2 days before closing price
s1 = 5: negative mentions 2 days before closing price
s2 = 25: positive mentions 1 day before closing price
s3 = 4: negative mentions 1 day before closing price
ReplyDelete
Replies
Anonymous2 June 2013 at 00:25
Dear Dimitri,
Thank you for your later explanation of the attributes. There is only one question remaining: what represents each data point (row)? How did you get, for example, more than 400 data points (rows) in the data set when you use 9 day history?

Mamoun
ReplyDelete
Replies
Dimitri3 June 2013 at 02:30
Each row represents the data for a particular day of the year. The data includes the closing stock price movement for that day and the sentiment data for the last x days. Each day has a date associated with it (even though it is not shown in the dataset). I could have added an extra column to the dataset to show the date. For example,

date s0 s1 s2 s3 adj_close
d c c c c d class
01/03/2010 18.0 5.0 25.0 4.0 up
02/03/2010 17.0 3.0 22.0 3.0 up
03/03/2010 19.0 5.0 29.0 4.0 down
04/03/2010 13.0 4.0 25.0 4.0 up
....
....
ReplyDelete
Replies
Anonymous30 January 2014 at 16:50
Dear Dimitri,

Thanks a lot for this introduction on using weka from Python. Do you know if it could fully creating classifiers and nested classifiers using methods like weka.core.Utils.splitOptions. It supports a command like:
weka.classifiers.meta.MultiScheme -X 0 -S 1 -B "weka.classifiers.rules.ZeroR " -B "weka.classifiers.meta.AdaBoostM1 -P 100 -S 1 -I 20 -W weka.classifiers.trees.DecisionStump" -B "weka.classifiers.trees.RandomForest -I 200 -K 30 -S 1 -num-slots 8" -B "weka.classifiers.meta.CostSensitiveClassifier -cost-matrix \"[0.0 1.0; 10.0 0.0]\" -S 1 -W weka.classifiers.trees.RandomForest -- -I 200 -K 0 -S 1 -num-slots 8" -B "weka.classifiers.rules.JRip -F 3 -N 3.0 -O 2 -S 1"

Thank you,
Xavier
ReplyDelete
Replies

Add comment

Software Ideas

Tuesday, October 23, 2012