October 2012 ~ Software Ideas

Download: mood_dataset.zip

In this post, I try to predict the daily up and down movement of stock prices using twitter mood data and machine learning algorithms. Some time ago, I read a paper called “Twitter mood predicts the stock market.” They claimed to be able to predict stock price movement with an accuracy of over 86%. They used 9,853,498 tweets posted by 2.7 million English speaking users in 2008 and showed that general twitter mood could be used to predict the DJIA.

My initial intention was to reproduce their results. However, their method would have required that I have access to large scale historical twitter data which is not free and probably not cheap. Instead, I found two companies that publish daily mood sentiment for individual stocks with historical data going back a couple of months. You can download here the anonymized dataset I used for IBM and AGN. I have anonymized the dataset for two reasons; I don’t know if it is legal to share this data and secondly, I might decide to use this method for my own financial gain – the results are impressive!

For the first company that publishes mood data, I combine the mood data for the previous 2, 5, and 8 days to predict whether a stock price goes up, down or stays flat. The accuracy is around 75% for AGN and 80% for IBM using 8 days of mood data. For the second company that publishes mood data the results where very impressive; around 90%-100% accuracy using just the previous day's mood data for all the stocks I tested. Below are the results summarized in a confusion matrix.

I used a decision tree and 5x cross-validation for all tests.

Company (1)

$AGN Confusion Matrix with 2 days worth of mood data

	down	flat	up
down	55.1 %	0.0 %	44.9 %	205
flat	33.3 %	0.0 %	66.7 %	3
up	35.9 %	0.0 %	64.1 %	234
	198	0	244	442

Note: columns represent predictions, row represent true classes

$AGN Confusion Matrix with 5 days worth of mood data

	down	flat	up
down	71.7 %	0.0 %	28.3 %	205
flat	66.7 %	0.0 %	33.3 %	3
up	23.7 %	0.0 %	76.3 %	232
	204	0	236	440

Note: columns represent predictions, row represent true classes

$AGN Confusion Matrix with 8 days worth of mood data

	down	flat	up
down	74.9 %	0.0 %	25.1 %	203
flat	0.0 %	0.0 %	100.0 %	3
up	23.5 %	0.4 %	76.1 %	230
	206	1	229	436

Note: columns represent predictions, row represent true classes

$IBM Confusion Matrix with 8 days worth of mood data

	down	flat	up
down	82.8 %	0.0 %	17.2 %	215
flat	N/A %	N/A %	N/A %	0
up	20.1 %	0.0 %	79.9 %	249
	228	0	236	464

Note: columns represent predictions, row represent true classes

Company (2)

$AGN Confusion Matrix with 1 day worth of mood data

	down	flat	up
down	100.0%	0.0 %	44.9 %	460
flat	0.0 %	100.0 %	0.0 %	46
up	0.0 %	0.0 %	100.0 %	436
	460	46	436	942

Note: columns represent predictions, row represent true classes

After installing real-time linux on both my Ubuntu laptops, my goal was to get a feel for how well latency peaks are eliminated compared to the standard Linux kernel. I was specificaly interested in network port latencies. Before looking at the network specific latenicies, I experimented with the internal worst-case interrupt latency of the kernel. The worst case latency for each hardware device will differ. The latencies of interrupts for devices connected directly to the CPU (e.g. local APIC) will be lower than the latencies of interrupts for devices connected to the CPU through a PCI bus. The interrupt latency for the APIC timer can be measured using "cyclictest". This should provide the lower-bound for interrupt latencies. Most likely, all other interrupts generated by other devices including the network card will exceed this value. The goal of running an RT kernel is to make the response time more consistent, even under load. I used hackbench to load the CPUs. You can see the effect on each processor by running htop:

hackbench -l 10000

htop

1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.1%]
2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%]
3 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||98.7%]
4 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
5 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.5%]
6 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.5%]
7 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||99.4%]
8 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
Mem[||||||||||||||||||||||||||||||||||||||||| 1036/7905MB]
Swp[ 0/16210MB]

PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
2843 dimitri 20 0 545M 75140 22716 S 4.0 0.9 1:21.71 /usr/bin/python /usr/bin/deluge-gtk
20586 dimitri 20 0 29500 2272 1344 R 3.0 0.0 0:00.91 htop
20896 root 20 0 6332 116 0 S 3.0 0.0 0:00.19 hackbench -l 10000
20884 root 20 0 6332 116 0 S 3.0 0.0 0:00.19 hackbench -l 10000
20969 root 20 0 6332 116 0 S 3.0 0.0 0:00.17 hackbench -l 10000
20885 root 20 0 6332 116 0 S 2.0 0.0 0:00.19 hackbench -l 10000
20895 root 20 0 6332 116 0 R 2.0 0.0 0:00.19 hackbench -l 10000
20883 root 20 0 6332 116 0 S 2.0 0.0 0:00.19 hackbench -l 10000
20891 root 20 0 6332 116 0 S 2.0 0.0 0:00.19 hackbench -l 10000
20682 root 20 0 6332 112 0 S 2.0 0.0 0:00.21 hackbench -l 10000
20715 root 20 0 6332 112 0 D 2.0 0.0 0:00.19 hackbench -l 10000
20887 root 20 0 6332 116 0 S 2.0 0.0 0:00.19 hackbench -l 10000
20911 root 20 0 6332 116 0 D 2.0 0.0 0:00.18 hackbench -l 10000
20880 root 20 0 6332 116 0 D 2.0 0.0 0:00.18 hackbench -l 10000
20881 root 20 0 6332 116 0 S 2.0 0.0 0:00.19 hackbench -l 10000
20882 root 20 0 6332 116 0 S 2.0 0.0 0:00.18 hackbench -l 10000
20888 root 20 0 6332 116 0 R 2.0 0.0 0:00.19 hackbench -l 10000
20889 root 20 0 6332 116 0 R 2.0 0.0 0:00.19 hackbench -l 10000
20890 root 20 0 6332 116 0 S 2.0 0.0 0:00.19 hackbench -l 10000
20892 root 20 0 6332 116 0 S 2.0 0.0 0:00.19 hackbench -l 10000
20894 root 20 0 6332 116 0 S 2.0 0.0 0:00.18 hackbench -l 10000
20897 root 20 0 6332 116 0 S 2.0 0.0 0:00.19 hackbench -l 10000
20898 root 20 0 6332 116 0 S 2.0 0.0 0:00.19 hackbench -l 10000
20912 root 20 0 6332 116 0 R 2.0 0.0 0:00.18 hackbench -l 10000

Hackbench ran all eight CPUs at near 100% and also caused lots of rescheduling interrupts. The scheduler tries to spread processor activity across as many cores as possible. When the scheduler decides to offload work from one core to another core, a rescheduling interrupt occurs. I also attempted to increase other device intrupts by running the bittorrent deluge client and rtsp/rtp internet radio. This generated both sound and wifi (ath9k) interrupts. Below, you can see a snapshot of the interrupt count for each device. The wifi (ath9k) is IRQ 17 and eth0 is on IRQ 56.

watch -n 1 cat /proc/interrupts

CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 144 0 0 0 0 0 0 0 IO-APIC-edge timer
1: 11 0 0 0 0 0 0 0 IO-APIC-edge i8042
8: 1 0 0 0 0 0 0 0 IO-APIC-edge rtc0
9: 399 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi
12: 181 0 0 0 0 0 0 0 IO-APIC-edge i8042
16: 114 0 221 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, mei
17: 238921 0 0 0 0 0 0 0 IO-APIC-fasteoi ath9k
23: 113 0 10894 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb2
40: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
41: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
42: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
43: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
44: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
45: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
46: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
47: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
48: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
49: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
50: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
51: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
52: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
53: 32389 0 0 0 0 0 0 0 PCI-MSI-edge ahci
54: 195410 0 0 0 0 0 0 0 PCI-MSI-edge i915
55: 273 6 0 0 0 0 0 0 PCI-MSI-edge hda_intel
56: 2 0 0 0 0 0 0 0 PCI-MSI-edge eth0
NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts
LOC: 2592013 3074746 2470454 2448349 2525416 2454510 2440296 2424395 Local timer inter
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 0 0 0 0 Performance monitoring interrupts
IWI: 0 0 0 0 0 0 0 0 IRQ work interrupts
RES: 357199 449954 390871 399211 536214 606334 493824 554138 Rescheduling interrupts
CAL: 300 467 500 505 480 484 477 476 Function call interrupts
TLB: 2876 647 582 632 1079 663 432 485 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 0 0 0 0 Machine check exceptions

After loading the system, I ran cycletest at a very high real-time priority of 99:

On the Preempt-RT linux kernel:

sudo cyclictest -a 0 -t -n -p99
T: 0 ( 3005) P:99 I:1000 C: 140173 Min: 1 Act: 2 Avg: 9 Max: 173
T: 1 ( 3006) P:98 I:1500 C: 93449 Min: 1 Act: 4 Avg: 16 Max: 172
T: 2 ( 3007) P:97 I:2000 C: 70087 Min: 1 Act: 4 Avg: 17 Max: 182
T: 3 ( 3008) P:96 I:2500 C: 56069 Min: 2 Act: 13 Avg: 17 Max: 166
T: 4 ( 3009) P:95 I:3000 C: 46725 Min: 2 Act: 3 Avg: 17 Max: 174
T: 5 ( 3010) P:94 I:3500 C: 40050 Min: 2 Act: 10 Avg: 15 Max: 163
T: 6 ( 3011) P:93 I:4000 C: 35044 Min: 2 Act: 4 Avg: 20 Max: 169
T: 7 ( 3012) P:92 I:4500 C: 31150 Min: 2 Act: 13 Avg: 22 Max: 164

On a standard Linux kernel:

sudo cyclictest -a 0 -t -n -p99

T: 0 ( 4264) P:99 I:1000 C: 76400 Min: 3 Act: 5 Avg: 10 Max: 6079
T: 1 ( 4265) P:98 I:1500 C: 50934 Min: 2 Act: 6 Avg: 13 Max: 15501
T: 2 ( 4266) P:97 I:2000 C: 38201 Min: 3 Act: 6 Avg: 6 Max: 4685
T: 3 ( 4267) P:96 I:2500 C: 30561 Min: 3 Act: 5 Avg: 6 Max: 1735
T: 4 ( 4268) P:95 I:3000 C: 25467 Min: 3 Act: 5 Avg: 6 Max: 1288
T: 5 ( 4269) P:94 I:3500 C: 21829 Min: 3 Act: 7 Avg: 8 Max: 13301
T: 6 ( 4270) P:93 I:4000 C: 19101 Min: 3 Act: 6 Avg: 6 Max: 2192
T: 7 ( 4271) P:92 I:4500 C: 16978 Min: 4 Act: 5 Avg: 6 Max: 85

The maximum latenency for the standard linux kernel is as high as 15501 microseconds and depends on load. The maximum timer latency irrespective of load is between 150 and 185 microseconds for the Preempt-RT linux kernel. The averagae latency, however, is better on the standard linux kernel. This is to be expected as the main goal of the real-time kernel is determinism and performance may or may not suffer. I then connected my two laptops directly via a cross-over network cable and used a modified version of the zeromq performance tests to measure the round-trip latency of both the real-time and standard kernels. Both the sender and receiver test applications where run at a real-time priority of 85.

Sends 50000 packets (1 byte) and measures round-trip time:

sudo chrt -f 85 ./local_lat tcp://eth0:5555 1 50000

Receives packets and returns them to sender:

sudo chrt -f 85 ./remote_lat tcp://192.168.2.24:5555 1 50000

In the diagram below, you can see that the real-time kernel's maximum round-trip packet latencies never exceed 900 microseconds, even under high load. The standard kernel, however, suffered several peaks, some as high as 3500 microseconds.

Preempt-RT Round-trip Latency

Standard Linux Kernel Round-trip Latency

I then used ku-latency application to measure the amount of time it takes the Linux kernel to hand a received network packet off to user space. The real-time kernel never excceds 50 microsecods with an average of around 20 microseconds. The standard kernel, on the other hand, suffered some extreme peaks.

Preempt-RT Receive Latency

Standard Linux Receive Latency

For these experiments, I did not take into account CPU affinity. Different results could be achieved with CPU shielding -- something I might leave for another blog post.

References:

Myths and Realities of Real-Time Linux Software Systems
Red Hat Enterprise MRG 1.3 Realtime Tuning Guide
Best practices for tuning system latency
https://github.com/koppi/renoise-refcards/wiki/HOWTO-fine-tune-realtime-audio-settings-on-Ubuntu-11.10
http://sickbits.networklabs.org/configuring-a-network-monitoring-system-sensor-w-pf_ring-on-ubuntu-server-11-04-part-1-interface-configuration/
http://vilimpoc.org/research/ku-latency/

Software Ideas

Tuesday, October 23, 2012

Twitter Mood Predicts Stock Market Movement

Friday, October 05, 2012

Experiments with Real-time Linux

Wikipedia

Labels

Become a Fan

Popular Posts

Blog Archive

Contributors

Followers