Resolved
Data corruption. I've been investigating this one and it is very hard to reproduce, but I see that there is no need to keep saving the data structure after every page unload. We have the extension running in the background to do all of that work. From the time of the update, the script injected to the page will send the data to the extension page running in the background, which will do the training and classification.
Separated some of the operations. Content scripts will do the data extraction and send data to the extension. The extension hosts the Naive Bayes Object which holds what you want!
Partial Online Learning. Each click trains the classifier, but clicking 'more' to go to the next page will completely train the classifier for that page. Basically, whatever you click is seen as what you are interested in, when you go to the next page everything else on that page is not interesting to you.
Bugfixes. Some people have said they are not getting recommendations and it happened to me as well, but it appears that has been fixed. Linux users were not getting recommendations due to case-sensitivity, unfortunately.
Current Issues
- Fixed Weights. These are the numbers I sort of fudged, haha! My fix is that I will add three options: Casual reader (one page), Average Reader (few pages), Addiction (Many pages). These options will be presented at first run and in the options page.
- Feature Elimination. One way to do it is to store time stamps of the words so as to eliminate words that you don't care about.
- Bugfixes? I'll find em.
- Explore the possibility of online learning and other statistical algorithms.
- Explore the possibility of ranking down recommendations.
- Figure out how to use notifications more effectively.
- Explore idea of
using probabilities rather than countsrecency. Basically, words withlow enough probabilities andleast recency will be shaved off the feature set. (I believe this will help with shifting interests.) Probabilities require recalculating new probabilities for all words, so it sis a bad idea. - Look into other implementations to see what works and what doesn't.
- Suggested by Roger Braun: Use the content of linked sites to further help. viewtext.org could possibly help.
- There are words that don't add information, so I'll have to look into finding those words. (Removing outlier words may address this, since words like 'a', 'this' and 'blah' don't really help and add further noise.)
- Switch from localstorage to indexedDB due to space limitations, but that should not be a problem right now.
I'm really glad at the response this extension got! I may post this to the Chrome Webstore, there's a chance. There's potential in making a FireFox version. Any other issues/suggestions, please feel free to contact me.