Intro
The HN News Recommender is due to my interest in recommendation systems, more or less content-based ones. I would really like to know more about what the industry is doing, but I thought I'd go figure out how simple classification works, since I'm not learning much about it in college, but college does help with understanding the theoretical parts and explaining why you believe something is reasonable.
Anyway, where did the idea come from? I've always thought of recommendations being on the sidebar of content is a sort of crude thing to do if other content is being displayed already (I like StackOverflow, because recommendations are on a particular page). Why isn't content being classified on the server as it is being pumped out? If you assume that people won't look at many posts and you absolutely believe that there is content wayyyyyy beyond 5 pages worth looking at then I say its fine to classify/recommend those content. On a site like hacker news where popularity dictates what content goes to the top, not all of it is interesting to me, so figuring out what I'd like based on past articles seemed like the best idea. [Keep in mind this is an idea I thought of last year, but didn't bother till this year.] I figured articles that I like or at least the articles that I click would have interesting words for a title, not particularly interesting content (I leave that up to the popularity feature of the site). Naive Bayes it something I learned about from all over the internet especially about spam/no spam classification. Simply replace those titles with interesting/not interesting and there you go. The labeling idea? On the fly idea, I didn't think of how to label the dataset at the time (short-sighted I know), so I figured I'd try that idea as well.
Now that you know the background. Let's go through the concepts of the current implementation.
Naive Bayes
Naive Bayes is on Wikipedia. You might want to read that, but basically we assume our language model is based off of unigrams (one word), where unigrams are the features of the sentence, and use Naive Bayes' simplifying assumption of saying every unigram is independent of the other. This makes the chain rule of probability with bayes rule really really easy because all we have to do is compute independent probabilities and compute their products, but if you are in school or have learned about maximum likelihood, then you know you can use log-likelihoods, since log is a one to one transformation and you know all probabilities computed are from 0 to 1. ZERO? you might say is NaN when taking the log of that and you'd be right! That's why I use Laplacian smoothing (Ai-Class) to help with that problem.For those of you who say the classes are not what you expect it's probably because you hate math, which is fine because I have a hate-love relationship with it. The hating part comes from my lack of understanding what I want to.
Forgive the digression. Anyway, we take maximum of the log likelihood. You might ask what are we maximizing exactly. Yes, good question, we are trying to maximize the likelihood that the word appears in the classes (interesting/not interesting). This is my basic explanation of the core classifier behind this extension.
Here's a pretty good article. If you care more for theory, go read the Elements of Statistical learning. There are enough articles on Naive Bayes that I don't need to keep repeating what is being put out. If you don't understand any of those, I suggest reading/doing exercises of a intro to statistics book, honestly.
Feature Elimination
I figured that common words and words that didn't appear too often were irrelevant and should be discarded, though I think a penalty should be likely, but that depends on what you are doing. I certainly didn't want to create a list of words to remove, so I decided that HN has its own world of words. I use Z-Scores from Normal Distribution to see the distance of the word from the mean. What is the mean/variance computed from? The average and variance of the word counts in that class. I believe that this is a reasonable statistic given enough data. The words that matter are close enough to the average word count whereas words like (a, the) or some that doesn't appear too often and etc. and may be irrelevant.
Weights
Weights are more of a penalty than anything else. If you go to the options part of the recommender you can choose what kind of a reader you are. If you don't read as often, then there shouldn't be much of a penalty on the not interested class because what you are clicking is something that really captured your attention whereas as a heavy reader, the not interested class is given a bigger penalty because what you didn't look at probably doesn't matter as much as it would for a casual. The average? was sort of fudged on my part since I don't have an evaluation metric for this recommender. I think the empirical results matter more and if people like what they are getting its good enough for me.
Summary
I left out word tokenization/normalization and etc. Because its details details. I think the take-aways are chrome extensions are easy as heck, do create your ideas, the naive bayes classifier can be made amazing (even though its the idiot's bayes classifier), do find some interest in mathematical statistics and use it, and code your intuitions because the results can be amazing too.
My interest in recommendations increased much more from ml-class. Taking that and Ai-class really helped. Taking these classes during the Christmas break was great since taking them during the school year would have been hell (which is sort of why I hate the fact NLP and PGM classes started now). Also, using Win key+ S shortcut for one note in taking lecture slides from the videos and into my one note notebooks helps as well because I can refer to them whenever I want. Keeping your notes in Skydrive is amazing too as it syncs from my Lenovo convertible tablet to my Desktop. I'm half-heartedly taking nlp-class and pgm-class since I already have classes. I'll do the homework/quizzes for these classes during the summer, but listening to the lectures and taking notes is the only thing I'm doing right now.
Future Steps
Rebrand the recommender and add reddit. Probably create a FireFox extension, if people want that. My main browser is Chrome, so... yeah.
Leverage stop word lists to completely ignore blatant irrelevant words.
Use porter's stemming algorithm (nlp-class)
Look towards leventein distance (nlp-class) for figuring word similarities in an attempt to cut down on sentences/words that say the same thing in different ways.