Of course, the real database of voting history of users would have been much better, but this is all I could get...
Calculation
The trust values were calculated as follows:
- For each story the submitter gets +1 from all commentors. I know this is naive, but bec. of lack of voting history (on the story), I had to go with this assumption.
- Each commentor gets ( votes_on_that_comment / total_votes_on_all_comment ) for each comment on that story, from all other commentors. (Again I know this is naive).
- Trust values are added up for (object, subject) pairs across all stories.
- The votes on the story were recorded as votes from a virtual user 'HNCROWD' for the submitter. After adding up, the trust value from HNCROWD for a user reflects the 'Karma' of the user on the website.
The resulting file is downloadable in CSV format here (http://www.sendspace.com/file/mw59f7).
So with these values I tried running some experiments:
1. Clustering:
An interesting experiment would be see if there are clusters of users among commentors. I used the Markov Clustering Algorithm (http://micans.org/mcl/) for clustering graphs as it does not need the number of clusters as initial input.
Unsurprisingly enough, most of the Hacker News community belongs to a single cluster. This makes sense as Hacker News is quite a focussed community interested in practical hacking related to the web, entrepreneurship and startups.
Other explainations are that users who comment are themselves quite interested in the stories and the community and are hence closely connected and similar. The users who are dissappointed with the website, might not be commenting at all... Again, using voting statistics would have been better.
2. Trust-Rank:
Second, I tried applying a variation of the TrustRank algorithm ( http://www.vldb.org/conf/2004/RS15P3.PDF ) to the trust values data.
The result here was also unsurprising. The ordering of users was very similar to what is generated using Karma on Hacker News website.
Further work:
1. The method of calculating trust values (based on comments) is very basic and needs to improved (like taking into account threads and opposing opinions).
2. I want to see if this information is actually useful in tasks like News-Story Recommendation.
Conclusions:
Without the availability of voting data, it is hard to say if users on a focussed site like Hacker News have diverging interests. I am sure, as the community grows people of different interests are bound to join. But, the whole idea of a democratic voting site only allows stories that are interesting to the most active users to be selected. And so, other users will find the website boring, and not contribute and maybe leave. This might be an example of a community maintaining itself...
Giving highly trusted users down-modding power will strengthen this emergent behaviour, and the community will become more focussed (towards these users) than it is now. This might be both good and bad depending on if you are in this majority...
P.S. Thanks to Xirium for sharing the dataset.
No comments:
Post a Comment