so one of my current projects is drudgington post. you might be able to imagine what its like from the name. combining huffington post with drudge report. kinda. it’s a place to compare these sites. since they are the flagship news sites of their respective ends of the political spectrum, it’s interesting to see the differences. so i’m saving off their headlines and seeing what i can do to compare them. i shall talk about word clouds.
this weekend i made some decent progress. added word clouds from the headlines of each site. i used an existing php function i found online and made a bunch of modifications. such as how it limits the words it uses and how it calculates sizes. i also made it compare the words between sites. so if the word is unique to its site (not a major word in the other site) then it’s colored (blue or red). common words are left as black (or gray). one thing i noticed was that the word “obama” is used so much that it dwarfs all the other words. i had to essentially have the function ignore the word to not skew all the results. worked out pretty well. better at least. i am still in the process of creating the word block list. it’ll probably be done along the way as i see words useless words pop up. i dont want to prematurely apply a predefined list. id rather see the words in context to decide if i want to remove it or not.
some issues. right now it looks like some words sometimes don’t make it into the word cloud. like the word “obama” will occasionally disappear. i have to look into that. another thing i would like to look into is combining words that are the same but in a different form. like combine “vote” and “votes”. and what about popular two word combinations? like “white” and “house”. can i make it know it’s “white house”?
well that’s it for word clouds. i’ll make a different post describing the general background of drudgington post soon.