October 22, 2007

Remove Spam from Google Blog Search

Even if Google Blog Search doesn't have too many interesting features, I still use it more often than Technorati because it's faster, it's not down for hours, it's much more comprehensive and it has features not available in any other important blog search engine. I still use Technorati for finding backlinks, because Google does a poor job in this area (compare Technorati with Google Blog Search). Unfortunately, Google Blog Search indexes a lot of spam posts that steal content and use it for lucrative purposes.

Google has two features that reduce the number of splogs (spam blogs) from search results. Like in web search, there's a duplicate filter that removes some of the posts that are almost identical. But it doesn't exclude all of them and it doesn't find posts that duplicate articles from news sites like Business Week.


The second feature is the option to sort results by relevancy, which is enabled by default. It may seem counterintuitive to sort blog search results by relevancy and not chronologically, but that's a great way to filter splogs or at least move them at the bottom of Google's search results. Google uses a lot of signals to rank blog posts, including PageRank, the number of feed subscriptions or the amount of duplicate content. But if you sort the results by relevancy, you'll find both recent and old posts and that's not always the optimal solution. A better way is to restrict the results to a recent period of time in the sidebar (to the last day or the last hour, depending on the volume of posts).


If you see a "References" link after the snippet, that's an indication that Google found (a significant number of) backlinks, so the result should be a little more reliable.

Many blogs use Google Alerts to pollute the web and make money, so you could also add [-"google alert"] to your query (a search for "google alert" returns more than 200,000 results). A lot spam blogs are hosted by Google's Blog*Spot, so removing the posts from blogspot.com could increase the quality of your results, but also remove non-spammy blogs like this one or Google's official blogs. I also noticed that many spam blogs use the .info TLD. A recent study showed that, when searching for commercial keywords, 75% of the results from blogspot.com and 68% of the results from .info sites are spam.

It's also a great idea to restrict the result to English (or another language) in "Advanced blog search".

So here's a summary:

1. sort the results by relevancy
2. restrict the results to a recent period (last day)
3. restrict the results to English (or another language)
4. if you really have to sort the results by date, remove the posts that follow a spammy pattern (for example, add -"google alert" -site:blogspot.com -site:.info to your query), but make sure you don't remove important results
5. check the posts that contain "References"

Google should do a better job at detecting spam in Blog Search results and identifying results from sites that happen to have feeds, but they're not blogs. It should also make it more difficult for spammers to use sites like Blogger or Google Alerts to pollute the search results.

10 comments:

  1. Ok but what are your views on this query :
    "remove spam from google blog search" -"google alert" -site:blogspot.com -site:.info

    No results are found :-(

    ReplyDelete
  2. Obviously. There's a single result for ["remove spam from google blog search"]: this post, which happens to be on a blog from blogspot.com. I suggest negative queries that remove a lot of spam, but that doesn't mean they remove only spam.

    These are only useful if you sort the results by date: sorting them by relevancy filters most of the spam.

    ReplyDelete
  3. i think the irony is that your example for filtering spam out of search results filters your own blog out too... that would seem to make it a bad example...

    ReplyDelete
  4. Just because this blog is not spammy doesn't change a simple fact: a lot of Blogspot blogs are spammy. Here's a recent example.

    ReplyDelete
  5. Unfortunately since some weeks Google Blog search is kind of broken, at least when it comes to finding backlinks. I had to remove the "Comments elsewhere" feature on Google Blogoscoped for this reasons, though in theory it might make a good, lazy alternative to trackbacks.

    ReplyDelete
  6. Spam and the lack of meta-tags in the blogsearch is exactly the reason why we started our own feed archive and web service. And you can see it as a "visual feed reader" to! Here are the top sites:

    http://www.gadgetfriends.net
    http://www.gossipfriends.net
    http://www.stylefriends.net
    http://www.web20friends.net

    ReplyDelete
  7. @ionut alex chitu:
    "Just because this blog is not spammy doesn't change a simple fact: a lot of Blogspot blogs are spammy."

    let's turn that logic around - a lot of spam uses the internet, so let's filter out that...

    on the other hand, if the argument is just about proportions then i would point to email spam where the vast majority of email traffic is spam and yet we still use email... the major difference is that we've got good tools for combating email spam but not much for combating splogs... blocking the entire blogspot domain would be a little like blocking the entire hotmail domain for email - it does what it's supposed to but the cost (missing out on a lot of great legitimate content) is too high for it to be an effective control overall...

    ReplyDelete
  8. I've just released my own blog search engine with less splogs. Check http://blogoat.com thanks

    ReplyDelete
  9. Dude your engine is call blo goat. thats not the best name for anything that is supposed to work well or not sound disgusting.

    ReplyDelete
  10. I encourage everyone to use the "Flag" button to flag spammy blogspot blogs.....

    If the spammer has removed the navbar from their blogspot page, you can also fill in the appropriate URL(s) over at this simple form, just a simple Copy/paste :)

    Though, I don't know how long it takes google to remove these blogs....


    Most of these splogs are generated by hijacked (Storm Worm) systems who got infected - happen to use Windows "Address Book" - and well, a common payload of viruses is so send itself to every address within the addressbook, which, most blogspot users put their blogspot post2 address within their address book....

    ReplyDelete

Note: Only a member of this blog may post a comment.