The first is recognizing spam in the form of spare blogs (splogs) and spam comments.
Two forms of spare are common in blogs: spare blogs (also known as splogs), where the entire blog and hosted posts are machine generated, and spam comments, where authentic posts feature machine-generated comments.
Results reported in the rest of this section are based on a seed data set of 700 positive (splogs) and 700 negative (authentic blog) labeled examples containing the entire HTML content of each blog home page.
For instance, text like "comments-off" (comments are usually turned off in splogs), "new-york" (a high-paying advertising term), "in-uncategorized" (spammers do not bother to specify categories for blog posts) are features common to splogs, whereas text like "2-comments," "1-comment," "i-have," to-my" were some features common to authentic blogs.
We have investigated the use of link distributions to see if splogs can be identified once they place themselves on the blog (web) hyperlink graph.
To be sure, Google and its search engine peers are rushing to fight off the splogs, teaching their search engines to distinguish between legitimate blows and spam.
A plague of splogs could strangle this possibility.