Project idea: identifying influence and subgroup identity with TF/IDF

I wonder if you can pick out what writing influences someone, and what subcultures they identify with, based on what terms and phrases they use.


I have noticed that some terms and phrases occur within a subset of blogs that I read. Generally, there’s an originator, who puts the same normal words together in a new way to single out a particular idea. Then, other blogs that I follow will write responses to this post. They’re not explicit about that, but when they use the same turn of phrase it seems likely to me that they are reading the same source that I read. So I group them together mentally by their shared use of a linguistic meme and coincident discussion of the same idea.

This idea hinges on the proposition that there’s a natural bias towards adopting the language of people you tend to agree with (or at least appreciate the writing of) and away from adopting language used by people you disagree with (or whose writing you find sloppy). Then, the adoption of a linguistic meme would be an indicator that the adopter reads & appreciates the thoughts of the originator. If the meme is unusual enough, its use would function also as a social signal - that the adopter wants others to know that they read a certain writer. In that way, use of a meme would indicate that the adopter identifies with a certain subgroup.

I’m also interested to see if you can chart the movement of an idea through one of these subgroups by time and draw a directed graph of influencers. Then there are some interesting graph theoretic conclusions (e.g. identifying bottlenecks - if this small subset of blogs is DDOS’d, thought stops moving through this subgroup).


Primarily I am thinking of relatively long-form blogs to perform this analysis on. The length of the documents involved would increase the frequency of uninteresting words across the whole corpus, highlighting rare terms better and reducing the false positive rate.

However, it occurs to me it might also be interesting to apply to:

  • Tweets
  • Facebook posts
  • Wikipedia edits
  • Books (not for relating to each other, but for relating present-day writers to)

I plan on getting whatever data I can put into a simple JSON document format.


Right now, I’m only aware of the TF/IDF method for identifying terms in subsets of documents. Basically, TF/IDF is the ratio between the occurrence of a term in the document (or set of documents) - Term Frequency - relative to the occurrence of the term in the whole corpus - Inverse Document Frequency. This is typically used to pick out topics. If a document has a higher TF/IDF for a particular term, it’s more likely to (at least in part) be about that term. I think I’ll do the following:

  • Pull all the terms out of all the documents
  • Throw out stopwords (a, the, and…)
  • For each term and document pair, calculate TF/IDF
  • Perform clustering on the results to identify groups of arbitrary sizes. I’ve done k-means before; it will probably work here since I only need two groups per term (in-subgroup and out-of-subgroup)
  • Interesting results will probably be - group for rarest terms, largest/smallest group, out-of-subgroups for most common terms. Let me know if you can think of any others.


I intend to use the following tools (mostly because I’ve already got experience with them):

  • MongoDB (primary datastore - probably? I might just use PostgreSQL’s JSON column since I’ve already got PG up & running)
  • Scrapy (for fetching unstructured web content)
  • ElasticSearch (for its excellent text document analysis capability; it’s got TF-IDF built-in)
  • Python (it’s easy to get a prototype going & I have a lot of experience with it)


I’ll play with this this week & get you a progress update next week. First step is to get some datasets, then design a target data format, and load them into it, which should all be easy enough. The more interesting stuff comes later. Let me know in the comments if you have ideas on how to expand this, or better ways of doing it.