Culturomics – 500 billion words start a trend

My brother-in-law sent me this one from the New York Times (thanks Ade!) and it blew me away. I’m guessing that people already know about the controversial project by Google to digitise every book in the world. If you don’t, it’s easy to find out a bit about it. Just Google it. *sigh*

Now, from that effort, a huge, and I mean monstrously, giganto-huge, database has been made from nearly 5.2 million digitised books. That database is now available to the public for free downloads and online searches. Before you panic that every book ever written is now available for free (which is what a lot of people fear) take a moment to understand the nature of the database. It consists of the 500 billion words contained in books published between 1500 and 2008 in English, French, Spanish, German, Chinese and Russian. That word-mine comprises words and short phrases as well as a year-by-year count of how often they appear. The potential use for this in cultural studies and humanities is mind-boggling.

“The goal is to give an 8-year-old the ability to browse cultural trends throughout history, as recorded in books,” said Erez Lieberman Aiden, a junior fellow at the Society of Fellows at Harvard. He calls this method of mass, high speed analysis “culturomics”: the application of high-throughput data collection and analysis to the study of human culture.

There are those that have reservations about the efficacy of the project and those that question the team involved, suugesting that not all the right kind of experts are represented. But you always get that among academics. They can be a bitchy bunch.

The New York Times article closes with this gem:

The warehouse of words makes it possible to analyze cultural influences statistically in a way previously not possible. Cultural references tend to appear in print much less frequently than everyday words, said Mr. Michel, whose expertise is in applied math and systems biology. An accurate picture needs a huge sample. Checking if “sasquatch” has infiltrated the culture requires a supply of at least a billion words a year, he said.

Read the whole article for a much clearer idea of what’s happening. There are links in the article to the full Science journal paper (available free to everyone, although you have to register) and an online tool to search the Google database for the use of any particular word or phrase over time. I can see myself wasting a lot of time with this.

.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • StumbleUpon
  • LinkedIn
  • MySpace
  • Reddit
  • Slashdot
  • Technorati
  • RSS
  • Twitter

Leave a Comment