Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    big data analytics in transporation
    Turning Data Into Decisions: How Analytics Improves Transportation Strategy
    3 Min Read
    sales and data analytics
    How Data Analytics Improves Lead Management and Sales Results
    9 Min Read
    data analytics and truck accident claims
    How Data Analytics Reduces Truck Accidents and Speeds Up Claims
    7 Min Read
    predictive analytics for interior designers
    Interior Designers Boost Profits with Predictive Analytics
    8 Min Read
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”
Big Data

The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”

hstevens
hstevens
7 Min Read
SHARE

Between 2007 and 2010, graduate students Erez Aiden, Jean-Baptiste Michel and Yuan Shen developed a tool for analyzing Google’s digitized books. Known as the N-gram viewer, their tool essentially just counted the number of times a word or phrase appeared in all the digitized publications for a given year.

Between 2007 and 2010, graduate students Erez Aiden, Jean-Baptiste Michel and Yuan Shen developed a tool for analyzing Google’s digitized books. Known as the N-gram viewer, their tool essentially just counted the number of times a word or phrase appeared in all the digitized publications for a given year. By plotting the year-by-year count, the software produced an N-gram: a chart showing the frequency of word use over time. For example, the chart below shows the frequency of the words “data,” “fact,” and “information” plotted from 1800 to 2000.

This is a powerful tool and there is lots that we can discern from even this basic query (that took all of 5 seconds): “facts” became more and more important through the 19th century, reaching a plateau by about 1920, and then declining sharply from about 1970 onward; “data” and “information” have grown rapidly in popularity in the 20th century, tracking each other closely until about the mid-1980s.

More Read

big data analytics for trading data
How Online Stock Trading is Being Impacted by Big Data
Search and the social graph
5 Great Tips for Using Data Analytics for Website UX
Five Data Mining Techniques That Help Create Business Value
Miss the Right Connections at Your Own Peril

Aiden and Michel have written about their N-gram viewer and how they came to develop it in a book: Uncharted: Big Data as a Lens on Human Culture (2013). The book gives lots of interesting example of how N-grams might be used to track fame, understand nationalism, explore the birth and death of words and concepts, and follow the influence of inventions. Aiden and Michel see their tool as part of a big data revolution that will transform the way we understand human culture in general and history in particular. “Big data,” they argue, “is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower” (p. 8). They see the N-gram as an instrument, like the microscope, that will provide us with a new quantitative insight into historical change. They call this approach “culturomics” (in analogy with genomics).  

But there are some good reasons to be skeptical, or perhaps even a little worried, about such claims. There are two issues in particular that are suggestive of more general problems for big data approaches.

First, there is the issue of completeness. One of the vaunted features of big data approaches are their ability to utilize very wide data sets. Victor Mayer-Schonberger and Kenneth Cukier have argued that one of the unique features of big data approaches is the use of “all available data.” Of course, Google Books has a very wide (and long) data set – 30 million or so books. And for this reason it is easy to see something like N-gram as working with “all available data” or some close approximation of it. But even this huge amount of text is only 20-25% of all material ever published. This is a large fraction, and it is no doubt valuable for all sorts of purposes. But it raises questions about what is left out. Do we have a representative sample? Are some sorts of books being systematically excluded? How does copyright influence which books are available?

Even leaving aside the question of what included and excluded, we have another problem. Google Books only (at the moment at least) takes account of published materials. Humans have, over the last centuries, created and left behind huge troves of unpublished material – letters, diaries, manuscripts, memos, files, etc. that were never published. Paying attention only to published sources can lead to unbalanced accounts. Publication is expensive and often those who publish written work tend to be those in positions of power and influence (men not women, lords not peasants, owners not workers, colonizers not colonized, etc.). Historians have grappled with this problem for many years – writing balanced history often means finding a way to account for inherently unbalanced sources. In dealing with societies where only a small fraction of the population is literate, one must be even more careful not to take any written material (published or unpublished) as representative of the overall population.

Second, there is the issue of quantitation. Historians attempt to make up for the inadequacy of their sources by using their intuition, judgment, empathy, and imagination. This sort of work may become increasingly marginalized in a world where the humanities and social sciences are subjected to quantitation. Aiden and Michel imagine and celebrate a “scrambling of the boundaries between science and the humanities” (p. 207). Surely, both have a lot to learn from one another. But it is important that quantitation does not dominate history, or other humanities too much. Counting words is important, but it cannot tell us everything. N-gram viewer tells us nothing about the context in which words appear for example. Context is exactly the kind of thing that historians are concerned with. Ultimately, N-gram viewer is a one-dimensional measure of a highly-multi-dimensional space (culture, society, history). 

To be fair, Aiden and Michel are aware of many of the shortcomings of their tool and their book contains some intelligent discussions of its possible limitations. However, big data tools of this type – powerful and yet easy to use – are proving immensely popular. They are seductive. However, here and with big data in general, we should be careful not to confuse having very large amounts of data with having “all data” and we should be careful to temper the drive to quantification with careful judgment and intuition.

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

financial data
Engineering Trust into Enterprise Data with Smart MDM Automation
Big Data Exclusive
christina wocintechchat com 6dv3pe jnsg unsplash
How CIS Credentials Can Launch Your AI Development Career
Exclusive News
big data analytics in transporation
Turning Data Into Decisions: How Analytics Improves Transportation Strategy
Analytics Big Data Exclusive
AI and fund manager software
AI And The Acceleration Of Information Flows From Fund Managers To Investors
Artificial Intelligence Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

“Synthetic Biology is A) the design and construction of new biological parts, devices, and systems,…”

1 Min Read
python best language for big data
Artificial IntelligenceBig DataBlockchainBusiness IntelligenceData ManagementData MiningData QualityData ScienceData VisualizationHadoopITMachine LearningUnstructured Data

Here’s Why Python Is The Top Programming Language For Big Data

6 Min Read
Image
Big DataHadoopSoftware

Why Your Choice of Hadoop Infrastructure Is Important

4 Min Read

How Big Data is Creating the Future of Science Fiction

4 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?