Training students on mega-scale data

In a New York Times article (sub. req.) published on the weekend, IBM and Google expressed doubts that the students graduating from US universities today have the chops to deal with the mulit-terabyte datasets that are becoming commonplace online and in domains like bioscience and astronomy today. From the article:

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.

The article reveals how Google and IBM are promoting internet-scale research at places like the University of Washington and Purdue. But a curious omission from the article is any mention of open-source technologies that are spurring the innovation in processing and analyzing these data sets. Tools like Hadoop, for processing internet-scale data sets and R, for analyzing the processed data (most likely in some parallelized form), and other …

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.