Google Engrams: Data in the Clouds

By Sam Celarek

"How might we use distributed computation to determine the frequency of the word 'data' in Google's corpus of books spanning the last two centuries?"

🎯 Project Overview

Utilizing the prowess of distributed computation systems such as Hadoop, AWS, and PySpark, this project aims to explore the historical frequency of the word ‘data’ within Google’s vast corpus of books. With the power of cloud-based computation, we ventured into more than 260 million entries, spanning half a millennium, to derive insights about the evolution of data’s relevance in literature.

📊 Dataset

The dataset was sourced from Google’s Ngram Viewer, which is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in sources printed between 1500 and 2019, in Google’s text corpora in multiple languages.

🧹 Data Wrangling

Data was sourced and processed using cloud computation resources to handle the immense volume and ensure efficient processing. The PySpark framework facilitated the data wrangling phase, especially in cleaning and structuring the massive dataset for analysis.

🛠️ Feature Engineering

Given the enormity of the dataset and the specific focus on the word ‘data’, the main feature of interest was derived by mapping and reducing the dataset to focus solely on the frequency of the word ‘data’ across the years.

📶 Exploratory Data Analysis (EDA)

Using PySpark’s capabilities, the EDA phase involved plotting the frequency trend of the word ‘data’ over the past five hundred years. This revealed interesting insights about the word’s historical significance and its growing importance in recent times.

Google Engrams Image

📈 Analysis

The primary observation was the exponential growth in the frequency of the word ‘data’ in the last century, reflecting the digital age’s influence and the increasing emphasis on data-driven decision-making in various fields.

Thank you for your interest in Google Engrams: Data in the Clouds. For more intricate details or insights, please explore the GitHub repository or connect at scelarek@gmail.com.

Warm Regards,
Sam Celarek