SwiftKey data scientist Sebastian Spiegler is an expert on open data and volunteers for the Common Crawl, an organization which produces and maintains an open repository of large parts of the internet. He recently published an in-depth analysis of the data and here Sebastian explains why open data matters:
Who are you and what do you do?
My name is Sebastian, I lead the data team at SwiftKey. As part of our daily work, we build infrastructure for storing, processing and analysing large datasets, such as language data and usage statistics.
What types of data sources do you typically use to power SwiftKey’s language predictions and suggestions?
SwiftKey keyboard’s core features of text prediction, completion and correction are based on custom language models. These language models are built in our language pipeline from large datasets consisting of crawled web data or conversational data.
Where does this data come from?
A portion of our static web data was produced in collaboration with a research project at the University of Cambridge. Nevertheless, we are constantly re-crawling specific languages to extend our coverage and update current models.
Tasks include, for instance, identifying good seed URLs, setting up and monitoring crawlers, and verifying the quality of data. All of which can be time consuming.
What is Common Crawl and how did you first hear about it?
The Common Crawl produces and maintains an open repository of large parts of the internet. This allows anyone to cost-effectively access terabytes of crawled data using Amazon Web Services. I first came across the Common Crawl corpus during a search for publicly available data sets in Spring 2012.
Although the Common Crawl corpus is only a fraction of what some search engines store, it is open and large enough for most applications.
I felt that carrying out this analysis and sharing the results as well as the tools used will help more individuals and businesses to understand the value of the corpus. It also helps promote the work of the foundation.
You explored and analyzed the Common Crawl corpus. How did you go about doing this research?
The research I have carried out constitutes a scan of the entire 2012 corpus. I was especially interested in the distribution of top-level domains, media types and encodings of documents and wanted to understand more about the structure of the corpus and its representativeness with respect to the entire internet.
What did you learn about the breadth and quality of Common Crawl data?
The 2012 corpus is a representative snapshot of the internet, based on the distribution of top level domains. It consists of 210 terabytes of raw data which correspond to 3.83 billion documents and 41.4 million distinct second-level domains.
More than 55% of sites are from the .com top-level domain, but this does not necessarily mean that their content is necessarily in English.
What did you find in your research that surprised you?
The corpus contains a large amount of sites from Youtube, Blogspot, WordPress and Amazon. It turned out that the crawler was fine-tuned for these domains to collect as many reviews, comments and blog entries as possible. Although these sites are registered under the .com top-level domain one can expect a lot of pages in non-English languages.
This investigation showed that with little effort in terms of code and processing costs, you can easily run analytics across several billion websites without the hurdle of crawling and storing the data yourself. The index generated even allows you to specifically target subparts of the corpus.
How do you imagine that SwiftKey will put the Common Crawl corpus to use?
As we extend our list of supported languages – which currently stands at 60 – we are dealing with languages that are less represented on the internet. The Common Crawl corpus allows us to process websites containing target languages without the need for compiling lists of seed URLs and crawling data ourselves.
Why is openly accessible data important?
Having open access to data is vital for research and innovation. It decreases the monopoly on infrastructure and data held by only a few large companies today.
Open data lowers barriers to market entry for new startups and allows existing businesses to cost-effectively compete.
Where can people find your research?