Thousands of research papers mention, use, or cite Common Crawl data, making it difficult to get a meaningful overview from traditional academic search engines like Google Scholar. To make exploration easier, we built an interactive visualization, available as a space on Hugging Face.
The visualization is implemented as a map-like web application in which 10,000+ research papers are displayed as markers on a map interface. Users can explore the papers visually by dragging and zooming. A search bar allows the user to find papers by their titles.
The positions of the papers on the map are defined by their semantic similarity. Specifically, we use SciNCL paper embeddings based on paper titles and abstracts in combination with UMAP dimensionality reduction. The different paper topics are visualized with different colors. For topic detection we use LDA in combination with Anthropic’s Claude to come up with human readable topics.
Clusters of Research Papers
The visualization provides a clear overview of the research areas directly or indirectly using Common Crawl data. Some topic clusters dominate, but many others are clearly visible too. A few examples are listed below:
Security & Attack Detection
The topic of security and attack detection appears in several areas and clusters in the visualization (displayed in red). Below are a few examples of papers from this topic:
- Propaganda in Press: Challenges of Automatic Detection (Pušelj and Skalec, 2020)
- A State-of-the-Art Review on Phishing Website Detection Techniques (Li et al., 2024)
- From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories (Tian et al. 2025)

Machine Translation
One isolated topic cluster in the top-right corner is about machine translation research and related topics (displayed in green). The cluster contains papers like:
- Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet (Kocmi et al., 2024)
- On the Language Coverage Bias for Neural Machine Translation (Wang et al., 2021)
- Nearest Neighbor Machine Translation (Khandelwal et al., 2021)

Ethics & Governance
At the very center of the map, there is a cluster about ethics and governance. The cluster contains papers such as Multidimensional tie strength and economic development (Aiello et al., 2022).

Other Topics
All research papers that cannot be assigned to a specific topic cluster are highlighted as “Other” (grey color). For example, Tracking and Identifying International Propaganda and Influence Networks Online (Hanley, 2025) or Determining How Citations Are Used in Citation Contexts (Färber and Sampath, 2019) are part of this cluster.

There are many more interesting research papers hidden here. Please feel free to go and explore the interactive visualization on your own.
Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.

