Wikipedia Explorer — The Sum of All Human Knowledge, Indexed
"Discover the sum of all human knowledge. Search millions of articles across every field of science, history, art, and culture."
Wikipedia Explorer is the flagship showcase for the Dumont Web Crawler Connector. It politely crawls Wikipedia starting from the Main Page, extracts article content, filters out the noise (Talk pages, User pages, File namespaces…), and surfaces everything through an elegant encyclopedia-grade search interface.
The idea
The open web is the world's largest, messiest knowledge graph. Most of it lives behind paywalls or deep inside CMSes — but some of the best parts are freely linked HTML pages just waiting to be discovered. The Dumont Web Crawler turns any public site into a searchable corpus, and Wikipedia Explorer proves the point by pointing the crawler at the single most impressive example on the internet: Wikipedia itself.
What you get
- 📚 Clean encyclopedia UI — sober typography, smart facets, and just the right amount of color.
- 🔦 Instant search with hints like "Einstein, quantum, rainforest…" — start typing and the suggestions flow.
- 🗂️ Faceted browsing — slice by category, language, topic, or any custom attribute your crawler captures.
- 🧭 Deep-link navigation — URL-synchronized state means every search is shareable.
- 📄 Article detail view with rendered HTML, metadata and backlinks.
- 🌙 Dark and light modes, because deep reading deserves both.
The crawler behind the curtain
The sample configuration (scripts/sample/export/wikipedia.json) is a masterclass in polite crawling:
{
"startingPoints": ["https://en.wikipedia.org/wiki/Main_Page"],
"allowUrls": ["https://en.wikipedia.org/wiki/*"],
"notAllowUrls": [
"https://en.wikipedia.org/wiki/Special:*",
"https://en.wikipedia.org/wiki/Talk:*",
"https://en.wikipedia.org/wiki/User:*",
"https://en.wikipedia.org/wiki/Wikipedia:*",
"https://en.wikipedia.org/wiki/Template:*",
"https://en.wikipedia.org/wiki/Category:*",
"https://en.wikipedia.org/wiki/File:*",
"https://en.wikipedia.org/wiki/Module:*"
],
"notAllowExtensions": [".pdf", ".zip", ".jpg", ".png", ".mp4", "…"]
}
Every rule is there for a reason — it keeps you on article pages, skips metadata namespaces, and avoids downloading binaries you don't need. Swap the starting point for any other site (your docs portal, a news site, a competitor catalog) and you have a brand-new indexed corpus.
Architecture
Any Website ──► WC Plugin (Crawler4j-style) ──► Dumont Connector ──► Turing ES ──► Wikipedia Explorer
(HTTP/HTTPS, │
robots-aware) ▼
You, falling down
a rabbit hole
of curiosity
Fire it up
- Download the zip from the Dumont Marketplace.
- Upload it to a Turing SN Site named
wikipedia. - Create a Web Crawler source using the provided
wikipedia.jsonand launch the crawl. - Pour a coffee — the encyclopedia is vast.
Features worth calling out
| Feature | How it shows up |
|---|---|
| URL include/exclude patterns | Keeps the crawler on topic |
| Extension blacklist | Skips PDFs, images, archives |
| Starting-point configuration | Crawl a single section or the whole site |
| Pagination + history | Never lose your place |
| Empty-state messaging | "No articles match your search. Try different keywords?" |
Tech Stack
React 19 · TypeScript · Vite · Tailwind CSS v4 · shadcn/ui · Tabler Icons · Turing React SDK
Lost in a reference loop? Report it at openviglet/dumont — we'll help you climb back out.