CrawlIQ: Building a Mini Search Engine

May 24, 20269 min read

Search looks simple from the outside: type a query and get relevant pages instantly. CrawlIQ is my attempt to rebuild the core retrieval pipeline behind that experience for technical documentation sites: crawl pages, extract content, build an inverted index, rank results with BM25, and use a graph layer to understand how pages relate.

The goal was not to clone Google but rather to build the full pipeline end-to-end, including the unglamorous parts: queues, workers, durable progress, retries, indexing, ranking, failure recovery, observability, and tests.

At a high level, the system looks like this:

Crawl (worker):  seed -> queue -> fetch -> parse -> store -> index (per page)
After crawl:     graph enrichment
At query time:   BM25 search (+ optional graph rerank / related pages)

Current scale

On local crawls, CrawlIQ has handled:

23K+ pages crawled
210K+ unique terms
6.4M+ index postings
~100% index coverage for fetched pages
p95 search latency under 200ms
100+ crawl jobs tracked through the dashboard

Architecture

CrawlIQ runs as a small Docker Compose system:

Component	Role
`web`	Next.js dashboard for the UI to run crawls, monitor jobs, run searches and explore the graph results
`api`	FastAPI backend for HTTP endpoints, search, graph reads, and job creation
`worker`	Python/RQ worker that runs the crawl pipeline
`redis`	Queue for background jobs
`postgres`	Stores crawl jobs, pages, links, index tables, graph tables, and query logs
`demo-site`	Static site for deterministic crawl demos

Unlike a normal CRUD app, crawling is not a single request-response action. Crawling can take seconds or minutes so it needs to be done in the background. It can fail halfway, discover more work or produce partial progress. It was a good way to think about state, retries, visibility and recovery.

Starting a Crawl

The crawl flow starts with a request:

POST /crawl-jobs

The API creates a crawljobs row with the seed URL, limits, and initial status. Then it enqueues an RQ job by ID. The worker receives that job and starts processcrawljob(crawljob_id).

The worker then runs a bounded BFS crawl:

start with seed URL
  -> fetch page
  -> parse text and links
  -> store page
  -> index page
  -> add eligible links to frontier
  -> repeat until max pages/depth/frontier limit

One implementation detail I cared about was committing after each page. That means progress is durable and visible while the crawl is still running. If the worker has fetched 40 pages out of 100, the UI can show those 40 pages instead of waiting until the entire job finishes.

It also makes failures less destructive. If the worker crashes halfway through, the database still contains the pages already fetched, the links already discovered, and the index entries already written.

CrawlIQ is also not designed to crawl the entire web. It's designed to be a local, bounded crawler mostly for learning and development, but I still wanted it to behave responsibly. This includes:

same-domain only mode
max crawl depths and page counts
request timeouts
redirect lmits
per-domain delay
basic robots.txt compliance

The fetch layer uses httpx. After fetching a page, CrawlIQ checks whether the content type is indexable, such as HTML, XHTML, plain text, or markdown. Then the parser extracts the title, body text, and links.

The parser also strips obvious noise like scripts and navigation-heavy content. It isn't perfect, but it's good enough for our goal.

Turning pages into indexable terms

Once a page is parsed, CrawlIQ indexes it immediately. It's a classaic inverted index:

term -> page_id -> term_frequency

Instead of scanning every page every time a user searches, the system stores which pages contain which terms, allowing for efficient search.

The tokenizer normalizes text, removes stopwords and produces terms for indexing and querying. Title terms are weighted more heavily than body terms, which seems to make sense and work well in technical documentation.

A page stores metadata like:

URL
normalized URL
title
extracted text
content hash
raw HTML hash
crawl depth
token count
fetch duration
indexed timestamp

BM25 search

For the first version of search, I used BM25.

BM25 is a classic ranking algorithm used in information retrieval. The idea is to score pages based on query term matches, while accounting for how common a term is across the corpus and how long each document is.

The search flow is:

user query
  -> tokenize query
  -> look up postings
  -> compute BM25 scores
  -> rank pages
  -> generate snippets
  -> return results

BM25 allows us to balance three things:

Whether a page contains the query terms
How rare that term is across the corpus
Whether the page is actually relevant or just long

The first part of the formula is inverse document frequency, or IDF:

\mathrm{IDF}(t) = \ln\left( 1 + \frac{N - df(t) + 0.5}{df(t) + 0.5} \right)

Where:

N = \text{total number of indexed pages (optionally scoped to one crawl job id)}

df(t) = \text{number of documents containing term } t

In plain English, this gives more weight to rare terms. If a word appears in almost every page, it's probably not very useful for ranking. If it appears in only a few pages, it's a stronger signal.

Then each query term contributes a BM25 score for a document:

\mathrm{BM25}(t,d) = \mathrm{IDF}(t) \cdot \frac{ tf(t,d)(k_1 + 1) }{ tf(t,d) + k_1\left(1 - b + b \cdot \frac{|d|}{avgdl}\right) }

Where:

tf(t,d) = \text{number of times term } t \text{ appears in document } d \\ \text{stored inverted index term frequency}

|d| = \text{(document length, stored in pages token count)}

avgdl = \text{average document length (only documents with indexed at set and token count > 0)}

k_1 = 1.5,\qquad b = 0.75

The k1 value controls term-frequency saturation. In other words, seeing a word 10 times should help more than seeing it once, but not 10 times more. The b value controls document-length normalization, so long pages don't automatically win just because they contain more words.

The final page score is the sum of each query term’s BM25 contribution:

\mathrm{score}(d,q) = \sum_{t \in q} \mathrm{BM25}(t,d)

In CrawlIQ, repeated query terms also act as multipliers:

\mathrm{score}(d,q) = \sum_{t \in q} count(t,q) \cdot \mathrm{BM25}(t,d)

After scoring, CrawlIQ returns the highest-ranked pages and generates snippets with matched-term highlighting. That makes the search results easier to trust because the user can see why each page matched instead of only seeing a title, URL, and a random number labeled score.

Adding a graph

After the basic crawler and search engine worked, I started noticing that the crawled data had more structure than just “pages containing words.”

Pages are connected.

A page can link to another page. Two pages can share a parent URL path. Two documents can discuss similar concepts. One page can be central because many other pages point to it. Some pages can be near-duplicates.

The graph layer turns pages into nodes and relationships into typed edges.

Once the graph existed, I added an optional graph-enhanced search mode.

The default search is still BM25. That matters because BM25 is a strong baseline and easy to reason about.

When graph_enhanced=true, CrawlIQ starts with a BM25 candidate pool, expands one hop into page_graph_edges, then reranks. A page connected to other strong hits can move up. A near-duplicate of something already ranked can move down. High PageRank inside the crawl is a small extra signal.

Graph edges are built after the crawl, not inside the worker. The crawler stays focused on fetching, parsing, storing, and indexing pages. Then I can run graph-building routes like POST /crawl-jobs/{id}/graph/link-edges on the pages that already exist.

That made the graph layer easier to experiment with. I could add new signals without recrawling the site.

The main edge types are:

link - one page links to another
url_hierarchy - pages are related by URL structure
content_similarity - pages share important indexed terms
near_duplicate - pages are exact or near copies

BM25, graph rerank, and embeddings

CrawlIQ has two ranking modes today.

BM25 is the default. It searches the inverted index and works best when the query words appear directly on the page. It's fast, simple, and easy to explain. The downside is that it only understands tokens. It doesn't know that “login” and “authentication” may be related unless the text connects them.

Graph rerank is optional with graph_enhanced=true. It starts with BM25 results, pulls in nearby graph pages, then reranks the candidate pool.

For example, if several strong BM25 hits belong to the same documentation cluster, graph reranking can pull in a linked overview page that may not contain every exact query term but is central to the topic. Conversely, near-duplicate pages are penalized so the top results are less repetitive.

The scoring blend is:

\text{final}(d) = 1.0 \cdot \text{bm25}_{\text{norm}} + 0.12 \cdot \text{PR}_{\text{norm}} + 0.18 \cdot \text{nbr}_{\text{norm}} - 0.25 \cdot \text{dup}_{\text{norm}}

BM25 is still the main signal. PageRank gives a small boost to central pages. Neighbor boost helps pages connected to strong BM25 hits. Duplicate penalty pushes down pages that look like copies.

I intentionally kept the first version lexical and graph-based so the ranking path stayed inspectable before introducing vector search. Embeddings would help with semantic matches like “auth” and “login,” but they also add chunking, vector storage, model dependencies, and harder-to-debug ranking behavior.

What I would add next

The next step is an evaluation harness: a fixed query set, expected relevant pages, and retrieval metrics like Precision@5, Precision@10, MRR, duplicate rate, and zero-result rate.

Only after measuring BM25 against graph reranking would I add more complex retrieval methods:

hybrid BM25 + embeddings
cross-encoder reranking on top results
learned ranking weights from query logs
incremental index updates without full recrawls

Reliability and testing

Some of the most valuable parts of CrawlIQ were not the visible features.

For example, long-running background jobs create weird states. What if a worker dies while a job is marked running? What if a job is marked queued in Postgres but never actually exists in Redis?

CrawlIQ has worker startup recovery logic for that:

stale running jobs can be marked failed
queued jobs without an RQ job can be re-enqueued
a Redis lock prevents multiple workers from running recovery at the same time

These details are easy to skip, but it forced me to think beyond the happy path and consider edge cases which occur in more production-like environments.

I also added a pytest suite covering the API, crawler behavior, BM25 search, graph edge generation, graph APIs, graph reranking, robots behavior, and URL normalization.

Some tests mock HTTP responses. Others run as integration tests against a real Postgres database. The CI workflow runs on GitHub Actions as well.

Final thoughts

Before CrawlIQ, I understood search at a high level, but mostly as a hand-wave: crawl some pages, index some words, rank the results. Building it made the details a lot less abstract.

The useful part was not just BM25 or the graph layer. It was seeing how many small pieces have to work together: the crawler needs limits, the worker needs recovery, the database needs durable progress, the index needs fast lookup, and the UI needs to show more than “something happened.”

It also made me appreciate how much of search is tradeoffs. BM25 is simple and explainable, but limited to tokens. Graph reranking can help surface related pages, but it needs guardrails. Embeddings would make semantic search better, but they would also add another layer of complexity.

CrawlIQ is not a production search engine, but it did what I wanted it to do. It made search feel less like magic and more like a system I could take apart, build, and reason about.