SolvedSeek Documentation

Everything you need to know about SolvedSeek, our crawler, and how your site fits into our index.

On this page

What is SolvedSeek?

SolvedSeek is an independent, specialised search engine and directory for Shopify stores. We discover storefronts from the public Common Crawl dataset, build our own index of their homepages, and rank results with our own algorithms. We do not license results from Google, Bing, or any other provider.

The search engine is designed to be transparent about how it works. This page explains our crawling, indexing, and ranking in full detail so that users and webmasters can understand exactly what happens when a query is submitted or a page is crawled.

Why we built this

The modern web search landscape is dominated by a small number of companies. Most "alternative" search engines are skins on top of Bing or Google APIs. They cannot control what gets indexed, how results are ranked, or what gets filtered. They are, at best, a different interface to someone else's index.

We wanted something different: a search engine that owns its entire stack. One where the ranking algorithm is not a black box optimised for ad revenue. One where privacy is not a marketing claim but a structural guarantee, because there is no tracking infrastructure to begin with.

SolvedSeek isn't a general web search engine — it's a focused directory of Shopify stores. We index storefront homepages discovered from public crawl data, one clean entry per shop, ranked on relevance and authority rather than advertising spend.

How search works

When you type a query, the following happens:

  1. Cache check: We check if this exact query has been answered recently. Cached results are served instantly.
  2. FULLTEXT search: Your query is matched against page titles, descriptions, and body text using MySQL FULLTEXT indexing in natural language mode. Scores are LOG-damped (BM25-style) to prevent keyword-stuffed pages from dominating.
  3. Embedding candidates: In parallel, your query is converted to a vector embedding and compared against thousands of page embeddings to find semantically similar pages that keyword matching might miss.
  4. Score: each matching store receives a text-relevance score (detailed below).
  5. Twiddler adjustments: Admin-defined ranking rules can pin, promote, demote, block, or otherwise adjust specific results.
  6. Re-ranking: candidates are re-ordered by a transparent blend — text relevance (50%), semantic similarity (30%), domain authority (15%) and completeness (5%).
  7. Diversity & filtering: Results are capped at 2 per root domain (subdomains grouped together), language-filtered to match the query, and paginated.
  8. Results returned: The final ranked list is returned with titles, URLs, description snippets, and entity topic labels.

Search operators

SolvedSeek supports search operators that give you fine-grained control over your results. You can combine multiple operators in a single query.

Operator Example What it does
site: site:github.com react Only show results from a specific domain (includes subdomains).
"quotes" "search engine" Match the exact phrase. Only pages containing those words in that order will appear.
-word python -snake Exclude results containing a specific word.
intitle: intitle:tutorial javascript Only show results where the word appears in the page title.
inurl: inurl:blog marketing Only show results where the word appears in the URL path.
filetype: filetype:pdf machine learning Only show results with a specific file extension in the URL.
after: after:2026-01-01 news Only show pages crawled on or after a specific date (YYYY-MM-DD).
before: before:2026-01-01 archive Only show pages crawled on or before a specific date (YYYY-MM-DD).
trust: trust:80 banking Only show results from domains with a trust score at or above the specified value (0-100).
lang: lang:de berlin Override language detection. Show results in a specific language (ISO 639-1 code).

Combining operators

You can use multiple operators together. For example:

If you use an operator by itself (e.g., just site:wikipedia.org), results are ordered by domain authority (Ahrefs Domain Rating) instead of keyword relevance.

Ranking signals

Ranking is a two-stage process. A keyword (FULLTEXT) search first retrieves the store homepages most relevant to your query; those candidates are then re-ordered by a single transparent score whose weights add up to 100%. There are no hidden penalties and ranking cannot be bought.

SignalWeightHow it works
Text relevance50%FULLTEXT match of your query against the store’s title and homepage text, title weighted higher. LOG-damped (BM25-style) so keyword stuffing has diminishing returns.
Semantic similarity30%Cosine similarity between your query’s embedding and the store’s homepage embedding — results match on meaning, not just exact keywords.
Domain authority15%The store’s Ahrefs Domain Rating (0–100), normalised. A light boost — a highly relevant small store still outranks an irrelevant high-authority one. Domain Rating by Ahrefs.
Listing completeness5%A small bonus for stores with a full, useful description.

Relevance to your query (text + semantic) is therefore around 80% of the score; authority and completeness are light tie-breakers. Browse with an operator only (no search terms) and results are ordered by Domain Rating instead. Admin-defined twiddlers (pin, promote, demote, block, domain diversity) can adjust specific results for spam control and curation, applied transparently on top of the score.

SolvedSeek blends keyword matching with AI semantic understanding. Every store’s homepage has a vector embedding generated by a local model (all-MiniLM-L6-v2, 384-dimension vectors). At search time your query is embedded the same way and compared to the candidate stores’ embeddings; that similarity contributes 30% of the final score, so a search for “eco-friendly trainers” can surface a sustainable footwear brand even when the exact words don’t appear. All embedding generation runs on our own hardware — nothing is sent to external AI services. Search spans all languages by default.

Domain authority

Rather than inventing our own trust number, we use an external, independent measure: each store’s Ahrefs Domain Rating (0–100), a widely-recognised gauge of backlink authority. It contributes a modest 15% to ranking and is displayed on every listing. There is no “unknown domain” penalty — a new store with no rating simply doesn’t receive the authority boost; it is never demoted. Domain Rating by Ahrefs.

Our crawler

User-Agent SolvedSeekBot/1.0 Full string SolvedSeekBot/1.0 (+https://solvedseek.com/docs) Respects robots.txt, meta robots, X-Robots-Tag, canonical URLs, crawl-delay Concurrency 10 URLs per tick (parallel), one page per domain per tick, 3 crawl workers Timeout 15 seconds per page request

SolvedSeekBot is a Node.js-based crawler that discovers pages through link following, XML sitemaps, and manually seeded URLs. It uses a priority-based queue: seed URLs get the highest priority, followed by homepage children, sitemap URLs, and then deeper links.

The crawler processes pages in a breadth-first pattern, prioritising homepage and top-level pages before diving deeper into a site. New external domains discovered through links are automatically added to the index with an initial 20-page allowance.

Each crawled page goes through the following pipeline:

  1. Check domain page limits and robots.txt permissions
  2. Fetch the page via HTTPS (with HTTP fallback)
  3. Detect CDN challenge/bot protection pages (Cloudflare, Sucuri, Akamai, etc.) and skip them
  4. Check X-Robots-Tag headers and meta robots tags for noindex/nofollow directives
  5. Skip pages with empty titles (low-quality/non-content pages)
  6. Resolve canonical URLs to avoid duplicate content
  7. Compute content hash for deduplication
  8. Calculate quality score (including gambling/pharma spam detection) and store the page in the index
  9. Detect page language using trigram analysis (franc-min) and store the ISO 639-1 code
  10. Extract the primary named entity (company, place, person) using NLP (compromise)
  11. Generate a vector embedding for semantic search (non-blocking)
  12. Discover and queue internal links, external links, and sitemap URLs

Robots.txt and meta tags

SolvedSeekBot fully respects robots.txt. We support Disallow, Allow, and Crawl-Delay directives. When both Allow and Disallow match a path, the most specific (longest) rule wins. If they are the same length, Allow takes precedence.

We check for our specific user agent first (SolvedSeekBot), then fall back to the wildcard (*) rules.

We also respect:

To block SolvedSeekBot from your site entirely, add this to your robots.txt:

User-agent: SolvedSeekBot
Disallow: /

XML Sitemaps

SolvedSeekBot reads Sitemap: directives from your robots.txt file. When crawling a domain for the first time (via a seed URL), the crawler checks for declared sitemaps and queues URLs found in them.

We support both standard sitemap XML files and sitemap index files. For sitemap indexes, we process the first child sitemap. Each sitemap is limited to 200 URLs to keep the queue manageable.

Sitemap URLs are queued at a medium priority, below homepage children but above deep links discovered through crawling.

Canonical URLs

SolvedSeekBot respects the <link rel="canonical"> tag. If a page declares a canonical URL that differs from the fetched URL but points to the same domain, we store the page under the canonical URL instead. This prevents duplicate entries for pages with query parameters, session IDs, or tracking parameters.

If the canonical URL points to a different domain, we skip indexing the page (it is a cross-domain canonical, indicating the content belongs to another site), but we still follow links on the page to discover new content.

What we index

For each page, we extract and store:

We only index HTML pages that return HTTP 200. Non-HTML content types (PDFs, images, JSON, etc.) are skipped. Pages behind CDN bot protection (Cloudflare challenges, CAPTCHA walls) are also skipped, as we cannot access their real content.

Privacy

SolvedSeek does not track users. There are no cookies, no analytics scripts, no fingerprinting, and no IP address logging. The search engine has no advertising, so there is no incentive to build user profiles.

Search queries may be cached temporarily (configurable, typically 5 minutes) to improve performance. Cached queries are stored as SHA-256 hashes and are automatically purged when they expire. No query is ever linked to a user, session, or IP address.

Query embeddings (used for semantic search) are cached in the database to avoid regenerating them. These contain only the mathematical vector representation of the query text, with no user-identifying information.

For full details, see our Privacy Policy.

For webmasters

Getting your site indexed

The easiest way to get your site into SolvedSeek is to submit it directly. Enter your homepage URL on the submit page and we will crawl it within the next crawl cycle. You can also wait for natural discovery: SolvedSeekBot finds new sites through links on already-indexed pages. New domains are created with an initial 20-page allowance and can expand to 500 pages as trust is established.

Helping SolvedSeekBot crawl your site effectively

Removing your site from the index

Add a Disallow rule for SolvedSeekBot in your robots.txt (see example above). Alternatively, add <meta name="robots" content="noindex"> to individual pages you want excluded. Changes will take effect the next time the crawler visits your site.

Technical details

Search engine PHP 8.2+ with MySQL FULLTEXT indexing
Crawler Node.js with native fetch API
HTML parsing cheerio (server-side DOM)
Embedding model Xenova/all-MiniLM-L6-v2 (384 dimensions, runs locally via ONNX)
Language detection franc-min (trigram-based, 82 languages, ~150KB)
Entity extraction compromise (NLP, extracts organizations, places, people)
Database MySQL 8.x / MariaDB with InnoDB
Concurrency 10 simultaneous page fetches, 1 page per domain per crawl tick, 3 crawl workers
Request timeout 15 seconds per page
robots.txt caching 24 hours (1 hour for server errors)
CDN detection Cloudflare, Sucuri, DDoS-Guard, Akamai, StackPath, Imperva/Incapsula
Frameworks None. Custom router, template engine, and database abstraction. No Laravel, Symfony, Express, or similar.

Crawl priority system

URLs in the crawl queue are processed by priority. Higher priority URLs are crawled first:

Priority Source
10 Manually seeded URLs and new domain homepages
8 Links found on seed/homepage pages (top-level children)
5 URLs discovered from XML sitemaps
3 Links found on deeper pages
2 Links to already-known approved domains

Questions?

Reach us at hello@solvedseek.com