SolvedSeek Documentation

Everything you need to know about SolvedSeek, our crawler, and how your site fits into our index.

What is SolvedSeek?

SolvedSeek is an independent, specialised search engine and directory for Shopify stores. We discover storefronts from the public Common Crawl dataset, build our own index of their homepages, and rank results with our own algorithms. We do not license results from Google, Bing, or any other provider.

The search engine is designed to be transparent about how it works. This page explains our crawling, indexing, and ranking in full detail so that users and webmasters can understand exactly what happens when a query is submitted or a page is crawled.

Why we built this

The modern web search landscape is dominated by a small number of companies. Most "alternative" search engines are skins on top of Bing or Google APIs. They cannot control what gets indexed, how results are ranked, or what gets filtered. They are, at best, a different interface to someone else's index.

We wanted something different: a search engine that owns its entire stack. One where the ranking algorithm is not a black box optimised for ad revenue. One where privacy is not a marketing claim but a structural guarantee, because there is no tracking infrastructure to begin with.

SolvedSeek isn't a general web search engine — it's a focused directory of Shopify stores. We index storefront homepages discovered from public crawl data, one clean entry per shop, ranked on relevance and authority rather than advertising spend.

How search works

When you type a query, the following happens:

Cache check: We check if this exact query has been answered recently. Cached results are served instantly.
FULLTEXT search: Your query is matched against page titles, descriptions, and body text using MySQL FULLTEXT indexing in natural language mode. Scores are LOG-damped (BM25-style) to prevent keyword-stuffed pages from dominating.
Embedding candidates: In parallel, your query is converted to a vector embedding and compared against thousands of page embeddings to find semantically similar pages that keyword matching might miss.
Score: each matching store receives a text-relevance score (detailed below).
Twiddler adjustments: Admin-defined ranking rules can pin, promote, demote, block, or otherwise adjust specific results.
Re-ranking: candidates are re-ordered by a transparent blend — text relevance (50%), semantic similarity (30%), domain authority (15%) and completeness (5%).
Diversity & filtering: Results are capped at 2 per root domain (subdomains grouped together), language-filtered to match the query, and paginated.
Results returned: The final ranked list is returned with titles, URLs, description snippets, and entity topic labels.

Search operators

SolvedSeek supports search operators that give you fine-grained control over your results. You can combine multiple operators in a single query.

Operator	Example	What it does
`site:`	`site:github.com react`	Only show results from a specific domain (includes subdomains).
`"quotes"`	`"search engine"`	Match the exact phrase. Only pages containing those words in that order will appear.
`-word`	`python -snake`	Exclude results containing a specific word.
`intitle:`	`intitle:tutorial javascript`	Only show results where the word appears in the page title.
`inurl:`	`inurl:blog marketing`	Only show results where the word appears in the URL path.
`filetype:`	`filetype:pdf machine learning`	Only show results with a specific file extension in the URL.
`after:`	`after:2026-01-01 news`	Only show pages crawled on or after a specific date (YYYY-MM-DD).
`before:`	`before:2026-01-01 archive`	Only show pages crawled on or before a specific date (YYYY-MM-DD).
`trust:`	`trust:80 banking`	Only show results from domains with a trust score at or above the specified value (0-100).
`lang:`	`lang:de berlin`	Override language detection. Show results in a specific language (ISO 639-1 code).

Combining operators

You can use multiple operators together. For example:

site:github.com intitle:readme "getting started" finds README pages on GitHub containing the phrase "getting started"
python tutorial -django after:2026-01-01 finds recent Python tutorials that are not about Django
trust:80 inurl:blog machine learning finds machine learning content on high-trust blogs

If you use an operator by itself (e.g., just site:wikipedia.org), results are ordered by domain authority (Ahrefs Domain Rating) instead of keyword relevance.

Ranking signals

Ranking is a two-stage process. A keyword (FULLTEXT) search first retrieves the store homepages most relevant to your query; those candidates are then re-ordered by a single transparent score whose weights add up to 100%. There are no hidden penalties and ranking cannot be bought.

Signal	Weight	How it works
Text relevance	50%	FULLTEXT match of your query against the store’s title and homepage text, title weighted higher. LOG-damped (BM25-style) so keyword stuffing has diminishing returns.
Semantic similarity	30%	Cosine similarity between your query’s embedding and the store’s homepage embedding — results match on meaning, not just exact keywords.
Domain authority	15%	The store’s Ahrefs Domain Rating (0–100), normalised. A light boost — a highly relevant small store still outranks an irrelevant high-authority one. Domain Rating by Ahrefs.
Listing completeness	5%	A small bonus for stores with a full, useful description.

Relevance to your query (text + semantic) is therefore around 80% of the score; authority and completeness are light tie-breakers. Browse with an operator only (no search terms) and results are ordered by Domain Rating instead. Admin-defined twiddlers (pin, promote, demote, block, domain diversity) can adjust specific results for spam control and curation, applied transparently on top of the score.

Semantic search

SolvedSeek blends keyword matching with AI semantic understanding. Every store’s homepage has a vector embedding generated by a local model (all-MiniLM-L6-v2, 384-dimension vectors). At search time your query is embedded the same way and compared to the candidate stores’ embeddings; that similarity contributes 30% of the final score, so a search for “eco-friendly trainers” can surface a sustainable footwear brand even when the exact words don’t appear. All embedding generation runs on our own hardware — nothing is sent to external AI services. Search spans all languages by default.

Domain authority

Rather than inventing our own trust number, we use an external, independent measure: each store’s Ahrefs Domain Rating (0–100), a widely-recognised gauge of backlink authority. It contributes a modest 15% to ranking and is displayed on every listing. There is no “unknown domain” penalty — a new store with no rating simply doesn’t receive the authority boost; it is never demoted. Domain Rating by Ahrefs.

Our crawler

User-Agent SolvedSeekBot/1.0 Full string SolvedSeekBot/1.0 (+https://solvedseek.com/docs) Respects robots.txt, meta robots, X-Robots-Tag, canonical URLs, crawl-delay Concurrency 10 URLs per tick (parallel), one page per domain per tick, 3 crawl workers Timeout 15 seconds per page request

SolvedSeekBot is a Node.js-based crawler that discovers pages through link following, XML sitemaps, and manually seeded URLs. It uses a priority-based queue: seed URLs get the highest priority, followed by homepage children, sitemap URLs, and then deeper links.

The crawler processes pages in a breadth-first pattern, prioritising homepage and top-level pages before diving deeper into a site. New external domains discovered through links are automatically added to the index with an initial 20-page allowance.

Each crawled page goes through the following pipeline:

Check domain page limits and robots.txt permissions
Fetch the page via HTTPS (with HTTP fallback)
Detect CDN challenge/bot protection pages (Cloudflare, Sucuri, Akamai, etc.) and skip them
Check X-Robots-Tag headers and meta robots tags for noindex/nofollow directives
Skip pages with empty titles (low-quality/non-content pages)
Resolve canonical URLs to avoid duplicate content
Compute content hash for deduplication
Calculate quality score (including gambling/pharma spam detection) and store the page in the index
Detect page language using trigram analysis (franc-min) and store the ISO 639-1 code
Extract the primary named entity (company, place, person) using NLP (compromise)
Generate a vector embedding for semantic search (non-blocking)
Discover and queue internal links, external links, and sitemap URLs

Robots.txt and meta tags

SolvedSeekBot fully respects robots.txt. We support Disallow, Allow, and Crawl-Delay directives. When both Allow and Disallow match a path, the most specific (longest) rule wins. If they are the same length, Allow takes precedence.

We check for our specific user agent first (SolvedSeekBot), then fall back to the wildcard (*) rules.

We also respect:

<meta name="robots" content="noindex">:Page will not be stored in the index
<meta name="robots" content="nofollow">:Links on the page will not be followed
<meta name="solvedseekbot" content="noindex">:Bot-specific directive
X-Robots-Tag: noindex:HTTP header directive
none:Equivalent to noindex, nofollow

To block SolvedSeekBot from your site entirely, add this to your robots.txt:

User-agent: SolvedSeekBot
Disallow: /

XML Sitemaps

SolvedSeekBot reads Sitemap: directives from your robots.txt file. When crawling a domain for the first time (via a seed URL), the crawler checks for declared sitemaps and queues URLs found in them.

We support both standard sitemap XML files and sitemap index files. For sitemap indexes, we process the first child sitemap. Each sitemap is limited to 200 URLs to keep the queue manageable.

Sitemap URLs are queued at a medium priority, below homepage children but above deep links discovered through crawling.

Canonical URLs

SolvedSeekBot respects the <link rel="canonical"> tag. If a page declares a canonical URL that differs from the fetched URL but points to the same domain, we store the page under the canonical URL instead. This prevents duplicate entries for pages with query parameters, session IDs, or tracking parameters.

If the canonical URL points to a different domain, we skip indexing the page (it is a cross-domain canonical, indicating the content belongs to another site), but we still follow links on the page to discover new content.

What we index

For each page, we extract and store:

Title from the <title> tag (up to 512 characters). Pages without a title are not indexed.
Description from the meta description tag (up to 1024 characters).
Body text extracted after removing scripts, styles, navigation, footers, headers, forms, and other non-content elements.
Internal and external links for link graph analysis and further crawling.
Content hash for detecting duplicate pages across different URLs.
Language detected via trigram-based analysis (franc), stored as ISO 639-1 code (e.g., "en", "de", "fr"). Used for language-filtered search results.
Entity extracted via NLP (compromise) — the primary named entity (company, place, or person) the page is about. Displayed on results as a topic label.
Vector embedding (384-dimension) for semantic search capability.

We only index HTML pages that return HTTP 200. Non-HTML content types (PDFs, images, JSON, etc.) are skipped. Pages behind CDN bot protection (Cloudflare challenges, CAPTCHA walls) are also skipped, as we cannot access their real content.

Privacy

SolvedSeek does not track users. There are no cookies, no analytics scripts, no fingerprinting, and no IP address logging. The search engine has no advertising, so there is no incentive to build user profiles.

Search queries may be cached temporarily (configurable, typically 5 minutes) to improve performance. Cached queries are stored as SHA-256 hashes and are automatically purged when they expire. No query is ever linked to a user, session, or IP address.

Query embeddings (used for semantic search) are cached in the database to avoid regenerating them. These contain only the mathematical vector representation of the query text, with no user-identifying information.

For full details, see our Privacy Policy.

For webmasters

Getting your site indexed

The easiest way to get your site into SolvedSeek is to submit it directly. Enter your homepage URL on the submit page and we will crawl it within the next crawl cycle. You can also wait for natural discovery: SolvedSeekBot finds new sites through links on already-indexed pages. New domains are created with an initial 20-page allowance and can expand to 500 pages as trust is established.

Helping SolvedSeekBot crawl your site effectively

Provide a complete robots.txt with a Sitemap: directive pointing to your XML sitemap
Use descriptive <title> tags on every page (pages without titles are not indexed)
Include meta description tags (a complete listing ranks slightly better)
Use <link rel="canonical"> tags to indicate preferred URLs
Use clean, semantic HTML with proper heading structure
Ensure your server responds within 15 seconds (our request timeout)

Removing your site from the index

Add a Disallow rule for SolvedSeekBot in your robots.txt (see example above). Alternatively, add <meta name="robots" content="noindex"> to individual pages you want excluded. Changes will take effect the next time the crawler visits your site.

Technical details

Search engine	PHP 8.2+ with MySQL FULLTEXT indexing
Crawler	Node.js with native fetch API
HTML parsing	cheerio (server-side DOM)
Embedding model	Xenova/all-MiniLM-L6-v2 (384 dimensions, runs locally via ONNX)
Language detection	franc-min (trigram-based, 82 languages, ~150KB)
Entity extraction	compromise (NLP, extracts organizations, places, people)
Database	MySQL 8.x / MariaDB with InnoDB
Concurrency	10 simultaneous page fetches, 1 page per domain per crawl tick, 3 crawl workers
Request timeout	15 seconds per page
robots.txt caching	24 hours (1 hour for server errors)
CDN detection	Cloudflare, Sucuri, DDoS-Guard, Akamai, StackPath, Imperva/Incapsula
Frameworks	None. Custom router, template engine, and database abstraction. No Laravel, Symfony, Express, or similar.

Crawl priority system

URLs in the crawl queue are processed by priority. Higher priority URLs are crawled first:

Priority	Source
10	Manually seeded URLs and new domain homepages
8	Links found on seed/homepage pages (top-level children)
5	URLs discovered from XML sitemaps
3	Links found on deeper pages
2	Links to already-known approved domains

Questions?

Reach us at hello@solvedseek.com