
During a previous internship in technical sales, I was building client quotes that pulled products from four separate billing systems. The same service existed under different names, different SKUs, and different prices depending on which system you were looking at. A 100 Mbps internet plan might be "Internet Standard 100" in one platform and "Turbo Internet" in another, priced exactly the same. Every quote required manually cross-referencing spreadsheets to make sure I was pulling the right product. This also meant that we had multiple different teams, managing multiple different platforms, so much effort for something that seemed so simply solved with unification.
I later learned this wasn't unique to that company. Large scale organization that manage flanker brands, own subsidiaries or acquire new companies deal with an issue called SKU proliferation (the rapid, often uncontrolled, expansion of a product catalog). This happens when they inherit overlapping product catalogs sitting in separate billing systems.
The industry solution is hiring consultants to manually build spreadsheet crosswalks: mapping columns, matching products by eye, and proposing consolidation plans. This process often takes months of time, and can cost anywhere from 1% to 7% of the original deals price according to this article: https://www.ciodive.com/news/merger-acquisition-technology-crm/567088/
The Wealthsimple AI Builders Program asked candidates to design and prototype an AI system that “meaningfully expands what a human can do”. The SKU proliferation problem was something I knew I wanted to solve. The core challenge is that catalog reconciliation requires two cognitive skills that don't scale manually: understanding messy schemas well enough to map them into a common format, and recognizing semantic similarity across products described in completely different language. Both are exactly the kind of pattern recognition AI handles well.
SKUai is a catalog standardization platform. First It establishes a canonical product standard from an existing schema or builds one based on two or more uploaded catalogs. It then converts any catalog to that standard, and uses a two-stage matching system to surface duplicates and overlaps across the entire product corpus.
The key architectural insight that was learned the hard way through a complete v1 iteration, is that matching and standardization are fundamentally different problems that need different AI tools:
| V1 - LLM Only | V2 - Vector + LLM |
|---|---|
| → All products sent to Claude in one prompt | → Embeddings stored in pgvector |
| → Context window is the bottleneck | → Cosine similarity finds potential overlaps in milliseconds |
| → Quality degrades past ~50 products | → Only potential matches are sent to Claude |
| → Cost scales linearly with catalog size | → Scales to thousands of SKU items |
<aside> <img src="/icons/light-bulb_blue.svg" alt="/icons/light-bulb_blue.svg" width="40px" />
Design Principle
Use the right AI tool for each job. Embedding models are fast, cheap, and excellent at measuring similarity, but they can't explain why two products match. LLMs are slow, expensive, and brilliant at reasoning, but that means they can't search across thousands of items efficiently. The two-stage architecture gives you both: vector similarity as the fast pre-filter, Claude as the semantic verifier.
</aside>
Every catalog goes through a nine-step pipeline. AI handles the cognitive work with schema analysis, normalization, semantic matching. Humans make the business decisions at two explicit confirmation gates.
| Stage | Action | Component |
|---|---|---|
| 1 | Upload & Parse → Human uploads 2+ catalogs & the system parses | System |
| 2 | AI Schema Detection → From the datasets claude recommends a data schema | Claude |
| 3 | Human Schema Confirmation → Human reviews, updates & confirms schema | Human Gate |
| 4 | Normalize → The system normalizes the messy data into the new schema | System |
| 5 | Generate Embeddings → converts normalized products into “fingerprints” | OpenAI |
| 6 | Vector Similarity Search → queries db for products with similar “fingerprints” | pgvector |
| 7 | Semantic Verification → Claude takes data from similar fingerprints and deeply analyzes similarity | Claude |
| 8 | Human Match Review → Human reviews Claudes match findings | Human Gate |
| 9 | Export & Report → Export formatted excel for analysis | System |
Claude analyzes column headers and sample rows from all uploaded catalogs and proposes a unified target schema: what fields should exist, what categories are needed, and how each catalog's columns map to the standard. The schema is fully dynamic — no hardcoded fields. The AI proposes whatever structure the data demands.