Data Ingestion
The Curate-Me platform ingests product data from real retail sources. All product data must come from verified sources — synthetic or fictional data is never acceptable.
Data Integrity Rules
| Data Type | Allowed Source | Not Allowed |
|---|---|---|
| Products | Scraper service, affiliate feeds, manual curation from real sites | Invented product names, fake URLs, made-up prices |
| User reviews | Real user submissions, verified imports | Synthetic reviews, fake testimonials |
| Brand info | Official brand websites, verified databases | Made-up brand descriptions |
| Images | Real product images from retailers | Placeholder or AI-generated product images |
Fake product data breaks user trust, produces 404 errors from synthetic URLs, misleads users with incorrect prices, and degrades the recommendation engine that depends on real product attributes.
ProductScraperService
The ProductScraperService is the primary tool for ingesting product data from Shopify-based retail stores. It handles pagination, rate limiting, data extraction, and deduplication.
Using the Scraper Script
The quickest way to scrape a new retailer:
cd services/backend
poetry run python scripts/scrape_new_retailers.pyThis interactive script prompts for the retailer name and configuration, then runs the scraper and saves the output.
Programmatic Usage
For more control, use the ProductScraperService directly:
import asyncio
from src.services.product_scraper_service import ProductScraperService
async def main():
scraper = ProductScraperService()
# Scrape up to 100 products from a retailer
products = await scraper.scrape_retailer("everlane", max_products=100)
# Save to a data file
scraper.save_products_file("everlane", products, "data")
# Always close the scraper when done
await scraper.close()
asyncio.run(main())Scraper Output
Each scraped product includes the following fields:
| Field | Type | Description |
|---|---|---|
title | string | Product name from the retailer |
brand | string | Brand name |
price | float | Current price in USD |
original_price | float | Original price before any discount |
url | string | Direct link to the product page |
image_url | string | Primary product image URL |
category | string | Product category (e.g., “tops”, “dresses”) |
colors | array | Available color options |
sizes | array | Available sizes |
description | string | Product description text |
scraped_at | datetime | Timestamp of when the data was collected |
COS API Integration
For brands that provide structured product feeds, the platform integrates with the COS API for real-time product data:
from src.services.cos_api_client import CosApiClient
async def fetch_cos_products():
client = CosApiClient()
products = await client.get_products(
category="women",
limit=50,
sort="newest"
)
return productsThe COS API client handles authentication, pagination, and data normalization to match the platform’s internal product schema.
Data Validation and Deduplication
Before products are stored, the pipeline applies validation and deduplication:
- URL validation — Verifies that product URLs return a 200 status code.
- Price validation — Rejects products with zero or negative prices.
- Image validation — Confirms that image URLs point to accessible images.
- Deduplication — Products are matched by URL. If a product with the same URL already exists, the record is updated rather than duplicated.
- Schema validation — All required fields must be present and correctly typed.
Adding a New Retailer
To add support for a new Shopify-based retailer:
- Identify the retailer’s Shopify store URL (typically
{store}.myshopify.comor the public domain). - Add the retailer configuration to the scraper service.
- Run the scraper and verify the output data quality.
- Save the products file and commit it to the
data/directory.
cd services/backend
poetry run python -c "
import asyncio
from src.services.product_scraper_service import ProductScraperService
async def main():
scraper = ProductScraperService()
products = await scraper.scrape_retailer('new-retailer', max_products=50)
print(f'Scraped {len(products)} products')
scraper.save_products_file('new-retailer', products, 'data')
await scraper.close()
asyncio.run(main())
"For non-Shopify retailers, implement a custom scraper by extending the base scraper class and registering it in the service.