Data Ingestion

The Curate-Me platform ingests product data from real retail sources. All product data must come from verified sources — synthetic or fictional data is never acceptable.

Data Integrity Rules

Data Type	Allowed Source	Not Allowed
Products	Scraper service, affiliate feeds, manual curation from real sites	Invented product names, fake URLs, made-up prices
User reviews	Real user submissions, verified imports	Synthetic reviews, fake testimonials
Brand info	Official brand websites, verified databases	Made-up brand descriptions
Images	Real product images from retailers	Placeholder or AI-generated product images

Fake product data breaks user trust, produces 404 errors from synthetic URLs, misleads users with incorrect prices, and degrades the recommendation engine that depends on real product attributes.

ProductScraperService

The ProductScraperService is the primary tool for ingesting product data from Shopify-based retail stores. It handles pagination, rate limiting, data extraction, and deduplication.

Using the Scraper Script

The quickest way to scrape a new retailer:


cd services/backend
poetry run python scripts/scrape_new_retailers.py

This interactive script prompts for the retailer name and configuration, then runs the scraper and saves the output.

Programmatic Usage

For more control, use the ProductScraperService directly:


import asyncio
from src.services.product_scraper_service import ProductScraperService
 
async def main():
    scraper = ProductScraperService()
 
    # Scrape up to 100 products from a retailer
    products = await scraper.scrape_retailer("everlane", max_products=100)
 
    # Save to a data file
    scraper.save_products_file("everlane", products, "data")
 
    # Always close the scraper when done
    await scraper.close()
 
asyncio.run(main())

Scraper Output

Each scraped product includes the following fields:

Field	Type	Description
`title`	string	Product name from the retailer
`brand`	string	Brand name
`price`	float	Current price in USD
`original_price`	float	Original price before any discount
`url`	string	Direct link to the product page
`image_url`	string	Primary product image URL
`category`	string	Product category (e.g., “tops”, “dresses”)
`colors`	array	Available color options
`sizes`	array	Available sizes
`description`	string	Product description text
`scraped_at`	datetime	Timestamp of when the data was collected

COS API Integration

For brands that provide structured product feeds, the platform integrates with the COS API for real-time product data:


from src.services.cos_api_client import CosApiClient
 
async def fetch_cos_products():
    client = CosApiClient()
    products = await client.get_products(
        category="women",
        limit=50,
        sort="newest"
    )
    return products

The COS API client handles authentication, pagination, and data normalization to match the platform’s internal product schema.

Data Validation and Deduplication

Before products are stored, the pipeline applies validation and deduplication:

URL validation — Verifies that product URLs return a 200 status code.
Price validation — Rejects products with zero or negative prices.
Image validation — Confirms that image URLs point to accessible images.
Deduplication — Products are matched by URL. If a product with the same URL already exists, the record is updated rather than duplicated.
Schema validation — All required fields must be present and correctly typed.

Adding a New Retailer

To add support for a new Shopify-based retailer:

Identify the retailer’s Shopify store URL (typically {store}.myshopify.com or the public domain).
Add the retailer configuration to the scraper service.
Run the scraper and verify the output data quality.
Save the products file and commit it to the data/ directory.


cd services/backend
poetry run python -c "
import asyncio
from src.services.product_scraper_service import ProductScraperService
 
async def main():
    scraper = ProductScraperService()
    products = await scraper.scrape_retailer('new-retailer', max_products=50)
    print(f'Scraped {len(products)} products')
    scraper.save_products_file('new-retailer', products, 'data')
    await scraper.close()
 
asyncio.run(main())
"

For non-Shopify retailers, implement a custom scraper by extending the base scraper class and registering it in the service.