Skip to Content
GuidesData Ingestion

Data Ingestion

The Curate-Me platform ingests product data from real retail sources. All product data must come from verified sources — synthetic or fictional data is never acceptable.

Data Integrity Rules

Data TypeAllowed SourceNot Allowed
ProductsScraper service, affiliate feeds, manual curation from real sitesInvented product names, fake URLs, made-up prices
User reviewsReal user submissions, verified importsSynthetic reviews, fake testimonials
Brand infoOfficial brand websites, verified databasesMade-up brand descriptions
ImagesReal product images from retailersPlaceholder or AI-generated product images

Fake product data breaks user trust, produces 404 errors from synthetic URLs, misleads users with incorrect prices, and degrades the recommendation engine that depends on real product attributes.

ProductScraperService

The ProductScraperService is the primary tool for ingesting product data from Shopify-based retail stores. It handles pagination, rate limiting, data extraction, and deduplication.

Using the Scraper Script

The quickest way to scrape a new retailer:

cd services/backend poetry run python scripts/scrape_new_retailers.py

This interactive script prompts for the retailer name and configuration, then runs the scraper and saves the output.

Programmatic Usage

For more control, use the ProductScraperService directly:

import asyncio from src.services.product_scraper_service import ProductScraperService async def main(): scraper = ProductScraperService() # Scrape up to 100 products from a retailer products = await scraper.scrape_retailer("everlane", max_products=100) # Save to a data file scraper.save_products_file("everlane", products, "data") # Always close the scraper when done await scraper.close() asyncio.run(main())

Scraper Output

Each scraped product includes the following fields:

FieldTypeDescription
titlestringProduct name from the retailer
brandstringBrand name
pricefloatCurrent price in USD
original_pricefloatOriginal price before any discount
urlstringDirect link to the product page
image_urlstringPrimary product image URL
categorystringProduct category (e.g., “tops”, “dresses”)
colorsarrayAvailable color options
sizesarrayAvailable sizes
descriptionstringProduct description text
scraped_atdatetimeTimestamp of when the data was collected

COS API Integration

For brands that provide structured product feeds, the platform integrates with the COS API for real-time product data:

from src.services.cos_api_client import CosApiClient async def fetch_cos_products(): client = CosApiClient() products = await client.get_products( category="women", limit=50, sort="newest" ) return products

The COS API client handles authentication, pagination, and data normalization to match the platform’s internal product schema.

Data Validation and Deduplication

Before products are stored, the pipeline applies validation and deduplication:

  1. URL validation — Verifies that product URLs return a 200 status code.
  2. Price validation — Rejects products with zero or negative prices.
  3. Image validation — Confirms that image URLs point to accessible images.
  4. Deduplication — Products are matched by URL. If a product with the same URL already exists, the record is updated rather than duplicated.
  5. Schema validation — All required fields must be present and correctly typed.

Adding a New Retailer

To add support for a new Shopify-based retailer:

  1. Identify the retailer’s Shopify store URL (typically {store}.myshopify.com or the public domain).
  2. Add the retailer configuration to the scraper service.
  3. Run the scraper and verify the output data quality.
  4. Save the products file and commit it to the data/ directory.
cd services/backend poetry run python -c " import asyncio from src.services.product_scraper_service import ProductScraperService async def main(): scraper = ProductScraperService() products = await scraper.scrape_retailer('new-retailer', max_products=50) print(f'Scraped {len(products)} products') scraper.save_products_file('new-retailer', products, 'data') await scraper.close() asyncio.run(main()) "

For non-Shopify retailers, implement a custom scraper by extending the base scraper class and registering it in the service.