94

Python scraping engineer - production data pipeline

UpworkGBNot specifiedexpert
PythonScrapyETL PipelineSeleniumPlaywrightBeautifulSoup Celery FastAPI PostgreSQLApify
Python scraping engineer - production data pipeline We are building a commercial intelligence platform that collects public signals from review sites, job boards, company websites, and news sources, normalises them into structured evidence records, and scores them to produce intelligence outputs for sales, procurement, and investment use cases. We need an engineer to own the scraping and data pipeline layer from day one. This is not a web app build. It is a data infrastructure build. The scraping and pipeline layer is the critical path for the entire product. What you will build in weeks 1–2: The output of weeks 1–2 will be packaged as a standalone Python module and handed to a second team. It must be clean, documented, and independently deployable. BaseAdapter class and adapter contract: can_handle, fetch, health_check, estimate_cost, normalize source_registry.yaml - config-driven source definitions, each with primary method, fallback adapter, and staleness threshold Postgres schema: raw_records, evidence_records, jobs, entities, scraper_health_daily Four working adapters across major B2B review, jobs, and company data sources - HTML fetch primary, commercial adapter fallbacks Cost routing logic: cheap-method-first enforcement - static HTML fetch → Firecrawl → commercial adapter - with automatic escalation when block rate exceeds threshold. Cost estimated before each job, capped per job. Evidence normaliser: raw payload → structured evidence records with topic, sentiment, recency, confidence, observed/inferred flag Evidence desert handler: count threshold check, low-confidence suppression, degraded output mode Per-job worker isolation: one Celery worker per company per source. A block on one job must never affect others. Scraper health monitor: block rate, success rate, avg cost per source, auto-fallback when block rate exceeds 30% Adapter addition guide and README for the second team What you will build in weeks 3-11: Additional adapters across review platforms, news/press, company filings, and app stores Cache layer with staleness thresholds per source type Change detection before re-scraping - only re-scrape pages that have actually changed Monthly snapshot scheduler for tracked entities Performance hardening under concurrent load Cost monitoring per source with alerts when cost per brief exceeds threshold The experience we need: Production Python scraping - you have built scrapers that run reliably against sites that actively block, with fallback routing and health monitoring in place Experience integrating commercial scraping services such as Apify, Bright Data, or Firecrawl - including when to route to them versus direct fetch, how to manage cost, and how to switch between providers when one fails Celery or equivalent job queue for async pipeline work Postgres schema design for data pipelines - not just web app CRUD FastAPI or equivalent for the internal API layer You have hit the wall of a scraper getting blocked in production and know how to architect around it What we do not need from you: Frontend or UI work - a second engineer owns that LLM integration - a second engineer owns that DevOps beyond Docker and basic deployment Stack: Python 3.12 · FastAPI · Celery · Postgres · SQLAlchemy · Playwright · Firecrawl · Apify · Bright Data Engagement: 11 weeks, part-time. Roughly 2-3 focused days per week to start, increasing from week 3. The weeks 1-2 deliverable has a hard handover deadline - this is non-negotiable. Remote. UK or EU timezone strongly preferred.
View Original Listing
Unlock AI intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/mo
Python scraping engineer - production data pipeline — Sift