94
Python scraping engineer - production data pipeline
UpworkGBNot specifiedexpert
PythonScrapyETL PipelineSeleniumPlaywrightBeautifulSoup Celery FastAPI PostgreSQLApify
Python scraping engineer - production data pipeline
We are building a commercial intelligence platform that collects public signals from review sites, job boards, company websites, and news sources, normalises them into structured evidence records, and scores them to produce intelligence outputs for sales, procurement, and investment use cases.
We need an engineer to own the scraping and data pipeline layer from day one. This is not a web app build. It is a data infrastructure build. The scraping and pipeline layer is the critical path for the entire product.
What you will build in weeks 1–2:
The output of weeks 1–2 will be packaged as a standalone Python module and handed to a second team. It must be clean, documented, and independently deployable.
BaseAdapter class and adapter contract: can_handle, fetch, health_check, estimate_cost, normalize
source_registry.yaml - config-driven source definitions, each with primary method, fallback adapter, and staleness threshold
Postgres schema: raw_records, evidence_records, jobs, entities, scraper_health_daily
Four working adapters across major B2B review, jobs, and company data sources - HTML fetch primary, commercial adapter fallbacks
Cost routing logic: cheap-method-first enforcement - static HTML fetch → Firecrawl → commercial adapter - with automatic escalation when block rate exceeds threshold. Cost estimated before each job, capped per job.
Evidence normaliser: raw payload → structured evidence records with topic, sentiment, recency, confidence, observed/inferred flag
Evidence desert handler: count threshold check, low-confidence suppression, degraded output mode
Per-job worker isolation: one Celery worker per company per source. A block on one job must never affect others.
Scraper health monitor: block rate, success rate, avg cost per source, auto-fallback when block rate exceeds 30%
Adapter addition guide and README for the second team
What you will build in weeks 3-11:
Additional adapters across review platforms, news/press, company filings, and app stores
Cache layer with staleness thresholds per source type
Change detection before re-scraping - only re-scrape pages that have actually changed
Monthly snapshot scheduler for tracked entities
Performance hardening under concurrent load
Cost monitoring per source with alerts when cost per brief exceeds threshold
The experience we need:
Production Python scraping - you have built scrapers that run reliably against sites that actively block, with fallback routing and health monitoring in place
Experience integrating commercial scraping services such as Apify, Bright Data, or Firecrawl - including when to route to them versus direct fetch, how to manage cost, and how to switch between providers when one fails
Celery or equivalent job queue for async pipeline work
Postgres schema design for data pipelines - not just web app CRUD
FastAPI or equivalent for the internal API layer
You have hit the wall of a scraper getting blocked in production and know how to architect around it
What we do not need from you:
Frontend or UI work - a second engineer owns that
LLM integration - a second engineer owns that
DevOps beyond Docker and basic deployment
Stack:
Python 3.12 · FastAPI · Celery · Postgres · SQLAlchemy · Playwright · Firecrawl · Apify · Bright Data
Engagement:
11 weeks, part-time. Roughly 2-3 focused days per week to start, increasing from week 3. The weeks 1-2 deliverable has a hard handover deadline - this is non-negotiable. Remote. UK or EU timezone strongly preferred.
Unlock AI intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/mo