Data Augmentation
UpworkUSNot specifiedintermediateScore: 54
Data MiningData ScrapingData AnalysisData EntryData Extraction
# Upwork Posting: CA C-57 Well Driller Data Verification
## Job Title
Data Research Specialist: Verify 700+ California C-57 Well Driller Records (Websites + Service Counties)
## Project Summary
We run a public directory for California well drilling contractors. We already have an exported dataset from official CSLB-derived records and automated enrichment, but we need a meticulous human verification pass before launch.
Your job is to review each row and improve trust quality:
- confirm the correct company website (or mark no website),
- verify whether the entity is truly a well drilling contractor (not only consulting/testing),
- extract and normalize California service counties from credible evidence.
Accuracy matters more than speed.
## Scope
- Total rows: ~717 license records
- Dataset includes:
- license and business identity fields,
- website candidate fields (sometimes blank or low confidence),
- extracted service areas (sometimes blank),
- review queue hints and metadata.
## Source of Truth Rules
1. The contractor license list comes from California CSLB C-57 data.
2. Website/service-area verification must be based on:
- official company website,
- official branch/location pages,
- clear service-area map/text on website.
3. Do not use third-party directory guesses as final truth unless corroborated by company-owned pages.
## Required Deliverables
Provide **three CSV files**:
1. `verified_websites.csv`
- `license_number`
- `business_name`
- `final_website_url`
- `website_status` (`verified_website` | `no_public_website` | `needs_escalation`)
- `verification_confidence` (`high` | `medium` | `low`)
- `evidence_url`
- `notes`
2. `verified_service_areas.csv`
- `license_number`
- `business_name`
- `area_type` (`county` or `city`)
- `normalized_area` (e.g., `Fresno County`, `Bakersfield`)
- `state` (`CA`)
- `source_page_url`
- `evidence_snippet` (short quote from page)
- `verification_confidence` (`high` | `medium` | `low`)
- `notes`
3. `exceptions_escalation.csv`
- `license_number`
- `business_name`
- `issue_type` (`ambiguous_website`, `not_well_driller`, `conflicting_service_claims`, `insufficient_evidence`, etc.)
- `details`
- `recommended_next_step`
## Detailed Instructions
### 1) Website Verification
For each license:
- Confirm the website domain matches the business brand/name.
- Reject mismatched domains (different company with similar name).
- If multiple locations/phone numbers exist under same brand, keep as same brand unless clearly unrelated.
- If no reliable public website exists, set `website_status = no_public_website`.
### 2) Well Drilling Relevance Check
We only want actual well drilling contractors.
- Accept if evidence clearly indicates drilling/well drilling services.
- Flag and escalate if company appears to be only:
- environmental consulting,
- lab/testing only,
- geotech consulting only,
- unrelated trades.
### 3) Service Area Extraction
Priority order:
1. Explicit county list on website.
2. Service map that clearly corresponds to counties.
3. Explicit city list that can be mapped to counties.
Always prefer county-level coverage when possible.
### 4) County Normalization Rules (Critical)
- Normalize to official California county names with suffix `County`.
- Example: `fresno` = `Fresno County`.
- If website says “Bay Area,” do **not** guess all Bay counties; only include counties directly supported by page evidence.
- If map is vague, include only clearly inferable counties and set confidence to `medium` or `low`.
### 5) Evidence Requirement
Every verified website/service-area row must include:
- `source_page_url`
- short `evidence_snippet` from that page (or explicit note why unavailable)
No evidence = no acceptance.
## Quality Bar (Acceptance Criteria)
Work is accepted only if:
1. CSV format is valid and consistent (no broken delimiters/headers).
2. Greater than or equal to 98% of rows contain complete required fields for chosen status.
3. Random QA sample of 75 rows has:
- less than or equal to 2 minor issues
- 0 critical mismatches (wrong company site, wrong county assignment)
4. All uncertain records are cleanly routed into `exceptions_escalation.csv`.
## Suggested Workflow for Contractor
1. Work in batches of 100 rows.
2. Submit first 25-row pilot for calibration.
3. Incorporate feedback.
4. Continue with full dataset in checkpoints.
## What We Provide
- Base CSV export from our current database.
- Definitions and examples.
- Clarification on edge cases during project.
## Proposal Requirements
When applying, include:
1. Similar data-cleaning/lead-enrichment project examples.
2. Your QA method to prevent wrong website associations.
3. Expected turnaround time for 717 rows.
4. Confirmation you can deliver strict evidence-backed county normalization.
## Nice-to-Have Skills
- Experience with contractor/service directories
- Strong web research and entity resolution skills
- Ability to reason about geography and county boundaries
Unlock AI Intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/moClient
Spent: $2,953.23Rating: 5.0Verified