49
Python Data Engineer – Event Data Aggregation Pipeline
UpworkUSNot specifiedexpert
PythonETL PipelineData Analysis
Project Overview
We are building a global event discovery platform for the social dance community.
The platform aggregates events such as milongas, classes, festivals, workshops, and dance gatherings from publicly available event listings and organizes them into a structured database.
We are looking for a Python Data Engineer to build a reliable pipeline that collects, normalizes, and deduplicates event data from multiple online sources.
This is not a simple script. We are looking for someone who can design a robust and scalable data pipeline capable of processing large datasets from multiple sources.
Data Sources
Event information will come from a variety of publicly available sources, including:
Event listing websites
Community calendars
Organizer websites
Dance community portals
Other public event listings
Initially we will provide a list of approximately 15–30 source websites.
Each source may contain hundreds to several thousand event listings, including recurring events, classes, festivals, and workshops.
Because these sources have different structures, the pipeline should support source-specific extraction logic while maintaining a unified output schema.
The architecture should allow additional sources to be added easily in the future.
Required Output Fields
Each event record should contain structured fields such as:
Event Name
Event Type (Festival, Marathon, Encuentro, Milonga, Class, etc.)
Start Date
End Date
Start Time
End Time
Venue Name
Venue Address
City
Country
Organizer Name
Organizer Contact
Organizer Website
Source URL
If some information is missing from the primary page, the system should attempt to extract it from related pages when available.
Data Processing Requirements
The system should include the following components.
Data Collection Layer
Ability to extract event data from multiple websites with different page structures.
Preferred technologies:
Python
Scrapy
Playwright or Selenium when needed
Data Normalization
Events must be standardized into a consistent schema.
Examples include:
Normalizing date formats
Standardizing event categories
Cleaning venue names
Standardizing city and country names
Deduplication
The system must detect duplicate events coming from different sources.
Examples:
The same event listed on multiple websites
Recurring events listed multiple times
Possible signals for duplicate detection may include:
event name
location
date
organizer
Data Validation
The pipeline should filter out:
past events
incomplete records
clearly invalid data
Structured Output
The final output should be clean and ready for database ingestion.
Example formats:
CSV
JSON
direct database insertion
Data Volume Expectations
Each source may contain hundreds or thousands of event listings, so the system should be designed to process large datasets efficiently.
Once multiple sources are combined, the dataset may contain tens of thousands of events.
Preferred Skills
Required
Python
data extraction from web sources
data cleaning and normalization
deduplication strategies
building data pipelines
Strongly Preferred
Scrapy
Playwright
experience with multi-source data aggregation
experience building scalable ETL pipelines
Architecture Question (Required in Proposal)
In your proposal, please briefly explain how you would design the data pipeline for this project.
Your answer should address:
how you would structure the pipeline to support multiple source websites
how you would normalize data from different site structures
how you would detect duplicate events across multiple sources
how the system would allow new sources to be added easily
A short explanation (5–10 sentences) is sufficient.
Technical Validation
Short-listed candidates may be asked to demonstrate their approach by extracting event data from one sample source website and delivering a structured dataset.
This task will be paid if requested.
The goal is to demonstrate data quality and pipeline design, not to complete production work.
Application Requirement
To confirm you read the full description, please begin your proposal with the word:
MILONGA
Proposals that do not include this word and do not answer the architecture question may not be considered.
Unlock AI intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/mo