49

Python Data Engineer – Event Data Aggregation Pipeline

UpworkUSNot specifiedexpert
PythonETL PipelineData Analysis
Project Overview We are building a global event discovery platform for the social dance community. The platform aggregates events such as milongas, classes, festivals, workshops, and dance gatherings from publicly available event listings and organizes them into a structured database. We are looking for a Python Data Engineer to build a reliable pipeline that collects, normalizes, and deduplicates event data from multiple online sources. This is not a simple script. We are looking for someone who can design a robust and scalable data pipeline capable of processing large datasets from multiple sources. Data Sources Event information will come from a variety of publicly available sources, including: Event listing websites Community calendars Organizer websites Dance community portals Other public event listings Initially we will provide a list of approximately 15–30 source websites. Each source may contain hundreds to several thousand event listings, including recurring events, classes, festivals, and workshops. Because these sources have different structures, the pipeline should support source-specific extraction logic while maintaining a unified output schema. The architecture should allow additional sources to be added easily in the future. Required Output Fields Each event record should contain structured fields such as: Event Name Event Type (Festival, Marathon, Encuentro, Milonga, Class, etc.) Start Date End Date Start Time End Time Venue Name Venue Address City Country Organizer Name Organizer Contact Organizer Website Source URL If some information is missing from the primary page, the system should attempt to extract it from related pages when available. Data Processing Requirements The system should include the following components. Data Collection Layer Ability to extract event data from multiple websites with different page structures. Preferred technologies: Python Scrapy Playwright or Selenium when needed Data Normalization Events must be standardized into a consistent schema. Examples include: Normalizing date formats Standardizing event categories Cleaning venue names Standardizing city and country names Deduplication The system must detect duplicate events coming from different sources. Examples: The same event listed on multiple websites Recurring events listed multiple times Possible signals for duplicate detection may include: event name location date organizer Data Validation The pipeline should filter out: past events incomplete records clearly invalid data Structured Output The final output should be clean and ready for database ingestion. Example formats: CSV JSON direct database insertion Data Volume Expectations Each source may contain hundreds or thousands of event listings, so the system should be designed to process large datasets efficiently. Once multiple sources are combined, the dataset may contain tens of thousands of events. Preferred Skills Required Python data extraction from web sources data cleaning and normalization deduplication strategies building data pipelines Strongly Preferred Scrapy Playwright experience with multi-source data aggregation experience building scalable ETL pipelines Architecture Question (Required in Proposal) In your proposal, please briefly explain how you would design the data pipeline for this project. Your answer should address: how you would structure the pipeline to support multiple source websites how you would normalize data from different site structures how you would detect duplicate events across multiple sources how the system would allow new sources to be added easily A short explanation (5–10 sentences) is sufficient. Technical Validation Short-listed candidates may be asked to demonstrate their approach by extracting event data from one sample source website and delivering a structured dataset. This task will be paid if requested. The goal is to demonstrate data quality and pipeline design, not to complete production work. Application Requirement To confirm you read the full description, please begin your proposal with the word: MILONGA Proposals that do not include this word and do not answer the architecture question may not be considered.
View Original Listing
Unlock AI intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/mo