93
AI + Data Automation Project: Build Enriched School Directory Dataset (2,400 Schools)
UpworkUSNot specifiedintermediate
Data ScrapingPythonWeb ScrapingData ExtractionAutomationArtificial IntelligenceData Processing
Project Overview
We are building a comprehensive online directory of schools in Indiana and are seeking a developer to help automate the enrichment of a base dataset.
We currently have a spreadsheet of approximately 2,400 schools sourced from the Indiana Department of Education that includes:
• School name
• Address
• City / State / ZIP
• Website URL (present for ~95% of schools)
• Basic identifying information
The goal of this project is to programmatically enrich the dataset, generate standardized content for each school, and collect associated images so the data can be imported into a website directory.
This is a fixed-scope automation project, not a manual research task.
⸻
Technical Context
To help clarify scope:
• A base spreadsheet of ~2,400 schools will be provided
• ~95% already include school website URLs
• Many school logos already exist on another website and can likely be reused
• The objective is to build an automated pipeline, not manual data entry
• Public datasets may be used to populate structured fields before scraping school websites
Developers may use any combination of:
• public education datasets
• web scraping
• LLM extraction
• automation pipelines
⸻
Scope of Work
1. Dataset Enrichment
Using school websites and publicly available education datasets, populate additional structured fields where available.
Examples include:
• Grade levels served
• Enrollment size
• School type (public, charter, private)
• Academic highlights or indicators
• Student demographics (if available)
• Extracurricular programs
• Updated contact information
Public datasets (DOE, NCES, etc.) may be used to populate factual fields where possible.
School websites should then be used to extract descriptive information.
⸻
2. AI Content Generation
Generate four original content sections (~180–220 words each) for each school using the collected information.
Sections include:
1. School Overview
2. Academics & Performance
3. Student Life & Activities
4. Community & Student Body
Requirements:
• Content must be original and not copied from school websites
• Informational and neutral tone suitable for a directory
• Consistent formatting across schools
⸻
3. Image Collection
School Logos
Collect a square logo image for each school.
Note:
There is an existing website containing a large number of school logos already compiled, which can likely be used as a primary source.
Banner Images
Collect a landscape banner image representing the school (campus, building, students, etc.) where available.
Images should be web quality and reasonably cropped.
⸻
4. Data Formatting
Final dataset should be structured and ready for import into a CMS directory.
Preferred formats:
• Excel (.xlsx)
• CSV
• or structured JSON
Each school record should include:
• Original dataset fields
• Enriched data fields
• Four generated content sections
• Logo image or URL
• Banner image or URL
⸻
Dataset Size
Approximately 2,400 schools
⸻
Deliverables
1. Enriched dataset for all schools
2. Four generated content sections per school
3. Logo images or URLs
4. Banner images or URLs
5. Clean spreadsheet ready for CMS import
⸻
Timeline
Preferred completion within 2–4 weeks.
⸻
Ideal Experience
We are looking for developers experienced with:
• data scraping and automation
• dataset enrichment pipelines
• AI content generation workflows
• Python-based data processing
Experience with education datasets or directory websites is a plus.
⸻
Proposal Requirements
Please include:
1. A brief description of your technical approach
2. Estimated timeline
3. Fixed project cost
4. Example of a similar scraping or data automation project
⸻
Important
To confirm you have read the project description, please begin your proposal by answering this question:
Given that we already have website URLs for ~95% of schools, what approach would you use to efficiently extract structured information from those sites?
Generic proposals that do not answer this question will not be considered.
Unlock AI intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/mo