ML Engineer for Synthetic Data Generation
UpworkHRNot specifiedintermediateScore: 56
SASMachine LearningPythonData Science
We have a real survey dataset (~10,000 respondents, ~100 variables). Variables are primarily Likert/ordinal items plus some categorical/demographic fields. We want an MVP that generates synthetic respondents (new rows) using tabular synthetic data methods (e.g., SDV, CTGAN, TVAE, Gaussian copula / copula models). This is not an LLM/chatbot project.
Must-have deliverables (MVP)
Data ingestion + schema
Read CSV + schema (variable types: ordinal/categorical/continuous + allowed ranges)
Basic missing value handling (documented)
Synthetic data generator
Train one model (contractor recommends: SDV/CTGAN/TVAE/copula)
Generate N synthetic rows
Enforce constraints:
Likert ranges (e.g., 1–5 / 1–7 only)
categorical values limited to known categories
Validation / utility checks
Provide a Jupyter notebook (or script) that outputs:
per-variable distribution comparison (real vs synthetic)
correlation comparison (real vs synthetic)
(If we provide item→scale mapping) Cronbach’s alpha per scale for real vs synthetic
basic leakage checks: duplicates + nearest-neighbor distance sanity check
Packaging
Clean Python repo (requirements.txt / poetry / conda)
Reproducible run instructions
Scripts or CLI: train + generate + validate
Nice-to-have (only if time permits)
Conditional generation by demographics (e.g., generate rows conditioned on age group/region)
Export report as HTML
Skills we’re looking for
Python, Pandas, NumPy, scikit-learn, SDV, CTGAN/TVAE, copulas, synthetic tabular data, Jupyter.
Bonus: psychometrics, Likert/ordinal modeling, Cronbach’s alpha, factor analysis.
Proposal requirements
Please include:
Relevant examples (synthetic tabular data / SDV / CTGAN)
Your recommended approach (CTGAN vs copula etc.) and why
Budget: Fixed price preferred; MVP only (no UI required).
Unlock AI Intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/moClient
Spent: $13,028.88Rating: 5.0Verified