ML Engineer for Synthetic Data Generation

UpworkHRNot specifiedintermediateScore: 56
SASMachine LearningPythonData Science
We have a real survey dataset (~10,000 respondents, ~100 variables). Variables are primarily Likert/ordinal items plus some categorical/demographic fields. We want an MVP that generates synthetic respondents (new rows) using tabular synthetic data methods (e.g., SDV, CTGAN, TVAE, Gaussian copula / copula models). This is not an LLM/chatbot project. Must-have deliverables (MVP) Data ingestion + schema Read CSV + schema (variable types: ordinal/categorical/continuous + allowed ranges) Basic missing value handling (documented) Synthetic data generator Train one model (contractor recommends: SDV/CTGAN/TVAE/copula) Generate N synthetic rows Enforce constraints: Likert ranges (e.g., 1–5 / 1–7 only) categorical values limited to known categories Validation / utility checks Provide a Jupyter notebook (or script) that outputs: per-variable distribution comparison (real vs synthetic) correlation comparison (real vs synthetic) (If we provide item→scale mapping) Cronbach’s alpha per scale for real vs synthetic basic leakage checks: duplicates + nearest-neighbor distance sanity check Packaging Clean Python repo (requirements.txt / poetry / conda) Reproducible run instructions Scripts or CLI: train + generate + validate Nice-to-have (only if time permits) Conditional generation by demographics (e.g., generate rows conditioned on age group/region) Export report as HTML Skills we’re looking for Python, Pandas, NumPy, scikit-learn, SDV, CTGAN/TVAE, copulas, synthetic tabular data, Jupyter. Bonus: psychometrics, Likert/ordinal modeling, Cronbach’s alpha, factor analysis. Proposal requirements Please include: Relevant examples (synthetic tabular data / SDV / CTGAN) Your recommended approach (CTGAN vs copula etc.) and why Budget: Fixed price preferred; MVP only (no UI required).
View Original Listing
Unlock AI Intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/mo

Client

Spent: $13,028.88Rating: 5.0Verified