ML Engineer for Synthetic Data Generation

UpworkHRNot specifiedintermediateScore: 56

SASMachine LearningPythonData Science

We have a real survey dataset (~10,000 respondents, ~100 variables). Variables are primarily Likert/ordinal items plus some categorical/demographic fields. We want an MVP that generates synthetic respondents (new rows) using tabular synthetic data methods (e.g., SDV, CTGAN, TVAE, Gaussian copula / copula models). This is not an LLM/chatbot project.

Must-have deliverables (MVP)

Data ingestion + schema

Read CSV + schema (variable types: ordinal/categorical/continuous + allowed ranges)

Basic missing value handling (documented)

Synthetic data generator

Train one model (contractor recommends: SDV/CTGAN/TVAE/copula)

Generate N synthetic rows

Enforce constraints:

Likert ranges (e.g., 1–5 / 1–7 only)

categorical values limited to known categories

Validation / utility checks
Provide a Jupyter notebook (or script) that outputs:

per-variable distribution comparison (real vs synthetic)

correlation comparison (real vs synthetic)

(If we provide item→scale mapping) Cronbach’s alpha per scale for real vs synthetic

basic leakage checks: duplicates + nearest-neighbor distance sanity check

Packaging

Clean Python repo (requirements.txt / poetry / conda)

Reproducible run instructions

Scripts or CLI: train + generate + validate

Nice-to-have (only if time permits)

Conditional generation by demographics (e.g., generate rows conditioned on age group/region)

Export report as HTML

Skills we’re looking for

Python, Pandas, NumPy, scikit-learn, SDV, CTGAN/TVAE, copulas, synthetic tabular data, Jupyter.
Bonus: psychometrics, Likert/ordinal modeling, Cronbach’s alpha, factor analysis.

Proposal requirements

Please include:

Relevant examples (synthetic tabular data / SDV / CTGAN)

Your recommended approach (CTGAN vs copula etc.) and why

Budget: Fixed price preferred; MVP only (no UI required).

View Original Listing

Unlock AI Intelligence, score breakdowns, and real-time alerts

Upgrade to Pro — $29.99/mo

Client

Spent: $13,028.88Rating: 5.0Verified