72
Analytics Engineer / Data Architect — AI Agency Intelligence System — BigQuery + GCP
UpworkAUNot specifiedintermediate
BigQueryPythonMachine LearningData ModelingData ScienceClaude
We are a Meta ads performance marketing agency building
an AI intelligence system that connects 5 years of agency data: ad performance,
client communications, creative briefs, financial records, and document history.
We have a detailed architecture plan and an in-house developer building the
pipelines. We need an experienced analytics engineer to review the architecture,
identify problems before we build on them, and advise during the build.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
THE CORE CHALLENGE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Our data refers to the same entities (clients, ads, people) inconsistently
across sources:
- The same client appears as "Brand X Pty Ltd" in our accounting software, "Brand X AU"
in Meta Ads, "BRANDX" in our spreadsheets, and only by first name in emails.
- The same ad appears as "CLIENT-089 | Product Name | Static | Variant | 1350"
in our tracking sheet and "Client Product Variant V2 Static" in Meta Ads Manager.
- Client contacts are referenced by first name only in email threads.
We need this solved before any data gets entered — not cleaned up after.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
WHAT WE NEED YOU TO DO
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Architecture review (Week 1)
Review our 13-table BigQuery schema and pipeline plan.
Identify structural problems: entity resolution gaps, join key risks,
schema decisions that will be painful to change at scale.
Deliver: written review with specific recommendations.
2. Entity resolution design (Week 1-2)
Design the strategy for resolving the same client, ad, and person
across different source systems.
Build: client aliases reference table, contact directory structure,
ad name canonicalisation rules.
Deliver: documented matching rules + BigQuery lookup tables.
3. Claude API optimisation (Week 2)
Advise on BigQuery schema decisions that affect LLM query performance:
partitioning, clustering, authorised views for LLM-safe data surfaces.
Deliver: optimisation recommendations + authorised view specifications.
4. Pipeline review during build (Weeks 2-4)
Review each pipeline before it runs on real data.
Available for questions from our developer (a few hours/week).
Flag issues before they become data quality problems.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
THE DATA SOURCES — WHAT WE ARE CONNECTING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
This is a multi-source architecture with 14+ databases spanning:
PAID ADVERTISING PERFORMANCE
- Meta / Facebook Ads (API via Windsor): ad-level daily performance —
spend, ROAS, CPA, impressions, frequency, hook rate, hold rate
- Google Ads (API via Windsor): same structure, cross-channel attribution
AD CREATIVE — STATICS AND VIDEO
- Creative intelligence spreadsheet (Google Sheets sync): every ad ever made,
tagged with creative taxonomy (hook type, format, funnel stage, emotion)
- Static images: Claude Vision analysis (text overlay %, faces, UGC aesthetic,
emotional register, visual complexity)
- Video ads: Whisper transcription + FFmpeg frame extraction + Claude Vision
on key frames. Hook line, narrative structure, timestamp-level key moments.
- Asset storage: Google Cloud Storage, linked to creative intelligence records
RECORDED KNOWLEDGE — LOOMS AND ZOOM
- Loom recordings: strategy sessions, training videos, client walkthroughs,
screen-share recordings of account management in action
- Zoom recordings: team meetings, client calls, training sessions
- Both: Whisper transcription + scene-change frame extraction + Claude API
summary and training value scoring. Highest complexity — videos have multiple
linked layers (ad unit, asset, transcript, frames, training moments).
COMMUNICATIONS
- Gmail (5 team accounts, 5 years): Gmail API backfill + Push Notifications
for real-time. Claude API enrichment: client classification, sentiment,
topic category, summary. LLM-safe authorised view filters noise.
- Slack: historical workspace export + Events API real-time
- WhatsApp: manual export (no API available for personal accounts)
- Future: Google Chat if used
DOCUMENTS AND IP
- Google Drive (all accounts): every document ever created — strategy docs,
SOPs, QBRs, proposals, reports, briefs, meeting notes, brand guidelines,
onboarding documents, research, contracts
- Google Slides: exported per-slide as PNG images, Claude Vision reads each slide
- Intellectual property and proprietary methodology: separate BigQuery dataset
with stricter access controls — NOT exposed to the same LLM surfaces as
operational data
PROJECT MANAGEMENT AND BRIEFS
- Basecamp (API): 5 years of project history, todos, messages, documents
- Creative brief database: Drive/Notion briefs → Claude extraction →
structured brief records linked to the ads they produced
- Motion (creative project management app): API for production time per task,
enabling true cost-per-ad calculation
FINANCIAL
- Xero (API): full invoice and payment history, retainer revenue by client,
contractor costs, profitability by client
SALES PIPELINE
- 5 years of sales conversations reconstructed from email archive and
founder memory: every prospect, outcome, reason won/lost
CUSTOMER DATA
- Shopify (direct Admin API per client store): full order history, customer LTV,
UTM attribution linking ads to acquired customers
- Klaviyo (API): email marketing performance, list health, revenue per send
The entity resolution challenge spans ALL of these. The same client, person,
and ad appears differently in almost every source. This is the core problem
we need you to solve before data entry begins.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TECH STACK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
BigQuery (primary data warehouse)
Google Cloud Platform (Cloud Functions, Cloud Run, Pub/Sub, Secret Manager)
Python / Node.js (pipeline code — Claude API generates most of it)
Anthropic Claude API (enrichment and intelligence layer)
Google Cloud Storage (creative asset storage)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
WHAT WE PROVIDE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
- Full architecture plan and 13-table schema (you review, not design from zero)
- Access to our in-house developer for questions
- Detailed documentation of all data sources and their inconsistencies
- Fast responses (same business day, AEST timezone)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
WHAT WE NEED FROM YOU
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
- Production BigQuery experience (not just familiar — have built and run
BigQuery pipelines that other people depend on)
- Experience with entity resolution or data matching across messy real-world
sources (not clean, well-structured datasets)
- Understand the tradeoffs in schema design: partitioning, clustering,
normalisation vs denormalisation for LLM workloads
- Can explain WHY you make architecture decisions, not just implement them
- Available for at least 15 hours in Week 1 for the architecture review
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
DURATION AND RATE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Duration: 3-5 weeks intensive, then ~5 hours/week available for questions
Estimated hours: 40-60 hours total
Rate: negotiable based on experience and location
Start: as soon as possible — we have a sprint beginning shortly
Unlock AI intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/moClient
Spent: $292,509.93Rating: 4.8Verified