72

Analytics Engineer / Data Architect — AI Agency Intelligence System — BigQuery + GCP

UpworkAUNot specifiedintermediate
BigQueryPythonMachine LearningData ModelingData ScienceClaude
We are a Meta ads performance marketing agency building an AI intelligence system that connects 5 years of agency data: ad performance, client communications, creative briefs, financial records, and document history. We have a detailed architecture plan and an in-house developer building the pipelines. We need an experienced analytics engineer to review the architecture, identify problems before we build on them, and advise during the build. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ THE CORE CHALLENGE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Our data refers to the same entities (clients, ads, people) inconsistently across sources: - The same client appears as "Brand X Pty Ltd" in our accounting software, "Brand X AU" in Meta Ads, "BRANDX" in our spreadsheets, and only by first name in emails. - The same ad appears as "CLIENT-089 | Product Name | Static | Variant | 1350" in our tracking sheet and "Client Product Variant V2 Static" in Meta Ads Manager. - Client contacts are referenced by first name only in email threads. We need this solved before any data gets entered — not cleaned up after. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ WHAT WE NEED YOU TO DO ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1. Architecture review (Week 1) Review our 13-table BigQuery schema and pipeline plan. Identify structural problems: entity resolution gaps, join key risks, schema decisions that will be painful to change at scale. Deliver: written review with specific recommendations. 2. Entity resolution design (Week 1-2) Design the strategy for resolving the same client, ad, and person across different source systems. Build: client aliases reference table, contact directory structure, ad name canonicalisation rules. Deliver: documented matching rules + BigQuery lookup tables. 3. Claude API optimisation (Week 2) Advise on BigQuery schema decisions that affect LLM query performance: partitioning, clustering, authorised views for LLM-safe data surfaces. Deliver: optimisation recommendations + authorised view specifications. 4. Pipeline review during build (Weeks 2-4) Review each pipeline before it runs on real data. Available for questions from our developer (a few hours/week). Flag issues before they become data quality problems. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ THE DATA SOURCES — WHAT WE ARE CONNECTING ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ This is a multi-source architecture with 14+ databases spanning: PAID ADVERTISING PERFORMANCE - Meta / Facebook Ads (API via Windsor): ad-level daily performance — spend, ROAS, CPA, impressions, frequency, hook rate, hold rate - Google Ads (API via Windsor): same structure, cross-channel attribution AD CREATIVE — STATICS AND VIDEO - Creative intelligence spreadsheet (Google Sheets sync): every ad ever made, tagged with creative taxonomy (hook type, format, funnel stage, emotion) - Static images: Claude Vision analysis (text overlay %, faces, UGC aesthetic, emotional register, visual complexity) - Video ads: Whisper transcription + FFmpeg frame extraction + Claude Vision on key frames. Hook line, narrative structure, timestamp-level key moments. - Asset storage: Google Cloud Storage, linked to creative intelligence records RECORDED KNOWLEDGE — LOOMS AND ZOOM - Loom recordings: strategy sessions, training videos, client walkthroughs, screen-share recordings of account management in action - Zoom recordings: team meetings, client calls, training sessions - Both: Whisper transcription + scene-change frame extraction + Claude API summary and training value scoring. Highest complexity — videos have multiple linked layers (ad unit, asset, transcript, frames, training moments). COMMUNICATIONS - Gmail (5 team accounts, 5 years): Gmail API backfill + Push Notifications for real-time. Claude API enrichment: client classification, sentiment, topic category, summary. LLM-safe authorised view filters noise. - Slack: historical workspace export + Events API real-time - WhatsApp: manual export (no API available for personal accounts) - Future: Google Chat if used DOCUMENTS AND IP - Google Drive (all accounts): every document ever created — strategy docs, SOPs, QBRs, proposals, reports, briefs, meeting notes, brand guidelines, onboarding documents, research, contracts - Google Slides: exported per-slide as PNG images, Claude Vision reads each slide - Intellectual property and proprietary methodology: separate BigQuery dataset with stricter access controls — NOT exposed to the same LLM surfaces as operational data PROJECT MANAGEMENT AND BRIEFS - Basecamp (API): 5 years of project history, todos, messages, documents - Creative brief database: Drive/Notion briefs → Claude extraction → structured brief records linked to the ads they produced - Motion (creative project management app): API for production time per task, enabling true cost-per-ad calculation FINANCIAL - Xero (API): full invoice and payment history, retainer revenue by client, contractor costs, profitability by client SALES PIPELINE - 5 years of sales conversations reconstructed from email archive and founder memory: every prospect, outcome, reason won/lost CUSTOMER DATA - Shopify (direct Admin API per client store): full order history, customer LTV, UTM attribution linking ads to acquired customers - Klaviyo (API): email marketing performance, list health, revenue per send The entity resolution challenge spans ALL of these. The same client, person, and ad appears differently in almost every source. This is the core problem we need you to solve before data entry begins. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ TECH STACK ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BigQuery (primary data warehouse) Google Cloud Platform (Cloud Functions, Cloud Run, Pub/Sub, Secret Manager) Python / Node.js (pipeline code — Claude API generates most of it) Anthropic Claude API (enrichment and intelligence layer) Google Cloud Storage (creative asset storage) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ WHAT WE PROVIDE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - Full architecture plan and 13-table schema (you review, not design from zero) - Access to our in-house developer for questions - Detailed documentation of all data sources and their inconsistencies - Fast responses (same business day, AEST timezone) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ WHAT WE NEED FROM YOU ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - Production BigQuery experience (not just familiar — have built and run BigQuery pipelines that other people depend on) - Experience with entity resolution or data matching across messy real-world sources (not clean, well-structured datasets) - Understand the tradeoffs in schema design: partitioning, clustering, normalisation vs denormalisation for LLM workloads - Can explain WHY you make architecture decisions, not just implement them - Available for at least 15 hours in Week 1 for the architecture review ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ DURATION AND RATE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Duration: 3-5 weeks intensive, then ~5 hours/week available for questions Estimated hours: 40-60 hours total Rate: negotiable based on experience and location Start: as soon as possible — we have a sprint beginning shortly
View Original Listing
Unlock AI intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/mo

Client

Spent: $292,509.93Rating: 4.8Verified