Data Source Setup ← All Guides

Data Source Setup

Per-source guides for ingesting marketing data into BigQuery — costs, setup paths, gotchas, and key SQL/CLI commands. Research conducted 2026-05-07.

Overview

Covers every data source in the AimClear multi-channel pipeline and the best path for getting each into BigQuery. Sources fall into three buckets: organic / content (analytics, search, social, SERPs, LLM coverage), paid channels (Google, Meta, Bing, LinkedIn, YouTube, Google Ad Manager), and aggregators (vendor tools that already roll several sources together).

Default approach: use Google's free BigQuery Data Transfer Service (BDTS) connectors wherever they exist, and pull anything else with the lightest-weight option that fits — Airbyte Cloud for production third-party sources, custom Python scripts on Cloud Run Jobs for one-off / niche feeds, and DataForSEO for SERP / AI Overview data.

Decision Matrix

Source Native BDTS Best Path Approx Monthly Cost
GA4✅ FreeBDTS$0 + BQ compute
Search Console✅ Free (own mechanism)GSC Bulk Export$0 + BQ compute
Google Ads✅ FreeBDTS$0 + BQ compute
YouTube Ads✅ Free (via Google Ads)BDTS (Google Ads connector)$0 + BQ compute
Facebook/Meta Ads✅ Free (limited columns)BDTS → OWOX if more detail needed$0
Google Ad Manager✅ Free (if DT access enabled)BDTS (verify eligibility first)$0 + BQ compute
Microsoft/Bing AdsAirbyte Cloud or Python (bingads SDK)$20–60 or $0
LinkedIn AdsAirbyte Cloud (after API approval — apply now)$20–50
Reddit Ads (paid)Airbyte Cloud$15–30
Reddit Organic (threads)Python/PRAW script$0–10 API costs
LinkedIn Organic❌ (ToS risk)Official Pages API only (own pages)$0
WikipediaPython script (Wikimedia REST API)$0
SERP APIDataForSEO pay-as-you-go$5–50/day depending on volume
LLMs / AI OverviewsPython async script + DataForSEO for Google AIO$10–50
OptimizerUnknownCheck vendor API/export docsTBD

GA4 — Google Analytics 4

Path: Native BDTS connector — free. Setup in GA4 Admin → BigQuery Link or via BDTS in GCP Console.

GA4's native BigQuery export pushes event-level data daily into your project. This is the most granular marketing data you can get from Google Analytics — every session, event, and conversion with all associated parameters — and it's the foundation for cross-channel attribution in BigQuery.

At a Glance

Setup Steps

  1. Open GA4 Admin → go to the property you want to export → Admin → Product links → BigQuery links.
  2. Click "Link" → choose your BigQuery project (marty-search-console-ga4-api).
  3. Select datasets to export: at minimum choose "Events" (daily). Streaming export is available but accrues BQ streaming insert costs — start with daily.
  4. Choose export frequency: Daily (free). Streaming optional (costs extra).
  5. Save the link. GA4 will create a dataset in BigQuery named analytics_{property_id}.
  6. Verify the link in GA4 Admin and in BigQuery console — a new dataset should appear within 24 hours.
  7. Grant any analyst accounts BigQuery Data Viewer role on the dataset if needed.
Auth note: No service account setup required — GA4 handles the export internally using Google's own service account. You just need Editor or above on the GCP project and Admin access on the GA4 property.

Key Gotchas

Options Compared

MethodCostNotes
Native GA4 BDTS (recommended)FreeDaily export, no third-party dependency, official support
FivetranFree tier available; paid tiers well below $12K/year enterprise floorAdds managed schema normalization; overkill if native export meets your needs
AirbyteFree (open-source) / Cloud pricing variesMore setup, same data available natively for free
Custom Python (GA4 Data API)GCP compute costs onlyUse only if you need custom dimensions/metrics not in the export; rate limits apply

Recommendation: Use the native BDTS export. It's free, official, and the most complete representation of GA4 data available.

Search Console

Path: GSC Bulk Data Export (not BDTS — its own mechanism). Setup in GSC Settings → Bulk Data Export. Free.

GSC Bulk Data Export pushes daily impression and click data — site-level and URL-level — into BigQuery. This is the primary source for organic search performance: queries, pages, countries, devices, and positions. It feeds SEO analysis, content performance, and keyword coverage in the marketer pipeline.

At a Glance

Setup Steps

  1. Verify GSC access level. You must be an Owner (not just Full User) on the GSC property. Delegated ownership works. Check at: Search Console → Settings → Users and permissions.
  2. Grant BQ permissions to the GSC export service account.
    The service account is: search-console-data-export@system.gserviceaccount.com
    In BigQuery console → your project → IAM, grant it:
    • BigQuery Job User (project level)
    • BigQuery Data Editor (dataset level, or project level for simplicity)
  3. Open GSC Bulk Data Export. Go to: search.google.com/search-console → Settings (gear icon) → Bulk Data Export.
  4. Click "Set up Bulk Data Export" and select:
    • BigQuery project: marty-search-console-ga4-api
    • Dataset name: choose something like search_console_export
    • Property: select the GSC property to export
  5. Save. GSC will validate permissions and begin export within 24 hours.
  6. Verify in BigQuery that the dataset and tables appear the next day.
Per-property setup: Each GSC property needs its own Bulk Export setup. If you manage multiple client properties, repeat these steps for each one — they can all write to the same BQ project but will need separate datasets or you'll need to use table prefixes.

Key Gotchas

Options Compared

MethodCostNotes
GSC Bulk Data Export (recommended)FreeOfficial, daily, three tables, no API rate limits
FivetranFree tier availableUseful if you need managed schema or multi-property consolidation
AirbyteFree (open-source) / Cloud variesPulls via Search Console API; subject to API quotas (25K rows/request)
Custom Python (Search Console API)Compute costs onlyAPI has row limits per request; pagination required; viable for small properties

Recommendation: Use GSC Bulk Data Export. It's free, quota-free, and higher fidelity than the API at scale.

Reddit Organic

Path: Custom Python script using PRAW library → BigQuery client.

Reddit organic tracking monitors keyword-relevant threads and comments across subreddits — brand mentions, competitor discussions, industry topics, and sentiment signals. This is a custom pull, not a managed connector. The Reddit API (via PRAW) is the correct method; HTML scraping violates Reddit's ToS. As of 2025, Reddit added a pre-approval requirement for commercial API access.

At a Glance

Step 1: Apply for Reddit API access (commercial tier)

Reddit added a pre-approval step for commercial use in 2025.

  1. Go to reddit.com/prefs/apps → Create App.
  2. Select type: script (for server-side use) or web app (if OAuth flow needed).
  3. For commercial use (storing data in BQ, using it for client reporting), review Reddit's Data API Terms and apply for the appropriate access tier.
  4. Note your Client ID and Client Secret from the created app.

Step 2: Install PRAW and write the pull script

pip install praw google-cloud-bigquery
import praw
from google.cloud import bigquery
from datetime import datetime, timezone

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="aimclear-pipeline/1.0 (by u/YOUR_REDDIT_USERNAME)"
)

bq = bigquery.Client(project="marty-search-console-ga4-api")

KEYWORDS = ["your keyword", "brand name", "competitor"]
SUBREDDITS = ["marketing", "ppc", "digitalmarketing"]  # customize

def search_and_load(keywords, subreddits, lookback_hours=25):
    rows = []
    for sub in subreddits:
        subreddit = reddit.subreddit(sub)
        for kw in keywords:
            for submission in subreddit.search(kw, time_filter="day", limit=100):
                rows.append({
                    "pulled_at": datetime.now(timezone.utc).isoformat(),
                    "subreddit": sub,
                    "keyword": kw,
                    "submission_id": submission.id,
                    "title": submission.title,
                    "url": submission.url,
                    "score": submission.score,
                    "num_comments": submission.num_comments,
                    "created_utc": datetime.fromtimestamp(submission.created_utc, tz=timezone.utc).isoformat(),
                    "author": str(submission.author),
                    "selftext": submission.selftext[:2000],  # truncate for BQ
                })
    if rows:
        table_id = "marty-search-console-ga4-api.reddit_organic.submissions"
        errors = bq.insert_rows_json(table_id, rows)
        if errors:
            raise RuntimeError(f"BQ insert errors: {errors}")
    print(f"Loaded {len(rows)} rows")

if __name__ == "__main__":
    search_and_load(KEYWORDS, SUBREDDITS)

Step 3: Create the BigQuery table

CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.reddit_organic.submissions` (
  pulled_at TIMESTAMP,
  subreddit STRING,
  keyword STRING,
  submission_id STRING,
  title STRING,
  url STRING,
  score INT64,
  num_comments INT64,
  created_utc TIMESTAMP,
  author STRING,
  selftext STRING
)
PARTITION BY DATE(created_utc)
OPTIONS (require_partition_filter = false);

Step 4: Deploy as a Cloud Run Job (preferred over PythonAnywhere for production)

# Build and push container
gcloud builds submit --tag gcr.io/marty-search-console-ga4-api/reddit-organic

# Deploy as Cloud Run Job
gcloud run jobs create reddit-organic \
  --image gcr.io/marty-search-console-ga4-api/reddit-organic \
  --region us-central1 \
  --project marty-search-console-ga4-api

# Schedule via Cloud Scheduler (daily at 6 AM UTC)
gcloud scheduler jobs create http reddit-organic-daily \
  --schedule="0 6 * * *" \
  --uri="https://us-central1-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/marty-search-console-ga4-api/jobs/reddit-organic:run" \
  --oauth-service-account-email=YOUR_SA@marty-search-console-ga4-api.iam.gserviceaccount.com

Key Gotchas

Options Compared

MethodCostNotes
PRAW + Cloud Run Job (recommended)~$0.24/1K API calls + minimal computeProduction-grade, GCP-native, incremental design
PRAW + PythonAnywhereSame API costs + PythonAnywhere subscriptionFine for prototyping; less reliable for unattended production
Pushshift / third-party archivesVariesPushshift has had significant reliability and ToS issues; verify current status before relying on it

LinkedIn Organic

Path: Official LinkedIn Pages API (own pages only). Abandon personal token / session cookie scraping.

LinkedIn's official Pages API provides analytics for your own company pages: post impressions, engagement, follower counts, and demographic breakdowns of your audience. This covers organic reach on pages where AimClear or its clients are admins. Competitive organic intelligence (seeing other companies' organic post performance) is not available via official API and attempts to obtain it via unofficial means carry serious account risk.

At a Glance

Critical Constraint: Own Pages Only

LinkedIn's official API grants access to:

It does not grant access to:

Do not use personal token/session cookie automation to access LinkedIn data outside the official API. Apollo.io and Seamless.AI had company Pages deleted by LinkedIn in 2025 for API policy violations. Proxycurl, which offered LinkedIn data scraping, shut down in July 2025 following LinkedIn enforcement actions. The risk is account termination, not just a rate limit error.

Step 1: Ensure LinkedIn app access

If you already have a LinkedIn app approved for Marketing API access, check whether the Pages product is also approved on that app. If not, request it:

  1. Go to developer.linkedin.com → your app → Products.
  2. Request "Share on LinkedIn" and/or "Community Management API" as applicable.
  3. These products generally have faster approval than the Marketing Developer Platform.

Step 2: Get admin access to the Company Page

The OAuth user authorizing the API calls must be a Super Admin or Content Admin on the LinkedIn Company Page. Verify page roles at: LinkedIn → Company Page → Admin tools → Manage admins.

Step 3: Write the pull script

pip install requests google-cloud-bigquery
import requests
from google.cloud import bigquery
from datetime import datetime, timezone

# After OAuth flow, you'll have an access token with r_organization_social scope
ACCESS_TOKEN = "YOUR_OAUTH_ACCESS_TOKEN"
ORG_ID = "YOUR_LINKEDIN_ORGANIZATION_ID"  # numeric org URN
BQ_PROJECT = "marty-search-console-ga4-api"
BQ_DATASET = "linkedin_organic"

headers = {"Authorization": f"Bearer {ACCESS_TOKEN}"}

def get_page_analytics(org_id, start_date, end_date):
    # Organization Follower Statistics
    url = f"https://api.linkedin.com/v2/organizationalEntityFollowerStatistics"
    params = {
        "q": "organizationalEntity",
        "organizationalEntity": f"urn:li:organization:{org_id}",
    }
    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
    return response.json()

def get_post_statistics(org_id):
    # Share Statistics (post-level engagement)
    url = f"https://api.linkedin.com/v2/organizationalEntityShareStatistics"
    params = {
        "q": "organizationalEntity",
        "organizationalEntity": f"urn:li:organization:{org_id}",
    }
    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
    return response.json()

# Load results into BigQuery
# bq_client = bigquery.Client(project=BQ_PROJECT)
# ... insert_rows_json into your target table

Refer to the LinkedIn API documentation for the current endpoint paths and OAuth scopes — LinkedIn deprecates and renames endpoints periodically. Required scopes typically include r_organization_social and r_organization_admin.

Step 4: Deploy as a Cloud Run Job

# Schedule daily, similar to other custom scripts
gcloud run jobs create linkedin-organic \
  --image gcr.io/marty-search-console-ga4-api/linkedin-organic \
  --region us-central1

Key Gotchas

Options Compared

MethodCostData ScopeNotes
LinkedIn Pages API (official)FreeOwn pages onlyThe only safe, compliant option
Fivetran LinkedIn Pages connectorFree tier availableOwn pages onlyManaged version of the same Pages API
Third-party scraping toolsVariesOwn + competitorDo not use — account termination risk, enforcement active as of 2025

Wikipedia

Path: Custom Python script (Wikimedia REST API) → BigQuery. Trivial.

Wikipedia data in the pipeline serves two distinct purposes: (1) pulling article content and metadata for specific topics or brand/competitor pages, and (2) tracking page view statistics over time. The Wikimedia REST API is free, requires no authentication, and handles both. A BigQuery public dataset (bigquery-public-data.wikipedia) also provides page view stats without any API calls.

At a Glance

Option A: Wikimedia REST API — article content and metadata (~20 lines of Python)

pip install requests google-cloud-bigquery
import requests
from google.cloud import bigquery
from datetime import datetime, timezone

BQ_PROJECT = "marty-search-console-ga4-api"

def get_article_summary(title: str) -> dict:
    """Fetch article summary from Wikimedia REST API."""
    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}"
    response = requests.get(url, headers={"User-Agent": "aimclear-pipeline/1.0 (timothy@aimclear.com)"})
    response.raise_for_status()
    data = response.json()
    return {
        "pulled_at": datetime.now(timezone.utc).isoformat(),
        "title": data.get("title"),
        "display_title": data.get("displaytitle"),
        "description": data.get("description"),
        "extract": data.get("extract"),
        "page_id": data.get("pageid"),
        "revision": data.get("revision"),
        "last_modified": data.get("timestamp"),
        "url": data.get("content_urls", {}).get("desktop", {}).get("page"),
    }

def get_page_views(title: str, start: str, end: str, granularity: str = "daily") -> list:
    """Fetch page view counts for a Wikipedia article."""
    # Format: YYYYMMDD for start/end
    url = (
        f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article"
        f"/en.wikipedia/all-access/all-agents/{title}/{granularity}/{start}/{end}"
    )
    response = requests.get(url, headers={"User-Agent": "aimclear-pipeline/1.0 (timothy@aimclear.com)"})
    response.raise_for_status()
    return response.json().get("items", [])

# Example usage
ARTICLES = ["AimClear", "Digital_marketing", "Search_engine_optimization"]

bq = bigquery.Client(project=BQ_PROJECT)
rows = [get_article_summary(title) for title in ARTICLES]
# bq.insert_rows_json("marty-search-console-ga4-api.wikipedia.article_summaries", rows)

Deploy as a Cloud Run Job, scheduled weekly (article content doesn't change daily).

Option B: BigQuery public dataset — page views at scale

The bigquery-public-data.wikipedia dataset contains monthly page view statistics across all Wikipedia articles. No API calls, no pipeline — just query directly.

-- Top Wikipedia articles by views for a given month
SELECT
  title,
  SUM(views) AS total_views,
  datehour
FROM `bigquery-public-data.wikipedia.pageviews_2026`
WHERE
  wiki = 'en'
  AND title IN ('AimClear', 'Digital_marketing', 'Search_engine_optimization')
  AND DATE(datehour) BETWEEN '2026-01-01' AND '2026-01-31'
GROUP BY title, datehour
ORDER BY total_views DESC;

This approach works well for trend analysis and requires no ingestion pipeline at all.

Key Gotchas

Options Compared

MethodCostUse Case
Wikimedia REST API + PythonFreeArticle summaries, metadata, recent page views for specific articles
BigQuery public datasetFree (BQ query costs)Historical page view trends at scale, no pipeline needed
Wikipedia database dumpsFreeBulk content ingestion (avoid for most marketing use cases)

Recommendation: Use the BigQuery public dataset for page view trend analysis. Use the REST API for pulling article summaries and metadata for specific pages. Both can coexist — they answer different questions.

SERP API

Path: DataForSEO (pay-as-you-go, no minimums). Do not depend on SerpApi.

SERP API pulls real-time Google search results for target keywords — organic rankings, featured snippets, People Also Ask boxes, local packs, and Google AI Overviews. This feeds keyword rank tracking, SERP feature monitoring, and AI Overview coverage detection in the pipeline. DataForSEO is the recommended provider: no minimum spend, competitive pricing, and their API returns AI Overview content in responses.

At a Glance

Vendor Selection

DataForSEO (recommended):

Serper (alternative):

Do not use SerpApi for new implementations:

Do not use Google Custom Search JSON API:

Step 1: Create DataForSEO account

  1. Go to app.dataforseo.com → Sign up.
  2. Add a payment method. No minimum deposit required.
  3. Note your API credentials (login email and password are used as HTTP Basic Auth).

Step 2: Write the pull script

pip install requests google-cloud-bigquery
import requests
import json
from google.cloud import bigquery
from datetime import datetime, timezone
from base64 import b64encode

BQ_PROJECT = "marty-search-console-ga4-api"
DATAFORSEO_LOGIN = "your-email@example.com"
DATAFORSEO_PASSWORD = "your-api-password"

def get_auth_header():
    creds = b64encode(f"{DATAFORSEO_LOGIN}:{DATAFORSEO_PASSWORD}".encode()).decode()
    return {"Authorization": f"Basic {creds}", "Content-Type": "application/json"}

def fetch_serp(keywords: list, location_code: int = 2840, language_code: str = "en") -> list:
    """
    Fetch SERP results for a list of keywords.
    location_code 2840 = United States
    Use /serp/google/organic/live/regular for live results
    Use /serp/google/organic/task_post + task_get for standard (cheaper)
    """
    url = "https://api.dataforseo.com/v3/serp/google/organic/live/regular"
    tasks = [
        {
            "keyword": kw,
            "location_code": location_code,
            "language_code": language_code,
            "device": "desktop",
            "os": "windows",
        }
        for kw in keywords
    ]
    response = requests.post(url, headers=get_auth_header(), json=tasks)
    response.raise_for_status()
    return response.json().get("tasks", [])

def parse_serp_to_rows(tasks: list, pulled_at: str) -> list:
    rows = []
    for task in tasks:
        keyword = task.get("data", {}).get("keyword")
        results = task.get("result", []) or []
        for result in results:
            items = result.get("items", []) or []
            for item in items:
                rows.append({
                    "pulled_at": pulled_at,
                    "keyword": keyword,
                    "rank_group": item.get("rank_group"),
                    "rank_absolute": item.get("rank_absolute"),
                    "type": item.get("type"),  # organic, featured_snippet, ai_overview, etc.
                    "url": item.get("url"),
                    "title": item.get("title"),
                    "description": item.get("description"),
                    "domain": item.get("domain"),
                })
    return rows

KEYWORDS = [
    "digital marketing agency",
    "ppc management",
    "your target keyword here",
]

def main():
    pulled_at = datetime.now(timezone.utc).isoformat()
    tasks = fetch_serp(KEYWORDS)
    rows = parse_serp_to_rows(tasks, pulled_at)

    bq = bigquery.Client(project=BQ_PROJECT)
    table_id = f"{BQ_PROJECT}.serp_tracking.organic_results"
    errors = bq.insert_rows_json(table_id, rows)
    if errors:
        raise RuntimeError(f"BQ errors: {errors}")
    print(f"Loaded {len(rows)} rows for {len(KEYWORDS)} keywords")

if __name__ == "__main__":
    main()

Step 3: Create BigQuery table

CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.serp_tracking.organic_results` (
  pulled_at TIMESTAMP,
  keyword STRING,
  rank_group INT64,
  rank_absolute INT64,
  type STRING,
  url STRING,
  title STRING,
  description STRING,
  domain STRING
)
PARTITION BY DATE(pulled_at)
CLUSTER BY keyword;

Step 4: Deploy as a Cloud Run Job

Schedule daily (or more frequently for high-value keywords). Keep cost in check by batching keywords per request (DataForSEO accepts up to 100 tasks per API call).

gcloud run jobs create serp-tracking \
  --image gcr.io/marty-search-console-ga4-api/serp-tracking \
  --region us-central1

Key Gotchas

Options Compared

ProviderCostAI OverviewNotes
DataForSEO (recommended)$0.60/1K standard, $2.00/1K liveYesNo minimum, pay-as-you-go, extensive coverage
SerperCheck serper.devYesSimpler API; competitive pricing
SerpApiVariesYesActive Google lawsuit (Dec 2025) — avoid for new builds
Google Custom Search JSON APILimited free, deprecatedNoNo new customers; deprecated Jan 2027
Custom scrapingCompute onlyN/AViolates Google ToS — do not use

LLM Monitoring

Path: Two-track:

  1. Google AI Overviews → DataForSEO SERP API (already returning AIO content in results)
  2. ChatGPT / Claude / Gemini / Perplexity direct responses → Custom Python async script using official APIs

Official API pricing (mid-2025):

Specialized tools: Otterly.ai, thruuu LLM API (explicit BigQuery integration), DemandSphere (BigQuery export available). Worth evaluating if managing this at scale.

LLM API responses are non-deterministic — track trends across runs, not single-point data.

LLM monitoring tracks whether brand, competitor, or topic keywords appear in AI-generated responses across two distinct channels:

  1. Google AI Overviews — AI-generated answer boxes appearing in Google SERP
  2. Conversational AI responses — What ChatGPT, Claude, Gemini, and Perplexity say when asked about your brand, competitors, or industry topics

These are separate data sources with separate methods. Track both for a complete picture of "AI answer engine" visibility.

At a Glance

Track 1: Google AI Overviews (via DataForSEO SERP API)

Google AI Overviews appear in standard Google search results. DataForSEO returns them as SERP items with type: "ai_overview". This track reuses the same infrastructure as the SERP API section above.

Cost: DataForSEO: $0.60/1K standard results — an AI Overview trigger is a byproduct of the SERP pull, no extra cost. Run this in the same Cloud Run Job as general SERP tracking to avoid duplicate API calls.

What to capture from DataForSEO AI Overview items: When a SERP result contains an AI Overview, the items array will include an entry with type: "ai_overview". Extract:

Store in serp_tracking.ai_overview_appearances. Join with organic_results to see which keywords trigger AI Overviews vs. traditional results.

CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.serp_tracking.ai_overview_appearances` (
  pulled_at TIMESTAMP,
  keyword STRING,
  has_ai_overview BOOL,
  ai_overview_text STRING,
  cited_urls ARRAY<STRING>,
  cited_domains ARRAY<STRING>
)
PARTITION BY DATE(pulled_at)
CLUSTER BY keyword;

Track 2: Conversational AI Monitoring (ChatGPT, Claude, Gemini, Perplexity)

This track queries LLMs directly via their APIs and asks brand/topic questions, then stores the responses for analysis. Use Python async for efficiency across multiple providers.

Cost per provider

ProviderModelApproximate Cost
OpenAIGPT-4o$0.0025/1K input tokens, $0.010/1K output tokens
AnthropicClaude Sonnet$3.00/1M input tokens, $15.00/1M output tokens
GoogleGemini 1.5 FlashFree at low volume (check current quota limits)
PerplexitySonar~$1.00/1K queries

At typical monitoring volumes (50–200 queries/day across all providers), expect $1–$10/day total depending on query complexity and provider mix.

Setup Steps

pip install openai anthropic google-generativeai requests google-cloud-bigquery asyncio
import asyncio
import json
from datetime import datetime, timezone
from google.cloud import bigquery

import openai
import anthropic
import google.generativeai as genai

BQ_PROJECT = "marty-search-console-ga4-api"
BQ_TABLE = f"{BQ_PROJECT}.llm_monitoring.responses"

PROMPTS = [
    "What is AimClear?",
    "Who are the top digital marketing agencies for B2B paid search?",
    "What companies specialize in audience targeting for paid social?",
    # Add brand, competitor, and category prompts
]

async def query_openai(prompt: str, pulled_at: str) -> dict:
    client = openai.AsyncOpenAI()  # uses OPENAI_API_KEY env var
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500,
    )
    return {
        "pulled_at": pulled_at,
        "provider": "openai",
        "model": "gpt-4o",
        "prompt": prompt,
        "response": response.choices[0].message.content,
        "input_tokens": response.usage.prompt_tokens,
        "output_tokens": response.usage.completion_tokens,
    }

async def query_anthropic(prompt: str, pulled_at: str) -> dict:
    client = anthropic.AsyncAnthropic()  # uses ANTHROPIC_API_KEY env var
    response = await client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}],
    )
    return {
        "pulled_at": pulled_at,
        "provider": "anthropic",
        "model": "claude-sonnet-4-5",
        "prompt": prompt,
        "response": response.content[0].text,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }

async def query_perplexity(prompt: str, pulled_at: str) -> dict:
    import httpx
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.perplexity.ai/chat/completions",
            headers={"Authorization": f"Bearer {PERPLEXITY_API_KEY}"},
            json={
                "model": "sonar",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 500,
            },
        )
        response.raise_for_status()
        data = response.json()
    return {
        "pulled_at": pulled_at,
        "provider": "perplexity",
        "model": "sonar",
        "prompt": prompt,
        "response": data["choices"][0]["message"]["content"],
        "input_tokens": data.get("usage", {}).get("prompt_tokens"),
        "output_tokens": data.get("usage", {}).get("completion_tokens"),
    }

async def run_all_prompts():
    pulled_at = datetime.now(timezone.utc).isoformat()
    tasks = []
    for prompt in PROMPTS:
        tasks.append(query_openai(prompt, pulled_at))
        tasks.append(query_anthropic(prompt, pulled_at))
        tasks.append(query_perplexity(prompt, pulled_at))
        # Add Gemini similarly using google.generativeai
    results = await asyncio.gather(*tasks, return_exceptions=True)

    rows = [r for r in results if isinstance(r, dict)]
    errors = [r for r in results if isinstance(r, Exception)]
    if errors:
        print(f"Errors: {errors}")

    bq = bigquery.Client(project=BQ_PROJECT)
    bq.insert_rows_json(BQ_TABLE, rows)
    print(f"Loaded {len(rows)} LLM responses")

if __name__ == "__main__":
    asyncio.run(run_all_prompts())

BigQuery table for LLM responses

CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.llm_monitoring.responses` (
  pulled_at TIMESTAMP,
  provider STRING,
  model STRING,
  prompt STRING,
  response STRING,
  input_tokens INT64,
  output_tokens INT64
)
PARTITION BY DATE(pulled_at)
CLUSTER BY provider, prompt;

Post-processing: brand mention detection

After loading responses, run a scheduled BQ query or dbt model to detect brand/competitor mentions:

SELECT
  pulled_at,
  provider,
  prompt,
  LOWER(response) LIKE '%aimclear%' AS mentions_aimclear,
  LOWER(response) LIKE '%competitor_a%' AS mentions_competitor_a,
  response
FROM `marty-search-console-ga4-api.llm_monitoring.responses`
WHERE DATE(pulled_at) = CURRENT_DATE()
ORDER BY pulled_at DESC;

Specialized Tools (evaluate vs. build)

ToolNotes
Otterly.aiSaaS specifically for AI mention/citation monitoring; BQ integration varies — check current offering
thruuu LLM APIHas native BigQuery integration; purpose-built for LLM answer tracking
DemandSphereEnterprise SEO/SERP platform with LLM visibility tracking; check pricing

If AimClear plans to offer LLM monitoring as a client service at scale, evaluate these tools against the custom script approach. For internal use or a handful of clients, the custom script is sufficient and cheaper.

Key Gotchas

  • Method: Native BigQuery Data Transfer Service (BDTS)
  • Cost: Free (standard BQ storage and query costs apply)
  • Setup time: ~20 minutes
  • Status: Not started

Path: Native BDTS connector — free. ~30+ standard tables covering campaigns, ad groups, ads, keywords, audiences, conversions, shopping, video.

The Google Ads BDTS connector exports ~30+ tables covering campaigns, ad groups, ads, keywords, audiences, conversions, and performance metrics. It's the canonical source for paid search data in the pipeline — clicks, impressions, costs, and ROAS at every level of the account hierarchy.

Setup Steps

  1. Open BigQuery console → go to project marty-search-console-ga4-api → BigQuery Data Transfers (left nav).
  2. Click "+ Create Transfer" → search for "Google Ads" → select it.
  3. Configure the transfer:
    • Display name: something like google-ads-{client}
    • Schedule: Daily (default, recommended)
    • Destination dataset: create a new dataset or use an existing one (e.g., google_ads_export)
    • Google Ads customer ID: enter the 10-digit customer ID (no dashes). For MCC accounts, use the top-level MCC ID to pull all child accounts.
  4. Authorize: BDTS will prompt for Google account authorization. Use an account that has at least read access on the Google Ads account.
  5. Save and run. The first run may take 30–60 minutes.
  6. Verify in BigQuery that tables are appearing in your dataset.

Enabling custom GAQL reports (optional)

In the transfer configuration, you can add custom Google Ads Query Language (GAQL) reports beyond the default tables. This is useful for pulling data not included in the standard ~30 tables (e.g., campaign-level audience performance).

Key Gotchas

Options Compared

MethodCostNotes
Native BDTS (recommended)Free~30 tables, daily, official support, GAQL custom reports available
FivetranFree tier available; paid tiers well below $12K/year enterprise floorManaged, normalized schema; useful for multi-client consolidation
AirbyteFree (open-source) / Cloud pricing variesOpen-source connector available; more setup overhead
Custom Python (Google Ads API)Compute costs onlyFull API flexibility but you own pagination, retry logic, schema management

Recommendation: Use native BDTS unless you need cross-client normalization that Fivetran provides.

YouTube Ads

  • Method: Google Ads BDTS connector (YouTube paid campaigns flow through Google Ads)
  • Cost: Free (standard BQ storage and query costs apply)
  • Setup time: ~5 minutes additional if Google Ads BDTS is already configured
  • Status: Not started

Path: Covered by the Google Ads BDTS connector — YouTube campaigns are managed in Google Ads. No separate connector needed.

YouTube paid advertising — TrueView, bumpers, discovery ads, and other video campaign formats — is managed inside Google Ads and exported through the Google Ads BDTS connector. There is no separate BDTS connector for YouTube Ads. A separate BDTS connector exists for organic YouTube channel analytics (views, subscribers, engagement on owned channels).

Note: There is a separate YouTube Channel BDTS connector for organic channel analytics (views, subscribers, watch time) — different from ad performance.

For YouTube Paid Campaigns (ad performance data)

YouTube campaigns are managed in Google Ads. If you have the Google Ads BDTS connector configured, YouTube ad performance is already included. See the Google Ads section for setup instructions.

To confirm YouTube campaign data is present:

  1. In BigQuery, query the Campaign_YYYYMMDD table from your Google Ads export.
  2. Filter by advertising_channel_type = 'VIDEO' (or check the campaign_type field — exact field name varies by API version).
  3. Video ad-specific metrics (views, view rate, CPV) appear in the VideoPerformance or similar tables depending on your BDTS configuration.

For Organic YouTube Channel Analytics (owned channel data)

A separate BDTS connector exports organic channel data (views, watch time, subscribers, engagement) from the YouTube Data API / YouTube Analytics.

  1. Open BigQuery console → Data Transfers → "+ Create Transfer" → search "YouTube".
  2. Select "YouTube Channel Reports" connector.
  3. Authorize with the Google account that owns or has Analyst access to the YouTube channel.
  4. Set destination dataset (e.g., youtube_organic).
  5. Save and run.

Key Gotchas

Options Compared

ChannelMethodNotes
YouTube PaidGoogle Ads BDTS (free)Covered automatically if Google Ads export is live
YouTube Paid customGoogle Ads API + PythonFor video-specific metrics not in standard tables
YouTube OrganicYouTube Channel Reports BDTS (free)Owned channel only; views, watch time, subs
YouTube Organic via FivetranFree tier availableAlternative with managed schema normalization

Meta Ads

  • Method: Fivetran (free tier) or OWOX BI open-source connector; native BDTS is limited
  • Cost: Free (Fivetran free tier or OWOX open-source) to paid (Fivetran consumption tiers)
  • Setup time: ~30 minutes (Fivetran) / ~1–2 hours (OWOX or custom Python)
  • Status: Not started

Path: Native BDTS Facebook Ads connector (free, limited columns). Upgrade to OWOX open-source connector if custom breakdowns needed.

Meta Ads data covers Facebook and Instagram paid campaigns — spend, impressions, clicks, conversions, and audience breakdowns. No native Google BigQuery connector exists with full fidelity; the best paths are Fivetran (managed) or OWOX BI's open-source connector (free, more flexible). A limited native BDTS connector exists but lacks custom breakdowns.

Option A: Fivetran (recommended for managed, reliable delivery)

  1. Log in to Fivetran → Connectors → "+ Add Connector" → search "Facebook Ad Insights".
  2. Authorize with a Meta Business account that has access to the ad account(s) you want to export.
  3. Select ad account(s). You can pull multiple accounts.
  4. Set destination: BigQuery project marty-search-console-ga4-api, choose or create a dataset (e.g., meta_ads_fivetran).
  5. Configure sync frequency (daily recommended on free tier; more frequent on paid).
  6. Select tables/reports. Fivetran pulls core ad performance by default; configure breakdowns (age, gender, placement, device) as needed.
  7. Save and run initial sync. First sync may backfill available history.

Option B: OWOX BI open-source connector (free, more breakdown flexibility)

OWOX offers a free, open-source Google Sheets → BigQuery pipeline for Meta Ads that gives more control over breakdowns than the native BDTS option.

  1. Visit owox.com or search GitHub for "OWOX BigQuery pipeline".
  2. Follow their setup guide — it typically involves authorizing a Google Workspace add-on or deploying a script.
  3. Connect your Meta Ads account and configure the breakdown dimensions you need.
  4. Set a BigQuery destination and schedule.

Option C: Native BDTS (limited — standard metrics only)

A Meta Ads BDTS connector exists but only exports standard metrics with limited breakdown support. No custom audience segments or placement-level detail.

  1. BigQuery console → Data Transfers → "+ Create Transfer" → search "Facebook".
  2. Authorize with Meta Business account.
  3. Select destination dataset and ad account.
  4. Note the breakdown limitations before committing to this path.

Key Gotchas

Options Compared

MethodCostBreakdown FlexibilityNotes
Fivetran (recommended)Free tier available; paid tiers well below $12K/year enterprise floorHighManaged, reliable, handles retries and schema changes
OWOX BI open-sourceFreeHighMore flexible breakdowns; requires some DIY setup
Native BDTSFreeLowQuick to set up; standard metrics only, no custom breakdowns
AirbyteFree (open-source) / Cloud pricing variesMediumOpen-source connector available
Custom Python (Meta Marketing API)Compute costs onlyFullYou own everything: pagination, rate limits, schema

Bing Ads

  • Method: Fivetran (recommended) or custom Python script via bingads SDK
  • Cost: Fivetran free tier available; Python: compute costs only
  • Setup time: ~30 minutes (Fivetran) / ~2–4 hours (custom Python)
  • Status: Not started

Path: Airbyte Cloud or custom Python script using Microsoft bingads SDK. No native BDTS connector.

Microsoft Advertising (Bing Ads) data covers paid search and audience campaigns on the Microsoft network — Bing, Yahoo, MSN, and partner sites. There is no native BigQuery Data Transfer Service connector for Bing Ads. Fivetran has a managed connector (free tier available), and Microsoft publishes a Python SDK (bingads on PyPI) for custom scripts.

Option A: Fivetran (recommended)

  1. Log in to Fivetran → Connectors → "+ Add Connector" → search "Microsoft Advertising".
  2. Click "Authorize" → you'll be redirected to Microsoft login. Use an account with access to the Microsoft Advertising account(s).
  3. Select account(s). Fivetran can pull from one or multiple accounts.
  4. Set destination: BigQuery project marty-search-console-ga4-api, dataset (e.g., bing_ads_fivetran).
  5. Configure sync frequency (daily recommended on free tier).
  6. Save and run initial sync. Fivetran will backfill available history on first run (check how far back — typically 30–90 days depending on plan).
  7. Verify tables are appearing in BigQuery.

Option B: Custom Python via bingads SDK

pip install bingads

Key steps:

  1. Register an app in the Microsoft Azure portal (Azure AD → App registrations → New registration).
  2. Get API credentials: Client ID, Client Secret, Developer Token (from Microsoft Advertising UI → Tools → API Center).
  3. OAuth flow to get refresh token — Microsoft provides a sample script in the SDK docs.
  4. Write a report request using the SDK's ReportingServiceManager. Choose report type (e.g., CampaignPerformanceReportRequest).
  5. Download report and parse (TSV format).
  6. Load into BigQuery using google-cloud-bigquery Python client or bq load.
# Minimal pattern — fill in auth details from your registered app
from bingads.service_client import ServiceClient
from bingads.authorization import OAuthDesktopMobileAuthCodeGrant, AuthorizationData

# See Microsoft Advertising Python SDK docs for full auth + reporting examples
# SDK repo: https://github.com/BingAds/BingAds-Python-SDK

Deploy as a Cloud Run Job on a daily schedule (preferred over PythonAnywhere for production).

Key Gotchas

Options Compared

MethodCostNotes
Fivetran (recommended)Free tier available; paid tiers well below $12K/year enterprise floorManaged, handles auth/token refresh, reliable schema
AirbyteFree (open-source self-hosted) / ~$15/million rows (Cloud)Open-source connector; self-hosted requires infra
Custom Python (bingads SDK)Compute costs onlyFull control; you own auth, retry, schema — most setup effort

LinkedIn Ads

  • Method: Fivetran (recommended) or Airbyte
  • Cost: Fivetran free tier available; Airbyte Cloud pricing varies
  • Setup time: ~30 minutes (connector setup) + 2–4 weeks (LinkedIn app approval — start immediately)
  • Status: Not started — apply for LinkedIn Marketing API access now

Path: Airbyte Cloud after obtaining LinkedIn Marketing API access. Apply for API access immediately — 2–4 week approval wait.

LinkedIn Ads data covers Sponsored Content, Message Ads, Dynamic Ads, and Text Ads — impressions, clicks, spend, conversions, and demographic breakdowns (job title, company size, industry, seniority). There is no native BDTS connector. Fivetran and Airbyte both use the LinkedIn Marketing API, which requires explicit app approval from LinkedIn — a process that can take 2–4 weeks.

Step 0: Apply for LinkedIn Marketing API access (do this first)

This is the longest part of the process. Do not wait until you're ready to configure a connector.

  1. Go to developer.linkedin.com → My Apps → Create App.
  2. Fill in the app details. You'll need:
    • App name (e.g., "AimClear BigQuery Pipeline")
    • A LinkedIn Company Page (you must have admin access to the page)
    • App logo
  3. Under the app, go to Products → request "Marketing Developer Platform" access.
  4. Submit and wait. LinkedIn reviews applications manually. Typical timeline: 2–4 weeks.
  5. Once approved, you'll have access to the r_ads and r_ads_reporting OAuth scopes.
Note: Fivetran and Airbyte both use your own LinkedIn app credentials under the hood — you can't skip the API approval step by using a managed connector.

Step 1: Configure Fivetran connector (after API approval)

  1. Log in to Fivetran → Connectors → "+ Add Connector" → search "LinkedIn Ad Analytics".
  2. Enter OAuth credentials from your approved LinkedIn app (Client ID, Client Secret).
  3. Authorize with a LinkedIn account that has Campaign Manager access to the ad account(s).
  4. Select ad accounts.
  5. Set destination: BigQuery project marty-search-console-ga4-api, dataset (e.g., linkedin_ads_fivetran).
  6. Configure sync frequency and save.
  7. Run initial sync and verify tables in BigQuery.

Step 2: Verify data

Check that tables include campaign performance, creative performance, and demographic breakdowns (if available on your LinkedIn account tier).

Key Gotchas

Options Compared

MethodCostNotes
Fivetran (recommended)Free tier available; paid tiers well below $12K/year enterprise floorManaged, handles API auth, schema normalized; requires your LinkedIn app approval
AirbyteFree (open-source) / Cloud pricing variesOpen-source connector available; same LinkedIn app approval required
Custom Python (LinkedIn Marketing API)Compute costs onlyFull control; same app approval required; handle rate limits and pagination yourself
  • Method: Native BDTS (Google Ad Manager Data Transfer) — but requires Ad Manager 360
  • Cost: Free data transfer, but Ad Manager 360 is a premium/enterprise product
  • Setup time: ~1–2 hours + Google support ticket time (potentially days/weeks)
  • Status: Not started — verify Ad Manager 360 eligibility first

Path: Native BDTS connector (free) — but requires Data Transfer file access, which is not self-serve.

Google Ad Manager's BigQuery Data Transfer exports impression-level ad serving data — every impression, click, and viewability event from the ad server. This is powerful for publisher-side analysis, revenue attribution, and cross-channel delivery reconciliation. However, it is only available to Ad Manager 360 (GAM 360) accounts, not standard Ad Manager accounts. Self-serve setup is not possible; it requires a Google support ticket.

Step 0: Verify Ad Manager 360 access

Before investing any setup time, confirm the account type:

If the account is standard Ad Manager (not 360), skip this connector and use the Ad Manager API or Reporting API instead (see Options Compared below).

Step 1: Enable Data Transfer file access (requires Google support ticket)

  1. Contact Google Ad Manager support and request Data Transfer file access for your network.
  2. Provide: Ad Manager Network Code, GCP project ID (marty-search-console-ga4-api), and the GCS bucket where you want files delivered.
  3. Wait for Google to provision access. Timeline varies — days to weeks.

Step 2: Set up Google Cloud Storage bucket for Data Transfer files

# Create a GCS bucket for Ad Manager Data Transfer files
gsutil mb -p marty-search-console-ga4-api gs://aimclear-gam-data-transfer

# Grant Ad Manager's service account access to the bucket
# (Google support will provide the exact service account email)
gsutil iam ch serviceAccount:{gam-service-account}@gserviceaccount.com:objectAdmin gs://aimclear-gam-data-transfer

Step 3: Configure BDTS to load GCS files into BigQuery

Once Data Transfer files are landing in GCS:

  1. BigQuery console → Data Transfers → "+ Create Transfer" → "Google Ad Manager".
  2. Point to the GCS bucket and set destination dataset (e.g., google_ad_manager).
  3. Set schedule and save.

Step 4: Validate data

Key Gotchas

Options Compared

MethodCostAccess RequirementNotes
Native BDTS (Data Transfer)Free transfer + BQ storageAd Manager 360 onlyImpression-level; requires support ticket
Ad Manager Reporting APIFreeStandard or 360Aggregated reports only (no impression-level); self-serve
FivetranFree tier availableStandard or 360Pulls via Reporting API; aggregated, not impression-level
Custom Python (Reporting API)Compute costs onlyStandard or 360Full API access to aggregated reports; impression-level not available

If you have Ad Manager 360: Use native BDTS for impression-level data.
If you have standard Ad Manager: Use the Reporting API (via Fivetran or custom Python) for aggregated reports.

Other Programmatic / DSPs

Depends on specific platforms. Notable:

Clarify which specific programmatic platforms are actually in scope before building pipelines.

Optimizer

  • Method: TBD — depends on vendor capabilities; see decision tree below
  • Cost: Depends on method chosen
  • Setup time: ~30 minutes (if native export exists) to ~4 hours (custom script)
  • Status: Not started — start by verifying vendor export options

No public documentation found for a product specifically named "Optimizer." Optimizer is a paid channel aggregator that consolidates performance data across multiple ad platforms. Before building a pipeline, the key question is: what does Optimizer offer for data export, and does pulling from Optimizer provide more value than pulling from source channels (Google Ads, Meta, Bing, LinkedIn) directly?

This section is a decision template. Fill in specifics after reviewing Optimizer's documentation or contacting their support team. Check with the vendor:

If Optimizer already aggregates paid channels and the data fidelity is acceptable, pulling from it instead of each channel separately reduces pipeline complexity.

Decision Tree: How to Connect

Step 1: Check what Optimizer offers natively

Contact Optimizer support or review their API/export documentation:

Step 2: Assess whether pulling from Optimizer is worth it

Optimizer aggregates data that may already be flowing into BigQuery from source channels:

Data in OptimizerAlready in BQ?Pull from Optimizer?
Google Ads performanceYes (BDTS)Only if Optimizer normalizes/enriches it
Meta Ads performanceYes (Fivetran)Only if Optimizer adds cross-channel normalization
Bing Ads performanceYes (Fivetran)Only if Optimizer's schema is cleaner
Cross-channel aggregated metricsMaybe notYes — this is Optimizer's core value
Optimizer-specific attributionUnique to OptimizerYes

Rule of thumb: If Optimizer provides normalization or attribution logic on top of raw channel data, that added layer is worth pulling. If it's just re-exporting the same numbers from Google Ads and Meta without transformation, pull from source channels for higher fidelity.

Step 3: REST API — custom Python script template

If Optimizer has a REST API, adapt this pattern:

pip install requests google-cloud-bigquery
import requests
from google.cloud import bigquery
from datetime import datetime, timezone, timedelta

BQ_PROJECT = "marty-search-console-ga4-api"
OPTIMIZER_BASE_URL = "https://api.optimizer-vendor.com/v1"  # replace with actual URL
OPTIMIZER_API_KEY = "YOUR_API_KEY"  # store in Secret Manager

headers = {
    "Authorization": f"Bearer {OPTIMIZER_API_KEY}",
    "Content-Type": "application/json",
}

def fetch_performance(start_date: str, end_date: str) -> list:
    """
    Pull performance data from Optimizer API.
    start_date/end_date: YYYY-MM-DD format
    Adjust endpoint path, params, and pagination to match actual API spec.
    """
    url = f"{OPTIMIZER_BASE_URL}/reports/performance"
    params = {
        "start_date": start_date,
        "end_date": end_date,
        "dimensions": "date,channel,campaign",  # adjust to API spec
        "metrics": "impressions,clicks,spend,conversions",
    }
    rows = []
    page = 1
    while True:
        response = requests.get(url, headers=headers, params={**params, "page": page})
        response.raise_for_status()
        data = response.json()
        rows.extend(data.get("results", []))
        if not data.get("has_next_page"):
            break
        page += 1
    return rows

def load_to_bq(rows: list, table_id: str):
    bq = bigquery.Client(project=BQ_PROJECT)
    pulled_at = datetime.now(timezone.utc).isoformat()
    for row in rows:
        row["pulled_at"] = pulled_at
    errors = bq.insert_rows_json(table_id, rows)
    if errors:
        raise RuntimeError(f"BQ errors: {errors}")
    print(f"Loaded {len(rows)} rows to {table_id}")

if __name__ == "__main__":
    yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
    rows = fetch_performance(yesterday, yesterday)
    load_to_bq(rows, f"{BQ_PROJECT}.optimizer.performance_daily")

Step 4: Deploy as a Cloud Run Job

gcloud run jobs create optimizer-pipeline \
  --image gcr.io/marty-search-console-ga4-api/optimizer-pipeline \
  --region us-central1

# Schedule daily
gcloud scheduler jobs create http optimizer-daily \
  --schedule="0 8 * * *" \
  --uri="https://us-central1-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/marty-search-console-ga4-api/jobs/optimizer-pipeline:run" \
  --oauth-service-account-email=YOUR_SA@marty-search-console-ga4-api.iam.gserviceaccount.com

Key Gotchas

Questions to Answer Before Building

Before writing any code, get answers to these from Optimizer:

  1. Is there a native BigQuery connector or GCS export?
  2. Is there a REST API? What's the authentication method (API key, OAuth, HMAC)?
  3. What are the available dimensions and metrics?
  4. What is the data freshness / lag from platform to Optimizer?
  5. How far back can you pull historical data?
  6. What is the rate limit on API calls?
  7. Does Optimizer's schema overlap with your source channel tables, and if so, how?

Architectural Recommendations

Python Scripts (Reddit organic, Wikipedia, LLMs, Bing Ads if no Airbyte)

Use Cloud Run Jobs over PythonAnywhere for production:

Use google-cloud-bigquery Python client with Workload Identity on Cloud Run. Write via streaming inserts (small volumes) or batch LOAD jobs (daily bulk pulls).

ETL Strategy (Bing Ads, LinkedIn Ads, Reddit Ads)

If adding multiple non-native sources, consider Airbyte self-hosted on a small GCE VM — one management interface covers all three connectors and more. Free (you run the infrastructure, costs ~$20–40/month for a small e2-micro).

Airbyte Cloud at $15/million rows is the lower-ops alternative if self-hosting is too much friction.

Do not use Fivetran — minimum $12K/year annual contract is not justified for this use case.

BDTS Pricing Note

Starting September 25, 2025, BigQuery DTS switches third-party marketplace connectors to consumption-based pricing (slot-hours). First-party Google connectors (GA4, Google Ads, Search Console, YouTube Ads, Meta Ads, Ad Manager) remain free. Only third-party Fivetran/partner connectors sold through GCP Marketplace are affected.

Immediate Action Items

  1. Apply for LinkedIn Marketing API access now — 2–4 week wait, blocks LinkedIn Ads pipeline
  2. Switch SERP API provider from SerpApi to DataForSEO (lawsuit risk)
  3. Verify Google Ad Manager account type — DT file access required for BDTS connector
  4. Set up Airbyte (self-hosted or Cloud) to cover Bing Ads + LinkedIn Ads (once approved) + Reddit Ads
  5. Prototype Reddit/Wikipedia/LLM Python scripts on PythonAnywhere; migrate to Cloud Run Jobs for production
  6. Do not build LinkedIn personal-token pipeline — account termination risk is real after 2025 enforcement actions