Data Source Setup

Per-source guides for ingesting marketing data into BigQuery — costs, setup paths, gotchas, and key SQL/CLI commands. Research conducted 2026-05-07.

Overview

Covers every data source in the AimClear multi-channel pipeline and the best path for getting each into BigQuery. Sources fall into three buckets: organic / content (analytics, search, social, SERPs, LLM coverage), paid channels (Google, Meta, Bing, LinkedIn, YouTube, Google Ad Manager), and aggregators (vendor tools that already roll several sources together).

Default approach: use Google's free BigQuery Data Transfer Service (BDTS) connectors wherever they exist, and pull anything else with the lightest-weight option that fits — Airbyte Cloud for production third-party sources, custom Python scripts on Cloud Run Jobs for one-off / niche feeds, and DataForSEO for SERP / AI Overview data.

Decision Matrix

Source	Native BDTS	Best Path	Approx Monthly Cost
GA4	✅ Free	BDTS	$0 + BQ compute
Search Console	✅ Free (own mechanism)	GSC Bulk Export	$0 + BQ compute
Google Ads	✅ Free	BDTS	$0 + BQ compute
YouTube Ads	✅ Free (via Google Ads)	BDTS (Google Ads connector)	$0 + BQ compute
Facebook/Meta Ads	✅ Free (limited columns)	BDTS → OWOX if more detail needed	$0
Google Ad Manager	✅ Free (if DT access enabled)	BDTS (verify eligibility first)	$0 + BQ compute
Microsoft/Bing Ads	❌	Airbyte Cloud or Python (`bingads` SDK)	$20–60 or $0
LinkedIn Ads	❌	Airbyte Cloud (after API approval — apply now)	$20–50
Reddit Ads (paid)	❌	Airbyte Cloud	$15–30
Reddit Organic (threads)	❌	Python/PRAW script	$0–10 API costs
LinkedIn Organic	❌ (ToS risk)	Official Pages API only (own pages)	$0
Wikipedia	❌	Python script (Wikimedia REST API)	$0
SERP API	❌	DataForSEO pay-as-you-go	$5–50/day depending on volume
LLMs / AI Overviews	❌	Python async script + DataForSEO for Google AIO	$10–50
Optimizer	Unknown	Check vendor API/export docs	TBD

GA4 — Google Analytics 4

Path: Native BDTS connector — free. Setup in GA4 Admin → BigQuery Link or via BDTS in GCP Console.

Produces events_YYYYMMDD and events_intraday_YYYYMMDD tables (daily export, 24–48h lag)
Schema is deeply nested (event_params requires UNNEST + COALESCE across 4 value types) — build flattened materialized views for marketer-facing queries
September 2025 schema update: conversions renamed to keyEvents, revenue fields changed from INTEGER to FLOAT
GA4 360 (paid) adds intraday streaming; free tier is daily only
Data is not retroactive — link must be established before data starts

GA4's native BigQuery export pushes event-level data daily into your project. This is the most granular marketing data you can get from Google Analytics — every session, event, and conversion with all associated parameters — and it's the foundation for cross-channel attribution in BigQuery.

At a Glance

Method: Native BigQuery Data Transfer Service (BDTS) — built into GA4
Cost: Free (standard BQ storage and query costs apply)
Setup time: ~15 minutes

Setup Steps

Open GA4 Admin → go to the property you want to export → Admin → Product links → BigQuery links.
Click "Link" → choose your BigQuery project (marty-search-console-ga4-api).
Select datasets to export: at minimum choose "Events" (daily). Streaming export is available but accrues BQ streaming insert costs — start with daily.
Choose export frequency: Daily (free). Streaming optional (costs extra).
Save the link. GA4 will create a dataset in BigQuery named analytics_{property_id}.
Verify the link in GA4 Admin and in BigQuery console — a new dataset should appear within 24 hours.
Grant any analyst accounts BigQuery Data Viewer role on the dataset if needed.

Auth note: No service account setup required — GA4 handles the export internally using Google's own service account. You just need Editor or above on the GCP project and Admin access on the GA4 property.

Key Gotchas

No backfill. The export starts from the day you enable it. Historical data before linking is not exported. Enable ASAP.
24–48 hour lag. Daily exports typically land the following day; some properties see up to 48h.
Schema change (September 2025): The conversions_* event naming convention changed — conversions events are now tracked as keyEvents. Any queries referencing conversions by the old convention need updating.
Nested schema. event_params is a REPEATED RECORD — you must UNNEST(event_params) to access parameter values. Same applies to user_properties and items. This surprises many SQL users coming from flattened UA exports.
Sampled vs. unsampled. The BQ export is always unsampled regardless of your GA4 tier. This is one of the main reasons to use it.
One link per property. Each GA4 property links to one BQ project. If you have multiple properties, set up multiple links — each gets its own dataset.
Table naming: Daily tables land as events_YYYYMMDD. Intraday tables (if streaming) land as events_intraday_YYYYMMDD and are overwritten throughout the day.
Quota: GA4 BQ export counts against your BigQuery quotas. Unlikely to be an issue at typical agency scale.

Options Compared

Method	Cost	Notes
Native GA4 BDTS (recommended)	Free	Daily export, no third-party dependency, official support
Fivetran	Free tier available; paid tiers well below $12K/year enterprise floor	Adds managed schema normalization; overkill if native export meets your needs
Airbyte	Free (open-source) / Cloud pricing varies	More setup, same data available natively for free
Custom Python (GA4 Data API)	GCP compute costs only	Use only if you need custom dimensions/metrics not in the export; rate limits apply

Recommendation: Use the native BDTS export. It's free, official, and the most complete representation of GA4 data available.

Search Console

Path: GSC Bulk Data Export (not BDTS — its own mechanism). Setup in GSC Settings → Bulk Data Export. Free.

Three tables: searchdata_site_impression, searchdata_url_impression, ExportLog
Requires Owner-level GSC permission
Must grant search-console-data-export@system.gserviceaccount.com BigQuery Job User + Data Editor roles
No historical backfill — data flows from activation date only
Anonymized queries (low-impression) are excluded from URL-level table

GSC Bulk Data Export pushes daily impression and click data — site-level and URL-level — into BigQuery. This is the primary source for organic search performance: queries, pages, countries, devices, and positions. It feeds SEO analysis, content performance, and keyword coverage in the marketer pipeline.

At a Glance

Method: GSC Bulk Data Export (native Google feature — not BDTS)
Cost: Free (standard BQ storage and query costs apply)
Setup time: ~20 minutes

Setup Steps

Verify GSC access level. You must be an Owner (not just Full User) on the GSC property. Delegated ownership works. Check at: Search Console → Settings → Users and permissions.
Grant BQ permissions to the GSC export service account.
The service account is: search-console-data-export@system.gserviceaccount.com
In BigQuery console → your project → IAM, grant it:
- BigQuery Job User (project level)
- BigQuery Data Editor (dataset level, or project level for simplicity)
Open GSC Bulk Data Export. Go to: search.google.com/search-console → Settings (gear icon) → Bulk Data Export.
Click "Set up Bulk Data Export" and select:
- BigQuery project: marty-search-console-ga4-api
- Dataset name: choose something like search_console_export
- Property: select the GSC property to export
Save. GSC will validate permissions and begin export within 24 hours.
Verify in BigQuery that the dataset and tables appear the next day.

Per-property setup: Each GSC property needs its own Bulk Export setup. If you manage multiple client properties, repeat these steps for each one — they can all write to the same BQ project but will need separate datasets or you'll need to use table prefixes.

Key Gotchas

No backfill. Export begins from the day you configure it. Historical GSC data is available in the UI for 16 months but is NOT exported retroactively.
Owner-level access required. Full Users cannot configure Bulk Export. If a client's GSC is set up with restricted access, you'll need them to either grant ownership or set it up themselves.
Three tables are created:
- searchdata_site_impression — site-level aggregated data (query + device + country + date)
- searchdata_url_impression — URL-level data (query + URL + device + country + date)
- ExportLog — metadata about each export run (useful for debugging gaps)
Data lag: Typically 2–3 days behind real-time. GSC itself processes data with delay.
Sampling at URL level. The URL-impression table is more aggressively sampled than site-level. Low-volume queries/pages may not appear.
Schema quirk: Some fields use integer codes for device type (MOBILE, DESKTOP, TABLET) — check GSC documentation for the enum mapping before querying.
Service account requires BQ Job User at project level. Data Editor on the dataset alone is not enough — the export creates new tables, which requires job-level permissions.

Options Compared

Method	Cost	Notes
GSC Bulk Data Export (recommended)	Free	Official, daily, three tables, no API rate limits
Fivetran	Free tier available	Useful if you need managed schema or multi-property consolidation
Airbyte	Free (open-source) / Cloud varies	Pulls via Search Console API; subject to API quotas (25K rows/request)
Custom Python (Search Console API)	Compute costs only	API has row limits per request; pagination required; viable for small properties

Recommendation: Use GSC Bulk Data Export. It's free, quota-free, and higher fidelity than the API at scale.

Reddit Organic

Path: Custom Python script using PRAW library → BigQuery client.

Reddit API: ~100 req/min free; $0.24/1K calls commercial beyond free tier
Reddit introduced pre-approval requirement in 2025 for new commercial API use cases — register app use case as "marketing research / brand monitoring" and apply early
Do not scrape without the API (HTML scraping triggers blocks and violates ToS)
Design as incremental pulls (new posts since last run) — full corpus scrapes of high-volume subreddits take hours
Cloud Run Jobs preferred over PythonAnywhere for production (GCP-native, IAM-integrated, scales to zero)

Reddit organic tracking monitors keyword-relevant threads and comments across subreddits — brand mentions, competitor discussions, industry topics, and sentiment signals. This is a custom pull, not a managed connector. The Reddit API (via PRAW) is the correct method; HTML scraping violates Reddit's ToS. As of 2025, Reddit added a pre-approval requirement for commercial API access.

At a Glance

Method: Custom Python script using PRAW (Python Reddit API Wrapper)
Cost: Reddit API: ~$0.24/1K calls at commercial tier; Cloud Run Job: ~$0 at low volume
Setup time: ~2–4 hours (script + deployment)

Step 1: Apply for Reddit API access (commercial tier)

Reddit added a pre-approval step for commercial use in 2025.

Go to reddit.com/prefs/apps → Create App.
Select type: script (for server-side use) or web app (if OAuth flow needed).
For commercial use (storing data in BQ, using it for client reporting), review Reddit's Data API Terms and apply for the appropriate access tier.
Note your Client ID and Client Secret from the created app.

Step 2: Install PRAW and write the pull script

pip install praw google-cloud-bigquery

import praw
from google.cloud import bigquery
from datetime import datetime, timezone

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="aimclear-pipeline/1.0 (by u/YOUR_REDDIT_USERNAME)"
)

bq = bigquery.Client(project="marty-search-console-ga4-api")

KEYWORDS = ["your keyword", "brand name", "competitor"]
SUBREDDITS = ["marketing", "ppc", "digitalmarketing"]  # customize

def search_and_load(keywords, subreddits, lookback_hours=25):
    rows = []
    for sub in subreddits:
        subreddit = reddit.subreddit(sub)
        for kw in keywords:
            for submission in subreddit.search(kw, time_filter="day", limit=100):
                rows.append({
                    "pulled_at": datetime.now(timezone.utc).isoformat(),
                    "subreddit": sub,
                    "keyword": kw,
                    "submission_id": submission.id,
                    "title": submission.title,
                    "url": submission.url,
                    "score": submission.score,
                    "num_comments": submission.num_comments,
                    "created_utc": datetime.fromtimestamp(submission.created_utc, tz=timezone.utc).isoformat(),
                    "author": str(submission.author),
                    "selftext": submission.selftext[:2000],  # truncate for BQ
                })
    if rows:
        table_id = "marty-search-console-ga4-api.reddit_organic.submissions"
        errors = bq.insert_rows_json(table_id, rows)
        if errors:
            raise RuntimeError(f"BQ insert errors: {errors}")
    print(f"Loaded {len(rows)} rows")

if __name__ == "__main__":
    search_and_load(KEYWORDS, SUBREDDITS)

Step 3: Create the BigQuery table

CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.reddit_organic.submissions` (
  pulled_at TIMESTAMP,
  subreddit STRING,
  keyword STRING,
  submission_id STRING,
  title STRING,
  url STRING,
  score INT64,
  num_comments INT64,
  created_utc TIMESTAMP,
  author STRING,
  selftext STRING
)
PARTITION BY DATE(created_utc)
OPTIONS (require_partition_filter = false);

Step 4: Deploy as a Cloud Run Job (preferred over PythonAnywhere for production)

# Build and push container
gcloud builds submit --tag gcr.io/marty-search-console-ga4-api/reddit-organic

# Deploy as Cloud Run Job
gcloud run jobs create reddit-organic \
  --image gcr.io/marty-search-console-ga4-api/reddit-organic \
  --region us-central1 \
  --project marty-search-console-ga4-api

# Schedule via Cloud Scheduler (daily at 6 AM UTC)
gcloud scheduler jobs create http reddit-organic-daily \
  --schedule="0 6 * * *" \
  --uri="https://us-central1-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/marty-search-console-ga4-api/jobs/reddit-organic:run" \
  --oauth-service-account-email=YOUR_SA@marty-search-console-ga4-api.iam.gserviceaccount.com

Key Gotchas

Do not HTML scrape Reddit. Reddit's ToS explicitly prohibits scraping. Use the API only.
Pre-approval requirement (2025). Commercial use of the Reddit API now requires applying for the appropriate tier. Review the Data API terms at redditinc.com/policies/data-api-terms before proceeding.
Rate limits: ~100 requests/minute on standard tier. PRAW automatically handles rate limit headers and backs off. Don't bypass PRAW's built-in rate limiting.
Commercial API pricing: ~$0.24/1K API calls at commercial rates. At typical monitoring volumes (a few hundred calls/day), costs are minimal — under $5/month for most use cases.
Design as incremental pulls. Don't pull full history on every run. Use time_filter="day" and partition your BQ table by date. Track what you've already pulled to avoid duplicate rows.
Deduplication. Reddit submissions can appear in multiple searches (multiple keywords). Use submission_id as a dedup key in your BQ table or downstream queries.
Search result limits. Reddit's search API returns a max of ~1,000 results per query (with pagination). For very high-volume keywords, you'll miss older results.
Deleted/removed content. Authors and body text may be [deleted] or [removed] by the time you pull. Handle these gracefully.
Cloud Run vs. PythonAnywhere. PythonAnywhere is viable for prototyping, but Cloud Run Jobs are the production target — they're more reliable, auditable via Cloud Logging, and natively authenticated to GCP via service accounts.

Options Compared

Method	Cost	Notes
PRAW + Cloud Run Job (recommended)	~$0.24/1K API calls + minimal compute	Production-grade, GCP-native, incremental design
PRAW + PythonAnywhere	Same API costs + PythonAnywhere subscription	Fine for prototyping; less reliable for unattended production
Pushshift / third-party archives	Varies	Pushshift has had significant reliability and ToS issues; verify current status before relying on it

LinkedIn Organic

Path: Official LinkedIn Pages API (own pages only). Abandon personal token / session cookie scraping.

2025 enforcement: LinkedIn deleted company Pages of Apollo.io and Seamless.AI for scraping; Proxycurl shut down July 2025
Personal login token automation = account termination risk, not just ToS violation
Official Pages Data Portability API and Community Management API cover your own company page data (posts, followers, engagement analytics) — requires app approval (2–4 weeks)
Competitive LinkedIn page data is not obtainable via official API; Bright Data / Coresignal exist as enterprise-priced data products ($1K+/month)
Decision: limit to own Pages data via official API. Do not build personal-token pipeline.

LinkedIn's official Pages API provides analytics for your own company pages: post impressions, engagement, follower counts, and demographic breakdowns of your audience. This covers organic reach on pages where AimClear or its clients are admins. Competitive organic intelligence (seeing other companies' organic post performance) is not available via official API and attempts to obtain it via unofficial means carry serious account risk.

At a Glance

Method: LinkedIn Pages API (official) via custom Python script
Cost: Free API (compute costs only)
Setup time: ~2–4 hours + 2–4 weeks LinkedIn app approval (if not already approved for Ads API)
Scope: Own company pages only; no competitive organic data available

Critical Constraint: Own Pages Only

LinkedIn's official API grants access to:

Your own company page analytics (Pages Data Portability API)
Content you publish (Community Management API)
Your own follower/audience demographics

It does not grant access to:

Competitor page performance data
Third-party organic post analytics
Employee post performance (unless connected to your page)

Do not use personal token/session cookie automation to access LinkedIn data outside the official API. Apollo.io and Seamless.AI had company Pages deleted by LinkedIn in 2025 for API policy violations. Proxycurl, which offered LinkedIn data scraping, shut down in July 2025 following LinkedIn enforcement actions. The risk is account termination, not just a rate limit error.

Step 1: Ensure LinkedIn app access

If you already have a LinkedIn app approved for Marketing API access, check whether the Pages product is also approved on that app. If not, request it:

Go to developer.linkedin.com → your app → Products.
Request "Share on LinkedIn" and/or "Community Management API" as applicable.
These products generally have faster approval than the Marketing Developer Platform.

Step 2: Get admin access to the Company Page

The OAuth user authorizing the API calls must be a Super Admin or Content Admin on the LinkedIn Company Page. Verify page roles at: LinkedIn → Company Page → Admin tools → Manage admins.

Step 3: Write the pull script

pip install requests google-cloud-bigquery

import requests
from google.cloud import bigquery
from datetime import datetime, timezone

# After OAuth flow, you'll have an access token with r_organization_social scope
ACCESS_TOKEN = "YOUR_OAUTH_ACCESS_TOKEN"
ORG_ID = "YOUR_LINKEDIN_ORGANIZATION_ID"  # numeric org URN
BQ_PROJECT = "marty-search-console-ga4-api"
BQ_DATASET = "linkedin_organic"

headers = {"Authorization": f"Bearer {ACCESS_TOKEN}"}

def get_page_analytics(org_id, start_date, end_date):
    # Organization Follower Statistics
    url = f"https://api.linkedin.com/v2/organizationalEntityFollowerStatistics"
    params = {
        "q": "organizationalEntity",
        "organizationalEntity": f"urn:li:organization:{org_id}",
    }
    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
    return response.json()

def get_post_statistics(org_id):
    # Share Statistics (post-level engagement)
    url = f"https://api.linkedin.com/v2/organizationalEntityShareStatistics"
    params = {
        "q": "organizationalEntity",
        "organizationalEntity": f"urn:li:organization:{org_id}",
    }
    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
    return response.json()

# Load results into BigQuery
# bq_client = bigquery.Client(project=BQ_PROJECT)
# ... insert_rows_json into your target table

Refer to the LinkedIn API documentation for the current endpoint paths and OAuth scopes — LinkedIn deprecates and renames endpoints periodically. Required scopes typically include r_organization_social and r_organization_admin.

Step 4: Deploy as a Cloud Run Job

# Schedule daily, similar to other custom scripts
gcloud run jobs create linkedin-organic \
  --image gcr.io/marty-search-console-ga4-api/linkedin-organic \
  --region us-central1

Key Gotchas

Own pages only. Competitor intelligence via LinkedIn's official API is not possible. If a client asks for competitor organic benchmarks, LinkedIn is the wrong source — use industry reports, LinkedIn's own Marketing Solutions benchmarks, or published case studies.
OAuth token management. LinkedIn access tokens expire. Set up token refresh and store credentials securely (Secret Manager, not environment variables in code).
Proxycurl is gone (July 2025). If any existing workflow relied on Proxycurl or similar scraping services for LinkedIn data, those are now defunct. There is no drop-in replacement via official channels for what those services provided.
Apollo.io / Seamless.AI precedent. These companies had LinkedIn company pages deleted in 2025 for violations. The risk of unofficial access is real and company-level, not just API-key level.
Rate limits are enforced. LinkedIn throttles API calls. For a daily pull of page stats, you're unlikely to hit limits. For bulk historical pulls of post-level data, space out requests.
Data portability API vs. Community Management API. LinkedIn has been expanding official data access incrementally. Check the current API product catalog — some analytics that required workarounds in 2024 may have official endpoints now.
Demographic data requires additional scope. Audience demographic breakdowns (industry, job function, seniority of followers) require the r_organization_social scope at minimum; some breakdowns may require the full admin scope.

Options Compared

Method	Cost	Data Scope	Notes
LinkedIn Pages API (official)	Free	Own pages only	The only safe, compliant option
Fivetran LinkedIn Pages connector	Free tier available	Own pages only	Managed version of the same Pages API
Third-party scraping tools	Varies	Own + competitor	Do not use — account termination risk, enforcement active as of 2025

Wikipedia

Path: Custom Python script (Wikimedia REST API) → BigQuery. Trivial.

Free REST API, no auth required, ~200 req/sec rate limit — massively over-provisioned for keyword research volumes
en.wikipedia.org/api/rest_v1/page/summary/{title} is the main endpoint
BigQuery public dataset bigquery-public-data.wikipedia includes page view stats (free to query under 1TB/month free tier)
~20 lines of Python total

Wikipedia data in the pipeline serves two distinct purposes: (1) pulling article content and metadata for specific topics or brand/competitor pages, and (2) tracking page view statistics over time. The Wikimedia REST API is free, requires no authentication, and handles both. A BigQuery public dataset (bigquery-public-data.wikipedia) also provides page view stats without any API calls.

At a Glance

Method: Wikimedia REST API (free, no auth) or BigQuery public dataset
Cost: Free
Setup time: ~1–2 hours

Option A: Wikimedia REST API — article content and metadata (~20 lines of Python)

pip install requests google-cloud-bigquery

import requests
from google.cloud import bigquery
from datetime import datetime, timezone

BQ_PROJECT = "marty-search-console-ga4-api"

def get_article_summary(title: str) -> dict:
    """Fetch article summary from Wikimedia REST API."""
    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}"
    response = requests.get(url, headers={"User-Agent": "aimclear-pipeline/1.0 (timothy@aimclear.com)"})
    response.raise_for_status()
    data = response.json()
    return {
        "pulled_at": datetime.now(timezone.utc).isoformat(),
        "title": data.get("title"),
        "display_title": data.get("displaytitle"),
        "description": data.get("description"),
        "extract": data.get("extract"),
        "page_id": data.get("pageid"),
        "revision": data.get("revision"),
        "last_modified": data.get("timestamp"),
        "url": data.get("content_urls", {}).get("desktop", {}).get("page"),
    }

def get_page_views(title: str, start: str, end: str, granularity: str = "daily") -> list:
    """Fetch page view counts for a Wikipedia article."""
    # Format: YYYYMMDD for start/end
    url = (
        f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article"
        f"/en.wikipedia/all-access/all-agents/{title}/{granularity}/{start}/{end}"
    )
    response = requests.get(url, headers={"User-Agent": "aimclear-pipeline/1.0 (timothy@aimclear.com)"})
    response.raise_for_status()
    return response.json().get("items", [])

# Example usage
ARTICLES = ["AimClear", "Digital_marketing", "Search_engine_optimization"]

bq = bigquery.Client(project=BQ_PROJECT)
rows = [get_article_summary(title) for title in ARTICLES]
# bq.insert_rows_json("marty-search-console-ga4-api.wikipedia.article_summaries", rows)

Deploy as a Cloud Run Job, scheduled weekly (article content doesn't change daily).

Option B: BigQuery public dataset — page views at scale

The bigquery-public-data.wikipedia dataset contains monthly page view statistics across all Wikipedia articles. No API calls, no pipeline — just query directly.

-- Top Wikipedia articles by views for a given month
SELECT
  title,
  SUM(views) AS total_views,
  datehour
FROM `bigquery-public-data.wikipedia.pageviews_2026`
WHERE
  wiki = 'en'
  AND title IN ('AimClear', 'Digital_marketing', 'Search_engine_optimization')
  AND DATE(datehour) BETWEEN '2026-01-01' AND '2026-01-31'
GROUP BY title, datehour
ORDER BY total_views DESC;

This approach works well for trend analysis and requires no ingestion pipeline at all.

Key Gotchas

Set a descriptive User-Agent header. The Wikimedia API terms require identifying your tool and contact info in the User-Agent string (e.g., aimclear-pipeline/1.0 (timothy@aimclear.com)). Requests without proper User-Agent may be throttled or blocked.
Rate limit: ~200 requests/second. At typical monitoring volumes (dozens of articles), you'll never approach this. No API key required.
Article title formatting. Wikipedia titles use underscores instead of spaces in URLs (Search_engine_optimization, not Search engine optimization). Handle URL encoding carefully.
Disambiguation pages. Some titles resolve to disambiguation pages, not articles. Check the type field in the REST API response — standard is a regular article, disambiguation means you need a more specific title.
Public dataset update lag. The bigquery-public-data.wikipedia dataset is updated monthly, not daily. Use the REST API for more current page view data.
No auth needed, but respect ToS. The Wikimedia API is completely open, but mass downloading of article content violates the spirit of the API. For bulk content needs, use Wikipedia database dumps instead.
International editions. To pull from non-English Wikipedia, change en.wikipedia to the appropriate language code (e.g., es.wikipedia, de.wikipedia).

Options Compared

Method	Cost	Use Case
Wikimedia REST API + Python	Free	Article summaries, metadata, recent page views for specific articles
BigQuery public dataset	Free (BQ query costs)	Historical page view trends at scale, no pipeline needed
Wikipedia database dumps	Free	Bulk content ingestion (avoid for most marketing use cases)

Recommendation: Use the BigQuery public dataset for page view trend analysis. Use the REST API for pulling article summaries and metadata for specific pages. Both can coexist — they answer different questions.

SERP API

Path: DataForSEO (pay-as-you-go, no minimums). Do not depend on SerpApi.

SerpApi is under active Google lawsuit (filed Dec 2025, motion to dismiss filed Feb 2026 — case ongoing, ruling likely mid-to-late 2026). Continued viability uncertain.
DataForSEO: $2.00/1K queries (live) or $0.60/1K (5-min delay). No monthly minimums. Best cost at volume.
Serper.dev: $50 for 50K queries (tiered down to $0.30/1K at volume). Fast, less legally exposed than SerpApi.
Google Custom Search JSON API: Closed to new customers. Existing customers deprecated by Jan 1, 2027. Not an option if you don't already have it.
DataForSEO and Serper both return Google AI Overview content in SERP responses — use this for AI Overview monitoring instead of separate LLM calls against Google.

SERP API pulls real-time Google search results for target keywords — organic rankings, featured snippets, People Also Ask boxes, local packs, and Google AI Overviews. This feeds keyword rank tracking, SERP feature monitoring, and AI Overview coverage detection in the pipeline. DataForSEO is the recommended provider: no minimum spend, competitive pricing, and their API returns AI Overview content in responses.

At a Glance

Method: DataForSEO API + custom Python script → Cloud Run Job
Cost: DataForSEO: $2.00/1K live results, $0.60/1K standard (5-minute delay); no minimums
Setup time: ~2–3 hours

Vendor Selection

DataForSEO (recommended):

$2.00/1K requests for live results
$0.60/1K for standard results (5-minute delivery delay, lower cost for rank tracking)
No minimums or contracts — pay-as-you-go
Returns Google AI Overview content in SERP responses
Extensive API coverage: Google, Bing, YouTube, local, image SERPs

Serper (alternative):

Also returns AI Overview content in SERP responses
Competitive pricing — check current rates at serper.dev
Simpler API structure; good for teams new to SERP APIs

Do not use SerpApi for new implementations:

Google filed a lawsuit against SerpApi in December 2025
Legal uncertainty creates business continuity risk

Do not use Google Custom Search JSON API:

No longer accepting new customers
Deprecated January 2027 — any existing integrations should be migrated

Step 1: Create DataForSEO account

Go to app.dataforseo.com → Sign up.
Add a payment method. No minimum deposit required.
Note your API credentials (login email and password are used as HTTP Basic Auth).

Step 2: Write the pull script

pip install requests google-cloud-bigquery

import requests
import json
from google.cloud import bigquery
from datetime import datetime, timezone
from base64 import b64encode

BQ_PROJECT = "marty-search-console-ga4-api"
DATAFORSEO_LOGIN = "your-email@example.com"
DATAFORSEO_PASSWORD = "your-api-password"

def get_auth_header():
    creds = b64encode(f"{DATAFORSEO_LOGIN}:{DATAFORSEO_PASSWORD}".encode()).decode()
    return {"Authorization": f"Basic {creds}", "Content-Type": "application/json"}

def fetch_serp(keywords: list, location_code: int = 2840, language_code: str = "en") -> list:
    """
    Fetch SERP results for a list of keywords.
    location_code 2840 = United States
    Use /serp/google/organic/live/regular for live results
    Use /serp/google/organic/task_post + task_get for standard (cheaper)
    """
    url = "https://api.dataforseo.com/v3/serp/google/organic/live/regular"
    tasks = [
        {
            "keyword": kw,
            "location_code": location_code,
            "language_code": language_code,
            "device": "desktop",
            "os": "windows",
        }
        for kw in keywords
    ]
    response = requests.post(url, headers=get_auth_header(), json=tasks)
    response.raise_for_status()
    return response.json().get("tasks", [])

def parse_serp_to_rows(tasks: list, pulled_at: str) -> list:
    rows = []
    for task in tasks:
        keyword = task.get("data", {}).get("keyword")
        results = task.get("result", []) or []
        for result in results:
            items = result.get("items", []) or []
            for item in items:
                rows.append({
                    "pulled_at": pulled_at,
                    "keyword": keyword,
                    "rank_group": item.get("rank_group"),
                    "rank_absolute": item.get("rank_absolute"),
                    "type": item.get("type"),  # organic, featured_snippet, ai_overview, etc.
                    "url": item.get("url"),
                    "title": item.get("title"),
                    "description": item.get("description"),
                    "domain": item.get("domain"),
                })
    return rows

KEYWORDS = [
    "digital marketing agency",
    "ppc management",
    "your target keyword here",
]

def main():
    pulled_at = datetime.now(timezone.utc).isoformat()
    tasks = fetch_serp(KEYWORDS)
    rows = parse_serp_to_rows(tasks, pulled_at)

    bq = bigquery.Client(project=BQ_PROJECT)
    table_id = f"{BQ_PROJECT}.serp_tracking.organic_results"
    errors = bq.insert_rows_json(table_id, rows)
    if errors:
        raise RuntimeError(f"BQ errors: {errors}")
    print(f"Loaded {len(rows)} rows for {len(KEYWORDS)} keywords")

if __name__ == "__main__":
    main()

Step 3: Create BigQuery table

CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.serp_tracking.organic_results` (
  pulled_at TIMESTAMP,
  keyword STRING,
  rank_group INT64,
  rank_absolute INT64,
  type STRING,
  url STRING,
  title STRING,
  description STRING,
  domain STRING
)
PARTITION BY DATE(pulled_at)
CLUSTER BY keyword;

Step 4: Deploy as a Cloud Run Job

Schedule daily (or more frequently for high-value keywords). Keep cost in check by batching keywords per request (DataForSEO accepts up to 100 tasks per API call).

gcloud run jobs create serp-tracking \
  --image gcr.io/marty-search-console-ga4-api/serp-tracking \
  --region us-central1

Key Gotchas

Live vs. standard results: Live is fresher but 3.3x the cost. For daily rank tracking, standard results (5-min delay) are fine and much cheaper. Use live for on-demand queries or AI Overview monitoring.
AI Overview detection: DataForSEO returns AI Overviews as an item_type in the SERP results. Parse the type field and look for ai_overview to capture which keywords trigger AI Overviews and what content appears. See the LLM Monitoring section for the dedicated LLM monitoring pipeline.
Google search ToS. Both DataForSEO and Serper operate as API intermediaries that comply with their own agreements with Google. Direct scraping of Google search results is against Google's ToS — use a compliant API provider.
Keyword volume costs. At $0.60/1K standard requests, 1,000 keywords pulled daily costs ~$0.60/day ($18/month). 10,000 keywords/day = ~$180/month. Size your keyword list deliberately.
Location and language matter. SERP results vary significantly by location. Set location_code to match your target market. Use country-level codes for national tracking, city-level for local.
SerpApi legal risk. Google's December 2025 lawsuit against SerpApi creates uncertainty about that service's continuity. Avoid starting new integrations there.
Google Custom Search JSON API is closing. Deprecated January 2027. Do not build new pipelines on it.

Options Compared

Provider	Cost	AI Overview	Notes
DataForSEO (recommended)	$0.60/1K standard, $2.00/1K live	Yes	No minimum, pay-as-you-go, extensive coverage
Serper	Check serper.dev	Yes	Simpler API; competitive pricing
SerpApi	Varies	Yes	Active Google lawsuit (Dec 2025) — avoid for new builds
Google Custom Search JSON API	Limited free, deprecated	No	No new customers; deprecated Jan 2027
Custom scraping	Compute only	N/A	Violates Google ToS — do not use

LLM Monitoring

Path: Two-track:

Google AI Overviews → DataForSEO SERP API (already returning AIO content in results)
ChatGPT / Claude / Gemini / Perplexity direct responses → Custom Python async script using official APIs

Official API pricing (mid-2025):

OpenAI GPT-4o: $0.0025/1K input tokens, $0.01/1K output tokens
Anthropic Claude Sonnet: $3/1M input, $15/1M output
Google Gemini 1.5 Flash: effectively free at low volume on free tier
Perplexity Sonar API: ~$1/1K queries

Specialized tools: Otterly.ai, thruuu LLM API (explicit BigQuery integration), DemandSphere (BigQuery export available). Worth evaluating if managing this at scale.

LLM API responses are non-deterministic — track trends across runs, not single-point data.

LLM monitoring tracks whether brand, competitor, or topic keywords appear in AI-generated responses across two distinct channels:

Google AI Overviews — AI-generated answer boxes appearing in Google SERP
Conversational AI responses — What ChatGPT, Claude, Gemini, and Perplexity say when asked about your brand, competitors, or industry topics

These are separate data sources with separate methods. Track both for a complete picture of "AI answer engine" visibility.

At a Glance

Method: Two tracks — (1) DataForSEO SERP API for Google AI Overviews; (2) Direct LLM API calls (OpenAI, Anthropic, Google, Perplexity) via Python async script
Cost: Varies by track and volume — see breakdown below
Setup time: ~3–5 hours (both tracks)

Track 1: Google AI Overviews (via DataForSEO SERP API)

Google AI Overviews appear in standard Google search results. DataForSEO returns them as SERP items with type: "ai_overview". This track reuses the same infrastructure as the SERP API section above.

Cost: DataForSEO: $0.60/1K standard results — an AI Overview trigger is a byproduct of the SERP pull, no extra cost. Run this in the same Cloud Run Job as general SERP tracking to avoid duplicate API calls.

What to capture from DataForSEO AI Overview items: When a SERP result contains an AI Overview, the items array will include an entry with type: "ai_overview". Extract:

keyword — the query that triggered it
ai_overview_text or equivalent field — the generated answer text
sources — URLs cited in the AI Overview (these are the "winners" worth tracking)
pulled_at — timestamp of the SERP pull

Store in serp_tracking.ai_overview_appearances. Join with organic_results to see which keywords trigger AI Overviews vs. traditional results.

CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.serp_tracking.ai_overview_appearances` (
  pulled_at TIMESTAMP,
  keyword STRING,
  has_ai_overview BOOL,
  ai_overview_text STRING,
  cited_urls ARRAY<STRING>,
  cited_domains ARRAY<STRING>
)
PARTITION BY DATE(pulled_at)
CLUSTER BY keyword;

Track 2: Conversational AI Monitoring (ChatGPT, Claude, Gemini, Perplexity)

This track queries LLMs directly via their APIs and asks brand/topic questions, then stores the responses for analysis. Use Python async for efficiency across multiple providers.

Cost per provider

Provider	Model	Approximate Cost
OpenAI	GPT-4o	$0.0025/1K input tokens, $0.010/1K output tokens
Anthropic	Claude Sonnet	$3.00/1M input tokens, $15.00/1M output tokens
Google	Gemini 1.5 Flash	Free at low volume (check current quota limits)
Perplexity	Sonar	~$1.00/1K queries

At typical monitoring volumes (50–200 queries/day across all providers), expect $1–$10/day total depending on query complexity and provider mix.

Setup Steps

pip install openai anthropic google-generativeai requests google-cloud-bigquery asyncio

import asyncio
import json
from datetime import datetime, timezone
from google.cloud import bigquery

import openai
import anthropic
import google.generativeai as genai

BQ_PROJECT = "marty-search-console-ga4-api"
BQ_TABLE = f"{BQ_PROJECT}.llm_monitoring.responses"

PROMPTS = [
    "What is AimClear?",
    "Who are the top digital marketing agencies for B2B paid search?",
    "What companies specialize in audience targeting for paid social?",
    # Add brand, competitor, and category prompts
]

async def query_openai(prompt: str, pulled_at: str) -> dict:
    client = openai.AsyncOpenAI()  # uses OPENAI_API_KEY env var
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500,
    )
    return {
        "pulled_at": pulled_at,
        "provider": "openai",
        "model": "gpt-4o",
        "prompt": prompt,
        "response": response.choices[0].message.content,
        "input_tokens": response.usage.prompt_tokens,
        "output_tokens": response.usage.completion_tokens,
    }

async def query_anthropic(prompt: str, pulled_at: str) -> dict:
    client = anthropic.AsyncAnthropic()  # uses ANTHROPIC_API_KEY env var
    response = await client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}],
    )
    return {
        "pulled_at": pulled_at,
        "provider": "anthropic",
        "model": "claude-sonnet-4-5",
        "prompt": prompt,
        "response": response.content[0].text,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }

async def query_perplexity(prompt: str, pulled_at: str) -> dict:
    import httpx
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.perplexity.ai/chat/completions",
            headers={"Authorization": f"Bearer {PERPLEXITY_API_KEY}"},
            json={
                "model": "sonar",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 500,
            },
        )
        response.raise_for_status()
        data = response.json()
    return {
        "pulled_at": pulled_at,
        "provider": "perplexity",
        "model": "sonar",
        "prompt": prompt,
        "response": data["choices"][0]["message"]["content"],
        "input_tokens": data.get("usage", {}).get("prompt_tokens"),
        "output_tokens": data.get("usage", {}).get("completion_tokens"),
    }

async def run_all_prompts():
    pulled_at = datetime.now(timezone.utc).isoformat()
    tasks = []
    for prompt in PROMPTS:
        tasks.append(query_openai(prompt, pulled_at))
        tasks.append(query_anthropic(prompt, pulled_at))
        tasks.append(query_perplexity(prompt, pulled_at))
        # Add Gemini similarly using google.generativeai
    results = await asyncio.gather(*tasks, return_exceptions=True)

    rows = [r for r in results if isinstance(r, dict)]
    errors = [r for r in results if isinstance(r, Exception)]
    if errors:
        print(f"Errors: {errors}")

    bq = bigquery.Client(project=BQ_PROJECT)
    bq.insert_rows_json(BQ_TABLE, rows)
    print(f"Loaded {len(rows)} LLM responses")

if __name__ == "__main__":
    asyncio.run(run_all_prompts())

BigQuery table for LLM responses

CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.llm_monitoring.responses` (
  pulled_at TIMESTAMP,
  provider STRING,
  model STRING,
  prompt STRING,
  response STRING,
  input_tokens INT64,
  output_tokens INT64
)
PARTITION BY DATE(pulled_at)
CLUSTER BY provider, prompt;

Post-processing: brand mention detection

After loading responses, run a scheduled BQ query or dbt model to detect brand/competitor mentions:

SELECT
  pulled_at,
  provider,
  prompt,
  LOWER(response) LIKE '%aimclear%' AS mentions_aimclear,
  LOWER(response) LIKE '%competitor_a%' AS mentions_competitor_a,
  response
FROM `marty-search-console-ga4-api.llm_monitoring.responses`
WHERE DATE(pulled_at) = CURRENT_DATE()
ORDER BY pulled_at DESC;

Specialized Tools (evaluate vs. build)

Tool	Notes
Otterly.ai	SaaS specifically for AI mention/citation monitoring; BQ integration varies — check current offering
thruuu LLM API	Has native BigQuery integration; purpose-built for LLM answer tracking
DemandSphere	Enterprise SEO/SERP platform with LLM visibility tracking; check pricing

If AimClear plans to offer LLM monitoring as a client service at scale, evaluate these tools against the custom script approach. For internal use or a handful of clients, the custom script is sufficient and cheaper.

Key Gotchas

LLM responses are non-deterministic. The same prompt returns different text on each call. This is by design — use daily snapshots, not single-run spot checks, to build a statistically meaningful signal.
Gemini free tier has rate limits. At low volume, Gemini 1.5 Flash is free, but sustained daily pulls may exhaust the free quota. Monitor usage in Google AI Studio.
Perplexity Sonar includes citations. Unlike other LLMs, Perplexity's responses include source URLs. Parse the citations array in the response to see which pages Perplexity is citing — that's actionable SEO signal.
Model version drift. LLM providers change model versions. Pin specific model versions in your API calls and set up alerts if a model is deprecated.
API key security. Store all API keys in Google Secret Manager, not in environment variables embedded in Docker images. Use Cloud Run's secret injection at runtime.
Cost control. Set spending limits on each LLM provider account. A misconfigured loop could generate unexpected charges. OpenAI and Anthropic both support spending caps.
Google AI Overviews are separate from Gemini API. The Gemini API is Google's LLM product; Google AI Overviews are a SERP feature. They may use similar underlying technology, but querying Gemini via API does not tell you what appears in AI Overviews on google.com.

Google Ads

Method: Native BigQuery Data Transfer Service (BDTS)
Cost: Free (standard BQ storage and query costs apply)
Setup time: ~20 minutes
Status: Not started

Path: Native BDTS connector — free. ~30+ standard tables covering campaigns, ad groups, ads, keywords, audiences, conversions, shopping, video.

The Google Ads BDTS connector exports ~30+ tables covering campaigns, ad groups, ads, keywords, audiences, conversions, and performance metrics. It's the canonical source for paid search data in the pipeline — clicks, impressions, costs, and ROAS at every level of the account hierarchy.

January–February 2025: switched to Google Ads API v16 (some column name changes)
Now supports custom reports via Google Ads Query Language (GAQL)

Setup Steps

Open BigQuery console → go to project marty-search-console-ga4-api → BigQuery Data Transfers (left nav).
Click "+ Create Transfer" → search for "Google Ads" → select it.
Configure the transfer:
- Display name: something like google-ads-{client}
- Schedule: Daily (default, recommended)
- Destination dataset: create a new dataset or use an existing one (e.g., google_ads_export)
- Google Ads customer ID: enter the 10-digit customer ID (no dashes). For MCC accounts, use the top-level MCC ID to pull all child accounts.
Authorize: BDTS will prompt for Google account authorization. Use an account that has at least read access on the Google Ads account.
Save and run. The first run may take 30–60 minutes.
Verify in BigQuery that tables are appearing in your dataset.

Enabling custom GAQL reports (optional)

In the transfer configuration, you can add custom Google Ads Query Language (GAQL) reports beyond the default tables. This is useful for pulling data not included in the standard ~30 tables (e.g., campaign-level audience performance).

Key Gotchas

API version cutover (January–February 2025): Google Ads BDTS switched to Ads API v16. If you had older transfers configured before that window, they may have broken or had schema changes. Re-verify any pre-existing transfers.
No backfill on first run. The connector pulls the last ~30 days on initial setup; earlier history is not available.
MCC vs. individual accounts. Using an MCC ID exports all child accounts into separate table shards — one shard per customer ID per date. Query patterns need to account for this (filter by customer_id).
~30+ tables, not all useful. The default export includes tables for ad groups, ads, keywords, audiences, campaigns, conversions, click performance, and more. Start by understanding which tables map to your reporting needs before querying everything.
Table naming: Tables follow {resource}_YYYYMMDD (e.g., Campaign_20260507). Current-day data lands as *_current tables.
YouTube campaigns are included here. YouTube video campaigns managed in Google Ads appear in this export. See the YouTube Ads section for organic YouTube data.
Conversion actions: Make sure conversion actions are configured correctly in the Google Ads account before relying on BQ conversion data.
MCC authorization: If using an MCC, the authorizing account must have at least read access at the MCC level, not just the child account level.

Options Compared

Method	Cost	Notes
Native BDTS (recommended)	Free	~30 tables, daily, official support, GAQL custom reports available
Fivetran	Free tier available; paid tiers well below $12K/year enterprise floor	Managed, normalized schema; useful for multi-client consolidation
Airbyte	Free (open-source) / Cloud pricing varies	Open-source connector available; more setup overhead
Custom Python (Google Ads API)	Compute costs only	Full API flexibility but you own pagination, retry logic, schema management

Recommendation: Use native BDTS unless you need cross-client normalization that Fivetran provides.

YouTube Ads

Method: Google Ads BDTS connector (YouTube paid campaigns flow through Google Ads)
Cost: Free (standard BQ storage and query costs apply)
Setup time: ~5 minutes additional if Google Ads BDTS is already configured
Status: Not started

Path: Covered by the Google Ads BDTS connector — YouTube campaigns are managed in Google Ads. No separate connector needed.

YouTube paid advertising — TrueView, bumpers, discovery ads, and other video campaign formats — is managed inside Google Ads and exported through the Google Ads BDTS connector. There is no separate BDTS connector for YouTube Ads. A separate BDTS connector exists for organic YouTube channel analytics (views, subscribers, engagement on owned channels).

Note: There is a separate YouTube Channel BDTS connector for organic channel analytics (views, subscribers, watch time) — different from ad performance.

For YouTube Paid Campaigns (ad performance data)

YouTube campaigns are managed in Google Ads. If you have the Google Ads BDTS connector configured, YouTube ad performance is already included. See the Google Ads section for setup instructions.

To confirm YouTube campaign data is present:

In BigQuery, query the Campaign_YYYYMMDD table from your Google Ads export.
Filter by advertising_channel_type = 'VIDEO' (or check the campaign_type field — exact field name varies by API version).
Video ad-specific metrics (views, view rate, CPV) appear in the VideoPerformance or similar tables depending on your BDTS configuration.

For Organic YouTube Channel Analytics (owned channel data)

A separate BDTS connector exports organic channel data (views, watch time, subscribers, engagement) from the YouTube Data API / YouTube Analytics.

Open BigQuery console → Data Transfers → "+ Create Transfer" → search "YouTube".
Select "YouTube Channel Reports" connector.
Authorize with the Google account that owns or has Analyst access to the YouTube channel.
Set destination dataset (e.g., youtube_organic).
Save and run.

Key Gotchas

Paid vs. organic are completely separate data sources. Don't confuse the Google Ads export (paid performance) with the YouTube Channel Reports connector (organic channel analytics). They require different setup and land in different datasets.
YouTube campaign tables in Google Ads export. Video campaign metrics are spread across standard campaign/ad group tables plus video-specific tables. You'll need to join them to get a complete picture.
Organic connector data lag. YouTube Analytics data typically has a 2–3 day lag. Historical data may be available for backfill — check the connector settings during setup.
YouTube channel analytics require Analytics access. The account authorizing the organic connector must have at minimum Analyst access on the YouTube channel (not just viewer).
Paid reach metrics (unique viewers, frequency) are not in BDTS. These are available in the Google Ads UI (Reach Planner, brand lift studies) but not exported to BQ. Pull via Google Ads API custom GAQL if needed.
Content ownership data (for networks with Content ID) has a separate YouTube Reporting API and is not covered by either connector above.

Options Compared

Channel	Method	Notes
YouTube Paid	Google Ads BDTS (free)	Covered automatically if Google Ads export is live
YouTube Paid custom	Google Ads API + Python	For video-specific metrics not in standard tables
YouTube Organic	YouTube Channel Reports BDTS (free)	Owned channel only; views, watch time, subs
YouTube Organic via Fivetran	Free tier available	Alternative with managed schema normalization

Meta Ads

Method: Fivetran (free tier) or OWOX BI open-source connector; native BDTS is limited
Cost: Free (Fivetran free tier or OWOX open-source) to paid (Fivetran consumption tiers)
Setup time: ~30 minutes (Fivetran) / ~1–2 hours (OWOX or custom Python)
Status: Not started

Path: Native BDTS Facebook Ads connector (free, limited columns). Upgrade to OWOX open-source connector if custom breakdowns needed.

Meta Ads data covers Facebook and Instagram paid campaigns — spend, impressions, clicks, conversions, and audience breakdowns. No native Google BigQuery connector exists with full fidelity; the best paths are Fivetran (managed) or OWOX BI's open-source connector (free, more flexible). A limited native BDTS connector exists but lacks custom breakdowns.

Native BDTS covers Campaign, Ad Set, Ad, Account level performance. Standard metrics only.
No custom breakdowns (age/gender/placement splits), no creative-level video metrics, no custom conversions in non-standard dimensions via native connector
OWOX offers a free open-source Meta Ads → BigQuery connector with more flexibility than BDTS
Third-party paid options (Fivetran, Airbyte Cloud, Windsor.ai, Supermetrics Enterprise) exist for richer data models

Option A: Fivetran (recommended for managed, reliable delivery)

Log in to Fivetran → Connectors → "+ Add Connector" → search "Facebook Ad Insights".
Authorize with a Meta Business account that has access to the ad account(s) you want to export.
Select ad account(s). You can pull multiple accounts.
Set destination: BigQuery project marty-search-console-ga4-api, choose or create a dataset (e.g., meta_ads_fivetran).
Configure sync frequency (daily recommended on free tier; more frequent on paid).
Select tables/reports. Fivetran pulls core ad performance by default; configure breakdowns (age, gender, placement, device) as needed.
Save and run initial sync. First sync may backfill available history.

Option B: OWOX BI open-source connector (free, more breakdown flexibility)

OWOX offers a free, open-source Google Sheets → BigQuery pipeline for Meta Ads that gives more control over breakdowns than the native BDTS option.

Visit owox.com or search GitHub for "OWOX BigQuery pipeline".
Follow their setup guide — it typically involves authorizing a Google Workspace add-on or deploying a script.
Connect your Meta Ads account and configure the breakdown dimensions you need.
Set a BigQuery destination and schedule.

Option C: Native BDTS (limited — standard metrics only)

A Meta Ads BDTS connector exists but only exports standard metrics with limited breakdown support. No custom audience segments or placement-level detail.

BigQuery console → Data Transfers → "+ Create Transfer" → search "Facebook".
Authorize with Meta Business account.
Select destination dataset and ad account.
Note the breakdown limitations before committing to this path.

Key Gotchas

Native BDTS is limited. Standard metrics only, no custom breakdowns (age, gender, placement, device-level splits). If you need detailed breakdowns for optimization, use Fivetran or OWOX.
Meta Marketing API rate limits. All connectors use the Meta Marketing API underneath. Heavy historical backfills can hit rate limits — Fivetran manages retries automatically; custom scripts need explicit backoff logic.
Attribution windows matter. Meta reports data using click and view attribution windows (1-day, 7-day, 28-day). Make sure the connector you choose is pulling with the attribution window that matches your reporting convention — changing it mid-stream causes comparability issues.
Conversion API vs. pixel. If the client uses Meta's Conversions API (CAPI) alongside or instead of the pixel, verify that CAPI events are appearing in the ad account's reported conversions before assuming BQ data is complete.
Ad account access. The authorizing user must have Advertiser or above access to the Meta Ad Account, not just the Business Manager.
Data retention in Meta API. Meta limits how far back you can pull via the API (typically 37 months). Backfill beyond that isn't possible.
Currency. If you manage multi-currency ad accounts, verify the connector exports a currency field and how it handles FX for aggregated reporting.

Options Compared

Method	Cost	Breakdown Flexibility	Notes
Fivetran (recommended)	Free tier available; paid tiers well below $12K/year enterprise floor	High	Managed, reliable, handles retries and schema changes
OWOX BI open-source	Free	High	More flexible breakdowns; requires some DIY setup
Native BDTS	Free	Low	Quick to set up; standard metrics only, no custom breakdowns
Airbyte	Free (open-source) / Cloud pricing varies	Medium	Open-source connector available
Custom Python (Meta Marketing API)	Compute costs only	Full	You own everything: pagination, rate limits, schema

Bing Ads

Method: Fivetran (recommended) or custom Python script via bingads SDK
Cost: Fivetran free tier available; Python: compute costs only
Setup time: ~30 minutes (Fivetran) / ~2–4 hours (custom Python)
Status: Not started

Path: Airbyte Cloud or custom Python script using Microsoft bingads SDK. No native BDTS connector.

Microsoft Advertising (Bing Ads) data covers paid search and audience campaigns on the Microsoft network — Bing, Yahoo, MSN, and partner sites. There is no native BigQuery Data Transfer Service connector for Bing Ads. Fivetran has a managed connector (free tier available), and Microsoft publishes a Python SDK (bingads on PyPI) for custom scripts.

Fivetran has a mature connector but minimum $12K/year annual contract — overkill for one source
Airbyte Cloud: ~$15/million rows, starting $10/month. Likely $20–60/month at agency ad reporting volumes. Open-source connector actively maintained.
Custom Python script: Microsoft Advertising Python SDK (bingads on PyPI, maintained by Microsoft), standard OAuth, free API usage. Moderate complexity.
If already running Airbyte self-hosted for other sources, add Bing Ads there for $0 incremental cost.

Option A: Fivetran (recommended)

Log in to Fivetran → Connectors → "+ Add Connector" → search "Microsoft Advertising".
Click "Authorize" → you'll be redirected to Microsoft login. Use an account with access to the Microsoft Advertising account(s).
Select account(s). Fivetran can pull from one or multiple accounts.
Set destination: BigQuery project marty-search-console-ga4-api, dataset (e.g., bing_ads_fivetran).
Configure sync frequency (daily recommended on free tier).
Save and run initial sync. Fivetran will backfill available history on first run (check how far back — typically 30–90 days depending on plan).
Verify tables are appearing in BigQuery.

Option B: Custom Python via `bingads` SDK

pip install bingads

Key steps:

Register an app in the Microsoft Azure portal (Azure AD → App registrations → New registration).
Get API credentials: Client ID, Client Secret, Developer Token (from Microsoft Advertising UI → Tools → API Center).
OAuth flow to get refresh token — Microsoft provides a sample script in the SDK docs.
Write a report request using the SDK's ReportingServiceManager. Choose report type (e.g., CampaignPerformanceReportRequest).
Download report and parse (TSV format).
Load into BigQuery using google-cloud-bigquery Python client or bq load.

# Minimal pattern — fill in auth details from your registered app
from bingads.service_client import ServiceClient
from bingads.authorization import OAuthDesktopMobileAuthCodeGrant, AuthorizationData

# See Microsoft Advertising Python SDK docs for full auth + reporting examples
# SDK repo: https://github.com/BingAds/BingAds-Python-SDK

Deploy as a Cloud Run Job on a daily schedule (preferred over PythonAnywhere for production).

Key Gotchas

No native BDTS connector. There's no built-in BigQuery connector from Microsoft — you must use a third-party tool or custom code.
Developer Token required. Even for read-only API access, you need a Developer Token from the Microsoft Advertising UI (Tools → API Center → Request Developer Token). Standard tokens are available to any advertiser; production/approved tokens may require a short application review.
OAuth refresh tokens expire. Microsoft refresh tokens have a finite lifetime (90 days without use). Any automated script needs to handle token refresh and persist updated tokens — a common failure point.
Report format is TSV. Downloaded reports are tab-separated, not JSON. Parse carefully and handle missing/null values.
Report availability lag. Performance data is typically available with a 2–3 hour lag; some conversion data up to 24h.
Multi-account/MCC structure. If managing multiple accounts under a Manager Account, handle the account hierarchy in your API calls to avoid pulling only the parent account.
Airbyte option: Airbyte Cloud has a Microsoft Advertising connector (~$15/million rows on Cloud, free on open-source self-hosted). Worth considering if you already have Airbyte set up.

Options Compared

Method	Cost	Notes
Fivetran (recommended)	Free tier available; paid tiers well below $12K/year enterprise floor	Managed, handles auth/token refresh, reliable schema
Airbyte	Free (open-source self-hosted) / ~$15/million rows (Cloud)	Open-source connector; self-hosted requires infra
Custom Python (`bingads` SDK)	Compute costs only	Full control; you own auth, retry, schema — most setup effort

LinkedIn Ads

Method: Fivetran (recommended) or Airbyte
Cost: Fivetran free tier available; Airbyte Cloud pricing varies
Setup time: ~30 minutes (connector setup) + 2–4 weeks (LinkedIn app approval — start immediately)
Status: Not started — apply for LinkedIn Marketing API access now

Path: Airbyte Cloud after obtaining LinkedIn Marketing API access. Apply for API access immediately — 2–4 week approval wait.

LinkedIn Ads data covers Sponsored Content, Message Ads, Dynamic Ads, and Text Ads — impressions, clicks, spend, conversions, and demographic breakdowns (job title, company size, industry, seniority). There is no native BDTS connector. Fivetran and Airbyte both use the LinkedIn Marketing API, which requires explicit app approval from LinkedIn — a process that can take 2–4 weeks.

No native BDTS connector
LinkedIn Marketing API requires app creation at developer.linkedin.com + Marketing Developer Platform access application. Approval is case-by-case.
Fivetran has a mature LinkedIn Ad Analytics connector but is expensive ($12K/year minimum)
Airbyte Cloud: ~$15/million rows (~$20–50/month at typical agency volumes). Open-source connector available.
Custom Python script is viable once API is approved (LinkedIn has a Python SDK); moderate complexity due to OAuth + LinkedIn's pagination model
Rate limits are not precisely documented; LinkedIn throttles aggressively on bulk requests
Note: Marketing API covers paid data only. Organic LinkedIn Page analytics require the separate Community Management API.

Step 0: Apply for LinkedIn Marketing API access (do this first)

This is the longest part of the process. Do not wait until you're ready to configure a connector.

Go to developer.linkedin.com → My Apps → Create App.
Fill in the app details. You'll need:
- App name (e.g., "AimClear BigQuery Pipeline")
- A LinkedIn Company Page (you must have admin access to the page)
- App logo
Under the app, go to Products → request "Marketing Developer Platform" access.
Submit and wait. LinkedIn reviews applications manually. Typical timeline: 2–4 weeks.
Once approved, you'll have access to the r_ads and r_ads_reporting OAuth scopes.

Note: Fivetran and Airbyte both use your own LinkedIn app credentials under the hood — you can't skip the API approval step by using a managed connector.

Step 1: Configure Fivetran connector (after API approval)

Log in to Fivetran → Connectors → "+ Add Connector" → search "LinkedIn Ad Analytics".
Enter OAuth credentials from your approved LinkedIn app (Client ID, Client Secret).
Authorize with a LinkedIn account that has Campaign Manager access to the ad account(s).
Select ad accounts.
Set destination: BigQuery project marty-search-console-ga4-api, dataset (e.g., linkedin_ads_fivetran).
Configure sync frequency and save.
Run initial sync and verify tables in BigQuery.

Step 2: Verify data

Check that tables include campaign performance, creative performance, and demographic breakdowns (if available on your LinkedIn account tier).

Key Gotchas

2–4 week API approval wait. This is the single most important thing to plan around. Apply the moment this pipeline project starts. If you miss this step, you'll be waiting before you can connect anything.
LinkedIn throttles aggressively. The Marketing API has rate limits, but they are not precisely documented. LinkedIn silently throttles heavy bulk requests. Fivetran and Airbyte handle retry logic, but custom scripts need exponential backoff.
Data retention via API. LinkedIn's API limits historical data pull depth — verify the exact lookback window during connector setup (typically 2 years).
Demographic breakdowns require Campaign Manager access. The authorizing LinkedIn account must have Campaign Manager access on the specific ad account, not just general LinkedIn Business Manager access.
App review can be rejected. LinkedIn may request clarification or additional information about your use case. Prepare a clear description of why you need reporting API access (data warehouse for client reporting is a valid use case).
Do not confuse LinkedIn Ads with LinkedIn Organic. The Marketing API for ads and the Pages API for organic are separate. See the LinkedIn Organic section for organic data.
One app per organization is recommended. If you're setting up for multiple clients, a single AimClear app with multiple ad account authorizations is cleaner than one app per client.

Options Compared

Method	Cost	Notes
Fivetran (recommended)	Free tier available; paid tiers well below $12K/year enterprise floor	Managed, handles API auth, schema normalized; requires your LinkedIn app approval
Airbyte	Free (open-source) / Cloud pricing varies	Open-source connector available; same LinkedIn app approval required
Custom Python (LinkedIn Marketing API)	Compute costs only	Full control; same app approval required; handle rate limits and pagination yourself

Google Ad Manager

Method: Native BDTS (Google Ad Manager Data Transfer) — but requires Ad Manager 360
Cost: Free data transfer, but Ad Manager 360 is a premium/enterprise product
Setup time: ~1–2 hours + Google support ticket time (potentially days/weeks)
Status: Not started — verify Ad Manager 360 eligibility first

Path: Native BDTS connector (free) — but requires Data Transfer file access, which is not self-serve.

Google Ad Manager's BigQuery Data Transfer exports impression-level ad serving data — every impression, click, and viewability event from the ad server. This is powerful for publisher-side analysis, revenue attribution, and cross-channel delivery reconciliation. However, it is only available to Ad Manager 360 (GAM 360) accounts, not standard Ad Manager accounts. Self-serve setup is not possible; it requires a Google support ticket.

BDTS connector reads from a GCS bucket that Google Ad Manager populates with DT files
DT file access must be enabled by contacting Google Ad Manager support — typically requires Ad Manager 360 (premium/enterprise). Free Ad Manager tier likely ineligible.
Verify account type before planning this pipeline. Contact Google rep if needed.
Impression-level data at scale can be very high volume (terabytes). Monitor BigQuery storage costs.

Step 0: Verify Ad Manager 360 access

Before investing any setup time, confirm the account type:

In Google Ad Manager → Admin → Network settings → look for "Ad Manager 360" in the account tier/type.
Or contact your Google account team — standard Ad Manager accounts cannot enable Data Transfer without upgrading.

If the account is standard Ad Manager (not 360), skip this connector and use the Ad Manager API or Reporting API instead (see Options Compared below).

Step 1: Enable Data Transfer file access (requires Google support ticket)

Contact Google Ad Manager support and request Data Transfer file access for your network.
Provide: Ad Manager Network Code, GCP project ID (marty-search-console-ga4-api), and the GCS bucket where you want files delivered.
Wait for Google to provision access. Timeline varies — days to weeks.

Step 2: Set up Google Cloud Storage bucket for Data Transfer files

# Create a GCS bucket for Ad Manager Data Transfer files
gsutil mb -p marty-search-console-ga4-api gs://aimclear-gam-data-transfer

# Grant Ad Manager's service account access to the bucket
# (Google support will provide the exact service account email)
gsutil iam ch serviceAccount:{gam-service-account}@gserviceaccount.com:objectAdmin gs://aimclear-gam-data-transfer

Step 3: Configure BDTS to load GCS files into BigQuery

Once Data Transfer files are landing in GCS:

BigQuery console → Data Transfers → "+ Create Transfer" → "Google Ad Manager".
Point to the GCS bucket and set destination dataset (e.g., google_ad_manager).
Set schedule and save.

Step 4: Validate data

Run a query against the resulting tables to verify impression counts match Ad Manager UI reports.
Data Transfer files are large — impression-level data for active networks generates substantial BQ storage costs.

Key Gotchas

Ad Manager 360 only. This is the single largest blocker. Standard GAM accounts cannot use Data Transfer. Verify this before anything else.
Requires a Google support ticket. There is no self-serve activation. Factor in ticket wait time (days to weeks) when planning timelines.
Very high data volume. Impression-level data for an active ad network generates massive BQ tables. Partition tables by date, consider clustering by order/line item, and set up lifecycle policies on GCS. Query costs can be significant.
Data lag. Data Transfer files are typically available 24–48 hours after the serving date.
File format complexity. Data Transfer files use a specific format (CSV-like with headers). The BDTS connector handles parsing, but if you're processing files directly, review the Ad Manager Data Transfer documentation carefully.
No backfill on initial setup. Data Transfer starts from the date files begin landing in GCS. Google may be able to re-provision historical files — ask during the support ticket process.
Viewability data is separate. Active View (viewability) data may be in separate files/tables from impression data.

Options Compared

Method	Cost	Access Requirement	Notes
Native BDTS (Data Transfer)	Free transfer + BQ storage	Ad Manager 360 only	Impression-level; requires support ticket
Ad Manager Reporting API	Free	Standard or 360	Aggregated reports only (no impression-level); self-serve
Fivetran	Free tier available	Standard or 360	Pulls via Reporting API; aggregated, not impression-level
Custom Python (Reporting API)	Compute costs only	Standard or 360	Full API access to aggregated reports; impression-level not available

If you have Ad Manager 360: Use native BDTS for impression-level data.
If you have standard Ad Manager: Use the Reporting API (via Fivetran or custom Python) for aggregated reports.

Other Programmatic / DSPs

Depends on specific platforms. Notable:

DV360 / Display & Video 360: Native BDTS connector available (free)
The Trade Desk: Fivetran or Airbyte connector; no native BDTS
Amazon DSP: Airbyte connector; no native BDTS

Clarify which specific programmatic platforms are actually in scope before building pipelines.

Optimizer

Method: TBD — depends on vendor capabilities; see decision tree below
Cost: Depends on method chosen
Setup time: ~30 minutes (if native export exists) to ~4 hours (custom script)
Status: Not started — start by verifying vendor export options

No public documentation found for a product specifically named "Optimizer." Optimizer is a paid channel aggregator that consolidates performance data across multiple ad platforms. Before building a pipeline, the key question is: what does Optimizer offer for data export, and does pulling from Optimizer provide more value than pulling from source channels (Google Ads, Meta, Bing, LinkedIn) directly?

This section is a decision template. Fill in specifics after reviewing Optimizer's documentation or contacting their support team. Check with the vendor:

Does it expose a REST API or export endpoints?
Does it write to cloud storage (GCS/S3)? BigQuery can ingest directly from GCS.
Does it have a direct BigQuery export feature?
What dimensions/metrics does it expose? Aggregators sometimes summarize in ways that lose granularity needed for deep analysis.

If Optimizer already aggregates paid channels and the data fidelity is acceptable, pulling from it instead of each channel separately reduces pipeline complexity.

Decision Tree: How to Connect

Step 1: Check what Optimizer offers natively

Contact Optimizer support or review their API/export documentation:

Does Optimizer offer a direct BigQuery export or scheduled GCS dump?
- Yes → use it. This is the simplest path. Ask for format documentation (schema, partition strategy, null handling).
- No → continue to Step 2.
Does Optimizer have a REST API?
- Yes → write a custom Python script using requests/httpx. See the script template below.
- No → continue to Step 3.
Does Optimizer offer CSV/TSV export to email or SFTP?
- Yes → set up automated export to GCS and load with BQ load jobs.
- No → escalate to your Optimizer account manager — this is a significant limitation.

Step 2: Assess whether pulling from Optimizer is worth it

Optimizer aggregates data that may already be flowing into BigQuery from source channels:

Data in Optimizer	Already in BQ?	Pull from Optimizer?
Google Ads performance	Yes (BDTS)	Only if Optimizer normalizes/enriches it
Meta Ads performance	Yes (Fivetran)	Only if Optimizer adds cross-channel normalization
Bing Ads performance	Yes (Fivetran)	Only if Optimizer's schema is cleaner
Cross-channel aggregated metrics	Maybe not	Yes — this is Optimizer's core value
Optimizer-specific attribution	Unique to Optimizer	Yes

Rule of thumb: If Optimizer provides normalization or attribution logic on top of raw channel data, that added layer is worth pulling. If it's just re-exporting the same numbers from Google Ads and Meta without transformation, pull from source channels for higher fidelity.

Step 3: REST API — custom Python script template

If Optimizer has a REST API, adapt this pattern:

pip install requests google-cloud-bigquery

import requests
from google.cloud import bigquery
from datetime import datetime, timezone, timedelta

BQ_PROJECT = "marty-search-console-ga4-api"
OPTIMIZER_BASE_URL = "https://api.optimizer-vendor.com/v1"  # replace with actual URL
OPTIMIZER_API_KEY = "YOUR_API_KEY"  # store in Secret Manager

headers = {
    "Authorization": f"Bearer {OPTIMIZER_API_KEY}",
    "Content-Type": "application/json",
}

def fetch_performance(start_date: str, end_date: str) -> list:
    """
    Pull performance data from Optimizer API.
    start_date/end_date: YYYY-MM-DD format
    Adjust endpoint path, params, and pagination to match actual API spec.
    """
    url = f"{OPTIMIZER_BASE_URL}/reports/performance"
    params = {
        "start_date": start_date,
        "end_date": end_date,
        "dimensions": "date,channel,campaign",  # adjust to API spec
        "metrics": "impressions,clicks,spend,conversions",
    }
    rows = []
    page = 1
    while True:
        response = requests.get(url, headers=headers, params={**params, "page": page})
        response.raise_for_status()
        data = response.json()
        rows.extend(data.get("results", []))
        if not data.get("has_next_page"):
            break
        page += 1
    return rows

def load_to_bq(rows: list, table_id: str):
    bq = bigquery.Client(project=BQ_PROJECT)
    pulled_at = datetime.now(timezone.utc).isoformat()
    for row in rows:
        row["pulled_at"] = pulled_at
    errors = bq.insert_rows_json(table_id, rows)
    if errors:
        raise RuntimeError(f"BQ errors: {errors}")
    print(f"Loaded {len(rows)} rows to {table_id}")

if __name__ == "__main__":
    yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
    rows = fetch_performance(yesterday, yesterday)
    load_to_bq(rows, f"{BQ_PROJECT}.optimizer.performance_daily")

Step 4: Deploy as a Cloud Run Job

gcloud run jobs create optimizer-pipeline \
  --image gcr.io/marty-search-console-ga4-api/optimizer-pipeline \
  --region us-central1

# Schedule daily
gcloud scheduler jobs create http optimizer-daily \
  --schedule="0 8 * * *" \
  --uri="https://us-central1-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/marty-search-console-ga4-api/jobs/optimizer-pipeline:run" \
  --oauth-service-account-email=YOUR_SA@marty-search-console-ga4-api.iam.gserviceaccount.com

Key Gotchas

Unknown vendor specifics. This document is a template. The actual API endpoint paths, authentication method, data schema, and pagination behavior need to be confirmed with Optimizer's documentation or support team before writing production code.
Fidelity trade-off. Aggregators like Optimizer may apply their own attribution or normalization rules that differ from source platform numbers. If exact platform spend figures are required for billing reconciliation, pull from source channels. If cross-channel normalized metrics for optimization decisions are the goal, Optimizer may be authoritative.
Avoid double-counting. If Optimizer data lands in BQ alongside raw channel data (Google Ads, Meta, Bing), be explicit in your data model about which table is source of truth for which metrics. Don't aggregate both without a dedup strategy.
Data freshness. Aggregators often have a lag on top of the platform's own lag. Verify Optimizer's data freshness SLA before relying on it for day-over-day reporting.
API key rotation. Store Optimizer credentials in Google Secret Manager and inject at Cloud Run Job runtime. Never embed keys in Docker images or commit them to source control.

Questions to Answer Before Building

Before writing any code, get answers to these from Optimizer:

Is there a native BigQuery connector or GCS export?
Is there a REST API? What's the authentication method (API key, OAuth, HMAC)?
What are the available dimensions and metrics?
What is the data freshness / lag from platform to Optimizer?
How far back can you pull historical data?
What is the rate limit on API calls?
Does Optimizer's schema overlap with your source channel tables, and if so, how?

Architectural Recommendations

Python Scripts (Reddit organic, Wikipedia, LLMs, Bing Ads if no Airbyte)

Use Cloud Run Jobs over PythonAnywhere for production:

Native GCP IAM — no key file management, uses Workload Identity
Container-based — reproducible environments
Built-in Cloud Scheduler integration for daily/hourly runs
Scales to zero — costs pennies per invocation
PythonAnywhere is fine for prototyping

Use google-cloud-bigquery Python client with Workload Identity on Cloud Run. Write via streaming inserts (small volumes) or batch LOAD jobs (daily bulk pulls).

ETL Strategy (Bing Ads, LinkedIn Ads, Reddit Ads)

If adding multiple non-native sources, consider Airbyte self-hosted on a small GCE VM — one management interface covers all three connectors and more. Free (you run the infrastructure, costs ~$20–40/month for a small e2-micro).

Airbyte Cloud at $15/million rows is the lower-ops alternative if self-hosting is too much friction.

Do not use Fivetran — minimum $12K/year annual contract is not justified for this use case.

BDTS Pricing Note

Starting September 25, 2025, BigQuery DTS switches third-party marketplace connectors to consumption-based pricing (slot-hours). First-party Google connectors (GA4, Google Ads, Search Console, YouTube Ads, Meta Ads, Ad Manager) remain free. Only third-party Fivetran/partner connectors sold through GCP Marketplace are affected.

Immediate Action Items

Apply for LinkedIn Marketing API access now — 2–4 week wait, blocks LinkedIn Ads pipeline
Switch SERP API provider from SerpApi to DataForSEO (lawsuit risk)
Verify Google Ad Manager account type — DT file access required for BDTS connector
Set up Airbyte (self-hosted or Cloud) to cover Bing Ads + LinkedIn Ads (once approved) + Reddit Ads
Prototype Reddit/Wikipedia/LLM Python scripts on PythonAnywhere; migrate to Cloud Run Jobs for production
Do not build LinkedIn personal-token pipeline — account termination risk is real after 2025 enforcement actions

MartyMCP Guides

Data Source Setup

Overview

Decision Matrix

GA4 — Google Analytics 4

At a Glance

Setup Steps

Key Gotchas

Options Compared

Search Console

At a Glance

Setup Steps

Key Gotchas

Options Compared

Reddit Organic

At a Glance

Step 1: Apply for Reddit API access (commercial tier)

Step 2: Install PRAW and write the pull script

Step 3: Create the BigQuery table

Step 4: Deploy as a Cloud Run Job (preferred over PythonAnywhere for production)

Key Gotchas

Options Compared

LinkedIn Organic

At a Glance

Critical Constraint: Own Pages Only

Step 1: Ensure LinkedIn app access

Step 2: Get admin access to the Company Page

Step 3: Write the pull script

Step 4: Deploy as a Cloud Run Job

Key Gotchas

Options Compared

Wikipedia

At a Glance

Option A: Wikimedia REST API — article content and metadata (~20 lines of Python)

Option B: BigQuery public dataset — page views at scale

Key Gotchas

Options Compared

SERP API

At a Glance

Vendor Selection

Step 1: Create DataForSEO account

Step 2: Write the pull script

Step 3: Create BigQuery table

Step 4: Deploy as a Cloud Run Job

Key Gotchas

Options Compared

LLM Monitoring

At a Glance

Track 1: Google AI Overviews (via DataForSEO SERP API)

Track 2: Conversational AI Monitoring (ChatGPT, Claude, Gemini, Perplexity)

Cost per provider

Setup Steps

BigQuery table for LLM responses

Post-processing: brand mention detection

Specialized Tools (evaluate vs. build)

Key Gotchas

Google Ads

Setup Steps

Enabling custom GAQL reports (optional)

Key Gotchas

Options Compared

YouTube Ads

For YouTube Paid Campaigns (ad performance data)

For Organic YouTube Channel Analytics (owned channel data)

Key Gotchas

Options Compared

Meta Ads

Option A: Fivetran (recommended for managed, reliable delivery)

Option B: OWOX BI open-source connector (free, more breakdown flexibility)

Option C: Native BDTS (limited — standard metrics only)

Key Gotchas

Options Compared

Bing Ads

Option A: Fivetran (recommended)

Option B: Custom Python via bingads SDK

Key Gotchas

Options Compared

LinkedIn Ads

Step 0: Apply for LinkedIn Marketing API access (do this first)

Step 1: Configure Fivetran connector (after API approval)

Option B: Custom Python via `bingads` SDK