Data Source Setup
Per-source guides for ingesting marketing data into BigQuery — costs, setup paths, gotchas, and key SQL/CLI commands. Research conducted 2026-05-07.
Overview
Covers every data source in the AimClear multi-channel pipeline and the best path for getting each into BigQuery. Sources fall into three buckets: organic / content (analytics, search, social, SERPs, LLM coverage), paid channels (Google, Meta, Bing, LinkedIn, YouTube, Google Ad Manager), and aggregators (vendor tools that already roll several sources together).
Default approach: use Google's free BigQuery Data Transfer Service (BDTS) connectors wherever they exist, and pull anything else with the lightest-weight option that fits — Airbyte Cloud for production third-party sources, custom Python scripts on Cloud Run Jobs for one-off / niche feeds, and DataForSEO for SERP / AI Overview data.
Decision Matrix
| Source | Native BDTS | Best Path | Approx Monthly Cost |
|---|---|---|---|
| GA4 | ✅ Free | BDTS | $0 + BQ compute |
| Search Console | ✅ Free (own mechanism) | GSC Bulk Export | $0 + BQ compute |
| Google Ads | ✅ Free | BDTS | $0 + BQ compute |
| YouTube Ads | ✅ Free (via Google Ads) | BDTS (Google Ads connector) | $0 + BQ compute |
| Facebook/Meta Ads | ✅ Free (limited columns) | BDTS → OWOX if more detail needed | $0 |
| Google Ad Manager | ✅ Free (if DT access enabled) | BDTS (verify eligibility first) | $0 + BQ compute |
| Microsoft/Bing Ads | ❌ | Airbyte Cloud or Python (bingads SDK) | $20–60 or $0 |
| LinkedIn Ads | ❌ | Airbyte Cloud (after API approval — apply now) | $20–50 |
| Reddit Ads (paid) | ❌ | Airbyte Cloud | $15–30 |
| Reddit Organic (threads) | ❌ | Python/PRAW script | $0–10 API costs |
| LinkedIn Organic | ❌ (ToS risk) | Official Pages API only (own pages) | $0 |
| Wikipedia | ❌ | Python script (Wikimedia REST API) | $0 |
| SERP API | ❌ | DataForSEO pay-as-you-go | $5–50/day depending on volume |
| LLMs / AI Overviews | ❌ | Python async script + DataForSEO for Google AIO | $10–50 |
| Optimizer | Unknown | Check vendor API/export docs | TBD |
GA4 — Google Analytics 4
Path: Native BDTS connector — free. Setup in GA4 Admin → BigQuery Link or via BDTS in GCP Console.
- Produces
events_YYYYMMDDandevents_intraday_YYYYMMDDtables (daily export, 24–48h lag) - Schema is deeply nested (
event_paramsrequires UNNEST + COALESCE across 4 value types) — build flattened materialized views for marketer-facing queries - September 2025 schema update:
conversionsrenamed tokeyEvents, revenue fields changed from INTEGER to FLOAT - GA4 360 (paid) adds intraday streaming; free tier is daily only
- Data is not retroactive — link must be established before data starts
GA4's native BigQuery export pushes event-level data daily into your project. This is the most granular marketing data you can get from Google Analytics — every session, event, and conversion with all associated parameters — and it's the foundation for cross-channel attribution in BigQuery.
At a Glance
- Method: Native BigQuery Data Transfer Service (BDTS) — built into GA4
- Cost: Free (standard BQ storage and query costs apply)
- Setup time: ~15 minutes
Setup Steps
- Open GA4 Admin → go to the property you want to export → Admin → Product links → BigQuery links.
- Click "Link" → choose your BigQuery project (
marty-search-console-ga4-api). - Select datasets to export: at minimum choose "Events" (daily). Streaming export is available but accrues BQ streaming insert costs — start with daily.
- Choose export frequency: Daily (free). Streaming optional (costs extra).
- Save the link. GA4 will create a dataset in BigQuery named
analytics_{property_id}. - Verify the link in GA4 Admin and in BigQuery console — a new dataset should appear within 24 hours.
- Grant any analyst accounts BigQuery Data Viewer role on the dataset if needed.
Key Gotchas
- No backfill. The export starts from the day you enable it. Historical data before linking is not exported. Enable ASAP.
- 24–48 hour lag. Daily exports typically land the following day; some properties see up to 48h.
- Schema change (September 2025): The
conversions_*event naming convention changed —conversionsevents are now tracked askeyEvents. Any queries referencingconversionsby the old convention need updating. - Nested schema.
event_paramsis aREPEATED RECORD— you mustUNNEST(event_params)to access parameter values. Same applies touser_propertiesanditems. This surprises many SQL users coming from flattened UA exports. - Sampled vs. unsampled. The BQ export is always unsampled regardless of your GA4 tier. This is one of the main reasons to use it.
- One link per property. Each GA4 property links to one BQ project. If you have multiple properties, set up multiple links — each gets its own dataset.
- Table naming: Daily tables land as
events_YYYYMMDD. Intraday tables (if streaming) land asevents_intraday_YYYYMMDDand are overwritten throughout the day. - Quota: GA4 BQ export counts against your BigQuery quotas. Unlikely to be an issue at typical agency scale.
Options Compared
| Method | Cost | Notes |
|---|---|---|
| Native GA4 BDTS (recommended) | Free | Daily export, no third-party dependency, official support |
| Fivetran | Free tier available; paid tiers well below $12K/year enterprise floor | Adds managed schema normalization; overkill if native export meets your needs |
| Airbyte | Free (open-source) / Cloud pricing varies | More setup, same data available natively for free |
| Custom Python (GA4 Data API) | GCP compute costs only | Use only if you need custom dimensions/metrics not in the export; rate limits apply |
Recommendation: Use the native BDTS export. It's free, official, and the most complete representation of GA4 data available.
Search Console
Path: GSC Bulk Data Export (not BDTS — its own mechanism). Setup in GSC Settings → Bulk Data Export. Free.
- Three tables:
searchdata_site_impression,searchdata_url_impression,ExportLog - Requires Owner-level GSC permission
- Must grant
search-console-data-export@system.gserviceaccount.comBigQuery Job User + Data Editor roles - No historical backfill — data flows from activation date only
- Anonymized queries (low-impression) are excluded from URL-level table
GSC Bulk Data Export pushes daily impression and click data — site-level and URL-level — into BigQuery. This is the primary source for organic search performance: queries, pages, countries, devices, and positions. It feeds SEO analysis, content performance, and keyword coverage in the marketer pipeline.
At a Glance
- Method: GSC Bulk Data Export (native Google feature — not BDTS)
- Cost: Free (standard BQ storage and query costs apply)
- Setup time: ~20 minutes
Setup Steps
- Verify GSC access level. You must be an Owner (not just Full User) on the GSC property. Delegated ownership works. Check at: Search Console → Settings → Users and permissions.
- Grant BQ permissions to the GSC export service account.
The service account is:search-console-data-export@system.gserviceaccount.com
In BigQuery console → your project → IAM, grant it:BigQuery Job User(project level)BigQuery Data Editor(dataset level, or project level for simplicity)
- Open GSC Bulk Data Export. Go to: search.google.com/search-console → Settings (gear icon) → Bulk Data Export.
- Click "Set up Bulk Data Export" and select:
- BigQuery project:
marty-search-console-ga4-api - Dataset name: choose something like
search_console_export - Property: select the GSC property to export
- BigQuery project:
- Save. GSC will validate permissions and begin export within 24 hours.
- Verify in BigQuery that the dataset and tables appear the next day.
Key Gotchas
- No backfill. Export begins from the day you configure it. Historical GSC data is available in the UI for 16 months but is NOT exported retroactively.
- Owner-level access required. Full Users cannot configure Bulk Export. If a client's GSC is set up with restricted access, you'll need them to either grant ownership or set it up themselves.
- Three tables are created:
searchdata_site_impression— site-level aggregated data (query + device + country + date)searchdata_url_impression— URL-level data (query + URL + device + country + date)ExportLog— metadata about each export run (useful for debugging gaps)
- Data lag: Typically 2–3 days behind real-time. GSC itself processes data with delay.
- Sampling at URL level. The URL-impression table is more aggressively sampled than site-level. Low-volume queries/pages may not appear.
- Schema quirk: Some fields use integer codes for device type (MOBILE, DESKTOP, TABLET) — check GSC documentation for the enum mapping before querying.
- Service account requires BQ Job User at project level. Data Editor on the dataset alone is not enough — the export creates new tables, which requires job-level permissions.
Options Compared
| Method | Cost | Notes |
|---|---|---|
| GSC Bulk Data Export (recommended) | Free | Official, daily, three tables, no API rate limits |
| Fivetran | Free tier available | Useful if you need managed schema or multi-property consolidation |
| Airbyte | Free (open-source) / Cloud varies | Pulls via Search Console API; subject to API quotas (25K rows/request) |
| Custom Python (Search Console API) | Compute costs only | API has row limits per request; pagination required; viable for small properties |
Recommendation: Use GSC Bulk Data Export. It's free, quota-free, and higher fidelity than the API at scale.
Reddit Organic
Path: Custom Python script using PRAW library → BigQuery client.
- Reddit API: ~100 req/min free; $0.24/1K calls commercial beyond free tier
- Reddit introduced pre-approval requirement in 2025 for new commercial API use cases — register app use case as "marketing research / brand monitoring" and apply early
- Do not scrape without the API (HTML scraping triggers blocks and violates ToS)
- Design as incremental pulls (new posts since last run) — full corpus scrapes of high-volume subreddits take hours
- Cloud Run Jobs preferred over PythonAnywhere for production (GCP-native, IAM-integrated, scales to zero)
Reddit organic tracking monitors keyword-relevant threads and comments across subreddits — brand mentions, competitor discussions, industry topics, and sentiment signals. This is a custom pull, not a managed connector. The Reddit API (via PRAW) is the correct method; HTML scraping violates Reddit's ToS. As of 2025, Reddit added a pre-approval requirement for commercial API access.
At a Glance
- Method: Custom Python script using PRAW (Python Reddit API Wrapper)
- Cost: Reddit API: ~$0.24/1K calls at commercial tier; Cloud Run Job: ~$0 at low volume
- Setup time: ~2–4 hours (script + deployment)
Step 1: Apply for Reddit API access (commercial tier)
Reddit added a pre-approval step for commercial use in 2025.
- Go to reddit.com/prefs/apps → Create App.
- Select type: script (for server-side use) or web app (if OAuth flow needed).
- For commercial use (storing data in BQ, using it for client reporting), review Reddit's Data API Terms and apply for the appropriate access tier.
- Note your Client ID and Client Secret from the created app.
Step 2: Install PRAW and write the pull script
pip install praw google-cloud-bigquery
import praw
from google.cloud import bigquery
from datetime import datetime, timezone
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
user_agent="aimclear-pipeline/1.0 (by u/YOUR_REDDIT_USERNAME)"
)
bq = bigquery.Client(project="marty-search-console-ga4-api")
KEYWORDS = ["your keyword", "brand name", "competitor"]
SUBREDDITS = ["marketing", "ppc", "digitalmarketing"] # customize
def search_and_load(keywords, subreddits, lookback_hours=25):
rows = []
for sub in subreddits:
subreddit = reddit.subreddit(sub)
for kw in keywords:
for submission in subreddit.search(kw, time_filter="day", limit=100):
rows.append({
"pulled_at": datetime.now(timezone.utc).isoformat(),
"subreddit": sub,
"keyword": kw,
"submission_id": submission.id,
"title": submission.title,
"url": submission.url,
"score": submission.score,
"num_comments": submission.num_comments,
"created_utc": datetime.fromtimestamp(submission.created_utc, tz=timezone.utc).isoformat(),
"author": str(submission.author),
"selftext": submission.selftext[:2000], # truncate for BQ
})
if rows:
table_id = "marty-search-console-ga4-api.reddit_organic.submissions"
errors = bq.insert_rows_json(table_id, rows)
if errors:
raise RuntimeError(f"BQ insert errors: {errors}")
print(f"Loaded {len(rows)} rows")
if __name__ == "__main__":
search_and_load(KEYWORDS, SUBREDDITS)
Step 3: Create the BigQuery table
CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.reddit_organic.submissions` (
pulled_at TIMESTAMP,
subreddit STRING,
keyword STRING,
submission_id STRING,
title STRING,
url STRING,
score INT64,
num_comments INT64,
created_utc TIMESTAMP,
author STRING,
selftext STRING
)
PARTITION BY DATE(created_utc)
OPTIONS (require_partition_filter = false);
Step 4: Deploy as a Cloud Run Job (preferred over PythonAnywhere for production)
# Build and push container
gcloud builds submit --tag gcr.io/marty-search-console-ga4-api/reddit-organic
# Deploy as Cloud Run Job
gcloud run jobs create reddit-organic \
--image gcr.io/marty-search-console-ga4-api/reddit-organic \
--region us-central1 \
--project marty-search-console-ga4-api
# Schedule via Cloud Scheduler (daily at 6 AM UTC)
gcloud scheduler jobs create http reddit-organic-daily \
--schedule="0 6 * * *" \
--uri="https://us-central1-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/marty-search-console-ga4-api/jobs/reddit-organic:run" \
--oauth-service-account-email=YOUR_SA@marty-search-console-ga4-api.iam.gserviceaccount.com
Key Gotchas
- Do not HTML scrape Reddit. Reddit's ToS explicitly prohibits scraping. Use the API only.
- Pre-approval requirement (2025). Commercial use of the Reddit API now requires applying for the appropriate tier. Review the Data API terms at redditinc.com/policies/data-api-terms before proceeding.
- Rate limits: ~100 requests/minute on standard tier. PRAW automatically handles rate limit headers and backs off. Don't bypass PRAW's built-in rate limiting.
- Commercial API pricing: ~$0.24/1K API calls at commercial rates. At typical monitoring volumes (a few hundred calls/day), costs are minimal — under $5/month for most use cases.
- Design as incremental pulls. Don't pull full history on every run. Use
time_filter="day"and partition your BQ table by date. Track what you've already pulled to avoid duplicate rows. - Deduplication. Reddit submissions can appear in multiple searches (multiple keywords). Use
submission_idas a dedup key in your BQ table or downstream queries. - Search result limits. Reddit's search API returns a max of ~1,000 results per query (with pagination). For very high-volume keywords, you'll miss older results.
- Deleted/removed content. Authors and body text may be
[deleted]or[removed]by the time you pull. Handle these gracefully. - Cloud Run vs. PythonAnywhere. PythonAnywhere is viable for prototyping, but Cloud Run Jobs are the production target — they're more reliable, auditable via Cloud Logging, and natively authenticated to GCP via service accounts.
Options Compared
| Method | Cost | Notes |
|---|---|---|
| PRAW + Cloud Run Job (recommended) | ~$0.24/1K API calls + minimal compute | Production-grade, GCP-native, incremental design |
| PRAW + PythonAnywhere | Same API costs + PythonAnywhere subscription | Fine for prototyping; less reliable for unattended production |
| Pushshift / third-party archives | Varies | Pushshift has had significant reliability and ToS issues; verify current status before relying on it |
LinkedIn Organic
Path: Official LinkedIn Pages API (own pages only). Abandon personal token / session cookie scraping.
- 2025 enforcement: LinkedIn deleted company Pages of Apollo.io and Seamless.AI for scraping; Proxycurl shut down July 2025
- Personal login token automation = account termination risk, not just ToS violation
- Official Pages Data Portability API and Community Management API cover your own company page data (posts, followers, engagement analytics) — requires app approval (2–4 weeks)
- Competitive LinkedIn page data is not obtainable via official API; Bright Data / Coresignal exist as enterprise-priced data products ($1K+/month)
- Decision: limit to own Pages data via official API. Do not build personal-token pipeline.
LinkedIn's official Pages API provides analytics for your own company pages: post impressions, engagement, follower counts, and demographic breakdowns of your audience. This covers organic reach on pages where AimClear or its clients are admins. Competitive organic intelligence (seeing other companies' organic post performance) is not available via official API and attempts to obtain it via unofficial means carry serious account risk.
At a Glance
- Method: LinkedIn Pages API (official) via custom Python script
- Cost: Free API (compute costs only)
- Setup time: ~2–4 hours + 2–4 weeks LinkedIn app approval (if not already approved for Ads API)
- Scope: Own company pages only; no competitive organic data available
Critical Constraint: Own Pages Only
LinkedIn's official API grants access to:
- Your own company page analytics (Pages Data Portability API)
- Content you publish (Community Management API)
- Your own follower/audience demographics
It does not grant access to:
- Competitor page performance data
- Third-party organic post analytics
- Employee post performance (unless connected to your page)
Do not use personal token/session cookie automation to access LinkedIn data outside the official API. Apollo.io and Seamless.AI had company Pages deleted by LinkedIn in 2025 for API policy violations. Proxycurl, which offered LinkedIn data scraping, shut down in July 2025 following LinkedIn enforcement actions. The risk is account termination, not just a rate limit error.
Step 1: Ensure LinkedIn app access
If you already have a LinkedIn app approved for Marketing API access, check whether the Pages product is also approved on that app. If not, request it:
- Go to developer.linkedin.com → your app → Products.
- Request "Share on LinkedIn" and/or "Community Management API" as applicable.
- These products generally have faster approval than the Marketing Developer Platform.
Step 2: Get admin access to the Company Page
The OAuth user authorizing the API calls must be a Super Admin or Content Admin on the LinkedIn Company Page. Verify page roles at: LinkedIn → Company Page → Admin tools → Manage admins.
Step 3: Write the pull script
pip install requests google-cloud-bigquery
import requests
from google.cloud import bigquery
from datetime import datetime, timezone
# After OAuth flow, you'll have an access token with r_organization_social scope
ACCESS_TOKEN = "YOUR_OAUTH_ACCESS_TOKEN"
ORG_ID = "YOUR_LINKEDIN_ORGANIZATION_ID" # numeric org URN
BQ_PROJECT = "marty-search-console-ga4-api"
BQ_DATASET = "linkedin_organic"
headers = {"Authorization": f"Bearer {ACCESS_TOKEN}"}
def get_page_analytics(org_id, start_date, end_date):
# Organization Follower Statistics
url = f"https://api.linkedin.com/v2/organizationalEntityFollowerStatistics"
params = {
"q": "organizationalEntity",
"organizationalEntity": f"urn:li:organization:{org_id}",
}
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
return response.json()
def get_post_statistics(org_id):
# Share Statistics (post-level engagement)
url = f"https://api.linkedin.com/v2/organizationalEntityShareStatistics"
params = {
"q": "organizationalEntity",
"organizationalEntity": f"urn:li:organization:{org_id}",
}
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
return response.json()
# Load results into BigQuery
# bq_client = bigquery.Client(project=BQ_PROJECT)
# ... insert_rows_json into your target table
Refer to the LinkedIn API documentation for the current endpoint paths and OAuth scopes — LinkedIn deprecates and renames endpoints periodically. Required scopes typically include r_organization_social and r_organization_admin.
Step 4: Deploy as a Cloud Run Job
# Schedule daily, similar to other custom scripts
gcloud run jobs create linkedin-organic \
--image gcr.io/marty-search-console-ga4-api/linkedin-organic \
--region us-central1
Key Gotchas
- Own pages only. Competitor intelligence via LinkedIn's official API is not possible. If a client asks for competitor organic benchmarks, LinkedIn is the wrong source — use industry reports, LinkedIn's own Marketing Solutions benchmarks, or published case studies.
- OAuth token management. LinkedIn access tokens expire. Set up token refresh and store credentials securely (Secret Manager, not environment variables in code).
- Proxycurl is gone (July 2025). If any existing workflow relied on Proxycurl or similar scraping services for LinkedIn data, those are now defunct. There is no drop-in replacement via official channels for what those services provided.
- Apollo.io / Seamless.AI precedent. These companies had LinkedIn company pages deleted in 2025 for violations. The risk of unofficial access is real and company-level, not just API-key level.
- Rate limits are enforced. LinkedIn throttles API calls. For a daily pull of page stats, you're unlikely to hit limits. For bulk historical pulls of post-level data, space out requests.
- Data portability API vs. Community Management API. LinkedIn has been expanding official data access incrementally. Check the current API product catalog — some analytics that required workarounds in 2024 may have official endpoints now.
- Demographic data requires additional scope. Audience demographic breakdowns (industry, job function, seniority of followers) require the
r_organization_socialscope at minimum; some breakdowns may require the full admin scope.
Options Compared
| Method | Cost | Data Scope | Notes |
|---|---|---|---|
| LinkedIn Pages API (official) | Free | Own pages only | The only safe, compliant option |
| Fivetran LinkedIn Pages connector | Free tier available | Own pages only | Managed version of the same Pages API |
| Third-party scraping tools | Varies | Own + competitor | Do not use — account termination risk, enforcement active as of 2025 |
Wikipedia
Path: Custom Python script (Wikimedia REST API) → BigQuery. Trivial.
- Free REST API, no auth required, ~200 req/sec rate limit — massively over-provisioned for keyword research volumes
en.wikipedia.org/api/rest_v1/page/summary/{title}is the main endpoint- BigQuery public dataset
bigquery-public-data.wikipediaincludes page view stats (free to query under 1TB/month free tier) - ~20 lines of Python total
Wikipedia data in the pipeline serves two distinct purposes: (1) pulling article content and metadata for specific topics or brand/competitor pages, and (2) tracking page view statistics over time. The Wikimedia REST API is free, requires no authentication, and handles both. A BigQuery public dataset (bigquery-public-data.wikipedia) also provides page view stats without any API calls.
At a Glance
- Method: Wikimedia REST API (free, no auth) or BigQuery public dataset
- Cost: Free
- Setup time: ~1–2 hours
Option A: Wikimedia REST API — article content and metadata (~20 lines of Python)
pip install requests google-cloud-bigquery
import requests
from google.cloud import bigquery
from datetime import datetime, timezone
BQ_PROJECT = "marty-search-console-ga4-api"
def get_article_summary(title: str) -> dict:
"""Fetch article summary from Wikimedia REST API."""
url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}"
response = requests.get(url, headers={"User-Agent": "aimclear-pipeline/1.0 (timothy@aimclear.com)"})
response.raise_for_status()
data = response.json()
return {
"pulled_at": datetime.now(timezone.utc).isoformat(),
"title": data.get("title"),
"display_title": data.get("displaytitle"),
"description": data.get("description"),
"extract": data.get("extract"),
"page_id": data.get("pageid"),
"revision": data.get("revision"),
"last_modified": data.get("timestamp"),
"url": data.get("content_urls", {}).get("desktop", {}).get("page"),
}
def get_page_views(title: str, start: str, end: str, granularity: str = "daily") -> list:
"""Fetch page view counts for a Wikipedia article."""
# Format: YYYYMMDD for start/end
url = (
f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article"
f"/en.wikipedia/all-access/all-agents/{title}/{granularity}/{start}/{end}"
)
response = requests.get(url, headers={"User-Agent": "aimclear-pipeline/1.0 (timothy@aimclear.com)"})
response.raise_for_status()
return response.json().get("items", [])
# Example usage
ARTICLES = ["AimClear", "Digital_marketing", "Search_engine_optimization"]
bq = bigquery.Client(project=BQ_PROJECT)
rows = [get_article_summary(title) for title in ARTICLES]
# bq.insert_rows_json("marty-search-console-ga4-api.wikipedia.article_summaries", rows)
Deploy as a Cloud Run Job, scheduled weekly (article content doesn't change daily).
Option B: BigQuery public dataset — page views at scale
The bigquery-public-data.wikipedia dataset contains monthly page view statistics across all Wikipedia articles. No API calls, no pipeline — just query directly.
-- Top Wikipedia articles by views for a given month
SELECT
title,
SUM(views) AS total_views,
datehour
FROM `bigquery-public-data.wikipedia.pageviews_2026`
WHERE
wiki = 'en'
AND title IN ('AimClear', 'Digital_marketing', 'Search_engine_optimization')
AND DATE(datehour) BETWEEN '2026-01-01' AND '2026-01-31'
GROUP BY title, datehour
ORDER BY total_views DESC;
This approach works well for trend analysis and requires no ingestion pipeline at all.
Key Gotchas
- Set a descriptive User-Agent header. The Wikimedia API terms require identifying your tool and contact info in the User-Agent string (e.g.,
aimclear-pipeline/1.0 (timothy@aimclear.com)). Requests without proper User-Agent may be throttled or blocked. - Rate limit: ~200 requests/second. At typical monitoring volumes (dozens of articles), you'll never approach this. No API key required.
- Article title formatting. Wikipedia titles use underscores instead of spaces in URLs (
Search_engine_optimization, notSearch engine optimization). Handle URL encoding carefully. - Disambiguation pages. Some titles resolve to disambiguation pages, not articles. Check the
typefield in the REST API response —standardis a regular article,disambiguationmeans you need a more specific title. - Public dataset update lag. The
bigquery-public-data.wikipediadataset is updated monthly, not daily. Use the REST API for more current page view data. - No auth needed, but respect ToS. The Wikimedia API is completely open, but mass downloading of article content violates the spirit of the API. For bulk content needs, use Wikipedia database dumps instead.
- International editions. To pull from non-English Wikipedia, change
en.wikipediato the appropriate language code (e.g.,es.wikipedia,de.wikipedia).
Options Compared
| Method | Cost | Use Case |
|---|---|---|
| Wikimedia REST API + Python | Free | Article summaries, metadata, recent page views for specific articles |
| BigQuery public dataset | Free (BQ query costs) | Historical page view trends at scale, no pipeline needed |
| Wikipedia database dumps | Free | Bulk content ingestion (avoid for most marketing use cases) |
Recommendation: Use the BigQuery public dataset for page view trend analysis. Use the REST API for pulling article summaries and metadata for specific pages. Both can coexist — they answer different questions.
SERP API
Path: DataForSEO (pay-as-you-go, no minimums). Do not depend on SerpApi.
- SerpApi is under active Google lawsuit (filed Dec 2025, motion to dismiss filed Feb 2026 — case ongoing, ruling likely mid-to-late 2026). Continued viability uncertain.
- DataForSEO: $2.00/1K queries (live) or $0.60/1K (5-min delay). No monthly minimums. Best cost at volume.
- Serper.dev: $50 for 50K queries (tiered down to $0.30/1K at volume). Fast, less legally exposed than SerpApi.
- Google Custom Search JSON API: Closed to new customers. Existing customers deprecated by Jan 1, 2027. Not an option if you don't already have it.
- DataForSEO and Serper both return Google AI Overview content in SERP responses — use this for AI Overview monitoring instead of separate LLM calls against Google.
SERP API pulls real-time Google search results for target keywords — organic rankings, featured snippets, People Also Ask boxes, local packs, and Google AI Overviews. This feeds keyword rank tracking, SERP feature monitoring, and AI Overview coverage detection in the pipeline. DataForSEO is the recommended provider: no minimum spend, competitive pricing, and their API returns AI Overview content in responses.
At a Glance
- Method: DataForSEO API + custom Python script → Cloud Run Job
- Cost: DataForSEO: $2.00/1K live results, $0.60/1K standard (5-minute delay); no minimums
- Setup time: ~2–3 hours
Vendor Selection
DataForSEO (recommended):
- $2.00/1K requests for live results
- $0.60/1K for standard results (5-minute delivery delay, lower cost for rank tracking)
- No minimums or contracts — pay-as-you-go
- Returns Google AI Overview content in SERP responses
- Extensive API coverage: Google, Bing, YouTube, local, image SERPs
Serper (alternative):
- Also returns AI Overview content in SERP responses
- Competitive pricing — check current rates at serper.dev
- Simpler API structure; good for teams new to SERP APIs
Do not use SerpApi for new implementations:
- Google filed a lawsuit against SerpApi in December 2025
- Legal uncertainty creates business continuity risk
Do not use Google Custom Search JSON API:
- No longer accepting new customers
- Deprecated January 2027 — any existing integrations should be migrated
Step 1: Create DataForSEO account
- Go to app.dataforseo.com → Sign up.
- Add a payment method. No minimum deposit required.
- Note your API credentials (login email and password are used as HTTP Basic Auth).
Step 2: Write the pull script
pip install requests google-cloud-bigquery
import requests
import json
from google.cloud import bigquery
from datetime import datetime, timezone
from base64 import b64encode
BQ_PROJECT = "marty-search-console-ga4-api"
DATAFORSEO_LOGIN = "your-email@example.com"
DATAFORSEO_PASSWORD = "your-api-password"
def get_auth_header():
creds = b64encode(f"{DATAFORSEO_LOGIN}:{DATAFORSEO_PASSWORD}".encode()).decode()
return {"Authorization": f"Basic {creds}", "Content-Type": "application/json"}
def fetch_serp(keywords: list, location_code: int = 2840, language_code: str = "en") -> list:
"""
Fetch SERP results for a list of keywords.
location_code 2840 = United States
Use /serp/google/organic/live/regular for live results
Use /serp/google/organic/task_post + task_get for standard (cheaper)
"""
url = "https://api.dataforseo.com/v3/serp/google/organic/live/regular"
tasks = [
{
"keyword": kw,
"location_code": location_code,
"language_code": language_code,
"device": "desktop",
"os": "windows",
}
for kw in keywords
]
response = requests.post(url, headers=get_auth_header(), json=tasks)
response.raise_for_status()
return response.json().get("tasks", [])
def parse_serp_to_rows(tasks: list, pulled_at: str) -> list:
rows = []
for task in tasks:
keyword = task.get("data", {}).get("keyword")
results = task.get("result", []) or []
for result in results:
items = result.get("items", []) or []
for item in items:
rows.append({
"pulled_at": pulled_at,
"keyword": keyword,
"rank_group": item.get("rank_group"),
"rank_absolute": item.get("rank_absolute"),
"type": item.get("type"), # organic, featured_snippet, ai_overview, etc.
"url": item.get("url"),
"title": item.get("title"),
"description": item.get("description"),
"domain": item.get("domain"),
})
return rows
KEYWORDS = [
"digital marketing agency",
"ppc management",
"your target keyword here",
]
def main():
pulled_at = datetime.now(timezone.utc).isoformat()
tasks = fetch_serp(KEYWORDS)
rows = parse_serp_to_rows(tasks, pulled_at)
bq = bigquery.Client(project=BQ_PROJECT)
table_id = f"{BQ_PROJECT}.serp_tracking.organic_results"
errors = bq.insert_rows_json(table_id, rows)
if errors:
raise RuntimeError(f"BQ errors: {errors}")
print(f"Loaded {len(rows)} rows for {len(KEYWORDS)} keywords")
if __name__ == "__main__":
main()
Step 3: Create BigQuery table
CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.serp_tracking.organic_results` (
pulled_at TIMESTAMP,
keyword STRING,
rank_group INT64,
rank_absolute INT64,
type STRING,
url STRING,
title STRING,
description STRING,
domain STRING
)
PARTITION BY DATE(pulled_at)
CLUSTER BY keyword;
Step 4: Deploy as a Cloud Run Job
Schedule daily (or more frequently for high-value keywords). Keep cost in check by batching keywords per request (DataForSEO accepts up to 100 tasks per API call).
gcloud run jobs create serp-tracking \
--image gcr.io/marty-search-console-ga4-api/serp-tracking \
--region us-central1
Key Gotchas
- Live vs. standard results: Live is fresher but 3.3x the cost. For daily rank tracking, standard results (5-min delay) are fine and much cheaper. Use live for on-demand queries or AI Overview monitoring.
- AI Overview detection: DataForSEO returns AI Overviews as an
item_typein the SERP results. Parse thetypefield and look forai_overviewto capture which keywords trigger AI Overviews and what content appears. See the LLM Monitoring section for the dedicated LLM monitoring pipeline. - Google search ToS. Both DataForSEO and Serper operate as API intermediaries that comply with their own agreements with Google. Direct scraping of Google search results is against Google's ToS — use a compliant API provider.
- Keyword volume costs. At $0.60/1K standard requests, 1,000 keywords pulled daily costs ~$0.60/day ($18/month). 10,000 keywords/day = ~$180/month. Size your keyword list deliberately.
- Location and language matter. SERP results vary significantly by location. Set
location_codeto match your target market. Use country-level codes for national tracking, city-level for local. - SerpApi legal risk. Google's December 2025 lawsuit against SerpApi creates uncertainty about that service's continuity. Avoid starting new integrations there.
- Google Custom Search JSON API is closing. Deprecated January 2027. Do not build new pipelines on it.
Options Compared
| Provider | Cost | AI Overview | Notes |
|---|---|---|---|
| DataForSEO (recommended) | $0.60/1K standard, $2.00/1K live | Yes | No minimum, pay-as-you-go, extensive coverage |
| Serper | Check serper.dev | Yes | Simpler API; competitive pricing |
| SerpApi | Varies | Yes | Active Google lawsuit (Dec 2025) — avoid for new builds |
| Google Custom Search JSON API | Limited free, deprecated | No | No new customers; deprecated Jan 2027 |
| Custom scraping | Compute only | N/A | Violates Google ToS — do not use |
LLM Monitoring
Path: Two-track:
- Google AI Overviews → DataForSEO SERP API (already returning AIO content in results)
- ChatGPT / Claude / Gemini / Perplexity direct responses → Custom Python async script using official APIs
Official API pricing (mid-2025):
- OpenAI GPT-4o: $0.0025/1K input tokens, $0.01/1K output tokens
- Anthropic Claude Sonnet: $3/1M input, $15/1M output
- Google Gemini 1.5 Flash: effectively free at low volume on free tier
- Perplexity Sonar API: ~$1/1K queries
Specialized tools: Otterly.ai, thruuu LLM API (explicit BigQuery integration), DemandSphere (BigQuery export available). Worth evaluating if managing this at scale.
LLM API responses are non-deterministic — track trends across runs, not single-point data.
LLM monitoring tracks whether brand, competitor, or topic keywords appear in AI-generated responses across two distinct channels:
- Google AI Overviews — AI-generated answer boxes appearing in Google SERP
- Conversational AI responses — What ChatGPT, Claude, Gemini, and Perplexity say when asked about your brand, competitors, or industry topics
These are separate data sources with separate methods. Track both for a complete picture of "AI answer engine" visibility.
At a Glance
- Method: Two tracks — (1) DataForSEO SERP API for Google AI Overviews; (2) Direct LLM API calls (OpenAI, Anthropic, Google, Perplexity) via Python async script
- Cost: Varies by track and volume — see breakdown below
- Setup time: ~3–5 hours (both tracks)
Track 1: Google AI Overviews (via DataForSEO SERP API)
Google AI Overviews appear in standard Google search results. DataForSEO returns them as SERP items with type: "ai_overview". This track reuses the same infrastructure as the SERP API section above.
Cost: DataForSEO: $0.60/1K standard results — an AI Overview trigger is a byproduct of the SERP pull, no extra cost. Run this in the same Cloud Run Job as general SERP tracking to avoid duplicate API calls.
What to capture from DataForSEO AI Overview items: When a SERP result contains an AI Overview, the items array will include an entry with type: "ai_overview". Extract:
keyword— the query that triggered itai_overview_textor equivalent field — the generated answer textsources— URLs cited in the AI Overview (these are the "winners" worth tracking)pulled_at— timestamp of the SERP pull
Store in serp_tracking.ai_overview_appearances. Join with organic_results to see which keywords trigger AI Overviews vs. traditional results.
CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.serp_tracking.ai_overview_appearances` (
pulled_at TIMESTAMP,
keyword STRING,
has_ai_overview BOOL,
ai_overview_text STRING,
cited_urls ARRAY<STRING>,
cited_domains ARRAY<STRING>
)
PARTITION BY DATE(pulled_at)
CLUSTER BY keyword;
Track 2: Conversational AI Monitoring (ChatGPT, Claude, Gemini, Perplexity)
This track queries LLMs directly via their APIs and asks brand/topic questions, then stores the responses for analysis. Use Python async for efficiency across multiple providers.
Cost per provider
| Provider | Model | Approximate Cost |
|---|---|---|
| OpenAI | GPT-4o | $0.0025/1K input tokens, $0.010/1K output tokens |
| Anthropic | Claude Sonnet | $3.00/1M input tokens, $15.00/1M output tokens |
| Gemini 1.5 Flash | Free at low volume (check current quota limits) | |
| Perplexity | Sonar | ~$1.00/1K queries |
At typical monitoring volumes (50–200 queries/day across all providers), expect $1–$10/day total depending on query complexity and provider mix.
Setup Steps
pip install openai anthropic google-generativeai requests google-cloud-bigquery asyncio
import asyncio
import json
from datetime import datetime, timezone
from google.cloud import bigquery
import openai
import anthropic
import google.generativeai as genai
BQ_PROJECT = "marty-search-console-ga4-api"
BQ_TABLE = f"{BQ_PROJECT}.llm_monitoring.responses"
PROMPTS = [
"What is AimClear?",
"Who are the top digital marketing agencies for B2B paid search?",
"What companies specialize in audience targeting for paid social?",
# Add brand, competitor, and category prompts
]
async def query_openai(prompt: str, pulled_at: str) -> dict:
client = openai.AsyncOpenAI() # uses OPENAI_API_KEY env var
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
)
return {
"pulled_at": pulled_at,
"provider": "openai",
"model": "gpt-4o",
"prompt": prompt,
"response": response.choices[0].message.content,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
}
async def query_anthropic(prompt: str, pulled_at: str) -> dict:
client = anthropic.AsyncAnthropic() # uses ANTHROPIC_API_KEY env var
response = await client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
messages=[{"role": "user", "content": prompt}],
)
return {
"pulled_at": pulled_at,
"provider": "anthropic",
"model": "claude-sonnet-4-5",
"prompt": prompt,
"response": response.content[0].text,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}
async def query_perplexity(prompt: str, pulled_at: str) -> dict:
import httpx
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.perplexity.ai/chat/completions",
headers={"Authorization": f"Bearer {PERPLEXITY_API_KEY}"},
json={
"model": "sonar",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 500,
},
)
response.raise_for_status()
data = response.json()
return {
"pulled_at": pulled_at,
"provider": "perplexity",
"model": "sonar",
"prompt": prompt,
"response": data["choices"][0]["message"]["content"],
"input_tokens": data.get("usage", {}).get("prompt_tokens"),
"output_tokens": data.get("usage", {}).get("completion_tokens"),
}
async def run_all_prompts():
pulled_at = datetime.now(timezone.utc).isoformat()
tasks = []
for prompt in PROMPTS:
tasks.append(query_openai(prompt, pulled_at))
tasks.append(query_anthropic(prompt, pulled_at))
tasks.append(query_perplexity(prompt, pulled_at))
# Add Gemini similarly using google.generativeai
results = await asyncio.gather(*tasks, return_exceptions=True)
rows = [r for r in results if isinstance(r, dict)]
errors = [r for r in results if isinstance(r, Exception)]
if errors:
print(f"Errors: {errors}")
bq = bigquery.Client(project=BQ_PROJECT)
bq.insert_rows_json(BQ_TABLE, rows)
print(f"Loaded {len(rows)} LLM responses")
if __name__ == "__main__":
asyncio.run(run_all_prompts())
BigQuery table for LLM responses
CREATE TABLE IF NOT EXISTS `marty-search-console-ga4-api.llm_monitoring.responses` (
pulled_at TIMESTAMP,
provider STRING,
model STRING,
prompt STRING,
response STRING,
input_tokens INT64,
output_tokens INT64
)
PARTITION BY DATE(pulled_at)
CLUSTER BY provider, prompt;
Post-processing: brand mention detection
After loading responses, run a scheduled BQ query or dbt model to detect brand/competitor mentions:
SELECT
pulled_at,
provider,
prompt,
LOWER(response) LIKE '%aimclear%' AS mentions_aimclear,
LOWER(response) LIKE '%competitor_a%' AS mentions_competitor_a,
response
FROM `marty-search-console-ga4-api.llm_monitoring.responses`
WHERE DATE(pulled_at) = CURRENT_DATE()
ORDER BY pulled_at DESC;
Specialized Tools (evaluate vs. build)
| Tool | Notes |
|---|---|
| Otterly.ai | SaaS specifically for AI mention/citation monitoring; BQ integration varies — check current offering |
| thruuu LLM API | Has native BigQuery integration; purpose-built for LLM answer tracking |
| DemandSphere | Enterprise SEO/SERP platform with LLM visibility tracking; check pricing |
If AimClear plans to offer LLM monitoring as a client service at scale, evaluate these tools against the custom script approach. For internal use or a handful of clients, the custom script is sufficient and cheaper.
Key Gotchas
- LLM responses are non-deterministic. The same prompt returns different text on each call. This is by design — use daily snapshots, not single-run spot checks, to build a statistically meaningful signal.
- Gemini free tier has rate limits. At low volume, Gemini 1.5 Flash is free, but sustained daily pulls may exhaust the free quota. Monitor usage in Google AI Studio.
- Perplexity Sonar includes citations. Unlike other LLMs, Perplexity's responses include source URLs. Parse the
citationsarray in the response to see which pages Perplexity is citing — that's actionable SEO signal. - Model version drift. LLM providers change model versions. Pin specific model versions in your API calls and set up alerts if a model is deprecated.
- API key security. Store all API keys in Google Secret Manager, not in environment variables embedded in Docker images. Use Cloud Run's secret injection at runtime.
- Cost control. Set spending limits on each LLM provider account. A misconfigured loop could generate unexpected charges. OpenAI and Anthropic both support spending caps.
- Google AI Overviews are separate from Gemini API. The Gemini API is Google's LLM product; Google AI Overviews are a SERP feature. They may use similar underlying technology, but querying Gemini via API does not tell you what appears in AI Overviews on google.com.
Google Ads
- Method: Native BigQuery Data Transfer Service (BDTS)
- Cost: Free (standard BQ storage and query costs apply)
- Setup time: ~20 minutes
- Status: Not started
Path: Native BDTS connector — free. ~30+ standard tables covering campaigns, ad groups, ads, keywords, audiences, conversions, shopping, video.
The Google Ads BDTS connector exports ~30+ tables covering campaigns, ad groups, ads, keywords, audiences, conversions, and performance metrics. It's the canonical source for paid search data in the pipeline — clicks, impressions, costs, and ROAS at every level of the account hierarchy.
- January–February 2025: switched to Google Ads API v16 (some column name changes)
- Now supports custom reports via Google Ads Query Language (GAQL)
Setup Steps
- Open BigQuery console → go to project
marty-search-console-ga4-api→ BigQuery Data Transfers (left nav). - Click "+ Create Transfer" → search for "Google Ads" → select it.
- Configure the transfer:
- Display name: something like
google-ads-{client} - Schedule: Daily (default, recommended)
- Destination dataset: create a new dataset or use an existing one (e.g.,
google_ads_export) - Google Ads customer ID: enter the 10-digit customer ID (no dashes). For MCC accounts, use the top-level MCC ID to pull all child accounts.
- Display name: something like
- Authorize: BDTS will prompt for Google account authorization. Use an account that has at least read access on the Google Ads account.
- Save and run. The first run may take 30–60 minutes.
- Verify in BigQuery that tables are appearing in your dataset.
Enabling custom GAQL reports (optional)
In the transfer configuration, you can add custom Google Ads Query Language (GAQL) reports beyond the default tables. This is useful for pulling data not included in the standard ~30 tables (e.g., campaign-level audience performance).
Key Gotchas
- API version cutover (January–February 2025): Google Ads BDTS switched to Ads API v16. If you had older transfers configured before that window, they may have broken or had schema changes. Re-verify any pre-existing transfers.
- No backfill on first run. The connector pulls the last ~30 days on initial setup; earlier history is not available.
- MCC vs. individual accounts. Using an MCC ID exports all child accounts into separate table shards — one shard per customer ID per date. Query patterns need to account for this (filter by
customer_id). - ~30+ tables, not all useful. The default export includes tables for ad groups, ads, keywords, audiences, campaigns, conversions, click performance, and more. Start by understanding which tables map to your reporting needs before querying everything.
- Table naming: Tables follow
{resource}_YYYYMMDD(e.g.,Campaign_20260507). Current-day data lands as*_currenttables. - YouTube campaigns are included here. YouTube video campaigns managed in Google Ads appear in this export. See the YouTube Ads section for organic YouTube data.
- Conversion actions: Make sure conversion actions are configured correctly in the Google Ads account before relying on BQ conversion data.
- MCC authorization: If using an MCC, the authorizing account must have at least read access at the MCC level, not just the child account level.
Options Compared
| Method | Cost | Notes |
|---|---|---|
| Native BDTS (recommended) | Free | ~30 tables, daily, official support, GAQL custom reports available |
| Fivetran | Free tier available; paid tiers well below $12K/year enterprise floor | Managed, normalized schema; useful for multi-client consolidation |
| Airbyte | Free (open-source) / Cloud pricing varies | Open-source connector available; more setup overhead |
| Custom Python (Google Ads API) | Compute costs only | Full API flexibility but you own pagination, retry logic, schema management |
Recommendation: Use native BDTS unless you need cross-client normalization that Fivetran provides.
YouTube Ads
- Method: Google Ads BDTS connector (YouTube paid campaigns flow through Google Ads)
- Cost: Free (standard BQ storage and query costs apply)
- Setup time: ~5 minutes additional if Google Ads BDTS is already configured
- Status: Not started
Path: Covered by the Google Ads BDTS connector — YouTube campaigns are managed in Google Ads. No separate connector needed.
YouTube paid advertising — TrueView, bumpers, discovery ads, and other video campaign formats — is managed inside Google Ads and exported through the Google Ads BDTS connector. There is no separate BDTS connector for YouTube Ads. A separate BDTS connector exists for organic YouTube channel analytics (views, subscribers, engagement on owned channels).
Note: There is a separate YouTube Channel BDTS connector for organic channel analytics (views, subscribers, watch time) — different from ad performance.
For YouTube Paid Campaigns (ad performance data)
YouTube campaigns are managed in Google Ads. If you have the Google Ads BDTS connector configured, YouTube ad performance is already included. See the Google Ads section for setup instructions.
To confirm YouTube campaign data is present:
- In BigQuery, query the
Campaign_YYYYMMDDtable from your Google Ads export. - Filter by
advertising_channel_type = 'VIDEO'(or check thecampaign_typefield — exact field name varies by API version). - Video ad-specific metrics (views, view rate, CPV) appear in the
VideoPerformanceor similar tables depending on your BDTS configuration.
For Organic YouTube Channel Analytics (owned channel data)
A separate BDTS connector exports organic channel data (views, watch time, subscribers, engagement) from the YouTube Data API / YouTube Analytics.
- Open BigQuery console → Data Transfers → "+ Create Transfer" → search "YouTube".
- Select "YouTube Channel Reports" connector.
- Authorize with the Google account that owns or has Analyst access to the YouTube channel.
- Set destination dataset (e.g.,
youtube_organic). - Save and run.
Key Gotchas
- Paid vs. organic are completely separate data sources. Don't confuse the Google Ads export (paid performance) with the YouTube Channel Reports connector (organic channel analytics). They require different setup and land in different datasets.
- YouTube campaign tables in Google Ads export. Video campaign metrics are spread across standard campaign/ad group tables plus video-specific tables. You'll need to join them to get a complete picture.
- Organic connector data lag. YouTube Analytics data typically has a 2–3 day lag. Historical data may be available for backfill — check the connector settings during setup.
- YouTube channel analytics require Analytics access. The account authorizing the organic connector must have at minimum Analyst access on the YouTube channel (not just viewer).
- Paid reach metrics (unique viewers, frequency) are not in BDTS. These are available in the Google Ads UI (Reach Planner, brand lift studies) but not exported to BQ. Pull via Google Ads API custom GAQL if needed.
- Content ownership data (for networks with Content ID) has a separate YouTube Reporting API and is not covered by either connector above.
Options Compared
| Channel | Method | Notes |
|---|---|---|
| YouTube Paid | Google Ads BDTS (free) | Covered automatically if Google Ads export is live |
| YouTube Paid custom | Google Ads API + Python | For video-specific metrics not in standard tables |
| YouTube Organic | YouTube Channel Reports BDTS (free) | Owned channel only; views, watch time, subs |
| YouTube Organic via Fivetran | Free tier available | Alternative with managed schema normalization |
Meta Ads
- Method: Fivetran (free tier) or OWOX BI open-source connector; native BDTS is limited
- Cost: Free (Fivetran free tier or OWOX open-source) to paid (Fivetran consumption tiers)
- Setup time: ~30 minutes (Fivetran) / ~1–2 hours (OWOX or custom Python)
- Status: Not started
Path: Native BDTS Facebook Ads connector (free, limited columns). Upgrade to OWOX open-source connector if custom breakdowns needed.
Meta Ads data covers Facebook and Instagram paid campaigns — spend, impressions, clicks, conversions, and audience breakdowns. No native Google BigQuery connector exists with full fidelity; the best paths are Fivetran (managed) or OWOX BI's open-source connector (free, more flexible). A limited native BDTS connector exists but lacks custom breakdowns.
- Native BDTS covers Campaign, Ad Set, Ad, Account level performance. Standard metrics only.
- No custom breakdowns (age/gender/placement splits), no creative-level video metrics, no custom conversions in non-standard dimensions via native connector
- OWOX offers a free open-source Meta Ads → BigQuery connector with more flexibility than BDTS
- Third-party paid options (Fivetran, Airbyte Cloud, Windsor.ai, Supermetrics Enterprise) exist for richer data models
Option A: Fivetran (recommended for managed, reliable delivery)
- Log in to Fivetran → Connectors → "+ Add Connector" → search "Facebook Ad Insights".
- Authorize with a Meta Business account that has access to the ad account(s) you want to export.
- Select ad account(s). You can pull multiple accounts.
- Set destination: BigQuery project
marty-search-console-ga4-api, choose or create a dataset (e.g.,meta_ads_fivetran). - Configure sync frequency (daily recommended on free tier; more frequent on paid).
- Select tables/reports. Fivetran pulls core ad performance by default; configure breakdowns (age, gender, placement, device) as needed.
- Save and run initial sync. First sync may backfill available history.
Option B: OWOX BI open-source connector (free, more breakdown flexibility)
OWOX offers a free, open-source Google Sheets → BigQuery pipeline for Meta Ads that gives more control over breakdowns than the native BDTS option.
- Visit owox.com or search GitHub for "OWOX BigQuery pipeline".
- Follow their setup guide — it typically involves authorizing a Google Workspace add-on or deploying a script.
- Connect your Meta Ads account and configure the breakdown dimensions you need.
- Set a BigQuery destination and schedule.
Option C: Native BDTS (limited — standard metrics only)
A Meta Ads BDTS connector exists but only exports standard metrics with limited breakdown support. No custom audience segments or placement-level detail.
- BigQuery console → Data Transfers → "+ Create Transfer" → search "Facebook".
- Authorize with Meta Business account.
- Select destination dataset and ad account.
- Note the breakdown limitations before committing to this path.
Key Gotchas
- Native BDTS is limited. Standard metrics only, no custom breakdowns (age, gender, placement, device-level splits). If you need detailed breakdowns for optimization, use Fivetran or OWOX.
- Meta Marketing API rate limits. All connectors use the Meta Marketing API underneath. Heavy historical backfills can hit rate limits — Fivetran manages retries automatically; custom scripts need explicit backoff logic.
- Attribution windows matter. Meta reports data using click and view attribution windows (1-day, 7-day, 28-day). Make sure the connector you choose is pulling with the attribution window that matches your reporting convention — changing it mid-stream causes comparability issues.
- Conversion API vs. pixel. If the client uses Meta's Conversions API (CAPI) alongside or instead of the pixel, verify that CAPI events are appearing in the ad account's reported conversions before assuming BQ data is complete.
- Ad account access. The authorizing user must have Advertiser or above access to the Meta Ad Account, not just the Business Manager.
- Data retention in Meta API. Meta limits how far back you can pull via the API (typically 37 months). Backfill beyond that isn't possible.
- Currency. If you manage multi-currency ad accounts, verify the connector exports a currency field and how it handles FX for aggregated reporting.
Options Compared
| Method | Cost | Breakdown Flexibility | Notes |
|---|---|---|---|
| Fivetran (recommended) | Free tier available; paid tiers well below $12K/year enterprise floor | High | Managed, reliable, handles retries and schema changes |
| OWOX BI open-source | Free | High | More flexible breakdowns; requires some DIY setup |
| Native BDTS | Free | Low | Quick to set up; standard metrics only, no custom breakdowns |
| Airbyte | Free (open-source) / Cloud pricing varies | Medium | Open-source connector available |
| Custom Python (Meta Marketing API) | Compute costs only | Full | You own everything: pagination, rate limits, schema |
Bing Ads
- Method: Fivetran (recommended) or custom Python script via
bingadsSDK - Cost: Fivetran free tier available; Python: compute costs only
- Setup time: ~30 minutes (Fivetran) / ~2–4 hours (custom Python)
- Status: Not started
Path: Airbyte Cloud or custom Python script using Microsoft bingads SDK. No native BDTS connector.
Microsoft Advertising (Bing Ads) data covers paid search and audience campaigns on the Microsoft network — Bing, Yahoo, MSN, and partner sites. There is no native BigQuery Data Transfer Service connector for Bing Ads. Fivetran has a managed connector (free tier available), and Microsoft publishes a Python SDK (bingads on PyPI) for custom scripts.
- Fivetran has a mature connector but minimum $12K/year annual contract — overkill for one source
- Airbyte Cloud: ~$15/million rows, starting $10/month. Likely $20–60/month at agency ad reporting volumes. Open-source connector actively maintained.
- Custom Python script: Microsoft Advertising Python SDK (
bingadson PyPI, maintained by Microsoft), standard OAuth, free API usage. Moderate complexity. - If already running Airbyte self-hosted for other sources, add Bing Ads there for $0 incremental cost.
Option A: Fivetran (recommended)
- Log in to Fivetran → Connectors → "+ Add Connector" → search "Microsoft Advertising".
- Click "Authorize" → you'll be redirected to Microsoft login. Use an account with access to the Microsoft Advertising account(s).
- Select account(s). Fivetran can pull from one or multiple accounts.
- Set destination: BigQuery project
marty-search-console-ga4-api, dataset (e.g.,bing_ads_fivetran). - Configure sync frequency (daily recommended on free tier).
- Save and run initial sync. Fivetran will backfill available history on first run (check how far back — typically 30–90 days depending on plan).
- Verify tables are appearing in BigQuery.
Option B: Custom Python via bingads SDK
pip install bingads
Key steps:
- Register an app in the Microsoft Azure portal (Azure AD → App registrations → New registration).
- Get API credentials: Client ID, Client Secret, Developer Token (from Microsoft Advertising UI → Tools → API Center).
- OAuth flow to get refresh token — Microsoft provides a sample script in the SDK docs.
- Write a report request using the SDK's
ReportingServiceManager. Choose report type (e.g.,CampaignPerformanceReportRequest). - Download report and parse (TSV format).
- Load into BigQuery using
google-cloud-bigqueryPython client orbq load.
# Minimal pattern — fill in auth details from your registered app
from bingads.service_client import ServiceClient
from bingads.authorization import OAuthDesktopMobileAuthCodeGrant, AuthorizationData
# See Microsoft Advertising Python SDK docs for full auth + reporting examples
# SDK repo: https://github.com/BingAds/BingAds-Python-SDK
Deploy as a Cloud Run Job on a daily schedule (preferred over PythonAnywhere for production).
Key Gotchas
- No native BDTS connector. There's no built-in BigQuery connector from Microsoft — you must use a third-party tool or custom code.
- Developer Token required. Even for read-only API access, you need a Developer Token from the Microsoft Advertising UI (Tools → API Center → Request Developer Token). Standard tokens are available to any advertiser; production/approved tokens may require a short application review.
- OAuth refresh tokens expire. Microsoft refresh tokens have a finite lifetime (90 days without use). Any automated script needs to handle token refresh and persist updated tokens — a common failure point.
- Report format is TSV. Downloaded reports are tab-separated, not JSON. Parse carefully and handle missing/null values.
- Report availability lag. Performance data is typically available with a 2–3 hour lag; some conversion data up to 24h.
- Multi-account/MCC structure. If managing multiple accounts under a Manager Account, handle the account hierarchy in your API calls to avoid pulling only the parent account.
- Airbyte option: Airbyte Cloud has a Microsoft Advertising connector (~$15/million rows on Cloud, free on open-source self-hosted). Worth considering if you already have Airbyte set up.
Options Compared
| Method | Cost | Notes |
|---|---|---|
| Fivetran (recommended) | Free tier available; paid tiers well below $12K/year enterprise floor | Managed, handles auth/token refresh, reliable schema |
| Airbyte | Free (open-source self-hosted) / ~$15/million rows (Cloud) | Open-source connector; self-hosted requires infra |
Custom Python (bingads SDK) | Compute costs only | Full control; you own auth, retry, schema — most setup effort |
LinkedIn Ads
- Method: Fivetran (recommended) or Airbyte
- Cost: Fivetran free tier available; Airbyte Cloud pricing varies
- Setup time: ~30 minutes (connector setup) + 2–4 weeks (LinkedIn app approval — start immediately)
- Status: Not started — apply for LinkedIn Marketing API access now
Path: Airbyte Cloud after obtaining LinkedIn Marketing API access. Apply for API access immediately — 2–4 week approval wait.
LinkedIn Ads data covers Sponsored Content, Message Ads, Dynamic Ads, and Text Ads — impressions, clicks, spend, conversions, and demographic breakdowns (job title, company size, industry, seniority). There is no native BDTS connector. Fivetran and Airbyte both use the LinkedIn Marketing API, which requires explicit app approval from LinkedIn — a process that can take 2–4 weeks.
- No native BDTS connector
- LinkedIn Marketing API requires app creation at developer.linkedin.com + Marketing Developer Platform access application. Approval is case-by-case.
- Fivetran has a mature LinkedIn Ad Analytics connector but is expensive ($12K/year minimum)
- Airbyte Cloud: ~$15/million rows (~$20–50/month at typical agency volumes). Open-source connector available.
- Custom Python script is viable once API is approved (LinkedIn has a Python SDK); moderate complexity due to OAuth + LinkedIn's pagination model
- Rate limits are not precisely documented; LinkedIn throttles aggressively on bulk requests
- Note: Marketing API covers paid data only. Organic LinkedIn Page analytics require the separate Community Management API.
Step 0: Apply for LinkedIn Marketing API access (do this first)
This is the longest part of the process. Do not wait until you're ready to configure a connector.
- Go to developer.linkedin.com → My Apps → Create App.
- Fill in the app details. You'll need:
- App name (e.g., "AimClear BigQuery Pipeline")
- A LinkedIn Company Page (you must have admin access to the page)
- App logo
- Under the app, go to Products → request "Marketing Developer Platform" access.
- Submit and wait. LinkedIn reviews applications manually. Typical timeline: 2–4 weeks.
- Once approved, you'll have access to the
r_adsandr_ads_reportingOAuth scopes.
Step 1: Configure Fivetran connector (after API approval)
- Log in to Fivetran → Connectors → "+ Add Connector" → search "LinkedIn Ad Analytics".
- Enter OAuth credentials from your approved LinkedIn app (Client ID, Client Secret).
- Authorize with a LinkedIn account that has Campaign Manager access to the ad account(s).
- Select ad accounts.
- Set destination: BigQuery project
marty-search-console-ga4-api, dataset (e.g.,linkedin_ads_fivetran). - Configure sync frequency and save.
- Run initial sync and verify tables in BigQuery.
Step 2: Verify data
Check that tables include campaign performance, creative performance, and demographic breakdowns (if available on your LinkedIn account tier).
Key Gotchas
- 2–4 week API approval wait. This is the single most important thing to plan around. Apply the moment this pipeline project starts. If you miss this step, you'll be waiting before you can connect anything.
- LinkedIn throttles aggressively. The Marketing API has rate limits, but they are not precisely documented. LinkedIn silently throttles heavy bulk requests. Fivetran and Airbyte handle retry logic, but custom scripts need exponential backoff.
- Data retention via API. LinkedIn's API limits historical data pull depth — verify the exact lookback window during connector setup (typically 2 years).
- Demographic breakdowns require Campaign Manager access. The authorizing LinkedIn account must have Campaign Manager access on the specific ad account, not just general LinkedIn Business Manager access.
- App review can be rejected. LinkedIn may request clarification or additional information about your use case. Prepare a clear description of why you need reporting API access (data warehouse for client reporting is a valid use case).
- Do not confuse LinkedIn Ads with LinkedIn Organic. The Marketing API for ads and the Pages API for organic are separate. See the LinkedIn Organic section for organic data.
- One app per organization is recommended. If you're setting up for multiple clients, a single AimClear app with multiple ad account authorizations is cleaner than one app per client.
Options Compared
| Method | Cost | Notes |
|---|---|---|
| Fivetran (recommended) | Free tier available; paid tiers well below $12K/year enterprise floor | Managed, handles API auth, schema normalized; requires your LinkedIn app approval |
| Airbyte | Free (open-source) / Cloud pricing varies | Open-source connector available; same LinkedIn app approval required |
| Custom Python (LinkedIn Marketing API) | Compute costs only | Full control; same app approval required; handle rate limits and pagination yourself |
Google Ad Manager
- Method: Native BDTS (Google Ad Manager Data Transfer) — but requires Ad Manager 360
- Cost: Free data transfer, but Ad Manager 360 is a premium/enterprise product
- Setup time: ~1–2 hours + Google support ticket time (potentially days/weeks)
- Status: Not started — verify Ad Manager 360 eligibility first
Path: Native BDTS connector (free) — but requires Data Transfer file access, which is not self-serve.
Google Ad Manager's BigQuery Data Transfer exports impression-level ad serving data — every impression, click, and viewability event from the ad server. This is powerful for publisher-side analysis, revenue attribution, and cross-channel delivery reconciliation. However, it is only available to Ad Manager 360 (GAM 360) accounts, not standard Ad Manager accounts. Self-serve setup is not possible; it requires a Google support ticket.
- BDTS connector reads from a GCS bucket that Google Ad Manager populates with DT files
- DT file access must be enabled by contacting Google Ad Manager support — typically requires Ad Manager 360 (premium/enterprise). Free Ad Manager tier likely ineligible.
- Verify account type before planning this pipeline. Contact Google rep if needed.
- Impression-level data at scale can be very high volume (terabytes). Monitor BigQuery storage costs.
Step 0: Verify Ad Manager 360 access
Before investing any setup time, confirm the account type:
- In Google Ad Manager → Admin → Network settings → look for "Ad Manager 360" in the account tier/type.
- Or contact your Google account team — standard Ad Manager accounts cannot enable Data Transfer without upgrading.
Step 1: Enable Data Transfer file access (requires Google support ticket)
- Contact Google Ad Manager support and request Data Transfer file access for your network.
- Provide: Ad Manager Network Code, GCP project ID (
marty-search-console-ga4-api), and the GCS bucket where you want files delivered. - Wait for Google to provision access. Timeline varies — days to weeks.
Step 2: Set up Google Cloud Storage bucket for Data Transfer files
# Create a GCS bucket for Ad Manager Data Transfer files
gsutil mb -p marty-search-console-ga4-api gs://aimclear-gam-data-transfer
# Grant Ad Manager's service account access to the bucket
# (Google support will provide the exact service account email)
gsutil iam ch serviceAccount:{gam-service-account}@gserviceaccount.com:objectAdmin gs://aimclear-gam-data-transfer
Step 3: Configure BDTS to load GCS files into BigQuery
Once Data Transfer files are landing in GCS:
- BigQuery console → Data Transfers → "+ Create Transfer" → "Google Ad Manager".
- Point to the GCS bucket and set destination dataset (e.g.,
google_ad_manager). - Set schedule and save.
Step 4: Validate data
- Run a query against the resulting tables to verify impression counts match Ad Manager UI reports.
- Data Transfer files are large — impression-level data for active networks generates substantial BQ storage costs.
Key Gotchas
- Ad Manager 360 only. This is the single largest blocker. Standard GAM accounts cannot use Data Transfer. Verify this before anything else.
- Requires a Google support ticket. There is no self-serve activation. Factor in ticket wait time (days to weeks) when planning timelines.
- Very high data volume. Impression-level data for an active ad network generates massive BQ tables. Partition tables by date, consider clustering by order/line item, and set up lifecycle policies on GCS. Query costs can be significant.
- Data lag. Data Transfer files are typically available 24–48 hours after the serving date.
- File format complexity. Data Transfer files use a specific format (CSV-like with headers). The BDTS connector handles parsing, but if you're processing files directly, review the Ad Manager Data Transfer documentation carefully.
- No backfill on initial setup. Data Transfer starts from the date files begin landing in GCS. Google may be able to re-provision historical files — ask during the support ticket process.
- Viewability data is separate. Active View (viewability) data may be in separate files/tables from impression data.
Options Compared
| Method | Cost | Access Requirement | Notes |
|---|---|---|---|
| Native BDTS (Data Transfer) | Free transfer + BQ storage | Ad Manager 360 only | Impression-level; requires support ticket |
| Ad Manager Reporting API | Free | Standard or 360 | Aggregated reports only (no impression-level); self-serve |
| Fivetran | Free tier available | Standard or 360 | Pulls via Reporting API; aggregated, not impression-level |
| Custom Python (Reporting API) | Compute costs only | Standard or 360 | Full API access to aggregated reports; impression-level not available |
If you have Ad Manager 360: Use native BDTS for impression-level data.
If you have standard Ad Manager: Use the Reporting API (via Fivetran or custom Python) for aggregated reports.
Other Programmatic / DSPs
Depends on specific platforms. Notable:
- DV360 / Display & Video 360: Native BDTS connector available (free)
- The Trade Desk: Fivetran or Airbyte connector; no native BDTS
- Amazon DSP: Airbyte connector; no native BDTS
Clarify which specific programmatic platforms are actually in scope before building pipelines.
Optimizer
- Method: TBD — depends on vendor capabilities; see decision tree below
- Cost: Depends on method chosen
- Setup time: ~30 minutes (if native export exists) to ~4 hours (custom script)
- Status: Not started — start by verifying vendor export options
No public documentation found for a product specifically named "Optimizer." Optimizer is a paid channel aggregator that consolidates performance data across multiple ad platforms. Before building a pipeline, the key question is: what does Optimizer offer for data export, and does pulling from Optimizer provide more value than pulling from source channels (Google Ads, Meta, Bing, LinkedIn) directly?
This section is a decision template. Fill in specifics after reviewing Optimizer's documentation or contacting their support team. Check with the vendor:
- Does it expose a REST API or export endpoints?
- Does it write to cloud storage (GCS/S3)? BigQuery can ingest directly from GCS.
- Does it have a direct BigQuery export feature?
- What dimensions/metrics does it expose? Aggregators sometimes summarize in ways that lose granularity needed for deep analysis.
If Optimizer already aggregates paid channels and the data fidelity is acceptable, pulling from it instead of each channel separately reduces pipeline complexity.
Decision Tree: How to Connect
Step 1: Check what Optimizer offers natively
Contact Optimizer support or review their API/export documentation:
- Does Optimizer offer a direct BigQuery export or scheduled GCS dump?
- Yes → use it. This is the simplest path. Ask for format documentation (schema, partition strategy, null handling).
- No → continue to Step 2.
- Does Optimizer have a REST API?
- Yes → write a custom Python script using requests/httpx. See the script template below.
- No → continue to Step 3.
- Does Optimizer offer CSV/TSV export to email or SFTP?
- Yes → set up automated export to GCS and load with BQ load jobs.
- No → escalate to your Optimizer account manager — this is a significant limitation.
Step 2: Assess whether pulling from Optimizer is worth it
Optimizer aggregates data that may already be flowing into BigQuery from source channels:
| Data in Optimizer | Already in BQ? | Pull from Optimizer? |
|---|---|---|
| Google Ads performance | Yes (BDTS) | Only if Optimizer normalizes/enriches it |
| Meta Ads performance | Yes (Fivetran) | Only if Optimizer adds cross-channel normalization |
| Bing Ads performance | Yes (Fivetran) | Only if Optimizer's schema is cleaner |
| Cross-channel aggregated metrics | Maybe not | Yes — this is Optimizer's core value |
| Optimizer-specific attribution | Unique to Optimizer | Yes |
Rule of thumb: If Optimizer provides normalization or attribution logic on top of raw channel data, that added layer is worth pulling. If it's just re-exporting the same numbers from Google Ads and Meta without transformation, pull from source channels for higher fidelity.
Step 3: REST API — custom Python script template
If Optimizer has a REST API, adapt this pattern:
pip install requests google-cloud-bigquery
import requests
from google.cloud import bigquery
from datetime import datetime, timezone, timedelta
BQ_PROJECT = "marty-search-console-ga4-api"
OPTIMIZER_BASE_URL = "https://api.optimizer-vendor.com/v1" # replace with actual URL
OPTIMIZER_API_KEY = "YOUR_API_KEY" # store in Secret Manager
headers = {
"Authorization": f"Bearer {OPTIMIZER_API_KEY}",
"Content-Type": "application/json",
}
def fetch_performance(start_date: str, end_date: str) -> list:
"""
Pull performance data from Optimizer API.
start_date/end_date: YYYY-MM-DD format
Adjust endpoint path, params, and pagination to match actual API spec.
"""
url = f"{OPTIMIZER_BASE_URL}/reports/performance"
params = {
"start_date": start_date,
"end_date": end_date,
"dimensions": "date,channel,campaign", # adjust to API spec
"metrics": "impressions,clicks,spend,conversions",
}
rows = []
page = 1
while True:
response = requests.get(url, headers=headers, params={**params, "page": page})
response.raise_for_status()
data = response.json()
rows.extend(data.get("results", []))
if not data.get("has_next_page"):
break
page += 1
return rows
def load_to_bq(rows: list, table_id: str):
bq = bigquery.Client(project=BQ_PROJECT)
pulled_at = datetime.now(timezone.utc).isoformat()
for row in rows:
row["pulled_at"] = pulled_at
errors = bq.insert_rows_json(table_id, rows)
if errors:
raise RuntimeError(f"BQ errors: {errors}")
print(f"Loaded {len(rows)} rows to {table_id}")
if __name__ == "__main__":
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
rows = fetch_performance(yesterday, yesterday)
load_to_bq(rows, f"{BQ_PROJECT}.optimizer.performance_daily")
Step 4: Deploy as a Cloud Run Job
gcloud run jobs create optimizer-pipeline \
--image gcr.io/marty-search-console-ga4-api/optimizer-pipeline \
--region us-central1
# Schedule daily
gcloud scheduler jobs create http optimizer-daily \
--schedule="0 8 * * *" \
--uri="https://us-central1-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/marty-search-console-ga4-api/jobs/optimizer-pipeline:run" \
--oauth-service-account-email=YOUR_SA@marty-search-console-ga4-api.iam.gserviceaccount.com
Key Gotchas
- Unknown vendor specifics. This document is a template. The actual API endpoint paths, authentication method, data schema, and pagination behavior need to be confirmed with Optimizer's documentation or support team before writing production code.
- Fidelity trade-off. Aggregators like Optimizer may apply their own attribution or normalization rules that differ from source platform numbers. If exact platform spend figures are required for billing reconciliation, pull from source channels. If cross-channel normalized metrics for optimization decisions are the goal, Optimizer may be authoritative.
- Avoid double-counting. If Optimizer data lands in BQ alongside raw channel data (Google Ads, Meta, Bing), be explicit in your data model about which table is source of truth for which metrics. Don't aggregate both without a dedup strategy.
- Data freshness. Aggregators often have a lag on top of the platform's own lag. Verify Optimizer's data freshness SLA before relying on it for day-over-day reporting.
- API key rotation. Store Optimizer credentials in Google Secret Manager and inject at Cloud Run Job runtime. Never embed keys in Docker images or commit them to source control.
Questions to Answer Before Building
Before writing any code, get answers to these from Optimizer:
- Is there a native BigQuery connector or GCS export?
- Is there a REST API? What's the authentication method (API key, OAuth, HMAC)?
- What are the available dimensions and metrics?
- What is the data freshness / lag from platform to Optimizer?
- How far back can you pull historical data?
- What is the rate limit on API calls?
- Does Optimizer's schema overlap with your source channel tables, and if so, how?
Architectural Recommendations
Python Scripts (Reddit organic, Wikipedia, LLMs, Bing Ads if no Airbyte)
Use Cloud Run Jobs over PythonAnywhere for production:
- Native GCP IAM — no key file management, uses Workload Identity
- Container-based — reproducible environments
- Built-in Cloud Scheduler integration for daily/hourly runs
- Scales to zero — costs pennies per invocation
- PythonAnywhere is fine for prototyping
Use google-cloud-bigquery Python client with Workload Identity on Cloud Run. Write via streaming inserts (small volumes) or batch LOAD jobs (daily bulk pulls).
ETL Strategy (Bing Ads, LinkedIn Ads, Reddit Ads)
If adding multiple non-native sources, consider Airbyte self-hosted on a small GCE VM — one management interface covers all three connectors and more. Free (you run the infrastructure, costs ~$20–40/month for a small e2-micro).
Airbyte Cloud at $15/million rows is the lower-ops alternative if self-hosting is too much friction.
Do not use Fivetran — minimum $12K/year annual contract is not justified for this use case.
BDTS Pricing Note
Starting September 25, 2025, BigQuery DTS switches third-party marketplace connectors to consumption-based pricing (slot-hours). First-party Google connectors (GA4, Google Ads, Search Console, YouTube Ads, Meta Ads, Ad Manager) remain free. Only third-party Fivetran/partner connectors sold through GCP Marketplace are affected.
Immediate Action Items
- Apply for LinkedIn Marketing API access now — 2–4 week wait, blocks LinkedIn Ads pipeline
- Switch SERP API provider from SerpApi to DataForSEO (lawsuit risk)
- Verify Google Ad Manager account type — DT file access required for BDTS connector
- Set up Airbyte (self-hosted or Cloud) to cover Bing Ads + LinkedIn Ads (once approved) + Reddit Ads
- Prototype Reddit/Wikipedia/LLM Python scripts on PythonAnywhere; migrate to Cloud Run Jobs for production
- Do not build LinkedIn personal-token pipeline — account termination risk is real after 2025 enforcement actions