How we grade financial influencers.
GuruScope is an accountability engine, not a brokerage ledger. Every score on this site follows the rules below — published in full so you, journalists, and compliance teams can audit them. Methodology updates are dated in section 14.
Data collection
We ingest content from public-facing channels using each platform's standard public API or syndicated RSS feed. As of v1.0 we cover:
- YouTube — channel uploads via RSS + yt-dlp resolution. Transcripts via the official transcript API with a Webshare proxy fallback.
- Twitter / X — Apify
apidojo~tweet-scraperwhen the operator has a paid Apify plan with residential proxies. Otherwise stubbed. - Blog / RSS — feed parsing via
feedparser. - TikTok & podcasts — adapters exist but are not enabled in v1 because audio transcription cost is not yet justified by the data quality.
Each guru's content is re-scanned weekly via a cron job (Sundays 02:00 UTC). New content is appended; existing predictions are re-verified against fresh market data.
Prediction extraction
Transcripts and posts are passed to a Gemini 2.5 Flash extractor (or Claude Sonnet 4.6 when configured). The model returns a strict JSON array of structured claims with:
claim,asset,exact_quote(verbatim, used for citation)recommendation_type— buy / sell / hold / sector_rotation / avoid / general_macrotime_horizon— intraday / weeks / months / 1-2_years / 5+_years / unspecifiedconfidence_level— high / medium / low (drives calibration scoring)disclosures[]— array of{type, quote}tagging holds_position, paid_promo, disclaimer, conflict, or no_disclosure
We require exact_quote on every prediction so any reader can verify our paraphrase against the source. Predictions without a verifiable quote are rejected.
Entry price rule
Entry price for any traded asset is the closing price on the publish date of the source content. This is the same convention used by academic studies of newsletter performance and by competing trackers (TrueAlpha, CXO Advisory, etc.). No look-ahead.
Evaluation horizons
Every prediction is scored at four horizons in parallel:
- Stated — the creator's own time horizon (default for credibility scoring; charitable)
- 90 / 180 / 365 daysfrom publish — fixed windows that prevent the "my 10-year thesis hasn't played out yet" dodge
Leaderboard alpha defaults to the stated horizon; users can switch via the horizon dropdown to compare. Horizons that haven't elapsed (e.g. 365d for a prediction made 30 days ago) return null rather than partial values.
Benchmark selection
Default benchmark is the S&P 500 (^GSPC). Sector-specific ETFs are used when the underlying asset matches a known sector taxonomy:
- Energy →
XLE, Mining/Gold →GDX, Tech →XLK, Financials →XLF, Healthcare →XLV, Consumer →XLY, Real estate →XLRE - Crypto →
BTC-USD(S&P 500 used as secondary cross-check) - Commodities →
GSGor futures contract directly
Alpha calculation
Alpha is the simple difference between asset return and benchmark return over the same window:
alpha_pct = ((asset_close_T - asset_close_0) / asset_close_0
- (bench_close_T - bench_close_0) / bench_close_0) * 100
Example: BTC bought at $42,000 on 2025-01-01, $48,000 on 2025-04-01 = +14.3%
S&P 500 over same period = +6.1%
alpha = +14.3% - 6.1% = +8.2%We do not annualise. Reported alpha is the realised excess return at each horizon.
Directional accuracy
Predictions resolve to one of four outcomes:
- Correct — directional call (buy/sell) confirmed by >5% move in the predicted direction within the horizon
- Wrong — directional call contradicted by >5% move in the opposite direction
- Partial — small move (within ±5%), or right direction but missed magnitude target. Counts as 0.5 in accuracy
- Pending — horizon hasn't elapsed yet
Qualitative claims (e.g. "the Fed will pause") that aren't directly tied to a tradeable asset are verified via Tavily web search rather than market data — same correct/wrong/pending buckets.
Risk adjustment
Three risk metrics per prediction, aggregated per guru:
- Max drawdown — worst peak-to-trough decline of the recommended asset during the holding period
- Volatility — standard deviation of returns across the guru's predictions
- Approximate Sharpe —
avg_alpha / volatility(zero risk-free rate assumed for simplicity)
The risk_adjusted_scoreshown on profiles is the guru's risk_adjusted_alpha (avg_alpha / |max_drawdown|) scaled to 0-100. A guru who picks volatile penny stocks scores lower than one who picks steady compounders, even at the same alpha.
Statistical significance
A guru with 3 predictions at 100% accuracy is statistically indistinguishable from a coin flip. Without a sample-size gate, the leaderboard would be dominated by lucky outliers. We run two tests:
Primary — one-sample t-test on alpha
from scipy import stats t_stat, p_value = stats.ttest_1samp(alpha_values, popmean=0.0) significant = p_value < 0.05
Secondary — bootstrap 95% CI on mean alpha (handles fat-tailed return distributions where the t-test can mislabel during tail events)
# 5000 resamples with replacement ci_lower = percentile(bootstrap_means, 2.5) ci_upper = percentile(bootstrap_means, 97.5) positive_alpha_confirmed = ci_lower > 0
Sample-size tiers
- N/A — fewer than 5 verified predictions. Shown on profile pages only.
- PRELIM — 5-19 verified. Eligible for leaderboard but tagged as preliminary.
- SIG — passes both t-test (p<0.05) AND bootstrap CI lower bound > 0.
The default leaderboard view is exploratory and shows all gurus. The "Sig only" toggle and "N≥20 (evaluable)" filter let users gate by methodology rigour.
Confidence calibration
Confidence calibration measures whether a guru's stated confidence (high/medium/low) tracks their actual hit rate. We use a Brier-style decomposition: bucket predictions by stated confidence, compute realised accuracy per bucket, then penalise the gap between expected accuracy (e.g. 80% for "high confidence") and observed.
A guru who's right 80% of the time on "high confidence" calls and 50% on "low" scores 100. A guru who uses "high confidence" on every call regardless of outcome scores 0.
Disclosure detection
For every recommendation, the extractor scans the source content for explicit disclosures:
holds_position— creator says they own the assetpaid_promo— sponsored content, affiliate link, brand dealdisclaimer— generic "not financial advice" or similarconflict— financial relationship with the asset issuerno_disclosure— recommendation made with no disclosure of any kind
The aggregate disclosure_qualityscore is the percentage of recommendations accompanied by at least one disclosure of any type. We treat undisclosed buy/sell calls as a structural red flag, not just stylistic preference — this aligns with FINRA's 2024 finfluencer enforcement actions (M1 Finance $850K, TradeZero $250K, Moomoo $750K).
Deletion policy
Each weekly re-scan compares live content IDs against our cache. Missing items are marked deleted_at = NOW() in our DB but their cached extracted predictions are preserved. Profile pages show deleted predictions with a strikethrough + accent "deleted" tag, citing the original URL even though it returns 404.
We do not host the original video, transcript, or screenshot. Only our structured prediction record persists. This is consistent with fair-use research/criticism.
Known limitations
- Transcript imperfection — auto-captioned YouTube transcripts can mishear numbers and ticker symbols. We mitigate by requiring the model to quote
exact_quote; visibly garbled quotes are flagged for review. - Context loss— extraction can miss qualifiers ("if the Fed pauses, then BTC to $100k"). We're improving by passing larger windows of surrounding text.
- Survivorship in the source— if a creator deletes wrong calls before our first scan, we never see them. Deletion tracking only works on content we've previously cached.
- Benchmark mismatch — sector ETF assignment is heuristic. A creator recommending a small-cap miner gets benchmarked against
GDX, which may understate or overstate their stock-picking edge. - We are not a brokerage ledger — predictions are statements made publicly, not actual trades. We measure what they said, not what they did. A creator could outperform privately while looking bad here, and vice versa.
Change log
Initial public methodology. v3 scoring (9 dimensions) + v4 additions (statistical significance via t-test + bootstrap, dual-horizon scoring at 90/180/365 + stated, sample-size tiers, leaderboard sig gates).
Cross-platform deduplication via TF-IDF cosine similarity. Conviction signal surfaces predictions repeated across multiple platforms or videos.
Soft-anchored qualitative prompts (5-point scale for risk_awareness and survivorship_honesty). Removed unfair zero-anchoring penalty.
methodology@guruscope.com. Disputes are reviewed manually; we do not auto-approve creator edits.