Methodology
Alex Hines — Quantitative Research Engineer
Five paper-trading sleeves have been running live since 2026-06-01. One sleeve trades at $250K paper notional through Interactive Brokers (account DUP***83); four are shadow books for hypothesis comparison. The signal-spec fence has not been touched since inception. This page is the project's methodology surface; the dashboard tiles are the live process record; the SSRN paper is the formal writeup.
SSRN paper — Q3 2026 · GitHub repository pending · LinkedIn pending · Substack pending
Intro
I am a quantitative-research engineer building a single-operator equity research and paper-trading stack. The codebase runs a 457-signal point-in-time research panel, a five-sleeve paper-trading pipeline against IBKR, and a methodology fence that hashes every active signal's metadata into a single SHA-256 lock. The stack's headline outputs are the daily NAV and risk numbers shown elsewhere on this dashboard. The stack's headline finding is that, of seventeen famous published equity-anomaly signals, zero survive retail-cost-honest replication. That negative result, and the multi-agent adversarial process that produced it, is what this page is about.
Why this strategy
The choice of a long-only, top-N, monthly-rebalanced US equity sleeve is not because it has the highest expected Sharpe — it does not. It is because the cost-and-friction profile is the one a retail-scale operator actually faces, and the operational stack (PIT panel, fence, walk-forward eval, IBKR submission, T+1 reconciliation) is the artifact that transfers to an institutional context.
Three design constraints drove the strategy shape:
- Honest costs over headline Sharpe. Net-of-cost evaluation uses commissions, half-spread, market-impact (Almgren-Chriss style on each rebalance leg), short borrow where applicable, dividend tax, and short-term-gains income tax. Strategies that look strong gross-of-cost and collapse net-of-cost are the rule in published anomaly research; building the cost model first means the headline only ever moves down.
- Single operator, no co-location. All execution is end-of-day MOC, all rebalance cadences are weekly or slower, and no signal depends on intraday microstructure. This is a deliberate choice to keep the strategy reproducible by someone with retail infrastructure.
- Process auditability over alpha mystique. Every active signal is registered, hashed, and pinned. Every paper trade is reconstructable from disk. The bet is that hiring managers prefer an honest replication framework over a black-box backtest.
The 4-method ensemble (sleeve_5) is the headline sleeve only because, on the honest evaluation window, it has the highest measured net-of-cost Sharpe; not because the underlying signals are novel.
4-method ensemble architecture
Sleeve_5 — the LIVE $250K paper sleeve — combines four independently trained ranking models. Each method is trained walk-forward against the same 457-signal PIT panel; outputs are per-name conformalised ranks that are then averaged with equal weights. Equal-weight averaging was chosen over learned blending because (a) blending introduces another best-of-K selection surface; (b) on the honest 12-fold window, equal-weight beat optimised-weight after Bonferroni correction.
| # | Method | Loss / objective | Why it's in the ensemble |
|---|---|---|---|
| 1 | LightGBM quantile trio (q10 / q50 / q90) | Multi-quantile pinball loss | Captures the shape of the return distribution per name, not just the mean. The q90 head is the dominant driver on long-only top-N. |
| 2 | XGBoost monotone | Pairwise rank loss with monotonic constraints on canonical signs (e.g. momentum +, volatility −) | Hard-constrains the model to respect known financial priors — prevents the booster from learning a spurious sign-flip in a quiet regime. |
| 3 | V1 LightGBM (frozen) | Standard pairwise rank | The pre-V2 baseline, frozen verbatim. Acts as a diversifier and a non-negotiable sanity anchor — if V1 drifts away from V2, that is itself a regime-flag. |
| 4 | CatBoost YetiRank | Listwise rank loss | Trained on a longer-horizon (63-day) panel slice; provides regime-diversification against the three 21-day-horizon methods. |
The blended score is then mapped to a long-only top-100 portfolio with equal weights and a 21-day rebalance. Sleeve_2, sleeve_3, and sleeve_4 strip out methods or change the cardinality rule (hold-N versus top-N) so we can attribute the contribution of each design choice. Sleeve_1 is V1-solo and is the apples-to-apples baseline.
PIT bitemporal data discipline
Every signal in the panel is evaluated as it was known on its anchor date. The data layer is bitemporal — each row carries both a period_end_date (the as-of of the underlying economic fact) and a knowledge_date (the earliest moment that fact was publishable). All joins use knowledge_date <= anchor_date - 1 trading day, which means a signal can never see information that was not yet in the public record on the trading day it was used.
The bitemporal discipline applies across:
- Compustat / CRSP via WRDS. Fundamentals are joined on the WRDS
rdq(report-date-of-quarter) plus a one-trading-day publication lag; restatements afterrdqare not back-applied to old anchors. - IBES analyst estimates. Each estimate carries an
anndateandrevdate; consensus is the cross-section as ofanchor_date - 1, not the latest revision known today. - EODHD price + corporate actions. Splits and dividends are applied PIT — a 2:1 split announced on day T does not change the adjusted close on day T-1 in our panel.
- OptionMetrics / RavenPack (where licensed). Same
knowledge_datediscipline; missing data isnull, never carried-forward from a future observation.
The PIT layer is covered by ~1,800 unit tests, including dedicated leakage tests for each of the 457 specs that assert no future-dated row contributes to any historical anchor. The leak-test template is tests/signals/compute/test_<name>.py; CI fails the build if any compute function regresses to use unsanitised data.
Methodology fence + cryptographic SHA verification
Every signal in the research panel is registered as a Pydantic SignalSpec carrying its name, version, polarity, required input families, lookback windows, and period type. The active registry is hashed by compute_signal_spec_sha256() into a single SHA-256 lock. As of 2026-06-01 the active values are:
- Spec count: 457
- Registry SHA-256:
39955c9172899e6246b46f5ebf3b9e7cbb8af65ce22710281a6f32608f660988 - Asserted in:
tests/signals/test_spec.py(count and SHA gates) - No amendments since inception of paper trading (2026-06-01).
The hash binds Strategy C v1 paper-trading to its research artifact: the trade history on this dashboard is the deterministic output of the registry whose SHA matches the one above. Compute implementations may evolve behind the fence — a faster join, a Polars rewrite, a vectorised helper — as long as the numerical contract (polarity, scale, fill semantics) is preserved. Adding, removing, or modifying specs changes the hash and constitutes a fence epoch bump, which would mint a new strategy version. That has not happened.
The fence is continuously verified: every CI run, every nightly orchestrator run, and every dashboard publish recomputes the SHA and aborts if it has drifted from the pinned value in strategies/paper_trading_v1/spec.yaml. There is no "soft" version of the fence — a mismatched hash is a build failure, not a warning.
Walk-forward training + 21-day rebalance cadence
All four ensemble methods are trained walk-forward, never on a held-out future block. The training schedule is:
- Anchor cadence: 21 trading days (≈ one calendar month). Anchors are pre-scheduled in
strategies/paper_trading_v1/spec.yamlfor the full pre-registered window; the publisher refuses to score an off-schedule date. - Training window: Expanding, from 2008-01-01 up to
anchor_date - 21d - embargo. - Embargo: A 21-trading-day embargo separates the training tail from the prediction date, so a label that ends on day T cannot be in the training set of a model that predicts day T+1.
- Holdout: Pre-registered for the full forward window (inception 2026-06-01 → indefinite). The fence prevents retraining on holdout data without minting a new strategy version.
- Refit frequency: Each anchor is a fresh fit on the expanding window; there is no online-update / streaming-fit shortcut.
The 21-day cadence was chosen because (a) it is roughly the cost-minimising rebalance frequency given the measured turnover and bid-ask, (b) it keeps the per-anchor trade list small enough to fit cleanly in IBKR MOC submission, and (c) it gives the post-trade reconciliation 21 days to detect drift before the next book change.
Cost model + execution realism
Every backtest, sleeve, and live anchor is evaluated net-of-cost from the same cost model. There is no separate "frictionless" report.
The cost model has six components:
- Commission. IBKR-tiered ($0.0035/share, $0.35 min, ~$3.50 max per leg, plus exchange + clearing pass-throughs).
- Half-spread. Stock-specific from EODHD intraday quotes, capped at 50 bps for thin names.
- Market impact. Almgren-Chriss with a per-name temporary-impact coefficient calibrated against ADV; full liquidation of the rebalance trade is assumed at MOC.
- Short borrow (for the symmetric / shadow research; sleeves 1–7 are long-only).
- Dividend tax. Federal qualified-dividend rate on long holds, ordinary rate on short holds.
- Short-term-gains income tax. Federal + (configurable) state ordinary rate on positions held < 366 days.
The most important finding from the live run so far is that the honest inception cost was substantially higher than the cost-model midpoint. Day-1 fills on inception (2026-06-01) showed ~90 bps of slippage versus the close, well above the cost-model's expected ~22 bps. The cost surplus is dominantly attributable to fast-pace MOC imbalance in a handful of mid-cap names plus a fill-quality penalty that the half-spread model under-prices in the auction. This is the kind of empirical finding that only shows up when you actually trade — and it is the single biggest reason the project's headline conclusion is "retail-cost-honest equity alpha is approximately zero."
The cost model is documented in docs/COST-MODEL.md and is itself version-pinned (any change to coefficients is a strategy-version bump).
Reliability infrastructure
The paper-trading pipeline runs daily, unattended, against live vendor data. Three pieces of infrastructure exist specifically to fail loudly instead of silently mis-trading:
- Coverage drift guard. Each daily refresh recomputes per-vendor row coverage by
(symbol, anchor_date)and compares against the previous anchor's distribution. If any signal's coverage shifts more than the per-spec tolerance (typically ±2% of names), the orchestrator aborts the panel build and emits acoverage_driftevent. This catches vendor schema changes, mid-day data revisions, and silent ticker delistings before they corrupt a live rebalance. - Silent-failure allowlist. Each signal compute function carries an explicit allowlist of expected null-shapes (e.g. "earnings-surprise null for names with no IBES coverage in the last 90d"). Any null pattern outside the allowlist fails the build. This is the inverse of the typical "catch-and-log" pattern — every unexpected null is a methodology event, not a warning, until it is explicitly added to the allowlist with a code-review rationale.
- Bitemporal archive. Every vendor pull is archived under
data/_archive/<vendor>/year=YYYY/month=MM/with the original payload, headers, and pull timestamp. This means any historical anchor can be re-built from the data that was actually visible on that date, rather than data revised after the fact. The archive is the empirical ground truth for the PIT layer.
Plus the standard hygiene: typed code (mypy strict on the methodology fence and signal-compute modules), structured logging, deterministic seeds, and atomic file writes for every promote-to-canonical step.
The 21× Standard-Error Bug-Catch
In May 2026 the project's V2 ensemble sweep reported a headline Sharpe of 1.76 (CAGR ~44%) for a "best-of-14" configuration. The number looked too clean. Before promoting the ensemble to a paper-trading sleeve, I designed a 20-agent adversarial audit framework: ten agents tasked with finding technical leakage, ten with finding statistical inflation, each running independently against the same codebase and prompt — try to break this result.
The audit found zero technical leakage. The PIT bitemporal joins, LambdaRank group construction, conformal calibration ordering, ridge OOF refit, and embargo were all correct. But three structural inflation sources stacked, and each had passed unflagged through prior single-agent reviews:
- Inner-join window censoring. The ensemble runner intersected per-method anchor sets with
how="inner". One model (catboost_yetirank) had only six folds, ending 2020-12-01. The intersection silently truncated the entire 2021–2026 evaluation tail — including the 2022 bear, the 2023 regional-bank stress, and the rate-regime change. On the un-censored window, the same ensemble's Sharpe collapses to 1.19. - Best-of-K configuration selection. The runner enumerated 14 ensemble configurations across 25 cycles (~350 logged tests) and reported the maximum. The standard error of monthly Sharpe over the available window is ~0.58; the expected max of fourteen N(0,1) draws is ~1.7. The selection premium alone explains 0.3–0.5 Sharpe units. Bonferroni-corrected, zero of the nine candidate methods cleared the V1 baseline.
- Baseline drift. The "V1 Sharpe 1.20" comparison was itself the winner of a 3,672-evaluation grid sweep. Raw V1 LightGBM, apples-to-apples, is 1.02 on the full nine-year window.
Net adjustment: the headline 1.76 was an honest 1.20 wearing a mask stitched from three independent structural biases — a composite ~21× inflation in effective standard-error terms once selection premium, window truncation, and baseline drift compose. Had the ensemble shipped to a live sleeve, capital would have been routed to a strategy whose honest out-of-sample expectation was approximately zero, with an audit-reconstructed worst-case 2022 path near Sharpe -0.56 that the truncated window had hidden entirely.
The multi-agent process catches what single-pass review misses because each agent re-derives the result independently and is incentivised to find a defect rather than confirm the headline. Composition of three subtle biases is the failure mode of any one-shot review; adversarial fan-out is the practical defense. The full audit transcript is under docs/research/v2-ml/leakage_audit/.
17 Failed Replications
The project's most useful output is a negative one. Seventeen published equity-anomaly signals were re-implemented against the repository's PIT-clean Compustat / CRSP / EODHD panel, evaluated at retail cost levels (commissions, bid-ask, market impact, borrow fees, dividend tax, and state plus federal income tax on short-duration gains), and compared against a low-cost SPY-and-cash reference. Of the seventeen, zero survived the retail-cost gate. Net-of-friction strategy returns were below SPY-and-cash in every case. The table below summarises; the SSRN paper has full per-signal methodology and numeric results.
| # | Signal family | Source citation | Verdict |
|---|---|---|---|
| 1 | Accruals | Sloan (1996) | Did not survive |
| 2 | Asset growth | Cooper / Gulen / Schill (2008) | Did not survive |
| 3 | Investment-to-assets | Lyandres / Sun / Zhang (2008) | Did not survive |
| 4 | Net external financing | Bradshaw / Richardson / Sloan (2006) | Did not survive |
| 5 | Net share issuance | Pontiff / Woodgate (2008) | Did not survive |
| 6 | Composite equity issuance | Daniel / Titman (2006) | Did not survive |
| 7 | Gross profitability | Novy-Marx (2013) | Did not survive |
| 8 | Quality minus junk | Asness / Frazzini / Pedersen (2019) | Did not survive |
| 9 | Earnings yield (E/P) | Basu (1977) — re-test | Did not survive |
| 10 | Book-to-market | Fama / French (1992) — re-test | Did not survive |
| 11 | 12-1 momentum | Jegadeesh / Titman (1993) | Did not survive |
| 12 | Short-term reversal (1m) | Jegadeesh (1990) | Did not survive |
| 13 | Idiosyncratic volatility | Ang / Hodrick / Xing / Zhang (2006) | Did not survive |
| 14 | Maximum daily return (MAX) | Bali / Cakici / Whitelaw (2011) | Did not survive |
| 15 | Post-earnings drift (SUE) | Bernard / Thomas (1989) | Did not survive |
| 16 | Analyst revision | Chan / Jegadeesh / Lakonishok (1996) | Did not survive |
| 17 | Short interest | Asquith / Pathak / Ritter (2005) | Did not survive |
Gross IC, gross Sharpe, and net-of-retail-cost Sharpe per row populate from data/research/baseline_ic.json at payload-build time. Until the first IC publication run lands, only the verdict column is rendered. The exact list of seventeen tracks the SSRN paper's Table 1; row count and citations may shift by ±1 before publication.
Pre-registered Hypotheses
Four hypotheses were pre-registered at inception (2026-06-01). Each is tied to a specific sleeve, an effect-size threshold, and a power calculation; none can be reached with n=1 anchor. Status as of today is uniformly INCONCLUSIVE.
| ID | Hypothesis | Sleeve(s) | Status |
|---|---|---|---|
| H1 | The 4-method ensemble outperforms any single-method sleeve net of retail cost over n≥6 anchors. | sleeve_5 vs 1–4 | INCONCLUSIVE (n=1) |
| H2 | The hold-rule (N=50, top-100 entry) reduces turnover ≥40% with ≤0.15 Sharpe penalty. | sleeve_3, sleeve_4 | INCONCLUSIVE (n=1) |
| H3 | Live execution drift between virtual and IBKR fills is ≤15 bps RMSE per rebalance. | sleeve_5 | INCONCLUSIVE (n=1) |
| H4 | None of the five sleeves clears SPY-and-cash net of retail cost across the full pre-registered window. | All five | INCONCLUSIVE (n=1) |
Status ladder: INCONCLUSIVE → PROVISIONAL+/- → CONFIRMED / REJECTED / ABANDONED. The ladder is gated on n≥6 anchors per strategies/paper_trading_v1/spec.yaml; promotion to PROVISIONAL requires three anchors with a consistent sign, and full CONFIRMED / REJECTED calls require six. First promotion opportunity is approximately 2026-12-01.
What I learned
Three honest reflections from running this stack end-to-end.
- Retail equity costs eat alpha. Not "reduce" — eat. The 17 published anomalies all show meaningful gross IC; none survive a retail-cost net evaluation. The first live anchor on 2026-06-01 then showed ~90 bps of MOC slippage, ~4× the cost-model midpoint, on a small-cap-tilted top-100 book. The combination — gross alpha that is real but small, plus net friction that is larger than the gross alpha — is the structural reason retail-scale single-operator equity strategies do not beat SPY-and-cash on a Sharpe-adjusted basis. The methodologically honest course is to publish that finding, not to keep tuning until the backtest looks better.
- The V2 ensemble does not beat V1 once you remove the inflation sources. The honest Sharpe of the 4-method ensemble on the un-censored, un-selected, baseline-stable window is ~1.20 — within noise of the V1 baseline (~1.02–1.20 depending on exact window). The 1.76 headline was wrong, and the project is calibrated to the 1.20 honest number. Sleeve_5 is the LIVE sleeve because it is the least-worst ensemble after audit, not because it has a real edge over V1.
- The process is the asset. A reproducible, hashed, walk-forward, retail-cost-honest research stack — with adversarial audit transcripts on disk — is more transferable than any individual alpha. The same fence-and-archive discipline that catches a 21× SE inflation in equities also catches one in credit, in options, or in any other PIT-joinable cross-section. The bet of this project is that that is what generalises.
The strategic implication, which is reflected in the project's roadmap, is that trading deployment is paused until either (a) the cost model gets meaningfully better at predicting MOC fills, or (b) the operator has the capital to access the institutional-cost regime where the gross-alpha-minus-friction equation looks different. Until then, the dashboard is a process record, the SSRN paper is the public output, and the next research lane is the credit and options panels where the per-trade friction is structurally lower.
What This Dashboard Is / Is Not
What this is: A live record of five paper-trading sleeves running against a pre-registered, retail-cost-honest configuration. The signal panel is point-in-time. The strategy version is bound to a hashed signal spec. The IBKR sleeve is paper, not real money.
What this is not: An alpha pitch. The 17-test result above implies that the structural alpha available at retail scale and retail cost is approximately zero. The reason the sleeves still exist is not because they are expected to beat SPY — they are not — but because the operational discipline of running them is the research artifact. The point is the process, not the P&L.
If you are evaluating this project as a hiring signal: the relevant artifacts are docs/SIGNAL-FENCE.md and the audit transcripts under docs/research/v2-ml/leakage_audit/. The P&L tile is downstream.
Links
| Code & artifacts | Writing & contact |
|---|---|
| GitHub repository pending | SSRN paper — Q3 2026 |
| Methodology fence doc | Substack pending |
| 20-agent audit transcripts | LinkedIn pending |
| Data catalog | alexhines2017@gmail.com |
External links are populated when the corresponding artifact is posted. The SSRN paper, the GitHub mirror, and the Substack series are placeholders until live; broken links are replaced with omissions, never with "coming soon" ghost rows.