Paper Trading v1

Methodology

Alex Hines — Quantitative Research Engineer

Five paper-trading sleeves have been running live since 2026-06-01. One sleeve trades at $250K paper notional through Interactive Brokers (account DUP***83); four are shadow books for hypothesis comparison. The signal-spec fence has not been touched since inception. This page is the project's methodology surface; the dashboard tiles are the live process record; the SSRN paper is the formal writeup.


Intro

I am a quantitative-research engineer building a single-operator equity research and paper-trading stack. The codebase runs a 457-signal point-in-time research panel, a five-sleeve paper-trading pipeline against IBKR, and a methodology fence that hashes every active signal's metadata into a single SHA-256 lock. The stack's headline outputs are the daily NAV and risk numbers shown elsewhere on this dashboard. The stack's headline finding is that, of seventeen famous published equity-anomaly signals, zero survive retail-cost-honest replication. That negative result, and the multi-agent adversarial process that produced it, is what this page is about.


Why this strategy

The choice of a long-only, top-N, monthly-rebalanced US equity sleeve is not because it has the highest expected Sharpe — it does not. It is because the cost-and-friction profile is the one a retail-scale operator actually faces, and the operational stack (PIT panel, fence, walk-forward eval, IBKR submission, T+1 reconciliation) is the artifact that transfers to an institutional context.

Three design constraints drove the strategy shape:

  1. Honest costs over headline Sharpe. Net-of-cost evaluation uses commissions, half-spread, market-impact (Almgren-Chriss style on each rebalance leg), short borrow where applicable, dividend tax, and short-term-gains income tax. Strategies that look strong gross-of-cost and collapse net-of-cost are the rule in published anomaly research; building the cost model first means the headline only ever moves down.
  2. Single operator, no co-location. All execution is end-of-day MOC, all rebalance cadences are weekly or slower, and no signal depends on intraday microstructure. This is a deliberate choice to keep the strategy reproducible by someone with retail infrastructure.
  3. Process auditability over alpha mystique. Every active signal is registered, hashed, and pinned. Every paper trade is reconstructable from disk. The bet is that hiring managers prefer an honest replication framework over a black-box backtest.

The 4-method ensemble (sleeve_5) is the headline sleeve only because, on the honest evaluation window, it has the highest measured net-of-cost Sharpe; not because the underlying signals are novel.


4-method ensemble architecture

Sleeve_5 — the LIVE $250K paper sleeve — combines four independently trained ranking models. Each method is trained walk-forward against the same 457-signal PIT panel; outputs are per-name conformalised ranks that are then averaged with equal weights. Equal-weight averaging was chosen over learned blending because (a) blending introduces another best-of-K selection surface; (b) on the honest 12-fold window, equal-weight beat optimised-weight after Bonferroni correction.

# Method Loss / objective Why it's in the ensemble
1 LightGBM quantile trio (q10 / q50 / q90) Multi-quantile pinball loss Captures the shape of the return distribution per name, not just the mean. The q90 head is the dominant driver on long-only top-N.
2 XGBoost monotone Pairwise rank loss with monotonic constraints on canonical signs (e.g. momentum +, volatility −) Hard-constrains the model to respect known financial priors — prevents the booster from learning a spurious sign-flip in a quiet regime.
3 V1 LightGBM (frozen) Standard pairwise rank The pre-V2 baseline, frozen verbatim. Acts as a diversifier and a non-negotiable sanity anchor — if V1 drifts away from V2, that is itself a regime-flag.
4 CatBoost YetiRank Listwise rank loss Trained on a longer-horizon (63-day) panel slice; provides regime-diversification against the three 21-day-horizon methods.

The blended score is then mapped to a long-only top-100 portfolio with equal weights and a 21-day rebalance. Sleeve_2, sleeve_3, and sleeve_4 strip out methods or change the cardinality rule (hold-N versus top-N) so we can attribute the contribution of each design choice. Sleeve_1 is V1-solo and is the apples-to-apples baseline.


PIT bitemporal data discipline

Every signal in the panel is evaluated as it was known on its anchor date. The data layer is bitemporal — each row carries both a period_end_date (the as-of of the underlying economic fact) and a knowledge_date (the earliest moment that fact was publishable). All joins use knowledge_date <= anchor_date - 1 trading day, which means a signal can never see information that was not yet in the public record on the trading day it was used.

The bitemporal discipline applies across:

The PIT layer is covered by ~1,800 unit tests, including dedicated leakage tests for each of the 457 specs that assert no future-dated row contributes to any historical anchor. The leak-test template is tests/signals/compute/test_<name>.py; CI fails the build if any compute function regresses to use unsanitised data.


Methodology fence + cryptographic SHA verification

Every signal in the research panel is registered as a Pydantic SignalSpec carrying its name, version, polarity, required input families, lookback windows, and period type. The active registry is hashed by compute_signal_spec_sha256() into a single SHA-256 lock. As of 2026-06-01 the active values are:

The hash binds Strategy C v1 paper-trading to its research artifact: the trade history on this dashboard is the deterministic output of the registry whose SHA matches the one above. Compute implementations may evolve behind the fence — a faster join, a Polars rewrite, a vectorised helper — as long as the numerical contract (polarity, scale, fill semantics) is preserved. Adding, removing, or modifying specs changes the hash and constitutes a fence epoch bump, which would mint a new strategy version. That has not happened.

The fence is continuously verified: every CI run, every nightly orchestrator run, and every dashboard publish recomputes the SHA and aborts if it has drifted from the pinned value in strategies/paper_trading_v1/spec.yaml. There is no "soft" version of the fence — a mismatched hash is a build failure, not a warning.


Walk-forward training + 21-day rebalance cadence

All four ensemble methods are trained walk-forward, never on a held-out future block. The training schedule is:

The 21-day cadence was chosen because (a) it is roughly the cost-minimising rebalance frequency given the measured turnover and bid-ask, (b) it keeps the per-anchor trade list small enough to fit cleanly in IBKR MOC submission, and (c) it gives the post-trade reconciliation 21 days to detect drift before the next book change.


Cost model + execution realism

Every backtest, sleeve, and live anchor is evaluated net-of-cost from the same cost model. There is no separate "frictionless" report.

The cost model has six components:

  1. Commission. IBKR-tiered ($0.0035/share, $0.35 min, ~$3.50 max per leg, plus exchange + clearing pass-throughs).
  2. Half-spread. Stock-specific from EODHD intraday quotes, capped at 50 bps for thin names.
  3. Market impact. Almgren-Chriss with a per-name temporary-impact coefficient calibrated against ADV; full liquidation of the rebalance trade is assumed at MOC.
  4. Short borrow (for the symmetric / shadow research; sleeves 1–7 are long-only).
  5. Dividend tax. Federal qualified-dividend rate on long holds, ordinary rate on short holds.
  6. Short-term-gains income tax. Federal + (configurable) state ordinary rate on positions held < 366 days.

The most important finding from the live run so far is that the honest inception cost was substantially higher than the cost-model midpoint. Day-1 fills on inception (2026-06-01) showed ~90 bps of slippage versus the close, well above the cost-model's expected ~22 bps. The cost surplus is dominantly attributable to fast-pace MOC imbalance in a handful of mid-cap names plus a fill-quality penalty that the half-spread model under-prices in the auction. This is the kind of empirical finding that only shows up when you actually trade — and it is the single biggest reason the project's headline conclusion is "retail-cost-honest equity alpha is approximately zero."

The cost model is documented in docs/COST-MODEL.md and is itself version-pinned (any change to coefficients is a strategy-version bump).


Reliability infrastructure

The paper-trading pipeline runs daily, unattended, against live vendor data. Three pieces of infrastructure exist specifically to fail loudly instead of silently mis-trading:

Plus the standard hygiene: typed code (mypy strict on the methodology fence and signal-compute modules), structured logging, deterministic seeds, and atomic file writes for every promote-to-canonical step.


The 21× Standard-Error Bug-Catch

In May 2026 the project's V2 ensemble sweep reported a headline Sharpe of 1.76 (CAGR ~44%) for a "best-of-14" configuration. The number looked too clean. Before promoting the ensemble to a paper-trading sleeve, I designed a 20-agent adversarial audit framework: ten agents tasked with finding technical leakage, ten with finding statistical inflation, each running independently against the same codebase and prompt — try to break this result.

The audit found zero technical leakage. The PIT bitemporal joins, LambdaRank group construction, conformal calibration ordering, ridge OOF refit, and embargo were all correct. But three structural inflation sources stacked, and each had passed unflagged through prior single-agent reviews:

  1. Inner-join window censoring. The ensemble runner intersected per-method anchor sets with how="inner". One model (catboost_yetirank) had only six folds, ending 2020-12-01. The intersection silently truncated the entire 2021–2026 evaluation tail — including the 2022 bear, the 2023 regional-bank stress, and the rate-regime change. On the un-censored window, the same ensemble's Sharpe collapses to 1.19.
  2. Best-of-K configuration selection. The runner enumerated 14 ensemble configurations across 25 cycles (~350 logged tests) and reported the maximum. The standard error of monthly Sharpe over the available window is ~0.58; the expected max of fourteen N(0,1) draws is ~1.7. The selection premium alone explains 0.3–0.5 Sharpe units. Bonferroni-corrected, zero of the nine candidate methods cleared the V1 baseline.
  3. Baseline drift. The "V1 Sharpe 1.20" comparison was itself the winner of a 3,672-evaluation grid sweep. Raw V1 LightGBM, apples-to-apples, is 1.02 on the full nine-year window.

Net adjustment: the headline 1.76 was an honest 1.20 wearing a mask stitched from three independent structural biases — a composite ~21× inflation in effective standard-error terms once selection premium, window truncation, and baseline drift compose. Had the ensemble shipped to a live sleeve, capital would have been routed to a strategy whose honest out-of-sample expectation was approximately zero, with an audit-reconstructed worst-case 2022 path near Sharpe -0.56 that the truncated window had hidden entirely.

The multi-agent process catches what single-pass review misses because each agent re-derives the result independently and is incentivised to find a defect rather than confirm the headline. Composition of three subtle biases is the failure mode of any one-shot review; adversarial fan-out is the practical defense. The full audit transcript is under docs/research/v2-ml/leakage_audit/.


17 Failed Replications

The project's most useful output is a negative one. Seventeen published equity-anomaly signals were re-implemented against the repository's PIT-clean Compustat / CRSP / EODHD panel, evaluated at retail cost levels (commissions, bid-ask, market impact, borrow fees, dividend tax, and state plus federal income tax on short-duration gains), and compared against a low-cost SPY-and-cash reference. Of the seventeen, zero survived the retail-cost gate. Net-of-friction strategy returns were below SPY-and-cash in every case. The table below summarises; the SSRN paper has full per-signal methodology and numeric results.

# Signal family Source citation Verdict
1 Accruals Sloan (1996) Did not survive
2 Asset growth Cooper / Gulen / Schill (2008) Did not survive
3 Investment-to-assets Lyandres / Sun / Zhang (2008) Did not survive
4 Net external financing Bradshaw / Richardson / Sloan (2006) Did not survive
5 Net share issuance Pontiff / Woodgate (2008) Did not survive
6 Composite equity issuance Daniel / Titman (2006) Did not survive
7 Gross profitability Novy-Marx (2013) Did not survive
8 Quality minus junk Asness / Frazzini / Pedersen (2019) Did not survive
9 Earnings yield (E/P) Basu (1977) — re-test Did not survive
10 Book-to-market Fama / French (1992) — re-test Did not survive
11 12-1 momentum Jegadeesh / Titman (1993) Did not survive
12 Short-term reversal (1m) Jegadeesh (1990) Did not survive
13 Idiosyncratic volatility Ang / Hodrick / Xing / Zhang (2006) Did not survive
14 Maximum daily return (MAX) Bali / Cakici / Whitelaw (2011) Did not survive
15 Post-earnings drift (SUE) Bernard / Thomas (1989) Did not survive
16 Analyst revision Chan / Jegadeesh / Lakonishok (1996) Did not survive
17 Short interest Asquith / Pathak / Ritter (2005) Did not survive

Gross IC, gross Sharpe, and net-of-retail-cost Sharpe per row populate from data/research/baseline_ic.json at payload-build time. Until the first IC publication run lands, only the verdict column is rendered. The exact list of seventeen tracks the SSRN paper's Table 1; row count and citations may shift by ±1 before publication.


Pre-registered Hypotheses

Four hypotheses were pre-registered at inception (2026-06-01). Each is tied to a specific sleeve, an effect-size threshold, and a power calculation; none can be reached with n=1 anchor. Status as of today is uniformly INCONCLUSIVE.

ID Hypothesis Sleeve(s) Status
H1 The 4-method ensemble outperforms any single-method sleeve net of retail cost over n≥6 anchors. sleeve_5 vs 1–4 INCONCLUSIVE (n=1)
H2 The hold-rule (N=50, top-100 entry) reduces turnover ≥40% with ≤0.15 Sharpe penalty. sleeve_3, sleeve_4 INCONCLUSIVE (n=1)
H3 Live execution drift between virtual and IBKR fills is ≤15 bps RMSE per rebalance. sleeve_5 INCONCLUSIVE (n=1)
H4 None of the five sleeves clears SPY-and-cash net of retail cost across the full pre-registered window. All five INCONCLUSIVE (n=1)

Status ladder: INCONCLUSIVEPROVISIONAL+/-CONFIRMED / REJECTED / ABANDONED. The ladder is gated on n≥6 anchors per strategies/paper_trading_v1/spec.yaml; promotion to PROVISIONAL requires three anchors with a consistent sign, and full CONFIRMED / REJECTED calls require six. First promotion opportunity is approximately 2026-12-01.


What I learned

Three honest reflections from running this stack end-to-end.

  1. Retail equity costs eat alpha. Not "reduce" — eat. The 17 published anomalies all show meaningful gross IC; none survive a retail-cost net evaluation. The first live anchor on 2026-06-01 then showed ~90 bps of MOC slippage, ~4× the cost-model midpoint, on a small-cap-tilted top-100 book. The combination — gross alpha that is real but small, plus net friction that is larger than the gross alpha — is the structural reason retail-scale single-operator equity strategies do not beat SPY-and-cash on a Sharpe-adjusted basis. The methodologically honest course is to publish that finding, not to keep tuning until the backtest looks better.
  2. The V2 ensemble does not beat V1 once you remove the inflation sources. The honest Sharpe of the 4-method ensemble on the un-censored, un-selected, baseline-stable window is ~1.20 — within noise of the V1 baseline (~1.02–1.20 depending on exact window). The 1.76 headline was wrong, and the project is calibrated to the 1.20 honest number. Sleeve_5 is the LIVE sleeve because it is the least-worst ensemble after audit, not because it has a real edge over V1.
  3. The process is the asset. A reproducible, hashed, walk-forward, retail-cost-honest research stack — with adversarial audit transcripts on disk — is more transferable than any individual alpha. The same fence-and-archive discipline that catches a 21× SE inflation in equities also catches one in credit, in options, or in any other PIT-joinable cross-section. The bet of this project is that that is what generalises.

The strategic implication, which is reflected in the project's roadmap, is that trading deployment is paused until either (a) the cost model gets meaningfully better at predicting MOC fills, or (b) the operator has the capital to access the institutional-cost regime where the gross-alpha-minus-friction equation looks different. Until then, the dashboard is a process record, the SSRN paper is the public output, and the next research lane is the credit and options panels where the per-trade friction is structurally lower.


What This Dashboard Is / Is Not

What this is: A live record of five paper-trading sleeves running against a pre-registered, retail-cost-honest configuration. The signal panel is point-in-time. The strategy version is bound to a hashed signal spec. The IBKR sleeve is paper, not real money.

What this is not: An alpha pitch. The 17-test result above implies that the structural alpha available at retail scale and retail cost is approximately zero. The reason the sleeves still exist is not because they are expected to beat SPY — they are not — but because the operational discipline of running them is the research artifact. The point is the process, not the P&L.

If you are evaluating this project as a hiring signal: the relevant artifacts are docs/SIGNAL-FENCE.md and the audit transcripts under docs/research/v2-ml/leakage_audit/. The P&L tile is downstream.


Code & artifacts Writing & contact
GitHub repository pending SSRN paper — Q3 2026
Methodology fence doc Substack pending
20-agent audit transcripts LinkedIn pending
Data catalog alexhines2017@gmail.com

External links are populated when the corresponding artifact is posted. The SSRN paper, the GitHub mirror, and the Substack series are placeholders until live; broken links are replaced with omissions, never with "coming soon" ghost rows.