DEV Community: akash

The Position Sizing Math That Keeps You in the Game

akash — Tue, 19 May 2026 08:55:16 +0000

Most retail traders who blow up did not blow up because their analysis was bad. They blew up because their position size was wrong for their conviction, their account, or their stop. The math to fix this is not complicated, and almost nobody who needs it has actually run it on their own account.

This is a short, practical walkthrough of the calculations that matter, written for traders who want to understand the inputs rather than just plug numbers into a calculator.

The single most important equation

For any trade, the dollar risk is:

dollar_risk = position_size × |entry_price - stop_price|

A long of 0.1 BTC at $61,000 with a stop at $59,500 carries a dollar risk of 0.1 × $1,500 = $150. That is the amount lost if the stop hits cleanly. (In practice it can be worse, because stops slip in fast markets, but $150 is the baseline.)

The mistake almost every beginner makes is calculating position size from the asset price ("I want to buy 0.1 BTC") instead of from the dollar risk ("I want to lose at most $150 if I am wrong"). Reversing this changes everything.

The 1% rule, properly stated

The most-quoted rule in trading is "risk 1% per trade." This is good advice that is usually applied incorrectly. The full statement is:

Risk no more than 1% of total account equity on any single trade, where "risk" is the dollar distance between entry and stop, not the notional position size.

A trader with a $10,000 account using 1% sizing has $100 of risk per trade. If the stop is 2% below entry, the position can be $5,000 ($100 / 0.02). If the stop is 0.5% below entry, the position can be $20,000. Same risk, very different position sizes, dictated by where the stop is.

This is why traders who pre-define their stop end up with rational sizing automatically, and traders who pick a position size first and then "decide where to stop out later" end up with risk that varies wildly trade to trade.

Why 1% and not 2 or 5

The math behind the 1% number is about surviving losing streaks. Consider the probability of a streak of N consecutive losses:

Win rate	Probability of 10 losses in a row
50%	0.098% (1 in 1,024)
60%	0.010% (1 in 9,766)
40%	0.605% (1 in 165)

Even at a 60% win rate, a 10-loss streak happens roughly once every 10,000 trades. At a 40% win rate, it happens once every 165. At 200 trades a year, that is more than once a year.

A 10-loss streak at 1% risk per trade is a 9.6% drawdown (compounding works in the trader's favor on the way down too). At 2% risk per trade, the same streak is an 18.3% drawdown. At 5% per trade, it is 40.1%. The 5% trader is one ordinary bad streak away from being psychologically destroyed; the 1% trader can shrug it off and keep going.

The 1% number is not magic. It is the largest size at which a normal losing streak does not derail traders emotionally or financially.

Adjusting for conviction (carefully)

Many traders use variable sizing: more on high-conviction setups, less on low-conviction ones. This is fine in principle and dangerous in practice.

The danger is that "conviction" is mostly post-hoc rationalization. Almost every trade feels high-conviction at the moment it is taken. Traders who scale up "when they are sure" usually end up scaling up randomly, just before the inevitable losing streak that happens at exactly the size that hurts the most.

For variable sizing to work, three things are needed:

A written rubric for what makes a trade A-tier vs B-tier vs C-tier. Written before the trade. Not adjustable based on how the trader feels.
A hard cap on the maximum size, no matter how high-conviction. Many disciplined discretionary traders cap at 2x their baseline.
A retrospective check every month: did A-tier trades actually outperform B-tier trades? If not, the rubric is wrong and the trader should go back to flat sizing.

Most traders should run flat 1% sizing for at least their first 200 trades. The reason is not that flat sizing is optimal; it is that flat sizing protects against bad self-assessment until there is enough data to prove the rubric works.

The leverage trap

Crypto perpetuals make this all worse by inviting traders to use leverage. Leverage does not change the math above; it just changes which number blows up first.

The right way to think about leverage: it sets the maximum position size collateral can support, but actual position size should still be determined by dollar risk. Using 10x leverage on a $1,000 account does not mean a $10,000 position is appropriate. It means a $10,000 position is possible if the dollar risk calculation says so, and probably the actual position should be much smaller.

The trader who uses leverage to amplify position size beyond what dollar risk would dictate is the trader who gets liquidated, often by a wick that the exchange's liquidation engine catches before any human would have closed the position manually.

Practicing the math

This is the kind of thing that should be drilled on paper before it is done live. A few exercises:

Open a paper trading account with $10,000 virtual. Set a hard rule: 1% risk per trade.
Take 20 trades. After each, compute the dollar risk before submitting the order and write it in a journal. Refuse any trade where the size cannot fit into 1%.
After 20 trades, look at the spread of position sizes. They should vary widely (because stop distances vary), but dollar risk should be near-constant.
Then do 20 more trades that intentionally violate the rule, sizing by "what feels right." Compare drawdowns.

Hex37's position-sizer panel computes the dollar risk inline as order parameters are adjusted, which tends to make the discipline stick faster than doing the math separately in a spreadsheet. Any platform that surfaces dollar risk before submission works; the point is the habit, not the tool.

Bottom line

If there is one thing to remember: size trades by dollar risk, not by asset price or by "what feels right." Use 1% of account as the default. Pre-define the stop before entry so the calculation is honest.

This is, by a wide margin, the highest-leverage change a developing trader can make to their results. It also requires zero predictive skill. It is just arithmetic, applied honestly.

Hex37 is a paper trading platform with a built-in position sizer that computes dollar risk inline as order parameters are set. Free $10K virtual balance, realistic execution. hex37.com.

Why AI Detectors Produce False Positives: A Technical Analysis

akash — Thu, 09 Apr 2026 05:26:35 +0000

An AI detector claims 95% accuracy. A student's essay gets flagged as "98% likely AI-generated." Open-and-shut case, right?

Not even close. The math tells a very different story. This article breaks down exactly why AI detector confidence scores are misleading, using probability theory that any developer can follow.

The Base Rate Fallacy

The base rate fallacy is the single most important concept for understanding AI detection errors. It is the reason a "95% accurate" detector can still be wrong a third of the time.

Here is the setup. A university uses an AI detector with these published metrics:

True positive rate (sensitivity): 95%. If text is AI-generated, the detector correctly flags it 95% of the time.
False positive rate: 5%. If text is human-written, the detector incorrectly flags it 5% of the time.

Sounds great. Now apply it to a real population.

In a class of 200 students, suppose 20 actually used AI (10% base rate). What happens when every essay goes through the detector?

# Base rate calculation
total_students = 200
ai_users = 20        # 10% base rate
human_writers = 180

# Detector results
true_positives = ai_users * 0.95           # 19 correctly flagged
false_positives = human_writers * 0.05     # 9 incorrectly flagged
total_flagged = true_positives + false_positives  # 28 total flags

# The critical number
precision = true_positives / total_flagged
print(f"P(actually AI | flagged) = {precision:.1%}")
# Output: P(actually AI | flagged) = 67.9%

Out of 28 flagged essays, 9 are false positives. Nearly one in three flagged students wrote their essay themselves. The detector's "95% accuracy" translates to a 32% error rate on flagged results in this scenario.

Bayes' Theorem: The Formal Version

What we just computed informally is Bayes' theorem:

P(AI | flagged) = P(flagged | AI) * P(AI) / P(flagged)

Where:

P(flagged | AI) = 0.95 (true positive rate)
P(AI) = 0.10 (base rate of AI usage)
P(flagged) = P(flagged | AI) * P(AI) + P(flagged | human) * P(human)
P(flagged) = 0.95 * 0.10 + 0.05 * 0.90 = 0.14

P(AI | flagged) = 0.95 * 0.10 / 0.14 = 0.679

The posterior probability (67.9%) is drastically lower than the detector's confidence output. This is not a flaw in the math. This is the math working correctly. The detector just is not reporting this number.

How Base Rate Changes Everything

The same detector produces wildly different reliability depending on the population:

def precision_at_base_rate(sensitivity, fpr, base_rate):
    """Calculate precision given base rate of AI text."""
    tp = sensitivity * base_rate
    fp = fpr * (1 - base_rate)
    return tp / (tp + fp)

base_rates = [0.01, 0.05, 0.10, 0.20, 0.50]
for br in base_rates:
    p = precision_at_base_rate(0.95, 0.05, br)
    print(f"Base rate {br:.0%}: P(AI | flagged) = {p:.1%}")

Base rate (% actually using AI)	P(actually AI given flagged)
1%	16.1%
5%	50.0%
10%	67.9%
20%	82.6%
50%	95.0%

At a 1% base rate, 84% of flags are false positives. The detector is wrong five out of six times it fires. At a 5% base rate, it is a coin flip.

The detector only matches its advertised accuracy when the base rate is 50%, meaning half the population used AI. In most real-world contexts (professional writing, journalism, established authors), the base rate is far lower.

Why Confidence Scores Mislead

When a detector reports "98% confidence this is AI-generated," it is reporting the model's internal softmax output, not the posterior probability accounting for the base rate. These are fundamentally different numbers.

# What the detector reports:
model_confidence = 0.98  # softmax output

# What you actually want to know:
# P(AI | text, base_rate) -- requires Bayesian adjustment

# Rough calibration: even with 0.98 model confidence,
# if the base rate in your context is 5%:
adjusted = (0.98 * 0.05) / (0.98 * 0.05 + 0.02 * 0.95)
print(f"Adjusted probability: {adjusted:.1%}")
# Output: Adjusted probability: 72.1%

A "98% confidence" flag, after base rate adjustment, might mean 72% actual likelihood. That is a meaningful difference when someone's grade, job, or reputation is on the line.

The Overlapping Distributions Problem

Beyond the base rate issue, there is a fundamental signal problem. The statistical features detectors measure (perplexity, burstiness, vocabulary distribution) are not cleanly separated between human and AI text.

Visualize two bell curves on a number line:

          Human text              AI text
          distribution            distribution

    |         *****                 *****
    |       **     **             **     **
    |     **         **         **         **
    |   **             ** *** **             **
    | **                 *****                 **
    +---|--------|--------|----|--------|--------|--->
       High    Medium     ↑   Low    Very Low
       perplexity         |   perplexity
                     OVERLAP ZONE
                  (unreliable region)

Any threshold you draw through the overlap zone creates two types of errors simultaneously:

False positives: Human text to the right of the threshold (lower perplexity than typical humans)
False negatives: AI text to the left of the threshold (higher perplexity than typical AI)

You can tune the threshold to reduce one error type, but only at the cost of increasing the other. This is the ROC curve tradeoff. No threshold eliminates both errors.

Who Gets Caught in the Overlap Zone?

The overlap zone is not random. Specific groups of human writers consistently fall into it:

Non-native English speakers. Simpler vocabulary, more regular grammar, fewer idiomatic expressions. A 2023 Stanford study found that detectors misclassified non-native English writing as AI over 60% of the time, while achieving near-zero false positives on native English text.

Formal and academic writers. Hedging language ("it could be argued that"), structured argumentation, and domain-specific terminology all reduce perplexity. The writing conventions that make academic text rigorous are the same patterns detectors associate with AI.

Technical writers. Programming tutorials, API documentation, medical summaries. When explaining well-documented concepts, human writers naturally converge on standard phrasing. The text reads as "predictable" not because AI wrote it, but because there are limited natural ways to explain how a hash map works.

Writers on well-covered topics. The more a topic has been written about, the more constrained the natural phrasing becomes. An article about "how to center a div in CSS" will read similarly whether a human or AI wrote it.

Practical Implications for Developers

If you are building systems that consume AI detection output, here is what the math demands:

1. Never treat scores as binary. A detection score is a probability estimate with wide confidence intervals. Threshold-based decisions ("flagged if > 70%") create brittle systems.

2. Account for base rates. If your application context has a low base rate of AI text (e.g., screening submissions from established authors), most flags will be false positives regardless of detector accuracy.

3. Require corroborating evidence. A detector score should be one input among many, not a verdict. Combine with metadata (writing history, edit patterns, timing data) for more reliable decisions.

4. Communicate uncertainty. If you surface detection results to users, show ranges and caveats, not confident-sounding percentages. "This text has statistical properties common in AI-generated text" is more honest than "98% AI."

5. Test on your population. Published accuracy numbers are measured on benchmark datasets. Your actual population (domain, language proficiency, writing style distribution) will produce different error rates. Measure them.

Try It Yourself

Want to see how different detectors score the same text? Metric37's free AI detector lets you paste any text and get a breakdown of the detection signals. For batch analysis or integration into your own tools, the Metric37 API provides programmatic access to detection and humanization scoring.

This is Part 2 of a technical series on AI detection. Part 1 covers how detection works under the hood, including perplexity math and classifier architectures.

How AI Text Detection Works Under the Hood: Perplexity, Burstiness, and Classifiers

akash — Thu, 09 Apr 2026 05:26:03 +0000

AI text detectors are not magic. They are statistical models measuring how predictable your text is. If you have ever wondered what GPTZero, Originality.ai, or Turnitin are actually computing when they flag text as "AI-generated," this post breaks down the math and the models.

The Core Intuition

Language models generate text by repeatedly predicting the next token. At each step, the model assigns a probability distribution over its entire vocabulary, then samples from it. The result is text where every word is, by definition, a high-probability choice given the preceding context.

Human writers do not work this way. We make unexpected word choices, write sentence fragments, insert tangents, and vary our rhythm. Our text is statistically messier.

AI detectors exploit this difference using two primary signals: perplexity and burstiness.

Perplexity: Measuring Surprise

Perplexity quantifies how "surprised" a language model is by a sequence of tokens. Formally, for a sequence of N tokens:

import math

def perplexity(token_log_probs):
    """
    token_log_probs: list of log P(token_i | token_1..token_i-1)
    from a reference language model (e.g., GPT-2, RoBERTa)
    """
    N = len(token_log_probs)
    avg_neg_log_prob = -sum(token_log_probs) / N
    return math.exp(avg_neg_log_prob)

A low perplexity score means the model easily predicted every token. A high score means the text contained surprises.

In practice, you run the suspect text through a reference model (often GPT-2 or a similar openly available LM), compute the log-probability of each token conditioned on its prefix, and aggregate. Typical ranges:

Text type	Perplexity range
Raw GPT-4 output	5-15
Human blog post	30-80
Creative fiction	60-150+
Non-native English	15-40

The overlap between "non-native English" and "AI output" is immediately visible, and it foreshadows the false positive problem.

Burstiness: Measuring Rhythm Variation

Perplexity alone is not enough. Burstiness measures how much the perplexity varies across a text. Think of it as the standard deviation (or variance) of per-sentence perplexity scores.

import statistics

def burstiness(sentence_perplexities):
    """
    sentence_perplexities: list of perplexity scores,
    one per sentence in the document
    """
    if len(sentence_perplexities) < 2:
        return 0.0
    mean_ppl = statistics.mean(sentence_perplexities)
    std_ppl = statistics.stdev(sentence_perplexities)
    # Normalized burstiness coefficient
    return std_ppl / mean_ppl

Human writing is bursty. You might write a straightforward factual sentence (low perplexity), followed by a creative metaphor (high perplexity), followed by a one-word interjection (wildcard). The per-sentence perplexity jumps around.

AI text has low burstiness. The model maintains a consistent "temperature" of word choice throughout. Every sentence sits in roughly the same predictability band.

Two-dimensional classification: Low perplexity + low burstiness = strong AI signal. High perplexity + high burstiness = strong human signal. Mixed signals land in the gray zone where detectors are unreliable.

Classifier Models: Learning the Difference

Statistical thresholds on perplexity and burstiness only get you so far. Modern commercial detectors (GPTZero, Originality.ai, Turnitin, Copyleaks) use trained classifier models, typically fine-tuned transformers.

The architecture usually looks like this:

Base model: A pre-trained transformer, commonly RoBERTa-base (125M params) or DeBERTa-v3 (300M+ params). These models already encode deep understanding of language patterns.
Classification head: A linear layer (or small MLP) on top of the [CLS] token representation that outputs a probability: P(AI-generated | text).
Training data: Millions of paired samples. Human text from diverse sources (academic papers, Reddit posts, news articles, fiction). AI text generated by GPT-3.5, GPT-4, Claude, Llama, Gemini, and others across varied prompts and temperatures.
Fine-tuning: Standard cross-entropy loss. The model learns subtle distributional features beyond perplexity and burstiness, including things like:
- Ratio of content words to function words
- Distribution of rare vs. common vocabulary
- Paragraph-level structural patterns
- Positional patterns (AI intros and conclusions follow recognizable templates)

# Simplified classifier architecture (PyTorch-style pseudo-code)
class AIDetector:
    def __init__(self):
        self.encoder = RoBERTa.from_pretrained("roberta-base")
        self.classifier = Linear(768, 1)  # hidden_size -> binary

    def forward(self, text):
        # Tokenize and encode
        hidden = self.encoder(tokenize(text))
        cls_repr = hidden[0]  # [CLS] token
        logit = self.classifier(cls_repr)
        return sigmoid(logit)  # P(AI-generated)

The advantage of classifiers over raw perplexity scoring: they capture patterns that are hard to express as a single metric. The disadvantage: they inherit every bias in their training data.

Statistical Watermarking

Some AI providers embed invisible statistical watermarks during generation. The approach works by partitioning the vocabulary into "green" and "red" lists at each token position (using a hash of the preceding token as a seed), then biasing generation toward green-list tokens.

A detector checks whether the proportion of green-list tokens is statistically improbable under random chance. If so, the text was likely generated by that specific model.

Watermarking is the most reliable detection method when present, but it only works for models that implement it, breaks under paraphrasing or editing, and requires provider cooperation.

Where Detection Breaks Down

Every detection method has systematic failure modes:

Short text (under 250 words): Not enough tokens to establish reliable statistical patterns. Detectors on short text are essentially guessing.
Edited AI text: Even moderate human editing disrupts the statistical fingerprint. Change 15-20% of the words and most detectors lose confidence.
Domain-specific writing: Technical documentation, legal writing, and medical text naturally use predictable vocabulary and structure. Detectors conflate "domain-constrained" with "AI-generated."
Non-native English: Simpler vocabulary and more regular grammar produce lower perplexity, overlapping with AI output distributions. Studies have found false positive rates above 60% for non-native writing.
Temperature and sampling: AI text generated with high temperature or nucleus sampling can have perplexity profiles that look human.

The Confidence Score Trap

When a detector reports "94% likely AI-generated," most people read that as "94% chance this is AI." That is not what it means. The score is the model's internal confidence, not the posterior probability of AI authorship given the base rate of AI text in the population being tested.

This matters enormously. We will cover the math behind this (Bayes' theorem and the base rate fallacy) in the next article in this series.

Practical Takeaways for Developers

Do not trust a single score. Cross-reference multiple detectors. If they disagree, the text is in the gray zone.
Understand the input constraints. Anything under 250 words is unreliable. Longer is better.
Know what is being measured. Perplexity and burstiness are proxies, not ground truth. They measure statistical properties that correlate with AI authorship but do not define it.
Build defensively. If you are building tools that incorporate AI detection, expose confidence intervals, not point estimates. Communicate uncertainty honestly.

If you want to test your own text, Metric37's free AI detector scores any text and breaks down the result. For programmatic access, the Metric37 API provides detection scores alongside humanization in a single endpoint.

This is Part 1 of a technical series on AI detection. Part 2 covers why false positives happen, with the probability math to prove it.