AI Text Detector False Positive Checker — GPTZero, Originality, Turnitin, Winston

What burstiness, perplexity, and repetition really mean — and why we refuse to ship a humanizer.

1. What "false positive" means here

A false positive in AI-text detection is the moment a tool tells your professor, editor or employer that a piece you actually wrote yourself was generated by ChatGPT, Claude or Gemini. The published industry numbers are unkind: GPTZero reports its own false-positive rate at about 0.24%, Originality.ai independent benchmarks land closer to 15% on edited human prose, and Turnitin has been the subject of multiple r/college threads describing students forced to re-defend essays they wrote at three in the morning over coffee. Even at 1% the volume is huge: a 200-student class produces two false accusations every assignment.

The problem is structural. Every consumer detector treats AI-vs-human as a binary classifier trained on examples — millions of GPT-4 paragraphs versus a smaller corpus of human writing. Anything in your draft that looks like the average GPT-4 paragraph — smooth pacing, vocabulary in the middle of the frequency distribution, paragraphs of similar length — tilts the classifier toward "AI". This is exactly the prose that high-school and university curricula reward. The students who studied the hardest are statistically the most exposed.

2. The four detectors this tool models

We do not run any detector model in your browser. Instead, we approximate the published behaviour of each system using three universally-available signals: sentence-length variance (burstiness), word-frequency rarity (a perplexity proxy), and bigram repetition. Each detector weights those signals differently, and our scoring mimics that weighting.

GPTZero publicly leans on burstiness and log-perplexity. Its founder, Edward Tian, has called those two metrics "the heart of the model" in multiple interviews. Our GPTZero estimator therefore puts ~45% of its weight on uniform sentence pacing and ~45% on a perplexity proxy built from bigram entropy plus word-rarity. The remainder picks up type-token ratio, which captures over-narrow vocabulary.
Originality.ai is the most aggressive of the four. Independent tests at Fritz.ai, EyeSift and the University of Maryland have repeatedly flagged 12–18% of human-written control essays. Originality appears to use a GPT-2-style perplexity model plus a separate "low-rarity" punisher. We mirror that with a 40% weight on low-rarity vocabulary, 30% on uniformity, plus a small baseline offset because Originality is known to skew higher than its peers across every input.
Turnitin AI is what universities actually run, and it differs from the consumer tools in that it uses paragraph-level sliding windows. That window picks up repeated phrasings even after a student has paraphrased line by line — a known weakness of paraphrase-only "humanizers". Our Turnitin estimator therefore weights repetition signals at 60% and uniformity at 30%.
Winston AI ships an ensemble classifier that, per its own marketing pages, weights syntactic regularity strongly. In practice it flags well-edited essays — exactly the writers who least deserve it. Our Winston estimator gives 45% to sentence uniformity and 30% to low-rarity vocabulary.

3. The three signals, in plain English

Burstiness

Burstiness is a fancy word for "do your sentence lengths jump around?" Human writers write short, punchy fragments after long, complex ones. Then a single word. Like that. AI writing — especially from instruction-tuned models — settles into 18-to-25-word sentences, paragraph after paragraph. We measure burstiness as the standard deviation of sentence-token-count divided by the mean. Anything below 0.45 on our 0–1 scale is a red flag for every detector in this list.

Perplexity proxy

Real perplexity requires a language model. We approximate it with two browser-cheap signals: bigram entropy (how surprising your token transitions are) and word-frequency rarity (how often you reach past the top-600 English words). High perplexity proxy means your prose has surprising local choices — exactly the thing GPT-4 was trained to round off. Polished writing rounds off the surprising choices too, which is why excellent students get flagged.

Repetition

Detectors notice when the same bigram or trigram dominates your text. AI prose repeats discourse markers ("It is important to note", "Furthermore", "In conclusion") and reuses noun phrases. We compute the ratio of the most-frequent bigram to total bigrams and scale it up — high repetition makes Turnitin in particular very nervous, because its sliding window finds the repeats even after you reshuffle sentences.

4. Why we refuse to ship a "humanizer"

The most common follow-up question from students is: "Great, can you just rewrite my essay so it slips past the detector?" The answer is no, and the reason is deliberate. First, every major school policy treats automated AI-text obfuscation as the same offence as AI generation itself. Running your real essay through a humanizer puts you in the exact category you were trying to escape. Second, humanizers leave their own fingerprint — Originality and Winston already train on humanizer output and recognise it. Third, the more thoughtful response to a false positive is human evidence: the working notes, the version history, the chat where you discussed the topic with a friend. Those are receipts a detector cannot fake away. Saving the receipts changes the conversation from "your tool says I cheated" to "here is my drafting trail".

This site's positioning is therefore: diagnose, do not deceive. Our tips are heuristic suggestions you, as the human author, can act on consciously — mix in a fragment, choose a more specific verb, drop a date or a place name only you would know. Those are real editorial moves. They make your prose more interesting on the page and they incidentally pull your detector scores down. They do not invite a school-policy violation.

5. How to use this site without lying to yourself

The most common mistake we see is paste-edit-paste-edit until the score is green, then submit a draft that no longer sounds like you. Detector scores are a noisy proxy. Editorial quality is the actual goal. Use this workflow instead:

Paste the draft. Read the composite score. If it is below 30, stop tweaking and ship the draft. The signals are already on your side.
If the score is between 30 and 60, scan the highlighted sentences. Are the flagged ones the ones you also feel are weakest? Usually yes. Rewrite thosefor clarity, not for the score. The score will follow.
If the score is above 60 and you genuinely wrote the draft yourself, save your version history immediately. Take screenshots of your in-progress Google Doc with timestamps. That paper trail beats any detector score in front of a professor or HR panel.
Resist the urge to rewrite a whole paragraph just to game one number. Real editing fixes one sentence at a time. The composite score should creep down a few points per real edit — if it crashes 20 points after one paragraph change you probably also broke your own voice.

6. Korean prose: even higher uncertainty

All four detectors were primarily trained on English. For Korean prose, independent tests at EyeSift and Korean university IR departments measure false-positive rates anywhere from 8% to 30% — wildly higher than English. Our Korean scoring intentionally dampens by 15% because the underlying signals (bigram entropy, word rarity) are even less reliable when the corpus the detector was trained on barely contained Hangul. Treat Korean scores from this site, or from any commercial detector, as an early-warning indicator, not a verdict.

7. Limits and known failure modes

We are wrong in the following predictable ways. (i) Very short inputs under 80 words are noisy — perplexity proxy cannot stabilise. (ii) Technical prose with long URLs, code blocks or formulas appears artificially low-rarity because the tokens look like rare words to our rarity table. (iii) Heavily edited drafts that had a real human polish pass often score worse than messier first drafts, because polish creates uniformity. (iv) Korean text is, as above, in a different uncertainty class entirely. Read the score the way you would read a weather forecast: useful as a probability, not as a guarantee.

8. If you have already been falsely accused

Do not panic and do not delete drafts. The strongest evidence is your editing history: Google Docs version history (File → Version history), Word track-changes, the chat log where you talked the topic over with a friend. Most schools have an internal appeals process — invoke it, attach the receipts, and ask explicitly which detector was used and at what threshold. Quote the published false-positive numbers: GPTZero ~0.24% (their own), Originality.ai 12–18% (independent), Turnitin AI >1% (their own). At those rates a 200-essay class mathematically will produce false positives every semester.

Lastly, if you are a teacher reading this: the responsible workflow is to use detector scores as a conversation starter, not as a verdict. Ask the student to walk you through their draft. False accusations are corrosive to the student-teacher relationship and to the integrity of the institution.

Continue to the FAQ for student-and-teacher answers, or return to the detector.