We Tested Humanizer Agent Skills Against GPTZero. They Failed.

We ran 100 GPTZero tests on popular humanizer agent skills. Vocabulary bans drop bypass rates by 43%. Here's what actually works.

20 February 2026

We ran 100 texts through GPTZero using 4 different humanization strategies and measured the bypass rate on each. The results contradict most of the advice you'll find online about bypassing AI detection.

The short version: shorter, more focused approaches outperform longer, more nuanced ones by 43 percentage points. Vocabulary bans, one of the most commonly recommended techniques, actively hurt performance.

Here are the full results.

The Experiment

We built this test after noticing that many humanizer agent skills, like openclaw/humanizer and blader/humanizer, all use the same sources: Wikipedia's Signs of AI Writing and the WikiProject AI Cleanup. The logic is straightforward: if Wikipedia editors flag these patterns as AI markers, blocking them should produce cleaner output.

Both skills recommend two concrete techniques based on this: vocabulary bans (blocklisting words like delve, tapestry, holistic) and specificity instructions (replacing vague phrases with concrete details like dates and place names).

To be clear: these skills are genuinely useful. They run locally, require no API, and do make AI-generated text sound more natural and human to a reader. The problem is that sounding human to a person and scoring as human on GPTZero are two very different things. In our tests, text processed by these techniques remained highly detectable.

We decided to measure exactly how much.

Setup:

30 AI-generated texts (100–500 words each), from our benchmark dataset
Each text humanized with 4 different approaches using a large language model
Each humanized version scored with the GPTZero API
Threshold: score below 50 = bypass (passes as human-written)

The 4 variants were:

Variant	Description
A: Baseline	HumanizerAI's current humanization approach
B: +Specificity	Baseline + instruction to replace vague words with concrete details
C: +Vocab ban	Baseline + explicit blocklist of 20 AI-associated words
D: Both	Baseline + specificity + vocab ban

Results

Variant	Bypass Rate	Avg GPTZero Score
A: Baseline (current)	66.7%	35.1
B: + Specificity	46.7%	49.0
C: + Vocab ban	30.0%	62.3
D: + Both	23.3%	73.8

HumanizerAI's baseline approach outperformed every enhanced variant. Adding a vocabulary ban dropped bypass rate by 36.7 percentage points. Combining both additions produced the worst result of all: 23.3%, compared to 66.7% for the unchanged approach.

Why Vocabulary Bans Backfire

The intuition behind vocabulary bans is sound: GPTZero flags words like delve, tapestry, and leverage as AI signals. Blocking them should produce cleaner text.

In practice, three things go wrong.

First, the LLM finds worse alternatives. When you ban holistic, the model substitutes words that may be equally AI-typical or formally stiff. Banning the symptom doesn't fix the underlying writing pattern.

Second, longer instructions shift the model's behavior. Adding a 20-word blocklist dilutes the humanization signal. The model tries to satisfy multiple constraints and ends up producing longer, more carefully-constructed output, exactly what GPTZero is trained to detect.

Third, GPTZero doesn't primarily flag vocabulary. It measures burstiness (sentence length variation) and perplexity (how predictable each word choice is given the context). A text with zero banned words but uniform sentence lengths and low perplexity will still score high. Our tests confirmed this: the strongest predictor of GPTZero score is sentence length variation, not vocabulary.

Why Specificity Instructions Backfire

The specificity technique sounds compelling in theory: replace vague AI phrases like "a major city" with concrete terms like "Portland" or "last Tuesday." Human writing is specific; AI writing is general.

But in practice, GPTZero is trained on a vast amount of AI-generated content, including AI text that has been prompted to include specific details. A text full of precise-but-invented dates and place names reads to GPTZero as AI trying to sound human. The overspecificity itself becomes a signal.

Our data backs this up. The specificity variant averaged a GPTZero score of 49.0 compared to 35.1 for the baseline, a 14-point increase in the wrong direction.

What Actually Works: Targeting Statistical Properties

HumanizerAI's baseline achieved 66.7% bypass not because it avoids specific words, but because it produces text whose underlying statistical structure looks more like human writing.

GPTZero measures two things above all else: burstiness (how much sentence lengths vary) and perplexity (how predictable each word is given the context around it). Human writing scores high on both: it's varied and unpredictable. AI writing tends toward uniform sentence patterns and predictable word choices regardless of vocabulary.

The winning approach targets these statistical properties directly. Adding vocabulary rules on top introduces noise that actually makes the output more uniform and more careful, the opposite of what you want.

The core insight: it's not about which words you use. It's about how those words are arranged and how varied the structure is.

What We Learned From the Open-Source Community

The blader/humanizer skill is well-researched and its recommendations make intuitive sense. The issue isn't the quality of the analysis, it's that the patterns were calibrated on a different type of AI text.

Chat-interface AI (copy-pasted responses from ChatGPT, Claude.ai, Gemini) has a very different fingerprint from direct API output used in document generation. Chat responses include assistant artifacts: feel free to ask, I hope this helps, let me break this down step by step. Those patterns are strong signals in chat-generated text.

When we tested for those same patterns on our benchmark dataset (100 samples, 50 human / 50 AI), we found:

Signal	AI Presence	Human Presence
Chatbot artifacts (feel free to, happy to help...)	0.0%	10.0%
Curly quotes	36.0%	52.0%
Tier-1 vocab (delve, tapestry, holistic...)	4.0%	2.0%

Chatbot artifacts appear in human texts (they're common in dialogue in classic literature) and not at all in our AI samples. Curly quotes are actually more common in human texts than AI texts, because our human corpus includes typographically correct published books.

These patterns are real signals for chat-style content. They don't transfer to API-generated document content. This distinction matters a lot for building an accurate detector.

The Takeaway for Humanization

If you're humanizing AI-generated text to reduce GPTZero scores, our data suggests:

Structural variation beats vocabulary fixes. GPTZero cares about how text is arranged, not which words it uses. Varied, unpredictable structure is the goal.
Don't add vocabulary blocklists. They consistently hurt performance in our tests, the model compensates in ways that make the output more AI-typical.
Don't add specificity instructions. LLM-invented concrete details (fake dates, places, names) are detectable and increase scores.
Context matters. Techniques that work on chat-style AI output may not work on document-style AI output, and vice versa. Verify with real GPTZero data, not assumptions.

Use the HumanizerAI Agent Skill

The skills above are a good starting point if you want something local and free. But as our tests show, they don't move the needle on GPTZero scores. The reason is structural: they fix surface vocabulary, not the statistical properties that detectors actually measure.

The HumanizerAI agent skill works differently. Instead of swapping words, it rewrites text to increase burstiness and perplexity, the two signals GPTZero weights most heavily. That's what produced the 66.7% bypass rate in our baseline, compared to 23.3% when vocabulary bans were added on top.

Install it from Claude Code, Cursor, or any MCP-compatible agent:

/learn humanizerai/agent-skills

It exposes two tools: humanize (rewrites AI text to bypass detection) and detect-ai (scores text against GPTZero-calibrated signals). Full documentation at humanizerai.com/docs/api.

Want to test it yourself? Try Humanizer AI and see how your text scores before and after.

GPTZeroAI HumanizerAI DetectionBypass AI DetectionAI Text Humanization

We Tested Humanizer Agent Skills Against GPTZero. They Failed.

The Experiment

Results

Why Vocabulary Bans Backfire

Why Specificity Instructions Backfire

What Actually Works: Targeting Statistical Properties

What We Learned From the Open-Source Community

The Takeaway for Humanization

Use the HumanizerAI Agent Skill

Read Next

Best AI Humanizer in 2026: 15 Tools Tested Against 5 Detectors

How to Humanize AI Text: 7 Methods That Actually Work in 2026

Can Class Companion Detect AI? What Teachers and Students Should Know

Ready to Humanize Your Content?