We Tested a Popular Humanizer Agent Skill Against GPTZero. Spoiler: they do not work
We ran 100 GPTZero tests on the techniques recommended by a popular open-source humanizer agent skill. Vocabulary bans and specificity instructions drop bypass rates by up to 43%. Here's what actually works.
We Ran 100 GPTZero Tests on Popular Humanizer Agent Skills. The Results Were Surprising.
We ran 100 texts through GPTZero using 4 different humanization strategies and measured the bypass rate on each. The results contradict most of the advice you'll find online about bypassing AI detection.
The short version: shorter, more focused approaches outperform longer, more nuanced ones by 43 percentage points. Vocabulary bans, one of the most commonly recommended techniques, actively hurt performance.
Here are the full results.
The Experiment
We built this test after noticing that many humanizer agent skills, like openclaw/humanizer and blader/humanizer, all use the same sources: Wikipedia's Signs of AI Writing and the WikiProject AI Cleanup. The logic is straightforward: if Wikipedia editors flag these patterns as AI markers, blocking them should produce cleaner output.
Both skills recommend two concrete techniques based on this: vocabulary bans (blocklisting words like delve, tapestry, holistic) and specificity instructions (replacing vague phrases with concrete details like dates and place names).
To be clear: these skills are genuinely useful. They run locally, require no API, and do make AI-generated text sound more natural and human to a reader. The problem is that sounding human to a person and scoring as human on GPTZero are two very different things. In our tests, text processed by these techniques remained highly detectable.
We decided to measure exactly how much.
Setup:
- 30 AI-generated texts (100–500 words each), from our benchmark dataset
- Each text humanized with 4 different approaches using a large language model
- Each humanized version scored with the GPTZero API
- Threshold: score below 50 = bypass (passes as human-written)
The 4 variants were:
| Variant | Description |
|---|---|
| A: Baseline | HumanizerAI's current humanization approach |
| B: +Specificity | Baseline + instruction to replace vague words with concrete details |
| C: +Vocab ban | Baseline + explicit blocklist of 20 AI-associated words |
| D: Both | Baseline + specificity + vocab ban |
Results
| Variant | Bypass Rate | Avg GPTZero Score |
|---|---|---|
| A: Baseline (current) | 66.7% | 35.1 |
| B: + Specificity | 46.7% | 49.0 |
| C: + Vocab ban | 30.0% | 62.3 |
| D: + Both | 23.3% | 73.8 |
HumanizerAI's baseline approach outperformed every enhanced variant. Adding a vocabulary ban dropped bypass rate by 36.7 percentage points. Combining both additions produced the worst result of all: 23.3%, compared to 66.7% for the unchanged approach.
Why Vocabulary Bans Backfire
The intuition behind vocabulary bans is sound: GPTZero flags words like delve, tapestry, and leverage as AI signals. Blocking them should produce cleaner text.
In practice, three things go wrong.
First, the LLM finds worse alternatives. When you ban holistic, the model substitutes words that may be equally AI-typical or formally stiff. Banning the symptom doesn't fix the underlying writing pattern.
Second, longer instructions shift the model's behavior. Adding a 20-word blocklist dilutes the humanization signal. The model tries to satisfy multiple constraints and ends up producing longer, more carefully-constructed output, exactly what GPTZero is trained to detect.
Third, GPTZero doesn't primarily flag vocabulary. It measures burstiness (sentence length variation) and perplexity (how predictable each word choice is given the context). A text with zero banned words but uniform sentence lengths and low perplexity will still score high. Our tests confirmed this: the strongest predictor of GPTZero score is sentence length variation, not vocabulary.
Why Specificity Instructions Backfire
The specificity technique sounds compelling in theory: replace vague AI phrases like "a major city" with concrete terms like "Portland" or "last Tuesday." Human writing is specific; AI writing is general.
But in practice, GPTZero is trained on a vast amount of AI-generated content, including AI text that has been prompted to include specific details. A text full of precise-but-invented dates and place names reads to GPTZero as AI trying to sound human. The overspecificity itself becomes a signal.
Our data backs this up. The specificity variant averaged a GPTZero score of 49.0 compared to 35.1 for the baseline, a 14-point increase in the wrong direction.
What Actually Works: Targeting Statistical Properties
HumanizerAI's baseline achieved 66.7% bypass not because it avoids specific words, but because it produces text whose underlying statistical structure looks more like human writing.
GPTZero measures two things above all else: burstiness (how much sentence lengths vary) and perplexity (how predictable each word is given the context around it). Human writing scores high on both: it's varied and unpredictable. AI writing tends toward uniform sentence patterns and predictable word choices regardless of vocabulary.
The winning approach targets these statistical properties directly. Adding vocabulary rules on top introduces noise that actually makes the output more uniform and more careful, the opposite of what you want.
The core insight: it's not about which words you use. It's about how those words are arranged and how varied the structure is.
What We Learned From the Open-Source Community
The blader/humanizer skill is well-researched and its recommendations make intuitive sense. The issue isn't the quality of the analysis, it's that the patterns were calibrated on a different type of AI text.
Chat-interface AI (copy-pasted responses from ChatGPT, Claude.ai, Gemini) has a very different fingerprint from direct API output used in document generation. Chat responses include assistant artifacts: feel free to ask, I hope this helps, let me break this down step by step. Those patterns are strong signals in chat-generated text.
When we tested for those same patterns on our benchmark dataset (100 samples, 50 human / 50 AI), we found:
| Signal | AI Presence | Human Presence |
|---|---|---|
| Chatbot artifacts (feel free to, happy to help...) | 0.0% | 10.0% |
| Curly quotes | 36.0% | 52.0% |
| Tier-1 vocab (delve, tapestry, holistic...) | 4.0% | 2.0% |
Chatbot artifacts appear in human texts (they're common in dialogue in classic literature) and not at all in our AI samples. Curly quotes are actually more common in human texts than AI texts, because our human corpus includes typographically correct published books.
These patterns are real signals for chat-style content. They don't transfer to API-generated document content. This distinction matters a lot for building an accurate detector.
The Takeaway for Humanization
If you're humanizing AI-generated text to reduce GPTZero scores, our data suggests:
- Structural variation beats vocabulary fixes. GPTZero cares about how text is arranged, not which words it uses. Varied, unpredictable structure is the goal.
- Don't add vocabulary blocklists. They consistently hurt performance in our tests, the model compensates in ways that make the output more AI-typical.
- Don't add specificity instructions. LLM-invented concrete details (fake dates, places, names) are detectable and increase scores.
- Context matters. Techniques that work on chat-style AI output may not work on document-style AI output, and vice versa. Verify with real GPTZero data, not assumptions.
Use the HumanizerAI Agent Skill
The skills above are a good starting point if you want something local and free. But as our tests show, they don't move the needle on GPTZero scores. The reason is structural: they fix surface vocabulary, not the statistical properties that detectors actually measure.
The HumanizerAI agent skill works differently. Instead of swapping words, it rewrites text to increase burstiness and perplexity, the two signals GPTZero weights most heavily. That's what produced the 66.7% bypass rate in our baseline, compared to 23.3% when vocabulary bans were added on top.
Install it from Claude Code, Cursor, or any MCP-compatible agent:
/learn humanizerai/agent-skills
It exposes two tools: humanize (rewrites AI text to bypass detection) and detect-ai (scores text against GPTZero-calibrated signals). Full documentation at humanizerai.com/docs/api.
Want to test it yourself? Try Humanizer AI and see how your text scores before and after.
Read Next
GPTZero vs Originality.ai: Which AI Detector Is More Accurate?
Compare GPTZero and Originality.ai head-to-head. Real accuracy tests, false positive rates, pricing, and which tool is best for your needs.
Undetectable AI vs Humanizer AI: Which AI Humanizer Actually Works?
Compare Undetectable AI and Humanizer AI head-to-head. Real test results, pricing comparison, and which tool actually bypasses AI detection in 2026.
Can Employers Detect AI in Cover Letters and Resumes?
65% of Fortune 500 companies use AI detection on applications. Learn what employers can spot, which industries auto-reject AI, and how to use AI assistance safely.