냄새검사 – 텍스트에서 AI 냄새를 감지합니다
hackernews
|
|
🔬 연구
#ai감지
#chatgpt
#gpt-4
#review
#리뷰
#머신러닝없음
#오프라인
#텍스트분석
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
Smellcheck는 기계 학습이나 API를 사용하지 않고 텍스트 내의 특수 문자, 이모지, AI 전문 용어 등을 분석하여 AI 생성 여부를 의심해 볼 수 있는 도구입니다. 현재 영어 텍스트만 지원하는 초기 알파 단계로, 분석 결과는 AI 작성 가능성을 시사하는 신호로 활용해야 하며 판정으로 받아들여서는 안 됩니다. 사용자는 CLI를 통해 텍스트 파일뿐만 아니라 PDF나 웹 페이지에서 추출한 텍스트를 검사하거나 CI 파이프라인에 연동하여 자동화된 리뷰를 수행할 수 있습니다.
본문
Detect suspicious AI-text fingerprints in user submissions — fast, offline, no ML required. Smellcheck scans text for patterns that frequently appear in AI-generated writing: unusual punctuation characters, overused AI buzzwords, and vocabulary that people recognize but almost never type themselves. Important caveat: smellcheck can tell you that a text looks suspicious — it cannot reliably tell you that a text was written by AI. A flagged text might have been written by a human who just loves em dashes. A clean text could still be AI-generated. Use the results as a signal to guide human review, not as a verdict. Smellcheck is in an early alpha stage, use it with caution. Currently, it only works for English texts. Smellcheck uses static analysis only — no machine learning, no API calls, no latency, no cost. It checks for: - Typography characters that AI models produce naturally but humans rarely type (em dashes, curly quotes, ellipsis … ) - Unicode symbols and emoji clusters common in LLM output - AI cliché phrases (delve into, it's worth noting, tapestry of) - Formal or legalistic vocabulary humans recognize but almost never reach for (aforementioned, heretofore, whilst) The package is not yet published to npm. Install directly from GitHub using npm/NodeJS: # npm npm install github:fbuchinger/smellcheck > echo "…and there are many — of this paradigm shift 🌟." | smellcheck "…and there are many — of this paradigm shift 🌟." ── Smellcheck Report ────────────────────────────────── ⚠ AI fingerprints detected TYPO 2 match(es) → "…" at position 1 Horizontal ellipsis (…) — distinct from three dots → "—" at position 21 Em dash (—) — rarely typed manually UNICODE 1 match(es) → "🌟" at position 46 Suspicious Unicode character (Miscellaneous symbols and pictographs): U+1F31F BUZZ 1 match(es) → "paradigm" at position 31 AI buzzword/cliché: "paradigm" ──────────────────────────────────────────────────────── Note: In Windows, make sure to switch the cmd.exe codepage to UTF-8 by executing the command chcp 65001 , otherwise the unicode detection will not work. In Linux / Unix, replace echo with cat . # Analyze a plain text file smellcheck report.txt # Pipe from stdin cat submission.txt | smellcheck # Output raw JSON (for piping to other tools) smellcheck --json report.txt # Disable specific plugins smellcheck --no-unicode --no-buzzwords report.txt # Exit code: 0 = clean, 1 = flagged — useful in CI / git hooks smellcheck report.txt && echo "Clean!" Smellcheck reads plain text. Use a third-party tool to extract text first, then pipe it in: # Using pdftotext (part of poppler-utils, available on Linux/macOS/WSL) pdftotext submission.pdf - | smellcheck # Using pdftotext with a specific page range pdftotext -f 1 -l 3 submission.pdf - | smellcheck # Using pdf-to-text (Node.js, cross-platform) npx pdf-to-text submission.pdf | smellcheck # Save extracted text first, then analyze pdftotext submission.pdf submission.txt && smellcheck submission.txt # Using curl + html2text to strip markup curl -s https://example.com/article | html2text | smellcheck # Using lynx lynx -dump https://example.com/article | smellcheck # Fail a pull request if a generated file looks AI-written smellcheck docs/release-notes.md || { echo "AI smell detected — please review"; exit 1; } All plugins are enabled by default and can be toggled individually. | Plugin | What it detects | Why it matters | |---|---|---| typography | Em dashes — , en dashes – , non-breaking spaces, zero-width chars, curly quotes " , soft hyphens, ellipsis … | These characters are standard output for LLMs because training data is full of typeset documents — but on a keyboard they require special key combos most people never bother with. A 2023 analysis of GPT-4 output found em dashes present in ~73% of long-form samples vs. ~12% of human-written equivalents. | unicode | Emoji, pictograms, decorative symbols from Unicode blocks rarely found in plain text | LLMs frequently insert decorative Unicode when producing structured or list-heavy content, a pattern identified in Guo et al., 2023 – "How Close is ChatGPT to Human Experts?". | buzzwords | AI clichés: delve, tapestry, nuanced, holistic, robust, leverage, cutting-edge, it's worth noting … | These phrases are statistically overrepresented in LLM output compared to human writing. The word delve, for instance, appears roughly 7× more often in ChatGPT responses than in human-written text of similar length. | unnatural | Vocabulary humans recognize but rarely type spontaneously: aforementioned, heretofore, whilst, elucidate, notwithstanding … | LLMs are trained on formal written corpora (legal documents, academic papers, Wikipedia) and tend to reproduce formal register even in casual contexts. Human writers almost never spontaneously choose aforementioned over "the above" or whilst over "while" — making these words strong soft signals. See Kobak et al., 2025 – Delving into LLM-assisted writing in biomedical publications through excess vocabular
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유