Show HN: Lmscan – Detect AI text and fingerprint which LLM wrote it (zero deps)
hackernews
|
|
📦 오픈소스
#ai 모델
#ai 텍스트 감지
#anthropic
#chatgpt
#claude
#gemini
#gpt-4
#gptzero 대체
#llama
#llm 지문인식
#mistral
#perplexity
#오픈소스
#ai 생성 텍스트
#ai 텍스트 탐지
#gptzero 대안
#llm
#ai 모델 분류
#llm 지문 인식
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
오픈소스 기반의 AI 텍스트 감지 도구인 'lmscan(v0.1.0)'은 텍스트가 AI로 생성된 것인지 통계적으로 분석할 뿐만 아니라, 특유의 어휘 패턴을 바탕으로 GPT-4, Claude 등 어떤 모델이 작성했는지 식별합니다. 이 도구는 12가지 계산 언어학적 특징을 분석하여 결과를 제공하며, 외부 API 키나 인터넷 연결 없이 오프라인에서 무료로 작동해 기존 유료 서비스의 대안이 될 수 있습니다. 단, 수동으로 수정되거나 영어가 아닌 텍스트, 50단어 미만의 짧은 글에 대해서는 감지 성능이 떨어질 수 있어 주의가 필요합니다.
본문
Detect AI-generated text. Fingerprint which LLM wrote it. Open-source GPTZero alternative. GPTZero charges $15/month. Originality.ai charges per scan. Turnitin locks you into institutional contracts. lmscan is free, open-source, works offline, and tells you which model wrote the text. $ lmscan "In today's rapidly evolving digital landscape, it's important to note that artificial intelligence has become a pivotal force in transforming how we navigate the complexities of modern life..." 🔍 lmscan v0.1.0 — AI Text Forensics ══════════════════════════════════════════════════ Verdict: 🤖 Likely AI (77% confidence) Words: 184 Sentences: 10 Scanned in 0.01s ┌────────────────────────────┬──────────┬────────────────────┐ │ Feature │ Value │ Signal │ ├────────────────────────────┼──────────┼────────────────────┤ │ Burstiness │ 0.07 │ 🔴 Very low (AI) │ │ Sentence length variance │ 0.27 │ 🟡 Below average │ │ Slop word density │ 20.7% │ 🔴 High (AI) │ │ Transition word ratio │ 2.2% │ 🟡 Elevated │ │ Readability consistency │ 0.00 │ 🔴 Very low (AI) │ │ ... │ │ │ └────────────────────────────┴──────────┴────────────────────┘ 🔎 Model Attribution 1. GPT-4 / ChatGPT 62% — "delve", "tapestry", "beacon", "landscape" (×2), +19 more 2. Claude (Anthropic) 13% — "robust", "nuanced", "comprehensive" 3. Gemini (Google) 9% — "furthermore", "additionally" ⚠️ Flags • Very low burstiness (0.07) — AI text is more uniform in complexity • High slop word density (20.7%) — contains known AI vocabulary markers pip install lmscan Zero dependencies. Works with Python 3.9+. No API keys. No internet. No GPU. # Scan text directly lmscan "Your text here..." # Scan a file lmscan document.txt # Pipe from stdin cat essay.txt | lmscan - # JSON output (for scripts and CI) lmscan document.txt --format json # Per-sentence breakdown lmscan document.txt --sentences # CI gate: fail if AI probability > 50% lmscan submission.txt --threshold 0.5 from lmscan import scan result = scan("Text to analyze...") print(f"AI probability: {result.ai_probability:.0%}") print(f"Verdict: {result.verdict}") print(f"Confidence: {result.confidence}") # Which model wrote it? for model in result.model_attribution: print(f" {model.model}: {model.confidence:.0%}") for evidence in model.evidence[:3]: print(f" → {evidence}") # Per-sentence analysis for sentence in result.sentence_scores: if sentence.ai_probability > 0.7: print(f" 🤖 {sentence.text[:60]}... ({sentence.ai_probability:.0%})") from lmscan import scan_file import glob for path in glob.glob("submissions/*.txt"): result = scan_file(path) print(f"{path}: {result.verdict} ({result.ai_probability:.0%})") lmscan uses 12 statistical features derived from computational linguistics research to distinguish AI-generated text from human writing: | Feature | What it measures | AI signal | |---|---|---| | Burstiness | Variance in sentence complexity | AI text is unusually uniform | | Sentence length variance | How much sentence lengths vary | AI produces uniform lengths | | Vocabulary richness | Type-token ratio (Yule's K corrected) | AI reuses words more | | Hapax legomena ratio | Fraction of words appearing once | AI has fewer unique words | | Zipf deviation | How word frequencies follow Zipf's law | AI deviates from natural distribution | | Readability consistency | Flesch-Kincaid variance across paragraphs | AI maintains constant readability | | Bigram/trigram repetition | Repeated word pairs and triples | AI repeats phrase structures | | Transition word ratio | "however", "moreover", "furthermore"... | AI overuses transitions | | Slop word density | Known AI vocabulary markers | "delve", "tapestry", "beacon"... | | Punctuation entropy | Diversity of punctuation usage | AI is more predictable | Each feature produces a signal via sigmoid transformation. The weighted combination produces the final AI probability. lmscan includes vocabulary fingerprints for 5 major LLM families: | Model | Distinctive markers | |---|---| | GPT-4 / ChatGPT | "delve", "tapestry", "landscape", "leverage", "multifaceted", "it's important to note" | | Claude (Anthropic) | "certainly", "I'd be happy to", "straightforward", "I should note" | | Gemini (Google) | "crucial", "here's a breakdown", "keep in mind" | | Llama / Meta | "awesome", "fantastic", "hope this helps" | | Mistral / Mixtral | "indeed", "moreover", "hence", "noteworthy" | Attribution uses weighted vocabulary matching, phrase detection, and hedging pattern analysis. What lmscan is good at: - Detecting text with strong AI stylistic patterns - Identifying which model family generated text - Scanning at scale (thousands of documents) with zero cost - Providing explainable evidence (not a black box) What lmscan cannot do: - Detect AI text that has been manually edited or paraphrased - Work reliably on very short text (<50 words) - Detect AI text in non-English languages (English-only for now) - Replace human judgment — use as a signal, not a verdict This is statistical analysis, not a neural classifier. It detects stylistic patterns, not watermarks. It works best on unedited LLM output and degrades gracefully on edited text. - name: AI Content Check run: | pip install lmscan lmscan submission.txt --threshold 0.7 --format json repos: - repo: https://github.com/stef41/lmscan rev: v0.1.0 hooks: - id: lmscan args: ["--threshold", "0.7"] lmscan's approach is informed by published research on AI text detection: - DetectGPT (Mitchell et al., 2023) — perturbation-based detection using log probability curvature - GLTR (Gehrmann et al., 2019) — statistical visualization of token predictions - Binoculars (Hans et al., 2024) — cross-model perplexity comparison - Zipf's Law in NLP — word frequency distributions differ between human and AI text - Stylometry — decades of authorship attribution research applied to AI forensics lmscan takes the statistical intuitions from these papers and implements them as lightweight, dependency-free heuristics that work without requiring a reference language model. Q: Is this as accurate as GPTZero? A: GPTZero uses neural classifiers trained on labeled data. lmscan uses statistical heuristics. GPTZero is more accurate on edge cases; lmscan is free, offline, and explainable. Use both if accuracy matters. Q: Can students use this to evade AI detection? A: lmscan shows which features trigger detection, which could help someone understand why text reads as AI-generated. This is by design — understanding AI writing patterns makes everyone a better writer. The same information is available in published research papers. Q: Does it work on non-English text? A: Currently English-only. The slop word lists and transition word lists are English-specific. Statistical features (entropy, burstiness) work across languages but haven't been calibrated. Q: Does it phone home? A: No. Zero network requests. No telemetry. No API keys. Everything runs locally. Q: How is model attribution possible without running the model? A: Each LLM family has characteristic vocabulary biases. GPT-4 loves "delve" and "tapestry". Claude says "I'd be happy to". These are statistical fingerprints — not guaranteed attribution, but strong signals. - reverse-SynthID — Reverse-engineering Google's image watermarking - vibesafe — AI code safety scanner - injectionguard — Prompt injection detection - vibescore — Grade your vibe-coded project Apache-2.0
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유