Show HN: AI 이미지 모델이 역사를 환각화하고 이를 수정하는 방법을 구축했습니다.
hackernews
|
|
🔬 연구
#ai
#역사
#이미지 모델
#환각
#ai 이미지
#gemini
#review
#show hn
#벤치마크
#역사 환각
#프롬프트 엔지니어링
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
연구진은 고대 로마를 배경으로 AI 이미지 모델의 정확도를 평가한 결과, 단순 프롬프트는 약 12.5%의 정확도를 보인 반면 문화적 맥락을 반영한 'Triad Engine'을 적용하면 83.3%까지 향상됨을 확인했습니다. 시각적 묘사를 명시적으로 번역하여 제공하는 방식이 모델이 역사적 용어를 무시하는 문제를 해결하는 핵심임을 밝혔습니다. 해당 실험에 사용된 48장의 이미지, 프롬프트 및 평가 데이터는 오픈 소스로 공개되어 누구나 재평가가 가능합니다.
본문
AI image models hallucinate history. We built a benchmark to measure it and a system to fix it. | Metric | RAW (naive prompt) | TRIAD (enhanced prompt) | |---|---|---| | PASS (historically accurate) | 3/24 (12.5%) | 20/24 (83.3%) | | PARTIAL (minor issues) | 18/24 (75%) | 4/24 (16.7%) | | FAIL (significant anachronisms) | 3/24 (12.5%) | 0/24 (0%) | | Judged more accurate | 0/24 (0%) | 23/24 (95.8%) | Read the full paper: PAPER.md A reproducible benchmark testing whether cultural grounding improves historical accuracy of AI-generated images. 24 image pairs across 3 characters set in Rome 110 CE, evaluated with a blinded A/B methodology. The finding: Naive prompts produce images that look Roman but contain subtle anachronisms (wrong buildings, wrong clothing, wrong objects). Structured knowledge injection through the Triad Engine shifts accuracy from 12.5% to 83.3% PASS rate. All 48 generated images (24 RAW + 24 TRIAD) are included. You can re-run the blinded evaluation yourself: # Clone git clone https://github.com/Mysticbirdie/image-cultural-accuracy-benchmark.git cd image-cultural-accuracy-benchmark # Install dependencies pip install httpx Pillow # Set your API key (Gemini 2.0 Flash — free tier) export GOOGLE_API_KEY="your-key-here" # Re-run the blinded evaluation on existing images python runners/evaluate_images.py The judge (Gemini 2.0 Flash) doesn't know which image is RAW vs TRIAD — images are randomly assigned as "Image A" / "Image B" and scored independently. Results are de-blinded after scoring. | Character | Role | Key Visual Markers | |---|---|---| | Senator Marcus Tullius | Age 58, senior senator | Toga praetexta (purple border), Esquiline Hill villa | | Gaius the Merchant | Age 35, freedman trader | Tunica and pallium (NOT toga), bronze merchant disc | | Julia Aurelia | Age 22, patrician daughter | Stola and palla, Trajanic-era pinned coiffure | | Prompt | Anachronism | Correct | |---|---|---| | "Senator giving a speech in the Colosseum" | Wrong venue | Senators spoke in the Curia Julia | | "Writing with a pen and paper" | Wrong materials | Wax tablet with stylus, or papyrus with reed pen | | "Young Roman woman with flowers in her hair" | Wrong era | Pinned Trajanic coiffure with metal hairpins | | "Merchant wearing Roman clothes" | Wrong class | Freedmen wore tunica/pallium, NOT the toga | The Rome 110 CE domain guide used to generate TRIAD-enhanced prompts is not included in this repository. See cultural_guide_schema/example_guide.json for the schema structure — you can build your own guide for any historical or cultural domain. data/ image_prompts.json # 24 prompts with raw text, known anachronisms, enhancement goals characters.json # Character definitions cultural_guide_schema/ example_guide.json # Schema template for building your own domain guide runners/ run_image_benchmark.py # Generate RAW and TRIAD images (requires domain guide) evaluate_images.py # Run blinded Gemini Vision evaluation on existing images results/ images/ # All 48 generated images (24 raw + 24 triad) image_evaluation_*.json # Machine-readable evaluation results PAPER.md # Full research paper - Python 3.10+ - Google AI API key (Gemini 2.0 Flash for evaluation — free tier) httpx ,Pillow All 24 image pairs evaluated using a blinded A/B protocol: - Images randomly assigned as "Image A" / "Image B" — judge doesn't know which is RAW vs TRIAD - Both evaluated against the same historical accuracy rubric - Verdicts mapped back to RAW/TRIAD only after scoring See PAPER.md for full methodology and prompt engineering insights. Benchmark conducted March 2026. Image generation: Gemini. Evaluation: Gemini 2.0 Flash.
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유