Show HN: Gemma 4의 감정 조사 – Anthropic의 감정 연구 복제
hackernews
|
|
📦 오픈소스
#gemma
#감정 분석
#머신러닝
#프로브
#anthropic
#emotion probes
#gemini
#gemma 4
#llm
#show hn
#머신러닝/연구
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
Anthropic의 대규모 언어 모델(LLM) 감정 연구 방법론을 바탕으로, Gemma 4 E4B 모델 내에서 표현되거나 억제된 감정을 탐지하는 도구를 구축하는 프로젝트가 소개되었습니다. 이 프로젝트는 총 171개의 감정에 대해 약 20만 개의 감성적 스토리와 23만 9천 개의 감정 회피 대화 합성 데이터셋을 생성하며, A100 GPU 기준 약 6시간에 걸쳐 모델의 활성화 값을 추출해 선형 감정 탐지 프로브를 계산합니다. 최종적으로 도출된 프로브들은 토큰 단위의 감정을 시각화하는 Flask 웹 서버를 통해 제공되며, 인공지능의 정렬(alignment) 상태를 모니터링하는 데 유용하게 활용될 수 있습니다.
본문
Extract emotion and emotion-deflection probes from large language models. Built from the methodology described in Anthropic's "Emotion Concepts and their Function in a Large Language Model" (Sofroniew et al., April 2026). This project provides tools to: - Generate synthetic datasets for 171 emotion concepts — emotional stories, neutral baselines, and emotion-deflection dialogues - Extract residual stream activations from these datasets through a target model (Gemma 4 E4B) - Compute emotion probes — linear directions in activation space that detect expressed and suppressed emotions - Visualise emotion probe activations on arbitrary text The generated datasets are available on HuggingFace: ryancodrai/emotion-probes from datasets import load_dataset ds = load_dataset("ryancodrai/emotion-probes", data_files="expression/stories.parquet") agents/ Dataset generation story/ 205k emotional stories (171 emotions × 100 topics × 12) neutral_story/ 1.2k emotionally neutral stories (PCA baseline) neutral_dialogue/ 1.2k neutral Person/AI dialogues (PCA baseline) deflection_story/ 239k emotion-deflection dialogues extraction/ Activation extraction & vector computation extract_story_activations.py extract_neutral_story_activations.py extract_neutral_dialogue_activations.py extract_deflection_activations.py compute_expression_vectors.py compute_deflection_vectors.py visualise.py Flask visualiser with expression/deflection modes agent.py Base agent class (pydantic-ai) pip install pydantic-ai tenacity tqdm pandas torch transformers flask From the repo root: # Emotional stories (requires Gemini API key) python -m agents.story.agent # Neutral stories python -m agents.neutral_story.agent # Neutral dialogues python -m agents.neutral_dialogue.agent # Deflection pair selection (requires pre-computed expression vectors) python agents/deflection_story/select_pairs.py # Deflection dialogues python -m agents.deflection_story.agent Each agent skips existing files, so re-running is safe. Run on a GPU machine with the model loaded: # Story activations (83GB output, ~6 hours on A100) python extraction/extract_story_activations.py # Neutral story activations (~30 seconds) python extraction/extract_neutral_story_activations.py # Neutral dialogue activations (~30 seconds) python extraction/extract_neutral_dialogue_activations.py # Deflection activations (~6 hours, supports checkpointing) python extraction/extract_deflection_activations.py # Expression vectors (reads 83GB activations, applies PCA confound removal) python extraction/compute_expression_vectors.py # Deflection vectors (applies neutral dialogue PCA + expression-space orthogonalisation) python extraction/compute_deflection_vectors.py python visualise.py Opens a Flask server on port 8080. Paste any text to see per-token emotion probe activations with: - Expression and deflection probe modes - Emotion groups (Fear, Anger, Sadness, Disgust, Surprise, Joy, Guilt, Shame + alignment-relevant groups) - Checkbox multi-select for custom emotion combinations - Drag-to-select token spans with ranked emotion analysis - Layer selection (0–41) For each of 171 emotions, we generate stories where a character experiences that emotion (without naming it). We extract residual stream activations from a target model, average across stories per emotion, subtract the global mean, project out confound directions from neutral text (PCA, 50% variance), and unit-normalise. The resulting vectors detect when an emotion is being openly expressed. We generate dialogues where a character masks one emotion with another. We extract activations on the masking speaker's tokens, apply the same difference-of-means recipe, then additionally orthogonalise against the expression vector space (99% variance). The resulting vectors detect when an emotion is contextually present but being suppressed — a distinct signal from expression, and potentially useful for alignment monitoring. Sofroniew, N., Kauvar, I., Saunders, W., Chen, R., et al. (2026). Emotion Concepts and their Function in a Large Language Model. Transformer Circuits Thread. https://transformer-circuits.pub/2026/emotions/index.html
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유