실제 데이터 분석에 가장 적합한 AI 모델은 무엇입니까?
hackernews
|
|
🔬 연구
#gpt-5
#review
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
MLJAR 팀은 탐색적 데이터 분석부터 금융, NLP까지 다양한 실제 파이썬 작업을 통해 주요 LLM의 데이터 분석 성능을 평가했습니다. gpt-oss:120b 모델은 23개 시나리오에서 평균 9.87/10점을 기록하며 최상위 성과를 보였으며, 현대 LLM들이 구조화된 작업에서 강력한 성능을 발휘함을 확인했습니다. 이 벤치마크는 투명한 평가 방식과 공개된 파이썬 노트북을 통해 모델 간의 강점과 실패 패턴을 비교하고, 실무 환경에서 최적의 AI를 선택하는 데 도움을 줍니다.
본문
gpt-oss:120b Average score: 9.87/10 Scored scenarios: 23 Comparison prepared by MLJAR Team · Date: 14.04.2026 · Repository: github.com/pplonski/ai-for-data-analysis This AI data analyst benchmark evaluates leading LLMs for data analysis on real Python tasks using our desktop application, MLJAR Studio. In this project, we created practical pipelines across multiple domains, including exploratory data analysis, time series, machine learning, finance, statistics, and NLP. Each scenario simulates how a data analyst works with data step by step. Every pipeline is a sequence of prompts. We send a prompt to the model, wait for its response, and then continue with the next step. The full interaction is saved as a Python notebook (.ipynb ) artifact, letting you review complete conversations and see how each model approaches the same task. All pipelines, run artifacts, and results are publicly available here and on GitHub. You can explore them, open full conversation traces, and compare model performance using shared score tables. The results show that modern LLMs perform very well on structured tasks and can effectively support end-to-end analytical experiments. This is the list of prompts used in this example pipeline. The same prompt sequence is sent to each model so outputs and scores can be compared fairly. The same system prompt was used for AI Data Analyst across all tested LLMs. We also used up to three follow-up prompts for deeper insights, and each model decided how many follow-ups were needed. You can learn more about system prompts in MLJAR Studio here. You are an AI Data Analyst for Python notebooks. Goal: Guide the user in an iterative loop: 1) propose one next step, 2) provide one runnable code block for that step, 3) analyze latest available outputs/context, 4) if outputs are missing for your step, ask to run the code, 5) propose the next single step. Rules: - Do not provide a full end-to-end pipeline at once unless user explicitly asks. - Default to one chart OR one table per step. - Keep each step small, clear, and beginner-friendly. - Use simple variable names and short comments. - Base recommendations on observed notebook outputs, not assumptions. - If required context is missing, ask a short clarifying question before writing code. Visualization policy: - Default chart library: seaborn (with matplotlib). - Use Altair only when interactive visualization is explicitly requested or clearly beneficial (for example: tooltips, brush/select, linked filtering). - If the user explicitly asks for Altair, use Altair. - For Altair charts, enforce full-width notebook layout with good height by default: .properties(width='container', height=360) and prefer height in 320-480 range. - Keep one chart per step unless user asks for more. - Add clear titles and axis labels. - If the user says only 'plot', 'chart', or 'visualize' without interactivity requirements, use seaborn/matplotlib by default. Style: - Write naturally, like a real data scientist collaborating with the user. - Do not use rigid templates or section headers like 'Step objective', 'What to expect', or 'Next action'. - Keep responses concise and conversational. - When code is needed, include one runnable Python code block. - If fresh execution outputs are already present in the provided notebook context, analyze them directly and provide concrete insights. - Treat provided notebook state/outputs as the source of truth for the current turn. - Never ask the user to 'share output' or 'share what you see'. - Never end with coaching phrases like 'run this and share...'. - If no fresh output is available, provide the next code step and stop there without asking for sharing. - Prefer: concise insight + next step code (when needed), without instructional boilerplate. We evaluate each run in a simple and transparent way. Our goal is to measure how well different LLMs perform on real scenarios using our AI Data Analyst in MLJAR Studio. Each run is graded across five dimensions, with every dimension focusing on a different aspect of analysis quality: The final score is the sum of all dimensions, ranging from 0 to 10. Higher scores indicate that the run is more complete, more accurate, and more reliable for a given scenario. To ensure consistency, scoring is performed automatically using GPT-5.4-mini. Each notebook run is evaluated three times, and we report the median score to reduce variance and improve reliability. You are an expert evaluator of AI-generated data analysis workflows. Your task is to evaluate how well a large language model (LLM) completed a data analysis task in Python. You must score the workflow using a strict rubric and provide concise, evidence-based explanations. Do NOT be lenient. Do NOT guess. Base your evaluation ONLY on the provided content. --- ## SCORING RUBRIC (0-10 total) You must score each dimension: 1. Task Completion (0-2) - 0 = failed or did not attempt core task - 1 = partially completed - 2 = fully completed all major steps 2. Execution Correctness (0-2) - 0 = code is broken or contains major errors - 1 = partially correct, requires fixes - 2 = correct and likely runnable 3. Output Quality (0-3) - 0 = missing or incorrect outputs - 1 = weak or partially correct - 2 = mostly correct - 3 = fully matches expected outcomes semantically (exact syntax/format not required) 4. Reasoning Quality (0-2) - 0 = incorrect or misleading reasoning - 1 = partially correct or shallow - 2 = clear, correct, and helpful 5. Reliability / Robustness (0-1) - 0 = fragile, hallucinated, or unsafe - 1 = reasonably robust and consistent --- ## IMPORTANT RULES - Use ONLY the provided notebook content and expected outcomes. - Do NOT assume missing steps were done. - Do NOT reward verbosity. - Penalize hallucinated functions, missing steps, or incorrect logic. - If uncertain, choose the LOWER score. - Be strict and consistent. - Prefer semantic equivalence over literal string matching. - Treat equivalent representations as correct and do NOT penalize presentation-only differences, including: - df.describe() vs df.describe().T - chart style/theme/color differences - minor wording differences in summaries - equivalent function choices producing the same analytical result - Penalize only when required information is missing, incorrect, or contradictory. --- ## OUTPUT FORMAT (STRICT JSON) Return ONLY valid JSON: { "task_completion": { "score": , "explanation": "" }, "execution_correctness": { "score": , "explanation": "" }, "output_quality": { "score": , "explanation": "" }, "reasoning_quality": { "score": , "explanation": "" }, "reliability": { "score": , "explanation": "" }, "total_score": } - total_score must be the sum of all scores - explanations must be concise (1-2 sentences) - no extra text outside JSON Browse benchmark pipelines by domain and open any example to review prompts, conversation, outputs, and model scores. These cards and table compare all scored model runs across published benchmark scenarios. This makes it easy to compare GPT vs Qwen vs GLM on the same task definitions. Average score (0-10) Average score: 9.87/10 Scored scenarios: 23 Average score: 9.65/10 Scored scenarios: 23 Average score: 9.48/10 Scored scenarios: 23 Average score: 9.30/10 Scored scenarios: 23 Average score: 9.04/10 Scored scenarios: 23 Average score: 8.43/10 Scored scenarios: 23 This table compares model scores for each scenario. Open any score chip to jump directly to the selected model conversation and review full prompts, code, outputs, and score cards. Score differences are usually driven by execution discipline, not only raw model capability. Stronger runs tend to keep consistent step-by-step structure, while weaker runs break down under longer multi-step workflows. For example, in air-passengers-forecast (gpt-5.4), stronger runs keep trend decomposition, forecast outputs, and interpretation aligned across turns; weaker runs often skip validation steps, for example air-passengers-forecast (qwen3-coder-next). In sentiment-analysis-python (gpt-oss:120b), better runs keep polarity scoring and conclusions consistent, while weaker runs can produce contradictory examples or shallow reasoning, for example sentiment-analysis-python (qwen3.5:397b). Benchmark value comes from both successes and failures. We publish failure patterns to make model behavior transparent and help teams choose safer workflow setups. Some runs generate code that is incomplete or fragile in later steps, especially when context from earlier cells is not handled consistently. Example: energy-consumption-forecast (qwen3-coder-next). Models sometimes produce plausible but incorrect interpretations, for example overconfident conclusions from weak evidence or missing checks. Example: risk-metrics-var (qwen3.5:397b). In weaker runs, charts, tables, and narrative can drift out of sync across steps, reducing reliability for decision-making workflows. Example: sentiment-analysis-python (qwen3.5:397b). Publishing these failure modes builds trust, improves reproducibility, and helps practitioners understand where guardrails are needed before using AI in production analysis. We evaluated multiple LLM models on the same step-by-step data analysis workflows using a shared scoring rubric. This allows for a fair, side-by-side comparison of how models perform in realistic analytical scenarios. Across different domains, most models produce strong notebook outputs, with high task completion rates and useful analytical reasoning. The results confirm that modern LLMs can effectively support end-to-end data analysis workflows when guided with well-structured prompts. You can use these examples as a practical reference for: These benchmarks are especially helpful before running similar analyses on your own data using the AI Data Analyst in MLJAR Studio, especially if you are choosing the best AI for Python data analysis in production. MLJAR Studio helps you analyze data with AI, run machine learning workflows, and build reproducible notebook-based results on your own computer. Runs locally • Supports local LLMs
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유