Show HN: Forge-Core released on GitHub, Parse JSON in your data warehouse
hackernews
|
|
📰 뉴스
#data warehouse
#forge-core
#github
#hn
#json
#ai
#gtt코리아
#기타 ai
#반도체
#비정형 데이터
#생성ai
#서수원 첨단산업
#자동화
#지능형 문서처리
#하드웨어/반도체
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
삼성전자는 오는 5월 17일 용인 에버랜드에서 '갤럭시 워치런 @사파리' 러닝 이벤트를 개최한다. 참가자들은 4.82km 코스(로스트밸리·사파리월드·장미원)를 달리며 로스트밸리와 사파리월드를 차량 없이 직접 달리면서 삼성 헬스와 연계된 러닝 코치, 수면 분석, 건강 관리 기능을 체험할 수 있다. 만 14세 이상이면 무료로 신청 가능하며 접수는 17~20일 삼성닷컴에서 받고 24일 1000명을 추첨 발표한다. 전 국가대표 마라토너 권은주 감독의 코칭과 삼성 헬스 엠버서더 션의 축하 공연도 펼쳐진다.
본문
Automatically decompose nested JSON in your data warehouse into normalized dbt models. Forge Core is a deterministic BFS engine that reads a single JSON column (or multi-column table), discovers all nested structures, and generates: - dbt SQL models — one per nested object/array - Rollup view — reassembles the full document from normalized tables - schema.yml — structural column inventory - JSON Schema — standard draft-07 schema of the discovered structure - Mermaid ER diagram — table relationship visualization - dbt docs — browseable documentation site | Warehouse | Install Extra | Status | |---|---|---| | BigQuery | foxtrotcommunications-forge-core[bigquery] | ✅ Production | | Snowflake | foxtrotcommunications-forge-core[snowflake] | ✅ Production | | Databricks | foxtrotcommunications-forge-core[databricks] | ✅ Production | | Redshift | foxtrotcommunications-forge-core[redshift] | 🚧 Beta | pip install foxtrotcommunications-forge-core[bigquery] forge-core build \ --source-type bigquery \ --source-project my-gcp-project \ --source-database my_dataset \ --source-table my_json_table \ --target-dataset my_target Or use the Python API: from forge_core import build_core result = build_core( source_type="bigquery", source_project="my-gcp-project", source_database="my_dataset", source_table_name="my_json_table", target_dataset="my_target", ) print(f"Created {result.total_models_created} models") print(f"Processed {result.total_rows_processed} rows") Forge Core uses Python's standard logging module. By default nothing is printed — add this before your build_core() call to stream progress to the console: import logging logging.basicConfig( level=logging.INFO, format="%(asctime)s %(message)s", datefmt="%H:%M:%S", ) logging.getLogger("forge_core").setLevel(logging.INFO) This works in Jupyter notebooks, plain scripts, Airflow (routes through its own handler automatically), and any CI/CD environment that captures stdout. ┌─────────────────────────────┐ │ Source Table (JSON column) │ └─────────────┬───────────────┘ │ ▼ ┌─────────────────────────────┐ │ 1. Root Model (frg) │ Parse JSON → root SELECT └─────────────┬───────────────┘ │ ▼ ┌─────────────────────────────┐ │ 2. BFS Discovery Loop │ For each level: │ - Discover keys │ • get_keys() → field names │ - Infer types │ • get_types() → STRUCT/ARRAY/scalar │ - Generate SQL model │ • create_file_in_models() │ - dbt build │ • run_dbt_command() │ - Tag as excluded │ • tag_models_as_excluded() │ - Queue children │ • next_batch.extend() └─────────────┬───────────────┘ │ ▼ ┌─────────────────────────────┐ │ 3. Rollup View │ JOIN all tables back into │ (frg__rollup) │ nested STRUCT/ARRAY form └─────────────┬───────────────┘ │ ▼ ┌─────────────────────────────┐ │ 4. Artifacts │ schema.yml, JSON Schema, │ │ Mermaid diagram, dbt docs └─────────────────────────────┘ Forge Core uses standard warehouse authentication: - BigQuery: Application Default Credentials ( gcloud auth application-default login ) orGOOGLE_APPLICATION_CREDENTIALS - Snowflake: SNOWFLAKE_ACCOUNT ,SNOWFLAKE_USER ,SNOWFLAKE_PRIVATE_KEY_PATH , etc. - Databricks: DATABRICKS_SERVER_HOSTNAME ,DATABRICKS_HTTP_PATH ,DATABRICKS_ACCESS_TOKEN - Redshift: REDSHIFT_HOST ,REDSHIFT_USER ,REDSHIFT_PASSWORD ,REDSHIFT_DATABASE After a build, your project directory looks like: forge_project/ ├── dbt_project.yml ├── profiles.yml # Auto-generated ├── macros/ │ └── incremental_tmp_table_dropper.sql ├── models/ │ ├── frg.sql # Root model │ ├── frg__root__....sql # Unnested models (one per level) │ ├── frg__rollup.sql # Rollup view │ └── schema.yml # Column inventory └── target/ ├── schema.json # JSON Schema ├── schema.mmd # Mermaid diagram └── index.html # dbt docs # Airflow PythonOperator from forge_core import build_core def forge_task(**context): result = build_core( source_type="bigquery", source_project="my-project", source_database="raw", source_table_name="api_responses", target_dataset="normalized", project_dir="/tmp/forge_project", ) return result.total_models_created Every table generated by Forge Core contains these system columns: | Column | Type | Description | |---|---|---| ingestion_hash | STRING | Hash of the source row. Groups all decomposed tables that came from the same original JSON document. | idx | STRING | Composite positional key. Encodes the exact path through nested arrays to reach this row. | ingestion_timestamp | TIMESTAMP | When the row was ingested. | table_path | STRING | Hierarchical path describing the nesting lineage (e.g., root__experiments__team ). | The idx column is a _ -delimited string that grows one segment per nesting level: Depth 0 (root): idx = "1" Depth 1 (child): idx = "1_2" ← root row 1, child element 2 Depth 2 (grandchild): idx = "1_2_3" ← root row 1, child 2, grandchild 3 Depth 3 (great-grand): idx = "1_2_3_1" ← root row 1, child 2, grandchild 3, great-grandchild 1 Each segment represents the array position at that nesting level. This means: - Every child row carries its full ancestry in idx . - To find a child's parent, strip the last segment. - To join parent ↔ child, match on the parent's depth offset. The rule: for each segment in the parent's idx , add one equality condition comparing that segment position in both parent and child. A parent at depth N has N segments — you expand N index conditions. -- Depth 0 → 1: root (idx="1") → experiments (idx="1_2") -- Parent has 1 segment → 1 index condition SELECT r.*, e.experiment_name, e.experiment_status FROM `project.dataset.frg__root` r JOIN `project.dataset.frg__root__expe1` e ON r.ingestion_hash = e.ingestion_hash AND SPLIT(r.idx, '_')[OFFSET(0)] = SPLIT(e.idx, '_')[OFFSET(0)] -- Depth 1 → 2: experiments (idx="1_2") → team (idx="1_2_3") -- Parent has 2 segments → 2 index conditions SELECT e.*, t.team_name, t.team_role FROM `project.dataset.frg__root__expe1` e JOIN `project.dataset.frg__root__expe1__team1` t ON e.ingestion_hash = t.ingestion_hash AND SPLIT(e.idx, '_')[OFFSET(0)] = SPLIT(t.idx, '_')[OFFSET(0)] AND SPLIT(e.idx, '_')[OFFSET(1)] = SPLIT(t.idx, '_')[OFFSET(1)] -- Depth 2 → 3: team (idx="1_2_3") → lab_results (idx="1_2_3_1") -- Parent has 3 segments → 3 index conditions SELECT t.*, l.lab_name, l.result_value FROM `project.dataset.frg__root__expe1__team1` t JOIN `project.dataset.frg__root__expe1__team1__lab_1` l ON t.ingestion_hash = l.ingestion_hash AND SPLIT(t.idx, '_')[OFFSET(0)] = SPLIT(l.idx, '_')[OFFSET(0)] AND SPLIT(t.idx, '_')[OFFSET(1)] = SPLIT(l.idx, '_')[OFFSET(1)] AND SPLIT(t.idx, '_')[OFFSET(2)] = SPLIT(l.idx, '_')[OFFSET(2)] -- Three-level join: root → experiments → team SELECT r.patient_id, e.experiment_name, t.team_name FROM `project.dataset.frg__root` r JOIN `project.dataset.frg__root__expe1` e ON r.ingestion_hash = e.ingestion_hash AND SPLIT(r.idx, '_')[OFFSET(0)] = SPLIT(e.idx, '_')[OFFSET(0)] JOIN `project.dataset.frg__root__expe1__team1` t ON e.ingestion_hash = t.ingestion_hash AND SPLIT(e.idx, '_')[OFFSET(0)] = SPLIT(t.idx, '_')[OFFSET(0)] AND SPLIT(e.idx, '_')[OFFSET(1)] = SPLIT(t.idx, '_')[OFFSET(1)] -- Depth 0 → 1: root → experiments (1 condition) SELECT r.*, e."experiment_name" FROM "DATASET"."FRG__ROOT" r JOIN "DATASET"."FRG__ROOT__EXPE1" e ON r."ingestion_hash" = e."ingestion_hash" AND SPLIT_PART(r."idx", '_', 1) = SPLIT_PART(e."idx", '_', 1) -- Depth 1 → 2: experiments → team (2 conditions) SELECT e.*, t."team_name" FROM "DATASET"."FRG__ROOT__EXPE1" e JOIN "DATASET"."FRG__ROOT__EXPE1__TEAM1" t ON e."ingestion_hash" = t."ingestion_hash" AND SPLIT_PART(e."idx", '_', 1) = SPLIT_PART(t."idx", '_', 1) AND SPLIT_PART(e."idx", '_', 2) = SPLIT_PART(t."idx", '_', 2) For a parent at depth N joining to a child at depth N+1, expand N index conditions — one per segment of the parent's idx : parent.ingestion_hash = child.ingestion_hash AND SPLIT(parent.idx, '_')[OFFSET(0)] = SPLIT(child.idx, '_')[OFFSET(0)] AND SPLIT(parent.idx, '_')[OFFSET(1)] = SPLIT(child.idx, '_')[OFFSET(1)] ... AND SPLIT(parent.idx, '_')[OFFSET(N-1)] = SPLIT(child.idx, '_')[OFFSET(N-1)] The child always has one more segment than the parent — that final segment is the child's own position within the parent array. Table names encode the nesting path with truncated field names: frg__root ← root extraction frg__root__expe1 ← root.experiments (truncated to 4 chars + counter) frg__root__expe1__team1 ← root.experiments[].team frg__root__expe1__team1__lab_1 ← root.experiments[].team[].lab_results frg__root__hosp1__staf1__nurs1 ← root.hospital[].staff[].nurses The frg__rollup view automatically reassembles all normalized tables back into nested STRUCT/ARRAY form — reconstructing the original JSON shape as queryable warehouse-native types. Use it when you want the full document without manual joins. Apache 2.0
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유