Full algorithm chain I used to implement a sigma rules engine in Linux Kernel

hackernews | 2026년 4월 19일 21:15 | 📦 오픈소스

#tip

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

owLSM는 Sigma 스타일의 YAML 규칙을 커널 공간에서 평가하기 위해 JSON으로 변환하고 BPF 맵을 활용합니다. 사용자 공간은 규칙을 토큰화하고 효율적인 평가를 위해 필요한 DFA를 사전 계산하며, 커널은 이벤트 유형별 맵과 캐시를 활용해 우선순위에 따라 규칙을 평가합니다. 이러한 과정은 eBPF 환경의 제약 조건에서 빠른 규칙 처리를 최적화하는 데 중점을 둡니다.

본문

This page explains how owLSM rules flow from human-readable YAML files to kernel-space evaluation. The rule lifecycle consists of three stages: ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ YAML Rules │ │ JSON Config │ │ BPF Maps │ │ (Sigma-like) │ ───▶ | (Serialized) │ ───▶ │ (Kernel) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ - Rules Generator (Python) — Compiles YAML rules to JSON - Userspace (C++) — Loads JSON and populates BPF maps - Kernel (eBPF) — Evaluates rules against runtime events The Rules Generator (Rules/RulesGenerator/ ) converts Sigma-like YAML rules into a serialized JSON format. ┌──────────────────────────────────────────────────────────────────────────┐ │ Rules Generator Pipeline │ ├──────────────────────────────────────────────────────────────────────────┤ │ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌───────────┐ │ │ │ YAML │───▶ │ Validate │───▶ │ Parse │───▶ │ Convert to │ | │ │ Files │ │ Schema │ │ AST │ │ Postfix │ │ │ └─────────┘ └──────────┘ └─────────┘ └───────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌────────────────────────────┐ │ │ │ Output Tables │ │ │ │ • id_to_string │ │ │ │ • id_to_ip │ │ │ │ • id_to_predicate │ │ │ │ • rules[] │ │ │ └────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ Each YAML file is validated against the owLSM rule schema. Part of the things that are checked: Required fields, Optional fields, Selection Field, field values, modifiers, etc. Using pySigma, the detection logic is walked as an Abstract Syntax Tree (AST). During this traversal, three tables are built: Every string value in the rule is assigned a unique ID and stored: id_to_string = { 0: { value: "/etc/passwd", string_type: DEFAULT }, 1: { value: ".ssh", string_type: CONTAINS }, 2: { value: "curl", string_type: DEFAULT }, 3: { value: "[a-z]+\\.conf", string_type: REGEX } } The string_type field indicates how the string is used: DEFAULT for exact/startswith/endswith comparisons, CONTAINS for substring matching (requires a KMP DFA later), or REGEX for regex pattern matching (requires a regex DFA later). IP addresses and CIDR ranges are stored separately: id_to_ip = { 0: { ip: "10.0.0.0", cidr: 8, ip_type: "ipv4" }, 1: { ip: "2001:db8::", cidr: 32, ip_type: "ipv6" } } Each comparison in the rule becomes a predicate: id_to_predicate = { 0: { field: "TARGET_FILE_PATH", comparison_type: "CONTAINS", string_idx: 1 }, 1: { field: "PROCESS_FILENAME", comparison_type: "EXACT_MATCH", string_idx: 2 }, 2: { field: "NETWORK_DST_PORT", comparison_type: "EQUAL", numerical_value: 443 } } A predicate has three main members: | Member | Description | |---|---| field | The event field to compare (e.g., TARGET_FILE_PATH , PROCESS_EUID ) | comparison_type | How to compare: EXACT_MATCH , CONTAINS , STARTS_WITH , GT , CIDR , etc. | string_idx / numerical_value | For strings/IPs: index into lookup table. For numbers: the actual value | Each rule’s detection logic is converted to postfix notation (Reverse Polish Notation) for efficient stack-based evaluation. Postfix eliminates the need for parentheses and operator precedence rules. It can be evaluated with a simple stack machine, which is ideal for the constrained eBPF environment. Each element in the postfix array is a token: | Member | Description | |---|---| operator_type | One of: PREDICATE , AND , OR , NOT | predicate_idx | Index into id_to_predicate (only set when operator_type is PREDICATE ) | Rule: target.file.path|contains: ".ssh" AND process.file.filename: "curl" AND / \ PRED(0) PRED(1) Postfix tokens: [ PRED(0), PRED(1), AND ] Where: PRED(0) → id_to_predicate[0] → { field: TARGET_FILE_PATH, contains, string_idx: 1 } PRED(1) → id_to_predicate[1] → { field: PROCESS_FILENAME, exact, string_idx: 2 } Keywords are field-less searches that match against all string fields for an event type. keywords: - "malware" This expands to multiple predicates — one for each string field the event type has: For WRITE event: keywords: "malware" Expands to: target.file.path|contains: "malware" OR target.file.filename|contains: "malware" OR process.cmd|contains: "malware" OR process.file.path|contains: "malware" OR ... Important: When a rule uses keywords AND specifies multiple event types, the rule is split into separate rules (one per event type). This is because different event types have different string fields. All the rules have the same id, but may have different string fields, thus different tokens. After processing all rules, the generator outputs: { "id_to_string": { "0": { "value": "/etc/passwd", "string_type": 0 }, ... }, "id_to_ip": { "0": { "ip": 8.8.8.0, "cidr": 8 }, ... }, "id_to_predicate": { "0": { "field": TARGET_FILE_PATH, "comparison_type": startswith, "string_idx": 1, "numerical_value": -1 }, "1": { "field": PROCESS_UID, "comparison_type": EQUAL, "string_idx": -1, "numerical_value": 1000 }, ... }, "rules": [ { "id": 1, "description": "Block curl from reading SSH keys", "action": "BLOCK_EVENT", "applied_events": ["READ"], "min_version": "1.0.0", "max_version": "2.0.0", "tokens": [ { "operator_type": PREDICATE, "predicate_idx": 0 }, { "operator_type": PREDICATE, "predicate_idx": 1 }, { "operator_type": AND } ] }, ... ] } This JSON is embedded into the config.json file passed to owLSM at runtime. The userspace component (src/Userspace/ ) parses the config and populates kernel BPF maps. | Map | Description | |---|---| {event}_rules_map | Per-event-type rule arrays. CHMOD events only check chmod_rules_map , and EXEC events only check exec_rules_map | predicates_map | Kernel representation of id_to_predicate | rules_strings_map | Kernel representation of id_to_string. Strings are structs with char arrays, length and DFA index | rules_ips_map | Kernel representation of id_to_ip | idx_to_DFA_map | Pre-computed DFAs for contains (KMP) and regex strings | idx_to_accepting_states_map | Accepting state bitmasks for regex DFAs | predicates_results_cache | Caches predicate results within an event evaluation | Userspace pre-computes DFAs (Deterministic Finite Automata) for strings that require pattern matching: contains strings use a KMP (Knuth-Morris-Pratt) DFA for O(n) substring matching.regex strings go through a full regex-to-DFA pipeline: regex → AST → NFA (Thompson’s construction) → unanchored skip-loop → DFA (subset construction) → minimization → normalization. The resulting DFA is limited to 32 states. Accepting states are stored as a bitmask inidx_to_accepting_states_map . Both DFA types are stored in idx_to_DFA_map and referenced by entries in rules_strings_map . Userspace organizes rules into per-event-type maps: // A rule with applied_events: [CHMOD, READ, WRITE] // is inserted into all three maps: chmod_rules_map → rule read_rules_map → rule write_rules_map → rule This ensures that when a CHMOD event occurs, only CHMOD-relevant rules are evaluated. When an event occurs, the eBPF program evaluates rules using the populated maps. ───────────────────────────────────────────────────────────────────────── Kernel Rule Evaluation ───────────────────────────────────────────────────────────────────────── Event Occurs (e.g., READ syscall) │ ▼ ┌─────────────────────────────────────┐ │ 1. Populate event_t struct │ │ - process info │ │ - target file info │ │ - event-specific data │ └──────────────────┬──────────────────┘ ▼ ┌─────────────────────────────────────┐ │ 2. Iterate read_rules_map │ │ using bpf_for_each_map_elem │ └──────────────────┬──────────────────┘ ▼ ┌─────────────────────────────────────┐ │ 3. For each rule: │ │ - Push tokens to stack │ │ - Evaluate postfix expression │ │ - If match → return action │ └──────────────────┬──────────────────┘ ▼ ┌─────────────────────────────────────┐ │ 4. Apply action │ │ - ALLOW_EVENT │ │ - BLOCK_EVENT │ │ - BLOCK_KILL_PROCESS │ │ - etc. │ └─────────────────────────────────────┘ The kernel uses a 2 stacks-based algorithm to evaluate postfix expressions. When evaluating a predicate token: - Lookup predicate from predicates_map[token.predicate_idx] - Check cache — If this predicate was already evaluated for this event, return cached result from predicates_results_cache . Its very likely that different rules (or even the same rule) will have identical predicates. For example, many rules will have:target.file.path: "/etc/hosts" . So using the cache we don’t evaluate this predicate twice for the same event. - Compare field — Use the predicate’s field and comparison_type to compare against the event: - String fields: exact match, contains (via KMP DFA), starts_with, ends_with, regex (via regex DFA) - Numeric fields: equal, greater than, less than, etc. - IP fields: CIDR matching - Cache result — Store in predicates_results_cache for potential reuse Rules are evaluated in order of their id (lowest first). Evaluation stops at the first matching rule: Rules (sorted by id): 1: Block SSH key access 5: Allow all reads from /usr 10: Block all reads If rule 1 matches → BLOCK_EVENT, rules 5 and 10 are NOT evaluated This differs from traditional SIEM systems that process all rules. The first-match approach is more efficient. Crucial for inline syscall evaluation. The design optimizes for kernel performance: - Postfix notation enables simple stack evaluation - Lookup tables minimize data duplication - Per-event-type maps reduce unnecessary rule checks - Predicate caching avoids redundant evaluations - Pre-computed DFAs enable O(n) string matching (substring & regex)

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기