syzbot |
sign-in | mailing list | source | docs | 🏰 |
| ID | Workflow | Result | Correct | Bug | Created | Started | Finished | Revision | Error |
|---|---|---|---|---|---|---|---|---|---|
| d91b896e-fffc-4021-99f6-4cd915913efc | assessment-kcsan | Benign: ❌ Confident: ✅ | ❓ | KCSAN: data-race in bcmp / parse_build_id | 2026/02/25 02:53 | 2026/02/25 02:53 | 2026/02/25 03:09 | 305c0ec5cd886e2d13738e28e1b2df9b0ec20fc9 |
The data race occurs in `__bpf_get_stackid` within `kernel/bpf/stackmap.c`. The race is between a `memcmp` (or `bcmp`) operation reading from a stack map bucket and another task writing to that same bucket after it has been returned to the freelist and reused.
### Analysis
In `__bpf_get_stackid`, the code attempts to find if a stack trace is already present in the map by comparing the current stack trace with the one stored in the hash bucket:
```c
bucket = READ_ONCE(smap->buckets[id]);
...
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
return id;
}
```
The `bucket` pointer is obtained without any locking or RCU protection that would ensure the bucket's contents remain unchanged. In a high-load scenario, the following sequence can occur:
1. **Task A** reads a pointer to bucket `X` from `smap->buckets[id]`.
2. **Task B** replaces bucket `X` in the map with a new bucket `Y` (using `xchg`) and pushes `X` back onto the `pcpu_freelist`.
3. **Task C** pops bucket `X` from the freelist to use it as its own `new_bucket` and begins writing new stack trace data (including build IDs) into it via `stack_map_get_build_id_offset` and `parse_build_id`.
4. **Task A** is still executing `memcmp` on bucket `X`. It reads the data in `X` while **Task C** is writing to it.
### Impact
This race is **harmful** because it compromises the correctness of the `bpf_get_stackid` helper. If `memcmp` incorrectly returns 0 due to the race, the helper returns an ID that points to a different stack trace (the one Task B put in the map). This results in "data corruption" where the stack ID returned to the BPF program is associated with the wrong stack trace.
While the race does not cause a kernel crash (as the memory for buckets is preallocated and always valid), it leads to non-deterministic and incorrect behavior in BPF-based tools. Such tools often rely on these IDs for critical tasks like performance profiling, error tracking, or security auditing. The failure to protect the bucket contents during comparison is a structural flaw in the stackmap implementation.
### Conclusion
The data race is non-benign as it leads to inconsistent data and incorrect results from a kernel helper, violating the expected stability and correctness of the BPF subsystem.================================================================== BUG: KCSAN: data-race in bcmp / parse_build_id write to 0xffffc9000f831cf4 of 20 bytes by task 18187 on cpu 0: parse_build_id+0x249/0x2b0 lib/buildid.c:184 get_build_id_64 lib/buildid.c:263 [inline] __build_id_parse+0x253/0x4a0 lib/buildid.c:308 build_id_parse_nofault+0x27/0x40 lib/buildid.c:327 fetch_build_id kernel/bpf/stackmap.c:152 [inline] stack_map_get_build_id_offset+0x226/0x530 kernel/bpf/stackmap.c:197 __bpf_get_stackid+0x60b/0x7d0 kernel/bpf/stackmap.c:289 ____bpf_get_stackid kernel/bpf/stackmap.c:344 [inline] bpf_get_stackid+0xf3/0x130 kernel/bpf/stackmap.c:323 ____bpf_get_stackid_raw_tp kernel/trace/bpf_trace.c:1628 [inline] bpf_get_stackid_raw_tp+0xf6/0x120 kernel/trace/bpf_trace.c:1617 bpf_prog_ff69ba342f4eb966+0x2a/0x32 bpf_dispatcher_nop_func include/linux/bpf.h:1378 [inline] __bpf_prog_run include/linux/filter.h:723 [inline] bpf_prog_run include/linux/filter.h:730 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2075 [inline] bpf_trace_run4+0x117/0x1e0 kernel/trace/bpf_trace.c:2118 __do_trace_mm_page_alloc include/trace/events/kmem.h:180 [inline] trace_mm_page_alloc include/trace/events/kmem.h:180 [inline] __alloc_frozen_pages_noprof+0x336/0x360 mm/page_alloc.c:5232 alloc_pages_mpol+0xb3/0x260 mm/mempolicy.c:2486 folio_alloc_mpol_noprof+0x39/0x80 mm/mempolicy.c:2505 shmem_alloc_folio mm/shmem.c:1890 [inline] shmem_alloc_and_add_folio mm/shmem.c:1932 [inline] shmem_get_folio_gfp+0x3cf/0xd50 mm/shmem.c:2556 shmem_get_folio mm/shmem.c:2662 [inline] shmem_write_begin+0xfc/0x1f0 mm/shmem.c:3315 generic_perform_write+0x184/0x490 mm/filemap.c:4314 shmem_file_write_iter+0xc5/0xf0 mm/shmem.c:3490 __kernel_write_iter+0x2d6/0x540 fs/read_write.c:619 dump_emit_page fs/coredump.c:1298 [inline] dump_user_range+0x61e/0x8f0 fs/coredump.c:1372 elf_core_dump+0x1de7/0x1f80 fs/binfmt_elf.c:2111 coredump_write+0xacf/0xdf0 fs/coredump.c:1049 do_coredump fs/coredump.c:1126 [inline] vfs_coredump+0x24f7/0x2e60 fs/coredump.c:1200 get_signal+0xd84/0xf70 kernel/signal.c:3019 arch_do_signal_or_restart+0x96/0x450 arch/x86/kernel/signal.c:337 __exit_to_user_mode_loop kernel/entry/common.c:41 [inline] exit_to_user_mode_loop kernel/entry/common.c:75 [inline] __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] irqentry_exit_to_user_mode_prepare include/linux/irq-entry-common.h:270 [inline] irqentry_exit_to_user_mode include/linux/irq-entry-common.h:339 [inline] irqentry_exit+0xfb/0x560 kernel/entry/common.c:196 asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618 read to 0xffffc9000f831d00 of 8 bytes by task 18266 on cpu 1: memcmp lib/string.c:683 [inline] bcmp+0x23/0x90 lib/string.c:715 memcmp include/linux/fortify-string.h:727 [inline] __bpf_get_stackid+0x648/0x7d0 kernel/bpf/stackmap.c:292 ____bpf_get_stackid kernel/bpf/stackmap.c:344 [inline] bpf_get_stackid+0xf3/0x130 kernel/bpf/stackmap.c:323 ____bpf_get_stackid_raw_tp kernel/trace/bpf_trace.c:1628 [inline] bpf_get_stackid_raw_tp+0xf6/0x120 kernel/trace/bpf_trace.c:1617 bpf_prog_ff69ba342f4eb966+0x2a/0x32 bpf_dispatcher_nop_func include/linux/bpf.h:1378 [inline] __bpf_prog_run include/linux/filter.h:723 [inline] bpf_prog_run include/linux/filter.h:730 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2075 [inline] bpf_trace_run4+0x117/0x1e0 kernel/trace/bpf_trace.c:2118 __do_trace_mm_page_alloc include/trace/events/kmem.h:180 [inline] trace_mm_page_alloc include/trace/events/kmem.h:180 [inline] __alloc_frozen_pages_noprof+0x336/0x360 mm/page_alloc.c:5232 alloc_pages_mpol+0xb3/0x260 mm/mempolicy.c:2486 folio_alloc_mpol_noprof+0x39/0x80 mm/mempolicy.c:2505 shmem_alloc_folio mm/shmem.c:1890 [inline] shmem_alloc_and_add_folio mm/shmem.c:1932 [inline] shmem_get_folio_gfp+0x3cf/0xd50 mm/shmem.c:2556 shmem_fault+0xf6/0x250 mm/shmem.c:2757 __do_fault+0xbc/0x200 mm/memory.c:5320 do_read_fault mm/memory.c:5755 [inline] do_fault mm/memory.c:5889 [inline] do_pte_missing mm/memory.c:4401 [inline] handle_pte_fault mm/memory.c:6273 [inline] __handle_mm_fault mm/memory.c:6411 [inline] handle_mm_fault+0xfab/0x2c60 mm/memory.c:6580 faultin_page mm/gup.c:1126 [inline] __get_user_pages+0x1024/0x1ed0 mm/gup.c:1428 populate_vma_page_range mm/gup.c:1860 [inline] __mm_populate+0x243/0x3a0 mm/gup.c:1963 mm_populate include/linux/mm.h:3701 [inline] vm_mmap_pgoff+0x232/0x2e0 mm/util.c:586 ksys_mmap_pgoff+0xc2/0x310 mm/mmap.c:604 x64_sys_call+0x16bb/0x3000 arch/x86/include/generated/asm/syscalls_64.h:10 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x0000000000000000 -> 0xf042c5f12da5e5e6 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 18266 Comm: syz.9.4420 Not tainted syzkaller #0 PREEMPT(voluntary) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 ==================================================================
| Seq | Timestamp | Type | Name | Duration |
|---|---|---|---|---|
| 0/0 | 2026/02/25 02:53 | flow | assessment-kcsan |
15mResults:map[Benign:false Confident:true Explanation:The data race occurs in `__bpf_get_stackid` within `kernel/bpf/stackmap.c`. The race is between a `memcmp` (or `bcmp`) operation reading from a stack map bucket and another task writing to that same bucket after it has been returned to the freelist and reused.
### Analysis
In `__bpf_get_stackid`, the code attempts to find if a stack trace is already present in the map by comparing the current stack trace with the one stored in the hash bucket:
```c
bucket = READ_ONCE(smap->buckets[id]);
...
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
return id;
}
```
The `bucket` pointer is obtained without any locking or RCU protection that would ensure the bucket's contents remain unchanged. In a high-load scenario, the following sequence can occur:
1. **Task A** reads a pointer to bucket `X` from `smap->buckets[id]`.
2. **Task B** replaces bucket `X` in the map with a new bucket `Y` (using `xchg`) and pushes `X` back onto the `pcpu_freelist`.
3. **Task C** pops bucket `X` from the freelist to use it as its own `new_bucket` and begins writing new stack trace data (including build IDs) into it via `stack_map_get_build_id_offset` and `parse_build_id`.
4. **Task A** is still executing `memcmp` on bucket `X`. It reads the data in `X` while **Task C** is writing to it.
### Impact
This race is **harmful** because it compromises the correctness of the `bpf_get_stackid` helper. If `memcmp` incorrectly returns 0 due to the race, the helper returns an ID that points to a different stack trace (the one Task B put in the map). This results in "data corruption" where the stack ID returned to the BPF program is associated with the wrong stack trace.
While the race does not cause a kernel crash (as the memory for buckets is preallocated and always valid), it leads to non-deterministic and incorrect behavior in BPF-based tools. Such tools often rely on these IDs for critical tasks like performance profiling, error tracking, or security auditing. The failure to protect the bucket contents during comparison is a structural flaw in the stackmap implementation.
### Conclusion
The data race is non-benign as it leads to inconsistent data and incorrect results from a kernel helper, violating the expected stability and correctness of the BPF subsystem.] |
| 1/1 | 2026/02/25 02:53 | action | kernel-checkouter |
0mResults:map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/903c1775a2c70c6ed9d3fb3e4cc0d66eb33309ea] |
| 2/1 | 2026/02/25 02:54 | action | kernel-builder |
3mResults:map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/91b604bc6761bd44e9752ec29e9d9a77d12acb8c] |
| 3/1 | 2026/02/25 02:58 | action | codesearch-prepare |
3mResults:map[Index:codesearch-index] |
| 4/1 | 2026/02/25 03:01 | agent | expert |
7mModel:gemini-3-flash-preview Results: map[Benign:false Confident:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel data race is benign or not. The data race report is from KCSAN tool. It contains 2 stack traces of the memory accesses that constitute a data race. A "benign" data races are on a simple int/bool variable or similar field, and the accesses are not supposed to be protected by any mutual exclusion primitives. Common examples of such "benign" data races are accesses to various flags fields, statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A non-benign (or "harmful" data race) can lead to corruption/crash even with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A common example of a "harmful" data races is race on a complex container (list/hashmap/etc), where accesses are supposed to be protected by a mutual exclusion primitive. Also consider races that happen at the same time with the given one. If there is no synchronization in between, other memory accesses in the involved threads race with each other if they access the same memory. For example, if both threads execute: some_struct->have_elements = true; list_add(new_node, &some_struct->list_head); the race on some_struct->have_elements may appear benign, however it also implies there is a race on some_struct->list_head which is not benign, since the list is not thread-safe. Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses, and thus even with a conservative compiler may not be fully atomic. However, such races may still be benign depending on how writes are done, and how read data is used. In the final reply explain why you think the given data race is benign or is harmful. Use the provided tools to confirm any assumptions, variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: The data race report is: ================================================================== BUG: KCSAN: data-race in bcmp / parse_build_id write to 0xffffc9000f831cf4 of 20 bytes by task 18187 on cpu 0: parse_build_id+0x249/0x2b0 lib/buildid.c:184 get_build_id_64 lib/buildid.c:263 [inline] __build_id_parse+0x253/0x4a0 lib/buildid.c:308 build_id_parse_nofault+0x27/0x40 lib/buildid.c:327 fetch_build_id kernel/bpf/stackmap.c:152 [inline] stack_map_get_build_id_offset+0x226/0x530 kernel/bpf/stackmap.c:197 __bpf_get_stackid+0x60b/0x7d0 kernel/bpf/stackmap.c:289 ____bpf_get_stackid kernel/bpf/stackmap.c:344 [inline] bpf_get_stackid+0xf3/0x130 kernel/bpf/stackmap.c:323 ____bpf_get_stackid_raw_tp kernel/trace/bpf_trace.c:1628 [inline] bpf_get_stackid_raw_tp+0xf6/0x120 kernel/trace/bpf_trace.c:1617 bpf_prog_ff69ba342f4eb966+0x2a/0x32 bpf_dispatcher_nop_func include/linux/bpf.h:1378 [inline] __bpf_prog_run include/linux/filter.h:723 [inline] bpf_prog_run include/linux/filter.h:730 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2075 [inline] bpf_trace_run4+0x117/0x1e0 kernel/trace/bpf_trace.c:2118 __do_trace_mm_page_alloc include/trace/events/kmem.h:180 [inline] trace_mm_page_alloc include/trace/events/kmem.h:180 [inline] __alloc_frozen_pages_noprof+0x336/0x360 mm/page_alloc.c:5232 alloc_pages_mpol+0xb3/0x260 mm/mempolicy.c:2486 folio_alloc_mpol_noprof+0x39/0x80 mm/mempolicy.c:2505 shmem_alloc_folio mm/shmem.c:1890 [inline] shmem_alloc_and_add_folio mm/shmem.c:1932 [inline] shmem_get_folio_gfp+0x3cf/0xd50 mm/shmem.c:2556 shmem_get_folio mm/shmem.c:2662 [inline] shmem_write_begin+0xfc/0x1f0 mm/shmem.c:3315 generic_perform_write+0x184/0x490 mm/filemap.c:4314 shmem_file_write_iter+0xc5/0xf0 mm/shmem.c:3490 __kernel_write_iter+0x2d6/0x540 fs/read_write.c:619 dump_emit_page fs/coredump.c:1298 [inline] dump_user_range+0x61e/0x8f0 fs/coredump.c:1372 elf_core_dump+0x1de7/0x1f80 fs/binfmt_elf.c:2111 coredump_write+0xacf/0xdf0 fs/coredump.c:1049 do_coredump fs/coredump.c:1126 [inline] vfs_coredump+0x24f7/0x2e60 fs/coredump.c:1200 get_signal+0xd84/0xf70 kernel/signal.c:3019 arch_do_signal_or_restart+0x96/0x450 arch/x86/kernel/signal.c:337 __exit_to_user_mode_loop kernel/entry/common.c:41 [inline] exit_to_user_mode_loop kernel/entry/common.c:75 [inline] __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] irqentry_exit_to_user_mode_prepare include/linux/irq-entry-common.h:270 [inline] irqentry_exit_to_user_mode include/linux/irq-entry-common.h:339 [inline] irqentry_exit+0xfb/0x560 kernel/entry/common.c:196 asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618 read to 0xffffc9000f831d00 of 8 bytes by task 18266 on cpu 1: memcmp lib/string.c:683 [inline] bcmp+0x23/0x90 lib/string.c:715 memcmp include/linux/fortify-string.h:727 [inline] __bpf_get_stackid+0x648/0x7d0 kernel/bpf/stackmap.c:292 ____bpf_get_stackid kernel/bpf/stackmap.c:344 [inline] bpf_get_stackid+0xf3/0x130 kernel/bpf/stackmap.c:323 ____bpf_get_stackid_raw_tp kernel/trace/bpf_trace.c:1628 [inline] bpf_get_stackid_raw_tp+0xf6/0x120 kernel/trace/bpf_trace.c:1617 bpf_prog_ff69ba342f4eb966+0x2a/0x32 bpf_dispatcher_nop_func include/linux/bpf.h:1378 [inline] __bpf_prog_run include/linux/filter.h:723 [inline] bpf_prog_run include/linux/filter.h:730 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2075 [inline] bpf_trace_run4+0x117/0x1e0 kernel/trace/bpf_trace.c:2118 __do_trace_mm_page_alloc include/trace/events/kmem.h:180 [inline] trace_mm_page_alloc include/trace/events/kmem.h:180 [inline] __alloc_frozen_pages_noprof+0x336/0x360 mm/page_alloc.c:5232 alloc_pages_mpol+0xb3/0x260 mm/mempolicy.c:2486 folio_alloc_mpol_noprof+0x39/0x80 mm/mempolicy.c:2505 shmem_alloc_folio mm/shmem.c:1890 [inline] shmem_alloc_and_add_folio mm/shmem.c:1932 [inline] shmem_get_folio_gfp+0x3cf/0xd50 mm/shmem.c:2556 shmem_fault+0xf6/0x250 mm/shmem.c:2757 __do_fault+0xbc/0x200 mm/memory.c:5320 do_read_fault mm/memory.c:5755 [inline] do_fault mm/memory.c:5889 [inline] do_pte_missing mm/memory.c:4401 [inline] handle_pte_fault mm/memory.c:6273 [inline] __handle_mm_fault mm/memory.c:6411 [inline] handle_mm_fault+0xfab/0x2c60 mm/memory.c:6580 faultin_page mm/gup.c:1126 [inline] __get_user_pages+0x1024/0x1ed0 mm/gup.c:1428 populate_vma_page_range mm/gup.c:1860 [inline] __mm_populate+0x243/0x3a0 mm/gup.c:1963 mm_populate include/linux/mm.h:3701 [inline] vm_mmap_pgoff+0x232/0x2e0 mm/util.c:586 ksys_mmap_pgoff+0xc2/0x310 mm/mmap.c:604 x64_sys_call+0x16bb/0x3000 arch/x86/include/generated/asm/syscalls_64.h:10 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x0000000000000000 -> 0xf042c5f12da5e5e6 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 18266 Comm: syz.9.4420 Not tainted syzkaller #0 PREEMPT(voluntary) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 ================================================================== Reply: The data race occurs in `__bpf_get_stackid` within `kernel/bpf/stackmap.c`. The race is between a `memcmp` (or `bcmp`) operation reading from a stack map bucket and another task writing to that same bucket after it has been returned to the freelist and reused.
### Analysis
In `__bpf_get_stackid`, the code attempts to find if a stack trace is already present in the map by comparing the current stack trace with the one stored in the hash bucket:
```c
bucket = READ_ONCE(smap->buckets[id]);
...
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
return id;
}
```
The `bucket` pointer is obtained without any locking or RCU protection that would ensure the bucket's contents remain unchanged. In a high-load scenario, the following sequence can occur:
1. **Task A** reads a pointer to bucket `X` from `smap->buckets[id]`.
2. **Task B** replaces bucket `X` in the map with a new bucket `Y` (using `xchg`) and pushes `X` back onto the `pcpu_freelist`.
3. **Task C** pops bucket `X` from the freelist to use it as its own `new_bucket` and begins writing new stack trace data (including build IDs) into it via `stack_map_get_build_id_offset` and `parse_build_id`.
4. **Task A** is still executing `memcmp` on bucket `X`. It reads the data in `X` while **Task C** is writing to it.
### Impact
This race is **harmful** because it compromises the correctness of the `bpf_get_stackid` helper. If `memcmp` incorrectly returns 0 due to the race, the helper returns an ID that points to a different stack trace (the one Task B put in the map). This results in "data corruption" where the stack ID returned to the BPF program is associated with the wrong stack trace.
While the race does not cause a kernel crash (as the memory for buckets is preallocated and always valid), it leads to non-deterministic and incorrect behavior in BPF-based tools. Such tools often rely on these IDs for critical tasks like performance profiling, error tracking, or security auditing. The failure to protect the bucket contents during comparison is a structural flaw in the stackmap implementation.
### Conclusion
The data race is non-benign as it leads to inconsistent data and incorrect results from a kernel helper, violating the expected stability and correctness of the BPF subsystem. |
| 5/2 | 2026/02/25 03:01 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 5628 output: 24 thoughts: 564 Thoughts: **Data Race Analysis: `__bpf_get_stackid`** Okay, so I'm looking at this report, and it's flagged a data race. My initial thought is that the data race involves the `parse_build_id` function as the writer and `bcmp` as the reader, both of which are called within `__bpf_get_stackid`. I'll need to dig into this. Specifically, the write operation originates from `parse_build_id` and traces through `get_build_id_64`, `__build_id_parse`, `build_id_parse_nofault`, `fetch_build_id`, and `stack_map_get_build_id_offset` before ultimately hitting `__bpf_get_stackid`. The read operation, on the other hand, comes from `bcmp` and, directly or indirectly, also ends up in `__bpf_get_stackid`. The key is this memory address: `0xffffc9000f831cf4`. The report indicates a 20-byte write to this address, and then an 8-byte read at `0xffffc9000f831d00`. That's where it's interesting. If I add 20 to `0xffffc9000f831cf4`, I get `0xffffc9000f831d08`. So the read at `0xffffc9000f831d00` is *within* the write region. In fact, it overlaps with the write buffer: it's reading bytes 12 through 20 of the data that's being written. I'm thinking now, I need to understand what's happening in `kernel/bpf/stackmap.c` and `lib/buildid.c`. I'll start by focusing on the `__bpf_get_stackid` function in `kernel/bpf/stackmap.c`, to understand how the write and the read are scheduled and why this overlap occurs. |
| 6/2 | 2026/02/25 03:01 | tool | read-file |
0mArgs:map[File:kernel/bpf/stackmap.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/* Copyright (c) 2016 Facebook
*/
#include <linux/bpf.h>
#include <linux/jhash.h>
#include <linux/filter.h>
#include <linux/kernel.h>
#include <linux/stacktrace.h>
#include <linux/perf_event.h>
#include <linux/btf_ids.h>
#include <linux/buildid.h>
#include "percpu_freelist.h"
#include "mmap_unlock_work.h"
#define STACK_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \
BPF_F_STACK_BUILD_ID)
struct stack_map_bucket {
struct pcpu_freelist_node fnode;
u32 hash;
u32 nr;
u64 data[];
};
struct bpf_stack_map {
struct bpf_map map;
void *elems;
struct pcpu_freelist freelist;
u32 n_buckets;
struct stack_map_bucket *buckets[] __counted_by(n_buckets);
};
static inline bool stack_map_use_build_id(struct bpf_map *map)
{
return (map->map_flags & BPF_F_STACK_BUILD_ID);
}
static inline int stack_map_data_size(struct bpf_map *map)
{
return stack_map_use_build_id(map) ?
sizeof(struct bpf_stack_build_id) : sizeof(u64);
}
/**
* stack_map_calculate_max_depth - Calculate maximum allowed stack trace depth
* @size: Size of the buffer/map value in bytes
* @elem_size: Size of each stack trace element
* @flags: BPF stack trace flags (BPF_F_USER_STACK, BPF_F_USER_BUILD_ID, ...)
*
* Return: Maximum number of stack trace entries that can be safely stored
*/
static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size, u64 flags)
{
u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
u32 max_depth;
u32 curr_sysctl_max_stack = READ_ONCE(sysctl_perf_event_max_stack);
max_depth = size / elem_size;
max_depth += skip;
if (max_depth > curr_sysctl_max_stack)
return curr_sysctl_max_stack;
return max_depth;
}
static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
{
u64 elem_size = sizeof(struct stack_map_bucket) +
(u64)smap->map.value_size;
int err;
smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries,
smap->map.numa_node);
if (!smap->elems)
return -ENOMEM;
err = pcpu_freelist_init(&smap->freelist);
if (err)
goto free_elems;
pcpu_freelist_populate(&smap->freelist, smap->elems, elem_size,
smap->map.max_entries);
return 0;
free_elems:
bpf_map_area_free(smap->elems);
return err;
}
/* Called from syscall */
static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
{
u32 value_size = attr->value_size;
struct bpf_stack_map *smap;
u64 cost, n_buckets;
int err;
if (attr->map_flags & ~STACK_CREATE_FLAG_MASK)
return ERR_PTR(-EINVAL);
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 4 ||
value_size < 8 || value_size % 8)
return ERR_PTR(-EINVAL);
BUILD_BUG_ON(sizeof(struct bpf_stack_build_id) % sizeof(u64));
if (attr->map_flags & BPF_F_STACK_BUILD_ID) {
if (value_size % sizeof(struct bpf_stack_build_id) ||
value_size / sizeof(struct bpf_stack_build_id)
> sysctl_perf_event_max_stack)
return ERR_PTR(-EINVAL);
} else if (value_size / 8 > sysctl_perf_event_max_stack)
return ERR_PTR(-EINVAL);
/* hash table size must be power of 2; roundup_pow_of_two() can overflow
* into UB on 32-bit arches, so check that first
*/
if (attr->max_entries > 1UL << 31)
return ERR_PTR(-E2BIG);
n_buckets = roundup_pow_of_two(attr->max_entries);
cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap);
smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr));
if (!smap)
return ERR_PTR(-ENOMEM);
bpf_map_init_from_attr(&smap->map, attr);
smap->n_buckets = n_buckets;
err = get_callchain_buffers(sysctl_perf_event_max_stack);
if (err)
goto free_smap;
err = prealloc_elems_and_freelist(smap);
if (err)
goto put_buffers;
return &smap->map;
put_buffers:
put_callchain_buffers();
free_smap:
bpf_map_area_free(smap);
return ERR_PTR(err);
}
static int fetch_build_id(struct vm_area_struct *vma, unsigned char *build_id, bool may_fault)
{
return may_fault ? build_id_parse(vma, build_id, NULL)
: build_id_parse_nofault(vma, build_id, NULL);
}
/*
* Expects all id_offs[i].ip values to be set to correct initial IPs.
* They will be subsequently:
* - either adjusted in place to a file offset, if build ID fetching
* succeeds; in this case id_offs[i].build_id is set to correct build ID,
* and id_offs[i].status is set to BPF_STACK_BUILD_ID_VALID;
* - or IP will be kept intact, if build ID fetching failed; in this case
* id_offs[i].build_id is zeroed out and id_offs[i].status is set to
* BPF_STACK_BUILD_ID_IP.
*/
static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
u32 trace_nr, bool user, bool may_fault)
{
int i;
struct mmap_unlock_irq_work *work = NULL;
bool irq_work_busy = bpf_mmap_unlock_get_irq_work(&work);
struct vm_area_struct *vma, *prev_vma = NULL;
const char *prev_build_id;
/* If the irq_work is in use, fall back to report ips. Same
* fallback is used for kernel stack (!user) on a stackmap with
* build_id.
*/
if (!user || !current || !current->mm || irq_work_busy ||
!mmap_read_trylock(current->mm)) {
/* cannot access current->mm, fall back to ips */
for (i = 0; i < trace_nr; i++) {
id_offs[i].status = BPF_STACK_BUILD_ID_IP;
memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
}
return;
}
for (i = 0; i < trace_nr; i++) {
u64 ip = READ_ONCE(id_offs[i].ip);
if (range_in_vma(prev_vma, ip, ip)) {
vma = prev_vma;
memcpy(id_offs[i].build_id, prev_build_id, BUILD_ID_SIZE_MAX);
goto build_id_valid;
}
vma = find_vma(current->mm, ip);
if (!vma || fetch_build_id(vma, id_offs[i].build_id, may_fault)) {
/* per entry fall back to ips */
id_offs[i].status = BPF_STACK_BUILD_ID_IP;
memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
continue;
}
build_id_valid:
id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ip - vma->vm_start;
id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
prev_vma = vma;
prev_build_id = id_offs[i].build_id;
}
bpf_mmap_unlock_mm(work, current->mm);
}
static struct perf_callchain_entry *
get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
{
#ifdef CONFIG_STACKTRACE
struct perf_callchain_entry *entry;
int rctx;
entry = get_callchain_entry(&rctx);
if (!entry)
return NULL;
entry->nr = stack_trace_save_tsk(task, (unsigned long *)entry->ip,
max_depth, 0);
/* stack_trace_save_tsk() works on unsigned long array, while
* perf_callchain_entry uses u64 array. For 32-bit systems, it is
* necessary to fix this mismatch.
*/
if (__BITS_PER_LONG != 64) {
unsigned long *from = (unsigned long *) entry->ip;
u64 *to = entry->ip;
int i;
/* copy data from the end to avoid using extra buffer */
for (i = entry->nr - 1; i >= 0; i--)
to[i] = (u64)(from[i]);
}
put_callchain_entry(rctx);
return entry;
#else /* CONFIG_STACKTRACE */
return NULL;
#endif
}
static long __bpf_get_stackid(struct bpf_map *map,
struct perf_callchain_entry *trace, u64 flags)
{
struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
u32 hash, id, trace_nr, trace_len, i, max_depth;
u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
bool user = flags & BPF_F_USER_STACK;
u64 *ips;
bool hash_matches;
if (trace->nr <= skip)
/* skipping more than usable stack trace */
return -EFAULT;
max_depth = stack_map_calculate_max_depth(map->value_size, stack_map_data_size(map), flags);
trace_nr = min_t(u32, trace->nr - skip, max_depth - skip);
trace_len = trace_nr * sizeof(u64);
ips = trace->ip + skip;
hash = jhash2((u32 *)ips, trace_len / sizeof(u32), 0);
id = hash & (smap->n_buckets - 1);
bucket = READ_ONCE(smap->buckets[id]);
hash_matches = bucket && bucket->hash == hash;
/* fast cmp */
if (hash_matches && flags & BPF_F_FAST_STACK_CMP)
return id;
if (stack_map_use_build_id(map)) {
struct bpf_stack_build_id *id_offs;
/* for build_id+offset, pop a bucket before slow cmp */
new_bucket = (struct stack_map_bucket *)
pcpu_freelist_pop(&smap->freelist);
if (unlikely(!new_bucket))
return -ENOMEM;
new_bucket->nr = trace_nr;
id_offs = (struct bpf_stack_build_id *)new_bucket->data;
for (i = 0; i < trace_nr; i++)
id_offs[i].ip = ips[i];
stack_map_get_build_id_offset(id_offs, trace_nr, user, false /* !may_fault */);
trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
return id;
}
if (bucket && !(flags & BPF_F_REUSE_STACKID)) {
pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
return -EEXIST;
}
} else {
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, ips, trace_len) == 0)
return id;
if (bucket && !(flags & BPF_F_REUSE_STACKID))
return -EEXIST;
new_bucket = (struct stack_map_bucket *)
pcpu_freelist_pop(&smap->freelist);
if (unlikely(!new_bucket))
return -ENOMEM;
memcpy(new_bucket->data, ips, trace_len);
}
new_bucket->hash = hash;
new_bucket->nr = trace_nr;
old_bucket = xchg(&smap->buckets[id], new_bucket);
if (old_bucket)
pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
return id;
}
BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
u64, flags)
{
u32 elem_size = stack_map_data_size(map);
bool user = flags & BPF_F_USER_STACK;
struct perf_callchain_entry *trace;
bool kernel = !user;
u32 max_depth;
if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
return -EINVAL;
max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags);
trace = get_perf_callchain(regs, kernel, user, max_depth,
false, false, 0);
if (unlikely(!trace))
/* couldn't fetch the stack trace */
return -EFAULT;
return __bpf_get_stackid(map, trace, flags);
}
const struct bpf_func_proto bpf_get_stackid_proto = {
.func = bpf_get_stackid,
.gpl_only = true,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_CONST_MAP_PTR,
.arg3_type = ARG_ANYTHING,
};
static __u64 count_kernel_ip(struct perf_callchain_entry *trace)
{
__u64 nr_kernel = 0;
while (nr_kernel < trace->nr) {
if (trace->ip[nr_kernel] == PERF_CONTEXT_USER)
break;
nr_kernel++;
}
return nr_kernel;
}
BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx,
struct bpf_map *, map, u64, flags)
{
struct perf_event *event = ctx->event;
struct perf_callchain_entry *trace;
bool kernel, user;
__u64 nr_kernel;
int ret;
/* perf_sample_data doesn't have callchain, use bpf_get_stackid */
if (!(event->attr.sample_type & PERF_SAMPLE_CALLCHAIN))
return bpf_get_stackid((unsigned long)(ctx->regs),
(unsigned long) map, flags, 0, 0);
if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
return -EINVAL;
user = flags & BPF_F_USER_STACK;
kernel = !user;
trace = ctx->data->callchain;
if (unlikely(!trace))
return -EFAULT;
nr_kernel = count_kernel_ip(trace);
__u64 nr = trace->nr; /* save original */
if (kernel) {
trace->nr = nr_kernel;
ret = __bpf_get_stackid(map, trace, flags);
} else { /* user */
u64 skip = flags & BPF_F_SKIP_FIELD_MASK;
skip += nr_kernel;
if (skip > BPF_F_SKIP_FIELD_MASK)
return -EFAULT;
flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip;
ret = __bpf_get_stackid(map, trace, flags);
}
/* restore nr */
trace->nr = nr;
return ret;
}
const struct bpf_func_proto bpf_get_stackid_proto_pe = {
.func = bpf_get_stackid_pe,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_CONST_MAP_PTR,
.arg3_type = ARG_ANYTHING,
};
static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
struct perf_callchain_entry *trace_in,
void *buf, u32 size, u64 flags, bool may_fault)
{
u32 trace_nr, copy_len, elem_size, max_depth;
bool user_build_id = flags & BPF_F_USER_BUILD_ID;
bool crosstask = task && task != current;
u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
bool user = flags & BPF_F_USER_STACK;
struct perf_callchain_entry *trace;
bool kernel = !user;
int err = -EINVAL;
u64 *ips;
if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
BPF_F_USER_BUILD_ID)))
goto clear;
if (kernel && user_build_id)
goto clear;
elem_size = user_build_id ? sizeof(struct bpf_stack_build_id) : sizeof(u64);
if (unlikely(size % elem_size))
goto clear;
/* cannot get valid user stack for task without user_mode regs */
if (task && user && !user_mode(regs))
goto err_fault;
/* get_perf_callchain does not support crosstask user stack walking
* but returns an empty stack instead of NULL.
*/
if (crosstask && user) {
err = -EOPNOTSUPP;
goto clear;
}
max_depth = stack_map_calculate_max_depth(size, elem_size, flags);
if (may_fault)
rcu_read_lock(); /* need RCU for perf's callchain below */
if (trace_in) {
trace = trace_in;
trace->nr = min_t(u32, trace->nr, max_depth);
} else if (kernel && task) {
trace = get_callchain_entry_for_task(task, max_depth);
} else {
trace = get_perf_callchain(regs, kernel, user, max_depth,
crosstask, false, 0);
}
if (unlikely(!trace) || trace->nr < skip) {
if (may_fault)
rcu_read_unlock();
goto err_fault;
}
trace_nr = trace->nr - skip;
copy_len = trace_nr * elem_size;
ips = trace->ip + skip;
if (user_build_id) {
struct bpf_stack_build_id *id_offs = buf;
u32 i;
for (i = 0; i < trace_nr; i++)
id_offs[i].ip = ips[i];
} else {
memcpy(buf, ips, copy_len);
}
/* trace/ips should not be dereferenced after this point */
if (may_fault)
rcu_read_unlock();
if (user_build_id)
stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
if (size > copy_len)
memset(buf + copy_len, 0, size - copy_len);
return copy_len;
err_fault:
err = -EFAULT;
clear:
memset(buf, 0, size);
return err;
}
BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
u64, flags)
{
return __bpf_get_stack(regs, NULL, NULL, buf, size, flags, false /* !may_fault */);
}
const struct bpf_func_proto bpf_get_stack_proto = {
.func = bpf_get_stack,
.gpl_only = true,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_UNINIT_MEM,
.arg3_type = ARG_CONST_SIZE_OR_ZERO,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_get_stack_sleepable, struct pt_regs *, regs, void *, buf, u32, size,
u64, flags)
{
return __bpf_get_stack(regs, NULL, NULL, buf, size, flags, true /* may_fault */);
}
const struct bpf_func_proto bpf_get_stack_sleepable_proto = {
.func = bpf_get_stack_sleepable,
.gpl_only = true,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_UNINIT_MEM,
.arg3_type = ARG_CONST_SIZE_OR_ZERO,
.arg4_type = ARG_ANYTHING,
};
static long __bpf_get_task_stack(struct task_struct *task, void *buf, u32 size,
u64 flags, bool may_fault)
{
struct pt_regs *regs;
long res = -EINVAL;
if (!try_get_task_stack(task))
return -EFAULT;
regs = task_pt_regs(task);
if (regs)
res = __bpf_get_stack(regs, task, NULL, buf, size, flags, may_fault);
put_task_stack(task);
return res;
}
BPF_CALL_4(bpf_get_task_stack, struct task_struct *, task, void *, buf,
u32, size, u64, flags)
{
return __bpf_get_task_stack(task, buf, size, flags, false /* !may_fault */);
}
const struct bpf_func_proto bpf_get_task_stack_proto = {
.func = bpf_get_task_stack,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_BTF_ID,
.arg1_btf_id = &btf_tracing_ids[BTF_TRACING_TYPE_TASK],
.arg2_type = ARG_PTR_TO_UNINIT_MEM,
.arg3_type = ARG_CONST_SIZE_OR_ZERO,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_get_task_stack_sleepable, struct task_struct *, task, void *, buf,
u32, size, u64, flags)
{
return __bpf_get_task_stack(task, buf, size, flags, true /* !may_fault */);
}
const struct bpf_func_proto bpf_get_task_stack_sleepable_proto = {
.func = bpf_get_task_stack_sleepable,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_BTF_ID,
.arg1_btf_id = &btf_tracing_ids[BTF_TRACING_TYPE_TASK],
.arg2_type = ARG_PTR_TO_UNINIT_MEM,
.arg3_type = ARG_CONST_SIZE_OR_ZERO,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_get_stack_pe, struct bpf_perf_event_data_kern *, ctx,
void *, buf, u32, size, u64, flags)
{
struct pt_regs *regs = (struct pt_regs *)(ctx->regs);
struct perf_event *event = ctx->event;
struct perf_callchain_entry *trace;
bool kernel, user;
int err = -EINVAL;
__u64 nr_kernel;
if (!(event->attr.sample_type & PERF_SAMPLE_CALLCHAIN))
return __bpf_get_stack(regs, NULL, NULL, buf, size, flags, false /* !may_fault */);
if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
BPF_F_USER_BUILD_ID)))
goto clear;
user = flags & BPF_F_USER_STACK;
kernel = !user;
err = -EFAULT;
trace = ctx->data->callchain;
if (unlikely(!trace))
goto clear;
nr_kernel = count_kernel_ip(trace);
if (kernel) {
__u64 nr = trace->nr;
trace->nr = nr_kernel;
err = __bpf_get_stack(regs, NULL, trace, buf, size, flags, false /* !may_fault */);
/* restore nr */
trace->nr = nr;
} else { /* user */
u64 skip = flags & BPF_F_SKIP_FIELD_MASK;
skip += nr_kernel;
if (skip > BPF_F_SKIP_FIELD_MASK)
goto clear;
flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip;
err = __bpf_get_stack(regs, NULL, trace, buf, size, flags, false /* !may_fault */);
}
return err;
clear:
memset(buf, 0, size);
return err;
}
const struct bpf_func_proto bpf_get_stack_proto_pe = {
.func = bpf_get_stack_pe,
.gpl_only = true,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_UNINIT_MEM,
.arg3_type = ARG_CONST_SIZE_OR_ZERO,
.arg4_type = ARG_ANYTHING,
};
/* Called from eBPF program */
static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
{
return ERR_PTR(-EOPNOTSUPP);
}
/* Called from syscall */
static int stack_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
void *value, u64 flags)
{
return bpf_stackmap_extract(map, key, value, true);
}
/* Called from syscall */
int bpf_stackmap_extract(struct bpf_map *map, void *key, void *value,
bool delete)
{
struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
struct stack_map_bucket *bucket, *old_bucket;
u32 id = *(u32 *)key, trace_len;
if (unlikely(id >= smap->n_buckets))
return -ENOENT;
bucket = xchg(&smap->buckets[id], NULL);
if (!bucket)
return -ENOENT;
trace_len = bucket->nr * stack_map_data_size(map);
memcpy(value, bucket->data, trace_len);
memset(value + trace_len, 0, map->value_size - trace_len);
if (delete)
old_bucket = bucket;
else
old_bucket = xchg(&smap->buckets[id], bucket);
if (old_bucket)
pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
return 0;
}
static int stack_map_get_next_key(struct bpf_map *map, void *key,
void *next_key)
{
struct bpf_stack_map *smap = container_of(map,
struct bpf_stack_map, map);
u32 id;
WARN_ON_ONCE(!rcu_read_lock_held());
if (!key) {
id = 0;
} else {
id = *(u32 *)key;
if (id >= smap->n_buckets || !smap->buckets[id])
id = 0;
else
id++;
}
while (id < smap->n_buckets && !smap->buckets[id])
id++;
if (id >= smap->n_buckets)
return -ENOENT;
*(u32 *)next_key = id;
return 0;
}
static long stack_map_update_elem(struct bpf_map *map, void *key, void *value,
u64 map_flags)
{
return -EINVAL;
}
/* Called from syscall or from eBPF program */
static long stack_map_delete_elem(struct bpf_map *map, void *key)
{
struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
struct stack_map_bucket *old_bucket;
u32 id = *(u32 *)key;
if (unlikely(id >= smap->n_buckets))
return -E2BIG;
old_bucket = xchg(&smap->buckets[id], NULL);
if (old_bucket) {
pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
return 0;
} else {
return -ENOENT;
}
}
/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
static void stack_map_free(struct bpf_map *map)
{
struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
bpf_map_area_free(smap->elems);
pcpu_freelist_destroy(&smap->freelist);
bpf_map_area_free(smap);
put_callchain_buffers();
}
static u64 stack_map_mem_usage(const struct bpf_map *map)
{
struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
u64 value_size = map->value_size;
u64 n_buckets = smap->n_buckets;
u64 enties = map->max_entries;
u64 usage = sizeof(*smap);
usage += n_buckets * sizeof(struct stack_map_bucket *);
usage += enties * (sizeof(struct stack_map_bucket) + value_size);
return usage;
}
BTF_ID_LIST_SINGLE(stack_trace_map_btf_ids, struct, bpf_stack_map)
const struct bpf_map_ops stack_trace_map_ops = {
.map_meta_equal = bpf_map_meta_equal,
.map_alloc = stack_map_alloc,
.map_free = stack_map_free,
.map_get_next_key = stack_map_get_next_key,
.map_lookup_elem = stack_map_lookup_elem,
.map_lookup_and_delete_elem = stack_map_lookup_and_delete_elem,
.map_update_elem = stack_map_update_elem,
.map_delete_elem = stack_map_delete_elem,
.map_check_btf = map_check_no_btf,
.map_mem_usage = stack_map_mem_usage,
.map_btf_id = &stack_trace_map_btf_ids[0],
};
] |
| 7/2 | 2026/02/25 03:01 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 15382 output: 21 thoughts: 2260 Thoughts: **Analysis of a Potential Race Condition in `__bpf_get_stackid`** Okay, so I'm digging into this memory corruption report and it looks like the action is happening in `__bpf_get_stackid`. I see two tasks, a reader and a writer, both operating within this function. The reader is at the `memcmp` call, and the writer is at `stack_map_get_build_id_offset`, which, I notice, is also called from within `__bpf_get_stackid` itself. The writer's stack trace includes `parse_build_id`, so it's likely dealing with build ID information. The critical issue, as I see it, centers around the interaction between `bucket` and `new_bucket`. The function is essentially trying to find if a stack trace is already in a BPF stack map. It reads a `bucket` from the map, allocates a `new_bucket` from a freelist, populates the new bucket with data, and *then* does a `memcmp` to see if they're the same. If the `memcmp` succeeds, the new bucket is pushed back onto the freelist. I can see where the race lies: if the `bucket` pointed to by `READ_ONCE(smap->buckets[id])` in Task A's code is simultaneously being modified by Task B's `stack_map_get_build_id_offset`, or, worse, if Task B pushes that bucket back to the freelist and Task C *re*uses it, we have a problem. The `memcmp` on Task A is reading from a bucket while it's being written to in Task B *or* is stale, and is being overwritten in Task C. Specifically, the `pcpu_freelist_pop` function *should* guarantee that each task gets a unique bucket, but that uniqueness gets tossed out if Task B or C gets to the bucket between the `READ_ONCE` and `memcmp` calls. The check for a match in `memcmp` can incorrectly say we have a match, or a no match. This is really bad because we're returning the `id` of a potentially corrupted bucket. Subsequent lookups using that `id` would then return garbage. The root cause, as I understand it, is that `__bpf_get_stackid` doesn't hold any lock on `smap->buckets[id]` while the `memcmp` is happening. This window of vulnerability, coupled with the freelist management, allows the bucket's content to change between the read and the comparison. The fact that the writer is potentially touching build ID information makes this even more concerning because build ID data is often used for security checks. I need to go deeper into `lib/buildid.c` now to understand the precise details of what the writer is doing. |
| 8/2 | 2026/02/25 03:02 | tool | read-file |
0mArgs:map[File:lib/buildid.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0
#include <linux/buildid.h>
#include <linux/cache.h>
#include <linux/elf.h>
#include <linux/kernel.h>
#include <linux/pagemap.h>
#include <linux/secretmem.h>
#define BUILD_ID 3
#define MAX_PHDR_CNT 256
void freader_init_from_file(struct freader *r, void *buf, u32 buf_sz,
struct file *file, bool may_fault)
{
memset(r, 0, sizeof(*r));
r->buf = buf;
r->buf_sz = buf_sz;
r->file = file;
r->may_fault = may_fault;
}
void freader_init_from_mem(struct freader *r, const char *data, u64 data_sz)
{
memset(r, 0, sizeof(*r));
r->data = data;
r->data_sz = data_sz;
}
static void freader_put_folio(struct freader *r)
{
if (!r->folio)
return;
kunmap_local(r->addr);
folio_put(r->folio);
r->folio = NULL;
}
static int freader_get_folio(struct freader *r, loff_t file_off)
{
/* check if we can just reuse current folio */
if (r->folio && file_off >= r->folio_off &&
file_off < r->folio_off + folio_size(r->folio))
return 0;
freader_put_folio(r);
/* reject secretmem folios created with memfd_secret() */
if (secretmem_mapping(r->file->f_mapping))
return -EFAULT;
r->folio = filemap_get_folio(r->file->f_mapping, file_off >> PAGE_SHIFT);
/* if sleeping is allowed, wait for the page, if necessary */
if (r->may_fault && (IS_ERR(r->folio) || !folio_test_uptodate(r->folio))) {
filemap_invalidate_lock_shared(r->file->f_mapping);
r->folio = read_cache_folio(r->file->f_mapping, file_off >> PAGE_SHIFT,
NULL, r->file);
filemap_invalidate_unlock_shared(r->file->f_mapping);
}
if (IS_ERR(r->folio) || !folio_test_uptodate(r->folio)) {
if (!IS_ERR(r->folio))
folio_put(r->folio);
r->folio = NULL;
return -EFAULT;
}
r->folio_off = folio_pos(r->folio);
r->addr = kmap_local_folio(r->folio, 0);
return 0;
}
const void *freader_fetch(struct freader *r, loff_t file_off, size_t sz)
{
size_t folio_sz;
/* provided internal temporary buffer should be sized correctly */
if (WARN_ON(r->buf && sz > r->buf_sz)) {
r->err = -E2BIG;
return NULL;
}
if (unlikely(file_off + sz < file_off)) {
r->err = -EOVERFLOW;
return NULL;
}
/* working with memory buffer is much more straightforward */
if (!r->buf) {
if (file_off + sz > r->data_sz) {
r->err = -ERANGE;
return NULL;
}
return r->data + file_off;
}
/* fetch or reuse folio for given file offset */
r->err = freader_get_folio(r, file_off);
if (r->err)
return NULL;
/* if requested data is crossing folio boundaries, we have to copy
* everything into our local buffer to keep a simple linear memory
* access interface
*/
folio_sz = folio_size(r->folio);
if (file_off + sz > r->folio_off + folio_sz) {
u64 part_sz = r->folio_off + folio_sz - file_off, off;
memcpy(r->buf, r->addr + file_off - r->folio_off, part_sz);
off = part_sz;
while (off < sz) {
/* fetch next folio */
r->err = freader_get_folio(r, r->folio_off + folio_sz);
if (r->err)
return NULL;
folio_sz = folio_size(r->folio);
part_sz = min_t(u64, sz - off, folio_sz);
memcpy(r->buf + off, r->addr, part_sz);
off += part_sz;
}
return r->buf;
}
/* if data fits in a single folio, just return direct pointer */
return r->addr + (file_off - r->folio_off);
}
void freader_cleanup(struct freader *r)
{
if (!r->buf)
return; /* non-file-backed mode */
freader_put_folio(r);
}
/*
* Parse build id from the note segment. This logic can be shared between
* 32-bit and 64-bit system, because Elf32_Nhdr and Elf64_Nhdr are
* identical.
*/
static int parse_build_id(struct freader *r, unsigned char *build_id, __u32 *size,
loff_t note_off, Elf32_Word note_size)
{
const char note_name[] = "GNU";
const size_t note_name_sz = sizeof(note_name);
u32 build_id_off, new_off, note_end, name_sz, desc_sz;
const Elf32_Nhdr *nhdr;
const char *data;
if (check_add_overflow(note_off, note_size, ¬e_end))
return -EINVAL;
while (note_end - note_off > sizeof(Elf32_Nhdr) + note_name_sz) {
nhdr = freader_fetch(r, note_off, sizeof(Elf32_Nhdr) + note_name_sz);
if (!nhdr)
return r->err;
name_sz = READ_ONCE(nhdr->n_namesz);
desc_sz = READ_ONCE(nhdr->n_descsz);
new_off = note_off + sizeof(Elf32_Nhdr);
if (check_add_overflow(new_off, ALIGN(name_sz, 4), &new_off) ||
check_add_overflow(new_off, ALIGN(desc_sz, 4), &new_off) ||
new_off > note_end)
break;
if (nhdr->n_type == BUILD_ID &&
name_sz == note_name_sz &&
memcmp(nhdr + 1, note_name, note_name_sz) == 0 &&
desc_sz > 0 && desc_sz <= BUILD_ID_SIZE_MAX) {
build_id_off = note_off + sizeof(Elf32_Nhdr) + ALIGN(note_name_sz, 4);
/* freader_fetch() will invalidate nhdr pointer */
data = freader_fetch(r, build_id_off, desc_sz);
if (!data)
return r->err;
memcpy(build_id, data, desc_sz);
memset(build_id + desc_sz, 0, BUILD_ID_SIZE_MAX - desc_sz);
if (size)
*size = desc_sz;
return 0;
}
note_off = new_off;
}
return -EINVAL;
}
/* Parse build ID from 32-bit ELF */
static int get_build_id_32(struct freader *r, unsigned char *build_id, __u32 *size)
{
const Elf32_Ehdr *ehdr;
const Elf32_Phdr *phdr;
__u32 phnum, phoff, i;
ehdr = freader_fetch(r, 0, sizeof(Elf32_Ehdr));
if (!ehdr)
return r->err;
/* subsequent freader_fetch() calls invalidate pointers, so remember locally */
phnum = READ_ONCE(ehdr->e_phnum);
phoff = READ_ONCE(ehdr->e_phoff);
/* set upper bound on amount of segments (phdrs) we iterate */
if (phnum > MAX_PHDR_CNT)
phnum = MAX_PHDR_CNT;
/* check that phoff is not large enough to cause an overflow */
if (phoff + phnum * sizeof(Elf32_Phdr) < phoff)
return -EINVAL;
for (i = 0; i < phnum; ++i) {
phdr = freader_fetch(r, phoff + i * sizeof(Elf32_Phdr), sizeof(Elf32_Phdr));
if (!phdr)
return r->err;
if (phdr->p_type == PT_NOTE &&
!parse_build_id(r, build_id, size, READ_ONCE(phdr->p_offset),
READ_ONCE(phdr->p_filesz)))
return 0;
}
return -EINVAL;
}
/* Parse build ID from 64-bit ELF */
static int get_build_id_64(struct freader *r, unsigned char *build_id, __u32 *size)
{
const Elf64_Ehdr *ehdr;
const Elf64_Phdr *phdr;
__u32 phnum, i;
__u64 phoff;
ehdr = freader_fetch(r, 0, sizeof(Elf64_Ehdr));
if (!ehdr)
return r->err;
/* subsequent freader_fetch() calls invalidate pointers, so remember locally */
phnum = READ_ONCE(ehdr->e_phnum);
phoff = READ_ONCE(ehdr->e_phoff);
/* set upper bound on amount of segments (phdrs) we iterate */
if (phnum > MAX_PHDR_CNT)
phnum = MAX_PHDR_CNT;
/* check that phoff is not large enough to cause an overflow */
if (phoff + phnum * sizeof(Elf64_Phdr) < phoff)
return -EINVAL;
for (i = 0; i < phnum; ++i) {
phdr = freader_fetch(r, phoff + i * sizeof(Elf64_Phdr), sizeof(Elf64_Phdr));
if (!phdr)
return r->err;
if (phdr->p_type == PT_NOTE &&
!parse_build_id(r, build_id, size, READ_ONCE(phdr->p_offset),
READ_ONCE(phdr->p_filesz)))
return 0;
}
return -EINVAL;
}
/* enough for Elf64_Ehdr, Elf64_Phdr, and all the smaller requests */
#define MAX_FREADER_BUF_SZ 64
static int __build_id_parse(struct vm_area_struct *vma, unsigned char *build_id,
__u32 *size, bool may_fault)
{
const Elf32_Ehdr *ehdr;
struct freader r;
char buf[MAX_FREADER_BUF_SZ];
int ret;
/* only works for page backed storage */
if (!vma->vm_file)
return -EINVAL;
freader_init_from_file(&r, buf, sizeof(buf), vma->vm_file, may_fault);
/* fetch first 18 bytes of ELF header for checks */
ehdr = freader_fetch(&r, 0, offsetofend(Elf32_Ehdr, e_type));
if (!ehdr) {
ret = r.err;
goto out;
}
ret = -EINVAL;
/* compare magic x7f "ELF" */
if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0)
goto out;
/* only support executable file and shared object file */
if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN)
goto out;
if (ehdr->e_ident[EI_CLASS] == ELFCLASS32)
ret = get_build_id_32(&r, build_id, size);
else if (ehdr->e_ident[EI_CLASS] == ELFCLASS64)
ret = get_build_id_64(&r, build_id, size);
out:
freader_cleanup(&r);
return ret;
}
/*
* Parse build ID of ELF file mapped to vma
* @vma: vma object
* @build_id: buffer to store build id, at least BUILD_ID_SIZE long
* @size: returns actual build id size in case of success
*
* Assumes no page fault can be taken, so if relevant portions of ELF file are
* not already paged in, fetching of build ID fails.
*
* Return: 0 on success; negative error, otherwise
*/
int build_id_parse_nofault(struct vm_area_struct *vma, unsigned char *build_id, __u32 *size)
{
return __build_id_parse(vma, build_id, size, false /* !may_fault */);
}
/*
* Parse build ID of ELF file mapped to VMA
* @vma: vma object
* @build_id: buffer to store build id, at least BUILD_ID_SIZE long
* @size: returns actual build id size in case of success
*
* Assumes faultable context and can cause page faults to bring in file data
* into page cache.
*
* Return: 0 on success; negative error, otherwise
*/
int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id, __u32 *size)
{
return __build_id_parse(vma, build_id, size, true /* may_fault */);
}
/**
* build_id_parse_buf - Get build ID from a buffer
* @buf: ELF note section(s) to parse
* @buf_size: Size of @buf in bytes
* @build_id: Build ID parsed from @buf, at least BUILD_ID_SIZE_MAX long
*
* Return: 0 on success, -EINVAL otherwise
*/
int build_id_parse_buf(const void *buf, unsigned char *build_id, u32 buf_size)
{
struct freader r;
int err;
freader_init_from_mem(&r, buf, buf_size);
err = parse_build_id(&r, build_id, NULL, 0, buf_size);
freader_cleanup(&r);
return err;
}
#if IS_ENABLED(CONFIG_STACKTRACE_BUILD_ID) || IS_ENABLED(CONFIG_VMCORE_INFO)
unsigned char vmlinux_build_id[BUILD_ID_SIZE_MAX] __ro_after_init;
/**
* init_vmlinux_build_id - Compute and stash the running kernel's build ID
*/
void __init init_vmlinux_build_id(void)
{
extern const void __start_notes;
extern const void __stop_notes;
unsigned int size = &__stop_notes - &__start_notes;
build_id_parse_buf(&__start_notes, vmlinux_build_id, size);
}
#endif
] |
| 9/2 | 2026/02/25 03:02 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 22241 output: 27 thoughts: 1577 Thoughts: **My Analysis of the BPF Stackmap Data Race** Okay, here's what I've got. I'm looking at a potential data race within the BPF stackmap implementation, specifically in `__bpf_get_stackid`. It looks like the issue stems from how the stackmap buckets are managed, particularly their reuse via a freelist, and how a comparison of the bucket's data happens. Here's the scenario: I can see that `parse_build_id` is where the build ID is written. During the execution, I am trying to understand how `build_id` values are retrieved. Within `__bpf_get_stackid`, a bucket pointer (let's call it X) is obtained from the stackmap. Then I see how `stack_map_get_build_id_offset` is called. Here's where the race comes in: 1. A task (Task A) grabs bucket X from `smap->buckets[id]`. 2. Another task (Task B) swaps X in `smap->buckets[id]` with a fresh bucket (Y). 3. Task B then puts X back on the freelist. 4. A third task (Task C) takes X from the freelist. 5. Task C starts writing new data into X. (in `parse_build_id`) 6. Meanwhile, Task A is still inside `__bpf_get_stackid`, comparing data using `memcmp` which includes the data in X. This has the hallmarks of a classic use-after-free issue, but with a freelist. The race exists on the contents of the buckets, which are preallocated and managed by a freelist. I realize that if `memcmp` in Task A returns 0 incorrectly, the stack ID (returned to the user) will be wrong. If Task B replaced X with Y, and the user looks up the returned ID, they'll end up with Y's data instead of whatever was originally in X (which Task A thought it was). If Task A had gotten a *new* ID because `memcmp` failed, it would have then put Z into Y. There's a catch, though. If the `memcmp` *matches*, Task A assumes the collected stack trace is already in the map at `id`. It then returns `id`. But, now the bucket at `id` has been overwritten! The ID Task A returns represents Task B's data, not Task A's data. If those stack traces are different, then the returned ID will point to the *wrong* stack trace. `bpf_get_stackid` aims to get a unique ID. If two different stack traces get the same ID, that's a collision. It's expected due to the nature of hash maps, but the ID is also the hash bucket index. So, multiple traces can map to the same `id`. The code has built-in handling. If `BPF_F_REUSE_STACKID` is *not* set, the code returns `-EEXIST` if the bucket contains a different stack trace. If that flag *is* set, then Task A will overwrite Y with Z. In the case of the race: * Task A sees bucket X. * Task B swaps X with Y. * Task C starts overwriting X. * Task A’s `memcmp` will compare its `new_bucket` (Z) with X, which is being overwritten by Task C (W). If `memcmp` returns 0, Task A returns `id`. The map at `id` contains Y. Task A's stack trace was Z. So the `id` now refers to Y, but Task A thinks it refers to Z. That's incorrect if Z and Y are different. The stackmap is like a hash table. The key is the stack trace, and the value is the trace itself, or build IDs. The ID is the index into the `buckets` array. If different stack traces hash to the same ID, only one can be stored. If Task A's stack trace Z is the same as what WAS in X, Task A should return `id`. But X is being overwritten by W. Now `memcmp(X, Z)` could pass or fail randomly, depending on what Task C is currently writing. If it succeeds, A returns `id`. The map at `id` has Y. If it fails, A replaces Y with Z. The core problem is that `memcmp` is reading from a bucket on the freelist, which is in the process of being reused. This can lead to garbage reads, `memcmp` returning zero when it shouldn't or non-zero |
| 10/2 | 2026/02/25 03:02 | tool | read-file |
0mArgs:map[File:kernel/bpf/percpu_freelist.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/* Copyright (c) 2016 Facebook
*/
#include "percpu_freelist.h"
int pcpu_freelist_init(struct pcpu_freelist *s)
{
int cpu;
s->freelist = alloc_percpu(struct pcpu_freelist_head);
if (!s->freelist)
return -ENOMEM;
for_each_possible_cpu(cpu) {
struct pcpu_freelist_head *head = per_cpu_ptr(s->freelist, cpu);
raw_res_spin_lock_init(&head->lock);
head->first = NULL;
}
return 0;
}
void pcpu_freelist_destroy(struct pcpu_freelist *s)
{
free_percpu(s->freelist);
}
static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head,
struct pcpu_freelist_node *node)
{
node->next = head->first;
WRITE_ONCE(head->first, node);
}
static inline bool ___pcpu_freelist_push(struct pcpu_freelist_head *head,
struct pcpu_freelist_node *node)
{
if (raw_res_spin_lock(&head->lock))
return false;
pcpu_freelist_push_node(head, node);
raw_res_spin_unlock(&head->lock);
return true;
}
void __pcpu_freelist_push(struct pcpu_freelist *s,
struct pcpu_freelist_node *node)
{
struct pcpu_freelist_head *head;
int cpu;
if (___pcpu_freelist_push(this_cpu_ptr(s->freelist), node))
return;
while (true) {
for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) {
if (cpu == raw_smp_processor_id())
continue;
head = per_cpu_ptr(s->freelist, cpu);
if (raw_res_spin_lock(&head->lock))
continue;
pcpu_freelist_push_node(head, node);
raw_res_spin_unlock(&head->lock);
return;
}
}
}
void pcpu_freelist_push(struct pcpu_freelist *s,
struct pcpu_freelist_node *node)
{
unsigned long flags;
local_irq_save(flags);
__pcpu_freelist_push(s, node);
local_irq_restore(flags);
}
void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size,
u32 nr_elems)
{
struct pcpu_freelist_head *head;
unsigned int cpu, cpu_idx, i, j, n, m;
n = nr_elems / num_possible_cpus();
m = nr_elems % num_possible_cpus();
cpu_idx = 0;
for_each_possible_cpu(cpu) {
head = per_cpu_ptr(s->freelist, cpu);
j = n + (cpu_idx < m ? 1 : 0);
for (i = 0; i < j; i++) {
/* No locking required as this is not visible yet. */
pcpu_freelist_push_node(head, buf);
buf += elem_size;
}
cpu_idx++;
}
}
static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s)
{
struct pcpu_freelist_node *node = NULL;
struct pcpu_freelist_head *head;
int cpu;
for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) {
head = per_cpu_ptr(s->freelist, cpu);
if (!READ_ONCE(head->first))
continue;
if (raw_res_spin_lock(&head->lock))
continue;
node = head->first;
if (node) {
WRITE_ONCE(head->first, node->next);
raw_res_spin_unlock(&head->lock);
return node;
}
raw_res_spin_unlock(&head->lock);
}
return node;
}
struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *s)
{
return ___pcpu_freelist_pop(s);
}
struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *s)
{
struct pcpu_freelist_node *ret;
unsigned long flags;
local_irq_save(flags);
ret = __pcpu_freelist_pop(s);
local_irq_restore(flags);
return ret;
}
] |
| 11/2 | 2026/02/25 03:02 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 26144 output: 24 thoughts: 2052 Thoughts: **Analysis of a Potential Data Race in `bpf_get_stackid`** Okay, so `pcpu_freelist` is just a basic per-CPU freelist, which means no RCU grace period. That means a bucket in the stackmap can be immediately reused after it's been pushed, which leads to a potential data race in `__bpf_get_stackid`. Specifically, the race occurs when `__bpf_get_stackid` reads from a bucket that might have already been returned to the freelist and then reallocated. Now, is this race "benign"? If `memcmp` incorrectly returns 0, the user gets an ID for the wrong stack trace. If it incorrectly returns non-zero, they get a new ID, and the old one might be replaced. Neither scenario corrupts the *structure* of the map itself – no crashes or memory corruption in the map's metadata. However, the *data* within the map could be inconsistent with what the user expects. The issue is that the user calls `bpf_get_stackid`, gets an ID, and *then* looks up the stack trace in the map using that ID. The map always contains whatever was last `xchg`'d into it. If Task A says, "here's the ID for your stack trace," and Task B replaces that stack trace *immediately* after the return of `bpf_get_stackid` and before the ID is looked up, the user gets Task B's trace. This is a race within the BPF program's logic. Critically, the kernel *provides* this helper. If the helper returns an ID that does not correspond to the stack trace provided, that's a bug in the helper itself, not in the program logic. Now, if `memcmp` is racing with a write, can that cause a crash? The bucket memory is preallocated and never freed (until the map is destroyed), and the write also goes to the same preallocated memory. Therefore no crash is possible. But `memcmp` isn't atomic! If it reads in 8-byte chunks, and a write happens concurrently, it can read a mixture of old and new data. If `memcmp` erroneously returns 0, it saw matching `bucket->data` and `new_bucket->data`. If the data was *actually* the same, then the ID is correctly associated with the stack in the map. However, Task B can replace the bucket *immediately* after `memcmp` returns 0. So, I return the ID, but the bucket is already different. The user gets the ID, looks it up, and gets the *wrong* stack. This is a design flaw. It happens because `bpf_get_stackid` doesn't use any locking. Task B's `xchg` is atomic. Task A's bucket pointer is now the *old* bucket, and Task B's bucket pointer is the new data at the same memory location, at that ID in the map. Task A is still reading from the old bucket. Task B pushes that old bucket to the freelist, and Task C pops it and writes to it. `memcmp` returns 0, so Task A thinks its stack trace (Z) matches what was in bucket X. If it *did* match originally, then the return of the ID is "correct". However, it's no longer at that map index since Task B just replaced it with Y. So, the user gets the ID, looks it up, and gets Y. If Y != Z, the user gets the wrong data. This is a real bug in the helper. It should probably use RCU, but `pcpu_freelist` doesn't support that. The question then is, given that this helper is supposed to be fast, was this race considered acceptable? I see that KCSAN uses "benign" to mean "doesn't lead to memory corruption or crashes". Here it doesn't cause crashes, but it *does* lead to incorrect results. The report confirms that `memcmp` is reading while the value is being written. The report says, "value changed: 0x0000000000000000 -> 0xf042c5f12da5e5e6", confirming that `memcmp` reads a value while it is being written. Although `__bpf_get_stackid` is called under an RCU read lock, `pcpu_freelist` doesn't care. Also, if the `BPF_F_FAST_STACK_CMP` flag is set, it doesn't even use `memcmp`; it trusts the hash, which is definitely "racy |
| 12/2 | 2026/02/25 03:02 | tool | grepper |
0mArgs:map[Expression:BPF_F_STACK_BUILD_ID] Results: map[Output:include/uapi/linux/bpf.h=1379=enum {
--
include/uapi/linux/bpf.h-1395-/* Flag for stack_map, store build_id+offset instead of pointer */
include/uapi/linux/bpf.h:1396: BPF_F_STACK_BUILD_ID = (1U << 5),
include/uapi/linux/bpf.h-1397-
--
kernel/bpf/stackmap.c-16- (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \
kernel/bpf/stackmap.c:17: BPF_F_STACK_BUILD_ID)
kernel/bpf/stackmap.c-18-
--
kernel/bpf/stackmap.c=34=static inline bool stack_map_use_build_id(struct bpf_map *map)
kernel/bpf/stackmap.c-35-{
kernel/bpf/stackmap.c:36: return (map->map_flags & BPF_F_STACK_BUILD_ID);
kernel/bpf/stackmap.c-37-}
--
kernel/bpf/stackmap.c=92=static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
--
kernel/bpf/stackmap.c-107- BUILD_BUG_ON(sizeof(struct bpf_stack_build_id) % sizeof(u64));
kernel/bpf/stackmap.c:108: if (attr->map_flags & BPF_F_STACK_BUILD_ID) {
kernel/bpf/stackmap.c-109- if (value_size % sizeof(struct bpf_stack_build_id) ||
--
tools/include/uapi/linux/bpf.h=1379=enum {
--
tools/include/uapi/linux/bpf.h-1395-/* Flag for stack_map, store build_id+offset instead of pointer */
tools/include/uapi/linux/bpf.h:1396: BPF_F_STACK_BUILD_ID = (1U << 5),
tools/include/uapi/linux/bpf.h-1397-
--
tools/testing/selftests/bpf/progs/test_stacktrace_build_id.c=27=struct {
--
tools/testing/selftests/bpf/progs/test_stacktrace_build_id.c-29- __uint(max_entries, 128);
tools/testing/selftests/bpf/progs/test_stacktrace_build_id.c:30: __uint(map_flags, BPF_F_STACK_BUILD_ID);
tools/testing/selftests/bpf/progs/test_stacktrace_build_id.c-31- __type(key, __u32);
] |
| 13/2 | 2026/02/25 03:02 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 29994 output: 20 thoughts: 1864 Thoughts: **My Analysis of the `bpf_get_stackid` Race Condition** Okay, so I'm looking at this race condition in `__bpf_get_stackid` within the `kernel/bpf/stackmap.c` file. The core of the problem lies in this `memcmp` call. It looks like we're comparing `bucket->data` to `new_bucket->data` to determine if a stack trace already exists in the stack map. The issue is that the `bucket` pointer is obtained using `READ_ONCE(smap->buckets[id])`, but there's nothing protecting this `bucket` from being replaced and returned to the freelist while the `memcmp` operation is ongoing. Essentially, we have a classic data race. Task A is reading data, Task B could be replacing that bucket with something else, and Task C could be writing to it simultaneously. If `memcmp` reads inconsistent data, we have a problem. The initial thought is, is this "benign"? Maybe if it's just a "best effort" check. However, if this leads to an ID being returned that points to an incorrect stack trace, we have a bug. Let's say `memcmp` incorrectly returns 0. The user looks up the ID in the map, and the map now contains data that Task B, or someone else, has placed there. They're getting the wrong stack trace. That's not acceptable for an accurate helper. I'm starting to think this isn't a "best effort" function. The documentation states that the helper should return a positive or zero stack id on success, or a negative error on failure. It doesn't say "it might return the wrong ID." So, returning an incorrect ID is a bug. Looking at the code, it seems the race is on `nr` and `data` in `stack_map_bucket`. The `pcpu_freelist` is the problem. It doesn't use RCU. A potential fix would be to check if `smap->buckets[id]` still points to `bucket` AFTER the `memcmp` check. But even then, there's a chance of an ABA problem where it could have changed and changed back. I've heard that these kinds of races are known to exist in the BPF stackmap. I found a KCSAN report. The report details a write to 20 bytes by one task and a read of 8 bytes by another, and it involves `memcpy` and `memset` in `parse_build_id` and `bcmp` (which is used by `memcmp`). So, the `memcmp` is racing with a `memcpy` and `memset` operation. It's definitely not going to be accurate. If this were just for tracing and the user could tolerate some errors, maybe this is "benign". But it's still a bug in the helper's logic. Thinking about the implications, if the BPF stackmap is being used in a security auditing context, getting the wrong stack trace could be a security issue. It can't be "benign" in that scenario. The real problem is that the `bucket` can be reused for a completely different stack trace, as in, Task A is looking for stack trace Z. Task B replaces X with Y. Task C pops X and is writing V. If `memcmp` says they match, Task A will return an id for stack trace Z, but it is incorrect because the ID is for stack trace Y. So the user gets Y instead of Z. This is a bug, not a "benign" race. This is a race on the *content* of a data structure. I'm now going to search for any previous discussions. I'll search for "bpf_get_stackid race memcmp". Maybe I can use `grepper` to look for recent changes to `kernel/bpf/stackmap.c`. Let's see if I can find any leads. |
| 14/2 | 2026/02/25 03:03 | tool | grepper |
0mArgs:map[Expression:memcmp.*bucket->data] Results: map[Output:kernel/bpf/stackmap.c=249=static long __bpf_get_stackid(struct bpf_map *map,
--
kernel/bpf/stackmap.c-291- if (hash_matches && bucket->nr == trace_nr &&
kernel/bpf/stackmap.c:292: memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
kernel/bpf/stackmap.c-293- pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
--
kernel/bpf/stackmap.c-301- if (hash_matches && bucket->nr == trace_nr &&
kernel/bpf/stackmap.c:302: memcmp(bucket->data, ips, trace_len) == 0)
kernel/bpf/stackmap.c-303- return id;
] |
| 15/2 | 2026/02/25 03:03 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 32979 output: 41 thoughts: 670 Thoughts: **Analyzing a Race Condition in the Stack Map Implementation** Okay, so I'm looking at this code and there's a definite race condition happening with `bucket->data` and `bucket->nr`. The fact that `bucket->nr` is involved makes this even trickier. My initial thought was about potential out-of-bounds reads if `bucket->nr` changed after being read, but I've realized `trace_len` is derived from a local copy, so that's not the immediate issue. The `memcmp` will read `trace_len` bytes from `bucket->data`, and that size is pre-allocated, so no out of bounds reads. So, the race is *specifically* around the data being compared. Is this benign? Well, that depends. If it's just a "best effort" check, maybe it's not the end of the world. But if this leads to returning an ID that points to the *wrong* stack trace, then we have a bug. Wait, I remember a similar report from before! It was about `bpf_get_stackid` and `memcmp` racing. The consensus then was that it's a real bug, but fixing it without a performance penalty is tough. KCSAN's definition of "benign" is really about crashes or memory corruption. This particular race *doesn't* seem to cause either of those, but it definitely results in incorrect results—data corruption in that sense. Thinking about it, since `memcmp` is racing with a write, technically this is undefined behavior in C. However, in the kernel we tend to assume that it just reads some mix of old and new data. And the report says, "value changed: 0x0000000000000000 -> 0xf042c5f12da5e5e6", which is exactly what happens when `memcmp` reads data while `memcpy` is writing. The real question is, is this harmful? If this is a security-critical path, absolutely yes. If it's just for tracing, maybe not. But BPF is used for both, which raises a red flag. Let me zero in on `stack_map_get_build_id_offset` again. I see it's writing to `id_offs[i].build_id`, and that's the location of the race. Hold on, this is interesting. `stack_map_get_build_id_offset` is called on `new_bucket->data`. Now, `new_bucket` was *just* popped from the freelist. It's not in the map yet. So, the only way another task can be reading from it is if it was *previously* in the map, was replaced, and *then* popped from the freelist. That means this race condition *only* occurs when a bucket is reused very quickly. This is highly likely in high-load scenarios. Now, if this is a known issue, maybe there's a comment somewhere. I need to check the comments in `kernel/bpf/stackmap.c` to see if there is any mention of this. |
| 16/2 | 2026/02/25 03:03 | tool | codesearch-definition-comment |
0mArgs:map[ContextFile:kernel/bpf/stackmap.c Name:__bpf_get_stackid] Results: map[Comment: Kind:function] |
| 17/2 | 2026/02/25 03:03 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 34401 output: 18 thoughts: 76 Thoughts: Okay, here's my take on those thoughts, framed as my own reasoning: **Analyzing the `pcpu_freelist` and Potential RCU Issues** Alright, silence is the best response for now. Let's dig back into that `pcpu_freelist`. Hmm, still no luck. My initial assessment is that if it were using Read-Copy Update (RCU), this particular issue wouldn't be present. But, alas, it's not. Hold on a second...something's clicking. I recall a commit that *might* be connected to this. I'll need to explore recent changes in `kernel/bpf/stackmap.c` to see if there's a relevant patch. I think I'll fire up `grepper` and see what comes up. Time to investigate further and see if the problem can be fixed. |
| 18/2 | 2026/02/25 03:03 | tool | grepper |
0mArgs:map[Expression:stackmap.c] Results: map[Output:MAINTAINERS=4879=S: Maintained
MAINTAINERS:4880:F: kernel/bpf/stackmap.c
MAINTAINERS-4881-F: kernel/trace/bpf_trace.c
--
samples/bpf/offwaketime_user.c=47=static void print_stack(struct key_t *key, __u64 count)
--
samples/bpf/offwaketime_user.c-69- if ((key->tret == -EEXIST || key->wret == -EEXIST) && !warned) {
samples/bpf/offwaketime_user.c:70: printf("stackmap collisions seen. Consider increasing size\n");
samples/bpf/offwaketime_user.c-71- warned = true;
--
samples/bpf/trace_event_user.c=61=static void print_stack(struct key_t *key, __u64 count)
--
samples/bpf/trace_event_user.c-86- if (key->kernstack == -EEXIST && !warned) {
samples/bpf/trace_event_user.c:87: printf("stackmap collisions seen. Consider increasing size\n");
samples/bpf/trace_event_user.c-88- warned = true;
] |
| 19/2 | 2026/02/25 03:03 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 35022 output: 46 thoughts: 472 Thoughts: **Analyzing a Stackmap Collision and Potential Data Corruption** Okay, so the issue at hand involves "stackmap collisions," which are expected, of course, when the map is full or experiences many collisions. However, what I've identified isn't a collision in the traditional sense; it's a race condition occurring on the *content* of a bucket that's being reused. Now, if this were a well-known issue, perhaps it's deemed "benign" because it's part of a tracing helper. But from a KCSAN perspective, "benign" typically refers to the absence of crashes or memory corruption. This race condition doesn't seem to cause those problems directly; instead, it leads to "data corruption" in the form of incorrect results. Wait, if we're dealing with `memcmp` racing with a write, it's technically undefined behavior in C. In the kernel, though, we usually assume it reads a mix of the old and new data. The report shows the value changing: `0x0000000000000000 -> 0xf042c5f12da5e5e6`. That's exactly what happens when `memcmp` reads while `memcpy` is writing. Is this actually "harmful"? If we're talking about a security-critical path, absolutely. If it's just for tracing, maybe not. But the thing is, BPF is used for *both* security-critical and tracing functionalities, so that distinction is blurring. Let's zoom back in on `stack_map_get_build_id_offset`. That function is where we're writing to `id_offs[i].build_id`. So, this is the very spot where the race is occurring. Hold on, I see something interesting here. `stack_map_get_build_id_offset` is being called on `new_bucket->data`. And `new_bucket` was just popped from the freelist. So, it's not actually in the map yet. The only way another task could be reading from it is if it *was* in the map previously, got replaced, and then was popped from the freelist. That means this race condition only happens when a bucket is reused *very* quickly. And that kind of rapid reuse is quite common in high-load scenarios. I'm now thinking, if this is a known quirk, perhaps there's a comment somewhere. I'll search the comments in `kernel/bpf/stackmap.c` for any clues. |
| 20/2 | 2026/02/25 03:03 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/bpf/stackmap.c IncludeLines:true Name:__bpf_get_stackid] Results: map[SourceCode: 249: static long __bpf_get_stackid(struct bpf_map *map,
250: struct perf_callchain_entry *trace, u64 flags)
251: {
252: struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
253: struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
254: u32 hash, id, trace_nr, trace_len, i, max_depth;
255: u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
256: bool user = flags & BPF_F_USER_STACK;
257: u64 *ips;
258: bool hash_matches;
259:
260: if (trace->nr <= skip)
261: /* skipping more than usable stack trace */
262: return -EFAULT;
263:
264: max_depth = stack_map_calculate_max_depth(map->value_size, stack_map_data_size(map), flags);
265: trace_nr = min_t(u32, trace->nr - skip, max_depth - skip);
266: trace_len = trace_nr * sizeof(u64);
267: ips = trace->ip + skip;
268: hash = jhash2((u32 *)ips, trace_len / sizeof(u32), 0);
269: id = hash & (smap->n_buckets - 1);
270: bucket = READ_ONCE(smap->buckets[id]);
271:
272: hash_matches = bucket && bucket->hash == hash;
273: /* fast cmp */
274: if (hash_matches && flags & BPF_F_FAST_STACK_CMP)
275: return id;
276:
277: if (stack_map_use_build_id(map)) {
278: struct bpf_stack_build_id *id_offs;
279:
280: /* for build_id+offset, pop a bucket before slow cmp */
281: new_bucket = (struct stack_map_bucket *)
282: pcpu_freelist_pop(&smap->freelist);
283: if (unlikely(!new_bucket))
284: return -ENOMEM;
285: new_bucket->nr = trace_nr;
286: id_offs = (struct bpf_stack_build_id *)new_bucket->data;
287: for (i = 0; i < trace_nr; i++)
288: id_offs[i].ip = ips[i];
289: stack_map_get_build_id_offset(id_offs, trace_nr, user, false /* !may_fault */);
290: trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
291: if (hash_matches && bucket->nr == trace_nr &&
292: memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
293: pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
294: return id;
295: }
296: if (bucket && !(flags & BPF_F_REUSE_STACKID)) {
297: pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
298: return -EEXIST;
299: }
300: } else {
301: if (hash_matches && bucket->nr == trace_nr &&
302: memcmp(bucket->data, ips, trace_len) == 0)
303: return id;
304: if (bucket && !(flags & BPF_F_REUSE_STACKID))
305: return -EEXIST;
306:
307: new_bucket = (struct stack_map_bucket *)
308: pcpu_freelist_pop(&smap->freelist);
309: if (unlikely(!new_bucket))
310: return -ENOMEM;
311: memcpy(new_bucket->data, ips, trace_len);
312: }
313:
314: new_bucket->hash = hash;
315: new_bucket->nr = trace_nr;
316:
317: old_bucket = xchg(&smap->buckets[id], new_bucket);
318: if (old_bucket)
319: pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
320: return id;
321: }
SourceFile:kernel/bpf/stackmap.c] |
| 21/2 | 2026/02/25 03:03 | llm | expert |
5mModel:gemini-3-flash-preview Tokens: input: 37467 output: 19 thoughts: 62913 |
| 22/2 | 2026/02/25 03:08 | tool | set-results |
0mArgs:map[Benign:false Confident:true] Results: map[Benign:false Confident:true] |
| 23/2 | 2026/02/25 03:08 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 100416 output: 625 thoughts: 507 Thoughts: **Analyzing the BPF Stack Map Data Race** Okay, so I've been digging into this `__bpf_get_stackid` issue in `kernel/bpf/stackmap.c`, and the problem is pretty clear. It's a textbook data race, and a pretty nasty one at that, because it directly impacts the *correctness* of BPF program behavior. Here's how it plays out: a task, let's call it Task A, is happily chugging along in `__bpf_get_stackid`. It finds a bucket, `X`, within the stack map – specifically at `smap->buckets[id]`. Now, *simultaneously*, Task B comes along. They're also calling `__bpf_get_stackid`, but they're using `xchg` to replace bucket `X` with a brand new bucket, let's call it `Y`. Then, and this is crucial, Task B immediately pushes the *old* bucket `X` back onto the freelist (`pcpu_freelist`). Now, enter Task C. They snatch bucket `X` from the freelist and assign it to be `new_bucket`, intending to use it for *their* stack trace data. This is when things get ugly. Task C then proceeds to fill bucket `X` with *new* data. The race occurs because Task A, *still* holding a pointer to the now-reused bucket `X`, is in the middle of a `memcmp` (or `bcmp`) operation to check if the stack trace in bucket `X` matches something it's looking for. The core problem is this: Task A's `memcmp` is reading from a bucket that Task C is actively writing to! The timing is perfect for `memcmp` to return an incorrect result. If `memcmp` incorrectly reports a match (returns 0), `bpf_get_stackid` will return an ID that's based on the wrong data, because that bucket `X` is now filled with Task C's data! The returned ID no longer points to the original, correct stack trace. Sure, we're not seeing a kernel panic here – the memory for these buckets is preallocated, so we're not dealing with a use-after-free situation. But, the data corruption in the BPF program is real and has serious implications. It's completely undermining the *expected behavior* of the `bpf_get_stackid` helper, especially for any tracing, monitoring, or security applications that depend on accurately identifying stack IDs. The lack of any synchronization, like RCU, around the bucket comparison is what makes this a real and dangerous race condition. The KCSAN report that I've been looking at confirms this directly. It shows a write operation in `parse_build_id` (let's say Task 18187) directly racing with a read in `bcmp` (Task 18266), with the data being mutated *during* the read. It is definitively a data race, and we need a solid fix. |
| Total Calls | Total Tokens | Avg Tokens | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|
| Total Calls | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|