syzbot |
sign-in | mailing list | source | docs | 🏰 |
| ID | Workflow | Result | Correct | Bug | Created | Started | Finished | Revision | Error |
|---|---|---|---|---|---|---|---|---|---|
| ab25b7bb-7500-47f8-be94-16130332e3ea | moderation | Actionable: ✅ Confident: ✅ | ❓ | KASAN: slab-use-after-free Read in clone_mnt | 2026/02/06 12:22 | 2026/02/06 12:35 | 2026/02/06 12:36 | 97745f52f5b9361548048f53f0b3b51b9313fedf |
The bug report describes a use-after-free (UAF) in the mount namespace cloning path. Specifically, a `struct mount` object (allocated in `mnt_cache`) is accessed after being freed. ### Analysis of the Stacks: 1. **Access Stack**: The UAF occurs in `clone_mnt` (called via `copy_tree` and `copy_mnt_ns` during an `unshare(CLONE_NEWNS)` syscall). The access is within `list_add(&mnt->mnt_share, &old->mnt_share)`, where `old` is the root mount of the namespace being cloned (`ns->root`). 2. **Allocation Stack**: The object was allocated via `vfs_create_mount` during a regular `mount` syscall. This is consistent with the creation of a `struct mount`. 3. **Free Stack**: The object was freed via RCU. The free was initiated by `mntput` called from `namespace_unlock` in `__detach_mounts`. 4. **Auxiliary Stacks**: The "potentially related work creation" stacks are highly informative. They show that `__detach_mounts` was triggered by `proc_invalidate_siblings_dcache` during `release_task`. This happens when a process exits, causing its `/proc/<pid>` entries to be invalidated. If a mount was attached to one of these entries (e.g., a detached mount created via the new mount API and then bind-mounted into a procfs directory), it gets detached and unmounted. ### Root Cause: The issue is that a `mnt_namespace` holds a pointer to its root mount (`ns->root`), but this pointer is not necessarily a pinning reference that prevents the mount from being unmounted. If the root mount of a namespace is unmounted (e.g., via `umount -l /` or via `__detach_mounts` if it was mounted elsewhere), the `struct mount` object can be freed while the `mnt_namespace` still exists and points to it. When a subsequent operation (like `unshare` or `setns`) attempts to clone or use this namespace, it accesses the dangling `ns->root` pointer, leading to a UAF. ### Conclusion: The report is self-consistent and provides all the necessary information to identify the root cause. The allocation, free, and access stacks all involve the same object type (`struct mount`) and the auxiliary stacks provide the exact mechanism of the race/leak. The report is **actionable** and **consistent**. Final results: - Actionable: Yes - Confident: Yes
================================================================== BUG: KASAN: slab-use-after-free in __list_add_valid_or_report+0x4e/0x130 lib/list_debug.c:29 Read of size 8 at addr ffff88806189d400 by task syz.1.1921/14961 CPU: 1 UID: 0 PID: 14961 Comm: syz.1.1921 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xba/0x230 mm/kasan/report.c:482 kasan_report+0x117/0x150 mm/kasan/report.c:595 __list_add_valid_or_report+0x4e/0x130 lib/list_debug.c:29 __list_add_valid include/linux/list.h:96 [inline] __list_add include/linux/list.h:158 [inline] list_add include/linux/list.h:177 [inline] clone_mnt+0x447/0x9a0 fs/namespace.c:1275 copy_tree+0xde/0x930 fs/namespace.c:2159 copy_mnt_ns+0x24d/0x990 fs/namespace.c:4246 create_new_namespaces+0xcf/0x6a0 kernel/nsproxy.c:98 unshare_nsproxy_namespaces+0x11a/0x160 kernel/nsproxy.c:226 ksys_unshare+0x4f4/0x900 kernel/fork.c:3173 __do_sys_unshare kernel/fork.c:3244 [inline] __se_sys_unshare kernel/fork.c:3242 [inline] __x64_sys_unshare+0x38/0x50 kernel/fork.c:3242 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f3e5c39aeb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f3e5d1ce028 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f3e5c615fa0 RCX: 00007f3e5c39aeb9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000022020600 RBP: 00007f3e5c408c1f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f3e5c616038 R14: 00007f3e5c615fa0 R15: 00007ffe1e964e38 </TASK> Allocated by task 14798: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 unpoison_slab_object mm/kasan/common.c:340 [inline] __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366 kasan_slab_alloc include/linux/kasan.h:253 [inline] slab_post_alloc_hook mm/slub.c:4456 [inline] slab_alloc_node mm/slub.c:4785 [inline] kmem_cache_alloc_noprof+0x2bc/0x650 mm/slub.c:4792 alloc_vfsmnt+0x23/0x420 fs/namespace.c:287 vfs_create_mount+0x69/0x320 fs/namespace.c:1178 fc_mount fs/namespace.c:1196 [inline] do_new_mount_fc fs/namespace.c:3760 [inline] do_new_mount+0x3bf/0xd30 fs/namespace.c:3836 do_mount fs/namespace.c:4159 [inline] __do_sys_mount fs/namespace.c:4348 [inline] __se_sys_mount+0x31d/0x420 fs/namespace.c:4325 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 15: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:253 [inline] __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:2667 [inline] slab_free mm/slub.c:6079 [inline] kmem_cache_free+0x187/0x630 mm/slub.c:6205 rcu_do_batch kernel/rcu/tree.c:2617 [inline] rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869 handle_softirqs+0x22a/0x870 kernel/softirq.c:626 run_ksoftirqd+0x36/0x60 kernel/softirq.c:1067 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160 kthread+0x388/0x470 kernel/kthread.c:467 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 Last potentially related work creation: kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556 __call_rcu_common kernel/rcu/tree.c:3131 [inline] call_rcu+0xee/0x890 kernel/rcu/tree.c:3251 task_work_run+0x1d9/0x270 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:67 [inline] exit_to_user_mode_loop+0xed/0x480 kernel/entry/common.c:98 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline] do_syscall_64+0x32d/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f Second to last potentially related work creation: kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556 task_work_add+0xb6/0x440 kernel/task_work.c:70 mntput_no_expire_slowpath+0x70c/0xbd0 fs/namespace.c:1379 mntput fs/namespace.c:1416 [inline] namespace_unlock+0x704/0x7a0 fs/namespace.c:1713 class_namespace_excl_destructor fs/namespace.c:90 [inline] __detach_mounts+0x504/0x590 fs/namespace.c:1987 detach_mounts fs/mount.h:146 [inline] d_invalidate+0x14d/0x210 fs/dcache.c:1718 proc_invalidate_siblings_dcache+0x3d3/0x6c0 fs/proc/inode.c:142 release_task+0x1204/0x16f0 kernel/exit.c:291 wait_task_zombie kernel/exit.c:1274 [inline] wait_consider_task+0x1966/0x2e30 kernel/exit.c:1501 do_wait_thread kernel/exit.c:1564 [inline] __do_wait+0x152/0x740 kernel/exit.c:1682 do_wait+0x1e7/0x540 kernel/exit.c:1716 kernel_wait4+0x232/0x2b0 kernel/exit.c:1875 __do_sys_wait4 kernel/exit.c:1903 [inline] __se_sys_wait4 kernel/exit.c:1899 [inline] __x64_sys_wait4+0x166/0x240 kernel/exit.c:1899 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff88806189d340 which belongs to the cache mnt_cache of size 352 The buggy address is located 192 bytes inside of freed 352-byte region [ffff88806189d340, ffff88806189d4a0) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88806189ce00 pfn:0x6189c head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 memcg:ffff888068d98d01 flags: 0xfff00000000240(workingset|head|node=0|zone=1|lastcpupid=0x7ff) page_type: f5(slab) raw: 00fff00000000240 ffff88801ca93640 ffffea0001862610 ffffea0001bfbe10 raw: ffff88806189ce00 000000080012000c 00000000f5000000 ffff888068d98d01 head: 00fff00000000240 ffff88801ca93640 ffffea0001862610 ffffea0001bfbe10 head: ffff88806189ce00 000000080012000c 00000000f5000000 ffff888068d98d01 head: 00fff00000000001 ffffea0001862701 00000000ffffffff 00000000ffffffff head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5819, tgid 5819 (syz-executor), ts 71566008237, free_ts 21136902514 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x231/0x280 mm/page_alloc.c:1883 prep_new_page mm/page_alloc.c:1891 [inline] get_page_from_freelist+0x24dc/0x2580 mm/page_alloc.c:3956 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5244 alloc_slab_page mm/slub.c:3235 [inline] allocate_slab+0x7b/0x660 mm/slub.c:3408 new_slab mm/slub.c:3466 [inline] refill_objects+0x331/0x3c0 mm/slub.c:7084 refill_sheaf mm/slub.c:2784 [inline] __pcs_replace_empty_main+0x2b9/0x620 mm/slub.c:4533 alloc_from_pcs mm/slub.c:4636 [inline] slab_alloc_node mm/slub.c:4770 [inline] kmem_cache_alloc_noprof+0x37d/0x650 mm/slub.c:4792 alloc_vfsmnt+0x23/0x420 fs/namespace.c:287 clone_mnt+0x4b/0x9a0 fs/namespace.c:1246 copy_tree+0x3d4/0x930 fs/namespace.c:2194 copy_mnt_ns+0x24d/0x990 fs/namespace.c:4246 create_new_namespaces+0xcf/0x6a0 kernel/nsproxy.c:98 unshare_nsproxy_namespaces+0x11a/0x160 kernel/nsproxy.c:226 ksys_unshare+0x4f4/0x900 kernel/fork.c:3173 __do_sys_unshare kernel/fork.c:3244 [inline] __se_sys_unshare kernel/fork.c:3242 [inline] __x64_sys_unshare+0x38/0x50 kernel/fork.c:3242 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 page last free pid 1 tgid 1 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1432 [inline] __free_frozen_pages+0xc01/0xd80 mm/page_alloc.c:2972 __free_pages mm/page_alloc.c:5363 [inline] free_contig_range+0xbb/0x170 mm/page_alloc.c:7367 destroy_args+0x4e5/0x570 mm/debug_vm_pgtable.c:993 debug_vm_pgtable+0x3f8/0x410 mm/debug_vm_pgtable.c:1368 do_one_initcall+0x250/0x8d0 init/main.c:1382 do_initcall_level+0x104/0x190 init/main.c:1444 do_initcalls+0x59/0xa0 init/main.c:1460 kernel_init_freeable+0x2a6/0x3e0 init/main.c:1692 kernel_init+0x1d/0x1d0 init/main.c:1582 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 Memory state around the buggy address: ffff88806189d300: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb ffff88806189d380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88806189d400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88806189d480: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc ffff88806189d500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ==================================================================
| Seq | Timestamp | Type | Name | Duration |
|---|---|---|---|---|
| 0/0 | 2026/02/06 12:35 | flow | moderation |
1mResults:map[Actionable:true Confident:true Explanation:The bug report describes a use-after-free (UAF) in the mount namespace cloning path. Specifically, a `struct mount` object (allocated in `mnt_cache`) is accessed after being freed. ### Analysis of the Stacks: 1. **Access Stack**: The UAF occurs in `clone_mnt` (called via `copy_tree` and `copy_mnt_ns` during an `unshare(CLONE_NEWNS)` syscall). The access is within `list_add(&mnt->mnt_share, &old->mnt_share)`, where `old` is the root mount of the namespace being cloned (`ns->root`). 2. **Allocation Stack**: The object was allocated via `vfs_create_mount` during a regular `mount` syscall. This is consistent with the creation of a `struct mount`. 3. **Free Stack**: The object was freed via RCU. The free was initiated by `mntput` called from `namespace_unlock` in `__detach_mounts`. 4. **Auxiliary Stacks**: The "potentially related work creation" stacks are highly informative. They show that `__detach_mounts` was triggered by `proc_invalidate_siblings_dcache` during `release_task`. This happens when a process exits, causing its `/proc/<pid>` entries to be invalidated. If a mount was attached to one of these entries (e.g., a detached mount created via the new mount API and then bind-mounted into a procfs directory), it gets detached and unmounted. ### Root Cause: The issue is that a `mnt_namespace` holds a pointer to its root mount (`ns->root`), but this pointer is not necessarily a pinning reference that prevents the mount from being unmounted. If the root mount of a namespace is unmounted (e.g., via `umount -l /` or via `__detach_mounts` if it was mounted elsewhere), the `struct mount` object can be freed while the `mnt_namespace` still exists and points to it. When a subsequent operation (like `unshare` or `setns`) attempts to clone or use this namespace, it accesses the dangling `ns->root` pointer, leading to a UAF. ### Conclusion: The report is self-consistent and provides all the necessary information to identify the root cause. The allocation, free, and access stacks all involve the same object type (`struct mount`) and the auxiliary stacks provide the exact mechanism of the race/leak. The report is **actionable** and **consistent**. Final results: - Actionable: Yes - Confident: Yes] |
| 1/1 | 2026/02/06 12:35 | action | kernel-checkouter |
0mResults:map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/1ac8d545521fe90cdf63b4ce73cd1b1fc65d4b3e] |
| 2/1 | 2026/02/06 12:35 | action | kernel-builder |
0mResults:map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/a15d451146e4f58bcf546cfb408e28529a75d0d3] |
| 3/1 | 2026/02/06 12:35 | action | codesearch-prepare |
0mResults:map[Index:codesearch-index] |
| 4/1 | 2026/02/06 12:35 | agent | expert |
1mModel:gemini-3-flash-preview Results: map[Actionable:true Confident:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel bug report is actionable or not. Actionable means that it contains enough info to root cause the underlying bug, and that the report is self-consistent and makes sense, rather than e.g. a one-off nonsensical crash induced by a previous memory corruption. The bug report is about a use-after-free bug generated by KASAN tool. It should contain 3 stack traces: the bad memory access stack, the heap block allocation stack, and the heap block free stack. If the report does not contain 3 stacks, it's not actionable. All 3 stack traces should be related to the same object type, and usually be in the same kernel subsystem (at least leaf stack frames). An example of an actionable and consistent report would be: first access stack relates to an access to a field of struct Foo, allocation/free stacks relate to allocation/free of the struct Foo. In inconsistent/nonsensical reports an access may be to a struct Foo, but allocation stack allocates a different structure in a different subsystem. Look for other suspicious signals/inconsistencies that can make this report hard to debug/understand. In the final reply explain why you think the report is self-consistent and actionable, or why it's inconsistent and/or not actionable. Use the provided tools to confirm any assumptions, variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt:
The bug report is:
==================================================================
BUG: KASAN: slab-use-after-free in __list_add_valid_or_report+0x4e/0x130 lib/list_debug.c:29
Read of size 8 at addr ffff88806189d400 by task syz.1.1921/14961
CPU: 1 UID: 0 PID: 14961 Comm: syz.1.1921 Tainted: G L syzkaller #0 PREEMPT(full)
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xba/0x230 mm/kasan/report.c:482
kasan_report+0x117/0x150 mm/kasan/report.c:595
__list_add_valid_or_report+0x4e/0x130 lib/list_debug.c:29
__list_add_valid include/linux/list.h:96 [inline]
__list_add include/linux/list.h:158 [inline]
list_add include/linux/list.h:177 [inline]
clone_mnt+0x447/0x9a0 fs/namespace.c:1275
copy_tree+0xde/0x930 fs/namespace.c:2159
copy_mnt_ns+0x24d/0x990 fs/namespace.c:4246
create_new_namespaces+0xcf/0x6a0 kernel/nsproxy.c:98
unshare_nsproxy_namespaces+0x11a/0x160 kernel/nsproxy.c:226
ksys_unshare+0x4f4/0x900 kernel/fork.c:3173
__do_sys_unshare kernel/fork.c:3244 [inline]
__se_sys_unshare kernel/fork.c:3242 [inline]
__x64_sys_unshare+0x38/0x50 kernel/fork.c:3242
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f3e5c39aeb9
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f3e5d1ce028 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f3e5c615fa0 RCX: 00007f3e5c39aeb9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000022020600
RBP: 00007f3e5c408c1f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f3e5c616038 R14: 00007f3e5c615fa0 R15: 00007ffe1e964e38
</TASK>
Allocated by task 14798:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
unpoison_slab_object mm/kasan/common.c:340 [inline]
__kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366
kasan_slab_alloc include/linux/kasan.h:253 [inline]
slab_post_alloc_hook mm/slub.c:4456 [inline]
slab_alloc_node mm/slub.c:4785 [inline]
kmem_cache_alloc_noprof+0x2bc/0x650 mm/slub.c:4792
alloc_vfsmnt+0x23/0x420 fs/namespace.c:287
vfs_create_mount+0x69/0x320 fs/namespace.c:1178
fc_mount fs/namespace.c:1196 [inline]
do_new_mount_fc fs/namespace.c:3760 [inline]
do_new_mount+0x3bf/0xd30 fs/namespace.c:3836
do_mount fs/namespace.c:4159 [inline]
__do_sys_mount fs/namespace.c:4348 [inline]
__se_sys_mount+0x31d/0x420 fs/namespace.c:4325
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 15:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
poison_slab_object mm/kasan/common.c:253 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:2667 [inline]
slab_free mm/slub.c:6079 [inline]
kmem_cache_free+0x187/0x630 mm/slub.c:6205
rcu_do_batch kernel/rcu/tree.c:2617 [inline]
rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869
handle_softirqs+0x22a/0x870 kernel/softirq.c:626
run_ksoftirqd+0x36/0x60 kernel/softirq.c:1067
smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
kthread+0x388/0x470 kernel/kthread.c:467
ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Last potentially related work creation:
kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57
kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556
__call_rcu_common kernel/rcu/tree.c:3131 [inline]
call_rcu+0xee/0x890 kernel/rcu/tree.c:3251
task_work_run+0x1d9/0x270 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0xed/0x480 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x32d/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Second to last potentially related work creation:
kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57
kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556
task_work_add+0xb6/0x440 kernel/task_work.c:70
mntput_no_expire_slowpath+0x70c/0xbd0 fs/namespace.c:1379
mntput fs/namespace.c:1416 [inline]
namespace_unlock+0x704/0x7a0 fs/namespace.c:1713
class_namespace_excl_destructor fs/namespace.c:90 [inline]
__detach_mounts+0x504/0x590 fs/namespace.c:1987
detach_mounts fs/mount.h:146 [inline]
d_invalidate+0x14d/0x210 fs/dcache.c:1718
proc_invalidate_siblings_dcache+0x3d3/0x6c0 fs/proc/inode.c:142
release_task+0x1204/0x16f0 kernel/exit.c:291
wait_task_zombie kernel/exit.c:1274 [inline]
wait_consider_task+0x1966/0x2e30 kernel/exit.c:1501
do_wait_thread kernel/exit.c:1564 [inline]
__do_wait+0x152/0x740 kernel/exit.c:1682
do_wait+0x1e7/0x540 kernel/exit.c:1716
kernel_wait4+0x232/0x2b0 kernel/exit.c:1875
__do_sys_wait4 kernel/exit.c:1903 [inline]
__se_sys_wait4 kernel/exit.c:1899 [inline]
__x64_sys_wait4+0x166/0x240 kernel/exit.c:1899
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff88806189d340
which belongs to the cache mnt_cache of size 352
The buggy address is located 192 bytes inside of
freed 352-byte region [ffff88806189d340, ffff88806189d4a0)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88806189ce00 pfn:0x6189c
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff888068d98d01
flags: 0xfff00000000240(workingset|head|node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000240 ffff88801ca93640 ffffea0001862610 ffffea0001bfbe10
raw: ffff88806189ce00 000000080012000c 00000000f5000000 ffff888068d98d01
head: 00fff00000000240 ffff88801ca93640 ffffea0001862610 ffffea0001bfbe10
head: ffff88806189ce00 000000080012000c 00000000f5000000 ffff888068d98d01
head: 00fff00000000001 ffffea0001862701 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5819, tgid 5819 (syz-executor), ts 71566008237, free_ts 21136902514
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x231/0x280 mm/page_alloc.c:1883
prep_new_page mm/page_alloc.c:1891 [inline]
get_page_from_freelist+0x24dc/0x2580 mm/page_alloc.c:3956
__alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5244
alloc_slab_page mm/slub.c:3235 [inline]
allocate_slab+0x7b/0x660 mm/slub.c:3408
new_slab mm/slub.c:3466 [inline]
refill_objects+0x331/0x3c0 mm/slub.c:7084
refill_sheaf mm/slub.c:2784 [inline]
__pcs_replace_empty_main+0x2b9/0x620 mm/slub.c:4533
alloc_from_pcs mm/slub.c:4636 [inline]
slab_alloc_node mm/slub.c:4770 [inline]
kmem_cache_alloc_noprof+0x37d/0x650 mm/slub.c:4792
alloc_vfsmnt+0x23/0x420 fs/namespace.c:287
clone_mnt+0x4b/0x9a0 fs/namespace.c:1246
copy_tree+0x3d4/0x930 fs/namespace.c:2194
copy_mnt_ns+0x24d/0x990 fs/namespace.c:4246
create_new_namespaces+0xcf/0x6a0 kernel/nsproxy.c:98
unshare_nsproxy_namespaces+0x11a/0x160 kernel/nsproxy.c:226
ksys_unshare+0x4f4/0x900 kernel/fork.c:3173
__do_sys_unshare kernel/fork.c:3244 [inline]
__se_sys_unshare kernel/fork.c:3242 [inline]
__x64_sys_unshare+0x38/0x50 kernel/fork.c:3242
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
page last free pid 1 tgid 1 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1432 [inline]
__free_frozen_pages+0xc01/0xd80 mm/page_alloc.c:2972
__free_pages mm/page_alloc.c:5363 [inline]
free_contig_range+0xbb/0x170 mm/page_alloc.c:7367
destroy_args+0x4e5/0x570 mm/debug_vm_pgtable.c:993
debug_vm_pgtable+0x3f8/0x410 mm/debug_vm_pgtable.c:1368
do_one_initcall+0x250/0x8d0 init/main.c:1382
do_initcall_level+0x104/0x190 init/main.c:1444
do_initcalls+0x59/0xa0 init/main.c:1460
kernel_init_freeable+0x2a6/0x3e0 init/main.c:1692
kernel_init+0x1d/0x1d0 init/main.c:1582
ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Memory state around the buggy address:
ffff88806189d300: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
ffff88806189d380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88806189d400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88806189d480: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
ffff88806189d500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Reply: The bug report describes a use-after-free (UAF) in the mount namespace cloning path. Specifically, a `struct mount` object (allocated in `mnt_cache`) is accessed after being freed. ### Analysis of the Stacks: 1. **Access Stack**: The UAF occurs in `clone_mnt` (called via `copy_tree` and `copy_mnt_ns` during an `unshare(CLONE_NEWNS)` syscall). The access is within `list_add(&mnt->mnt_share, &old->mnt_share)`, where `old` is the root mount of the namespace being cloned (`ns->root`). 2. **Allocation Stack**: The object was allocated via `vfs_create_mount` during a regular `mount` syscall. This is consistent with the creation of a `struct mount`. 3. **Free Stack**: The object was freed via RCU. The free was initiated by `mntput` called from `namespace_unlock` in `__detach_mounts`. 4. **Auxiliary Stacks**: The "potentially related work creation" stacks are highly informative. They show that `__detach_mounts` was triggered by `proc_invalidate_siblings_dcache` during `release_task`. This happens when a process exits, causing its `/proc/<pid>` entries to be invalidated. If a mount was attached to one of these entries (e.g., a detached mount created via the new mount API and then bind-mounted into a procfs directory), it gets detached and unmounted. ### Root Cause: The issue is that a `mnt_namespace` holds a pointer to its root mount (`ns->root`), but this pointer is not necessarily a pinning reference that prevents the mount from being unmounted. If the root mount of a namespace is unmounted (e.g., via `umount -l /` or via `__detach_mounts` if it was mounted elsewhere), the `struct mount` object can be freed while the `mnt_namespace` still exists and points to it. When a subsequent operation (like `unshare` or `setns`) attempts to clone or use this namespace, it accesses the dangling `ns->root` pointer, leading to a UAF. ### Conclusion: The report is self-consistent and provides all the necessary information to identify the root cause. The allocation, free, and access stacks all involve the same object type (`struct mount`) and the auxiliary stacks provide the exact mechanism of the race/leak. The report is **actionable** and **consistent**. Final results: - Actionable: Yes - Confident: Yes |
| 5/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 8131 output: 57 thoughts: 266 Thoughts: **Analyzing a KASAN Slab-Use-After-Free Bug** Okay, so I'm looking at a KASAN report, and it's a slab-use-after-free issue. The buggy address is `ffff88806189d400`, which, interesting, is offset within the object `ffff88806189d340`. The cache is `mnt_cache`, which is 352 bytes in size, and the access is 192 bytes into the object. Got it. First, I need to look at the stack traces. There's an access stack, an allocation stack, and a free stack. Let's start with the access stack. It looks like `__list_add_valid_or_report` is being called from `clone_mnt` at `fs/namespace.c:1275`. That’s my primary point of interest right now. `clone_mnt` is a critical function for mount propagation, so a bug here is serious. Alright, time to dive into `fs/namespace.c` around line 1275. Let's see what's happening there. |
| 6/2 | 2026/02/06 12:35 | tool | codesearch-dir-index |
0mArgs:map[Dir:fs/] Results: map[Files:[aio.c anon_inodes.c attr.c backing-file.c bad_inode.c binfmt_elf.c binfmt_elf_fdpic.c binfmt_flat.c binfmt_misc.c binfmt_script.c bpf_fs_kfuncs.c buffer.c char_dev.c compat_binfmt_elf.c coredump.c d_path.c dax.c dcache.c direct-io.c drop_caches.c eventfd.c eventpoll.c exec.c fcntl.c fhandle.c file.c file_attr.c file_table.c filesystems.c fs-writeback.c fs_context.c fs_dirent.c fs_parser.c fs_pin.c fs_struct.c fserror.c fsopen.c init.c inode.c internal.h ioctl.c kernel_read_file.c libfs.c locks.c mbcache.c mnt_idmapping.c mount.h mpage.c namei.c namespace.c nsfs.c nullfs.c open.c pidfs.c pipe.c pnode.c pnode.h posix_acl.c proc_namespace.c read_write.c readdir.c remap_range.c select.c seq_file.c signalfd.c splice.c stack.c stat.c statfs.c super.c sync.c sysctls.c timerfd.c userfaultfd.c utimes.c xattr.c] Subdirs:[9p adfs affs afs autofs befs bfs btrfs cachefiles ceph coda configfs cramfs crypto debugfs devpts dlm ecryptfs efivarfs efs erofs exfat exportfs ext2 ext4 f2fs fat freevxfs fuse gfs2 hfs hfsplus hostfs hpfs hugetlbfs iomap isofs jbd2 jffs2 jfs kernfs lockd minix netfs nfs nfs_common nfsd nilfs2 nls notify ntfs3 ocfs2 omfs openpromfs orangefs overlayfs proc pstore qnx4 qnx6 quota ramfs resctrl romfs smb squashfs sysfs tests tracefs ubifs udf ufs unicode vboxsf verity xfs zonefs]] |
| 7/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/namespace.c IncludeLines:true Name:clone_mnt] Results: map[SourceCode:1240: static struct mount *clone_mnt(struct mount *old, struct dentry *root,
1241: int flag)
1242: {
1243: struct mount *mnt;
1244: int err;
1245:
1246: mnt = alloc_vfsmnt(old->mnt_devname);
1247: if (!mnt)
1248: return ERR_PTR(-ENOMEM);
1249:
1250: mnt->mnt.mnt_flags = READ_ONCE(old->mnt.mnt_flags) &
1251: ~MNT_INTERNAL_FLAGS;
1252:
1253: if (flag & (CL_SLAVE | CL_PRIVATE))
1254: mnt->mnt_group_id = 0; /* not a peer of original */
1255: else
1256: mnt->mnt_group_id = old->mnt_group_id;
1257:
1258: if ((flag & CL_MAKE_SHARED) && !mnt->mnt_group_id) {
1259: err = mnt_alloc_group_id(mnt);
1260: if (err)
1261: goto out_free;
1262: }
1263:
1264: if (mnt->mnt_group_id)
1265: set_mnt_shared(mnt);
1266:
1267: mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt));
1268:
1269: setup_mnt(mnt, root);
1270:
1271: if (flag & CL_PRIVATE) // we are done with it
1272: return mnt;
1273:
1274: if (peers(mnt, old))
1275: list_add(&mnt->mnt_share, &old->mnt_share);
1276:
1277: if ((flag & CL_SLAVE) && old->mnt_group_id) {
1278: hlist_add_head(&mnt->mnt_slave, &old->mnt_slave_list);
1279: mnt->mnt_master = old;
1280: } else if (IS_MNT_SLAVE(old)) {
1281: hlist_add_behind(&mnt->mnt_slave, &old->mnt_slave);
1282: mnt->mnt_master = old->mnt_master;
1283: }
1284: return mnt;
1285:
1286: out_free:
1287: mnt_free_id(mnt);
1288: free_vfsmnt(mnt);
1289: return ERR_PTR(err);
1290: }
SourceFile:fs/namespace.c] |
| 8/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 10482 output: 30 thoughts: 242 Thoughts: **Analyzing a Kernel Address Sanitizer (KASAN) Report** Okay, so I'm looking at this KASAN report, and it looks like a use-after-free issue. The access is pinpointed at line 1275, specifically this `list_add` call: `list_add(&mnt->mnt_share, &old->mnt_share);`. The KASAN report helpfully tells me the issue is happening in `__list_add_valid_or_report`, which means we're dealing with list manipulation – specifically a call to `__list_add` which in turn calls `__list_add_valid`. Now, the report mentions the UAF is happening on a `mnt_cache` object. Given my experience, it's pretty likely that a `struct mount` is the object residing in this `mnt_cache`. I'll want to verify this assumption by checking the `struct mount` definition, which, as I recall, should be in `fs/mount.h`. The report also gives me the buggy address: `ffff88806189d400`, and the object's starting address: `ffff88806189d340`. A quick calculation gives me an offset of 192 bytes (0xc0). That's critical information. My next step is to locate the `mnt_share` member within `struct mount`. Knowing its offset will provide valuable context in understanding the root cause of this UAF, and also whether we are dereferencing a dangling pointer. |
| 9/2 | 2026/02/06 12:35 | tool | codesearch-struct-layout |
0mArgs:map[ContextFile:fs/namespace.c Name:mount] Results: map[Fields:[map[Name:mnt_hash OffsetBits:0 SizeBits:128] map[Name:mnt_parent OffsetBits:128 SizeBits:64] map[Name:mnt_mountpoint OffsetBits:192 SizeBits:64] map[Name:mnt OffsetBits:256 SizeBits:256] map[Name: OffsetBits:512 SizeBits:192] map[Name:mnt_pcp OffsetBits:704 SizeBits:64] map[Name:mnt_mounts OffsetBits:768 SizeBits:128] map[Name:mnt_child OffsetBits:896 SizeBits:128] map[Name:mnt_next_for_sb OffsetBits:1024 SizeBits:64] map[Name:mnt_pprev_for_sb OffsetBits:1088 SizeBits:64] map[Name:mnt_devname OffsetBits:1152 SizeBits:64] map[Name:mnt_list OffsetBits:1216 SizeBits:128] map[Name:mnt_expire OffsetBits:1344 SizeBits:128] map[Name:mnt_share OffsetBits:1472 SizeBits:128] map[Name:mnt_slave_list OffsetBits:1600 SizeBits:64] map[Name:mnt_slave OffsetBits:1664 SizeBits:128] map[Name:mnt_master OffsetBits:1792 SizeBits:64] map[Name:mnt_ns OffsetBits:1856 SizeBits:64] map[Name:mnt_mp OffsetBits:1920 SizeBits:64] map[Name: OffsetBits:1984 SizeBits:128] map[Name:mnt_fsnotify_marks OffsetBits:2112 SizeBits:64] map[Name:mnt_fsnotify_mask OffsetBits:2176 SizeBits:32] map[Name:to_notify OffsetBits:2240 SizeBits:128] map[Name:prev_ns OffsetBits:2368 SizeBits:64] map[Name:mnt_t_flags OffsetBits:2432 SizeBits:32] map[Name:mnt_id OffsetBits:2464 SizeBits:32] map[Name:mnt_id_unique OffsetBits:2496 SizeBits:64] map[Name:mnt_group_id OffsetBits:2560 SizeBits:32] map[Name:mnt_expiry_mark OffsetBits:2592 SizeBits:32] map[Name:mnt_pins OffsetBits:2624 SizeBits:64] map[Name:mnt_stuck_children OffsetBits:2688 SizeBits:64] map[Name:overmount OffsetBits:2752 SizeBits:64]]] |
| 10/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 11836 output: 43 thoughts: 1600 Thoughts: **UAF Analysis in Mount Namespace Cloning** Okay, let's break this down. It looks like we have a use-after-free vulnerability. The report points to an access at offset 192 bytes within the `mnt_share` structure. This struct is a `list_head`, meaning the offset 192 refers to the `prev` pointer. It looks like the access is to `old->mnt_share.prev`. The key is `list_add(&mnt->mnt_share, &old->mnt_share)`. This operation uses `__list_add`, which modifies the `prev` and `next` pointers. The code shows that if `old` is freed, accessing `old->mnt_share.next` and `old->mnt_share.prev` is a UAF. The report says "Read of size 8 at addr ffff88806189d400", and the offset to the `prev` member. Looking at the allocation stack, it's a standard mount allocation, so nothing unusual there. The free stack indicates that the `struct mount` object is freed via RCU, which makes sense because `mntput` often uses `delayed_free_vfsmnt`. This means that `old` has been freed via RCU. The stack trace of the access shows the UAF occurring within `clone_mnt`, called from `copy_tree`, which is then called by `copy_mnt_ns`. Line 2159 in `copy_tree` shows that `mnt` that's passed to `copy_tree` is the one being cloned, that is the `old` one that is UAFed. In `copy_mnt_ns`, `ns->root` is being cloned. The function has `namespace_lock()` so there should be no concurrent issues. So how can the root mount be freed? The trace looks related to RCU freeing, so the key seems to be `mntput_no_expire_slowpath`. The RCU mechanism means that `ns->root` could have been freed, but the access happens *after* the `namespace_lock()` is taken. This seems odd. `mntput_no_expire_slowpath` calls `task_work_add`, which means the free is actually performed by RCU. So it is likely that `mntput` causes an RCU delayed free of `ns->root`. |
| 11/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/namespace.c IncludeLines:true Name:mntput_no_expire_slowpath] Results: map[SourceCode:1333: static void noinline mntput_no_expire_slowpath(struct mount *mnt)
1334: {
1335: LIST_HEAD(list);
1336: int count;
1337:
1338: VFS_BUG_ON(mnt->mnt_ns);
1339: lock_mount_hash();
1340: /*
1341: * make sure that if __legitimize_mnt() has not seen us grab
1342: * mount_lock, we'll see their refcount increment here.
1343: */
1344: smp_mb();
1345: mnt_add_count(mnt, -1);
1346: count = mnt_get_count(mnt);
1347: if (count != 0) {
1348: WARN_ON(count < 0);
1349: rcu_read_unlock();
1350: unlock_mount_hash();
1351: return;
1352: }
1353: if (unlikely(mnt->mnt.mnt_flags & MNT_DOOMED)) {
1354: rcu_read_unlock();
1355: unlock_mount_hash();
1356: return;
1357: }
1358: mnt->mnt.mnt_flags |= MNT_DOOMED;
1359: rcu_read_unlock();
1360:
1361: mnt_del_instance(mnt);
1362: if (unlikely(!list_empty(&mnt->mnt_expire)))
1363: list_del(&mnt->mnt_expire);
1364:
1365: if (unlikely(!list_empty(&mnt->mnt_mounts))) {
1366: struct mount *p, *tmp;
1367: list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts, mnt_child) {
1368: __umount_mnt(p, &list);
1369: hlist_add_head(&p->mnt_umount, &mnt->mnt_stuck_children);
1370: }
1371: }
1372: unlock_mount_hash();
1373: shrink_dentry_list(&list);
1374:
1375: if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
1376: struct task_struct *task = current;
1377: if (likely(!(task->flags & PF_KTHREAD))) {
1378: init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
1379: if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
1380: return;
1381: }
1382: if (llist_add(&mnt->mnt_llist, &delayed_mntput_list))
1383: schedule_delayed_work(&delayed_mntput_work, 1);
1384: return;
1385: }
1386: cleanup_mnt(mnt);
1387: }
SourceFile:fs/namespace.c] |
| 12/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 14888 output: 37 thoughts: 810 Thoughts: **Analyzing a Potential Use-After-Free Vulnerability in the Linux Kernel's Mount Namespace Implementation** Okay, so I'm looking at this kernel dump, specifically focusing on a potential UAF. The trace points me to `task_work_add` at line 1379, and it seems to correspond to the "Second to last potentially related work creation" in the stack. This uses `__cleanup_mnt` as its callback. My initial thought is to trace the execution path. From `__cleanup_mnt`, we go to `cleanup_mnt`. That calls `call_rcu` on `mnt->mnt_rcu`, which matches the "Last potentially related work creation" stack. `delayed_free_vfsmnt` is the RCU callback, which in turn calls `free_vfsmnt`. This eventually leads to `kmem_cache_free`. So, that confirms the freed object was indeed a `struct mount`, and it's being freed via RCU, as expected. Now, the core of the problem. The UAF seems to be originating within `clone_mnt`, which is called from `copy_tree`, which in turn is called from `copy_mnt_ns`. `copy_mnt_ns` is the entry point for handling `unshare(CLONE_NEWNS)` calls. It effectively clones the current mount namespace. Specifically, `copy_tree` is called with `ns->root` which is the root of the namespace being copied. The critical point is that the root of the namespace (`ns->root`) could be freed *while* `copy_mnt_ns` is running. That would cause a UAF. But here's the rub: `copy_mnt_ns` is called with the namespace (`ns`) from the current process (`tsk->nsproxy->mnt_ns`). Since the current process should hold a reference to its own namespace, and the namespace should hold a reference to its root, this seems unlikely. This should prevent the root from being freed out from under the cloning process. I need to understand what's really happening. Let me examine the code flow a bit closer. Specifically, let me go over how `copy_mnt_ns` is invoked. It is called during the process of cloning new namespaces. The key line being: ```c new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs); ``` Then I notice: ```c 4244: namespace_lock(); 4245: dentry = ns->root->mnt.mnt_root; 4246: new_mnt = copy_tree(ns->root, dentry, CL_COPY_ALL | CL_EXPIRE); ``` I *think* I'm starting to see something here. The code seems to lock the namespace, and then reads `ns->root`. The `copy_tree` function can still result in a use-after-free, if `ns->root` is changed or freed. Let's check the definition of `struct mnt_namespace` definition to check if the root is freed. |
| 13/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/namespace.c IncludeLines:true Name:mnt_namespace] Results: map[SourceCode: 11: struct mnt_namespace {
12: struct ns_common ns;
13: struct mount * root;
14: struct {
15: struct rb_root mounts; /* Protected by namespace_sem */
16: struct rb_node *mnt_last_node; /* last (rightmost) mount in the rbtree */
17: struct rb_node *mnt_first_node; /* first (leftmost) mount in the rbtree */
18: };
19: struct user_namespace *user_ns;
20: struct ucounts *ucounts;
21: wait_queue_head_t poll;
22: u64 seq_origin; /* Sequence number of origin mount namespace */
23: u64 event;
24: #ifdef CONFIG_FSNOTIFY
25: __u32 n_fsnotify_mask;
26: struct fsnotify_mark_connector __rcu *n_fsnotify_marks;
27: #endif
28: unsigned int nr_mounts; /* # of mounts in the namespace */
29: unsigned int pending_mounts;
30: refcount_t passive; /* number references not pinning @mounts */
31: bool is_anon;
32: } __randomize_layout;
SourceFile:fs/mount.h] |
| 14/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 16819 output: 41 thoughts: 1296 Thoughts: **Analyzing Mount Namespace and Root Mount Lifecycle** Okay, let me walk through this. I'm focusing on `ns->root` in the context of `copy_mnt_ns`, and I'm a bit concerned about potential dangling pointers. First, I see that `ns->root` is used in `copy_mnt_ns`. If `ns->root` is freed *before* this, that's a problem. That implies its refcount hit zero. The question is, does the `mnt_namespace` keep a reference to `root` to prevent this? Let's look at `alloc_mnt_ns`. It initializes `new_ns->root` to `NULL`. Okay, so it starts empty. Then, `create_mnt_ns` is where `ns->root` is actually set. I see `new_ns->root = mnt;` in that function. It *looks* like `mnt` (a `struct mount`) becomes the root. It seems that `mnt` is added to the namespace with `mnt_add_to_ns`. Now, if a namespace is freed, `umount_tree` is called on the root, and that will eventually trigger `mntput`. The potential issue is that `copy_mnt_ns` uses `ns->root`, but if it's freed before the copy, we have a problem. If `ns->root` is freed, it *must* have been unmounted somehow. And since `ns->root` is the root of the namespace, that's a significant unmount, but is it possible? I'm now tracing the relevant code path: `release_task` calls `proc_invalidate_siblings_dcache` which eventually calls `__detach_mounts`, which is interesting. `__detach_mounts` is detaching mounts from a dentry, and this is happening *before* `copy_mnt_ns`. If the dentry being invalidated is the *root* of a namespace...that sounds like it could be the problem! Here's the sequence. `proc_invalidate_siblings_dcache` -> `d_invalidate` -> `detach_mounts` -> `__detach_mounts`. Alright, back to the problem: `ns->root` in `copy_mnt_ns` is a `struct mount`. If that mount gets detached *before* the copy, and its refcount goes to zero, it gets freed. Boom. That would explain everything. In `copy_mnt_ns`, I see the critical line: `dentry = ns->root->mnt.mnt_root;`. This is where `ns->root` is dereferenced. If `ns->root` is freed before this, we have a dangling pointer. But wait, `ns` is derived from `current->nsproxy->mnt_ns`. *That* seems like it should keep things alive. The `mnt_namespace` should prevent the root from going away too early, right? The issue might be that `create_mnt_ns` *doesn't* call `mntget(mnt)` on setting `new_ns->root = mnt;`. I'd expect it to. The `mnt->mnt_ns` is set to point to the namespace. The mount *is* added to the namespace's list, via `mnt_add_to_ns`. Usually, the namespace "pins" the mounts inside of it, and the mounts pin the namespace through that pointer? Actually, the namespace doesn't necessarily pin the mounts, only contains a pointer to the namespace. Aha! `struct mount` *does* have a `struct mnt_namespace *mnt_ns`. When that pointer is set, does that imply the namespace is taking a reference? No it doesn't. Okay, I'm getting closer. I need to dig deeper into the refcounting around the mount and namespace to understand the life cycle, but my gut tells me this could be the source of the problem. If `__detach_mounts` unmounts the root mount *before* the namespace copy, then the `ns->root` pointer is invalid. I need to be sure the namespace has a reference. |
| 15/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/namespace.c IncludeLines:true Name:mnt_add_to_ns] Results: map[SourceCode:1075: static void mnt_add_to_ns(struct mnt_namespace *ns, struct mount *mnt)
1076: {
1077: struct rb_node **link = &ns->mounts.rb_node;
1078: struct rb_node *parent = NULL;
1079: bool mnt_first_node = true, mnt_last_node = true;
1080:
1081: WARN_ON(mnt_ns_attached(mnt));
1082: mnt->mnt_ns = ns;
1083: while (*link) {
1084: parent = *link;
1085: if (mnt->mnt_id_unique < node_to_mount(parent)->mnt_id_unique) {
1086: link = &parent->rb_left;
1087: mnt_last_node = false;
1088: } else {
1089: link = &parent->rb_right;
1090: mnt_first_node = false;
1091: }
1092: }
1093:
1094: if (mnt_last_node)
1095: ns->mnt_last_node = &mnt->mnt_node;
1096: if (mnt_first_node)
1097: ns->mnt_first_node = &mnt->mnt_node;
1098: rb_link_node(&mnt->mnt_node, parent, link);
1099: rb_insert_color(&mnt->mnt_node, &ns->mounts);
1100:
1101: mnt_notify_add(mnt);
1102: }
SourceFile:fs/namespace.c] |
| 16/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 19543 output: 44 thoughts: 293 Thoughts: **My Analysis of a Potential Mount Namespace Issue** Okay, so I'm looking at this code and something feels off. It appears that a reference isn't being held where I'd expect it. My immediate concern is this: if `ns->root` isn't pinned by the namespace itself, *who* is keeping it alive? Traditionally, the root mount of a namespace is inherently pinned as long as the namespace is active. But if there's a mechanism to unmount that root... that's a problem. I'm thinking about `copy_mnt_ns` now. We're within the `namespace_lock()`, so that provides *some* protection. However, a nasty race condition comes to mind: if `ns->root` were freed *before* we entered `copy_mnt_ns`, we're in deep trouble. But, realistically, `ns->root` should only be freed during namespace teardown or if the root is explicitly unmounted, right? That should be a controlled process. Wait, here's a potential clue. I'm looking at a stack trace, specifically the "Second to last potentially related work creation" and it shows a sequence: `release_task` -> `proc_invalidate_siblings_dcache` -> `d_invalidate` -> `detach_mounts`. This is happening during `release_task`, which is when a process is being reaped. And `proc_invalidate_siblings_dcache`… that sounds awfully related to those `/proc/<pid>/` dentries, doesn't it? The implication here is that if a process has a mount namespace, and that namespace is *shared* by another process... and that other process is doing a `unshare`, as the report of `syz.1.1921` details... things might be getting hairy. It makes sense, the unshare would result in an invalidation of the dentries related to the dying process, which then would try to detach the mounts. Time to dive deeper into `proc_invalidate_siblings_dcache` and see what it's really doing, specifically in the context of shared mount namespaces and dying processes. This feels like a good starting point to track down the root cause. |
| 17/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/proc/inode.c IncludeLines:true Name:proc_invalidate_siblings_dcache] Results: map[SourceCode: 108: void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
109: {
110: struct hlist_node *node;
111: struct super_block *old_sb = NULL;
112:
113: rcu_read_lock();
114: while ((node = hlist_first_rcu(inodes))) {
115: struct proc_inode *ei = hlist_entry(node, struct proc_inode, sibling_inodes);
116: struct super_block *sb;
117: struct inode *inode;
118:
119: spin_lock(lock);
120: hlist_del_init_rcu(&ei->sibling_inodes);
121: spin_unlock(lock);
122:
123: inode = &ei->vfs_inode;
124: sb = inode->i_sb;
125: if ((sb != old_sb) && !atomic_inc_not_zero(&sb->s_active))
126: continue;
127: inode = igrab(inode);
128: rcu_read_unlock();
129: if (sb != old_sb) {
130: if (old_sb)
131: deactivate_super(old_sb);
132: old_sb = sb;
133: }
134: if (unlikely(!inode)) {
135: rcu_read_lock();
136: continue;
137: }
138:
139: if (S_ISDIR(inode->i_mode)) {
140: struct dentry *dir = d_find_any_alias(inode);
141: if (dir) {
142: d_invalidate(dir);
143: dput(dir);
144: }
145: } else {
146: struct dentry *dentry;
147: while ((dentry = d_find_alias(inode))) {
148: d_invalidate(dentry);
149: dput(dentry);
150: }
151: }
152: iput(inode);
153:
154: rcu_read_lock();
155: }
156: rcu_read_unlock();
157: if (old_sb)
158: deactivate_super(old_sb);
159: }
SourceFile:fs/proc/inode.c] |
| 18/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 21108 output: 39 thoughts: 203 Thoughts: Okay, let me think this through. **Analyzing Dentry Invalidation During Process Exit** Right, so I'm looking at this function that's part of the process exit sequence, `d_invalidate`. My understanding is that it's designed to clean up dentries associated with a process that's shutting down. The crucial point here is that if any of those dentries happen to be mount points, `d_invalidate` calls `detach_mounts`. Hmm, okay. Now, I need to see what `detach_mounts` actually *does*. Ah, there it is in `fs/mount.h` or `fs/namespace.c`. And it looks like `detach_mounts` is defined as `__detach_mounts`. Okay, so `__detach_mounts` calls `namespace_unlock()`. Hold on a second... It *unlocks* the namespace? That implies it was *locked* beforehand, doesn't it? I'm going to bet it calls `namespace_lock()` first. I'll need to confirm that. Let me quickly go check the implementation of `__detach_mounts` to see if my assumption is correct. |
| 19/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/namespace.c IncludeLines:true Name:__detach_mounts] Results: map[SourceCode:1956: /*
1957: * __detach_mounts - lazily unmount all mounts on the specified dentry
1958: *
1959: * During unlink, rmdir, and d_drop it is possible to loose the path
1960: * to an existing mountpoint, and wind up leaking the mount.
1961: * detach_mounts allows lazily unmounting those mounts instead of
1962: * leaking them.
1963: *
1964: * The caller may hold dentry->d_inode->i_rwsem.
1965: */
1966: void __detach_mounts(struct dentry *dentry)
1967: {
1968: struct pinned_mountpoint mp = {};
1969: struct mount *mnt;
1970:
1971: guard(namespace_excl)();
1972: guard(mount_writer)();
1973:
1974: if (!lookup_mountpoint(dentry, &mp))
1975: return;
1976:
1977: event++;
1978: while (mp.node.next) {
1979: mnt = hlist_entry(mp.node.next, struct mount, mnt_mp_list);
1980: if (mnt->mnt.mnt_flags & MNT_UMOUNT) {
1981: umount_mnt(mnt);
1982: hlist_add_head(&mnt->mnt_umount, &unmounted);
1983: }
1984: else umount_tree(mnt, UMOUNT_CONNECTED);
1985: }
1986: unpin_mountpoint(&mp);
1987: }
SourceFile:fs/namespace.c] |
| 20/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 22103 output: 23 thoughts: 10037 Thoughts: **UAF in Mount Namespace Creation** Alright, I think I've got a handle on this. The core issue revolves around a Use-After-Free (UAF) vulnerability during the creation of a new mount namespace, stemming from the premature freeing of the root mount. Here's my thought process: The stack trace points to `namespace_unlock` and, specifically, the destructor of `guard(namespace_excl)()`. This hints at a race condition or incorrect locking around namespace operations. The critical detail is that a dentry being invalidated, specifically under circumstances involving proc files, is key to the mount being freed. If a mount namespace is kept alive solely by a bind mount of `/proc/<pid>/ns/mnt`, when the process `<pid>` exits, the `proc_invalidate_siblings_dcache` function is called. This call triggers `d_invalidate` on the dentries in `/proc/<pid>/`, including the `ns/mnt` dentry. `d_invalidate` then calls `detach_mounts`. If there's a mount *on top* of `/proc/<pid>/ns/mnt`, that mount is detached. Here's where things get interesting: if a mount is detached, its `mnt_ns` field is set to NULL, and `mntput` is called on it. So, if the reference count goes to zero, the mount is freed via RCU. The problem is that the mount's parent mount will still have a pointer to this now freed `struct mount` object. The syzkaller report's KASAN output confirms a read from freed memory. `clone_mnt` is the one accessing the freed memory. The important part is that the freed object is *ns->root*. The allocation stack reveals that the freed object is a mount, and it was freed via RCU. It looks like `mntput_no_expire_slowpath` is called, which then schedules `delayed_free_vfsmnt`, which does the RCU free. Now, `copy_mnt_ns` is where the UAF occurs. It's called during `unshare(CLONE_NEWNS)`. `copy_mnt_ns` creates a copy of the current mount namespace. The important operation in it is `new_mnt = copy_tree(ns->root, ...)`. If the *old* `ns->root` has already been freed, this is a clear UAF. The allocation stack shows it's a new mount. The free stack says it's released via RCU. The "Second to last potentially related work creation" indicates that this was caused by `proc_invalidate_siblings_dcache`, triggered by a task exiting and invalidating a dentry with a bind mount. This bind mount's unmount causes the root mount to be freed, but the namespace still has a pointer to the now freed memory. Here's how I think it breaks down: 1. Task A exits, which triggers `proc_invalidate_siblings_dcache`, resulting in `__detach_mounts` being called. 2. `__detach_mounts` calls `umount_tree` on a mount (let's say `M`) if it is attached to some invalidated dentry. 3. `umount_tree` sets `M->mnt_ns = NULL` and calls `mntput`. 4. If `M`'s refcount drops to 0, it is scheduled to be freed. 5. Meanwhile, Task B (syzkaller) calls `unshare(CLONE_NEWNS)`. 6. `unshare` calls `copy_mnt_ns`, which calls `copy_tree` that accesses `ns->root`. 7. If Task A's action resulted in freeing *ns->root* then the UAF happens. The key is that the root of the mount namespace can be mounted *on* something in another namespace. The mount will be detached, and the refcount might drop to 0, triggering the free. The solution is to find a way to prevent `ns->root` from being freed prematurely. This can be achieved by making sure the namespace owns a proper reference. The problem lies in cases where `ns->root` is the *only* thing keeping the root mount alive. If you trigger `umount -l /` in this scenario, then the reference is removed. When the root is unmounted, its `mnt_ns` pointer is nulled, and if no other references exist, it is freed. But `ns->root` continues to point to it, which causes a UAF. |
| 21/2 | 2026/02/06 12:35 | tool | codesearch-find-references |
0mError:requested entity does not exist Args: map[Name:create_mnt_ns] Results: map[References:<nil> TruncatedOutput:false] |
| 22/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 33173 output: 19 thoughts: 46 Thoughts: Okay, here's my interpretation of those thoughts, summarized in the first person, assuming the context is code-related: **Debugging a Potential Namespace Issue** Hmm, something's not quite right. Could be a simple typo, or perhaps some kind of static problem preventing it from being recognized. I'll double-check for errors, but I need a more reliable solution. Wait a second… I believe I spotted something related to this in `fs/namespace.c`. I'm pretty sure there's a function there related to namespace creation… yes! It's `create_mnt_ns`. Alright, that's the key. Now, let me quickly use `grep` to confirm my suspicion and locate the relevant usage of this function and see how it is being called. That should give me the answer. |
| 23/2 | 2026/02/06 12:35 | tool | grepper |
0mError:no matches Args: map[Expression:create_mnt_ns] Results: map[Output:] |
| 24/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 33426 output: 24 thoughts: 26 Thoughts: Okay, here's my take on those thoughts, tailored for an expert audience: **Debugging a Code Discrepancy** Hmm, that's peculiar. I could swear I saw that particular functionality implemented somewhere earlier in the code. My mental map of this project, and especially the file system implementation, feels pretty solid. It's not like I'm new to this code base; this behavior should not be a surprise. Right, let me double-check `fs/namespace.c`. That's the likely spot for anything related to name-to-inode mapping and the namespace logic, where I'd expect to find the specific detail I'm looking for. I need to make sure I am not overlooking something; perhaps I misread it earlier. Let's see if the code confirms my initial impression. I'll focus on the sections dealing with path resolution and directory traversal, looking for the specific feature. |
| 25/2 | 2026/02/06 12:35 | tool | codesearch-file-index |
0mArgs:map[SourceFile:fs/namespace.c] Results: map[Entities:[map[Kind:function Name:__cleanup_mnt] map[Kind:function Name:__detach_mounts] map[Kind:function Name:__do_loopback] map[Kind:function Name:__do_sys_fsmount] map[Kind:function Name:__do_sys_listmount] map[Kind:function Name:__do_sys_mount] map[Kind:function Name:__do_sys_mount_setattr] map[Kind:function Name:__do_sys_move_mount] map[Kind:function Name:__do_sys_oldumount] map[Kind:function Name:__do_sys_open_tree] map[Kind:function Name:__do_sys_open_tree_attr] map[Kind:function Name:__do_sys_pivot_root] map[Kind:function Name:__do_sys_statmount] map[Kind:function Name:__do_sys_umount] map[Kind:function Name:__free_klistmount_free] map[Kind:function Name:__free_mnt_ns_release] map[Kind:function Name:__free_mntput] map[Kind:function Name:__free_put_empty_mnt_ns] map[Kind:function Name:__has_locked_children] map[Kind:function Name:__ia32_sys_fsmount] map[Kind:function Name:__ia32_sys_listmount] map[Kind:function Name:__ia32_sys_mount] map[Kind:function Name:__ia32_sys_mount_setattr] map[Kind:function Name:__ia32_sys_move_mount] map[Kind:function Name:__ia32_sys_oldumount] map[Kind:function Name:__ia32_sys_open_tree] map[Kind:function Name:__ia32_sys_open_tree_attr] map[Kind:function Name:__ia32_sys_pivot_root] map[Kind:function Name:__ia32_sys_statmount] map[Kind:function Name:__ia32_sys_umount] map[Kind:function Name:__is_local_mountpoint] map[Kind:function Name:__legitimize_mnt] map[Kind:function Name:__lookup_mnt] map[Kind:function Name:__mnt_is_readonly] map[Kind:function Name:__se_sys_fsmount] map[Kind:function Name:__se_sys_listmount] map[Kind:function Name:__se_sys_mount] map[Kind:function Name:__se_sys_mount_setattr] map[Kind:function Name:__se_sys_move_mount] map[Kind:function Name:__se_sys_oldumount] map[Kind:function Name:__se_sys_open_tree] map[Kind:function Name:__se_sys_open_tree_attr] map[Kind:function Name:__se_sys_pivot_root] map[Kind:function Name:__se_sys_statmount] map[Kind:function Name:__se_sys_umount] map[Kind:function Name:__touch_mnt_namespace] map[Kind:function Name:__umount_mnt] map[Kind:function Name:__unlock_mount] map[Kind:function Name:__x64_sys_fsmount] map[Kind:function Name:__x64_sys_listmount] map[Kind:function Name:__x64_sys_mount] map[Kind:function Name:__x64_sys_mount_setattr] map[Kind:function Name:__x64_sys_move_mount] map[Kind:function Name:__x64_sys_oldumount] map[Kind:function Name:__x64_sys_open_tree] map[Kind:function Name:__x64_sys_open_tree_attr] map[Kind:function Name:__x64_sys_pivot_root] map[Kind:function Name:__x64_sys_statmount] map[Kind:function Name:__x64_sys_umount] map[Kind:function Name:alloc_mnt_ns] map[Kind:function Name:alloc_vfsmnt] map[Kind:function Name:attach_mnt] map[Kind:function Name:attach_recursive_mnt] map[Kind:function Name:attr_flags_to_mnt_flags] map[Kind:function Name:build_mount_idmapped] map[Kind:function Name:build_mount_kattr] map[Kind:function Name:can_change_locked_flags] map[Kind:function Name:can_idmap_mount] map[Kind:function Name:can_move_mount_beneath] map[Kind:function Name:can_umount] map[Kind:function Name:change_mount_ro_state] map[Kind:function Name:check_anonymous_mnt] map[Kind:function Name:check_for_nsfs_mounts] map[Kind:function Name:check_mnt] map[Kind:function Name:class_namespace_excl_constructor] map[Kind:function Name:class_namespace_excl_destructor] map[Kind:function Name:class_namespace_excl_lock_err] map[Kind:function Name:class_namespace_excl_lock_ptr] map[Kind:function Name:class_namespace_shared_constructor] map[Kind:function Name:class_namespace_shared_destructor] map[Kind:function Name:class_namespace_shared_lock_err] map[Kind:function Name:class_namespace_shared_lock_ptr] map[Kind:function Name:cleanup_group_ids] map[Kind:function Name:cleanup_mnt] map[Kind:function Name:clone_mnt] map[Kind:function Name:clone_private_mount] map[Kind:function Name:collect_paths] map[Kind:function Name:commit_tree] map[Kind:function Name:copy_mnt_id_req] map[Kind:function Name:copy_mnt_ns] map[Kind:function Name:copy_mount_options] map[Kind:function Name:copy_mount_string] map[Kind:function Name:copy_statmount_to_user] map[Kind:function Name:copy_tree] map[Kind:function Name:count_mounts] map[Kind:function Name:create_new_namespace] map[Kind:function Name:current_chrooted] map[Kind:function Name:dec_mnt_namespaces] map[Kind:function Name:delayed_free_vfsmnt] map[Kind:function Name:delayed_mntput] map[Kind:function Name:disconnect_mount] map[Kind:function Name:dissolve_on_fput] map[Kind:function Name:do_add_mount] map[Kind:function Name:do_change_type] map[Kind:function Name:do_idmap_mount] map[Kind:function Name:do_listmount] map[Kind:function Name:do_lock_mount] map[Kind:function Name:do_loopback] map[Kind:function Name:do_mount] map[Kind:function Name:do_mount_setattr] map[Kind:function Name:do_move_mount] map[Kind:function Name:do_move_mount_old] map[Kind:function Name:do_new_mount] map[Kind:function Name:do_new_mount_fc] map[Kind:function Name:do_reconfigure_mnt] map[Kind:function Name:do_remount] map[Kind:function Name:do_set_group] map[Kind:function Name:do_statmount] map[Kind:function Name:do_umount] map[Kind:function Name:do_umount_root] map[Kind:function Name:drop_collected_paths] map[Kind:function Name:extend_array] map[Kind:function Name:fc_mount] map[Kind:function Name:fc_mount_longterm] map[Kind:function Name:finish_automount] map[Kind:function Name:finish_mount_kattr] map[Kind:function Name:flags_to_propagation_type] map[Kind:function Name:free_mnt_ns] map[Kind:function Name:free_vfsmnt] map[Kind:function Name:from_mnt_ns] map[Kind:function Name:get_detached_copy] map[Kind:function Name:get_mountpoint] map[Kind:function Name:get_sequential_mnt_ns] map[Kind:function Name:grab_requested_mnt_ns] map[Kind:function Name:grab_requested_root] map[Kind:function Name:graft_tree] map[Kind:function Name:has_locked_children] map[Kind:function Name:inc_mnt_namespaces] map[Kind:function Name:init_fs_namespace_sysctls] map[Kind:function Name:init_mount_tree] map[Kind:function Name:initramfs_options_setup] map[Kind:function Name:invent_group_ids] map[Kind:function Name:is_mnt_ns_file] map[Kind:function Name:is_path_reachable] map[Kind:function Name:kern_mount] map[Kind:function Name:kern_unmount] map[Kind:function Name:kern_unmount_array] map[Kind:function Name:ksys_umount] map[Kind:function Name:legitimize_mnt] map[Kind:function Name:listmnt_next] map[Kind:function Name:lock_mnt_tree] map[Kind:function Name:lock_mount_exact] map[Kind:function Name:lock_mount_hash] map[Kind:function Name:lookup_mnt] map[Kind:function Name:lookup_mnt_in_ns] map[Kind:function Name:lookup_mnt_ns] map[Kind:function Name:lookup_mountpoint] map[Kind:function Name:m_hash] map[Kind:function Name:m_next] map[Kind:function Name:m_show] map[Kind:function Name:m_start] map[Kind:function Name:m_stop] map[Kind:function Name:make_visible] map[Kind:function Name:mark_mounts_for_expiry] map[Kind:function Name:may_change_propagation] map[Kind:function Name:may_copy_tree] map[Kind:function Name:may_mount] map[Kind:function Name:may_umount] map[Kind:function Name:may_umount_tree] map[Kind:function Name:may_use_mount] map[Kind:function Name:maybe_free_mountpoint] map[Kind:function Name:mnt_add_count] map[Kind:function Name:mnt_add_instance] map[Kind:function Name:mnt_add_to_ns] map[Kind:function Name:mnt_alloc_group_id] map[Kind:function Name:mnt_alloc_id] map[Kind:function Name:mnt_allow_writers] map[Kind:function Name:mnt_already_visible] map[Kind:function Name:mnt_change_mountpoint] map[Kind:function Name:mnt_clone_internal] map[Kind:function Name:mnt_dec_writers] map[Kind:function Name:mnt_del_instance] map[Kind:function Name:mnt_drop_write] map[Kind:function Name:mnt_drop_write_file] map[Kind:function Name:mnt_find_id_at] map[Kind:function Name:mnt_find_id_at_reverse] map[Kind:function Name:mnt_free_id] map[Kind:function Name:mnt_get_count] map[Kind:function Name:mnt_get_write_access] map[Kind:function Name:mnt_get_write_access_file] map[Kind:function Name:mnt_get_writers] map[Kind:function Name:mnt_hold_writers] map[Kind:function Name:mnt_inc_writers] map[Kind:function Name:mnt_init] map[Kind:function Name:mnt_is_readonly] map[Kind:function Name:mnt_make_readonly] map[Kind:function Name:mnt_make_shortterm] map[Kind:function Name:mnt_may_suid] map[Kind:function Name:mnt_notify] map[Kind:function Name:mnt_ns_from_dentry] map[Kind:function Name:mnt_ns_loop] map[Kind:function Name:mnt_ns_release] map[Kind:function Name:mnt_ns_release_rcu] map[Kind:function Name:mnt_ns_tree_remove] map[Kind:function Name:mnt_put_write_access] map[Kind:function Name:mnt_put_write_access_file] map[Kind:function Name:mnt_release_group_id] map[Kind:function Name:mnt_set_expiry] map[Kind:function Name:mnt_set_mountpoint] map[Kind:function Name:mnt_to_attr_flags] map[Kind:function Name:mnt_to_propagation_flags] map[Kind:function Name:mnt_unhold_writers] map[Kind:function Name:mnt_want_write] map[Kind:function Name:mnt_want_write_file] map[Kind:function Name:mnt_warn_timestamp_expiry] map[Kind:function Name:mntget] map[Kind:function Name:mntns_get] map[Kind:function Name:mntns_install] map[Kind:function Name:mntns_owner] map[Kind:function Name:mntns_put] map[Kind:function Name:mntput] map[Kind:function Name:mntput_no_expire] map[Kind:function Name:mntput_no_expire_slowpath] map[Kind:function Name:mount_is_ancestor] map[Kind:function Name:mount_setattr_commit] map[Kind:function Name:mount_setattr_prepare] map[Kind:function Name:mount_subtree] map[Kind:function Name:mount_too_revealing] map[Kind:function Name:mp_hash] map[Kind:function Name:namespace_lock] map[Kind:function Name:namespace_unlock] map[Kind:function Name:need_notify_mnt_list] map[Kind:function Name:next_mnt] map[Kind:function Name:node_to_mount] map[Kind:function Name:notify_mnt_list] map[Kind:function Name:open_detached_copy] map[Kind:function Name:open_new_namespace] map[Kind:function Name:our_mnt] map[Kind:function Name:path_is_mountpoint] map[Kind:function Name:path_is_under] map[Kind:function Name:path_mount] map[Kind:function Name:path_overmounted] map[Kind:function Name:path_pivot_root] map[Kind:function Name:path_umount] map[Kind:function Name:prepare_klistmount] map[Kind:function Name:prepare_kstatmount] map[Kind:function Name:put_mnt_ns] map[Kind:function Name:recalc_flags] map[Kind:function Name:retry_statmount] map[Kind:function Name:sb_prepare_remount_readonly] map[Kind:function Name:select_submounts] map[Kind:function Name:set_mhash_entries] map[Kind:function Name:set_mount_attributes] map[Kind:function Name:set_mphash_entries] map[Kind:function Name:setup_mnt] map[Kind:function Name:show_path] map[Kind:function Name:shrink_submounts] map[Kind:function Name:skip_mnt_tree] map[Kind:function Name:statmount_fs_subtype] map[Kind:function Name:statmount_fs_type] map[Kind:function Name:statmount_mnt_basic] map[Kind:function Name:statmount_mnt_gidmap] map[Kind:function Name:statmount_mnt_ns_id] map[Kind:function Name:statmount_mnt_opts] map[Kind:function Name:statmount_mnt_point] map[Kind:function Name:statmount_mnt_root] map[Kind:function Name:statmount_mnt_uidmap] map[Kind:function Name:statmount_opt_array] map[Kind:function Name:statmount_opt_process] map[Kind:function Name:statmount_opt_sec_array] map[Kind:function Name:statmount_propagate_from] map[Kind:function Name:statmount_sb_basic] map[Kind:function Name:statmount_sb_source] map[Kind:function Name:statmount_string] map[Kind:function Name:touch_mnt_namespace] map[Kind:function Name:tree_contains_unbindable] map[Kind:function Name:umount_mnt] map[Kind:function Name:umount_tree] map[Kind:function Name:unlock_mount] map[Kind:function Name:unlock_mount_hash] map[Kind:function Name:unpin_mountpoint] map[Kind:function Name:vfs_create_mount] map[Kind:function Name:vfs_kern_mount] map[Kind:function Name:vfs_move_mount] map[Kind:function Name:vfs_open_tree] map[Kind:function Name:vfsmount_to_propagation_flags] map[Kind:function Name:wants_mount_setattr] map[Kind:function Name:warn_mandlock] map[Kind:function Name:where_to_mount] map[Kind:struct Name:klistmount] map[Kind:struct Name:kstatmount] map[Kind:struct Name:mount_kattr] map[Kind:struct Name:pinned_mountpoint] map[Kind:enum Name:mnt_tree_flags_t] map[Kind:enum Name:mount_kattr_flags_t] map[Kind:enum Name:umount_tree_flags] map[Kind:typedef Name:class_namespace_excl_t] map[Kind:typedef Name:class_namespace_shared_t] map[Kind:typedef Name:lock_namespace_excl_t] map[Kind:typedef Name:lock_namespace_shared_t]]] |
| 26/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 38764 output: 39 thoughts: 46 Thoughts: Okay, here's my interpretation of that thought process, tailored for an expert audience: **Examining Namespace Creation Functions** Hmm, I'm reviewing this code and I don't see `create_mnt_ns` in the function list. That's unexpected. Wait a minute… Ah, yes, of course! It looks like the naming convention is slightly different here. I see `open_new_namespace` and `create_new_namespace` listed instead. Right, that makes sense given the potential for different namespace initialization strategies. I'll need to confirm that `create_new_namespace` is the functional equivalent or, at least, the crucial component in this context. Let me dive into the implementation of `create_new_namespace` and ensure its behavior aligns with my expectations. |
| 27/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/namespace.c IncludeLines:true Name:create_new_namespace] Results: map[SourceCode:3079: static struct mnt_namespace *create_new_namespace(struct path *path, unsigned int flags)
3080: {
3081: struct mnt_namespace *new_ns __free(put_empty_mnt_ns) = NULL;
3082: struct path to_path __free(path_put) = {};
3083: struct mnt_namespace *ns = current->nsproxy->mnt_ns;
3084: struct user_namespace *user_ns = current_user_ns();
3085: struct mount *new_ns_root;
3086: struct mount *mnt;
3087: unsigned int copy_flags = 0;
3088: bool locked = false;
3089:
3090: if (user_ns != ns->user_ns)
3091: copy_flags |= CL_SLAVE;
3092:
3093: new_ns = alloc_mnt_ns(user_ns, false);
3094: if (IS_ERR(new_ns))
3095: return ERR_CAST(new_ns);
3096:
3097: scoped_guard(namespace_excl) {
3098: new_ns_root = clone_mnt(ns->root, ns->root->mnt.mnt_root, copy_flags);
3099: if (IS_ERR(new_ns_root))
3100: return ERR_CAST(new_ns_root);
3101:
3102: /*
3103: * If the real rootfs had a locked mount on top of it somewhere
3104: * in the stack, lock the new mount tree as well so it can't be
3105: * exposed.
3106: */
3107: mnt = ns->root;
3108: while (mnt->overmount) {
3109: mnt = mnt->overmount;
3110: if (mnt->mnt.mnt_flags & MNT_LOCKED)
3111: locked = true;
3112: }
3113: }
3114:
3115: /*
3116: * We dropped the namespace semaphore so we can actually lock
3117: * the copy for mounting. The copied mount isn't attached to any
3118: * mount namespace and it is thus excluded from any propagation.
3119: * So realistically we're isolated and the mount can't be
3120: * overmounted.
3121: */
3122:
3123: /* Borrow the reference from clone_mnt(). */
3124: to_path.mnt = &new_ns_root->mnt;
3125: to_path.dentry = dget(new_ns_root->mnt.mnt_root);
3126:
3127: /* Now lock for actual mounting. */
3128: LOCK_MOUNT_EXACT(mp, &to_path);
3129: if (unlikely(IS_ERR(mp.parent)))
3130: return ERR_CAST(mp.parent);
3131:
3132: /*
3133: * We don't emulate unshare()ing a mount namespace. We stick to the
3134: * restrictions of creating detached bind-mounts. It has a lot
3135: * saner and simpler semantics.
3136: */
3137: mnt = __do_loopback(path, flags, copy_flags);
3138: if (IS_ERR(mnt))
3139: return ERR_CAST(mnt);
3140:
3141: scoped_guard(mount_writer) {
3142: if (locked)
3143: mnt->mnt.mnt_flags |= MNT_LOCKED;
3144: /*
3145: * Now mount the detached tree on top of the copy of the
3146: * real rootfs we created.
3147: */
3148: attach_mnt(mnt, new_ns_root, mp.mp);
3149: if (user_ns != ns->user_ns)
3150: lock_mnt_tree(new_ns_root);
3151: }
3152:
3153: /* Add all mounts to the new namespace. */
3154: for (struct mount *p = new_ns_root; p; p = next_mnt(p, new_ns_root)) {
3155: mnt_add_to_ns(new_ns, p);
3156: new_ns->nr_mounts++;
3157: }
3158:
3159: new_ns->root = real_mount(no_free_ptr(to_path.mnt));
3160: ns_tree_add_raw(new_ns);
3161: return no_free_ptr(new_ns);
3162: }
SourceFile:fs/namespace.c] |
| 28/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 40375 output: 39 thoughts: 113 Thoughts: **Analyzing a Potential Namespace Bug** Okay, hold on a second. Line 3159... there it is: `new_ns->root = real_mount(no_free_ptr(to_path.mnt));`. I recognize that. It's used within `open_tree` with the `OPEN_TREE_NEW_NS` flag. But... the syzkaller report is specifically about `unshare`. And `unshare`, as I recall, calls `copy_mnt_ns`. So, there's a disconnect. I need to be sure. Let's trace that. Time to go back and thoroughly re-examine the `copy_mnt_ns` function. I need to be sure I am not missing something that links `open_tree` and `unshare`. I need the full picture of `copy_mnt_ns`'s behavior to understand how this is happening. Let's see what is really going on. |
| 29/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/namespace.c IncludeLines:true Name:copy_mnt_ns] Results: map[SourceCode:4217: struct mnt_namespace *copy_mnt_ns(u64 flags, struct mnt_namespace *ns,
4218: struct user_namespace *user_ns, struct fs_struct *new_fs)
4219: {
4220: struct mnt_namespace *new_ns;
4221: struct vfsmount *rootmnt __free(mntput) = NULL;
4222: struct vfsmount *pwdmnt __free(mntput) = NULL;
4223: struct mount *p, *q;
4224: struct mount *old;
4225: struct mount *new;
4226: int copy_flags;
4227:
4228: BUG_ON(!ns);
4229:
4230: if (likely(!(flags & CLONE_NEWNS))) {
4231: get_mnt_ns(ns);
4232: return ns;
4233: }
4234:
4235: old = ns->root;
4236:
4237: new_ns = alloc_mnt_ns(user_ns, false);
4238: if (IS_ERR(new_ns))
4239: return new_ns;
4240:
4241: guard(namespace_excl)();
4242: /* First pass: copy the tree topology */
4243: copy_flags = CL_COPY_UNBINDABLE | CL_EXPIRE;
4244: if (user_ns != ns->user_ns)
4245: copy_flags |= CL_SLAVE;
4246: new = copy_tree(old, old->mnt.mnt_root, copy_flags);
4247: if (IS_ERR(new)) {
4248: emptied_ns = new_ns;
4249: return ERR_CAST(new);
4250: }
4251: if (user_ns != ns->user_ns) {
4252: guard(mount_writer)();
4253: lock_mnt_tree(new);
4254: }
4255: new_ns->root = new;
4256:
4257: /*
4258: * Second pass: switch the tsk->fs->* elements and mark new vfsmounts
4259: * as belonging to new namespace. We have already acquired a private
4260: * fs_struct, so tsk->fs->lock is not needed.
4261: */
4262: p = old;
4263: q = new;
4264: while (p) {
4265: mnt_add_to_ns(new_ns, q);
4266: new_ns->nr_mounts++;
4267: if (new_fs) {
4268: if (&p->mnt == new_fs->root.mnt) {
4269: new_fs->root.mnt = mntget(&q->mnt);
4270: rootmnt = &p->mnt;
4271: }
4272: if (&p->mnt == new_fs->pwd.mnt) {
4273: new_fs->pwd.mnt = mntget(&q->mnt);
4274: pwdmnt = &p->mnt;
4275: }
4276: }
4277: p = next_mnt(p, old);
4278: q = next_mnt(q, new);
4279: if (!q)
4280: break;
4281: // an mntns binding we'd skipped?
4282: while (p->mnt.mnt_root != q->mnt.mnt_root)
4283: p = next_mnt(skip_mnt_tree(p), old);
4284: }
4285: ns_tree_add_raw(new_ns);
4286: return new_ns;
4287: }
SourceFile:fs/namespace.c] |
| 30/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 41837 output: 39 thoughts: 1123 Thoughts: **Analyzing a Potential Use-After-Free Vulnerability in Namespace Management** Okay, so I'm looking at this code, and the first thing that jumps out is this potential use-after-free scenario. Line 4235, `old = ns->root;`, and then on line 4246, `new = copy_tree(old, old->mnt.mnt_root, copy_flags);`. The issue is, if `old` gets freed between those lines, we have a classic UAF. The big question is, how can `old`, which is `ns->root`, be freed? `ns` is tied to `current->nsproxy->mnt_ns`, and as long as `current` is running `copy_mnt_ns`, it should be holding a reference to `ns`, and consequently, `ns` should be holding a reference to `old`. I initially thought the reference counting would prevent this; `alloc_vfsmnt` starts with a count of 1, and `free_mnt_ns` calls `umount_tree`, which in turn calls `mntput`. But then I realized, what if `umount_tree` is called on `old` *before* `free_mnt_ns`? If `umount_tree` is called on `old`, and `old->mnt_count` is 1, it gets freed, and `ns->root` is left dangling. So, can `umount_tree` be called on `ns->root`? Yes, that's possible if someone calls `umount` on it, or if `__detach_mounts` is called on the dentry it's mounted on. Hmm, but `ns->root` is the *root* of the namespace, and it's mounted on `old->mnt.mnt_root`. It *shouldn't* be mounted on any dentry in another namespace... Wait, that's it! `struct mount` isn't shared. No, wait, this isn't right. What if `ns->root` is a mount created but not fully owned by the namespace? No, `copy_mnt_ns` is called on an existing namespace. Okay, let's go back and look at those "potentially related work creation" events. It's `__detach_mounts` called from `d_invalidate` on a proc dentry. Is it `/proc/<pid>/ns/mnt`? Mounting on that doesn't help. No, the mount itself is not a proc dentry. Aha! Wait, I think I've got something with `mntns_install` from `setns`! It's like this: if a process does `setns` to a namespace, and then `unshare`. Let's revisit that "potentially related work creation" point again. It's `release_task` -> `proc_invalidate_siblings_dcache`. This happens when a task exits and it invalidates `/proc/<pid>/` dentries. If a mount was attached to `/proc/<pid>/ns/mnt`... Wait, if I do `mount --bind /some/where /proc/self/ns/mnt`, and then I exit, `/proc/self/ns/mnt` gets invalidated, and the mount is detached. If that mount was `ns->root`... But how can `ns->root` be mounted there? Oh wait, that can't work that way either because `detach_mounts` is called on the dentry in the *parent* procfs. Wait a second. I think I see what's going on. This is about **detached mounts**. With the new mount API, you can create a mount that isn't attached to any namespace initially. `fsmount` returns a detached mount, and then you can create a new namespace from it, like `open_new_namespace` does. Wait, looking at `create_new_namespace` again... that's used by `open_tree(..., OPEN_TREE_NEW_NS)`. It creates a new namespace and sets `new_ns->root`. But the syzkaller report is about `unshare`, not `open_tree`. Okay, let's look at that allocation stack from the report again. It's `do_new_mount`, and that's on line 3836 in `fs/namespace.c`. I need to review what is happening there. |
| 31/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/namespace.c IncludeLines:true Name:do_new_mount] Results: map[SourceCode:3785: /*
3786: * create a new mount for userspace and request it to be added into the
3787: * namespace's tree
3788: */
3789: static int do_new_mount(const struct path *path, const char *fstype,
3790: int sb_flags, int mnt_flags,
3791: const char *name, void *data)
3792: {
3793: struct file_system_type *type;
3794: struct fs_context *fc;
3795: const char *subtype = NULL;
3796: int err = 0;
3797:
3798: if (!fstype)
3799: return -EINVAL;
3800:
3801: type = get_fs_type(fstype);
3802: if (!type)
3803: return -ENODEV;
3804:
3805: if (type->fs_flags & FS_HAS_SUBTYPE) {
3806: subtype = strchr(fstype, '.');
3807: if (subtype) {
3808: subtype++;
3809: if (!*subtype) {
3810: put_filesystem(type);
3811: return -EINVAL;
3812: }
3813: }
3814: }
3815:
3816: fc = fs_context_for_mount(type, sb_flags);
3817: put_filesystem(type);
3818: if (IS_ERR(fc))
3819: return PTR_ERR(fc);
3820:
3821: /*
3822: * Indicate to the filesystem that the mount request is coming
3823: * from the legacy mount system call.
3824: */
3825: fc->oldapi = true;
3826:
3827: if (subtype)
3828: err = vfs_parse_fs_string(fc, "subtype", subtype);
3829: if (!err && name)
3830: err = vfs_parse_fs_string(fc, "source", name);
3831: if (!err)
3832: err = parse_monolithic_mount_data(fc, data);
3833: if (!err && !mount_capable(fc))
3834: err = -EPERM;
3835: if (!err)
3836: err = do_new_mount_fc(fc, path, mnt_flags);
3837:
3838: put_fs_context(fc);
3839: return err;
3840: }
SourceFile:fs/namespace.c] |
| 32/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 44744 output: 39 thoughts: 1648 Thoughts: **Vulnerability Analysis: Dangling Pointer in Namespace Root** Okay, here's what I've got. It seems like the core issue lies with how namespace roots are handled, specifically, how mounts become roots and, crucially, how they might be unmounted from under the namespace's nose. It starts with a standard `mount` syscall, which calls `vfs_create_mount`, nothing out of the ordinary so far. The key question is how this mount ends up as `ns->root` and, more importantly, how it gets freed. My initial focus was on `copy_mnt_ns` during namespace cloning (`unshare`), where a NULL `ns->root` would definitely cause a crash, and a freed `ns->root` is obviously a use-after-free scenario. But is `ns->root` always what I think it is? The code seems to suggest so. Aha, that's it! Looking at "Second to last potentially related work creation" shows that `__detach_mounts` is called, specifically when a mountpoint is invalidated. If that mountpoint happens to be `ns->root`, we're in trouble. Okay, let's craft a scenario. If I have: 1. Namespace `NS1`. 2. Within `NS1`, a mount `M1` on `/foo`. 3. I somehow create `NS2` using `M1` as its root... wait how do I do that? `unshare` doesn't support that. Nor does `setns`. Aha! `open_tree(..., OPEN_TREE_NEW_NS)` does. I read `create_new_namespace` earlier. ```c 3159: new_ns->root = real_mount(no_free_ptr(to_path.mnt)); ``` It sets `new_ns->root` to a mount. Then that mount is cloned. ```c 3137: mnt = __do_loopback(path, flags, copy_flags); ... 3148: attach_mnt(mnt, new_ns_root, mp.mp); ``` So, what does this actually mean? If I understand correctly, it looks like `new_ns->root` will be a clone of a mount (`mnt`) that's attached to the *real* root of the namespace (`new_ns_root`). But `new_ns->root` points to a *submount*. So the namespace's root isn't the real root, it's a mount! Okay, `OPEN_TREE_NEW_NS` basically says: "Here, take this mount as the root for your new namespace." If that mount (`mnt`) is unmounted from within the new namespace, `new_ns->root` becomes a dangling pointer. But wait, can I unmount the root of a namespace? If `new_ns->root` is actually a *submount* of the "true" root (`new_ns_root`), then yes, I *can* unmount it! That removes the mount and the refcount drops! But isn't `mnt_add_to_ns` called for it? ```c 3155: mnt_add_to_ns(new_ns, p); ``` Yes, it is, so it's in the namespace's list. That means if I unmount `new_ns->root`, `umount_tree` is called which clears `mnt->mnt_ns` and calls `mntput`. This would decrement the refcount, which would be 0 if the `mnt` was only pinned by `ns->root` and its parent, thus freeing it. However, the `ns->root` still contains the dangling pointer. So, if I attempt to clone `new_ns` using `unshare(CLONE_NEWNS)`, `copy_mnt_ns` would dereference `new_ns->root`. BOOM! UAF. Let me check the syzkaller allocations, `do_new_mount`, not `open_tree`... Can a regular mount become `ns->root`? Only if it's the *root* of the namespace. During `unshare -m`, the new namespace's root is a clone of the old one's. Wait a second. There is another attack surface! `pivot_root`. `pivot_root` lets me change the root of the namespace directly! ```c 3626: SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, 3627: const char __user *, put_old) ``` It calls ` |
| 33/2 | 2026/02/06 12:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/namespace.c IncludeLines:true Name:path_pivot_root] Results: map[SourceCode:4617: int path_pivot_root(struct path *new, struct path *old)
4618: {
4619: struct path root __free(path_put) = {};
4620: struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent;
4621: int error;
4622:
4623: if (!may_mount())
4624: return -EPERM;
4625:
4626: error = security_sb_pivotroot(old, new);
4627: if (error)
4628: return error;
4629:
4630: get_fs_root(current->fs, &root);
4631:
4632: LOCK_MOUNT(old_mp, old);
4633: old_mnt = old_mp.parent;
4634: if (IS_ERR(old_mnt))
4635: return PTR_ERR(old_mnt);
4636:
4637: new_mnt = real_mount(new->mnt);
4638: root_mnt = real_mount(root.mnt);
4639: ex_parent = new_mnt->mnt_parent;
4640: root_parent = root_mnt->mnt_parent;
4641: if (IS_MNT_SHARED(old_mnt) ||
4642: IS_MNT_SHARED(ex_parent) ||
4643: IS_MNT_SHARED(root_parent))
4644: return -EINVAL;
4645: if (!check_mnt(root_mnt) || !check_mnt(new_mnt))
4646: return -EINVAL;
4647: if (new_mnt->mnt.mnt_flags & MNT_LOCKED)
4648: return -EINVAL;
4649: if (d_unlinked(new->dentry))
4650: return -ENOENT;
4651: if (new_mnt == root_mnt || old_mnt == root_mnt)
4652: return -EBUSY; /* loop, on the same file system */
4653: if (!path_mounted(&root))
4654: return -EINVAL; /* not a mountpoint */
4655: if (!mnt_has_parent(root_mnt))
4656: return -EINVAL; /* absolute root */
4657: if (!path_mounted(new))
4658: return -EINVAL; /* not a mountpoint */
4659: if (!mnt_has_parent(new_mnt))
4660: return -EINVAL; /* absolute root */
4661: /* make sure we can reach put_old from new_root */
4662: if (!is_path_reachable(old_mnt, old_mp.mp->m_dentry, new))
4663: return -EINVAL;
4664: /* make certain new is below the root */
4665: if (!is_path_reachable(new_mnt, new->dentry, &root))
4666: return -EINVAL;
4667: lock_mount_hash();
4668: umount_mnt(new_mnt);
4669: if (root_mnt->mnt.mnt_flags & MNT_LOCKED) {
4670: new_mnt->mnt.mnt_flags |= MNT_LOCKED;
4671: root_mnt->mnt.mnt_flags &= ~MNT_LOCKED;
4672: }
4673: /* mount new_root on / */
4674: attach_mnt(new_mnt, root_parent, root_mnt->mnt_mp);
4675: umount_mnt(root_mnt);
4676: /* mount old root on put_old */
4677: attach_mnt(root_mnt, old_mnt, old_mp.mp);
4678: touch_mnt_namespace(current->nsproxy->mnt_ns);
4679: /* A moved mount should not expire automatically */
4680: list_del_init(&new_mnt->mnt_expire);
4681: unlock_mount_hash();
4682: mnt_notify_add(root_mnt);
4683: mnt_notify_add(new_mnt);
4684: chroot_fs_refs(&root, new);
4685: return 0;
4686: }
SourceFile:fs/namespace.c] |
| 34/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 48679 output: 19 thoughts: 148 Thoughts: **Tracing the Filesystem Root in Namespace Creation** Okay, I'm diving into this code, and the first thing that jumps out is the missing assignment. Where's the `ns->root = ...` initialization? I might be looking at an older kernel version, or perhaps I'm just missing it in this particular snippet. I need to make sure I'm following the correct logic. Hold on, I see it! It's not `ns->root = ...` directly, it's `chroot_fs_refs(&root, new);`. This makes more sense, updating `current->fs->root` and `current->fs->pwd`. Right, that's the current process's filesystem, not necessarily the namespace's. But where is the *namespace's* root updated? That's the core question. In some kernel versions, this might not exist or be handled differently. I recall seeing the `struct mnt_namespace` definition earlier; it had `struct mount *root;` in it, which is good. Alright, let's take a step back. I need to get the definitive answer. I'm going to scan the entire file for all occurrences of `ns->root` to see how and where that root is assigned. I need to trace this initialization to be sure. |
| 35/2 | 2026/02/06 12:35 | tool | grepper |
0mError:no matches Args: map[Expression:->root = 1] Results: map[Output:] |
| 36/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 49141 output: 18 thoughts: 9 Thoughts: Okay, here's my attempt at summarizing the thought process, tailored for an expert user and in the first person: **Initial Investigation into a Root Cause** Alright, let's get down to brass tacks. I'm faced with this issue, and the obvious first move is to quickly identify if there's any mention of "root =" in the logs or configuration files. My gut tells me this could be a configuration variable, an environment setting, or even part of a debugging message. It's a key starting point to narrow down where the "root" might be defined or referenced. So, my initial thought is, "Okay, let's fire up `grep` and see what we can find. Scan for instances of 'root ='." This initial `grep` should provide some quick wins and a base level of understanding to begin with. |
| 37/2 | 2026/02/06 12:35 | tool | grepper |
0mArgs:map[Expression:root = 1] Results: map[Output:arch/um/kernel/um_arch.c=158=static int __init uml_root_setup(char *line, int *add)
arch/um/kernel/um_arch.c-159-{
arch/um/kernel/um_arch.c:160: have_root = 1;
arch/um/kernel/um_arch.c-161- return 0;
--
drivers/pci/pcie/ptm.c=41=void pci_ptm_init(struct pci_dev *dev)
--
drivers/pci/pcie/ptm.c-72- } else if (cap & PCI_PTM_CAP_ROOT) {
drivers/pci/pcie/ptm.c:73: dev->ptm_root = 1;
drivers/pci/pcie/ptm.c-74- } else if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END) {
--
fs/btrfs/backref.c=3176=static int handle_direct_tree_backref(struct btrfs_backref_cache *cache,
--
fs/btrfs/backref.c-3189-
fs/btrfs/backref.c:3190: cur->is_reloc_root = 1;
fs/btrfs/backref.c-3191- /* Only reloc backref cache cares about a specific root */
--
fs/btrfs/inode.c=3683=int btrfs_orphan_cleanup(struct btrfs_root *root)
--
fs/btrfs/inode.c-3795- if (dead_root && btrfs_root_refs(&dead_root->root_item) == 0)
fs/btrfs/inode.c:3796: is_dead_root = 1;
fs/btrfs/inode.c-3797- spin_unlock(&fs_info->fs_roots_radix_lock);
--
fs/ocfs2/dir.c=4191=static int ocfs2_prepare_dx_dir_for_insert(struct inode *dir,
--
fs/ocfs2/dir.c-4196-{
fs/ocfs2/dir.c:4197: int ret, free_dx_root = 1;
fs/ocfs2/dir.c-4198- struct ocfs2_super *osb = OCFS2_SB(dir->i_sb);
--
net/ipv4/fib_trie.c=291=static const int inflate_threshold = 50;
net/ipv4/fib_trie.c:292:static const int halve_threshold_root = 15;
net/ipv4/fib_trie.c-293-static const int inflate_threshold_root = 30;
] |
| 38/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 50079 output: 19 thoughts: 14 Thoughts: Here's your summary: **Grep Frustration and a Better Path** Okay, my initial `grep` attempt was clearly not cutting it. I was expecting a certain pattern, but it wasn't delivering. The specific issue is likely related to how I was trying to find a specific assignment or connection within the codebase. I need to be more precise in my search. My intuition is that the correct pattern I should have been looking for is `->root =`. That should narrow down the results to what I'm actually interested in, which I'm pretty certain is the relevant assignment related to `root`. This should give me a much more focused view of where this variable is being manipulated. Time to try that and see what it turns up. |
| 39/2 | 2026/02/06 12:35 | tool | grepper |
0mError:no matches Args: map[Expression:->root = 1] Results: map[Output:] |
| 40/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 50282 output: 24 thoughts: 17 Thoughts: Okay, here's my take on those thoughts, tailored for an expert audience: **Targeted Code Search: Root in Namespace.c** Alright, let's get down to it. Given the context, my initial thought is to zero in on the core of the matter. The problem likely revolves around how the "root" directory is handled within the filesystem namespace, and the `fs/namespace.c` file feels like the logical starting point. I'll initiate a search directly within that file for the term "root". This is a focused approach. I'm not going to start with broad file searches across the entire kernel source tree. This is about precision and rapid prototyping a solution. By directly targeting `namespace.c` and searching for "root", I anticipate quickly identifying the key data structures, functions, or macros involved in the root directory's management. This should provide insights into the relevant code, facilitating a deeper understanding of the issue and how to resolve it. I'll need to look at how 'root' is defined, initialized, and referenced within that specific context, in order to rapidly diagnose and resolve the issue. |
| 41/2 | 2026/02/06 12:35 | tool | codesearch-file-index |
0mArgs:map[SourceFile:fs/namespace.c] Results: map[Entities:[map[Kind:function Name:__cleanup_mnt] map[Kind:function Name:__detach_mounts] map[Kind:function Name:__do_loopback] map[Kind:function Name:__do_sys_fsmount] map[Kind:function Name:__do_sys_listmount] map[Kind:function Name:__do_sys_mount] map[Kind:function Name:__do_sys_mount_setattr] map[Kind:function Name:__do_sys_move_mount] map[Kind:function Name:__do_sys_oldumount] map[Kind:function Name:__do_sys_open_tree] map[Kind:function Name:__do_sys_open_tree_attr] map[Kind:function Name:__do_sys_pivot_root] map[Kind:function Name:__do_sys_statmount] map[Kind:function Name:__do_sys_umount] map[Kind:function Name:__free_klistmount_free] map[Kind:function Name:__free_mnt_ns_release] map[Kind:function Name:__free_mntput] map[Kind:function Name:__free_put_empty_mnt_ns] map[Kind:function Name:__has_locked_children] map[Kind:function Name:__ia32_sys_fsmount] map[Kind:function Name:__ia32_sys_listmount] map[Kind:function Name:__ia32_sys_mount] map[Kind:function Name:__ia32_sys_mount_setattr] map[Kind:function Name:__ia32_sys_move_mount] map[Kind:function Name:__ia32_sys_oldumount] map[Kind:function Name:__ia32_sys_open_tree] map[Kind:function Name:__ia32_sys_open_tree_attr] map[Kind:function Name:__ia32_sys_pivot_root] map[Kind:function Name:__ia32_sys_statmount] map[Kind:function Name:__ia32_sys_umount] map[Kind:function Name:__is_local_mountpoint] map[Kind:function Name:__legitimize_mnt] map[Kind:function Name:__lookup_mnt] map[Kind:function Name:__mnt_is_readonly] map[Kind:function Name:__se_sys_fsmount] map[Kind:function Name:__se_sys_listmount] map[Kind:function Name:__se_sys_mount] map[Kind:function Name:__se_sys_mount_setattr] map[Kind:function Name:__se_sys_move_mount] map[Kind:function Name:__se_sys_oldumount] map[Kind:function Name:__se_sys_open_tree] map[Kind:function Name:__se_sys_open_tree_attr] map[Kind:function Name:__se_sys_pivot_root] map[Kind:function Name:__se_sys_statmount] map[Kind:function Name:__se_sys_umount] map[Kind:function Name:__touch_mnt_namespace] map[Kind:function Name:__umount_mnt] map[Kind:function Name:__unlock_mount] map[Kind:function Name:__x64_sys_fsmount] map[Kind:function Name:__x64_sys_listmount] map[Kind:function Name:__x64_sys_mount] map[Kind:function Name:__x64_sys_mount_setattr] map[Kind:function Name:__x64_sys_move_mount] map[Kind:function Name:__x64_sys_oldumount] map[Kind:function Name:__x64_sys_open_tree] map[Kind:function Name:__x64_sys_open_tree_attr] map[Kind:function Name:__x64_sys_pivot_root] map[Kind:function Name:__x64_sys_statmount] map[Kind:function Name:__x64_sys_umount] map[Kind:function Name:alloc_mnt_ns] map[Kind:function Name:alloc_vfsmnt] map[Kind:function Name:attach_mnt] map[Kind:function Name:attach_recursive_mnt] map[Kind:function Name:attr_flags_to_mnt_flags] map[Kind:function Name:build_mount_idmapped] map[Kind:function Name:build_mount_kattr] map[Kind:function Name:can_change_locked_flags] map[Kind:function Name:can_idmap_mount] map[Kind:function Name:can_move_mount_beneath] map[Kind:function Name:can_umount] map[Kind:function Name:change_mount_ro_state] map[Kind:function Name:check_anonymous_mnt] map[Kind:function Name:check_for_nsfs_mounts] map[Kind:function Name:check_mnt] map[Kind:function Name:class_namespace_excl_constructor] map[Kind:function Name:class_namespace_excl_destructor] map[Kind:function Name:class_namespace_excl_lock_err] map[Kind:function Name:class_namespace_excl_lock_ptr] map[Kind:function Name:class_namespace_shared_constructor] map[Kind:function Name:class_namespace_shared_destructor] map[Kind:function Name:class_namespace_shared_lock_err] map[Kind:function Name:class_namespace_shared_lock_ptr] map[Kind:function Name:cleanup_group_ids] map[Kind:function Name:cleanup_mnt] map[Kind:function Name:clone_mnt] map[Kind:function Name:clone_private_mount] map[Kind:function Name:collect_paths] map[Kind:function Name:commit_tree] map[Kind:function Name:copy_mnt_id_req] map[Kind:function Name:copy_mnt_ns] map[Kind:function Name:copy_mount_options] map[Kind:function Name:copy_mount_string] map[Kind:function Name:copy_statmount_to_user] map[Kind:function Name:copy_tree] map[Kind:function Name:count_mounts] map[Kind:function Name:create_new_namespace] map[Kind:function Name:current_chrooted] map[Kind:function Name:dec_mnt_namespaces] map[Kind:function Name:delayed_free_vfsmnt] map[Kind:function Name:delayed_mntput] map[Kind:function Name:disconnect_mount] map[Kind:function Name:dissolve_on_fput] map[Kind:function Name:do_add_mount] map[Kind:function Name:do_change_type] map[Kind:function Name:do_idmap_mount] map[Kind:function Name:do_listmount] map[Kind:function Name:do_lock_mount] map[Kind:function Name:do_loopback] map[Kind:function Name:do_mount] map[Kind:function Name:do_mount_setattr] map[Kind:function Name:do_move_mount] map[Kind:function Name:do_move_mount_old] map[Kind:function Name:do_new_mount] map[Kind:function Name:do_new_mount_fc] map[Kind:function Name:do_reconfigure_mnt] map[Kind:function Name:do_remount] map[Kind:function Name:do_set_group] map[Kind:function Name:do_statmount] map[Kind:function Name:do_umount] map[Kind:function Name:do_umount_root] map[Kind:function Name:drop_collected_paths] map[Kind:function Name:extend_array] map[Kind:function Name:fc_mount] map[Kind:function Name:fc_mount_longterm] map[Kind:function Name:finish_automount] map[Kind:function Name:finish_mount_kattr] map[Kind:function Name:flags_to_propagation_type] map[Kind:function Name:free_mnt_ns] map[Kind:function Name:free_vfsmnt] map[Kind:function Name:from_mnt_ns] map[Kind:function Name:get_detached_copy] map[Kind:function Name:get_mountpoint] map[Kind:function Name:get_sequential_mnt_ns] map[Kind:function Name:grab_requested_mnt_ns] map[Kind:function Name:grab_requested_root] map[Kind:function Name:graft_tree] map[Kind:function Name:has_locked_children] map[Kind:function Name:inc_mnt_namespaces] map[Kind:function Name:init_fs_namespace_sysctls] map[Kind:function Name:init_mount_tree] map[Kind:function Name:initramfs_options_setup] map[Kind:function Name:invent_group_ids] map[Kind:function Name:is_mnt_ns_file] map[Kind:function Name:is_path_reachable] map[Kind:function Name:kern_mount] map[Kind:function Name:kern_unmount] map[Kind:function Name:kern_unmount_array] map[Kind:function Name:ksys_umount] map[Kind:function Name:legitimize_mnt] map[Kind:function Name:listmnt_next] map[Kind:function Name:lock_mnt_tree] map[Kind:function Name:lock_mount_exact] map[Kind:function Name:lock_mount_hash] map[Kind:function Name:lookup_mnt] map[Kind:function Name:lookup_mnt_in_ns] map[Kind:function Name:lookup_mnt_ns] map[Kind:function Name:lookup_mountpoint] map[Kind:function Name:m_hash] map[Kind:function Name:m_next] map[Kind:function Name:m_show] map[Kind:function Name:m_start] map[Kind:function Name:m_stop] map[Kind:function Name:make_visible] map[Kind:function Name:mark_mounts_for_expiry] map[Kind:function Name:may_change_propagation] map[Kind:function Name:may_copy_tree] map[Kind:function Name:may_mount] map[Kind:function Name:may_umount] map[Kind:function Name:may_umount_tree] map[Kind:function Name:may_use_mount] map[Kind:function Name:maybe_free_mountpoint] map[Kind:function Name:mnt_add_count] map[Kind:function Name:mnt_add_instance] map[Kind:function Name:mnt_add_to_ns] map[Kind:function Name:mnt_alloc_group_id] map[Kind:function Name:mnt_alloc_id] map[Kind:function Name:mnt_allow_writers] map[Kind:function Name:mnt_already_visible] map[Kind:function Name:mnt_change_mountpoint] map[Kind:function Name:mnt_clone_internal] map[Kind:function Name:mnt_dec_writers] map[Kind:function Name:mnt_del_instance] map[Kind:function Name:mnt_drop_write] map[Kind:function Name:mnt_drop_write_file] map[Kind:function Name:mnt_find_id_at] map[Kind:function Name:mnt_find_id_at_reverse] map[Kind:function Name:mnt_free_id] map[Kind:function Name:mnt_get_count] map[Kind:function Name:mnt_get_write_access] map[Kind:function Name:mnt_get_write_access_file] map[Kind:function Name:mnt_get_writers] map[Kind:function Name:mnt_hold_writers] map[Kind:function Name:mnt_inc_writers] map[Kind:function Name:mnt_init] map[Kind:function Name:mnt_is_readonly] map[Kind:function Name:mnt_make_readonly] map[Kind:function Name:mnt_make_shortterm] map[Kind:function Name:mnt_may_suid] map[Kind:function Name:mnt_notify] map[Kind:function Name:mnt_ns_from_dentry] map[Kind:function Name:mnt_ns_loop] map[Kind:function Name:mnt_ns_release] map[Kind:function Name:mnt_ns_release_rcu] map[Kind:function Name:mnt_ns_tree_remove] map[Kind:function Name:mnt_put_write_access] map[Kind:function Name:mnt_put_write_access_file] map[Kind:function Name:mnt_release_group_id] map[Kind:function Name:mnt_set_expiry] map[Kind:function Name:mnt_set_mountpoint] map[Kind:function Name:mnt_to_attr_flags] map[Kind:function Name:mnt_to_propagation_flags] map[Kind:function Name:mnt_unhold_writers] map[Kind:function Name:mnt_want_write] map[Kind:function Name:mnt_want_write_file] map[Kind:function Name:mnt_warn_timestamp_expiry] map[Kind:function Name:mntget] map[Kind:function Name:mntns_get] map[Kind:function Name:mntns_install] map[Kind:function Name:mntns_owner] map[Kind:function Name:mntns_put] map[Kind:function Name:mntput] map[Kind:function Name:mntput_no_expire] map[Kind:function Name:mntput_no_expire_slowpath] map[Kind:function Name:mount_is_ancestor] map[Kind:function Name:mount_setattr_commit] map[Kind:function Name:mount_setattr_prepare] map[Kind:function Name:mount_subtree] map[Kind:function Name:mount_too_revealing] map[Kind:function Name:mp_hash] map[Kind:function Name:namespace_lock] map[Kind:function Name:namespace_unlock] map[Kind:function Name:need_notify_mnt_list] map[Kind:function Name:next_mnt] map[Kind:function Name:node_to_mount] map[Kind:function Name:notify_mnt_list] map[Kind:function Name:open_detached_copy] map[Kind:function Name:open_new_namespace] map[Kind:function Name:our_mnt] map[Kind:function Name:path_is_mountpoint] map[Kind:function Name:path_is_under] map[Kind:function Name:path_mount] map[Kind:function Name:path_overmounted] map[Kind:function Name:path_pivot_root] map[Kind:function Name:path_umount] map[Kind:function Name:prepare_klistmount] map[Kind:function Name:prepare_kstatmount] map[Kind:function Name:put_mnt_ns] map[Kind:function Name:recalc_flags] map[Kind:function Name:retry_statmount] map[Kind:function Name:sb_prepare_remount_readonly] map[Kind:function Name:select_submounts] map[Kind:function Name:set_mhash_entries] map[Kind:function Name:set_mount_attributes] map[Kind:function Name:set_mphash_entries] map[Kind:function Name:setup_mnt] map[Kind:function Name:show_path] map[Kind:function Name:shrink_submounts] map[Kind:function Name:skip_mnt_tree] map[Kind:function Name:statmount_fs_subtype] map[Kind:function Name:statmount_fs_type] map[Kind:function Name:statmount_mnt_basic] map[Kind:function Name:statmount_mnt_gidmap] map[Kind:function Name:statmount_mnt_ns_id] map[Kind:function Name:statmount_mnt_opts] map[Kind:function Name:statmount_mnt_point] map[Kind:function Name:statmount_mnt_root] map[Kind:function Name:statmount_mnt_uidmap] map[Kind:function Name:statmount_opt_array] map[Kind:function Name:statmount_opt_process] map[Kind:function Name:statmount_opt_sec_array] map[Kind:function Name:statmount_propagate_from] map[Kind:function Name:statmount_sb_basic] map[Kind:function Name:statmount_sb_source] map[Kind:function Name:statmount_string] map[Kind:function Name:touch_mnt_namespace] map[Kind:function Name:tree_contains_unbindable] map[Kind:function Name:umount_mnt] map[Kind:function Name:umount_tree] map[Kind:function Name:unlock_mount] map[Kind:function Name:unlock_mount_hash] map[Kind:function Name:unpin_mountpoint] map[Kind:function Name:vfs_create_mount] map[Kind:function Name:vfs_kern_mount] map[Kind:function Name:vfs_move_mount] map[Kind:function Name:vfs_open_tree] map[Kind:function Name:vfsmount_to_propagation_flags] map[Kind:function Name:wants_mount_setattr] map[Kind:function Name:warn_mandlock] map[Kind:function Name:where_to_mount] map[Kind:struct Name:klistmount] map[Kind:struct Name:kstatmount] map[Kind:struct Name:mount_kattr] map[Kind:struct Name:pinned_mountpoint] map[Kind:enum Name:mnt_tree_flags_t] map[Kind:enum Name:mount_kattr_flags_t] map[Kind:enum Name:umount_tree_flags] map[Kind:typedef Name:class_namespace_excl_t] map[Kind:typedef Name:class_namespace_shared_t] map[Kind:typedef Name:lock_namespace_excl_t] map[Kind:typedef Name:lock_namespace_shared_t]]] |
| 42/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 55653 output: 20 thoughts: 57 Thoughts: **Navigating the Namespace Functions** Okay, so I'm trying to locate the `create_mnt_ns` function. I initially thought it would be directly within `fs/namespace.c`, but a quick scan of that file didn't turn it up. Hmm, I might have either missed it during my first read-through, or perhaps it resides in a more logical place. Given the function's name and its purpose, it's more likely related to mount operations and namespace creation. My next step is to examine `fs/mount.h`. That's where I'll probably find the declaration or definition, if not the actual implementation. Let's see what `mount.h` reveals. |
| 43/2 | 2026/02/06 12:35 | tool | read-file |
0mArgs:map[File:fs/mount.h] Results: map[Contents:/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/mount.h>
#include <linux/seq_file.h>
#include <linux/poll.h>
#include <linux/ns_common.h>
#include <linux/fs_pin.h>
extern struct file_system_type nullfs_fs_type;
extern struct list_head notify_list;
struct mnt_namespace {
struct ns_common ns;
struct mount * root;
struct {
struct rb_root mounts; /* Protected by namespace_sem */
struct rb_node *mnt_last_node; /* last (rightmost) mount in the rbtree */
struct rb_node *mnt_first_node; /* first (leftmost) mount in the rbtree */
};
struct user_namespace *user_ns;
struct ucounts *ucounts;
wait_queue_head_t poll;
u64 seq_origin; /* Sequence number of origin mount namespace */
u64 event;
#ifdef CONFIG_FSNOTIFY
__u32 n_fsnotify_mask;
struct fsnotify_mark_connector __rcu *n_fsnotify_marks;
#endif
unsigned int nr_mounts; /* # of mounts in the namespace */
unsigned int pending_mounts;
refcount_t passive; /* number references not pinning @mounts */
bool is_anon;
} __randomize_layout;
struct mnt_pcp {
int mnt_count;
int mnt_writers;
};
struct mountpoint {
struct hlist_node m_hash;
struct dentry *m_dentry;
struct hlist_head m_list;
};
struct mount {
struct hlist_node mnt_hash;
struct mount *mnt_parent;
struct dentry *mnt_mountpoint;
struct vfsmount mnt;
union {
struct rb_node mnt_node; /* node in the ns->mounts rbtree */
struct rcu_head mnt_rcu;
struct llist_node mnt_llist;
};
#ifdef CONFIG_SMP
struct mnt_pcp __percpu *mnt_pcp;
#else
int mnt_count;
int mnt_writers;
#endif
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */
struct mount * __aligned(1) *mnt_pprev_for_sb;
/* except that LSB of pprev is stolen */
#define WRITE_HOLD 1 /* ... for use by mnt_hold_writers() */
const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */
struct list_head mnt_list;
struct list_head mnt_expire; /* link in fs-specific expiry list */
struct list_head mnt_share; /* circular list of shared mounts */
struct hlist_head mnt_slave_list;/* list of slave mounts */
struct hlist_node mnt_slave; /* slave list entry */
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
struct mountpoint *mnt_mp; /* where is it mounted */
union {
struct hlist_node mnt_mp_list; /* list mounts with the same mountpoint */
struct hlist_node mnt_umount;
};
#ifdef CONFIG_FSNOTIFY
struct fsnotify_mark_connector __rcu *mnt_fsnotify_marks;
__u32 mnt_fsnotify_mask;
struct list_head to_notify; /* need to queue notification */
struct mnt_namespace *prev_ns; /* previous namespace (NULL if none) */
#endif
int mnt_t_flags; /* namespace_sem-protected flags */
int mnt_id; /* mount identifier, reused */
u64 mnt_id_unique; /* mount ID unique until reboot */
int mnt_group_id; /* peer group identifier */
int mnt_expiry_mark; /* true if marked for expiry */
struct hlist_head mnt_pins;
struct hlist_head mnt_stuck_children;
struct mount *overmount; /* mounted on ->mnt_root */
} __randomize_layout;
enum {
T_SHARED = 1, /* mount is shared */
T_UNBINDABLE = 2, /* mount is unbindable */
T_MARKED = 4, /* internal mark for propagate_... */
T_UMOUNT_CANDIDATE = 8, /* for propagate_umount */
/*
* T_SHARED_MASK is the set of flags that should be cleared when a
* mount becomes shared. Currently, this is only the flag that says a
* mount cannot be bind mounted, since this is how we create a mount
* that shares events with another mount. If you add a new T_*
* flag, consider how it interacts with shared mounts.
*/
T_SHARED_MASK = T_UNBINDABLE,
};
#define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
static inline struct mount *real_mount(struct vfsmount *mnt)
{
return container_of(mnt, struct mount, mnt);
}
static inline int mnt_has_parent(const struct mount *mnt)
{
return mnt != mnt->mnt_parent;
}
static inline int is_mounted(struct vfsmount *mnt)
{
/* neither detached nor internal? */
return !IS_ERR_OR_NULL(real_mount(mnt)->mnt_ns);
}
extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
extern int __legitimize_mnt(struct vfsmount *, unsigned);
static inline bool __path_is_mountpoint(const struct path *path)
{
struct mount *m = __lookup_mnt(path->mnt, path->dentry);
return m && likely(!(m->mnt.mnt_flags & MNT_SYNC_UMOUNT));
}
extern void __detach_mounts(struct dentry *dentry);
static inline void detach_mounts(struct dentry *dentry)
{
if (!d_mountpoint(dentry))
return;
__detach_mounts(dentry);
}
static inline void get_mnt_ns(struct mnt_namespace *ns)
{
ns_ref_inc(ns);
}
extern seqlock_t mount_lock;
DEFINE_LOCK_GUARD_0(mount_writer, write_seqlock(&mount_lock),
write_sequnlock(&mount_lock))
DEFINE_LOCK_GUARD_0(mount_locked_reader, read_seqlock_excl(&mount_lock),
read_sequnlock_excl(&mount_lock))
struct proc_mounts {
struct mnt_namespace *ns;
struct path root;
int (*show)(struct seq_file *, struct vfsmount *);
};
extern const struct seq_operations mounts_op;
extern bool __is_local_mountpoint(const struct dentry *dentry);
static inline bool is_local_mountpoint(const struct dentry *dentry)
{
if (!d_mountpoint(dentry))
return false;
return __is_local_mountpoint(dentry);
}
static inline bool is_anon_ns(struct mnt_namespace *ns)
{
return ns->is_anon;
}
static inline bool anon_ns_root(const struct mount *m)
{
struct mnt_namespace *ns = READ_ONCE(m->mnt_ns);
return !IS_ERR_OR_NULL(ns) && is_anon_ns(ns) && m == ns->root;
}
static inline bool mnt_ns_attached(const struct mount *mnt)
{
return !RB_EMPTY_NODE(&mnt->mnt_node);
}
static inline bool mnt_ns_empty(const struct mnt_namespace *ns)
{
return RB_EMPTY_ROOT(&ns->mounts);
}
static inline void move_from_ns(struct mount *mnt)
{
struct mnt_namespace *ns = mnt->mnt_ns;
WARN_ON(!mnt_ns_attached(mnt));
if (ns->mnt_last_node == &mnt->mnt_node)
ns->mnt_last_node = rb_prev(&mnt->mnt_node);
if (ns->mnt_first_node == &mnt->mnt_node)
ns->mnt_first_node = rb_next(&mnt->mnt_node);
rb_erase(&mnt->mnt_node, &ns->mounts);
RB_CLEAR_NODE(&mnt->mnt_node);
}
bool has_locked_children(struct mount *mnt, struct dentry *dentry);
struct mnt_namespace *get_sequential_mnt_ns(struct mnt_namespace *mnt_ns,
bool previous);
static inline struct mnt_namespace *to_mnt_ns(struct ns_common *ns)
{
return container_of(ns, struct mnt_namespace, ns);
}
#ifdef CONFIG_FSNOTIFY
static inline void mnt_notify_add(struct mount *m)
{
/* Optimize the case where there are no watches */
if ((m->mnt_ns && m->mnt_ns->n_fsnotify_marks) ||
(m->prev_ns && m->prev_ns->n_fsnotify_marks))
list_add_tail(&m->to_notify, ¬ify_list);
else
m->prev_ns = m->mnt_ns;
}
#else
static inline void mnt_notify_add(struct mount *m)
{
}
#endif
static inline struct mount *topmost_overmount(struct mount *m)
{
while (m->overmount)
m = m->overmount;
return m;
}
static inline bool __test_write_hold(struct mount * __aligned(1) *val)
{
return (unsigned long)val & WRITE_HOLD;
}
static inline bool test_write_hold(const struct mount *m)
{
return __test_write_hold(m->mnt_pprev_for_sb);
}
static inline void set_write_hold(struct mount *m)
{
m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb
| WRITE_HOLD);
}
static inline void clear_write_hold(struct mount *m)
{
m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb
& ~WRITE_HOLD);
}
struct mnt_namespace *mnt_ns_from_dentry(struct dentry *dentry);
] |
| 44/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 58620 output: 20 thoughts: 311 Thoughts: **Analyzing `anon_ns_root` and Navigating the Filesystem Code**
Okay, I've got it. I was looking at line 178 and saw `anon_ns_root`. Right, there it is:
```c
178: static inline bool anon_ns_root(const struct mount *m)
179: {
180: struct mnt_namespace *ns = READ_ONCE(m->mnt_ns);
181:
182: return !IS_ERR_OR_NULL(ns) && is_anon_ns(ns) && m == ns->root;
183: }
```
This confirms that `ns->root` is indeed a `struct mount *`. Good, that solidifies my understanding of the relationship.
Now, back to `fs/namespace.c`. I need to track how `ns->root` is set. I thought I did a good job before, but let's be thorough this time. I should use `codesearch-file-index` to find all functions and read them if they seem relevant. Oh, wait, I already did that, didn't I? Time to re-examine my notes.
Let's see, what about `mnt_ns_release`? I'll check what that function does. It *should* be in this file.
```c
4198: void mnt_ns_release(struct mnt_namespace *ns)
```
Hmmm, it's not popping up in my usual search tools. Maybe I'm looking at the wrong section of the file. Is it really at line 4198? I'm just guessing.
Let's look around line 4200 in `fs/namespace.c`, just to get my bearings and make sure I'm in the right vicinity. I'll need to figure out what that function does and, more importantly, if it is responsible for the root.
|
| 45/2 | 2026/02/06 12:35 | tool | codesearch-dir-index |
0mArgs:map[Dir:fs/] Results: map[Files:[aio.c anon_inodes.c attr.c backing-file.c bad_inode.c binfmt_elf.c binfmt_elf_fdpic.c binfmt_flat.c binfmt_misc.c binfmt_script.c bpf_fs_kfuncs.c buffer.c char_dev.c compat_binfmt_elf.c coredump.c d_path.c dax.c dcache.c direct-io.c drop_caches.c eventfd.c eventpoll.c exec.c fcntl.c fhandle.c file.c file_attr.c file_table.c filesystems.c fs-writeback.c fs_context.c fs_dirent.c fs_parser.c fs_pin.c fs_struct.c fserror.c fsopen.c init.c inode.c internal.h ioctl.c kernel_read_file.c libfs.c locks.c mbcache.c mnt_idmapping.c mount.h mpage.c namei.c namespace.c nsfs.c nullfs.c open.c pidfs.c pipe.c pnode.c pnode.h posix_acl.c proc_namespace.c read_write.c readdir.c remap_range.c select.c seq_file.c signalfd.c splice.c stack.c stat.c statfs.c super.c sync.c sysctls.c timerfd.c userfaultfd.c utimes.c xattr.c] Subdirs:[9p adfs affs afs autofs befs bfs btrfs cachefiles ceph coda configfs cramfs crypto debugfs devpts dlm ecryptfs efivarfs efs erofs exfat exportfs ext2 ext4 f2fs fat freevxfs fuse gfs2 hfs hfsplus hostfs hpfs hugetlbfs iomap isofs jbd2 jffs2 jfs kernfs lockd minix netfs nfs nfs_common nfsd nilfs2 nls notify ntfs3 ocfs2 omfs openpromfs orangefs overlayfs proc pstore qnx4 qnx6 quota ramfs resctrl romfs smb squashfs sysfs tests tracefs ubifs udf ufs unicode vboxsf verity xfs zonefs]] |
| 46/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 60368 output: 20 thoughts: 14 Thoughts: Here's my take on the process, crafted as a first-person expert summary: **Diving into `fs/namespace.c`'s Conclusion** Okay, so I'm skipping right to the end of `fs/namespace.c`. There's usually a lot of groundwork and structure at the beginning of these files, and while important, the culmination often reveals the core logic and final decisions. I'm already familiar with the fundamental concepts of namespaces in the kernel – mountpoints, root directories, process isolation, that sort of thing. I want to see what this specific implementation does with all of that. What are the key takeaways from this particular code? What are the edge cases they've considered, and how did they ultimately handle them? Are there any clever optimizations or potentially concerning bottlenecks that jump out? Looking at the ending provides a 'solution' view of the problem. That's the part I really want to assess first - is the ending elegantly done? |
| 47/2 | 2026/02/06 12:35 | tool | read-file |
0mArgs:map[File:fs/namespace.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/*
* linux/fs/namespace.c
*
* (C) Copyright Al Viro 2000, 2001
*
* Based on code from fs/super.c, copyright Linus Torvalds and others.
* Heavily rewritten.
*/
#include <linux/syscalls.h>
#include <linux/export.h>
#include <linux/capability.h>
#include <linux/mnt_namespace.h>
#include <linux/user_namespace.h>
#include <linux/namei.h>
#include <linux/security.h>
#include <linux/cred.h>
#include <linux/idr.h>
#include <linux/init.h> /* init_rootfs */
#include <linux/fs_struct.h> /* get_fs_root et.al. */
#include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */
#include <linux/file.h>
#include <linux/uaccess.h>
#include <linux/proc_ns.h>
#include <linux/magic.h>
#include <linux/memblock.h>
#include <linux/proc_fs.h>
#include <linux/task_work.h>
#include <linux/sched/task.h>
#include <uapi/linux/mount.h>
#include <linux/fs_context.h>
#include <linux/shmem_fs.h>
#include <linux/mnt_idmapping.h>
#include <linux/pidfs.h>
#include <linux/nstree.h>
#include "pnode.h"
#include "internal.h"
/* Maximum number of mounts in a mount namespace */
static unsigned int sysctl_mount_max __read_mostly = 100000;
static unsigned int m_hash_mask __ro_after_init;
static unsigned int m_hash_shift __ro_after_init;
static unsigned int mp_hash_mask __ro_after_init;
static unsigned int mp_hash_shift __ro_after_init;
static __initdata unsigned long mhash_entries;
static int __init set_mhash_entries(char *str)
{
return kstrtoul(str, 0, &mhash_entries) == 0;
}
__setup("mhash_entries=", set_mhash_entries);
static __initdata unsigned long mphash_entries;
static int __init set_mphash_entries(char *str)
{
return kstrtoul(str, 0, &mphash_entries) == 0;
}
__setup("mphash_entries=", set_mphash_entries);
static char * __initdata initramfs_options;
static int __init initramfs_options_setup(char *str)
{
initramfs_options = str;
return 1;
}
__setup("initramfs_options=", initramfs_options_setup);
static u64 event;
static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
static DEFINE_IDA(mnt_group_ida);
/* Don't allow confusion with old 32bit mount ID */
#define MNT_UNIQUE_ID_OFFSET (1ULL << 31)
static u64 mnt_id_ctr = MNT_UNIQUE_ID_OFFSET;
static struct hlist_head *mount_hashtable __ro_after_init;
static struct hlist_head *mountpoint_hashtable __ro_after_init;
static struct kmem_cache *mnt_cache __ro_after_init;
static DECLARE_RWSEM(namespace_sem);
static HLIST_HEAD(unmounted); /* protected by namespace_sem */
static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
static struct mnt_namespace *emptied_ns; /* protected by namespace_sem */
static inline void namespace_lock(void);
static void namespace_unlock(void);
DEFINE_LOCK_GUARD_0(namespace_excl, namespace_lock(), namespace_unlock())
DEFINE_LOCK_GUARD_0(namespace_shared, down_read(&namespace_sem),
up_read(&namespace_sem))
DEFINE_FREE(mntput, struct vfsmount *, if (!IS_ERR(_T)) mntput(_T))
#ifdef CONFIG_FSNOTIFY
LIST_HEAD(notify_list); /* protected by namespace_sem */
#endif
enum mount_kattr_flags_t {
MOUNT_KATTR_RECURSE = (1 << 0),
MOUNT_KATTR_IDMAP_REPLACE = (1 << 1),
};
struct mount_kattr {
unsigned int attr_set;
unsigned int attr_clr;
unsigned int propagation;
unsigned int lookup_flags;
enum mount_kattr_flags_t kflags;
struct user_namespace *mnt_userns;
struct mnt_idmap *mnt_idmap;
};
/* /sys/fs */
struct kobject *fs_kobj __ro_after_init;
EXPORT_SYMBOL_GPL(fs_kobj);
/*
* vfsmount lock may be taken for read to prevent changes to the
* vfsmount hash, ie. during mountpoint lookups or walking back
* up the tree.
*
* It should be taken for write in all cases where the vfsmount
* tree or hash is modified or when a vfsmount structure is modified.
*/
__cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
static void mnt_ns_release(struct mnt_namespace *ns)
{
/* keep alive for {list,stat}mount() */
if (ns && refcount_dec_and_test(&ns->passive)) {
fsnotify_mntns_delete(ns);
put_user_ns(ns->user_ns);
kfree(ns);
}
}
DEFINE_FREE(mnt_ns_release, struct mnt_namespace *,
if (!IS_ERR(_T)) mnt_ns_release(_T))
static void mnt_ns_release_rcu(struct rcu_head *rcu)
{
mnt_ns_release(container_of(rcu, struct mnt_namespace, ns.ns_rcu));
}
static void mnt_ns_tree_remove(struct mnt_namespace *ns)
{
/* remove from global mount namespace list */
if (ns_tree_active(ns))
ns_tree_remove(ns);
call_rcu(&ns->ns.ns_rcu, mnt_ns_release_rcu);
}
/*
* Lookup a mount namespace by id and take a passive reference count. Taking a
* passive reference means the mount namespace can be emptied if e.g., the last
* task holding an active reference exits. To access the mounts of the
* namespace the @namespace_sem must first be acquired. If the namespace has
* already shut down before acquiring @namespace_sem, {list,stat}mount() will
* see that the mount rbtree of the namespace is empty.
*
* Note the lookup is lockless protected by a sequence counter. We only
* need to guard against false negatives as false positives aren't
* possible. So if we didn't find a mount namespace and the sequence
* counter has changed we need to retry. If the sequence counter is
* still the same we know the search actually failed.
*/
static struct mnt_namespace *lookup_mnt_ns(u64 mnt_ns_id)
{
struct mnt_namespace *mnt_ns;
struct ns_common *ns;
guard(rcu)();
ns = ns_tree_lookup_rcu(mnt_ns_id, CLONE_NEWNS);
if (!ns)
return NULL;
/*
* The last reference count is put with RCU delay so we can
* unconditonally acquire a reference here.
*/
mnt_ns = container_of(ns, struct mnt_namespace, ns);
refcount_inc(&mnt_ns->passive);
return mnt_ns;
}
static inline void lock_mount_hash(void)
{
write_seqlock(&mount_lock);
}
static inline void unlock_mount_hash(void)
{
write_sequnlock(&mount_lock);
}
static inline struct hlist_head *m_hash(struct vfsmount *mnt, struct dentry *dentry)
{
unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
tmp += ((unsigned long)dentry / L1_CACHE_BYTES);
tmp = tmp + (tmp >> m_hash_shift);
return &mount_hashtable[tmp & m_hash_mask];
}
static inline struct hlist_head *mp_hash(struct dentry *dentry)
{
unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
tmp = tmp + (tmp >> mp_hash_shift);
return &mountpoint_hashtable[tmp & mp_hash_mask];
}
static int mnt_alloc_id(struct mount *mnt)
{
int res;
xa_lock(&mnt_id_xa);
res = __xa_alloc(&mnt_id_xa, &mnt->mnt_id, mnt, xa_limit_31b, GFP_KERNEL);
if (!res)
mnt->mnt_id_unique = ++mnt_id_ctr;
xa_unlock(&mnt_id_xa);
return res;
}
static void mnt_free_id(struct mount *mnt)
{
xa_erase(&mnt_id_xa, mnt->mnt_id);
}
/*
* Allocate a new peer group ID
*/
static int mnt_alloc_group_id(struct mount *mnt)
{
int res = ida_alloc_min(&mnt_group_ida, 1, GFP_KERNEL);
if (res < 0)
return res;
mnt->mnt_group_id = res;
return 0;
}
/*
* Release a peer group ID
*/
void mnt_release_group_id(struct mount *mnt)
{
ida_free(&mnt_group_ida, mnt->mnt_group_id);
mnt->mnt_group_id = 0;
}
/*
* vfsmount lock must be held for read
*/
static inline void mnt_add_count(struct mount *mnt, int n)
{
#ifdef CONFIG_SMP
this_cpu_add(mnt->mnt_pcp->mnt_count, n);
#else
preempt_disable();
mnt->mnt_count += n;
preempt_enable();
#endif
}
/*
* vfsmount lock must be held for write
*/
int mnt_get_count(struct mount *mnt)
{
#ifdef CONFIG_SMP
int count = 0;
int cpu;
for_each_possible_cpu(cpu) {
count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count;
}
return count;
#else
return mnt->mnt_count;
#endif
}
static struct mount *alloc_vfsmnt(const char *name)
{
struct mount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
if (mnt) {
int err;
err = mnt_alloc_id(mnt);
if (err)
goto out_free_cache;
if (name)
mnt->mnt_devname = kstrdup_const(name,
GFP_KERNEL_ACCOUNT);
else
mnt->mnt_devname = "none";
if (!mnt->mnt_devname)
goto out_free_id;
#ifdef CONFIG_SMP
mnt->mnt_pcp = alloc_percpu(struct mnt_pcp);
if (!mnt->mnt_pcp)
goto out_free_devname;
this_cpu_add(mnt->mnt_pcp->mnt_count, 1);
#else
mnt->mnt_count = 1;
mnt->mnt_writers = 0;
#endif
INIT_HLIST_NODE(&mnt->mnt_hash);
INIT_LIST_HEAD(&mnt->mnt_child);
INIT_LIST_HEAD(&mnt->mnt_mounts);
INIT_LIST_HEAD(&mnt->mnt_list);
INIT_LIST_HEAD(&mnt->mnt_expire);
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_HLIST_HEAD(&mnt->mnt_slave_list);
INIT_HLIST_NODE(&mnt->mnt_slave);
INIT_HLIST_NODE(&mnt->mnt_mp_list);
INIT_HLIST_HEAD(&mnt->mnt_stuck_children);
RB_CLEAR_NODE(&mnt->mnt_node);
mnt->mnt.mnt_idmap = &nop_mnt_idmap;
}
return mnt;
#ifdef CONFIG_SMP
out_free_devname:
kfree_const(mnt->mnt_devname);
#endif
out_free_id:
mnt_free_id(mnt);
out_free_cache:
kmem_cache_free(mnt_cache, mnt);
return NULL;
}
/*
* Most r/o checks on a fs are for operations that take
* discrete amounts of time, like a write() or unlink().
* We must keep track of when those operations start
* (for permission checks) and when they end, so that
* we can determine when writes are able to occur to
* a filesystem.
*/
/*
* __mnt_is_readonly: check whether a mount is read-only
* @mnt: the mount to check for its write status
*
* This shouldn't be used directly ouside of the VFS.
* It does not guarantee that the filesystem will stay
* r/w, just that it is right *now*. This can not and
* should not be used in place of IS_RDONLY(inode).
* mnt_want/drop_write() will _keep_ the filesystem
* r/w.
*/
bool __mnt_is_readonly(const struct vfsmount *mnt)
{
return (mnt->mnt_flags & MNT_READONLY) || sb_rdonly(mnt->mnt_sb);
}
EXPORT_SYMBOL_GPL(__mnt_is_readonly);
static inline void mnt_inc_writers(struct mount *mnt)
{
#ifdef CONFIG_SMP
this_cpu_inc(mnt->mnt_pcp->mnt_writers);
#else
mnt->mnt_writers++;
#endif
}
static inline void mnt_dec_writers(struct mount *mnt)
{
#ifdef CONFIG_SMP
this_cpu_dec(mnt->mnt_pcp->mnt_writers);
#else
mnt->mnt_writers--;
#endif
}
static unsigned int mnt_get_writers(struct mount *mnt)
{
#ifdef CONFIG_SMP
unsigned int count = 0;
int cpu;
for_each_possible_cpu(cpu) {
count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_writers;
}
return count;
#else
return mnt->mnt_writers;
#endif
}
static int mnt_is_readonly(const struct vfsmount *mnt)
{
if (READ_ONCE(mnt->mnt_sb->s_readonly_remount))
return 1;
/*
* The barrier pairs with the barrier in sb_start_ro_state_change()
* making sure if we don't see s_readonly_remount set yet, we also will
* not see any superblock / mount flag changes done by remount.
* It also pairs with the barrier in sb_end_ro_state_change()
* assuring that if we see s_readonly_remount already cleared, we will
* see the values of superblock / mount flags updated by remount.
*/
smp_rmb();
return __mnt_is_readonly(mnt);
}
/*
* Most r/o & frozen checks on a fs are for operations that take discrete
* amounts of time, like a write() or unlink(). We must keep track of when
* those operations start (for permission checks) and when they end, so that we
* can determine when writes are able to occur to a filesystem.
*/
/**
* mnt_get_write_access - get write access to a mount without freeze protection
* @m: the mount on which to take a write
*
* This tells the low-level filesystem that a write is about to be performed to
* it, and makes sure that writes are allowed (mnt it read-write) before
* returning success. This operation does not protect against filesystem being
* frozen. When the write operation is finished, mnt_put_write_access() must be
* called. This is effectively a refcount.
*/
int mnt_get_write_access(struct vfsmount *m)
{
struct mount *mnt = real_mount(m);
int ret = 0;
preempt_disable();
mnt_inc_writers(mnt);
/*
* The store to mnt_inc_writers must be visible before we pass
* WRITE_HOLD loop below, so that the slowpath can see our
* incremented count after it has set WRITE_HOLD.
*/
smp_mb();
might_lock(&mount_lock.lock);
while (__test_write_hold(READ_ONCE(mnt->mnt_pprev_for_sb))) {
if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
cpu_relax();
} else {
/*
* This prevents priority inversion, if the task
* setting WRITE_HOLD got preempted on a remote
* CPU, and it prevents life lock if the task setting
* WRITE_HOLD has a lower priority and is bound to
* the same CPU as the task that is spinning here.
*/
preempt_enable();
read_seqlock_excl(&mount_lock);
read_sequnlock_excl(&mount_lock);
preempt_disable();
}
}
/*
* The barrier pairs with the barrier sb_start_ro_state_change() making
* sure that if we see WRITE_HOLD cleared, we will also see
* s_readonly_remount set (or even SB_RDONLY / MNT_READONLY flags) in
* mnt_is_readonly() and bail in case we are racing with remount
* read-only.
*/
smp_rmb();
if (mnt_is_readonly(m)) {
mnt_dec_writers(mnt);
ret = -EROFS;
}
preempt_enable();
return ret;
}
EXPORT_SYMBOL_GPL(mnt_get_write_access);
/**
* mnt_want_write - get write access to a mount
* @m: the mount on which to take a write
*
* This tells the low-level filesystem that a write is about to be performed to
* it, and makes sure that writes are allowed (mount is read-write, filesystem
* is not frozen) before returning success. When the write operation is
* finished, mnt_drop_write() must be called. This is effectively a refcount.
*/
int mnt_want_write(struct vfsmount *m)
{
int ret;
sb_start_write(m->mnt_sb);
ret = mnt_get_write_access(m);
if (ret)
sb_end_write(m->mnt_sb);
return ret;
}
EXPORT_SYMBOL_GPL(mnt_want_write);
/**
* mnt_get_write_access_file - get write access to a file's mount
* @file: the file who's mount on which to take a write
*
* This is like mnt_get_write_access, but if @file is already open for write it
* skips incrementing mnt_writers (since the open file already has a reference)
* and instead only does the check for emergency r/o remounts. This must be
* paired with mnt_put_write_access_file.
*/
int mnt_get_write_access_file(struct file *file)
{
if (file->f_mode & FMODE_WRITER) {
/*
* Superblock may have become readonly while there are still
* writable fd's, e.g. due to a fs error with errors=remount-ro
*/
if (__mnt_is_readonly(file->f_path.mnt))
return -EROFS;
return 0;
}
return mnt_get_write_access(file->f_path.mnt);
}
/**
* mnt_want_write_file - get write access to a file's mount
* @file: the file who's mount on which to take a write
*
* This is like mnt_want_write, but if the file is already open for writing it
* skips incrementing mnt_writers (since the open file already has a reference)
* and instead only does the freeze protection and the check for emergency r/o
* remounts. This must be paired with mnt_drop_write_file.
*/
int mnt_want_write_file(struct file *file)
{
int ret;
sb_start_write(file_inode(file)->i_sb);
ret = mnt_get_write_access_file(file);
if (ret)
sb_end_write(file_inode(file)->i_sb);
return ret;
}
EXPORT_SYMBOL_GPL(mnt_want_write_file);
/**
* mnt_put_write_access - give up write access to a mount
* @mnt: the mount on which to give up write access
*
* Tells the low-level filesystem that we are done
* performing writes to it. Must be matched with
* mnt_get_write_access() call above.
*/
void mnt_put_write_access(struct vfsmount *mnt)
{
preempt_disable();
mnt_dec_writers(real_mount(mnt));
preempt_enable();
}
EXPORT_SYMBOL_GPL(mnt_put_write_access);
/**
* mnt_drop_write - give up write access to a mount
* @mnt: the mount on which to give up write access
*
* Tells the low-level filesystem that we are done performing writes to it and
* also allows filesystem to be frozen again. Must be matched with
* mnt_want_write() call above.
*/
void mnt_drop_write(struct vfsmount *mnt)
{
mnt_put_write_access(mnt);
sb_end_write(mnt->mnt_sb);
}
EXPORT_SYMBOL_GPL(mnt_drop_write);
void mnt_put_write_access_file(struct file *file)
{
if (!(file->f_mode & FMODE_WRITER))
mnt_put_write_access(file->f_path.mnt);
}
void mnt_drop_write_file(struct file *file)
{
mnt_put_write_access_file(file);
sb_end_write(file_inode(file)->i_sb);
}
EXPORT_SYMBOL(mnt_drop_write_file);
/**
* mnt_hold_writers - prevent write access to the given mount
* @mnt: mnt to prevent write access to
*
* Prevents write access to @mnt if there are no active writers for @mnt.
* This function needs to be called and return successfully before changing
* properties of @mnt that need to remain stable for callers with write access
* to @mnt.
*
* After this functions has been called successfully callers must pair it with
* a call to mnt_unhold_writers() in order to stop preventing write access to
* @mnt.
*
* Context: This function expects to be in mount_locked_reader scope serializing
* setting WRITE_HOLD.
* Return: On success 0 is returned.
* On error, -EBUSY is returned.
*/
static inline int mnt_hold_writers(struct mount *mnt)
{
set_write_hold(mnt);
/*
* After storing WRITE_HOLD, we'll read the counters. This store
* should be visible before we do.
*/
smp_mb();
/*
* With writers on hold, if this value is zero, then there are
* definitely no active writers (although held writers may subsequently
* increment the count, they'll have to wait, and decrement it after
* seeing MNT_READONLY).
*
* It is OK to have counter incremented on one CPU and decremented on
* another: the sum will add up correctly. The danger would be when we
* sum up each counter, if we read a counter before it is incremented,
* but then read another CPU's count which it has been subsequently
* decremented from -- we would see more decrements than we should.
* WRITE_HOLD protects against this scenario, because
* mnt_want_write first increments count, then smp_mb, then spins on
* WRITE_HOLD, so it can't be decremented by another CPU while
* we're counting up here.
*/
if (mnt_get_writers(mnt) > 0)
return -EBUSY;
return 0;
}
/**
* mnt_unhold_writers - stop preventing write access to the given mount
* @mnt: mnt to stop preventing write access to
*
* Stop preventing write access to @mnt allowing callers to gain write access
* to @mnt again.
*
* This function can only be called after a call to mnt_hold_writers().
*
* Context: This function expects to be in the same mount_locked_reader scope
* as the matching mnt_hold_writers().
*/
static inline void mnt_unhold_writers(struct mount *mnt)
{
if (!test_write_hold(mnt))
return;
/*
* MNT_READONLY must become visible before ~WRITE_HOLD, so writers
* that become unheld will see MNT_READONLY.
*/
smp_wmb();
clear_write_hold(mnt);
}
static inline void mnt_del_instance(struct mount *m)
{
struct mount **p = m->mnt_pprev_for_sb;
struct mount *next = m->mnt_next_for_sb;
if (next)
next->mnt_pprev_for_sb = p;
*p = next;
}
static inline void mnt_add_instance(struct mount *m, struct super_block *s)
{
struct mount *first = s->s_mounts;
if (first)
first->mnt_pprev_for_sb = &m->mnt_next_for_sb;
m->mnt_next_for_sb = first;
m->mnt_pprev_for_sb = &s->s_mounts;
s->s_mounts = m;
}
static int mnt_make_readonly(struct mount *mnt)
{
int ret;
ret = mnt_hold_writers(mnt);
if (!ret)
mnt->mnt.mnt_flags |= MNT_READONLY;
mnt_unhold_writers(mnt);
return ret;
}
int sb_prepare_remount_readonly(struct super_block *sb)
{
int err = 0;
/* Racy optimization. Recheck the counter under WRITE_HOLD */
if (atomic_long_read(&sb->s_remove_count))
return -EBUSY;
guard(mount_locked_reader)();
for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) {
if (!(m->mnt.mnt_flags & MNT_READONLY)) {
err = mnt_hold_writers(m);
if (err)
break;
}
}
if (!err && atomic_long_read(&sb->s_remove_count))
err = -EBUSY;
if (!err)
sb_start_ro_state_change(sb);
for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) {
if (test_write_hold(m))
clear_write_hold(m);
}
return err;
}
static void free_vfsmnt(struct mount *mnt)
{
mnt_idmap_put(mnt_idmap(&mnt->mnt));
kfree_const(mnt->mnt_devname);
#ifdef CONFIG_SMP
free_percpu(mnt->mnt_pcp);
#endif
kmem_cache_free(mnt_cache, mnt);
}
static void delayed_free_vfsmnt(struct rcu_head *head)
{
free_vfsmnt(container_of(head, struct mount, mnt_rcu));
}
/* call under rcu_read_lock */
int __legitimize_mnt(struct vfsmount *bastard, unsigned seq)
{
struct mount *mnt;
if (read_seqretry(&mount_lock, seq))
return 1;
if (bastard == NULL)
return 0;
mnt = real_mount(bastard);
mnt_add_count(mnt, 1);
smp_mb(); // see mntput_no_expire() and do_umount()
if (likely(!read_seqretry(&mount_lock, seq)))
return 0;
lock_mount_hash();
if (unlikely(bastard->mnt_flags & (MNT_SYNC_UMOUNT | MNT_DOOMED))) {
mnt_add_count(mnt, -1);
unlock_mount_hash();
return 1;
}
unlock_mount_hash();
/* caller will mntput() */
return -1;
}
/* call under rcu_read_lock */
static bool legitimize_mnt(struct vfsmount *bastard, unsigned seq)
{
int res = __legitimize_mnt(bastard, seq);
if (likely(!res))
return true;
if (unlikely(res < 0)) {
rcu_read_unlock();
mntput(bastard);
rcu_read_lock();
}
return false;
}
/**
* __lookup_mnt - mount hash lookup
* @mnt: parent mount
* @dentry: dentry of mountpoint
*
* If @mnt has a child mount @c mounted on @dentry find and return it.
* Caller must either hold the spinlock component of @mount_lock or
* hold rcu_read_lock(), sample the seqcount component before the call
* and recheck it afterwards.
*
* Return: The child of @mnt mounted on @dentry or %NULL.
*/
struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
{
struct hlist_head *head = m_hash(mnt, dentry);
struct mount *p;
hlist_for_each_entry_rcu(p, head, mnt_hash)
if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry)
return p;
return NULL;
}
/**
* lookup_mnt - Return the child mount mounted at given location
* @path: location in the namespace
*
* Acquires and returns a new reference to mount at given location
* or %NULL if nothing is mounted there.
*/
struct vfsmount *lookup_mnt(const struct path *path)
{
struct mount *child_mnt;
struct vfsmount *m;
unsigned seq;
rcu_read_lock();
do {
seq = read_seqbegin(&mount_lock);
child_mnt = __lookup_mnt(path->mnt, path->dentry);
m = child_mnt ? &child_mnt->mnt : NULL;
} while (!legitimize_mnt(m, seq));
rcu_read_unlock();
return m;
}
/*
* __is_local_mountpoint - Test to see if dentry is a mountpoint in the
* current mount namespace.
*
* The common case is dentries are not mountpoints at all and that
* test is handled inline. For the slow case when we are actually
* dealing with a mountpoint of some kind, walk through all of the
* mounts in the current mount namespace and test to see if the dentry
* is a mountpoint.
*
* The mount_hashtable is not usable in the context because we
* need to identify all mounts that may be in the current mount
* namespace not just a mount that happens to have some specified
* parent mount.
*/
bool __is_local_mountpoint(const struct dentry *dentry)
{
struct mnt_namespace *ns = current->nsproxy->mnt_ns;
struct mount *mnt, *n;
guard(namespace_shared)();
rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node)
if (mnt->mnt_mountpoint == dentry)
return true;
return false;
}
struct pinned_mountpoint {
struct hlist_node node;
struct mountpoint *mp;
struct mount *parent;
};
static bool lookup_mountpoint(struct dentry *dentry, struct pinned_mountpoint *m)
{
struct hlist_head *chain = mp_hash(dentry);
struct mountpoint *mp;
hlist_for_each_entry(mp, chain, m_hash) {
if (mp->m_dentry == dentry) {
hlist_add_head(&m->node, &mp->m_list);
m->mp = mp;
return true;
}
}
return false;
}
static int get_mountpoint(struct dentry *dentry, struct pinned_mountpoint *m)
{
struct mountpoint *mp __free(kfree) = NULL;
bool found;
int ret;
if (d_mountpoint(dentry)) {
/* might be worth a WARN_ON() */
if (d_unlinked(dentry))
return -ENOENT;
mountpoint:
read_seqlock_excl(&mount_lock);
found = lookup_mountpoint(dentry, m);
read_sequnlock_excl(&mount_lock);
if (found)
return 0;
}
if (!mp)
mp = kmalloc(sizeof(struct mountpoint), GFP_KERNEL);
if (!mp)
return -ENOMEM;
/* Exactly one processes may set d_mounted */
ret = d_set_mounted(dentry);
/* Someone else set d_mounted? */
if (ret == -EBUSY)
goto mountpoint;
/* The dentry is not available as a mountpoint? */
if (ret)
return ret;
/* Add the new mountpoint to the hash table */
read_seqlock_excl(&mount_lock);
mp->m_dentry = dget(dentry);
hlist_add_head(&mp->m_hash, mp_hash(dentry));
INIT_HLIST_HEAD(&mp->m_list);
hlist_add_head(&m->node, &mp->m_list);
m->mp = no_free_ptr(mp);
read_sequnlock_excl(&mount_lock);
return 0;
}
/*
* vfsmount lock must be held. Additionally, the caller is responsible
* for serializing calls for given disposal list.
*/
static void maybe_free_mountpoint(struct mountpoint *mp, struct list_head *list)
{
if (hlist_empty(&mp->m_list)) {
struct dentry *dentry = mp->m_dentry;
spin_lock(&dentry->d_lock);
dentry->d_flags &= ~DCACHE_MOUNTED;
spin_unlock(&dentry->d_lock);
dput_to_list(dentry, list);
hlist_del(&mp->m_hash);
kfree(mp);
}
}
/*
* locks: mount_lock [read_seqlock_excl], namespace_sem [excl]
*/
static void unpin_mountpoint(struct pinned_mountpoint *m)
{
if (m->mp) {
hlist_del(&m->node);
maybe_free_mountpoint(m->mp, &ex_mountpoints);
}
}
static inline int check_mnt(const struct mount *mnt)
{
return mnt->mnt_ns == current->nsproxy->mnt_ns;
}
static inline bool check_anonymous_mnt(struct mount *mnt)
{
u64 seq;
if (!is_anon_ns(mnt->mnt_ns))
return false;
seq = mnt->mnt_ns->seq_origin;
return !seq || (seq == current->nsproxy->mnt_ns->ns.ns_id);
}
/*
* vfsmount lock must be held for write
*/
static void touch_mnt_namespace(struct mnt_namespace *ns)
{
if (ns) {
ns->event = ++event;
wake_up_interruptible(&ns->poll);
}
}
/*
* vfsmount lock must be held for write
*/
static void __touch_mnt_namespace(struct mnt_namespace *ns)
{
if (ns && ns->event != event) {
ns->event = event;
wake_up_interruptible(&ns->poll);
}
}
/*
* locks: mount_lock[write_seqlock]
*/
static void __umount_mnt(struct mount *mnt, struct list_head *shrink_list)
{
struct mountpoint *mp;
struct mount *parent = mnt->mnt_parent;
if (unlikely(parent->overmount == mnt))
parent->overmount = NULL;
mnt->mnt_parent = mnt;
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
list_del_init(&mnt->mnt_child);
hlist_del_init_rcu(&mnt->mnt_hash);
hlist_del_init(&mnt->mnt_mp_list);
mp = mnt->mnt_mp;
mnt->mnt_mp = NULL;
maybe_free_mountpoint(mp, shrink_list);
}
/*
* locks: mount_lock[write_seqlock], namespace_sem[excl] (for ex_mountpoints)
*/
static void umount_mnt(struct mount *mnt)
{
__umount_mnt(mnt, &ex_mountpoints);
}
/*
* vfsmount lock must be held for write
*/
void mnt_set_mountpoint(struct mount *mnt,
struct mountpoint *mp,
struct mount *child_mnt)
{
child_mnt->mnt_mountpoint = mp->m_dentry;
child_mnt->mnt_parent = mnt;
child_mnt->mnt_mp = mp;
hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list);
}
static void make_visible(struct mount *mnt)
{
struct mount *parent = mnt->mnt_parent;
if (unlikely(mnt->mnt_mountpoint == parent->mnt.mnt_root))
parent->overmount = mnt;
hlist_add_head_rcu(&mnt->mnt_hash,
m_hash(&parent->mnt, mnt->mnt_mountpoint));
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
}
/**
* attach_mnt - mount a mount, attach to @mount_hashtable and parent's
* list of child mounts
* @parent: the parent
* @mnt: the new mount
* @mp: the new mountpoint
*
* Mount @mnt at @mp on @parent. Then attach @mnt
* to @parent's child mount list and to @mount_hashtable.
*
* Note, when make_visible() is called @mnt->mnt_parent already points
* to the correct parent.
*
* Context: This function expects namespace_lock() and lock_mount_hash()
* to have been acquired in that order.
*/
static void attach_mnt(struct mount *mnt, struct mount *parent,
struct mountpoint *mp)
{
mnt_set_mountpoint(parent, mp, mnt);
make_visible(mnt);
}
void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
{
struct mountpoint *old_mp = mnt->mnt_mp;
list_del_init(&mnt->mnt_child);
hlist_del_init(&mnt->mnt_mp_list);
hlist_del_init_rcu(&mnt->mnt_hash);
attach_mnt(mnt, parent, mp);
maybe_free_mountpoint(old_mp, &ex_mountpoints);
}
static inline struct mount *node_to_mount(struct rb_node *node)
{
return node ? rb_entry(node, struct mount, mnt_node) : NULL;
}
static void mnt_add_to_ns(struct mnt_namespace *ns, struct mount *mnt)
{
struct rb_node **link = &ns->mounts.rb_node;
struct rb_node *parent = NULL;
bool mnt_first_node = true, mnt_last_node = true;
WARN_ON(mnt_ns_attached(mnt));
mnt->mnt_ns = ns;
while (*link) {
parent = *link;
if (mnt->mnt_id_unique < node_to_mount(parent)->mnt_id_unique) {
link = &parent->rb_left;
mnt_last_node = false;
} else {
link = &parent->rb_right;
mnt_first_node = false;
}
}
if (mnt_last_node)
ns->mnt_last_node = &mnt->mnt_node;
if (mnt_first_node)
ns->mnt_first_node = &mnt->mnt_node;
rb_link_node(&mnt->mnt_node, parent, link);
rb_insert_color(&mnt->mnt_node, &ns->mounts);
mnt_notify_add(mnt);
}
static struct mount *next_mnt(struct mount *p, struct mount *root)
{
struct list_head *next = p->mnt_mounts.next;
if (next == &p->mnt_mounts) {
while (1) {
if (p == root)
return NULL;
next = p->mnt_child.next;
if (next != &p->mnt_parent->mnt_mounts)
break;
p = p->mnt_parent;
}
}
return list_entry(next, struct mount, mnt_child);
}
static struct mount *skip_mnt_tree(struct mount *p)
{
struct list_head *prev = p->mnt_mounts.prev;
while (prev != &p->mnt_mounts) {
p = list_entry(prev, struct mount, mnt_child);
prev = p->mnt_mounts.prev;
}
return p;
}
/*
* vfsmount lock must be held for write
*/
static void commit_tree(struct mount *mnt)
{
struct mnt_namespace *n = mnt->mnt_parent->mnt_ns;
if (!mnt_ns_attached(mnt)) {
for (struct mount *m = mnt; m; m = next_mnt(m, mnt))
mnt_add_to_ns(n, m);
n->nr_mounts += n->pending_mounts;
n->pending_mounts = 0;
}
make_visible(mnt);
touch_mnt_namespace(n);
}
static void setup_mnt(struct mount *m, struct dentry *root)
{
struct super_block *s = root->d_sb;
atomic_inc(&s->s_active);
m->mnt.mnt_sb = s;
m->mnt.mnt_root = dget(root);
m->mnt_mountpoint = m->mnt.mnt_root;
m->mnt_parent = m;
guard(mount_locked_reader)();
mnt_add_instance(m, s);
}
/**
* vfs_create_mount - Create a mount for a configured superblock
* @fc: The configuration context with the superblock attached
*
* Create a mount to an already configured superblock. If necessary, the
* caller should invoke vfs_get_tree() before calling this.
*
* Note that this does not attach the mount to anything.
*/
struct vfsmount *vfs_create_mount(struct fs_context *fc)
{
struct mount *mnt;
if (!fc->root)
return ERR_PTR(-EINVAL);
mnt = alloc_vfsmnt(fc->source);
if (!mnt)
return ERR_PTR(-ENOMEM);
if (fc->sb_flags & SB_KERNMOUNT)
mnt->mnt.mnt_flags = MNT_INTERNAL;
setup_mnt(mnt, fc->root);
return &mnt->mnt;
}
EXPORT_SYMBOL(vfs_create_mount);
struct vfsmount *fc_mount(struct fs_context *fc)
{
int err = vfs_get_tree(fc);
if (!err) {
up_write(&fc->root->d_sb->s_umount);
return vfs_create_mount(fc);
}
return ERR_PTR(err);
}
EXPORT_SYMBOL(fc_mount);
struct vfsmount *fc_mount_longterm(struct fs_context *fc)
{
struct vfsmount *mnt = fc_mount(fc);
if (!IS_ERR(mnt))
real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
return mnt;
}
EXPORT_SYMBOL(fc_mount_longterm);
struct vfsmount *vfs_kern_mount(struct file_system_type *type,
int flags, const char *name,
void *data)
{
struct fs_context *fc;
struct vfsmount *mnt;
int ret = 0;
if (!type)
return ERR_PTR(-EINVAL);
fc = fs_context_for_mount(type, flags);
if (IS_ERR(fc))
return ERR_CAST(fc);
if (name)
ret = vfs_parse_fs_string(fc, "source", name);
if (!ret)
ret = parse_monolithic_mount_data(fc, data);
if (!ret)
mnt = fc_mount(fc);
else
mnt = ERR_PTR(ret);
put_fs_context(fc);
return mnt;
}
EXPORT_SYMBOL_GPL(vfs_kern_mount);
static struct mount *clone_mnt(struct mount *old, struct dentry *root,
int flag)
{
struct mount *mnt;
int err;
mnt = alloc_vfsmnt(old->mnt_devname);
if (!mnt)
return ERR_PTR(-ENOMEM);
mnt->mnt.mnt_flags = READ_ONCE(old->mnt.mnt_flags) &
~MNT_INTERNAL_FLAGS;
if (flag & (CL_SLAVE | CL_PRIVATE))
mnt->mnt_group_id = 0; /* not a peer of original */
else
mnt->mnt_group_id = old->mnt_group_id;
if ((flag & CL_MAKE_SHARED) && !mnt->mnt_group_id) {
err = mnt_alloc_group_id(mnt);
if (err)
goto out_free;
}
if (mnt->mnt_group_id)
set_mnt_shared(mnt);
mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt));
setup_mnt(mnt, root);
if (flag & CL_PRIVATE) // we are done with it
return mnt;
if (peers(mnt, old))
list_add(&mnt->mnt_share, &old->mnt_share);
if ((flag & CL_SLAVE) && old->mnt_group_id) {
hlist_add_head(&mnt->mnt_slave, &old->mnt_slave_list);
mnt->mnt_master = old;
} else if (IS_MNT_SLAVE(old)) {
hlist_add_behind(&mnt->mnt_slave, &old->mnt_slave);
mnt->mnt_master = old->mnt_master;
}
return mnt;
out_free:
mnt_free_id(mnt);
free_vfsmnt(mnt);
return ERR_PTR(err);
}
static void cleanup_mnt(struct mount *mnt)
{
struct hlist_node *p;
struct mount *m;
/*
* The warning here probably indicates that somebody messed
* up a mnt_want/drop_write() pair. If this happens, the
* filesystem was probably unable to make r/w->r/o transitions.
* The locking used to deal with mnt_count decrement provides barriers,
* so mnt_get_writers() below is safe.
*/
WARN_ON(mnt_get_writers(mnt));
if (unlikely(mnt->mnt_pins.first))
mnt_pin_kill(mnt);
hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) {
hlist_del(&m->mnt_umount);
mntput(&m->mnt);
}
fsnotify_vfsmount_delete(&mnt->mnt);
dput(mnt->mnt.mnt_root);
deactivate_super(mnt->mnt.mnt_sb);
mnt_free_id(mnt);
call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
}
static void __cleanup_mnt(struct rcu_head *head)
{
cleanup_mnt(container_of(head, struct mount, mnt_rcu));
}
static LLIST_HEAD(delayed_mntput_list);
static void delayed_mntput(struct work_struct *unused)
{
struct llist_node *node = llist_del_all(&delayed_mntput_list);
struct mount *m, *t;
llist_for_each_entry_safe(m, t, node, mnt_llist)
cleanup_mnt(m);
}
static DECLARE_DELAYED_WORK(delayed_mntput_work, delayed_mntput);
static void noinline mntput_no_expire_slowpath(struct mount *mnt)
{
LIST_HEAD(list);
int count;
VFS_BUG_ON(mnt->mnt_ns);
lock_mount_hash();
/*
* make sure that if __legitimize_mnt() has not seen us grab
* mount_lock, we'll see their refcount increment here.
*/
smp_mb();
mnt_add_count(mnt, -1);
count = mnt_get_count(mnt);
if (count != 0) {
WARN_ON(count < 0);
rcu_read_unlock();
unlock_mount_hash();
return;
}
if (unlikely(mnt->mnt.mnt_flags & MNT_DOOMED)) {
rcu_read_unlock();
unlock_mount_hash();
return;
}
mnt->mnt.mnt_flags |= MNT_DOOMED;
rcu_read_unlock();
mnt_del_instance(mnt);
if (unlikely(!list_empty(&mnt->mnt_expire)))
list_del(&mnt->mnt_expire);
if (unlikely(!list_empty(&mnt->mnt_mounts))) {
struct mount *p, *tmp;
list_for_each_entry_safe(p, tmp, &mnt->mnt_mounts, mnt_child) {
__umount_mnt(p, &list);
hlist_add_head(&p->mnt_umount, &mnt->mnt_stuck_children);
}
}
unlock_mount_hash();
shrink_dentry_list(&list);
if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
struct task_struct *task = current;
if (likely(!(task->flags & PF_KTHREAD))) {
init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
return;
}
if (llist_add(&mnt->mnt_llist, &delayed_mntput_list))
schedule_delayed_work(&delayed_mntput_work, 1);
return;
}
cleanup_mnt(mnt);
}
static void mntput_no_expire(struct mount *mnt)
{
rcu_read_lock();
if (likely(READ_ONCE(mnt->mnt_ns))) {
/*
* Since we don't do lock_mount_hash() here,
* ->mnt_ns can change under us. However, if it's
* non-NULL, then there's a reference that won't
* be dropped until after an RCU delay done after
* turning ->mnt_ns NULL. So if we observe it
* non-NULL under rcu_read_lock(), the reference
* we are dropping is not the final one.
*/
mnt_add_count(mnt, -1);
rcu_read_unlock();
return;
}
mntput_no_expire_slowpath(mnt);
}
void mntput(struct vfsmount *mnt)
{
if (mnt) {
struct mount *m = real_mount(mnt);
/* avoid cacheline pingpong */
if (unlikely(m->mnt_expiry_mark))
WRITE_ONCE(m->mnt_expiry_mark, 0);
mntput_no_expire(m);
}
}
EXPORT_SYMBOL(mntput);
struct vfsmount *mntget(struct vfsmount *mnt)
{
if (mnt)
mnt_add_count(real_mount(mnt), 1);
return mnt;
}
EXPORT_SYMBOL(mntget);
/*
* Make a mount point inaccessible to new lookups.
* Because there may still be current users, the caller MUST WAIT
* for an RCU grace period before destroying the mount point.
*/
void mnt_make_shortterm(struct vfsmount *mnt)
{
if (mnt)
real_mount(mnt)->mnt_ns = NULL;
}
/**
* path_is_mountpoint() - Check if path is a mount in the current namespace.
* @path: path to check
*
* d_mountpoint() can only be used reliably to establish if a dentry is
* not mounted in any namespace and that common case is handled inline.
* d_mountpoint() isn't aware of the possibility there may be multiple
* mounts using a given dentry in a different namespace. This function
* checks if the passed in path is a mountpoint rather than the dentry
* alone.
*/
bool path_is_mountpoint(const struct path *path)
{
unsigned seq;
bool res;
if (!d_mountpoint(path->dentry))
return false;
rcu_read_lock();
do {
seq = read_seqbegin(&mount_lock);
res = __path_is_mountpoint(path);
} while (read_seqretry(&mount_lock, seq));
rcu_read_unlock();
return res;
}
EXPORT_SYMBOL(path_is_mountpoint);
struct vfsmount *mnt_clone_internal(const struct path *path)
{
struct mount *p;
p = clone_mnt(real_mount(path->mnt), path->dentry, CL_PRIVATE);
if (IS_ERR(p))
return ERR_CAST(p);
p->mnt.mnt_flags |= MNT_INTERNAL;
return &p->mnt;
}
/*
* Returns the mount which either has the specified mnt_id, or has the next
* smallest id afer the specified one.
*/
static struct mount *mnt_find_id_at(struct mnt_namespace *ns, u64 mnt_id)
{
struct rb_node *node = ns->mounts.rb_node;
struct mount *ret = NULL;
while (node) {
struct mount *m = node_to_mount(node);
if (mnt_id <= m->mnt_id_unique) {
ret = node_to_mount(node);
if (mnt_id == m->mnt_id_unique)
break;
node = node->rb_left;
} else {
node = node->rb_right;
}
}
return ret;
}
/*
* Returns the mount which either has the specified mnt_id, or has the next
* greater id before the specified one.
*/
static struct mount *mnt_find_id_at_reverse(struct mnt_namespace *ns, u64 mnt_id)
{
struct rb_node *node = ns->mounts.rb_node;
struct mount *ret = NULL;
while (node) {
struct mount *m = node_to_mount(node);
if (mnt_id >= m->mnt_id_unique) {
ret = node_to_mount(node);
if (mnt_id == m->mnt_id_unique)
break;
node = node->rb_right;
} else {
node = node->rb_left;
}
}
return ret;
}
#ifdef CONFIG_PROC_FS
/* iterator; we want it to have access to namespace_sem, thus here... */
static void *m_start(struct seq_file *m, loff_t *pos)
{
struct proc_mounts *p = m->private;
down_read(&namespace_sem);
return mnt_find_id_at(p->ns, *pos);
}
static void *m_next(struct seq_file *m, void *v, loff_t *pos)
{
struct mount *next = NULL, *mnt = v;
struct rb_node *node = rb_next(&mnt->mnt_node);
++*pos;
if (node) {
next = node_to_mount(node);
*pos = next->mnt_id_unique;
}
return next;
}
static void m_stop(struct seq_file *m, void *v)
{
up_read(&namespace_sem);
}
static int m_show(struct seq_file *m, void *v)
{
struct proc_mounts *p = m->private;
struct mount *r = v;
return p->show(m, &r->mnt);
}
const struct seq_operations mounts_op = {
.start = m_start,
.next = m_next,
.stop = m_stop,
.show = m_show,
};
#endif /* CONFIG_PROC_FS */
/**
* may_umount_tree - check if a mount tree is busy
* @m: root of mount tree
*
* This is called to check if a tree of mounts has any
* open files, pwds, chroots or sub mounts that are
* busy.
*/
int may_umount_tree(struct vfsmount *m)
{
struct mount *mnt = real_mount(m);
bool busy = false;
/* write lock needed for mnt_get_count */
lock_mount_hash();
for (struct mount *p = mnt; p; p = next_mnt(p, mnt)) {
if (mnt_get_count(p) > (p == mnt ? 2 : 1)) {
busy = true;
break;
}
}
unlock_mount_hash();
return !busy;
}
EXPORT_SYMBOL(may_umount_tree);
/**
* may_umount - check if a mount point is busy
* @mnt: root of mount
*
* This is called to check if a mount point has any
* open files, pwds, chroots or sub mounts. If the
* mount has sub mounts this will return busy
* regardless of whether the sub mounts are busy.
*
* Doesn't take quota and stuff into account. IOW, in some cases it will
* give false negatives. The main reason why it's here is that we need
* a non-destructive way to look for easily umountable filesystems.
*/
int may_umount(struct vfsmount *mnt)
{
int ret = 1;
down_read(&namespace_sem);
lock_mount_hash();
if (propagate_mount_busy(real_mount(mnt), 2))
ret = 0;
unlock_mount_hash();
up_read(&namespace_sem);
return ret;
}
EXPORT_SYMBOL(may_umount);
#ifdef CONFIG_FSNOTIFY
static void mnt_notify(struct mount *p)
{
if (!p->prev_ns && p->mnt_ns) {
fsnotify_mnt_attach(p->mnt_ns, &p->mnt);
} else if (p->prev_ns && !p->mnt_ns) {
fsnotify_mnt_detach(p->prev_ns, &p->mnt);
} else if (p->prev_ns == p->mnt_ns) {
fsnotify_mnt_move(p->mnt_ns, &p->mnt);
} else {
fsnotify_mnt_detach(p->prev_ns, &p->mnt);
fsnotify_mnt_attach(p->mnt_ns, &p->mnt);
}
p->prev_ns = p->mnt_ns;
}
static void notify_mnt_list(void)
{
struct mount *m, *tmp;
/*
* Notify about mounts that were added/reparented/detached/remain
* connected after unmount.
*/
list_for_each_entry_safe(m, tmp, ¬ify_list, to_notify) {
mnt_notify(m);
list_del_init(&m->to_notify);
}
}
static bool need_notify_mnt_list(void)
{
return !list_empty(¬ify_list);
}
#else
static void notify_mnt_list(void)
{
}
static bool need_notify_mnt_list(void)
{
return false;
}
#endif
static void free_mnt_ns(struct mnt_namespace *);
static void namespace_unlock(void)
{
struct hlist_head head;
struct hlist_node *p;
struct mount *m;
struct mnt_namespace *ns = emptied_ns;
LIST_HEAD(list);
hlist_move_list(&unmounted, &head);
list_splice_init(&ex_mountpoints, &list);
emptied_ns = NULL;
if (need_notify_mnt_list()) {
/*
* No point blocking out concurrent readers while notifications
* are sent. This will also allow statmount()/listmount() to run
* concurrently.
*/
downgrade_write(&namespace_sem);
notify_mnt_list();
up_read(&namespace_sem);
} else {
up_write(&namespace_sem);
}
if (unlikely(ns)) {
/* Make sure we notice when we leak mounts. */
VFS_WARN_ON_ONCE(!mnt_ns_empty(ns));
free_mnt_ns(ns);
}
shrink_dentry_list(&list);
if (likely(hlist_empty(&head)))
return;
synchronize_rcu_expedited();
hlist_for_each_entry_safe(m, p, &head, mnt_umount) {
hlist_del(&m->mnt_umount);
mntput(&m->mnt);
}
}
static inline void namespace_lock(void)
{
down_write(&namespace_sem);
}
enum umount_tree_flags {
UMOUNT_SYNC = 1,
UMOUNT_PROPAGATE = 2,
UMOUNT_CONNECTED = 4,
};
static bool disconnect_mount(struct mount *mnt, enum umount_tree_flags how)
{
/* Leaving mounts connected is only valid for lazy umounts */
if (how & UMOUNT_SYNC)
return true;
/* A mount without a parent has nothing to be connected to */
if (!mnt_has_parent(mnt))
return true;
/* Because the reference counting rules change when mounts are
* unmounted and connected, umounted mounts may not be
* connected to mounted mounts.
*/
if (!(mnt->mnt_parent->mnt.mnt_flags & MNT_UMOUNT))
return true;
/* Has it been requested that the mount remain connected? */
if (how & UMOUNT_CONNECTED)
return false;
/* Is the mount locked such that it needs to remain connected? */
if (IS_MNT_LOCKED(mnt))
return false;
/* By default disconnect the mount */
return true;
}
/*
* mount_lock must be held
* namespace_sem must be held for write
*/
static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
{
LIST_HEAD(tmp_list);
struct mount *p;
if (how & UMOUNT_PROPAGATE)
propagate_mount_unlock(mnt);
/* Gather the mounts to umount */
for (p = mnt; p; p = next_mnt(p, mnt)) {
p->mnt.mnt_flags |= MNT_UMOUNT;
if (mnt_ns_attached(p))
move_from_ns(p);
list_add_tail(&p->mnt_list, &tmp_list);
}
/* Hide the mounts from mnt_mounts */
list_for_each_entry(p, &tmp_list, mnt_list) {
list_del_init(&p->mnt_child);
}
/* Add propagated mounts to the tmp_list */
if (how & UMOUNT_PROPAGATE)
propagate_umount(&tmp_list);
bulk_make_private(&tmp_list);
while (!list_empty(&tmp_list)) {
struct mnt_namespace *ns;
bool disconnect;
p = list_first_entry(&tmp_list, struct mount, mnt_list);
list_del_init(&p->mnt_expire);
list_del_init(&p->mnt_list);
ns = p->mnt_ns;
if (ns) {
ns->nr_mounts--;
__touch_mnt_namespace(ns);
}
p->mnt_ns = NULL;
if (how & UMOUNT_SYNC)
p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
disconnect = disconnect_mount(p, how);
if (mnt_has_parent(p)) {
if (!disconnect) {
/* Don't forget about p */
list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
} else {
umount_mnt(p);
}
}
if (disconnect)
hlist_add_head(&p->mnt_umount, &unmounted);
/*
* At this point p->mnt_ns is NULL, notification will be queued
* only if
*
* - p->prev_ns is non-NULL *and*
* - p->prev_ns->n_fsnotify_marks is non-NULL
*
* This will preclude queuing the mount if this is a cleanup
* after a failed copy_tree() or destruction of an anonymous
* namespace, etc.
*/
mnt_notify_add(p);
}
}
static void shrink_submounts(struct mount *mnt);
static int do_umount_root(struct super_block *sb)
{
int ret = 0;
down_write(&sb->s_umount);
if (!sb_rdonly(sb)) {
struct fs_context *fc;
fc = fs_context_for_reconfigure(sb->s_root, SB_RDONLY,
SB_RDONLY);
if (IS_ERR(fc)) {
ret = PTR_ERR(fc);
} else {
ret = parse_monolithic_mount_data(fc, NULL);
if (!ret)
ret = reconfigure_super(fc);
put_fs_context(fc);
}
}
up_write(&sb->s_umount);
return ret;
}
static int do_umount(struct mount *mnt, int flags)
{
struct super_block *sb = mnt->mnt.mnt_sb;
int retval;
retval = security_sb_umount(&mnt->mnt, flags);
if (retval)
return retval;
/*
* Allow userspace to request a mountpoint be expired rather than
* unmounting unconditionally. Unmount only happens if:
* (1) the mark is already set (the mark is cleared by mntput())
* (2) the usage count == 1 [parent vfsmount] + 1 [sys_umount]
*/
if (flags & MNT_EXPIRE) {
if (&mnt->mnt == current->fs->root.mnt ||
flags & (MNT_FORCE | MNT_DETACH))
return -EINVAL;
/*
* probably don't strictly need the lock here if we examined
* all race cases, but it's a slowpath.
*/
lock_mount_hash();
if (!list_empty(&mnt->mnt_mounts) || mnt_get_count(mnt) != 2) {
unlock_mount_hash();
return -EBUSY;
}
unlock_mount_hash();
if (!xchg(&mnt->mnt_expiry_mark, 1))
return -EAGAIN;
}
/*
* If we may have to abort operations to get out of this
* mount, and they will themselves hold resources we must
* allow the fs to do things. In the Unix tradition of
* 'Gee thats tricky lets do it in userspace' the umount_begin
* might fail to complete on the first run through as other tasks
* must return, and the like. Thats for the mount program to worry
* about for the moment.
*/
if (flags & MNT_FORCE && sb->s_op->umount_begin) {
sb->s_op->umount_begin(sb);
}
/*
* No sense to grab the lock for this test, but test itself looks
* somewhat bogus. Suggestions for better replacement?
* Ho-hum... In principle, we might treat that as umount + switch
* to rootfs. GC would eventually take care of the old vfsmount.
* Actually it makes sense, especially if rootfs would contain a
* /reboot - static binary that would close all descriptors and
* call reboot(9). Then init(8) could umount root and exec /reboot.
*/
if (&mnt->mnt == current->fs->root.mnt && !(flags & MNT_DETACH)) {
/*
* Special case for "unmounting" root ...
* we just try to remount it readonly.
*/
if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
return do_umount_root(sb);
}
namespace_lock();
lock_mount_hash();
/* Repeat the earlier racy checks, now that we are holding the locks */
retval = -EINVAL;
if (!check_mnt(mnt))
goto out;
if (mnt->mnt.mnt_flags & MNT_LOCKED)
goto out;
if (!mnt_has_parent(mnt)) /* not the absolute root */
goto out;
event++;
if (flags & MNT_DETACH) {
umount_tree(mnt, UMOUNT_PROPAGATE);
retval = 0;
} else {
smp_mb(); // paired with __legitimize_mnt()
shrink_submounts(mnt);
retval = -EBUSY;
if (!propagate_mount_busy(mnt, 2)) {
umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
retval = 0;
}
}
out:
unlock_mount_hash();
namespace_unlock();
return retval;
}
/*
* __detach_mounts - lazily unmount all mounts on the specified dentry
*
* During unlink, rmdir, and d_drop it is possible to loose the path
* to an existing mountpoint, and wind up leaking the mount.
* detach_mounts allows lazily unmounting those mounts instead of
* leaking them.
*
* The caller may hold dentry->d_inode->i_rwsem.
*/
void __detach_mounts(struct dentry *dentry)
{
struct pinned_mountpoint mp = {};
struct mount *mnt;
guard(namespace_excl)();
guard(mount_writer)();
if (!lookup_mountpoint(dentry, &mp))
return;
event++;
while (mp.node.next) {
mnt = hlist_entry(mp.node.next, struct mount, mnt_mp_list);
if (mnt->mnt.mnt_flags & MNT_UMOUNT) {
umount_mnt(mnt);
hlist_add_head(&mnt->mnt_umount, &unmounted);
}
else umount_tree(mnt, UMOUNT_CONNECTED);
}
unpin_mountpoint(&mp);
}
/*
* Is the caller allowed to modify his namespace?
*/
bool may_mount(void)
{
return ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN);
}
static void warn_mandlock(void)
{
pr_warn_once("=======================================================\n"
"WARNING: The mand mount option has been deprecated and\n"
" and is ignored by this kernel. Remove the mand\n"
" option from the mount to silence this warning.\n"
"=======================================================\n");
}
static int can_umount(const struct path *path, int flags)
{
struct mount *mnt = real_mount(path->mnt);
struct super_block *sb = path->dentry->d_sb;
if (!may_mount())
return -EPERM;
if (!path_mounted(path))
return -EINVAL;
if (!check_mnt(mnt))
return -EINVAL;
if (mnt->mnt.mnt_flags & MNT_LOCKED) /* Check optimistically */
return -EINVAL;
if (flags & MNT_FORCE && !ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
// caller is responsible for flags being sane
int path_umount(const struct path *path, int flags)
{
struct mount *mnt = real_mount(path->mnt);
int ret;
ret = can_umount(path, flags);
if (!ret)
ret = do_umount(mnt, flags);
/* we mustn't call path_put() as that would clear mnt_expiry_mark */
dput(path->dentry);
mntput_no_expire(mnt);
return ret;
}
static int ksys_umount(char __user *name, int flags)
{
int lookup_flags = LOOKUP_MOUNTPOINT;
struct path path;
int ret;
// basic validity checks done first
if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
return -EINVAL;
if (!(flags & UMOUNT_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
ret = user_path_at(AT_FDCWD, name, lookup_flags, &path);
if (ret)
return ret;
return path_umount(&path, flags);
}
SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
{
return ksys_umount(name, flags);
}
#ifdef __ARCH_WANT_SYS_OLDUMOUNT
/*
* The 2.0 compatible umount. No flags.
*/
SYSCALL_DEFINE1(oldumount, char __user *, name)
{
return ksys_umount(name, 0);
}
#endif
static bool is_mnt_ns_file(struct dentry *dentry)
{
struct ns_common *ns;
/* Is this a proxy for a mount namespace? */
if (dentry->d_op != &ns_dentry_operations)
return false;
ns = d_inode(dentry)->i_private;
return ns->ops == &mntns_operations;
}
struct ns_common *from_mnt_ns(struct mnt_namespace *mnt)
{
return &mnt->ns;
}
struct mnt_namespace *get_sequential_mnt_ns(struct mnt_namespace *mntns, bool previous)
{
struct ns_common *ns;
guard(rcu)();
for (;;) {
ns = ns_tree_adjoined_rcu(mntns, previous);
if (IS_ERR(ns))
return ERR_CAST(ns);
mntns = to_mnt_ns(ns);
/*
* The last passive reference count is put with RCU
* delay so accessing the mount namespace is not just
* safe but all relevant members are still valid.
*/
if (!ns_capable_noaudit(mntns->user_ns, CAP_SYS_ADMIN))
continue;
/*
* We need an active reference count as we're persisting
* the mount namespace and it might already be on its
* deathbed.
*/
if (!ns_ref_get(mntns))
continue;
return mntns;
}
}
struct mnt_namespace *mnt_ns_from_dentry(struct dentry *dentry)
{
if (!is_mnt_ns_file(dentry))
return NULL;
return to_mnt_ns(get_proc_ns(dentry->d_inode));
}
static bool mnt_ns_loop(struct dentry *dentry)
{
/* Could bind mounting the mount namespace inode cause a
* mount namespace loop?
*/
struct mnt_namespace *mnt_ns = mnt_ns_from_dentry(dentry);
if (!mnt_ns)
return false;
return current->nsproxy->mnt_ns->ns.ns_id >= mnt_ns->ns.ns_id;
}
struct mount *copy_tree(struct mount *src_root, struct dentry *dentry,
int flag)
{
struct mount *res, *src_parent, *src_root_child, *src_mnt,
*dst_parent, *dst_mnt;
if (!(flag & CL_COPY_UNBINDABLE) && IS_MNT_UNBINDABLE(src_root))
return ERR_PTR(-EINVAL);
if (!(flag & CL_COPY_MNT_NS_FILE) && is_mnt_ns_file(dentry))
return ERR_PTR(-EINVAL);
res = dst_mnt = clone_mnt(src_root, dentry, flag);
if (IS_ERR(dst_mnt))
return dst_mnt;
src_parent = src_root;
list_for_each_entry(src_root_child, &src_root->mnt_mounts, mnt_child) {
if (!is_subdir(src_root_child->mnt_mountpoint, dentry))
continue;
for (src_mnt = src_root_child; src_mnt;
src_mnt = next_mnt(src_mnt, src_root_child)) {
if (!(flag & CL_COPY_UNBINDABLE) &&
IS_MNT_UNBINDABLE(src_mnt)) {
if (src_mnt->mnt.mnt_flags & MNT_LOCKED) {
/* Both unbindable and locked. */
dst_mnt = ERR_PTR(-EPERM);
goto out;
} else {
src_mnt = skip_mnt_tree(src_mnt);
continue;
}
}
if (!(flag & CL_COPY_MNT_NS_FILE) &&
is_mnt_ns_file(src_mnt->mnt.mnt_root)) {
src_mnt = skip_mnt_tree(src_mnt);
continue;
}
while (src_parent != src_mnt->mnt_parent) {
src_parent = src_parent->mnt_parent;
dst_mnt = dst_mnt->mnt_parent;
}
src_parent = src_mnt;
dst_parent = dst_mnt;
dst_mnt = clone_mnt(src_mnt, src_mnt->mnt.mnt_root, flag);
if (IS_ERR(dst_mnt))
goto out;
lock_mount_hash();
if (src_mnt->mnt.mnt_flags & MNT_LOCKED)
dst_mnt->mnt.mnt_flags |= MNT_LOCKED;
if (unlikely(flag & CL_EXPIRE)) {
/* stick the duplicate mount on the same expiry
* list as the original if that was on one */
if (!list_empty(&src_mnt->mnt_expire))
list_add(&dst_mnt->mnt_expire,
&src_mnt->mnt_expire);
}
attach_mnt(dst_mnt, dst_parent, src_parent->mnt_mp);
unlock_mount_hash();
}
}
return res;
out:
if (res) {
lock_mount_hash();
umount_tree(res, UMOUNT_SYNC);
unlock_mount_hash();
}
return dst_mnt;
}
static inline bool extend_array(struct path **res, struct path **to_free,
unsigned n, unsigned *count, unsigned new_count)
{
struct path *p;
if (likely(n < *count))
return true;
p = kmalloc_array(new_count, sizeof(struct path), GFP_KERNEL);
if (p && *count)
memcpy(p, *res, *count * sizeof(struct path));
*count = new_count;
kfree(*to_free);
*to_free = *res = p;
return p;
}
const struct path *collect_paths(const struct path *path,
struct path *prealloc, unsigned count)
{
struct mount *root = real_mount(path->mnt);
struct mount *child;
struct path *res = prealloc, *to_free = NULL;
unsigned n = 0;
guard(namespace_shared)();
if (!check_mnt(root))
return ERR_PTR(-EINVAL);
if (!extend_array(&res, &to_free, 0, &count, 32))
return ERR_PTR(-ENOMEM);
res[n++] = *path;
list_for_each_entry(child, &root->mnt_mounts, mnt_child) {
if (!is_subdir(child->mnt_mountpoint, path->dentry))
continue;
for (struct mount *m = child; m; m = next_mnt(m, child)) {
if (!extend_array(&res, &to_free, n, &count, 2 * count))
return ERR_PTR(-ENOMEM);
res[n].mnt = &m->mnt;
res[n].dentry = m->mnt.mnt_root;
n++;
}
}
if (!extend_array(&res, &to_free, n, &count, count + 1))
return ERR_PTR(-ENOMEM);
memset(res + n, 0, (count - n) * sizeof(struct path));
for (struct path *p = res; p->mnt; p++)
path_get(p);
return res;
}
void drop_collected_paths(const struct path *paths, const struct path *prealloc)
{
for (const struct path *p = paths; p->mnt; p++)
path_put(p);
if (paths != prealloc)
kfree(paths);
}
static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *, bool);
void dissolve_on_fput(struct vfsmount *mnt)
{
struct mount *m = real_mount(mnt);
/*
* m used to be the root of anon namespace; if it still is one,
* we need to dissolve the mount tree and free that namespace.
* Let's try to avoid taking namespace_sem if we can determine
* that there's nothing to do without it - rcu_read_lock() is
* enough to make anon_ns_root() memory-safe and once m has
* left its namespace, it's no longer our concern, since it will
* never become a root of anon ns again.
*/
scoped_guard(rcu) {
if (!anon_ns_root(m))
return;
}
scoped_guard(namespace_excl) {
if (!anon_ns_root(m))
return;
emptied_ns = m->mnt_ns;
lock_mount_hash();
umount_tree(m, UMOUNT_CONNECTED);
unlock_mount_hash();
}
}
/* locks: namespace_shared && pinned(mnt) || mount_locked_reader */
static bool __has_locked_children(struct mount *mnt, struct dentry *dentry)
{
struct mount *child;
list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
if (!is_subdir(child->mnt_mountpoint, dentry))
continue;
if (child->mnt.mnt_flags & MNT_LOCKED)
return true;
}
return false;
}
bool has_locked_children(struct mount *mnt, struct dentry *dentry)
{
guard(mount_locked_reader)();
return __has_locked_children(mnt, dentry);
}
/*
* Check that there aren't references to earlier/same mount namespaces in the
* specified subtree. Such references can act as pins for mount namespaces
* that aren't checked by the mount-cycle checking code, thereby allowing
* cycles to be made.
*
* locks: mount_locked_reader || namespace_shared && pinned(subtree)
*/
static bool check_for_nsfs_mounts(struct mount *subtree)
{
for (struct mount *p = subtree; p; p = next_mnt(p, subtree))
if (mnt_ns_loop(p->mnt.mnt_root))
return false;
return true;
}
/**
* clone_private_mount - create a private clone of a path
* @path: path to clone
*
* This creates a new vfsmount, which will be the clone of @path. The new mount
* will not be attached anywhere in the namespace and will be private (i.e.
* changes to the originating mount won't be propagated into this).
*
* This assumes caller has called or done the equivalent of may_mount().
*
* Release with mntput().
*/
struct vfsmount *clone_private_mount(const struct path *path)
{
struct mount *old_mnt = real_mount(path->mnt);
struct mount *new_mnt;
guard(namespace_shared)();
if (IS_MNT_UNBINDABLE(old_mnt))
return ERR_PTR(-EINVAL);
/*
* Make sure the source mount is acceptable.
* Anything mounted in our mount namespace is allowed.
* Otherwise, it must be the root of an anonymous mount
* namespace, and we need to make sure no namespace
* loops get created.
*/
if (!check_mnt(old_mnt)) {
if (!anon_ns_root(old_mnt))
return ERR_PTR(-EINVAL);
if (!check_for_nsfs_mounts(old_mnt))
return ERR_PTR(-EINVAL);
}
if (!ns_capable(old_mnt->mnt_ns->user_ns, CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
if (__has_locked_children(old_mnt, path->dentry))
return ERR_PTR(-EINVAL);
new_mnt = clone_mnt(old_mnt, path->dentry, CL_PRIVATE);
if (IS_ERR(new_mnt))
return ERR_PTR(-EINVAL);
/* Longterm mount to be removed by kern_unmount*() */
new_mnt->mnt_ns = MNT_NS_INTERNAL;
return &new_mnt->mnt;
}
EXPORT_SYMBOL_GPL(clone_private_mount);
static void lock_mnt_tree(struct mount *mnt)
{
struct mount *p;
for (p = mnt; p; p = next_mnt(p, mnt)) {
int flags = p->mnt.mnt_flags;
/* Don't allow unprivileged users to change mount flags */
flags |= MNT_LOCK_ATIME;
if (flags & MNT_READONLY)
flags |= MNT_LOCK_READONLY;
if (flags & MNT_NODEV)
flags |= MNT_LOCK_NODEV;
if (flags & MNT_NOSUID)
flags |= MNT_LOCK_NOSUID;
if (flags & MNT_NOEXEC)
flags |= MNT_LOCK_NOEXEC;
/* Don't allow unprivileged users to reveal what is under a mount */
if (list_empty(&p->mnt_expire) && p != mnt)
flags |= MNT_LOCKED;
p->mnt.mnt_flags = flags;
}
}
static void cleanup_group_ids(struct mount *mnt, struct mount *end)
{
struct mount *p;
for (p = mnt; p != end; p = next_mnt(p, mnt)) {
if (p->mnt_group_id && !IS_MNT_SHARED(p))
mnt_release_group_id(p);
}
}
static int invent_group_ids(struct mount *mnt, bool recurse)
{
struct mount *p;
for (p = mnt; p; p = recurse ? next_mnt(p, mnt) : NULL) {
if (!p->mnt_group_id) {
int err = mnt_alloc_group_id(p);
if (err) {
cleanup_group_ids(mnt, p);
return err;
}
}
}
return 0;
}
int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
{
unsigned int max = READ_ONCE(sysctl_mount_max);
unsigned int mounts = 0;
struct mount *p;
if (ns->nr_mounts >= max)
return -ENOSPC;
max -= ns->nr_mounts;
if (ns->pending_mounts >= max)
return -ENOSPC;
max -= ns->pending_mounts;
for (p = mnt; p; p = next_mnt(p, mnt))
mounts++;
if (mounts > max)
return -ENOSPC;
ns->pending_mounts += mounts;
return 0;
}
enum mnt_tree_flags_t {
MNT_TREE_BENEATH = BIT(0),
MNT_TREE_PROPAGATION = BIT(1),
};
/**
* attach_recursive_mnt - attach a source mount tree
* @source_mnt: mount tree to be attached
* @dest: the context for mounting at the place where the tree should go
*
* NOTE: in the table below explains the semantics when a source mount
* of a given type is attached to a destination mount of a given type.
* ---------------------------------------------------------------------------
* | BIND MOUNT OPERATION |
* |**************************************************************************
* | source-->| shared | private | slave | unbindable |
* | dest | | | | |
* | | | | | | |
* | v | | | | |
* |**************************************************************************
* | shared | shared (++) | shared (+) | shared(+++)| invalid |
* | | | | | |
* |non-shared| shared (+) | private | slave (*) | invalid |
* ***************************************************************************
* A bind operation clones the source mount and mounts the clone on the
* destination mount.
*
* (++) the cloned mount is propagated to all the mounts in the propagation
* tree of the destination mount and the cloned mount is added to
* the peer group of the source mount.
* (+) the cloned mount is created under the destination mount and is marked
* as shared. The cloned mount is added to the peer group of the source
* mount.
* (+++) the mount is propagated to all the mounts in the propagation tree
* of the destination mount and the cloned mount is made slave
* of the same master as that of the source mount. The cloned mount
* is marked as 'shared and slave'.
* (*) the cloned mount is made a slave of the same master as that of the
* source mount.
*
* ---------------------------------------------------------------------------
* | MOVE MOUNT OPERATION |
* |**************************************************************************
* | source-->| shared | private | slave | unbindable |
* | dest | | | | |
* | | | | | | |
* | v | | | | |
* |**************************************************************************
* | shared | shared (+) | shared (+) | shared(+++) | invalid |
* | | | | | |
* |non-shared| shared (+*) | private | slave (*) | unbindable |
* ***************************************************************************
*
* (+) the mount is moved to the destination. And is then propagated to
* all the mounts in the propagation tree of the destination mount.
* (+*) the mount is moved to the destination.
* (+++) the mount is moved to the destination and is then propagated to
* all the mounts belonging to the destination mount's propagation tree.
* the mount is marked as 'shared and slave'.
* (*) the mount continues to be a slave at the new location.
*
* if the source mount is a tree, the operations explained above is
* applied to each mount in the tree.
* Must be called without spinlocks held, since this function can sleep
* in allocations.
*
* Context: The function expects namespace_lock() to be held.
* Return: If @source_mnt was successfully attached 0 is returned.
* Otherwise a negative error code is returned.
*/
static int attach_recursive_mnt(struct mount *source_mnt,
const struct pinned_mountpoint *dest)
{
struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
struct mount *dest_mnt = dest->parent;
struct mountpoint *dest_mp = dest->mp;
HLIST_HEAD(tree_list);
struct mnt_namespace *ns = dest_mnt->mnt_ns;
struct pinned_mountpoint root = {};
struct mountpoint *shorter = NULL;
struct mount *child, *p;
struct mount *top;
struct hlist_node *n;
int err = 0;
bool moving = mnt_has_parent(source_mnt);
/*
* Preallocate a mountpoint in case the new mounts need to be
* mounted beneath mounts on the same mountpoint.
*/
for (top = source_mnt; unlikely(top->overmount); top = top->overmount) {
if (!shorter && is_mnt_ns_file(top->mnt.mnt_root))
shorter = top->mnt_mp;
}
err = get_mountpoint(top->mnt.mnt_root, &root);
if (err)
return err;
/* Is there space to add these mounts to the mount namespace? */
if (!moving) {
err = count_mounts(ns, source_mnt);
if (err)
goto out;
}
if (IS_MNT_SHARED(dest_mnt)) {
err = invent_group_ids(source_mnt, true);
if (err)
goto out;
err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
}
lock_mount_hash();
if (err)
goto out_cleanup_ids;
if (IS_MNT_SHARED(dest_mnt)) {
for (p = source_mnt; p; p = next_mnt(p, source_mnt))
set_mnt_shared(p);
}
if (moving) {
umount_mnt(source_mnt);
mnt_notify_add(source_mnt);
/* if the mount is moved, it should no longer be expired
* automatically */
list_del_init(&source_mnt->mnt_expire);
} else {
if (source_mnt->mnt_ns) {
/* move from anon - the caller will destroy */
emptied_ns = source_mnt->mnt_ns;
for (p = source_mnt; p; p = next_mnt(p, source_mnt))
move_from_ns(p);
}
}
mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
/*
* Now the original copy is in the same state as the secondaries -
* its root attached to mountpoint, but not hashed and all mounts
* in it are either in our namespace or in no namespace at all.
* Add the original to the list of copies and deal with the
* rest of work for all of them uniformly.
*/
hlist_add_head(&source_mnt->mnt_hash, &tree_list);
hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) {
struct mount *q;
hlist_del_init(&child->mnt_hash);
/* Notice when we are propagating across user namespaces */
if (child->mnt_parent->mnt_ns->user_ns != user_ns)
lock_mnt_tree(child);
q = __lookup_mnt(&child->mnt_parent->mnt,
child->mnt_mountpoint);
commit_tree(child);
if (q) {
struct mount *r = topmost_overmount(child);
struct mountpoint *mp = root.mp;
if (unlikely(shorter) && child != source_mnt)
mp = shorter;
mnt_change_mountpoint(r, mp, q);
}
}
unpin_mountpoint(&root);
unlock_mount_hash();
return 0;
out_cleanup_ids:
while (!hlist_empty(&tree_list)) {
child = hlist_entry(tree_list.first, struct mount, mnt_hash);
child->mnt_parent->mnt_ns->pending_mounts = 0;
umount_tree(child, UMOUNT_SYNC);
}
unlock_mount_hash();
cleanup_group_ids(source_mnt, NULL);
out:
ns->pending_mounts = 0;
read_seqlock_excl(&mount_lock);
unpin_mountpoint(&root);
read_sequnlock_excl(&mount_lock);
return err;
}
static inline struct mount *where_to_mount(const struct path *path,
struct dentry **dentry,
bool beneath)
{
struct mount *m;
if (unlikely(beneath)) {
m = topmost_overmount(real_mount(path->mnt));
*dentry = m->mnt_mountpoint;
return m->mnt_parent;
}
m = __lookup_mnt(path->mnt, path->dentry);
if (unlikely(m)) {
m = topmost_overmount(m);
*dentry = m->mnt.mnt_root;
return m;
}
*dentry = path->dentry;
return real_mount(path->mnt);
}
/**
* do_lock_mount - acquire environment for mounting
* @path: target path
* @res: context to set up
* @beneath: whether the intention is to mount beneath @path
*
* To mount something at given location, we need
* namespace_sem locked exclusive
* inode of dentry we are mounting on locked exclusive
* struct mountpoint for that dentry
* struct mount we are mounting on
*
* Results are stored in caller-supplied context (pinned_mountpoint);
* on success we have res->parent and res->mp pointing to parent and
* mountpoint respectively and res->node inserted into the ->m_list
* of the mountpoint, making sure the mountpoint won't disappear.
* On failure we have res->parent set to ERR_PTR(-E...), res->mp
* left NULL, res->node - empty.
* In case of success do_lock_mount returns with locks acquired (in
* proper order - inode lock nests outside of namespace_sem).
*
* Request to mount on overmounted location is treated as "mount on
* top of whatever's overmounting it"; request to mount beneath
* a location - "mount immediately beneath the topmost mount at that
* place".
*
* In all cases the location must not have been unmounted and the
* chosen mountpoint must be allowed to be mounted on. For "beneath"
* case we also require the location to be at the root of a mount
* that has a parent (i.e. is not a root of some namespace).
*/
static void do_lock_mount(const struct path *path,
struct pinned_mountpoint *res,
bool beneath)
{
int err;
if (unlikely(beneath) && !path_mounted(path)) {
res->parent = ERR_PTR(-EINVAL);
return;
}
do {
struct dentry *dentry, *d;
struct mount *m, *n;
scoped_guard(mount_locked_reader) {
m = where_to_mount(path, &dentry, beneath);
if (&m->mnt != path->mnt) {
mntget(&m->mnt);
dget(dentry);
}
}
inode_lock(dentry->d_inode);
namespace_lock();
// check if the chain of mounts (if any) has changed.
scoped_guard(mount_locked_reader)
n = where_to_mount(path, &d, beneath);
if (unlikely(n != m || dentry != d))
err = -EAGAIN; // something moved, retry
else if (unlikely(cant_mount(dentry) || !is_mounted(path->mnt)))
err = -ENOENT; // not to be mounted on
else if (beneath && &m->mnt == path->mnt && !m->overmount)
err = -EINVAL;
else
err = get_mountpoint(dentry, res);
if (unlikely(err)) {
res->parent = ERR_PTR(err);
namespace_unlock();
inode_unlock(dentry->d_inode);
} else {
res->parent = m;
}
/*
* Drop the temporary references. This is subtle - on success
* we are doing that under namespace_sem, which would normally
* be forbidden. However, in that case we are guaranteed that
* refcounts won't reach zero, since we know that path->mnt
* is mounted and thus all mounts reachable from it are pinned
* and stable, along with their mountpoints and roots.
*/
if (&m->mnt != path->mnt) {
dput(dentry);
mntput(&m->mnt);
}
} while (err == -EAGAIN);
}
static void __unlock_mount(struct pinned_mountpoint *m)
{
inode_unlock(m->mp->m_dentry->d_inode);
read_seqlock_excl(&mount_lock);
unpin_mountpoint(m);
read_sequnlock_excl(&mount_lock);
namespace_unlock();
}
static inline void unlock_mount(struct pinned_mountpoint *m)
{
if (!IS_ERR(m->parent))
__unlock_mount(m);
}
static void lock_mount_exact(const struct path *path,
struct pinned_mountpoint *mp);
#define LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) \
struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \
do_lock_mount((path), &mp, (beneath))
#define LOCK_MOUNT(mp, path) LOCK_MOUNT_MAYBE_BENEATH(mp, (path), false)
#define LOCK_MOUNT_EXACT(mp, path) \
struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \
lock_mount_exact((path), &mp)
static int graft_tree(struct mount *mnt, const struct pinned_mountpoint *mp)
{
if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER)
return -EINVAL;
if (d_is_dir(mp->mp->m_dentry) !=
d_is_dir(mnt->mnt.mnt_root))
return -ENOTDIR;
return attach_recursive_mnt(mnt, mp);
}
static int may_change_propagation(const struct mount *m)
{
struct mnt_namespace *ns = m->mnt_ns;
// it must be mounted in some namespace
if (IS_ERR_OR_NULL(ns)) // is_mounted()
return -EINVAL;
// and the caller must be admin in userns of that namespace
if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
/*
* Sanity check the flags to change_mnt_propagation.
*/
static int flags_to_propagation_type(int ms_flags)
{
int type = ms_flags & ~(MS_REC | MS_SILENT);
/* Fail if any non-propagation flags are set */
if (type & ~(MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
return 0;
/* Only one propagation flag should be set */
if (!is_power_of_2(type))
return 0;
return type;
}
/*
* recursively change the type of the mountpoint.
*/
static int do_change_type(const struct path *path, int ms_flags)
{
struct mount *m;
struct mount *mnt = real_mount(path->mnt);
int recurse = ms_flags & MS_REC;
int type;
int err;
if (!path_mounted(path))
return -EINVAL;
type = flags_to_propagation_type(ms_flags);
if (!type)
return -EINVAL;
guard(namespace_excl)();
err = may_change_propagation(mnt);
if (err)
return err;
if (type == MS_SHARED) {
err = invent_group_ids(mnt, recurse);
if (err)
return err;
}
for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
change_mnt_propagation(m, type);
return 0;
}
/* may_copy_tree() - check if a mount tree can be copied
* @path: path to the mount tree to be copied
*
* This helper checks if the caller may copy the mount tree starting
* from @path->mnt. The caller may copy the mount tree under the
* following circumstances:
*
* (1) The caller is located in the mount namespace of the mount tree.
* This also implies that the mount does not belong to an anonymous
* mount namespace.
* (2) The caller tries to copy an nfs mount referring to a mount
* namespace, i.e., the caller is trying to copy a mount namespace
* entry from nsfs.
* (3) The caller tries to copy a pidfs mount referring to a pidfd.
* (4) The caller is trying to copy a mount tree that belongs to an
* anonymous mount namespace.
*
* For that to be safe, this helper enforces that the origin mount
* namespace the anonymous mount namespace was created from is the
* same as the caller's mount namespace by comparing the sequence
* numbers.
*
* This is not strictly necessary. The current semantics of the new
* mount api enforce that the caller must be located in the same
* mount namespace as the mount tree it interacts with. Using the
* origin sequence number preserves these semantics even for
* anonymous mount namespaces. However, one could envision extending
* the api to directly operate across mount namespace if needed.
*
* The ownership of a non-anonymous mount namespace such as the
* caller's cannot change.
* => We know that the caller's mount namespace is stable.
*
* If the origin sequence number of the anonymous mount namespace is
* the same as the sequence number of the caller's mount namespace.
* => The owning namespaces are the same.
*
* ==> The earlier capability check on the owning namespace of the
* caller's mount namespace ensures that the caller has the
* ability to copy the mount tree.
*
* Returns true if the mount tree can be copied, false otherwise.
*/
static inline bool may_copy_tree(const struct path *path)
{
struct mount *mnt = real_mount(path->mnt);
const struct dentry_operations *d_op;
if (check_mnt(mnt))
return true;
d_op = path->dentry->d_op;
if (d_op == &ns_dentry_operations)
return true;
if (d_op == &pidfs_dentry_operations)
return true;
if (!is_mounted(path->mnt))
return false;
return check_anonymous_mnt(mnt);
}
static struct mount *__do_loopback(const struct path *old_path,
unsigned int flags, unsigned int copy_flags)
{
struct mount *old = real_mount(old_path->mnt);
bool recurse = flags & AT_RECURSIVE;
if (IS_MNT_UNBINDABLE(old))
return ERR_PTR(-EINVAL);
if (!may_copy_tree(old_path))
return ERR_PTR(-EINVAL);
if (!recurse && __has_locked_children(old, old_path->dentry))
return ERR_PTR(-EINVAL);
/*
* When creating a new mount namespace we don't want to copy over
* mounts of mount namespaces to avoid the risk of cycles and also to
* minimize the default complex interdependencies between mount
* namespaces.
*
* We could ofc just check whether all mount namespace files aren't
* creating cycles but really let's keep this simple.
*/
if (!(flags & OPEN_TREE_NAMESPACE))
copy_flags |= CL_COPY_MNT_NS_FILE;
if (recurse)
return copy_tree(old, old_path->dentry, copy_flags);
return clone_mnt(old, old_path->dentry, copy_flags);
}
/*
* do loopback mount.
*/
static int do_loopback(const struct path *path, const char *old_name,
int recurse)
{
struct path old_path __free(path_put) = {};
struct mount *mnt = NULL;
unsigned int flags = recurse ? AT_RECURSIVE : 0;
int err;
if (!old_name || !*old_name)
return -EINVAL;
err = kern_path(old_name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &old_path);
if (err)
return err;
if (mnt_ns_loop(old_path.dentry))
return -EINVAL;
LOCK_MOUNT(mp, path);
if (IS_ERR(mp.parent))
return PTR_ERR(mp.parent);
if (!check_mnt(mp.parent))
return -EINVAL;
mnt = __do_loopback(&old_path, flags, 0);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
err = graft_tree(mnt, &mp);
if (err) {
lock_mount_hash();
umount_tree(mnt, UMOUNT_SYNC);
unlock_mount_hash();
}
return err;
}
static struct mnt_namespace *get_detached_copy(const struct path *path, unsigned int flags)
{
struct mnt_namespace *ns, *mnt_ns = current->nsproxy->mnt_ns, *src_mnt_ns;
struct user_namespace *user_ns = mnt_ns->user_ns;
struct mount *mnt, *p;
ns = alloc_mnt_ns(user_ns, true);
if (IS_ERR(ns))
return ns;
guard(namespace_excl)();
/*
* Record the sequence number of the source mount namespace.
* This needs to hold namespace_sem to ensure that the mount
* doesn't get attached.
*/
if (is_mounted(path->mnt)) {
src_mnt_ns = real_mount(path->mnt)->mnt_ns;
if (is_anon_ns(src_mnt_ns))
ns->seq_origin = src_mnt_ns->seq_origin;
else
ns->seq_origin = src_mnt_ns->ns.ns_id;
}
mnt = __do_loopback(path, flags, 0);
if (IS_ERR(mnt)) {
emptied_ns = ns;
return ERR_CAST(mnt);
}
for (p = mnt; p; p = next_mnt(p, mnt)) {
mnt_add_to_ns(ns, p);
ns->nr_mounts++;
}
ns->root = mnt;
return ns;
}
static struct file *open_detached_copy(struct path *path, unsigned int flags)
{
struct mnt_namespace *ns = get_detached_copy(path, flags);
struct file *file;
if (IS_ERR(ns))
return ERR_CAST(ns);
mntput(path->mnt);
path->mnt = mntget(&ns->root->mnt);
file = dentry_open(path, O_PATH, current_cred());
if (IS_ERR(file))
dissolve_on_fput(path->mnt);
else
file->f_mode |= FMODE_NEED_UNMOUNT;
return file;
}
DEFINE_FREE(put_empty_mnt_ns, struct mnt_namespace *,
if (!IS_ERR_OR_NULL(_T)) free_mnt_ns(_T))
static struct mnt_namespace *create_new_namespace(struct path *path, unsigned int flags)
{
struct mnt_namespace *new_ns __free(put_empty_mnt_ns) = NULL;
struct path to_path __free(path_put) = {};
struct mnt_namespace *ns = current->nsproxy->mnt_ns;
struct user_namespace *user_ns = current_user_ns();
struct mount *new_ns_root;
struct mount *mnt;
unsigned int copy_flags = 0;
bool locked = false;
if (user_ns != ns->user_ns)
copy_flags |= CL_SLAVE;
new_ns = alloc_mnt_ns(user_ns, false);
if (IS_ERR(new_ns))
return ERR_CAST(new_ns);
scoped_guard(namespace_excl) {
new_ns_root = clone_mnt(ns->root, ns->root->mnt.mnt_root, copy_flags);
if (IS_ERR(new_ns_root))
return ERR_CAST(new_ns_root);
/*
* If the real rootfs had a locked mount on top of it somewhere
* in the stack, lock the new mount tree as well so it can't be
* exposed.
*/
mnt = ns->root;
while (mnt->overmount) {
mnt = mnt->overmount;
if (mnt->mnt.mnt_flags & MNT_LOCKED)
locked = true;
}
}
/*
* We dropped the namespace semaphore so we can actually lock
* the copy for mounting. The copied mount isn't attached to any
* mount namespace and it is thus excluded from any propagation.
* So realistically we're isolated and the mount can't be
* overmounted.
*/
/* Borrow the reference from clone_mnt(). */
to_path.mnt = &new_ns_root->mnt;
to_path.dentry = dget(new_ns_root->mnt.mnt_root);
/* Now lock for actual mounting. */
LOCK_MOUNT_EXACT(mp, &to_path);
if (unlikely(IS_ERR(mp.parent)))
return ERR_CAST(mp.parent);
/*
* We don't emulate unshare()ing a mount namespace. We stick to the
* restrictions of creating detached bind-mounts. It has a lot
* saner and simpler semantics.
*/
mnt = __do_loopback(path, flags, copy_flags);
if (IS_ERR(mnt))
return ERR_CAST(mnt);
scoped_guard(mount_writer) {
if (locked)
mnt->mnt.mnt_flags |= MNT_LOCKED;
/*
* Now mount the detached tree on top of the copy of the
* real rootfs we created.
*/
attach_mnt(mnt, new_ns_root, mp.mp);
if (user_ns != ns->user_ns)
lock_mnt_tree(new_ns_root);
}
/* Add all mounts to the new namespace. */
for (struct mount *p = new_ns_root; p; p = next_mnt(p, new_ns_root)) {
mnt_add_to_ns(new_ns, p);
new_ns->nr_mounts++;
}
new_ns->root = real_mount(no_free_ptr(to_path.mnt));
ns_tree_add_raw(new_ns);
return no_free_ptr(new_ns);
}
static struct file *open_new_namespace(struct path *path, unsigned int flags)
{
struct mnt_namespace *new_ns;
new_ns = create_new_namespace(path, flags);
if (IS_ERR(new_ns))
return ERR_CAST(new_ns);
return open_namespace_file(to_ns_common(new_ns));
}
static struct file *vfs_open_tree(int dfd, const char __user *filename, unsigned int flags)
{
int ret;
struct path path __free(path_put) = {};
int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
OPEN_TREE_CLOEXEC | OPEN_TREE_NAMESPACE))
return ERR_PTR(-EINVAL);
if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE | OPEN_TREE_NAMESPACE)) ==
AT_RECURSIVE)
return ERR_PTR(-EINVAL);
if (hweight32(flags & (OPEN_TREE_CLONE | OPEN_TREE_NAMESPACE)) > 1)
return ERR_PTR(-EINVAL);
if (flags & AT_NO_AUTOMOUNT)
lookup_flags &= ~LOOKUP_AUTOMOUNT;
if (flags & AT_SYMLINK_NOFOLLOW)
lookup_flags &= ~LOOKUP_FOLLOW;
/*
* If we create a new mount namespace with the cloned mount tree we
* just care about being privileged over our current user namespace.
* The new mount namespace will be owned by it.
*/
if ((flags & OPEN_TREE_NAMESPACE) &&
!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
if ((flags & OPEN_TREE_CLONE) && !may_mount())
return ERR_PTR(-EPERM);
CLASS(filename_uflags, name)(filename, flags);
ret = filename_lookup(dfd, name, lookup_flags, &path, NULL);
if (unlikely(ret))
return ERR_PTR(ret);
if (flags & OPEN_TREE_NAMESPACE)
return open_new_namespace(&path, flags);
if (flags & OPEN_TREE_CLONE)
return open_detached_copy(&path, flags);
return dentry_open(&path, O_PATH, current_cred());
}
SYSCALL_DEFINE3(open_tree, int, dfd, const char __user *, filename, unsigned, flags)
{
return FD_ADD(flags, vfs_open_tree(dfd, filename, flags));
}
/*
* Don't allow locked mount flags to be cleared.
*
* No locks need to be held here while testing the various MNT_LOCK
* flags because those flags can never be cleared once they are set.
*/
static bool can_change_locked_flags(struct mount *mnt, unsigned int mnt_flags)
{
unsigned int fl = mnt->mnt.mnt_flags;
if ((fl & MNT_LOCK_READONLY) &&
!(mnt_flags & MNT_READONLY))
return false;
if ((fl & MNT_LOCK_NODEV) &&
!(mnt_flags & MNT_NODEV))
return false;
if ((fl & MNT_LOCK_NOSUID) &&
!(mnt_flags & MNT_NOSUID))
return false;
if ((fl & MNT_LOCK_NOEXEC) &&
!(mnt_flags & MNT_NOEXEC))
return false;
if ((fl & MNT_LOCK_ATIME) &&
((fl & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK)))
return false;
return true;
}
static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
{
bool readonly_request = (mnt_flags & MNT_READONLY);
if (readonly_request == __mnt_is_readonly(&mnt->mnt))
return 0;
if (readonly_request)
return mnt_make_readonly(mnt);
mnt->mnt.mnt_flags &= ~MNT_READONLY;
return 0;
}
static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
{
mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
mnt->mnt.mnt_flags = mnt_flags;
touch_mnt_namespace(mnt->mnt_ns);
}
static void mnt_warn_timestamp_expiry(const struct path *mountpoint,
struct vfsmount *mnt)
{
struct super_block *sb = mnt->mnt_sb;
if (!__mnt_is_readonly(mnt) &&
(!(sb->s_iflags & SB_I_TS_EXPIRY_WARNED)) &&
(ktime_get_real_seconds() + TIME_UPTIME_SEC_MAX > sb->s_time_max)) {
char *buf, *mntpath;
buf = (char *)__get_free_page(GFP_KERNEL);
if (buf)
mntpath = d_path(mountpoint, buf, PAGE_SIZE);
else
mntpath = ERR_PTR(-ENOMEM);
if (IS_ERR(mntpath))
mntpath = "(unknown)";
pr_warn("%s filesystem being %s at %s supports timestamps until %ptTd (0x%llx)\n",
sb->s_type->name,
is_mounted(mnt) ? "remounted" : "mounted",
mntpath, &sb->s_time_max,
(unsigned long long)sb->s_time_max);
sb->s_iflags |= SB_I_TS_EXPIRY_WARNED;
if (buf)
free_page((unsigned long)buf);
}
}
/*
* Handle reconfiguration of the mountpoint only without alteration of the
* superblock it refers to. This is triggered by specifying MS_REMOUNT|MS_BIND
* to mount(2).
*/
static int do_reconfigure_mnt(const struct path *path, unsigned int mnt_flags)
{
struct super_block *sb = path->mnt->mnt_sb;
struct mount *mnt = real_mount(path->mnt);
int ret;
if (!check_mnt(mnt))
return -EINVAL;
if (!path_mounted(path))
return -EINVAL;
if (!can_change_locked_flags(mnt, mnt_flags))
return -EPERM;
/*
* We're only checking whether the superblock is read-only not
* changing it, so only take down_read(&sb->s_umount).
*/
down_read(&sb->s_umount);
lock_mount_hash();
ret = change_mount_ro_state(mnt, mnt_flags);
if (ret == 0)
set_mount_attributes(mnt, mnt_flags);
unlock_mount_hash();
up_read(&sb->s_umount);
mnt_warn_timestamp_expiry(path, &mnt->mnt);
return ret;
}
/*
* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
* on it - tough luck.
*/
static int do_remount(const struct path *path, int sb_flags,
int mnt_flags, void *data)
{
int err;
struct super_block *sb = path->mnt->mnt_sb;
struct mount *mnt = real_mount(path->mnt);
struct fs_context *fc;
if (!check_mnt(mnt))
return -EINVAL;
if (!path_mounted(path))
return -EINVAL;
if (!can_change_locked_flags(mnt, mnt_flags))
return -EPERM;
fc = fs_context_for_reconfigure(path->dentry, sb_flags, MS_RMT_MASK);
if (IS_ERR(fc))
return PTR_ERR(fc);
/*
* Indicate to the filesystem that the remount request is coming
* from the legacy mount system call.
*/
fc->oldapi = true;
err = parse_monolithic_mount_data(fc, data);
if (!err) {
down_write(&sb->s_umount);
err = -EPERM;
if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
err = reconfigure_super(fc);
if (!err) {
lock_mount_hash();
set_mount_attributes(mnt, mnt_flags);
unlock_mount_hash();
}
}
up_write(&sb->s_umount);
}
mnt_warn_timestamp_expiry(path, &mnt->mnt);
put_fs_context(fc);
return err;
}
static inline int tree_contains_unbindable(struct mount *mnt)
{
struct mount *p;
for (p = mnt; p; p = next_mnt(p, mnt)) {
if (IS_MNT_UNBINDABLE(p))
return 1;
}
return 0;
}
static int do_set_group(const struct path *from_path, const struct path *to_path)
{
struct mount *from = real_mount(from_path->mnt);
struct mount *to = real_mount(to_path->mnt);
int err;
guard(namespace_excl)();
err = may_change_propagation(from);
if (err)
return err;
err = may_change_propagation(to);
if (err)
return err;
/* To and From paths should be mount roots */
if (!path_mounted(from_path))
return -EINVAL;
if (!path_mounted(to_path))
return -EINVAL;
/* Setting sharing groups is only allowed across same superblock */
if (from->mnt.mnt_sb != to->mnt.mnt_sb)
return -EINVAL;
/* From mount root should be wider than To mount root */
if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
return -EINVAL;
/* From mount should not have locked children in place of To's root */
if (__has_locked_children(from, to->mnt.mnt_root))
return -EINVAL;
/* Setting sharing groups is only allowed on private mounts */
if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
return -EINVAL;
/* From should not be private */
if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
return -EINVAL;
if (IS_MNT_SLAVE(from)) {
hlist_add_behind(&to->mnt_slave, &from->mnt_slave);
to->mnt_master = from->mnt_master;
}
if (IS_MNT_SHARED(from)) {
to->mnt_group_id = from->mnt_group_id;
list_add(&to->mnt_share, &from->mnt_share);
set_mnt_shared(to);
}
return 0;
}
/**
* path_overmounted - check if path is overmounted
* @path: path to check
*
* Check if path is overmounted, i.e., if there's a mount on top of
* @path->mnt with @path->dentry as mountpoint.
*
* Context: namespace_sem must be held at least shared.
* MUST NOT be called under lock_mount_hash() (there one should just
* call __lookup_mnt() and check if it returns NULL).
* Return: If path is overmounted true is returned, false if not.
*/
static inline bool path_overmounted(const struct path *path)
{
unsigned seq = read_seqbegin(&mount_lock);
bool no_child;
rcu_read_lock();
no_child = !__lookup_mnt(path->mnt, path->dentry);
rcu_read_unlock();
if (need_seqretry(&mount_lock, seq)) {
read_seqlock_excl(&mount_lock);
no_child = !__lookup_mnt(path->mnt, path->dentry);
read_sequnlock_excl(&mount_lock);
}
return unlikely(!no_child);
}
/*
* Check if there is a possibly empty chain of descent from p1 to p2.
* Locks: namespace_sem (shared) or mount_lock (read_seqlock_excl).
*/
static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2)
{
while (p2 != p1 && mnt_has_parent(p2))
p2 = p2->mnt_parent;
return p2 == p1;
}
/**
* can_move_mount_beneath - check that we can mount beneath the top mount
* @mnt_from: mount we are trying to move
* @mnt_to: mount under which to mount
* @mp: mountpoint of @mnt_to
*
* - Make sure that nothing can be mounted beneath the caller's current
* root or the rootfs of the namespace.
* - Make sure that the caller can unmount the topmost mount ensuring
* that the caller could reveal the underlying mountpoint.
* - Ensure that nothing has been mounted on top of @mnt_from before we
* grabbed @namespace_sem to avoid creating pointless shadow mounts.
* - Prevent mounting beneath a mount if the propagation relationship
* between the source mount, parent mount, and top mount would lead to
* nonsensical mount trees.
*
* Context: This function expects namespace_lock() to be held.
* Return: On success 0, and on error a negative error code is returned.
*/
static int can_move_mount_beneath(const struct mount *mnt_from,
const struct mount *mnt_to,
const struct mountpoint *mp)
{
struct mount *parent_mnt_to = mnt_to->mnt_parent;
if (IS_MNT_LOCKED(mnt_to))
return -EINVAL;
/* Avoid creating shadow mounts during mount propagation. */
if (mnt_from->overmount)
return -EINVAL;
/*
* Mounting beneath the rootfs only makes sense when the
* semantics of pivot_root(".", ".") are used.
*/
if (&mnt_to->mnt == current->fs->root.mnt)
return -EINVAL;
if (parent_mnt_to == current->nsproxy->mnt_ns->root)
return -EINVAL;
if (mount_is_ancestor(mnt_to, mnt_from))
return -EINVAL;
/*
* If the parent mount propagates to the child mount this would
* mean mounting @mnt_from on @mnt_to->mnt_parent and then
* propagating a copy @c of @mnt_from on top of @mnt_to. This
* defeats the whole purpose of mounting beneath another mount.
*/
if (propagation_would_overmount(parent_mnt_to, mnt_to, mp))
return -EINVAL;
/*
* If @mnt_to->mnt_parent propagates to @mnt_from this would
* mean propagating a copy @c of @mnt_from on top of @mnt_from.
* Afterwards @mnt_from would be mounted on top of
* @mnt_to->mnt_parent and @mnt_to would be unmounted from
* @mnt->mnt_parent and remounted on @mnt_from. But since @c is
* already mounted on @mnt_from, @mnt_to would ultimately be
* remounted on top of @c. Afterwards, @mnt_from would be
* covered by a copy @c of @mnt_from and @c would be covered by
* @mnt_from itself. This defeats the whole purpose of mounting
* @mnt_from beneath @mnt_to.
*/
if (check_mnt(mnt_from) &&
propagation_would_overmount(parent_mnt_to, mnt_from, mp))
return -EINVAL;
return 0;
}
/* may_use_mount() - check if a mount tree can be used
* @mnt: vfsmount to be used
*
* This helper checks if the caller may use the mount tree starting
* from @path->mnt. The caller may use the mount tree under the
* following circumstances:
*
* (1) The caller is located in the mount namespace of the mount tree.
* This also implies that the mount does not belong to an anonymous
* mount namespace.
* (2) The caller is trying to use a mount tree that belongs to an
* anonymous mount namespace.
*
* For that to be safe, this helper enforces that the origin mount
* namespace the anonymous mount namespace was created from is the
* same as the caller's mount namespace by comparing the sequence
* numbers.
*
* The ownership of a non-anonymous mount namespace such as the
* caller's cannot change.
* => We know that the caller's mount namespace is stable.
*
* If the origin sequence number of the anonymous mount namespace is
* the same as the sequence number of the caller's mount namespace.
* => The owning namespaces are the same.
*
* ==> The earlier capability check on the owning namespace of the
* caller's mount namespace ensures that the caller has the
* ability to use the mount tree.
*
* Returns true if the mount tree can be used, false otherwise.
*/
static inline bool may_use_mount(struct mount *mnt)
{
if (check_mnt(mnt))
return true;
/*
* Make sure that noone unmounted the target path or somehow
* managed to get their hands on something purely kernel
* internal.
*/
if (!is_mounted(&mnt->mnt))
return false;
return check_anonymous_mnt(mnt);
}
static int do_move_mount(const struct path *old_path,
const struct path *new_path,
enum mnt_tree_flags_t flags)
{
struct mount *old = real_mount(old_path->mnt);
int err;
bool beneath = flags & MNT_TREE_BENEATH;
if (!path_mounted(old_path))
return -EINVAL;
if (d_is_dir(new_path->dentry) != d_is_dir(old_path->dentry))
return -EINVAL;
LOCK_MOUNT_MAYBE_BENEATH(mp, new_path, beneath);
if (IS_ERR(mp.parent))
return PTR_ERR(mp.parent);
if (check_mnt(old)) {
/* if the source is in our namespace... */
/* ... it should be detachable from parent */
if (!mnt_has_parent(old) || IS_MNT_LOCKED(old))
return -EINVAL;
/* ... which should not be shared */
if (IS_MNT_SHARED(old->mnt_parent))
return -EINVAL;
/* ... and the target should be in our namespace */
if (!check_mnt(mp.parent))
return -EINVAL;
} else {
/*
* otherwise the source must be the root of some anon namespace.
*/
if (!anon_ns_root(old))
return -EINVAL;
/*
* Bail out early if the target is within the same namespace -
* subsequent checks would've rejected that, but they lose
* some corner cases if we check it early.
*/
if (old->mnt_ns == mp.parent->mnt_ns)
return -EINVAL;
/*
* Target should be either in our namespace or in an acceptable
* anon namespace, sensu check_anonymous_mnt().
*/
if (!may_use_mount(mp.parent))
return -EINVAL;
}
if (beneath) {
struct mount *over = real_mount(new_path->mnt);
if (mp.parent != over->mnt_parent)
over = mp.parent->overmount;
err = can_move_mount_beneath(old, over, mp.mp);
if (err)
return err;
}
/*
* Don't move a mount tree containing unbindable mounts to a destination
* mount which is shared.
*/
if (IS_MNT_SHARED(mp.parent) && tree_contains_unbindable(old))
return -EINVAL;
if (!check_for_nsfs_mounts(old))
return -ELOOP;
if (mount_is_ancestor(old, mp.parent))
return -ELOOP;
return attach_recursive_mnt(old, &mp);
}
static int do_move_mount_old(const struct path *path, const char *old_name)
{
struct path old_path __free(path_put) = {};
int err;
if (!old_name || !*old_name)
return -EINVAL;
err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
if (err)
return err;
return do_move_mount(&old_path, path, 0);
}
/*
* add a mount into a namespace's mount tree
*/
static int do_add_mount(struct mount *newmnt, const struct pinned_mountpoint *mp,
int mnt_flags)
{
struct mount *parent = mp->parent;
if (IS_ERR(parent))
return PTR_ERR(parent);
mnt_flags &= ~MNT_INTERNAL_FLAGS;
if (unlikely(!check_mnt(parent))) {
/* that's acceptable only for automounts done in private ns */
if (!(mnt_flags & MNT_SHRINKABLE))
return -EINVAL;
/* ... and for those we'd better have mountpoint still alive */
if (!parent->mnt_ns)
return -EINVAL;
}
/* Refuse the same filesystem on the same mount point */
if (parent->mnt.mnt_sb == newmnt->mnt.mnt_sb &&
parent->mnt.mnt_root == mp->mp->m_dentry)
return -EBUSY;
if (d_is_symlink(newmnt->mnt.mnt_root))
return -EINVAL;
newmnt->mnt.mnt_flags = mnt_flags;
return graft_tree(newmnt, mp);
}
static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags);
/*
* Create a new mount using a superblock configuration and request it
* be added to the namespace tree.
*/
static int do_new_mount_fc(struct fs_context *fc, const struct path *mountpoint,
unsigned int mnt_flags)
{
struct super_block *sb;
struct vfsmount *mnt __free(mntput) = fc_mount(fc);
int error;
if (IS_ERR(mnt))
return PTR_ERR(mnt);
sb = fc->root->d_sb;
error = security_sb_kern_mount(sb);
if (unlikely(error))
return error;
if (unlikely(mount_too_revealing(sb, &mnt_flags))) {
errorfcp(fc, "VFS", "Mount too revealing");
return -EPERM;
}
mnt_warn_timestamp_expiry(mountpoint, mnt);
LOCK_MOUNT(mp, mountpoint);
error = do_add_mount(real_mount(mnt), &mp, mnt_flags);
if (!error)
retain_and_null_ptr(mnt); // consumed on success
return error;
}
/*
* create a new mount for userspace and request it to be added into the
* namespace's tree
*/
static int do_new_mount(const struct path *path, const char *fstype,
int sb_flags, int mnt_flags,
const char *name, void *data)
{
struct file_system_type *type;
struct fs_context *fc;
const char *subtype = NULL;
int err = 0;
if (!fstype)
return -EINVAL;
type = get_fs_type(fstype);
if (!type)
return -ENODEV;
if (type->fs_flags & FS_HAS_SUBTYPE) {
subtype = strchr(fstype, '.');
if (subtype) {
subtype++;
if (!*subtype) {
put_filesystem(type);
return -EINVAL;
}
}
}
fc = fs_context_for_mount(type, sb_flags);
put_filesystem(type);
if (IS_ERR(fc))
return PTR_ERR(fc);
/*
* Indicate to the filesystem that the mount request is coming
* from the legacy mount system call.
*/
fc->oldapi = true;
if (subtype)
err = vfs_parse_fs_string(fc, "subtype", subtype);
if (!err && name)
err = vfs_parse_fs_string(fc, "source", name);
if (!err)
err = parse_monolithic_mount_data(fc, data);
if (!err && !mount_capable(fc))
err = -EPERM;
if (!err)
err = do_new_mount_fc(fc, path, mnt_flags);
put_fs_context(fc);
return err;
}
static void lock_mount_exact(const struct path *path,
struct pinned_mountpoint *mp)
{
struct dentry *dentry = path->dentry;
int err;
inode_lock(dentry->d_inode);
namespace_lock();
if (unlikely(cant_mount(dentry)))
err = -ENOENT;
else if (path_overmounted(path))
err = -EBUSY;
else
err = get_mountpoint(dentry, mp);
if (unlikely(err)) {
namespace_unlock();
inode_unlock(dentry->d_inode);
mp->parent = ERR_PTR(err);
} else {
mp->parent = real_mount(path->mnt);
}
}
int finish_automount(struct vfsmount *__m, const struct path *path)
{
struct vfsmount *m __free(mntput) = __m;
struct mount *mnt;
int err;
if (!m)
return 0;
if (IS_ERR(m))
return PTR_ERR(m);
mnt = real_mount(m);
if (m->mnt_root == path->dentry)
return -ELOOP;
/*
* we don't want to use LOCK_MOUNT() - in this case finding something
* that overmounts our mountpoint to be means "quitely drop what we've
* got", not "try to mount it on top".
*/
LOCK_MOUNT_EXACT(mp, path);
if (mp.parent == ERR_PTR(-EBUSY))
return 0;
err = do_add_mount(mnt, &mp, path->mnt->mnt_flags | MNT_SHRINKABLE);
if (likely(!err))
retain_and_null_ptr(m);
return err;
}
/**
* mnt_set_expiry - Put a mount on an expiration list
* @mnt: The mount to list.
* @expiry_list: The list to add the mount to.
*/
void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list)
{
guard(mount_locked_reader)();
list_add_tail(&real_mount(mnt)->mnt_expire, expiry_list);
}
EXPORT_SYMBOL(mnt_set_expiry);
/*
* process a list of expirable mountpoints with the intent of discarding any
* mountpoints that aren't in use and haven't been touched since last we came
* here
*/
void mark_mounts_for_expiry(struct list_head *mounts)
{
struct mount *mnt, *next;
LIST_HEAD(graveyard);
if (list_empty(mounts))
return;
guard(namespace_excl)();
guard(mount_writer)();
/* extract from the expiration list every vfsmount that matches the
* following criteria:
* - already mounted
* - only referenced by its parent vfsmount
* - still marked for expiry (marked on the last call here; marks are
* cleared by mntput())
*/
list_for_each_entry_safe(mnt, next, mounts, mnt_expire) {
if (!is_mounted(&mnt->mnt))
continue;
if (!xchg(&mnt->mnt_expiry_mark, 1) ||
propagate_mount_busy(mnt, 1))
continue;
list_move(&mnt->mnt_expire, &graveyard);
}
while (!list_empty(&graveyard)) {
mnt = list_first_entry(&graveyard, struct mount, mnt_expire);
touch_mnt_namespace(mnt->mnt_ns);
umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
}
}
EXPORT_SYMBOL_GPL(mark_mounts_for_expiry);
/*
* Ripoff of 'select_parent()'
*
* search the list of submounts for a given mountpoint, and move any
* shrinkable submounts to the 'graveyard' list.
*/
static int select_submounts(struct mount *parent, struct list_head *graveyard)
{
struct mount *this_parent = parent;
struct list_head *next;
int found = 0;
repeat:
next = this_parent->mnt_mounts.next;
resume:
while (next != &this_parent->mnt_mounts) {
struct list_head *tmp = next;
struct mount *mnt = list_entry(tmp, struct mount, mnt_child);
next = tmp->next;
if (!(mnt->mnt.mnt_flags & MNT_SHRINKABLE))
continue;
/*
* Descend a level if the d_mounts list is non-empty.
*/
if (!list_empty(&mnt->mnt_mounts)) {
this_parent = mnt;
goto repeat;
}
if (!propagate_mount_busy(mnt, 1)) {
list_move_tail(&mnt->mnt_expire, graveyard);
found++;
}
}
/*
* All done at this level ... ascend and resume the search
*/
if (this_parent != parent) {
next = this_parent->mnt_child.next;
this_parent = this_parent->mnt_parent;
goto resume;
}
return found;
}
/*
* process a list of expirable mountpoints with the intent of discarding any
* submounts of a specific parent mountpoint
*
* mount_lock must be held for write
*/
static void shrink_submounts(struct mount *mnt)
{
LIST_HEAD(graveyard);
struct mount *m;
/* extract submounts of 'mountpoint' from the expiration list */
while (select_submounts(mnt, &graveyard)) {
while (!list_empty(&graveyard)) {
m = list_first_entry(&graveyard, struct mount,
mnt_expire);
touch_mnt_namespace(m->mnt_ns);
umount_tree(m, UMOUNT_PROPAGATE|UMOUNT_SYNC);
}
}
}
static void *copy_mount_options(const void __user * data)
{
char *copy;
unsigned left, offset;
if (!data)
return NULL;
copy = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!copy)
return ERR_PTR(-ENOMEM);
left = copy_from_user(copy, data, PAGE_SIZE);
/*
* Not all architectures have an exact copy_from_user(). Resort to
* byte at a time.
*/
offset = PAGE_SIZE - left;
while (left) {
char c;
if (get_user(c, (const char __user *)data + offset))
break;
copy[offset] = c;
left--;
offset++;
}
if (left == PAGE_SIZE) {
kfree(copy);
return ERR_PTR(-EFAULT);
}
return copy;
}
static char *copy_mount_string(const void __user *data)
{
return data ? strndup_user(data, PATH_MAX) : NULL;
}
/*
* Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
* be given to the mount() call (ie: read-only, no-dev, no-suid etc).
*
* data is a (void *) that can point to any structure up to
* PAGE_SIZE-1 bytes, which can contain arbitrary fs-dependent
* information (or be NULL).
*
* Pre-0.97 versions of mount() didn't have a flags word.
* When the flags word was introduced its top half was required
* to have the magic value 0xC0ED, and this remained so until 2.4.0-test9.
* Therefore, if this magic number is present, it carries no information
* and must be discarded.
*/
int path_mount(const char *dev_name, const struct path *path,
const char *type_page, unsigned long flags, void *data_page)
{
unsigned int mnt_flags = 0, sb_flags;
int ret;
/* Discard magic */
if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
flags &= ~MS_MGC_MSK;
/* Basic sanity checks */
if (data_page)
((char *)data_page)[PAGE_SIZE - 1] = 0;
if (flags & MS_NOUSER)
return -EINVAL;
ret = security_sb_mount(dev_name, path, type_page, flags, data_page);
if (ret)
return ret;
if (!may_mount())
return -EPERM;
if (flags & SB_MANDLOCK)
warn_mandlock();
/* Default to relatime unless overriden */
if (!(flags & MS_NOATIME))
mnt_flags |= MNT_RELATIME;
/* Separate the per-mountpoint flags */
if (flags & MS_NOSUID)
mnt_flags |= MNT_NOSUID;
if (flags & MS_NODEV)
mnt_flags |= MNT_NODEV;
if (flags & MS_NOEXEC)
mnt_flags |= MNT_NOEXEC;
if (flags & MS_NOATIME)
mnt_flags |= MNT_NOATIME;
if (flags & MS_NODIRATIME)
mnt_flags |= MNT_NODIRATIME;
if (flags & MS_STRICTATIME)
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
if (flags & MS_NOSYMFOLLOW)
mnt_flags |= MNT_NOSYMFOLLOW;
/* The default atime for remount is preservation */
if ((flags & MS_REMOUNT) &&
((flags & (MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
MS_STRICTATIME)) == 0)) {
mnt_flags &= ~MNT_ATIME_MASK;
mnt_flags |= path->mnt->mnt_flags & MNT_ATIME_MASK;
}
sb_flags = flags & (SB_RDONLY |
SB_SYNCHRONOUS |
SB_MANDLOCK |
SB_DIRSYNC |
SB_SILENT |
SB_POSIXACL |
SB_LAZYTIME |
SB_I_VERSION);
if ((flags & (MS_REMOUNT | MS_BIND)) == (MS_REMOUNT | MS_BIND))
return do_reconfigure_mnt(path, mnt_flags);
if (flags & MS_REMOUNT)
return do_remount(path, sb_flags, mnt_flags, data_page);
if (flags & MS_BIND)
return do_loopback(path, dev_name, flags & MS_REC);
if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
return do_change_type(path, flags);
if (flags & MS_MOVE)
return do_move_mount_old(path, dev_name);
return do_new_mount(path, type_page, sb_flags, mnt_flags, dev_name,
data_page);
}
int do_mount(const char *dev_name, const char __user *dir_name,
const char *type_page, unsigned long flags, void *data_page)
{
struct path path __free(path_put) = {};
int ret;
ret = user_path_at(AT_FDCWD, dir_name, LOOKUP_FOLLOW, &path);
if (ret)
return ret;
return path_mount(dev_name, &path, type_page, flags, data_page);
}
static struct ucounts *inc_mnt_namespaces(struct user_namespace *ns)
{
return inc_ucount(ns, current_euid(), UCOUNT_MNT_NAMESPACES);
}
static void dec_mnt_namespaces(struct ucounts *ucounts)
{
dec_ucount(ucounts, UCOUNT_MNT_NAMESPACES);
}
static void free_mnt_ns(struct mnt_namespace *ns)
{
if (!is_anon_ns(ns))
ns_common_free(ns);
dec_mnt_namespaces(ns->ucounts);
mnt_ns_tree_remove(ns);
}
static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool anon)
{
struct mnt_namespace *new_ns;
struct ucounts *ucounts;
int ret;
ucounts = inc_mnt_namespaces(user_ns);
if (!ucounts)
return ERR_PTR(-ENOSPC);
new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
if (!new_ns) {
dec_mnt_namespaces(ucounts);
return ERR_PTR(-ENOMEM);
}
if (anon)
ret = ns_common_init_inum(new_ns, MNT_NS_ANON_INO);
else
ret = ns_common_init(new_ns);
if (ret) {
kfree(new_ns);
dec_mnt_namespaces(ucounts);
return ERR_PTR(ret);
}
ns_tree_gen_id(new_ns);
new_ns->is_anon = anon;
refcount_set(&new_ns->passive, 1);
new_ns->mounts = RB_ROOT;
init_waitqueue_head(&new_ns->poll);
new_ns->user_ns = get_user_ns(user_ns);
new_ns->ucounts = ucounts;
return new_ns;
}
__latent_entropy
struct mnt_namespace *copy_mnt_ns(u64 flags, struct mnt_namespace *ns,
struct user_namespace *user_ns, struct fs_struct *new_fs)
{
struct mnt_namespace *new_ns;
struct vfsmount *rootmnt __free(mntput) = NULL;
struct vfsmount *pwdmnt __free(mntput) = NULL;
struct mount *p, *q;
struct mount *old;
struct mount *new;
int copy_flags;
BUG_ON(!ns);
if (likely(!(flags & CLONE_NEWNS))) {
get_mnt_ns(ns);
return ns;
}
old = ns->root;
new_ns = alloc_mnt_ns(user_ns, false);
if (IS_ERR(new_ns))
return new_ns;
guard(namespace_excl)();
/* First pass: copy the tree topology */
copy_flags = CL_COPY_UNBINDABLE | CL_EXPIRE;
if (user_ns != ns->user_ns)
copy_flags |= CL_SLAVE;
new = copy_tree(old, old->mnt.mnt_root, copy_flags);
if (IS_ERR(new)) {
emptied_ns = new_ns;
return ERR_CAST(new);
}
if (user_ns != ns->user_ns) {
guard(mount_writer)();
lock_mnt_tree(new);
}
new_ns->root = new;
/*
* Second pass: switch the tsk->fs->* elements and mark new vfsmounts
* as belonging to new namespace. We have already acquired a private
* fs_struct, so tsk->fs->lock is not needed.
*/
p = old;
q = new;
while (p) {
mnt_add_to_ns(new_ns, q);
new_ns->nr_mounts++;
if (new_fs) {
if (&p->mnt == new_fs->root.mnt) {
new_fs->root.mnt = mntget(&q->mnt);
rootmnt = &p->mnt;
}
if (&p->mnt == new_fs->pwd.mnt) {
new_fs->pwd.mnt = mntget(&q->mnt);
pwdmnt = &p->mnt;
}
}
p = next_mnt(p, old);
q = next_mnt(q, new);
if (!q)
break;
// an mntns binding we'd skipped?
while (p->mnt.mnt_root != q->mnt.mnt_root)
p = next_mnt(skip_mnt_tree(p), old);
}
ns_tree_add_raw(new_ns);
return new_ns;
}
struct dentry *mount_subtree(struct vfsmount *m, const char *name)
{
struct mount *mnt = real_mount(m);
struct mnt_namespace *ns;
struct super_block *s;
struct path path;
int err;
ns = alloc_mnt_ns(&init_user_ns, true);
if (IS_ERR(ns)) {
mntput(m);
return ERR_CAST(ns);
}
ns->root = mnt;
ns->nr_mounts++;
mnt_add_to_ns(ns, mnt);
err = vfs_path_lookup(m->mnt_root, m,
name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &path);
put_mnt_ns(ns);
if (err)
return ERR_PTR(err);
/* trade a vfsmount reference for active sb one */
s = path.mnt->mnt_sb;
atomic_inc(&s->s_active);
mntput(path.mnt);
/* lock the sucker */
down_write(&s->s_umount);
/* ... and return the root of (sub)tree on it */
return path.dentry;
}
EXPORT_SYMBOL(mount_subtree);
SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
char __user *, type, unsigned long, flags, void __user *, data)
{
int ret;
char *kernel_type;
char *kernel_dev;
void *options;
kernel_type = copy_mount_string(type);
ret = PTR_ERR(kernel_type);
if (IS_ERR(kernel_type))
goto out_type;
kernel_dev = copy_mount_string(dev_name);
ret = PTR_ERR(kernel_dev);
if (IS_ERR(kernel_dev))
goto out_dev;
options = copy_mount_options(data);
ret = PTR_ERR(options);
if (IS_ERR(options))
goto out_data;
ret = do_mount(kernel_dev, dir_name, kernel_type, flags, options);
kfree(options);
out_data:
kfree(kernel_dev);
out_dev:
kfree(kernel_type);
out_type:
return ret;
}
#define FSMOUNT_VALID_FLAGS \
(MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NODEV | \
MOUNT_ATTR_NOEXEC | MOUNT_ATTR__ATIME | MOUNT_ATTR_NODIRATIME | \
MOUNT_ATTR_NOSYMFOLLOW)
#define MOUNT_SETATTR_VALID_FLAGS (FSMOUNT_VALID_FLAGS | MOUNT_ATTR_IDMAP)
#define MOUNT_SETATTR_PROPAGATION_FLAGS \
(MS_UNBINDABLE | MS_PRIVATE | MS_SLAVE | MS_SHARED)
static unsigned int attr_flags_to_mnt_flags(u64 attr_flags)
{
unsigned int mnt_flags = 0;
if (attr_flags & MOUNT_ATTR_RDONLY)
mnt_flags |= MNT_READONLY;
if (attr_flags & MOUNT_ATTR_NOSUID)
mnt_flags |= MNT_NOSUID;
if (attr_flags & MOUNT_ATTR_NODEV)
mnt_flags |= MNT_NODEV;
if (attr_flags & MOUNT_ATTR_NOEXEC)
mnt_flags |= MNT_NOEXEC;
if (attr_flags & MOUNT_ATTR_NODIRATIME)
mnt_flags |= MNT_NODIRATIME;
if (attr_flags & MOUNT_ATTR_NOSYMFOLLOW)
mnt_flags |= MNT_NOSYMFOLLOW;
return mnt_flags;
}
/*
* Create a kernel mount representation for a new, prepared superblock
* (specified by fs_fd) and attach to an open_tree-like file descriptor.
*/
SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
unsigned int, attr_flags)
{
struct path new_path __free(path_put) = {};
struct mnt_namespace *ns;
struct fs_context *fc;
struct vfsmount *new_mnt;
struct mount *mnt;
unsigned int mnt_flags = 0;
long ret;
if (!may_mount())
return -EPERM;
if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
return -EINVAL;
if (attr_flags & ~FSMOUNT_VALID_FLAGS)
return -EINVAL;
mnt_flags = attr_flags_to_mnt_flags(attr_flags);
switch (attr_flags & MOUNT_ATTR__ATIME) {
case MOUNT_ATTR_STRICTATIME:
break;
case MOUNT_ATTR_NOATIME:
mnt_flags |= MNT_NOATIME;
break;
case MOUNT_ATTR_RELATIME:
mnt_flags |= MNT_RELATIME;
break;
default:
return -EINVAL;
}
CLASS(fd, f)(fs_fd);
if (fd_empty(f))
return -EBADF;
if (fd_file(f)->f_op != &fscontext_fops)
return -EINVAL;
fc = fd_file(f)->private_data;
ACQUIRE(mutex_intr, uapi_mutex)(&fc->uapi_mutex);
ret = ACQUIRE_ERR(mutex_intr, &uapi_mutex);
if (ret)
return ret;
/* There must be a valid superblock or we can't mount it */
ret = -EINVAL;
if (!fc->root)
return ret;
ret = -EPERM;
if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
errorfcp(fc, "VFS", "Mount too revealing");
return ret;
}
ret = -EBUSY;
if (fc->phase != FS_CONTEXT_AWAITING_MOUNT)
return ret;
if (fc->sb_flags & SB_MANDLOCK)
warn_mandlock();
new_mnt = vfs_create_mount(fc);
if (IS_ERR(new_mnt))
return PTR_ERR(new_mnt);
new_mnt->mnt_flags = mnt_flags;
new_path.dentry = dget(fc->root);
new_path.mnt = new_mnt;
/* We've done the mount bit - now move the file context into more or
* less the same state as if we'd done an fspick(). We don't want to
* do any memory allocation or anything like that at this point as we
* don't want to have to handle any errors incurred.
*/
vfs_clean_context(fc);
ns = alloc_mnt_ns(current->nsproxy->mnt_ns->user_ns, true);
if (IS_ERR(ns))
return PTR_ERR(ns);
mnt = real_mount(new_path.mnt);
ns->root = mnt;
ns->nr_mounts = 1;
mnt_add_to_ns(ns, mnt);
mntget(new_path.mnt);
FD_PREPARE(fdf, (flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0,
dentry_open(&new_path, O_PATH, fc->cred));
if (fdf.err) {
dissolve_on_fput(new_path.mnt);
return fdf.err;
}
/*
* Attach to an apparent O_PATH fd with a note that we
* need to unmount it, not just simply put it.
*/
fd_prepare_file(fdf)->f_mode |= FMODE_NEED_UNMOUNT;
return fd_publish(fdf);
}
static inline int vfs_move_mount(const struct path *from_path,
const struct path *to_path,
enum mnt_tree_flags_t mflags)
{
int ret;
ret = security_move_mount(from_path, to_path);
if (ret)
return ret;
if (mflags & MNT_TREE_PROPAGATION)
return do_set_group(from_path, to_path);
return do_move_mount(from_path, to_path, mflags);
}
/*
* Move a mount from one place to another. In combination with
* fsopen()/fsmount() this is used to install a new mount and in combination
* with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be used to copy
* a mount subtree.
*
* Note the flags value is a combination of MOVE_MOUNT_* flags.
*/
SYSCALL_DEFINE5(move_mount,
int, from_dfd, const char __user *, from_pathname,
int, to_dfd, const char __user *, to_pathname,
unsigned int, flags)
{
struct path to_path __free(path_put) = {};
struct path from_path __free(path_put) = {};
unsigned int lflags, uflags;
enum mnt_tree_flags_t mflags = 0;
int ret = 0;
if (!may_mount())
return -EPERM;
if (flags & ~MOVE_MOUNT__MASK)
return -EINVAL;
if ((flags & (MOVE_MOUNT_BENEATH | MOVE_MOUNT_SET_GROUP)) ==
(MOVE_MOUNT_BENEATH | MOVE_MOUNT_SET_GROUP))
return -EINVAL;
if (flags & MOVE_MOUNT_SET_GROUP) mflags |= MNT_TREE_PROPAGATION;
if (flags & MOVE_MOUNT_BENEATH) mflags |= MNT_TREE_BENEATH;
uflags = 0;
if (flags & MOVE_MOUNT_T_EMPTY_PATH)
uflags = AT_EMPTY_PATH;
CLASS(filename_maybe_null,to_name)(to_pathname, uflags);
if (!to_name && to_dfd >= 0) {
CLASS(fd_raw, f_to)(to_dfd);
if (fd_empty(f_to))
return -EBADF;
to_path = fd_file(f_to)->f_path;
path_get(&to_path);
} else {
lflags = 0;
if (flags & MOVE_MOUNT_T_SYMLINKS)
lflags |= LOOKUP_FOLLOW;
if (flags & MOVE_MOUNT_T_AUTOMOUNTS)
lflags |= LOOKUP_AUTOMOUNT;
ret = filename_lookup(to_dfd, to_name, lflags, &to_path, NULL);
if (ret)
return ret;
}
uflags = 0;
if (flags & MOVE_MOUNT_F_EMPTY_PATH)
uflags = AT_EMPTY_PATH;
CLASS(filename_maybe_null,from_name)(from_pathname, uflags);
if (!from_name && from_dfd >= 0) {
CLASS(fd_raw, f_from)(from_dfd);
if (fd_empty(f_from))
return -EBADF;
return vfs_move_mount(&fd_file(f_from)->f_path, &to_path, mflags);
}
lflags = 0;
if (flags & MOVE_MOUNT_F_SYMLINKS)
lflags |= LOOKUP_FOLLOW;
if (flags & MOVE_MOUNT_F_AUTOMOUNTS)
lflags |= LOOKUP_AUTOMOUNT;
ret = filename_lookup(from_dfd, from_name, lflags, &from_path, NULL);
if (ret)
return ret;
return vfs_move_mount(&from_path, &to_path, mflags);
}
/*
* Return true if path is reachable from root
*
* locks: mount_locked_reader || namespace_shared && is_mounted(mnt)
*/
bool is_path_reachable(struct mount *mnt, struct dentry *dentry,
const struct path *root)
{
while (&mnt->mnt != root->mnt && mnt_has_parent(mnt)) {
dentry = mnt->mnt_mountpoint;
mnt = mnt->mnt_parent;
}
return &mnt->mnt == root->mnt && is_subdir(dentry, root->dentry);
}
bool path_is_under(const struct path *path1, const struct path *path2)
{
guard(mount_locked_reader)();
return is_path_reachable(real_mount(path1->mnt), path1->dentry, path2);
}
EXPORT_SYMBOL(path_is_under);
int path_pivot_root(struct path *new, struct path *old)
{
struct path root __free(path_put) = {};
struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent;
int error;
if (!may_mount())
return -EPERM;
error = security_sb_pivotroot(old, new);
if (error)
return error;
get_fs_root(current->fs, &root);
LOCK_MOUNT(old_mp, old);
old_mnt = old_mp.parent;
if (IS_ERR(old_mnt))
return PTR_ERR(old_mnt);
new_mnt = real_mount(new->mnt);
root_mnt = real_mount(root.mnt);
ex_parent = new_mnt->mnt_parent;
root_parent = root_mnt->mnt_parent;
if (IS_MNT_SHARED(old_mnt) ||
IS_MNT_SHARED(ex_parent) ||
IS_MNT_SHARED(root_parent))
return -EINVAL;
if (!check_mnt(root_mnt) || !check_mnt(new_mnt))
return -EINVAL;
if (new_mnt->mnt.mnt_flags & MNT_LOCKED)
return -EINVAL;
if (d_unlinked(new->dentry))
return -ENOENT;
if (new_mnt == root_mnt || old_mnt == root_mnt)
return -EBUSY; /* loop, on the same file system */
if (!path_mounted(&root))
return -EINVAL; /* not a mountpoint */
if (!mnt_has_parent(root_mnt))
return -EINVAL; /* absolute root */
if (!path_mounted(new))
return -EINVAL; /* not a mountpoint */
if (!mnt_has_parent(new_mnt))
return -EINVAL; /* absolute root */
/* make sure we can reach put_old from new_root */
if (!is_path_reachable(old_mnt, old_mp.mp->m_dentry, new))
return -EINVAL;
/* make certain new is below the root */
if (!is_path_reachable(new_mnt, new->dentry, &root))
return -EINVAL;
lock_mount_hash();
umount_mnt(new_mnt);
if (root_mnt->mnt.mnt_flags & MNT_LOCKED) {
new_mnt->mnt.mnt_flags |= MNT_LOCKED;
root_mnt->mnt.mnt_flags &= ~MNT_LOCKED;
}
/* mount new_root on / */
attach_mnt(new_mnt, root_parent, root_mnt->mnt_mp);
umount_mnt(root_mnt);
/* mount old root on put_old */
attach_mnt(root_mnt, old_mnt, old_mp.mp);
touch_mnt_namespace(current->nsproxy->mnt_ns);
/* A moved mount should not expire automatically */
list_del_init(&new_mnt->mnt_expire);
unlock_mount_hash();
mnt_notify_add(root_mnt);
mnt_notify_add(new_mnt);
chroot_fs_refs(&root, new);
return 0;
}
/*
* pivot_root Semantics:
* Moves the root file system of the current process to the directory put_old,
* makes new_root as the new root file system of the current process, and sets
* root/cwd of all processes which had them on the current root to new_root.
*
* Restrictions:
* The new_root and put_old must be directories, and must not be on the
* same file system as the current process root. The put_old must be
* underneath new_root, i.e. adding a non-zero number of /.. to the string
* pointed to by put_old must yield the same directory as new_root. No other
* file system may be mounted on put_old. After all, new_root is a mountpoint.
*
* The immutable nullfs filesystem is mounted as the true root of the VFS
* hierarchy. The mutable rootfs (tmpfs/ramfs) is layered on top of this,
* allowing pivot_root() to work normally from initramfs.
*
* Notes:
* - we don't move root/cwd if they are not at the root (reason: if something
* cared enough to change them, it's probably wrong to force them elsewhere)
* - it's okay to pick a root that isn't the root of a file system, e.g.
* /nfs/my_root where /nfs is the mount point. It must be a mountpoint,
* though, so you may need to say mount --bind /nfs/my_root /nfs/my_root
* first.
*/
SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
const char __user *, put_old)
{
struct path new __free(path_put) = {};
struct path old __free(path_put) = {};
int error;
error = user_path_at(AT_FDCWD, new_root,
LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new);
if (error)
return error;
error = user_path_at(AT_FDCWD, put_old,
LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old);
if (error)
return error;
return path_pivot_root(&new, &old);
}
static unsigned int recalc_flags(struct mount_kattr *kattr, struct mount *mnt)
{
unsigned int flags = mnt->mnt.mnt_flags;
/* flags to clear */
flags &= ~kattr->attr_clr;
/* flags to raise */
flags |= kattr->attr_set;
return flags;
}
static int can_idmap_mount(const struct mount_kattr *kattr, struct mount *mnt)
{
struct vfsmount *m = &mnt->mnt;
struct user_namespace *fs_userns = m->mnt_sb->s_user_ns;
if (!kattr->mnt_idmap)
return 0;
/*
* Creating an idmapped mount with the filesystem wide idmapping
* doesn't make sense so block that. We don't allow mushy semantics.
*/
if (kattr->mnt_userns == m->mnt_sb->s_user_ns)
return -EINVAL;
/*
* We only allow an mount to change it's idmapping if it has
* never been accessible to userspace.
*/
if (!(kattr->kflags & MOUNT_KATTR_IDMAP_REPLACE) && is_idmapped_mnt(m))
return -EPERM;
/* The underlying filesystem doesn't support idmapped mounts yet. */
if (!(m->mnt_sb->s_type->fs_flags & FS_ALLOW_IDMAP))
return -EINVAL;
/* The filesystem has turned off idmapped mounts. */
if (m->mnt_sb->s_iflags & SB_I_NOIDMAP)
return -EINVAL;
/* We're not controlling the superblock. */
if (!ns_capable(fs_userns, CAP_SYS_ADMIN))
return -EPERM;
/* Mount has already been visible in the filesystem hierarchy. */
if (!is_anon_ns(mnt->mnt_ns))
return -EINVAL;
return 0;
}
/**
* mnt_allow_writers() - check whether the attribute change allows writers
* @kattr: the new mount attributes
* @mnt: the mount to which @kattr will be applied
*
* Check whether thew new mount attributes in @kattr allow concurrent writers.
*
* Return: true if writers need to be held, false if not
*/
static inline bool mnt_allow_writers(const struct mount_kattr *kattr,
const struct mount *mnt)
{
return (!(kattr->attr_set & MNT_READONLY) ||
(mnt->mnt.mnt_flags & MNT_READONLY)) &&
!kattr->mnt_idmap;
}
static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt)
{
struct mount *m;
int err;
for (m = mnt; m; m = next_mnt(m, mnt)) {
if (!can_change_locked_flags(m, recalc_flags(kattr, m))) {
err = -EPERM;
break;
}
err = can_idmap_mount(kattr, m);
if (err)
break;
if (!mnt_allow_writers(kattr, m)) {
err = mnt_hold_writers(m);
if (err) {
m = next_mnt(m, mnt);
break;
}
}
if (!(kattr->kflags & MOUNT_KATTR_RECURSE))
return 0;
}
if (err) {
/* undo all mnt_hold_writers() we'd done */
for (struct mount *p = mnt; p != m; p = next_mnt(p, mnt))
mnt_unhold_writers(p);
}
return err;
}
static void do_idmap_mount(const struct mount_kattr *kattr, struct mount *mnt)
{
struct mnt_idmap *old_idmap;
if (!kattr->mnt_idmap)
return;
old_idmap = mnt_idmap(&mnt->mnt);
/* Pairs with smp_load_acquire() in mnt_idmap(). */
smp_store_release(&mnt->mnt.mnt_idmap, mnt_idmap_get(kattr->mnt_idmap));
mnt_idmap_put(old_idmap);
}
static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt)
{
struct mount *m;
for (m = mnt; m; m = next_mnt(m, mnt)) {
unsigned int flags;
do_idmap_mount(kattr, m);
flags = recalc_flags(kattr, m);
WRITE_ONCE(m->mnt.mnt_flags, flags);
/* If we had to hold writers unblock them. */
mnt_unhold_writers(m);
if (kattr->propagation)
change_mnt_propagation(m, kattr->propagation);
if (!(kattr->kflags & MOUNT_KATTR_RECURSE))
break;
}
touch_mnt_namespace(mnt->mnt_ns);
}
static int do_mount_setattr(const struct path *path, struct mount_kattr *kattr)
{
struct mount *mnt = real_mount(path->mnt);
int err = 0;
if (!path_mounted(path))
return -EINVAL;
if (kattr->mnt_userns) {
struct mnt_idmap *mnt_idmap;
mnt_idmap = alloc_mnt_idmap(kattr->mnt_userns);
if (IS_ERR(mnt_idmap))
return PTR_ERR(mnt_idmap);
kattr->mnt_idmap = mnt_idmap;
}
if (kattr->propagation) {
/*
* Only take namespace_lock() if we're actually changing
* propagation.
*/
namespace_lock();
if (kattr->propagation == MS_SHARED) {
err = invent_group_ids(mnt, kattr->kflags & MOUNT_KATTR_RECURSE);
if (err) {
namespace_unlock();
return err;
}
}
}
err = -EINVAL;
lock_mount_hash();
if (!anon_ns_root(mnt) && !check_mnt(mnt))
goto out;
/*
* First, we get the mount tree in a shape where we can change mount
* properties without failure. If we succeeded to do so we commit all
* changes and if we failed we clean up.
*/
err = mount_setattr_prepare(kattr, mnt);
if (!err)
mount_setattr_commit(kattr, mnt);
out:
unlock_mount_hash();
if (kattr->propagation) {
if (err)
cleanup_group_ids(mnt, NULL);
namespace_unlock();
}
return err;
}
static int build_mount_idmapped(const struct mount_attr *attr, size_t usize,
struct mount_kattr *kattr)
{
struct ns_common *ns;
struct user_namespace *mnt_userns;
if (!((attr->attr_set | attr->attr_clr) & MOUNT_ATTR_IDMAP))
return 0;
if (attr->attr_clr & MOUNT_ATTR_IDMAP) {
/*
* We can only remove an idmapping if it's never been
* exposed to userspace.
*/
if (!(kattr->kflags & MOUNT_KATTR_IDMAP_REPLACE))
return -EINVAL;
/*
* Removal of idmappings is equivalent to setting
* nop_mnt_idmap.
*/
if (!(attr->attr_set & MOUNT_ATTR_IDMAP)) {
kattr->mnt_idmap = &nop_mnt_idmap;
return 0;
}
}
if (attr->userns_fd > INT_MAX)
return -EINVAL;
CLASS(fd, f)(attr->userns_fd);
if (fd_empty(f))
return -EBADF;
if (!proc_ns_file(fd_file(f)))
return -EINVAL;
ns = get_proc_ns(file_inode(fd_file(f)));
if (ns->ns_type != CLONE_NEWUSER)
return -EINVAL;
/*
* The initial idmapping cannot be used to create an idmapped
* mount. We use the initial idmapping as an indicator of a mount
* that is not idmapped. It can simply be passed into helpers that
* are aware of idmapped mounts as a convenient shortcut. A user
* can just create a dedicated identity mapping to achieve the same
* result.
*/
mnt_userns = container_of(ns, struct user_namespace, ns);
if (mnt_userns == &init_user_ns)
return -EPERM;
/* We're not controlling the target namespace. */
if (!ns_capable(mnt_userns, CAP_SYS_ADMIN))
return -EPERM;
kattr->mnt_userns = get_user_ns(mnt_userns);
return 0;
}
static int build_mount_kattr(const struct mount_attr *attr, size_t usize,
struct mount_kattr *kattr)
{
if (attr->propagation & ~MOUNT_SETATTR_PROPAGATION_FLAGS)
return -EINVAL;
if (hweight32(attr->propagation & MOUNT_SETATTR_PROPAGATION_FLAGS) > 1)
return -EINVAL;
kattr->propagation = attr->propagation;
if ((attr->attr_set | attr->attr_clr) & ~MOUNT_SETATTR_VALID_FLAGS)
return -EINVAL;
kattr->attr_set = attr_flags_to_mnt_flags(attr->attr_set);
kattr->attr_clr = attr_flags_to_mnt_flags(attr->attr_clr);
/*
* Since the MOUNT_ATTR_<atime> values are an enum, not a bitmap,
* users wanting to transition to a different atime setting cannot
* simply specify the atime setting in @attr_set, but must also
* specify MOUNT_ATTR__ATIME in the @attr_clr field.
* So ensure that MOUNT_ATTR__ATIME can't be partially set in
* @attr_clr and that @attr_set can't have any atime bits set if
* MOUNT_ATTR__ATIME isn't set in @attr_clr.
*/
if (attr->attr_clr & MOUNT_ATTR__ATIME) {
if ((attr->attr_clr & MOUNT_ATTR__ATIME) != MOUNT_ATTR__ATIME)
return -EINVAL;
/*
* Clear all previous time settings as they are mutually
* exclusive.
*/
kattr->attr_clr |= MNT_RELATIME | MNT_NOATIME;
switch (attr->attr_set & MOUNT_ATTR__ATIME) {
case MOUNT_ATTR_RELATIME:
kattr->attr_set |= MNT_RELATIME;
break;
case MOUNT_ATTR_NOATIME:
kattr->attr_set |= MNT_NOATIME;
break;
case MOUNT_ATTR_STRICTATIME:
break;
default:
return -EINVAL;
}
} else {
if (attr->attr_set & MOUNT_ATTR__ATIME)
return -EINVAL;
}
return build_mount_idmapped(attr, usize, kattr);
}
static void finish_mount_kattr(struct mount_kattr *kattr)
{
if (kattr->mnt_userns) {
put_user_ns(kattr->mnt_userns);
kattr->mnt_userns = NULL;
}
if (kattr->mnt_idmap)
mnt_idmap_put(kattr->mnt_idmap);
}
static int wants_mount_setattr(struct mount_attr __user *uattr, size_t usize,
struct mount_kattr *kattr)
{
int ret;
struct mount_attr attr;
BUILD_BUG_ON(sizeof(struct mount_attr) != MOUNT_ATTR_SIZE_VER0);
if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
if (unlikely(usize < MOUNT_ATTR_SIZE_VER0))
return -EINVAL;
if (!may_mount())
return -EPERM;
ret = copy_struct_from_user(&attr, sizeof(attr), uattr, usize);
if (ret)
return ret;
/* Don't bother walking through the mounts if this is a nop. */
if (attr.attr_set == 0 &&
attr.attr_clr == 0 &&
attr.propagation == 0)
return 0; /* Tell caller to not bother. */
ret = build_mount_kattr(&attr, usize, kattr);
if (ret < 0)
return ret;
return 1;
}
SYSCALL_DEFINE5(mount_setattr, int, dfd, const char __user *, path,
unsigned int, flags, struct mount_attr __user *, uattr,
size_t, usize)
{
int err;
struct path target;
struct mount_kattr kattr;
unsigned int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
if (flags & ~(AT_EMPTY_PATH |
AT_RECURSIVE |
AT_SYMLINK_NOFOLLOW |
AT_NO_AUTOMOUNT))
return -EINVAL;
if (flags & AT_NO_AUTOMOUNT)
lookup_flags &= ~LOOKUP_AUTOMOUNT;
if (flags & AT_SYMLINK_NOFOLLOW)
lookup_flags &= ~LOOKUP_FOLLOW;
kattr = (struct mount_kattr) {
.lookup_flags = lookup_flags,
};
if (flags & AT_RECURSIVE)
kattr.kflags |= MOUNT_KATTR_RECURSE;
err = wants_mount_setattr(uattr, usize, &kattr);
if (err <= 0)
return err;
CLASS(filename_uflags, name)(path, flags);
err = filename_lookup(dfd, name, kattr.lookup_flags, &target, NULL);
if (!err) {
err = do_mount_setattr(&target, &kattr);
path_put(&target);
}
finish_mount_kattr(&kattr);
return err;
}
SYSCALL_DEFINE5(open_tree_attr, int, dfd, const char __user *, filename,
unsigned, flags, struct mount_attr __user *, uattr,
size_t, usize)
{
if (!uattr && usize)
return -EINVAL;
FD_PREPARE(fdf, flags, vfs_open_tree(dfd, filename, flags));
if (fdf.err)
return fdf.err;
if (uattr) {
struct mount_kattr kattr = {};
struct file *file = fd_prepare_file(fdf);
int ret;
if (flags & OPEN_TREE_CLONE)
kattr.kflags = MOUNT_KATTR_IDMAP_REPLACE;
if (flags & AT_RECURSIVE)
kattr.kflags |= MOUNT_KATTR_RECURSE;
ret = wants_mount_setattr(uattr, usize, &kattr);
if (ret > 0) {
ret = do_mount_setattr(&file->f_path, &kattr);
finish_mount_kattr(&kattr);
}
if (ret)
return ret;
}
return fd_publish(fdf);
}
int show_path(struct seq_file *m, struct dentry *root)
{
if (root->d_sb->s_op->show_path)
return root->d_sb->s_op->show_path(m, root);
seq_dentry(m, root, " \t\n\\");
return 0;
}
static struct vfsmount *lookup_mnt_in_ns(u64 id, struct mnt_namespace *ns)
{
struct mount *mnt = mnt_find_id_at(ns, id);
if (!mnt || mnt->mnt_id_unique != id)
return NULL;
return &mnt->mnt;
}
struct kstatmount {
struct statmount __user *buf;
size_t bufsize;
struct vfsmount *mnt;
struct mnt_idmap *idmap;
u64 mask;
struct path root;
struct seq_file seq;
/* Must be last --ends in a flexible-array member. */
struct statmount sm;
};
static u64 mnt_to_attr_flags(struct vfsmount *mnt)
{
unsigned int mnt_flags = READ_ONCE(mnt->mnt_flags);
u64 attr_flags = 0;
if (mnt_flags & MNT_READONLY)
attr_flags |= MOUNT_ATTR_RDONLY;
if (mnt_flags & MNT_NOSUID)
attr_flags |= MOUNT_ATTR_NOSUID;
if (mnt_flags & MNT_NODEV)
attr_flags |= MOUNT_ATTR_NODEV;
if (mnt_flags & MNT_NOEXEC)
attr_flags |= MOUNT_ATTR_NOEXEC;
if (mnt_flags & MNT_NODIRATIME)
attr_flags |= MOUNT_ATTR_NODIRATIME;
if (mnt_flags & MNT_NOSYMFOLLOW)
attr_flags |= MOUNT_ATTR_NOSYMFOLLOW;
if (mnt_flags & MNT_NOATIME)
attr_flags |= MOUNT_ATTR_NOATIME;
else if (mnt_flags & MNT_RELATIME)
attr_flags |= MOUNT_ATTR_RELATIME;
else
attr_flags |= MOUNT_ATTR_STRICTATIME;
if (is_idmapped_mnt(mnt))
attr_flags |= MOUNT_ATTR_IDMAP;
return attr_flags;
}
static u64 mnt_to_propagation_flags(struct mount *m)
{
u64 propagation = 0;
if (IS_MNT_SHARED(m))
propagation |= MS_SHARED;
if (IS_MNT_SLAVE(m))
propagation |= MS_SLAVE;
if (IS_MNT_UNBINDABLE(m))
propagation |= MS_UNBINDABLE;
if (!propagation)
propagation |= MS_PRIVATE;
return propagation;
}
u64 vfsmount_to_propagation_flags(struct vfsmount *mnt)
{
return mnt_to_propagation_flags(real_mount(mnt));
}
EXPORT_SYMBOL_GPL(vfsmount_to_propagation_flags);
static void statmount_sb_basic(struct kstatmount *s)
{
struct super_block *sb = s->mnt->mnt_sb;
s->sm.mask |= STATMOUNT_SB_BASIC;
s->sm.sb_dev_major = MAJOR(sb->s_dev);
s->sm.sb_dev_minor = MINOR(sb->s_dev);
s->sm.sb_magic = sb->s_magic;
s->sm.sb_flags = sb->s_flags & (SB_RDONLY|SB_SYNCHRONOUS|SB_DIRSYNC|SB_LAZYTIME);
}
static void statmount_mnt_basic(struct kstatmount *s)
{
struct mount *m = real_mount(s->mnt);
s->sm.mask |= STATMOUNT_MNT_BASIC;
s->sm.mnt_id = m->mnt_id_unique;
s->sm.mnt_parent_id = m->mnt_parent->mnt_id_unique;
s->sm.mnt_id_old = m->mnt_id;
s->sm.mnt_parent_id_old = m->mnt_parent->mnt_id;
s->sm.mnt_attr = mnt_to_attr_flags(&m->mnt);
s->sm.mnt_propagation = mnt_to_propagation_flags(m);
s->sm.mnt_peer_group = m->mnt_group_id;
s->sm.mnt_master = IS_MNT_SLAVE(m) ? m->mnt_master->mnt_group_id : 0;
}
static void statmount_propagate_from(struct kstatmount *s)
{
struct mount *m = real_mount(s->mnt);
s->sm.mask |= STATMOUNT_PROPAGATE_FROM;
if (IS_MNT_SLAVE(m))
s->sm.propagate_from = get_dominating_id(m, ¤t->fs->root);
}
static int statmount_mnt_root(struct kstatmount *s, struct seq_file *seq)
{
int ret;
size_t start = seq->count;
ret = show_path(seq, s->mnt->mnt_root);
if (ret)
return ret;
if (unlikely(seq_has_overflowed(seq)))
return -EAGAIN;
/*
* Unescape the result. It would be better if supplied string was not
* escaped in the first place, but that's a pretty invasive change.
*/
seq->buf[seq->count] = '\0';
seq->count = start;
seq_commit(seq, string_unescape_inplace(seq->buf + start, UNESCAPE_OCTAL));
return 0;
}
static int statmount_mnt_point(struct kstatmount *s, struct seq_file *seq)
{
struct vfsmount *mnt = s->mnt;
struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt };
int err;
err = seq_path_root(seq, &mnt_path, &s->root, "");
return err == SEQ_SKIP ? 0 : err;
}
static int statmount_fs_type(struct kstatmount *s, struct seq_file *seq)
{
struct super_block *sb = s->mnt->mnt_sb;
seq_puts(seq, sb->s_type->name);
return 0;
}
static void statmount_fs_subtype(struct kstatmount *s, struct seq_file *seq)
{
struct super_block *sb = s->mnt->mnt_sb;
if (sb->s_subtype)
seq_puts(seq, sb->s_subtype);
}
static int statmount_sb_source(struct kstatmount *s, struct seq_file *seq)
{
struct super_block *sb = s->mnt->mnt_sb;
struct mount *r = real_mount(s->mnt);
if (sb->s_op->show_devname) {
size_t start = seq->count;
int ret;
ret = sb->s_op->show_devname(seq, s->mnt->mnt_root);
if (ret)
return ret;
if (unlikely(seq_has_overflowed(seq)))
return -EAGAIN;
/* Unescape the result */
seq->buf[seq->count] = '\0';
seq->count = start;
seq_commit(seq, string_unescape_inplace(seq->buf + start, UNESCAPE_OCTAL));
} else {
seq_puts(seq, r->mnt_devname);
}
return 0;
}
static void statmount_mnt_ns_id(struct kstatmount *s, struct mnt_namespace *ns)
{
s->sm.mask |= STATMOUNT_MNT_NS_ID;
s->sm.mnt_ns_id = ns->ns.ns_id;
}
static int statmount_mnt_opts(struct kstatmount *s, struct seq_file *seq)
{
struct vfsmount *mnt = s->mnt;
struct super_block *sb = mnt->mnt_sb;
size_t start = seq->count;
int err;
err = security_sb_show_options(seq, sb);
if (err)
return err;
if (sb->s_op->show_options) {
err = sb->s_op->show_options(seq, mnt->mnt_root);
if (err)
return err;
}
if (unlikely(seq_has_overflowed(seq)))
return -EAGAIN;
if (seq->count == start)
return 0;
/* skip leading comma */
memmove(seq->buf + start, seq->buf + start + 1,
seq->count - start - 1);
seq->count--;
return 0;
}
static inline int statmount_opt_process(struct seq_file *seq, size_t start)
{
char *buf_end, *opt_end, *src, *dst;
int count = 0;
if (unlikely(seq_has_overflowed(seq)))
return -EAGAIN;
buf_end = seq->buf + seq->count;
dst = seq->buf + start;
src = dst + 1; /* skip initial comma */
if (src >= buf_end) {
seq->count = start;
return 0;
}
*buf_end = '\0';
for (; src < buf_end; src = opt_end + 1) {
opt_end = strchrnul(src, ',');
*opt_end = '\0';
dst += string_unescape(src, dst, 0, UNESCAPE_OCTAL) + 1;
if (WARN_ON_ONCE(++count == INT_MAX))
return -EOVERFLOW;
}
seq->count = dst - 1 - seq->buf;
return count;
}
static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq)
{
struct vfsmount *mnt = s->mnt;
struct super_block *sb = mnt->mnt_sb;
size_t start = seq->count;
int err;
if (!sb->s_op->show_options)
return 0;
err = sb->s_op->show_options(seq, mnt->mnt_root);
if (err)
return err;
err = statmount_opt_process(seq, start);
if (err < 0)
return err;
s->sm.opt_num = err;
return 0;
}
static int statmount_opt_sec_array(struct kstatmount *s, struct seq_file *seq)
{
struct vfsmount *mnt = s->mnt;
struct super_block *sb = mnt->mnt_sb;
size_t start = seq->count;
int err;
err = security_sb_show_options(seq, sb);
if (err)
return err;
err = statmount_opt_process(seq, start);
if (err < 0)
return err;
s->sm.opt_sec_num = err;
return 0;
}
static inline int statmount_mnt_uidmap(struct kstatmount *s, struct seq_file *seq)
{
int ret;
ret = statmount_mnt_idmap(s->idmap, seq, true);
if (ret < 0)
return ret;
s->sm.mnt_uidmap_num = ret;
/*
* Always raise STATMOUNT_MNT_UIDMAP even if there are no valid
* mappings. This allows userspace to distinguish between a
* non-idmapped mount and an idmapped mount where none of the
* individual mappings are valid in the caller's idmapping.
*/
if (is_valid_mnt_idmap(s->idmap))
s->sm.mask |= STATMOUNT_MNT_UIDMAP;
return 0;
}
static inline int statmount_mnt_gidmap(struct kstatmount *s, struct seq_file *seq)
{
int ret;
ret = statmount_mnt_idmap(s->idmap, seq, false);
if (ret < 0)
return ret;
s->sm.mnt_gidmap_num = ret;
/*
* Always raise STATMOUNT_MNT_GIDMAP even if there are no valid
* mappings. This allows userspace to distinguish between a
* non-idmapped mount and an idmapped mount where none of the
* individual mappings are valid in the caller's idmapping.
*/
if (is_valid_mnt_idmap(s->idmap))
s->sm.mask |= STATMOUNT_MNT_GIDMAP;
return 0;
}
static int statmount_string(struct kstatmount *s, u64 flag)
{
int ret = 0;
size_t kbufsize;
struct seq_file *seq = &s->seq;
struct statmount *sm = &s->sm;
u32 start, *offp;
/* Reserve an empty string at the beginning for any unset offsets */
if (!seq->count)
seq_putc(seq, 0);
start = seq->count;
switch (flag) {
case STATMOUNT_FS_TYPE:
offp = &sm->fs_type;
ret = statmount_fs_type(s, seq);
break;
case STATMOUNT_MNT_ROOT:
offp = &sm->mnt_root;
ret = statmount_mnt_root(s, seq);
break;
case STATMOUNT_MNT_POINT:
offp = &sm->mnt_point;
ret = statmount_mnt_point(s, seq);
break;
case STATMOUNT_MNT_OPTS:
offp = &sm->mnt_opts;
ret = statmount_mnt_opts(s, seq);
break;
case STATMOUNT_OPT_ARRAY:
offp = &sm->opt_array;
ret = statmount_opt_array(s, seq);
break;
case STATMOUNT_OPT_SEC_ARRAY:
offp = &sm->opt_sec_array;
ret = statmount_opt_sec_array(s, seq);
break;
case STATMOUNT_FS_SUBTYPE:
offp = &sm->fs_subtype;
statmount_fs_subtype(s, seq);
break;
case STATMOUNT_SB_SOURCE:
offp = &sm->sb_source;
ret = statmount_sb_source(s, seq);
break;
case STATMOUNT_MNT_UIDMAP:
offp = &sm->mnt_uidmap;
ret = statmount_mnt_uidmap(s, seq);
break;
case STATMOUNT_MNT_GIDMAP:
offp = &sm->mnt_gidmap;
ret = statmount_mnt_gidmap(s, seq);
break;
default:
WARN_ON_ONCE(true);
return -EINVAL;
}
/*
* If nothing was emitted, return to avoid setting the flag
* and terminating the buffer.
*/
if (seq->count == start)
return ret;
if (unlikely(check_add_overflow(sizeof(*sm), seq->count, &kbufsize)))
return -EOVERFLOW;
if (kbufsize >= s->bufsize)
return -EOVERFLOW;
/* signal a retry */
if (unlikely(seq_has_overflowed(seq)))
return -EAGAIN;
if (ret)
return ret;
seq->buf[seq->count++] = '\0';
sm->mask |= flag;
*offp = start;
return 0;
}
static int copy_statmount_to_user(struct kstatmount *s)
{
struct statmount *sm = &s->sm;
struct seq_file *seq = &s->seq;
char __user *str = ((char __user *)s->buf) + sizeof(*sm);
size_t copysize = min_t(size_t, s->bufsize, sizeof(*sm));
if (seq->count && copy_to_user(str, seq->buf, seq->count))
return -EFAULT;
/* Return the number of bytes copied to the buffer */
sm->size = copysize + seq->count;
if (copy_to_user(s->buf, sm, copysize))
return -EFAULT;
return 0;
}
static struct mount *listmnt_next(struct mount *curr, bool reverse)
{
struct rb_node *node;
if (reverse)
node = rb_prev(&curr->mnt_node);
else
node = rb_next(&curr->mnt_node);
return node_to_mount(node);
}
static int grab_requested_root(struct mnt_namespace *ns, struct path *root)
{
struct mount *first, *child;
rwsem_assert_held(&namespace_sem);
/* We're looking at our own ns, just use get_fs_root. */
if (ns == current->nsproxy->mnt_ns) {
get_fs_root(current->fs, root);
return 0;
}
/*
* We have to find the first mount in our ns and use that, however it
* may not exist, so handle that properly.
*/
if (mnt_ns_empty(ns))
return -ENOENT;
first = child = ns->root;
for (;;) {
child = listmnt_next(child, false);
if (!child)
return -ENOENT;
if (child->mnt_parent == first)
break;
}
root->mnt = mntget(&child->mnt);
root->dentry = dget(root->mnt->mnt_root);
return 0;
}
/* This must be updated whenever a new flag is added */
#define STATMOUNT_SUPPORTED (STATMOUNT_SB_BASIC | \
STATMOUNT_MNT_BASIC | \
STATMOUNT_PROPAGATE_FROM | \
STATMOUNT_MNT_ROOT | \
STATMOUNT_MNT_POINT | \
STATMOUNT_FS_TYPE | \
STATMOUNT_MNT_NS_ID | \
STATMOUNT_MNT_OPTS | \
STATMOUNT_FS_SUBTYPE | \
STATMOUNT_SB_SOURCE | \
STATMOUNT_OPT_ARRAY | \
STATMOUNT_OPT_SEC_ARRAY | \
STATMOUNT_SUPPORTED_MASK | \
STATMOUNT_MNT_UIDMAP | \
STATMOUNT_MNT_GIDMAP)
/* locks: namespace_shared */
static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id,
struct file *mnt_file, struct mnt_namespace *ns)
{
int err;
if (mnt_file) {
WARN_ON_ONCE(ns != NULL);
s->mnt = mnt_file->f_path.mnt;
ns = real_mount(s->mnt)->mnt_ns;
if (!ns)
/*
* We can't set mount point and mnt_ns_id since we don't have a
* ns for the mount. This can happen if the mount is unmounted
* with MNT_DETACH.
*/
s->mask &= ~(STATMOUNT_MNT_POINT | STATMOUNT_MNT_NS_ID);
} else {
/* Has the namespace already been emptied? */
if (mnt_ns_id && mnt_ns_empty(ns))
return -ENOENT;
s->mnt = lookup_mnt_in_ns(mnt_id, ns);
if (!s->mnt)
return -ENOENT;
}
if (ns) {
err = grab_requested_root(ns, &s->root);
if (err)
return err;
if (!mnt_file) {
struct mount *m;
/*
* Don't trigger audit denials. We just want to determine what
* mounts to show users.
*/
m = real_mount(s->mnt);
if (!is_path_reachable(m, m->mnt.mnt_root, &s->root) &&
!ns_capable_noaudit(ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
}
}
err = security_sb_statfs(s->mnt->mnt_root);
if (err)
return err;
/*
* Note that mount properties in mnt->mnt_flags, mnt->mnt_idmap
* can change concurrently as we only hold the read-side of the
* namespace semaphore and mount properties may change with only
* the mount lock held.
*
* We could sample the mount lock sequence counter to detect
* those changes and retry. But it's not worth it. Worst that
* happens is that the mnt->mnt_idmap pointer is already changed
* while mnt->mnt_flags isn't or vica versa. So what.
*
* Both mnt->mnt_flags and mnt->mnt_idmap are set and retrieved
* via READ_ONCE()/WRITE_ONCE() and guard against theoretical
* torn read/write. That's all we care about right now.
*/
s->idmap = mnt_idmap(s->mnt);
if (s->mask & STATMOUNT_MNT_BASIC)
statmount_mnt_basic(s);
if (s->mask & STATMOUNT_SB_BASIC)
statmount_sb_basic(s);
if (s->mask & STATMOUNT_PROPAGATE_FROM)
statmount_propagate_from(s);
if (s->mask & STATMOUNT_FS_TYPE)
err = statmount_string(s, STATMOUNT_FS_TYPE);
if (!err && s->mask & STATMOUNT_MNT_ROOT)
err = statmount_string(s, STATMOUNT_MNT_ROOT);
if (!err && s->mask & STATMOUNT_MNT_POINT)
err = statmount_string(s, STATMOUNT_MNT_POINT);
if (!err && s->mask & STATMOUNT_MNT_OPTS)
err = statmount_string(s, STATMOUNT_MNT_OPTS);
if (!err && s->mask & STATMOUNT_OPT_ARRAY)
err = statmount_string(s, STATMOUNT_OPT_ARRAY);
if (!err && s->mask & STATMOUNT_OPT_SEC_ARRAY)
err = statmount_string(s, STATMOUNT_OPT_SEC_ARRAY);
if (!err && s->mask & STATMOUNT_FS_SUBTYPE)
err = statmount_string(s, STATMOUNT_FS_SUBTYPE);
if (!err && s->mask & STATMOUNT_SB_SOURCE)
err = statmount_string(s, STATMOUNT_SB_SOURCE);
if (!err && s->mask & STATMOUNT_MNT_UIDMAP)
err = statmount_string(s, STATMOUNT_MNT_UIDMAP);
if (!err && s->mask & STATMOUNT_MNT_GIDMAP)
err = statmount_string(s, STATMOUNT_MNT_GIDMAP);
if (!err && s->mask & STATMOUNT_MNT_NS_ID)
statmount_mnt_ns_id(s, ns);
if (!err && s->mask & STATMOUNT_SUPPORTED_MASK) {
s->sm.mask |= STATMOUNT_SUPPORTED_MASK;
s->sm.supported_mask = STATMOUNT_SUPPORTED;
}
if (err)
return err;
/* Are there bits in the return mask not present in STATMOUNT_SUPPORTED? */
WARN_ON_ONCE(~STATMOUNT_SUPPORTED & s->sm.mask);
return 0;
}
static inline bool retry_statmount(const long ret, size_t *seq_size)
{
if (likely(ret != -EAGAIN))
return false;
if (unlikely(check_mul_overflow(*seq_size, 2, seq_size)))
return false;
if (unlikely(*seq_size > MAX_RW_COUNT))
return false;
return true;
}
#define STATMOUNT_STRING_REQ (STATMOUNT_MNT_ROOT | STATMOUNT_MNT_POINT | \
STATMOUNT_FS_TYPE | STATMOUNT_MNT_OPTS | \
STATMOUNT_FS_SUBTYPE | STATMOUNT_SB_SOURCE | \
STATMOUNT_OPT_ARRAY | STATMOUNT_OPT_SEC_ARRAY | \
STATMOUNT_MNT_UIDMAP | STATMOUNT_MNT_GIDMAP)
static int prepare_kstatmount(struct kstatmount *ks, struct mnt_id_req *kreq,
struct statmount __user *buf, size_t bufsize,
size_t seq_size)
{
if (!access_ok(buf, bufsize))
return -EFAULT;
memset(ks, 0, sizeof(*ks));
ks->mask = kreq->param;
ks->buf = buf;
ks->bufsize = bufsize;
if (ks->mask & STATMOUNT_STRING_REQ) {
if (bufsize == sizeof(ks->sm))
return -EOVERFLOW;
ks->seq.buf = kvmalloc(seq_size, GFP_KERNEL_ACCOUNT);
if (!ks->seq.buf)
return -ENOMEM;
ks->seq.size = seq_size;
}
return 0;
}
static int copy_mnt_id_req(const struct mnt_id_req __user *req,
struct mnt_id_req *kreq, unsigned int flags)
{
int ret;
size_t usize;
BUILD_BUG_ON(sizeof(struct mnt_id_req) != MNT_ID_REQ_SIZE_VER1);
ret = get_user(usize, &req->size);
if (ret)
return -EFAULT;
if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
if (unlikely(usize < MNT_ID_REQ_SIZE_VER0))
return -EINVAL;
memset(kreq, 0, sizeof(*kreq));
ret = copy_struct_from_user(kreq, sizeof(*kreq), req, usize);
if (ret)
return ret;
if (flags & STATMOUNT_BY_FD) {
if (kreq->mnt_id || kreq->mnt_ns_id)
return -EINVAL;
} else {
if (kreq->mnt_ns_fd != 0 && kreq->mnt_ns_id)
return -EINVAL;
/* The first valid unique mount id is MNT_UNIQUE_ID_OFFSET + 1. */
if (kreq->mnt_id <= MNT_UNIQUE_ID_OFFSET)
return -EINVAL;
}
return 0;
}
/*
* If the user requested a specific mount namespace id, look that up and return
* that, or if not simply grab a passive reference on our mount namespace and
* return that.
*/
static struct mnt_namespace *grab_requested_mnt_ns(const struct mnt_id_req *kreq)
{
struct mnt_namespace *mnt_ns;
if (kreq->mnt_ns_id) {
mnt_ns = lookup_mnt_ns(kreq->mnt_ns_id);
if (!mnt_ns)
return ERR_PTR(-ENOENT);
} else if (kreq->mnt_ns_fd) {
struct ns_common *ns;
CLASS(fd, f)(kreq->mnt_ns_fd);
if (fd_empty(f))
return ERR_PTR(-EBADF);
if (!proc_ns_file(fd_file(f)))
return ERR_PTR(-EINVAL);
ns = get_proc_ns(file_inode(fd_file(f)));
if (ns->ns_type != CLONE_NEWNS)
return ERR_PTR(-EINVAL);
mnt_ns = to_mnt_ns(ns);
refcount_inc(&mnt_ns->passive);
} else {
mnt_ns = current->nsproxy->mnt_ns;
refcount_inc(&mnt_ns->passive);
}
return mnt_ns;
}
SYSCALL_DEFINE4(statmount, const struct mnt_id_req __user *, req,
struct statmount __user *, buf, size_t, bufsize,
unsigned int, flags)
{
struct mnt_namespace *ns __free(mnt_ns_release) = NULL;
struct kstatmount *ks __free(kfree) = NULL;
struct file *mnt_file __free(fput) = NULL;
struct mnt_id_req kreq;
/* We currently support retrieval of 3 strings. */
size_t seq_size = 3 * PATH_MAX;
int ret;
if (flags & ~STATMOUNT_BY_FD)
return -EINVAL;
ret = copy_mnt_id_req(req, &kreq, flags);
if (ret)
return ret;
if (flags & STATMOUNT_BY_FD) {
mnt_file = fget_raw(kreq.mnt_fd);
if (!mnt_file)
return -EBADF;
/* do_statmount sets ns in case of STATMOUNT_BY_FD */
} else {
ns = grab_requested_mnt_ns(&kreq);
if (IS_ERR(ns))
return PTR_ERR(ns);
if (kreq.mnt_ns_id && (ns != current->nsproxy->mnt_ns) &&
!ns_capable_noaudit(ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
}
ks = kmalloc(sizeof(*ks), GFP_KERNEL_ACCOUNT);
if (!ks)
return -ENOMEM;
retry:
ret = prepare_kstatmount(ks, &kreq, buf, bufsize, seq_size);
if (ret)
return ret;
scoped_guard(namespace_shared)
ret = do_statmount(ks, kreq.mnt_id, kreq.mnt_ns_id, mnt_file, ns);
if (!ret)
ret = copy_statmount_to_user(ks);
kvfree(ks->seq.buf);
path_put(&ks->root);
if (retry_statmount(ret, &seq_size))
goto retry;
return ret;
}
struct klistmount {
u64 last_mnt_id;
u64 mnt_parent_id;
u64 *kmnt_ids;
u32 nr_mnt_ids;
struct mnt_namespace *ns;
struct path root;
};
/* locks: namespace_shared */
static ssize_t do_listmount(struct klistmount *kls, bool reverse)
{
struct mnt_namespace *ns = kls->ns;
u64 mnt_parent_id = kls->mnt_parent_id;
u64 last_mnt_id = kls->last_mnt_id;
u64 *mnt_ids = kls->kmnt_ids;
size_t nr_mnt_ids = kls->nr_mnt_ids;
struct path orig;
struct mount *r, *first;
ssize_t ret;
rwsem_assert_held(&namespace_sem);
ret = grab_requested_root(ns, &kls->root);
if (ret)
return ret;
if (mnt_parent_id == LSMT_ROOT) {
orig = kls->root;
} else {
orig.mnt = lookup_mnt_in_ns(mnt_parent_id, ns);
if (!orig.mnt)
return -ENOENT;
orig.dentry = orig.mnt->mnt_root;
}
/*
* Don't trigger audit denials. We just want to determine what
* mounts to show users.
*/
if (!is_path_reachable(real_mount(orig.mnt), orig.dentry, &kls->root) &&
!ns_capable_noaudit(ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
ret = security_sb_statfs(orig.dentry);
if (ret)
return ret;
if (!last_mnt_id) {
if (reverse)
first = node_to_mount(ns->mnt_last_node);
else
first = node_to_mount(ns->mnt_first_node);
} else {
if (reverse)
first = mnt_find_id_at_reverse(ns, last_mnt_id - 1);
else
first = mnt_find_id_at(ns, last_mnt_id + 1);
}
for (ret = 0, r = first; r && nr_mnt_ids; r = listmnt_next(r, reverse)) {
if (r->mnt_id_unique == mnt_parent_id)
continue;
if (!is_path_reachable(r, r->mnt.mnt_root, &orig))
continue;
*mnt_ids = r->mnt_id_unique;
mnt_ids++;
nr_mnt_ids--;
ret++;
}
return ret;
}
static void __free_klistmount_free(const struct klistmount *kls)
{
path_put(&kls->root);
kvfree(kls->kmnt_ids);
mnt_ns_release(kls->ns);
}
static inline int prepare_klistmount(struct klistmount *kls, struct mnt_id_req *kreq,
size_t nr_mnt_ids)
{
u64 last_mnt_id = kreq->param;
struct mnt_namespace *ns;
/* The first valid unique mount id is MNT_UNIQUE_ID_OFFSET + 1. */
if (last_mnt_id != 0 && last_mnt_id <= MNT_UNIQUE_ID_OFFSET)
return -EINVAL;
kls->last_mnt_id = last_mnt_id;
kls->nr_mnt_ids = nr_mnt_ids;
kls->kmnt_ids = kvmalloc_array(nr_mnt_ids, sizeof(*kls->kmnt_ids),
GFP_KERNEL_ACCOUNT);
if (!kls->kmnt_ids)
return -ENOMEM;
ns = grab_requested_mnt_ns(kreq);
if (IS_ERR(ns))
return PTR_ERR(ns);
kls->ns = ns;
kls->mnt_parent_id = kreq->mnt_id;
return 0;
}
SYSCALL_DEFINE4(listmount, const struct mnt_id_req __user *, req,
u64 __user *, mnt_ids, size_t, nr_mnt_ids, unsigned int, flags)
{
struct klistmount kls __free(klistmount_free) = {};
const size_t maxcount = 1000000;
struct mnt_id_req kreq;
ssize_t ret;
if (flags & ~LISTMOUNT_REVERSE)
return -EINVAL;
/*
* If the mount namespace really has more than 1 million mounts the
* caller must iterate over the mount namespace (and reconsider their
* system design...).
*/
if (unlikely(nr_mnt_ids > maxcount))
return -EOVERFLOW;
if (!access_ok(mnt_ids, nr_mnt_ids * sizeof(*mnt_ids)))
return -EFAULT;
ret = copy_mnt_id_req(req, &kreq, 0);
if (ret)
return ret;
ret = prepare_klistmount(&kls, &kreq, nr_mnt_ids);
if (ret)
return ret;
if (kreq.mnt_ns_id && (kls.ns != current->nsproxy->mnt_ns) &&
!ns_capable_noaudit(kls.ns->user_ns, CAP_SYS_ADMIN))
return -ENOENT;
/*
* We only need to guard against mount topology changes as
* listmount() doesn't care about any mount properties.
*/
scoped_guard(namespace_shared)
ret = do_listmount(&kls, (flags & LISTMOUNT_REVERSE));
if (ret <= 0)
return ret;
if (copy_to_user(mnt_ids, kls.kmnt_ids, ret * sizeof(*mnt_ids)))
return -EFAULT;
return ret;
}
struct mnt_namespace init_mnt_ns = {
.ns = NS_COMMON_INIT(init_mnt_ns),
.user_ns = &init_user_ns,
.passive = REFCOUNT_INIT(1),
.mounts = RB_ROOT,
.poll = __WAIT_QUEUE_HEAD_INITIALIZER(init_mnt_ns.poll),
};
static void __init init_mount_tree(void)
{
struct vfsmount *mnt, *nullfs_mnt;
struct mount *mnt_root;
struct path root;
/*
* We create two mounts:
*
* (1) nullfs with mount id 1
* (2) mutable rootfs with mount id 2
*
* with (2) mounted on top of (1).
*/
nullfs_mnt = vfs_kern_mount(&nullfs_fs_type, 0, "nullfs", NULL);
if (IS_ERR(nullfs_mnt))
panic("VFS: Failed to create nullfs");
mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", initramfs_options);
if (IS_ERR(mnt))
panic("Can't create rootfs");
VFS_WARN_ON_ONCE(real_mount(nullfs_mnt)->mnt_id != 1);
VFS_WARN_ON_ONCE(real_mount(mnt)->mnt_id != 2);
/* The namespace root is the nullfs mnt. */
mnt_root = real_mount(nullfs_mnt);
init_mnt_ns.root = mnt_root;
/* Mount mutable rootfs on top of nullfs. */
root.mnt = nullfs_mnt;
root.dentry = nullfs_mnt->mnt_root;
LOCK_MOUNT_EXACT(mp, &root);
if (unlikely(IS_ERR(mp.parent)))
panic("VFS: Failed to mount rootfs on nullfs");
scoped_guard(mount_writer)
attach_mnt(real_mount(mnt), mp.parent, mp.mp);
pr_info("VFS: Finished mounting rootfs on nullfs\n");
/*
* We've dropped all locks here but that's fine. Not just are we
* the only task that's running, there's no other mount
* namespace in existence and the initial mount namespace is
* completely empty until we add the mounts we just created.
*/
for (struct mount *p = mnt_root; p; p = next_mnt(p, mnt_root)) {
mnt_add_to_ns(&init_mnt_ns, p);
init_mnt_ns.nr_mounts++;
}
init_task.nsproxy->mnt_ns = &init_mnt_ns;
get_mnt_ns(&init_mnt_ns);
/* The root and pwd always point to the mutable rootfs. */
root.mnt = mnt;
root.dentry = mnt->mnt_root;
set_fs_pwd(current->fs, &root);
set_fs_root(current->fs, &root);
ns_tree_add(&init_mnt_ns);
}
void __init mnt_init(void)
{
int err;
mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
mount_hashtable = alloc_large_system_hash("Mount-cache",
sizeof(struct hlist_head),
mhash_entries, 19,
HASH_ZERO,
&m_hash_shift, &m_hash_mask, 0, 0);
mountpoint_hashtable = alloc_large_system_hash("Mountpoint-cache",
sizeof(struct hlist_head),
mphash_entries, 19,
HASH_ZERO,
&mp_hash_shift, &mp_hash_mask, 0, 0);
if (!mount_hashtable || !mountpoint_hashtable)
panic("Failed to allocate mount hash table\n");
kernfs_init();
err = sysfs_init();
if (err)
printk(KERN_WARNING "%s: sysfs_init error: %d\n",
__func__, err);
fs_kobj = kobject_create_and_add("fs", NULL);
if (!fs_kobj)
printk(KERN_WARNING "%s: kobj create error\n", __func__);
shmem_init();
init_rootfs();
init_mount_tree();
}
void put_mnt_ns(struct mnt_namespace *ns)
{
if (!ns_ref_put(ns))
return;
guard(namespace_excl)();
emptied_ns = ns;
guard(mount_writer)();
umount_tree(ns->root, 0);
}
struct vfsmount *kern_mount(struct file_system_type *type)
{
struct vfsmount *mnt;
mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL);
if (!IS_ERR(mnt)) {
/*
* it is a longterm mount, don't release mnt until
* we unmount before file sys is unregistered
*/
real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
}
return mnt;
}
EXPORT_SYMBOL_GPL(kern_mount);
void kern_unmount(struct vfsmount *mnt)
{
/* release long term mount so mount point can be released */
if (!IS_ERR(mnt)) {
mnt_make_shortterm(mnt);
synchronize_rcu(); /* yecchhh... */
mntput(mnt);
}
}
EXPORT_SYMBOL(kern_unmount);
void kern_unmount_array(struct vfsmount *mnt[], unsigned int num)
{
unsigned int i;
for (i = 0; i < num; i++)
mnt_make_shortterm(mnt[i]);
synchronize_rcu_expedited();
for (i = 0; i < num; i++)
mntput(mnt[i]);
}
EXPORT_SYMBOL(kern_unmount_array);
bool our_mnt(struct vfsmount *mnt)
{
return check_mnt(real_mount(mnt));
}
bool current_chrooted(void)
{
/* Does the current process have a non-standard root */
struct path fs_root __free(path_put) = {};
struct mount *root;
get_fs_root(current->fs, &fs_root);
/* Find the namespace root */
guard(mount_locked_reader)();
root = topmost_overmount(current->nsproxy->mnt_ns->root);
return fs_root.mnt != &root->mnt || !path_mounted(&fs_root);
}
static bool mnt_already_visible(struct mnt_namespace *ns,
const struct super_block *sb,
int *new_mnt_flags)
{
int new_flags = *new_mnt_flags;
struct mount *mnt, *n;
guard(namespace_shared)();
rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) {
struct mount *child;
int mnt_flags;
if (mnt->mnt.mnt_sb->s_type != sb->s_type)
continue;
/* This mount is not fully visible if it's root directory
* is not the root directory of the filesystem.
*/
if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
continue;
/* A local view of the mount flags */
mnt_flags = mnt->mnt.mnt_flags;
/* Don't miss readonly hidden in the superblock flags */
if (sb_rdonly(mnt->mnt.mnt_sb))
mnt_flags |= MNT_LOCK_READONLY;
/* Verify the mount flags are equal to or more permissive
* than the proposed new mount.
*/
if ((mnt_flags & MNT_LOCK_READONLY) &&
!(new_flags & MNT_READONLY))
continue;
if ((mnt_flags & MNT_LOCK_ATIME) &&
((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
continue;
/* This mount is not fully visible if there are any
* locked child mounts that cover anything except for
* empty directories.
*/
list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
struct inode *inode = child->mnt_mountpoint->d_inode;
/* Only worry about locked mounts */
if (!(child->mnt.mnt_flags & MNT_LOCKED))
continue;
/* Is the directory permanently empty? */
if (!is_empty_dir_inode(inode))
goto next;
}
/* Preserve the locked attributes */
*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
MNT_LOCK_ATIME);
return true;
next: ;
}
return false;
}
static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags)
{
const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
struct mnt_namespace *ns = current->nsproxy->mnt_ns;
unsigned long s_iflags;
if (ns->user_ns == &init_user_ns)
return false;
/* Can this filesystem be too revealing? */
s_iflags = sb->s_iflags;
if (!(s_iflags & SB_I_USERNS_VISIBLE))
return false;
if ((s_iflags & required_iflags) != required_iflags) {
WARN_ONCE(1, "Expected s_iflags to contain 0x%lx\n",
required_iflags);
return true;
}
return !mnt_already_visible(ns, sb, new_mnt_flags);
}
bool mnt_may_suid(struct vfsmount *mnt)
{
/*
* Foreign mounts (accessed via fchdir or through /proc
* symlinks) are always treated as if they are nosuid. This
* prevents namespaces from trusting potentially unsafe
* suid/sgid bits, file caps, or security labels that originate
* in other namespaces.
*/
return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(real_mount(mnt)) &&
current_in_userns(mnt->mnt_sb->s_user_ns);
}
static struct ns_common *mntns_get(struct task_struct *task)
{
struct ns_common *ns = NULL;
struct nsproxy *nsproxy;
task_lock(task);
nsproxy = task->nsproxy;
if (nsproxy) {
ns = &nsproxy->mnt_ns->ns;
get_mnt_ns(to_mnt_ns(ns));
}
task_unlock(task);
return ns;
}
static void mntns_put(struct ns_common *ns)
{
put_mnt_ns(to_mnt_ns(ns));
}
static int mntns_install(struct nsset *nsset, struct ns_common *ns)
{
struct nsproxy *nsproxy = nsset->nsproxy;
struct fs_struct *fs = nsset->fs;
struct mnt_namespace *mnt_ns = to_mnt_ns(ns), *old_mnt_ns;
struct user_namespace *user_ns = nsset->cred->user_ns;
struct path root;
int err;
if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
!ns_capable(user_ns, CAP_SYS_CHROOT) ||
!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
if (is_anon_ns(mnt_ns))
return -EINVAL;
if (fs->users != 1)
return -EINVAL;
get_mnt_ns(mnt_ns);
old_mnt_ns = nsproxy->mnt_ns;
nsproxy->mnt_ns = mnt_ns;
/* Find the root */
err = vfs_path_lookup(mnt_ns->root->mnt.mnt_root, &mnt_ns->root->mnt,
"/", LOOKUP_DOWN, &root);
if (err) {
/* revert to old namespace */
nsproxy->mnt_ns = old_mnt_ns;
put_mnt_ns(mnt_ns);
return err;
}
put_mnt_ns(old_mnt_ns);
/* Update the pwd and root */
set_fs_pwd(fs, &root);
set_fs_root(fs, &root);
path_put(&root);
return 0;
}
static struct user_namespace *mntns_owner(struct ns_common *ns)
{
return to_mnt_ns(ns)->user_ns;
}
const struct proc_ns_operations mntns_operations = {
.name = "mnt",
.get = mntns_get,
.put = mntns_put,
.install = mntns_install,
.owner = mntns_owner,
};
#ifdef CONFIG_SYSCTL
static const struct ctl_table fs_namespace_sysctls[] = {
{
.procname = "mount-max",
.data = &sysctl_mount_max,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ONE,
},
};
static int __init init_fs_namespace_sysctls(void)
{
register_sysctl_init("fs", fs_namespace_sysctls);
return 0;
}
fs_initcall(init_fs_namespace_sysctls);
#endif /* CONFIG_SYSCTL */
] |
| 48/2 | 2026/02/06 12:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 122244 output: 19 thoughts: 5411 Thoughts: **Actionable UAF Report: Namespace Root Dangling Pointer** Okay, I've thoroughly reviewed the file and the provided information. Here's my assessment: There's a critical Use-After-Free (UAF) vulnerability in the Linux kernel's mount namespace handling. It stems from a race condition where a mount point, designated as the root of a namespace, gets unmounted before the namespace itself is destroyed, leaving a dangling pointer. Specifically, I see that the `put_mnt_ns` function, responsible for destroying a mount namespace, calls `umount_tree` on `ns->root`. The problem lies in the fact that there's no code preventing the unmounting of `ns->root` *before* the namespace is destroyed. While there's a check in `do_umount` that prevents unmounting the root *unless* `MNT_DETACH` is set, `umount -l /` with appropriate privileges *will* trigger this detached unmount. The scenario that triggers the bug is this: If `umount -l /` is executed on the mount that is currently set as `ns->root`, `umount_tree` is called, which clears `ns->root->mnt_ns` and calls `mntput`. This means the `struct mount` can be freed. However, the `ns->root` pointer in the `mnt_namespace` structure *is not* cleared. Then, when `copy_mnt_ns` is subsequently called during `unshare(CLONE_NEWNS)`, it attempts to use this dangling `ns->root` pointer. I've traced the places where `ns->root` is set, and it seems like the root of a mount namespace is not properly tracked. Further investigation revealed a critical detail involving the `__detach_mounts` function, which is called when a dentry is invalidated. This invalidation can happen during task exit, for example, when `/proc/<pid>` is removed. This means that if a mount is attached to `/proc/<pid>/ns/mnt` *and* used as the root of a namespace, the mount can be unexpectedly unmounted upon the task's exit via `__detach_mounts`. The key is that `__detach_mounts` will unmount mounts *attached to* a dentry, and a namespace's root *could* be one. Specifically, the report perfectly matches a situation where: 1. A task (Task A) is running. 2. Another task (Task B) creates a new mount `M1`. 3. Task B *bind-mounts* `M1` onto a dentry in `/proc/A` (e.g. `/proc/A/ns/mnt`). This is a critical step because now `M1` is "attached" to a dentry. 4. Task B then creates a new namespace `NS1`, using `M1` as its root, for example by calling `open_tree` using `/proc/A/ns/mnt` (or another path within Task A). 5. Task B then gets the new namespace, `NS1`, assigned to it via e.g. `setns`. 6. Task A exits. 7. `release_task` on A calls `proc_invalidate_siblings_dcache`. 8. This calls `d_invalidate` on `/proc/A/ns/mnt`. 9. `d_invalidate` calls `__detach_mounts`. 10. `__detach_mounts` unmounts `M1`. 11. `M1->mnt_ns` is cleared and `mntput(M1)` is called, eventually freeing `M1`. 12. `NS1->root` *still* points to the now-freed `M1`. 13. Task B calls `unshare(CLONE_NEWNS)`. 14. `copy_mnt_ns` accesses the dangling `NS1->root`. UAF! This UAF scenario matches all the clues in the report, including RCU-based freeing, `__detach_mounts` in the free path, and `copy_mnt_ns` and `unshare` in the access path. The key issue is `__detach_mounts` unintentionally unmounting a mount point that has become the root of a namespace. Specifically, `copy_mnt_ns` is called from `create_new_namespaces` from `unshare`. This means it's cloning the *current* mount namespace. If the current mount namespace has a root that is dangling, then `unshare` triggers the UAF. I've examined the stack traces and confirm that the allocation, free, and access paths all point |
| 49/2 | 2026/02/06 12:36 | tool | set-results |
0mArgs:map[Actionable:true Confident:true] Results: map[Actionable:true Confident:true] |
| 50/2 | 2026/02/06 12:36 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 128716 output: 562 thoughts: 159 Thoughts: **My Assessment of the Report** Alright, after reviewing this, I'm confident in saying the report presents a valid and actionable finding. It all hangs together – it's self-consistent. The core of the issue seems to be a classic use-after-free vulnerability, specifically targeting the `struct mount` object. That object acts as the root of a mount namespace, pointed to by `ns->root`. The triggering event seems to be the `unshare(CLONE_NEWNS)` system call, which attempts to clone the current mount namespace. The crux is that the root mount can be unmounted and freed, likely through a call to `__detach_mounts` as part of a task exit or something similar, while the namespace itself is still active. This leaves `ns->root` as a dangling pointer. Pretty straightforward, and a classic concurrency pitfall. The allocation, freeing, and subsequent access stack traces all align perfectly with this description, which helps to cement my confidence in the analysis. Furthermore, the traces related to "potentially related work creation" are particularly insightful, giving a precise roadmap of how the problematic mount was unmounted and scheduled for freeing. That makes this significantly more diagnosable. |
| Total Calls | Total Tokens | Avg Tokens | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|
| Total Calls | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|