syzbot


ID Workflow Result Correct Bug Created Started Finished Revision Error
982b89e5-2f52-49a5-a022-9678d450ee0d assessment-kcsan Benign: āœ…  Confident: āœ…  ā“ KCSAN: data-race in __dentry_kill / __dentry_kill (7) 2026/02/02 12:41 2026/02/02 12:41 2026/02/02 12:55 2186dcabcd743737b50c1a9f99a8bf0d3a5a7914
Benign: āœ…    Confident: āœ…   

Explanation:
The KCSAN report shows a data race on the `d_flags` field of a `struct dentry` between two tasks, both executing the `__dentry_kill` function.

### Analysis of the Race

1.  **Task 3002** is executing `d_lru_del` (inlined in `__dentry_kill`), which performs the following operation:
    ```c
    dentry->d_flags &= ~DCACHE_LRU_LIST;
    ```
    This is a write to `d_flags` to clear the `DCACHE_LRU_LIST` bit.

2.  **Task 3304** is executing `dentry_unlist` (also inlined in `__dentry_kill`), which performs:
    ```c
    dentry->d_flags |= DCACHE_DENTRY_KILLED;
    ```
    This is a read-modify-write operation to set the `DCACHE_DENTRY_KILLED` bit.

The race occurs because `d_flags` is updated non-atomically. If Task 3304 reads the flags, then Task 3002 writes the flags (clearing `DCACHE_LRU_LIST`), and then Task 3304 writes back its modified value (setting `DCACHE_DENTRY_KILLED`), the change made by Task 3002 (clearing `DCACHE_LRU_LIST`) could be lost.

### Why the Race Occurs

Both tasks are in `__dentry_kill` for the same dentry. This typically happens during a "double-kill" scenario or when a dentry is being invalidated while another task is already killing it. In `__dentry_kill`, the dentry's lock (`d_lock`) is dropped and re-acquired around a `cond_resched()` call. Specifically, `d_lru_del` is called before the lock is dropped, and `dentry_unlist` is called after it is re-acquired.

If Task A is in `__dentry_kill`, it clears `DCACHE_LRU_LIST`, drops the lock, and reaches `cond_resched()`. If Task B then enters `__dentry_kill` for the same dentry (which can happen if reference counting or invalidation logic allows it), it will also try to clear `DCACHE_LRU_LIST`. When Task A resumes and enters `dentry_unlist`, it will set `DCACHE_DENTRY_KILLED`.

### Why the Race is Benign

This data race is considered benign for the following reasons:

1.  **Dentry Destruction State**: The race occurs while the dentry is already being destroyed (`__dentry_kill`). At this stage, the dentry is no longer reachable via normal lookups (it has been unhashed) and its reference count is effectively zero (marked dead).
2.  **Non-Overlapping Bits**: The bits being modified (`DCACHE_LRU_LIST` and `DCACHE_DENTRY_KILLED`) are distinct. Even if one update overwrites the other, the resulting state of the flags is for a dentry that is about to be freed.
3.  **No Functional Impact**:
    *   If `DCACHE_LRU_LIST` remains set due to the race, it doesn't matter because the dentry is being removed from the LRU anyway and will be freed shortly.
    *   If `DCACHE_DENTRY_KILLED` is not set correctly, it might affect `d_walk()`, but `d_walk()` also checks if the dentry is unhashed (`d_unhashed`), which happens early in `__dentry_kill`.
4.  **List Integrity**: The actual list operations (`list_lru_del_obj` and `__hlist_del`) are protected by checks (like `hlist_unhashed`) or return values that prevent corruption even if called multiple times.

In summary, while the race indicates that two tasks are concurrently tearing down the same dentry (which might be a separate logic issue), the resulting data race on the `d_flags` field itself does not lead to memory corruption or kernel instability.

Conclusion: Benign.

Crash report:
==================================================================
BUG: KCSAN: data-race in __dentry_kill / __dentry_kill

write to 0xffff88811a9159c0 of 4 bytes by task 3002 on cpu 1:
 d_lru_del fs/dcache.c:504 [inline]
 __dentry_kill+0xa8/0x460 fs/dcache.c:665
 finish_dput+0x2b/0x200 fs/dcache.c:879
 dput+0x52/0x60 fs/dcache.c:928
 lookup_fast+0x29e/0x320 fs/namei.c:1832
 walk_component fs/namei.c:2225 [inline]
 lookup_last fs/namei.c:2730 [inline]
 path_lookupat+0x15c/0x500 fs/namei.c:2754
 filename_lookup+0x190/0x390 fs/namei.c:2783
 do_readlinkat+0x7d/0x340 fs/stat.c:576
 __do_sys_readlink fs/stat.c:613 [inline]
 __se_sys_readlink fs/stat.c:610 [inline]
 __x64_sys_readlink+0x47/0x60 fs/stat.c:610
 x64_sys_call+0x2af1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:90
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff88811a9159c0 of 4 bytes by task 3304 on cpu 0:
 dentry_unlist fs/dcache.c:640 [inline]
 __dentry_kill+0x25c/0x460 fs/dcache.c:684
 finish_dput+0x2b/0x200 fs/dcache.c:879
 dput+0x52/0x60 fs/dcache.c:928
 lookup_fast+0x29e/0x320 fs/namei.c:1832
 walk_component fs/namei.c:2225 [inline]
 lookup_last fs/namei.c:2730 [inline]
 path_lookupat+0x15c/0x500 fs/namei.c:2754
 filename_lookup+0x190/0x390 fs/namei.c:2783
 user_path_at+0x3e/0x130 fs/namei.c:3576
 do_faccessat+0x391/0x830 fs/open.c:490
 __do_sys_faccessat2 fs/open.c:544 [inline]
 __se_sys_faccessat2 fs/open.c:541 [inline]
 __x64_sys_faccessat2+0x51/0x60 fs/open.c:541
 x64_sys_call+0x2c1a/0x3000 arch/x86/include/generated/asm/syscalls_64.h:440
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00140004 -> 0x00000004

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 3304 Comm: udevd Tainted: G        W           syzkaller #0 PREEMPT(voluntary) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
==================================================================

Trajectory:
Seq Timestamp Type Name Duration
0/0 2026/02/02 12:41 flow assessment-kcsan
13m Results:
map[Benign:true Confident:true Explanation:The KCSAN report shows a data race on the `d_flags` field of a `struct dentry` between two tasks, both executing the `__dentry_kill` function.

### Analysis of the Race

1.  **Task 3002** is executing `d_lru_del` (inlined in `__dentry_kill`), which performs the following operation:
    ```c
    dentry->d_flags &= ~DCACHE_LRU_LIST;
    ```
    This is a write to `d_flags` to clear the `DCACHE_LRU_LIST` bit.

2.  **Task 3304** is executing `dentry_unlist` (also inlined in `__dentry_kill`), which performs:
    ```c
    dentry->d_flags |= DCACHE_DENTRY_KILLED;
    ```
    This is a read-modify-write operation to set the `DCACHE_DENTRY_KILLED` bit.

The race occurs because `d_flags` is updated non-atomically. If Task 3304 reads the flags, then Task 3002 writes the flags (clearing `DCACHE_LRU_LIST`), and then Task 3304 writes back its modified value (setting `DCACHE_DENTRY_KILLED`), the change made by Task 3002 (clearing `DCACHE_LRU_LIST`) could be lost.

### Why the Race Occurs

Both tasks are in `__dentry_kill` for the same dentry. This typically happens during a "double-kill" scenario or when a dentry is being invalidated while another task is already killing it. In `__dentry_kill`, the dentry's lock (`d_lock`) is dropped and re-acquired around a `cond_resched()` call. Specifically, `d_lru_del` is called before the lock is dropped, and `dentry_unlist` is called after it is re-acquired.

If Task A is in `__dentry_kill`, it clears `DCACHE_LRU_LIST`, drops the lock, and reaches `cond_resched()`. If Task B then enters `__dentry_kill` for the same dentry (which can happen if reference counting or invalidation logic allows it), it will also try to clear `DCACHE_LRU_LIST`. When Task A resumes and enters `dentry_unlist`, it will set `DCACHE_DENTRY_KILLED`.

### Why the Race is Benign

This data race is considered benign for the following reasons:

1.  **Dentry Destruction State**: The race occurs while the dentry is already being destroyed (`__dentry_kill`). At this stage, the dentry is no longer reachable via normal lookups (it has been unhashed) and its reference count is effectively zero (marked dead).
2.  **Non-Overlapping Bits**: The bits being modified (`DCACHE_LRU_LIST` and `DCACHE_DENTRY_KILLED`) are distinct. Even if one update overwrites the other, the resulting state of the flags is for a dentry that is about to be freed.
3.  **No Functional Impact**:
    *   If `DCACHE_LRU_LIST` remains set due to the race, it doesn't matter because the dentry is being removed from the LRU anyway and will be freed shortly.
    *   If `DCACHE_DENTRY_KILLED` is not set correctly, it might affect `d_walk()`, but `d_walk()` also checks if the dentry is unhashed (`d_unhashed`), which happens early in `__dentry_kill`.
4.  **List Integrity**: The actual list operations (`list_lru_del_obj` and `__hlist_del`) are protected by checks (like `hlist_unhashed`) or return values that prevent corruption even if called multiple times.

In summary, while the race indicates that two tasks are concurrently tearing down the same dentry (which might be a separate logic issue), the resulting data race on the `d_flags` field itself does not lead to memory corruption or kernel instability.

Conclusion: Benign.]

1/1 2026/02/02 12:41 action kernel-checkouter
0m Results:
map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/442656ac8be766ac08d507e9d55d7e172d7ca2ec]

2/1 2026/02/02 12:42 action kernel-builder
4m Results:
map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/d5e7fe031ef17fcd9d75f07422941eb5162b2fe8]

3/1 2026/02/02 12:46 action codesearch-prepare
2m Results:
map[Index:codesearch-index]

4/1 2026/02/02 12:49 agent expert
6m Model:
gemini-3-flash-preview

Results:
map[Benign:true Confident:true]

Instruction:
You are an experienced Linux kernel developer tasked with determining if the given kernel
data race is benign or not. The data race report is from KCSAN tool.
It contains 2 stack traces of the memory accesses that constitute a data race.

A "benign" data races are on a simple int/bool variable or similar field,
and the accesses are not supposed to be protected by any mutual exclusion primitives.
Common examples of such "benign" data races are accesses to various flags fields,
statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash
with a conservative compiler that compiles memory accesses to primitive types
effectively as atomic.

A non-benign (or "harmful" data race) can lead to corruption/crash even with
a conservative compiler that compiles memory accesses to primitive types
effectively as atomic. A common example of a "harmful" data races is race on
a complex container (list/hashmap/etc), where accesses are supposed to be protected
by a mutual exclusion primitive.

In the final reply explain why you think the given data race is benign or is harmful.

Use the provided tools to confirm any assumptions, variables/fields being accessed, etc.
In particular, don't make assumptions about the kernel source code,
use codesearch tools to read the actual source code.

Prefer calling several tools at the same time to save round-trips.


Use set-results tool to provide results of the analysis.
It must be called exactly once before the final reply.
Ignore results of this tool.

Prompt:
The data race report is:

==================================================================
BUG: KCSAN: data-race in __dentry_kill / __dentry_kill

write to 0xffff88811a9159c0 of 4 bytes by task 3002 on cpu 1:
 d_lru_del fs/dcache.c:504 [inline]
 __dentry_kill+0xa8/0x460 fs/dcache.c:665
 finish_dput+0x2b/0x200 fs/dcache.c:879
 dput+0x52/0x60 fs/dcache.c:928
 lookup_fast+0x29e/0x320 fs/namei.c:1832
 walk_component fs/namei.c:2225 [inline]
 lookup_last fs/namei.c:2730 [inline]
 path_lookupat+0x15c/0x500 fs/namei.c:2754
 filename_lookup+0x190/0x390 fs/namei.c:2783
 do_readlinkat+0x7d/0x340 fs/stat.c:576
 __do_sys_readlink fs/stat.c:613 [inline]
 __se_sys_readlink fs/stat.c:610 [inline]
 __x64_sys_readlink+0x47/0x60 fs/stat.c:610
 x64_sys_call+0x2af1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:90
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff88811a9159c0 of 4 bytes by task 3304 on cpu 0:
 dentry_unlist fs/dcache.c:640 [inline]
 __dentry_kill+0x25c/0x460 fs/dcache.c:684
 finish_dput+0x2b/0x200 fs/dcache.c:879
 dput+0x52/0x60 fs/dcache.c:928
 lookup_fast+0x29e/0x320 fs/namei.c:1832
 walk_component fs/namei.c:2225 [inline]
 lookup_last fs/namei.c:2730 [inline]
 path_lookupat+0x15c/0x500 fs/namei.c:2754
 filename_lookup+0x190/0x390 fs/namei.c:2783
 user_path_at+0x3e/0x130 fs/namei.c:3576
 do_faccessat+0x391/0x830 fs/open.c:490
 __do_sys_faccessat2 fs/open.c:544 [inline]
 __se_sys_faccessat2 fs/open.c:541 [inline]
 __x64_sys_faccessat2+0x51/0x60 fs/open.c:541
 x64_sys_call+0x2c1a/0x3000 arch/x86/include/generated/asm/syscalls_64.h:440
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00140004 -> 0x00000004

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 3304 Comm: udevd Tainted: G        W           syzkaller #0 PREEMPT(voluntary) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
==================================================================


Reply:
The KCSAN report shows a data race on the `d_flags` field of a `struct dentry` between two tasks, both executing the `__dentry_kill` function.

### Analysis of the Race

1.  **Task 3002** is executing `d_lru_del` (inlined in `__dentry_kill`), which performs the following operation:
    ```c
    dentry->d_flags &= ~DCACHE_LRU_LIST;
    ```
    This is a write to `d_flags` to clear the `DCACHE_LRU_LIST` bit.

2.  **Task 3304** is executing `dentry_unlist` (also inlined in `__dentry_kill`), which performs:
    ```c
    dentry->d_flags |= DCACHE_DENTRY_KILLED;
    ```
    This is a read-modify-write operation to set the `DCACHE_DENTRY_KILLED` bit.

The race occurs because `d_flags` is updated non-atomically. If Task 3304 reads the flags, then Task 3002 writes the flags (clearing `DCACHE_LRU_LIST`), and then Task 3304 writes back its modified value (setting `DCACHE_DENTRY_KILLED`), the change made by Task 3002 (clearing `DCACHE_LRU_LIST`) could be lost.

### Why the Race Occurs

Both tasks are in `__dentry_kill` for the same dentry. This typically happens during a "double-kill" scenario or when a dentry is being invalidated while another task is already killing it. In `__dentry_kill`, the dentry's lock (`d_lock`) is dropped and re-acquired around a `cond_resched()` call. Specifically, `d_lru_del` is called before the lock is dropped, and `dentry_unlist` is called after it is re-acquired.

If Task A is in `__dentry_kill`, it clears `DCACHE_LRU_LIST`, drops the lock, and reaches `cond_resched()`. If Task B then enters `__dentry_kill` for the same dentry (which can happen if reference counting or invalidation logic allows it), it will also try to clear `DCACHE_LRU_LIST`. When Task A resumes and enters `dentry_unlist`, it will set `DCACHE_DENTRY_KILLED`.

### Why the Race is Benign

This data race is considered benign for the following reasons:

1.  **Dentry Destruction State**: The race occurs while the dentry is already being destroyed (`__dentry_kill`). At this stage, the dentry is no longer reachable via normal lookups (it has been unhashed) and its reference count is effectively zero (marked dead).
2.  **Non-Overlapping Bits**: The bits being modified (`DCACHE_LRU_LIST` and `DCACHE_DENTRY_KILLED`) are distinct. Even if one update overwrites the other, the resulting state of the flags is for a dentry that is about to be freed.
3.  **No Functional Impact**:
    *   If `DCACHE_LRU_LIST` remains set due to the race, it doesn't matter because the dentry is being removed from the LRU anyway and will be freed shortly.
    *   If `DCACHE_DENTRY_KILLED` is not set correctly, it might affect `d_walk()`, but `d_walk()` also checks if the dentry is unhashed (`d_unhashed`), which happens early in `__dentry_kill`.
4.  **List Integrity**: The actual list operations (`list_lru_del_obj` and `__hlist_del`) are protected by checks (like `hlist_unhashed`) or return values that prevent corruption even if called multiple times.

In summary, while the race indicates that two tasks are concurrently tearing down the same dentry (which might be a separate logic issue), the resulting data race on the `d_flags` field itself does not lead to memory corruption or kernel instability.

Conclusion: Benign.

5/2 2026/02/02 12:49 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 3961
							output: 21
							thoughts: 285
						

Thoughts:
**Analyzing a Data Race in the dcache**

Okay, so the report indicates a data race within `__dentry_kill`. Specifically, it pinpoints a conflict between a write operation in `d_lru_del` and a read operation in `dentry_unlist`, both of which are calling `__dentry_kill`. The memory address involved is `0xffff88811a9159c0`, and the value transition from `0x00140004` to `0x00000004` is definitely significant. It looks like we're dealing with a flags field or bitmask.

My next step is to dive into the `fs/dcache.c` code to understand this. The report has been kind enough to provide the relevant line numbers, which is a great time saver. Let's see...

*   `d_lru_del`: `fs/dcache.c:504` (inlined)
*   `__dentry_kill`: `fs/dcache.c:665` (write)
*   `dentry_unlist`: `fs/dcache.c:640` (inlined)
*   `__dentry_kill`: `fs/dcache.c:684` (read)

Right, now I'll start with the file and examine the code around these lines, paying close attention to the data structures accessed and how they interact in `__dentry_kill` to see how these inlined functions play their role. I'll need to understand the bitmask and flags operations to get a handle on what might be causing the race.



6/2 2026/02/02 12:49 tool read-file
0m Args:
map[File:fs/dcache.c]

Results:
map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/*
 * fs/dcache.c
 *
 * Complete reimplementation
 * (C) 1997 Thomas Schoebel-Theuer,
 * with heavy changes by Linus Torvalds
 */

/*
 * Notes on the allocation strategy:
 *
 * The dcache is a master of the icache - whenever a dcache entry
 * exists, the inode will always exist. "iput()" is done either when
 * the dcache entry is deleted or garbage collected.
 */

#include <linux/ratelimit.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/fs.h>
#include <linux/fscrypt.h>
#include <linux/fsnotify.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/cache.h>
#include <linux/export.h>
#include <linux/security.h>
#include <linux/seqlock.h>
#include <linux/memblock.h>
#include <linux/bit_spinlock.h>
#include <linux/rculist_bl.h>
#include <linux/list_lru.h>
#include "internal.h"
#include "mount.h"

#include <asm/runtime-const.h>

/*
 * Usage:
 * dcache->d_inode->i_lock protects:
 *   - i_dentry, d_u.d_alias, d_inode of aliases
 * dcache_hash_bucket lock protects:
 *   - the dcache hash table
 * s_roots bl list spinlock protects:
 *   - the s_roots list (see __d_drop)
 * dentry->d_sb->s_dentry_lru_lock protects:
 *   - the dcache lru lists and counters
 * d_lock protects:
 *   - d_flags
 *   - d_name
 *   - d_lru
 *   - d_count
 *   - d_unhashed()
 *   - d_parent and d_chilren
 *   - childrens' d_sib and d_parent
 *   - d_u.d_alias, d_inode
 *
 * Ordering:
 * dentry->d_inode->i_lock
 *   dentry->d_lock
 *     dentry->d_sb->s_dentry_lru_lock
 *     dcache_hash_bucket lock
 *     s_roots lock
 *
 * If there is an ancestor relationship:
 * dentry->d_parent->...->d_parent->d_lock
 *   ...
 *     dentry->d_parent->d_lock
 *       dentry->d_lock
 *
 * If no ancestor relationship:
 * arbitrary, since it's serialized on rename_lock
 */
static int sysctl_vfs_cache_pressure __read_mostly = 100;
static int sysctl_vfs_cache_pressure_denom __read_mostly = 100;

unsigned long vfs_pressure_ratio(unsigned long val)
{
	return mult_frac(val, sysctl_vfs_cache_pressure, sysctl_vfs_cache_pressure_denom);
}
EXPORT_SYMBOL_GPL(vfs_pressure_ratio);

__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

EXPORT_SYMBOL(rename_lock);

static struct kmem_cache *__dentry_cache __ro_after_init;
#define dentry_cache runtime_const_ptr(__dentry_cache)

const struct qstr empty_name = QSTR_INIT("", 0);
EXPORT_SYMBOL(empty_name);
const struct qstr slash_name = QSTR_INIT("/", 1);
EXPORT_SYMBOL(slash_name);
const struct qstr dotdot_name = QSTR_INIT("..", 2);
EXPORT_SYMBOL(dotdot_name);

/*
 * This is the single most critical data structure when it comes
 * to the dcache: the hashtable for lookups. Somebody should try
 * to make this good - I've just made it work.
 *
 * This hash-function tries to avoid losing too many bits of hash
 * information, yet avoid using a prime hash-size or similar.
 *
 * Marking the variables "used" ensures that the compiler doesn't
 * optimize them away completely on architectures with runtime
 * constant infrastructure, this allows debuggers to see their
 * values. But updating these values has no effect on those arches.
 */

static unsigned int d_hash_shift __ro_after_init __used;

static struct hlist_bl_head *dentry_hashtable __ro_after_init __used;

static inline struct hlist_bl_head *d_hash(unsigned long hashlen)
{
	return runtime_const_ptr(dentry_hashtable) +
		runtime_const_shift_right_32(hashlen, d_hash_shift);
}

#define IN_LOOKUP_SHIFT 10
static struct hlist_bl_head in_lookup_hashtable[1 << IN_LOOKUP_SHIFT];

static inline struct hlist_bl_head *in_lookup_hash(const struct dentry *parent,
					unsigned int hash)
{
	hash += (unsigned long) parent / L1_CACHE_BYTES;
	return in_lookup_hashtable + hash_32(hash, IN_LOOKUP_SHIFT);
}

struct dentry_stat_t {
	long nr_dentry;
	long nr_unused;
	long age_limit;		/* age in seconds */
	long want_pages;	/* pages requested by system */
	long nr_negative;	/* # of unused negative dentries */
	long dummy;		/* Reserved for future use */
};

static DEFINE_PER_CPU(long, nr_dentry);
static DEFINE_PER_CPU(long, nr_dentry_unused);
static DEFINE_PER_CPU(long, nr_dentry_negative);
static int dentry_negative_policy;

#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
/* Statistics gathering. */
static struct dentry_stat_t dentry_stat = {
	.age_limit = 45,
};

/*
 * Here we resort to our own counters instead of using generic per-cpu counters
 * for consistency with what the vfs inode code does. We are expected to harvest
 * better code and performance by having our own specialized counters.
 *
 * Please note that the loop is done over all possible CPUs, not over all online
 * CPUs. The reason for this is that we don't want to play games with CPUs going
 * on and off. If one of them goes off, we will just keep their counters.
 *
 * glommer: See cffbc8a for details, and if you ever intend to change this,
 * please update all vfs counters to match.
 */
static long get_nr_dentry(void)
{
	int i;
	long sum = 0;
	for_each_possible_cpu(i)
		sum += per_cpu(nr_dentry, i);
	return sum < 0 ? 0 : sum;
}

static long get_nr_dentry_unused(void)
{
	int i;
	long sum = 0;
	for_each_possible_cpu(i)
		sum += per_cpu(nr_dentry_unused, i);
	return sum < 0 ? 0 : sum;
}

static long get_nr_dentry_negative(void)
{
	int i;
	long sum = 0;

	for_each_possible_cpu(i)
		sum += per_cpu(nr_dentry_negative, i);
	return sum < 0 ? 0 : sum;
}

static int proc_nr_dentry(const struct ctl_table *table, int write, void *buffer,
			  size_t *lenp, loff_t *ppos)
{
	dentry_stat.nr_dentry = get_nr_dentry();
	dentry_stat.nr_unused = get_nr_dentry_unused();
	dentry_stat.nr_negative = get_nr_dentry_negative();
	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
}

static const struct ctl_table fs_dcache_sysctls[] = {
	{
		.procname	= "dentry-state",
		.data		= &dentry_stat,
		.maxlen		= 6*sizeof(long),
		.mode		= 0444,
		.proc_handler	= proc_nr_dentry,
	},
	{
		.procname	= "dentry-negative",
		.data		= &dentry_negative_policy,
		.maxlen		= sizeof(dentry_negative_policy),
		.mode		= 0644,
		.proc_handler	= proc_dointvec_minmax,
		.extra1		= SYSCTL_ZERO,
		.extra2		= SYSCTL_ONE,
	},
};

static const struct ctl_table vm_dcache_sysctls[] = {
	{
		.procname	= "vfs_cache_pressure",
		.data		= &sysctl_vfs_cache_pressure,
		.maxlen		= sizeof(sysctl_vfs_cache_pressure),
		.mode		= 0644,
		.proc_handler	= proc_dointvec_minmax,
		.extra1		= SYSCTL_ZERO,
	},
	{
		.procname	= "vfs_cache_pressure_denom",
		.data		= &sysctl_vfs_cache_pressure_denom,
		.maxlen		= sizeof(sysctl_vfs_cache_pressure_denom),
		.mode		= 0644,
		.proc_handler	= proc_dointvec_minmax,
		.extra1		= SYSCTL_ONE_HUNDRED,
	},
};

static int __init init_fs_dcache_sysctls(void)
{
	register_sysctl_init("vm", vm_dcache_sysctls);
	register_sysctl_init("fs", fs_dcache_sysctls);
	return 0;
}
fs_initcall(init_fs_dcache_sysctls);
#endif

/*
 * Compare 2 name strings, return 0 if they match, otherwise non-zero.
 * The strings are both count bytes long, and count is non-zero.
 */
#ifdef CONFIG_DCACHE_WORD_ACCESS

#include <asm/word-at-a-time.h>
/*
 * NOTE! 'cs' and 'scount' come from a dentry, so it has a
 * aligned allocation for this particular component. We don't
 * strictly need the load_unaligned_zeropad() safety, but it
 * doesn't hurt either.
 *
 * In contrast, 'ct' and 'tcount' can be from a pathname, and do
 * need the careful unaligned handling.
 */
static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char *ct, unsigned tcount)
{
	unsigned long a,b,mask;

	for (;;) {
		a = read_word_at_a_time(cs);
		b = load_unaligned_zeropad(ct);
		if (tcount < sizeof(unsigned long))
			break;
		if (unlikely(a != b))
			return 1;
		cs += sizeof(unsigned long);
		ct += sizeof(unsigned long);
		tcount -= sizeof(unsigned long);
		if (!tcount)
			return 0;
	}
	mask = bytemask_from_count(tcount);
	return unlikely(!!((a ^ b) & mask));
}

#else

static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char *ct, unsigned tcount)
{
	do {
		if (*cs != *ct)
			return 1;
		cs++;
		ct++;
		tcount--;
	} while (tcount);
	return 0;
}

#endif

static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *ct, unsigned tcount)
{
	/*
	 * Be careful about RCU walk racing with rename:
	 * use 'READ_ONCE' to fetch the name pointer.
	 *
	 * NOTE! Even if a rename will mean that the length
	 * was not loaded atomically, we don't care. The
	 * RCU walk will check the sequence count eventually,
	 * and catch it. And we won't overrun the buffer,
	 * because we're reading the name pointer atomically,
	 * and a dentry name is guaranteed to be properly
	 * terminated with a NUL byte.
	 *
	 * End result: even if 'len' is wrong, we'll exit
	 * early because the data cannot match (there can
	 * be no NUL in the ct/tcount data)
	 */
	const unsigned char *cs = READ_ONCE(dentry->d_name.name);

	return dentry_string_cmp(cs, ct, tcount);
}

/*
 * long names are allocated separately from dentry and never modified.
 * Refcounted, freeing is RCU-delayed.  See take_dentry_name_snapshot()
 * for the reason why ->count and ->head can't be combined into a union.
 * dentry_string_cmp() relies upon ->name[] being word-aligned.
 */
struct external_name {
	atomic_t count;
	struct rcu_head head;
	unsigned char name[] __aligned(sizeof(unsigned long));
};

static inline struct external_name *external_name(struct dentry *dentry)
{
	return container_of(dentry->d_name.name, struct external_name, name[0]);
}

static void __d_free(struct rcu_head *head)
{
	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);

	kmem_cache_free(dentry_cache, dentry); 
}

static void __d_free_external(struct rcu_head *head)
{
	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
	kfree(external_name(dentry));
	kmem_cache_free(dentry_cache, dentry);
}

static inline int dname_external(const struct dentry *dentry)
{
	return dentry->d_name.name != dentry->d_shortname.string;
}

void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry)
{
	unsigned seq;
	const unsigned char *s;

	rcu_read_lock();
retry:
	seq = read_seqcount_begin(&dentry->d_seq);
	s = READ_ONCE(dentry->d_name.name);
	name->name.hash_len = dentry->d_name.hash_len;
	name->name.name = name->inline_name.string;
	if (likely(s == dentry->d_shortname.string)) {
		name->inline_name = dentry->d_shortname;
	} else {
		struct external_name *p;
		p = container_of(s, struct external_name, name[0]);
		// get a valid reference
		if (unlikely(!atomic_inc_not_zero(&p->count)))
			goto retry;
		name->name.name = s;
	}
	if (read_seqcount_retry(&dentry->d_seq, seq)) {
		release_dentry_name_snapshot(name);
		goto retry;
	}
	rcu_read_unlock();
}
EXPORT_SYMBOL(take_dentry_name_snapshot);

void release_dentry_name_snapshot(struct name_snapshot *name)
{
	if (unlikely(name->name.name != name->inline_name.string)) {
		struct external_name *p;
		p = container_of(name->name.name, struct external_name, name[0]);
		if (unlikely(atomic_dec_and_test(&p->count)))
			kfree_rcu(p, head);
	}
}
EXPORT_SYMBOL(release_dentry_name_snapshot);

static inline void __d_set_inode_and_type(struct dentry *dentry,
					  struct inode *inode,
					  unsigned type_flags)
{
	unsigned flags;

	dentry->d_inode = inode;
	flags = READ_ONCE(dentry->d_flags);
	flags &= ~DCACHE_ENTRY_TYPE;
	flags |= type_flags;
	smp_store_release(&dentry->d_flags, flags);
}

static inline void __d_clear_type_and_inode(struct dentry *dentry)
{
	unsigned flags = READ_ONCE(dentry->d_flags);

	flags &= ~DCACHE_ENTRY_TYPE;
	WRITE_ONCE(dentry->d_flags, flags);
	dentry->d_inode = NULL;
	/*
	 * The negative counter only tracks dentries on the LRU. Don't inc if
	 * d_lru is on another list.
	 */
	if ((flags & (DCACHE_LRU_LIST|DCACHE_SHRINK_LIST)) == DCACHE_LRU_LIST)
		this_cpu_inc(nr_dentry_negative);
}

static void dentry_free(struct dentry *dentry)
{
	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
	if (unlikely(dname_external(dentry))) {
		struct external_name *p = external_name(dentry);
		if (likely(atomic_dec_and_test(&p->count))) {
			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
			return;
		}
	}
	/* if dentry was never visible to RCU, immediate free is OK */
	if (dentry->d_flags & DCACHE_NORCU)
		__d_free(&dentry->d_u.d_rcu);
	else
		call_rcu(&dentry->d_u.d_rcu, __d_free);
}

/*
 * Release the dentry's inode, using the filesystem
 * d_iput() operation if defined.
 */
static void dentry_unlink_inode(struct dentry * dentry)
	__releases(dentry->d_lock)
	__releases(dentry->d_inode->i_lock)
{
	struct inode *inode = dentry->d_inode;

	raw_write_seqcount_begin(&dentry->d_seq);
	__d_clear_type_and_inode(dentry);
	hlist_del_init(&dentry->d_u.d_alias);
	raw_write_seqcount_end(&dentry->d_seq);
	spin_unlock(&dentry->d_lock);
	spin_unlock(&inode->i_lock);
	if (!inode->i_nlink)
		fsnotify_inoderemove(inode);
	if (dentry->d_op && dentry->d_op->d_iput)
		dentry->d_op->d_iput(dentry, inode);
	else
		iput(inode);
}

/*
 * The DCACHE_LRU_LIST bit is set whenever the 'd_lru' entry
 * is in use - which includes both the "real" per-superblock
 * LRU list _and_ the DCACHE_SHRINK_LIST use.
 *
 * The DCACHE_SHRINK_LIST bit is set whenever the dentry is
 * on the shrink list (ie not on the superblock LRU list).
 *
 * The per-cpu "nr_dentry_unused" counters are updated with
 * the DCACHE_LRU_LIST bit.
 *
 * The per-cpu "nr_dentry_negative" counters are only updated
 * when deleted from or added to the per-superblock LRU list, not
 * from/to the shrink list. That is to avoid an unneeded dec/inc
 * pair when moving from LRU to shrink list in select_collect().
 *
 * These helper functions make sure we always follow the
 * rules. d_lock must be held by the caller.
 */
#define D_FLAG_VERIFY(dentry,x) WARN_ON_ONCE(((dentry)->d_flags & (DCACHE_LRU_LIST | DCACHE_SHRINK_LIST)) != (x))
static void d_lru_add(struct dentry *dentry)
{
	D_FLAG_VERIFY(dentry, 0);
	dentry->d_flags |= DCACHE_LRU_LIST;
	this_cpu_inc(nr_dentry_unused);
	if (d_is_negative(dentry))
		this_cpu_inc(nr_dentry_negative);
	WARN_ON_ONCE(!list_lru_add_obj(
			&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
}

static void d_lru_del(struct dentry *dentry)
{
	D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST);
	dentry->d_flags &= ~DCACHE_LRU_LIST;
	this_cpu_dec(nr_dentry_unused);
	if (d_is_negative(dentry))
		this_cpu_dec(nr_dentry_negative);
	WARN_ON_ONCE(!list_lru_del_obj(
			&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
}

static void d_shrink_del(struct dentry *dentry)
{
	D_FLAG_VERIFY(dentry, DCACHE_SHRINK_LIST | DCACHE_LRU_LIST);
	list_del_init(&dentry->d_lru);
	dentry->d_flags &= ~(DCACHE_SHRINK_LIST | DCACHE_LRU_LIST);
	this_cpu_dec(nr_dentry_unused);
}

static void d_shrink_add(struct dentry *dentry, struct list_head *list)
{
	D_FLAG_VERIFY(dentry, 0);
	list_add(&dentry->d_lru, list);
	dentry->d_flags |= DCACHE_SHRINK_LIST | DCACHE_LRU_LIST;
	this_cpu_inc(nr_dentry_unused);
}

/*
 * These can only be called under the global LRU lock, ie during the
 * callback for freeing the LRU list. "isolate" removes it from the
 * LRU lists entirely, while shrink_move moves it to the indicated
 * private list.
 */
static void d_lru_isolate(struct list_lru_one *lru, struct dentry *dentry)
{
	D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST);
	dentry->d_flags &= ~DCACHE_LRU_LIST;
	this_cpu_dec(nr_dentry_unused);
	if (d_is_negative(dentry))
		this_cpu_dec(nr_dentry_negative);
	list_lru_isolate(lru, &dentry->d_lru);
}

static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
			      struct list_head *list)
{
	D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST);
	dentry->d_flags |= DCACHE_SHRINK_LIST;
	if (d_is_negative(dentry))
		this_cpu_dec(nr_dentry_negative);
	list_lru_isolate_move(lru, &dentry->d_lru, list);
}

static void ___d_drop(struct dentry *dentry)
{
	struct hlist_bl_head *b;
	/*
	 * Hashed dentries are normally on the dentry hashtable,
	 * with the exception of those newly allocated by
	 * d_obtain_root, which are always IS_ROOT:
	 */
	if (unlikely(IS_ROOT(dentry)))
		b = &dentry->d_sb->s_roots;
	else
		b = d_hash(dentry->d_name.hash);

	hlist_bl_lock(b);
	__hlist_bl_del(&dentry->d_hash);
	hlist_bl_unlock(b);
}

void __d_drop(struct dentry *dentry)
{
	if (!d_unhashed(dentry)) {
		___d_drop(dentry);
		dentry->d_hash.pprev = NULL;
		write_seqcount_invalidate(&dentry->d_seq);
	}
}
EXPORT_SYMBOL(__d_drop);

/**
 * d_drop - drop a dentry
 * @dentry: dentry to drop
 *
 * d_drop() unhashes the entry from the parent dentry hashes, so that it won't
 * be found through a VFS lookup any more. Note that this is different from
 * deleting the dentry - d_delete will try to mark the dentry negative if
 * possible, giving a successful _negative_ lookup, while d_drop will
 * just make the cache lookup fail.
 *
 * d_drop() is used mainly for stuff that wants to invalidate a dentry for some
 * reason (NFS timeouts or autofs deletes).
 *
 * __d_drop requires dentry->d_lock
 *
 * ___d_drop doesn't mark dentry as "unhashed"
 * (dentry->d_hash.pprev will be LIST_POISON2, not NULL).
 */
void d_drop(struct dentry *dentry)
{
	spin_lock(&dentry->d_lock);
	__d_drop(dentry);
	spin_unlock(&dentry->d_lock);
}
EXPORT_SYMBOL(d_drop);

static inline void dentry_unlist(struct dentry *dentry)
{
	struct dentry *next;
	/*
	 * Inform d_walk() and shrink_dentry_list() that we are no longer
	 * attached to the dentry tree
	 */
	dentry->d_flags |= DCACHE_DENTRY_KILLED;
	if (unlikely(hlist_unhashed(&dentry->d_sib)))
		return;
	__hlist_del(&dentry->d_sib);
	/*
	 * Cursors can move around the list of children.  While we'd been
	 * a normal list member, it didn't matter - ->d_sib.next would've
	 * been updated.  However, from now on it won't be and for the
	 * things like d_walk() it might end up with a nasty surprise.
	 * Normally d_walk() doesn't care about cursors moving around -
	 * ->d_lock on parent prevents that and since a cursor has no children
	 * of its own, we get through it without ever unlocking the parent.
	 * There is one exception, though - if we ascend from a child that
	 * gets killed as soon as we unlock it, the next sibling is found
	 * using the value left in its ->d_sib.next.  And if _that_
	 * pointed to a cursor, and cursor got moved (e.g. by lseek())
	 * before d_walk() regains parent->d_lock, we'll end up skipping
	 * everything the cursor had been moved past.
	 *
	 * Solution: make sure that the pointer left behind in ->d_sib.next
	 * points to something that won't be moving around.  I.e. skip the
	 * cursors.
	 */
	while (dentry->d_sib.next) {
		next = hlist_entry(dentry->d_sib.next, struct dentry, d_sib);
		if (likely(!(next->d_flags & DCACHE_DENTRY_CURSOR)))
			break;
		dentry->d_sib.next = next->d_sib.next;
	}
}

static struct dentry *__dentry_kill(struct dentry *dentry)
{
	struct dentry *parent = NULL;
	bool can_free = true;

	/*
	 * The dentry is now unrecoverably dead to the world.
	 */
	lockref_mark_dead(&dentry->d_lockref);

	/*
	 * inform the fs via d_prune that this dentry is about to be
	 * unhashed and destroyed.
	 */
	if (dentry->d_flags & DCACHE_OP_PRUNE)
		dentry->d_op->d_prune(dentry);

	if (dentry->d_flags & DCACHE_LRU_LIST) {
		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
			d_lru_del(dentry);
	}
	/* if it was on the hash then remove it */
	__d_drop(dentry);
	if (dentry->d_inode)
		dentry_unlink_inode(dentry);
	else
		spin_unlock(&dentry->d_lock);
	this_cpu_dec(nr_dentry);
	if (dentry->d_op && dentry->d_op->d_release)
		dentry->d_op->d_release(dentry);

	cond_resched();
	/* now that it's negative, ->d_parent is stable */
	if (!IS_ROOT(dentry)) {
		parent = dentry->d_parent;
		spin_lock(&parent->d_lock);
	}
	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
	dentry_unlist(dentry);
	if (dentry->d_flags & DCACHE_SHRINK_LIST)
		can_free = false;
	spin_unlock(&dentry->d_lock);
	if (likely(can_free))
		dentry_free(dentry);
	if (parent && --parent->d_lockref.count) {
		spin_unlock(&parent->d_lock);
		return NULL;
	}
	return parent;
}

/*
 * Lock a dentry for feeding it to __dentry_kill().
 * Called under rcu_read_lock() and dentry->d_lock; the former
 * guarantees that nothing we access will be freed under us.
 * Note that dentry is *not* protected from concurrent dentry_kill(),
 * d_delete(), etc.
 *
 * Return false if dentry is busy.  Otherwise, return true and have
 * that dentry's inode locked.
 */

static bool lock_for_kill(struct dentry *dentry)
{
	struct inode *inode = dentry->d_inode;

	if (unlikely(dentry->d_lockref.count))
		return false;

	if (!inode || likely(spin_trylock(&inode->i_lock)))
		return true;

	do {
		spin_unlock(&dentry->d_lock);
		spin_lock(&inode->i_lock);
		spin_lock(&dentry->d_lock);
		if (likely(inode == dentry->d_inode))
			break;
		spin_unlock(&inode->i_lock);
		inode = dentry->d_inode;
	} while (inode);
	if (likely(!dentry->d_lockref.count))
		return true;
	if (inode)
		spin_unlock(&inode->i_lock);
	return false;
}

/*
 * Decide if dentry is worth retaining.  Usually this is called with dentry
 * locked; if not locked, we are more limited and might not be able to tell
 * without a lock.  False in this case means "punt to locked path and recheck".
 *
 * In case we aren't locked, these predicates are not "stable". However, it is
 * sufficient that at some point after we dropped the reference the dentry was
 * hashed and the flags had the proper value. Other dentry users may have
 * re-gotten a reference to the dentry and change that, but our work is done -
 * we can leave the dentry around with a zero refcount.
 */
static inline bool retain_dentry(struct dentry *dentry, bool locked)
{
	unsigned int d_flags;

	smp_rmb();
	d_flags = READ_ONCE(dentry->d_flags);

	// Unreachable? Nobody would be able to look it up, no point retaining
	if (unlikely(d_unhashed(dentry)))
		return false;

	// Same if it's disconnected
	if (unlikely(d_flags & DCACHE_DISCONNECTED))
		return false;

	// ->d_delete() might tell us not to bother, but that requires
	// ->d_lock; can't decide without it
	if (unlikely(d_flags & DCACHE_OP_DELETE)) {
		if (!locked || dentry->d_op->d_delete(dentry))
			return false;
	}

	// Explicitly told not to bother
	if (unlikely(d_flags & DCACHE_DONTCACHE))
		return false;

	// At this point it looks like we ought to keep it.  We also might
	// need to do something - put it on LRU if it wasn't there already
	// and mark it referenced if it was on LRU, but not marked yet.
	// Unfortunately, both actions require ->d_lock, so in lockless
	// case we'd have to punt rather than doing those.
	if (unlikely(!(d_flags & DCACHE_LRU_LIST))) {
		if (!locked)
			return false;
		d_lru_add(dentry);
	} else if (unlikely(!(d_flags & DCACHE_REFERENCED))) {
		if (!locked)
			return false;
		dentry->d_flags |= DCACHE_REFERENCED;
	}
	return true;
}

void d_mark_dontcache(struct inode *inode)
{
	struct dentry *de;

	spin_lock(&inode->i_lock);
	hlist_for_each_entry(de, &inode->i_dentry, d_u.d_alias) {
		spin_lock(&de->d_lock);
		de->d_flags |= DCACHE_DONTCACHE;
		spin_unlock(&de->d_lock);
	}
	inode_state_set(inode, I_DONTCACHE);
	spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(d_mark_dontcache);

/*
 * Try to do a lockless dput(), and return whether that was successful.
 *
 * If unsuccessful, we return false, having already taken the dentry lock.
 * In that case refcount is guaranteed to be zero and we have already
 * decided that it's not worth keeping around.
 *
 * The caller needs to hold the RCU read lock, so that the dentry is
 * guaranteed to stay around even if the refcount goes down to zero!
 */
static inline bool fast_dput(struct dentry *dentry)
{
	int ret;

	/*
	 * try to decrement the lockref optimistically.
	 */
	ret = lockref_put_return(&dentry->d_lockref);

	/*
	 * If the lockref_put_return() failed due to the lock being held
	 * by somebody else, the fast path has failed. We will need to
	 * get the lock, and then check the count again.
	 */
	if (unlikely(ret < 0)) {
		spin_lock(&dentry->d_lock);
		if (WARN_ON_ONCE(dentry->d_lockref.count <= 0)) {
			spin_unlock(&dentry->d_lock);
			return true;
		}
		dentry->d_lockref.count--;
		goto locked;
	}

	/*
	 * If we weren't the last ref, we're done.
	 */
	if (ret)
		return true;

	/*
	 * Can we decide that decrement of refcount is all we needed without
	 * taking the lock?  There's a very common case when it's all we need -
	 * dentry looks like it ought to be retained and there's nothing else
	 * to do.
	 */
	if (retain_dentry(dentry, false))
		return true;

	/*
	 * Either not worth retaining or we can't tell without the lock.
	 * Get the lock, then.  We've already decremented the refcount to 0,
	 * but we'll need to re-check the situation after getting the lock.
	 */
	spin_lock(&dentry->d_lock);

	/*
	 * Did somebody else grab a reference to it in the meantime, and
	 * we're no longer the last user after all? Alternatively, somebody
	 * else could have killed it and marked it dead. Either way, we
	 * don't need to do anything else.
	 */
locked:
	if (dentry->d_lockref.count || retain_dentry(dentry, true)) {
		spin_unlock(&dentry->d_lock);
		return true;
	}
	return false;
}

static void finish_dput(struct dentry *dentry)
	__releases(dentry->d_lock)
	__releases(RCU)
{
	while (lock_for_kill(dentry)) {
		rcu_read_unlock();
		dentry = __dentry_kill(dentry);
		if (!dentry)
			return;
		if (retain_dentry(dentry, true)) {
			spin_unlock(&dentry->d_lock);
			return;
		}
		rcu_read_lock();
	}
	rcu_read_unlock();
	spin_unlock(&dentry->d_lock);
}

/* 
 * This is dput
 *
 * This is complicated by the fact that we do not want to put
 * dentries that are no longer on any hash chain on the unused
 * list: we'd much rather just get rid of them immediately.
 *
 * However, that implies that we have to traverse the dentry
 * tree upwards to the parents which might _also_ now be
 * scheduled for deletion (it may have been only waiting for
 * its last child to go away).
 *
 * This tail recursion is done by hand as we don't want to depend
 * on the compiler to always get this right (gcc generally doesn't).
 * Real recursion would eat up our stack space.
 */

/*
 * dput - release a dentry
 * @dentry: dentry to release 
 *
 * Release a dentry. This will drop the usage count and if appropriate
 * call the dentry unlink method as well as removing it from the queues and
 * releasing its resources. If the parent dentries were scheduled for release
 * they too may now get deleted.
 */
void dput(struct dentry *dentry)
{
	if (!dentry)
		return;
	might_sleep();
	rcu_read_lock();
	if (likely(fast_dput(dentry))) {
		rcu_read_unlock();
		return;
	}
	finish_dput(dentry);
}
EXPORT_SYMBOL(dput);

void d_make_discardable(struct dentry *dentry)
{
	spin_lock(&dentry->d_lock);
	WARN_ON(!(dentry->d_flags & DCACHE_PERSISTENT));
	dentry->d_flags &= ~DCACHE_PERSISTENT;
	dentry->d_lockref.count--;
	rcu_read_lock();
	finish_dput(dentry);
}
EXPORT_SYMBOL(d_make_discardable);

static void to_shrink_list(struct dentry *dentry, struct list_head *list)
__must_hold(&dentry->d_lock)
{
	if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
		if (dentry->d_flags & DCACHE_LRU_LIST)
			d_lru_del(dentry);
		d_shrink_add(dentry, list);
	}
}

void dput_to_list(struct dentry *dentry, struct list_head *list)
{
	rcu_read_lock();
	if (likely(fast_dput(dentry))) {
		rcu_read_unlock();
		return;
	}
	rcu_read_unlock();
	to_shrink_list(dentry, list);
	spin_unlock(&dentry->d_lock);
}

struct dentry *dget_parent(struct dentry *dentry)
{
	int gotref;
	struct dentry *ret;
	unsigned seq;

	/*
	 * Do optimistic parent lookup without any
	 * locking.
	 */
	rcu_read_lock();
	seq = raw_seqcount_begin(&dentry->d_seq);
	ret = READ_ONCE(dentry->d_parent);
	gotref = lockref_get_not_zero(&ret->d_lockref);
	rcu_read_unlock();
	if (likely(gotref)) {
		if (!read_seqcount_retry(&dentry->d_seq, seq))
			return ret;
		dput(ret);
	}

repeat:
	/*
	 * Don't need rcu_dereference because we re-check it was correct under
	 * the lock.
	 */
	rcu_read_lock();
	ret = dentry->d_parent;
	spin_lock(&ret->d_lock);
	if (unlikely(ret != dentry->d_parent)) {
		spin_unlock(&ret->d_lock);
		rcu_read_unlock();
		goto repeat;
	}
	rcu_read_unlock();
	BUG_ON(!ret->d_lockref.count);
	ret->d_lockref.count++;
	spin_unlock(&ret->d_lock);
	return ret;
}
EXPORT_SYMBOL(dget_parent);

static struct dentry * __d_find_any_alias(struct inode *inode)
{
	struct dentry *alias;

	if (hlist_empty(&inode->i_dentry))
		return NULL;
	alias = hlist_entry(inode->i_dentry.first, struct dentry, d_u.d_alias);
	lockref_get(&alias->d_lockref);
	return alias;
}

/**
 * d_find_any_alias - find any alias for a given inode
 * @inode: inode to find an alias for
 *
 * If any aliases exist for the given inode, take and return a
 * reference for one of them.  If no aliases exist, return %NULL.
 */
struct dentry *d_find_any_alias(struct inode *inode)
{
	struct dentry *de;

	spin_lock(&inode->i_lock);
	de = __d_find_any_alias(inode);
	spin_unlock(&inode->i_lock);
	return de;
}
EXPORT_SYMBOL(d_find_any_alias);

static struct dentry *__d_find_alias(struct inode *inode)
{
	struct dentry *alias;

	if (S_ISDIR(inode->i_mode))
		return __d_find_any_alias(inode);

	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
		spin_lock(&alias->d_lock);
 		if (!d_unhashed(alias)) {
			dget_dlock(alias);
			spin_unlock(&alias->d_lock);
			return alias;
		}
		spin_unlock(&alias->d_lock);
	}
	return NULL;
}

/**
 * d_find_alias - grab a hashed alias of inode
 * @inode: inode in question
 *
 * If inode has a hashed alias, or is a directory and has any alias,
 * acquire the reference to alias and return it. Otherwise return NULL.
 * Notice that if inode is a directory there can be only one alias and
 * it can be unhashed only if it has no children, or if it is the root
 * of a filesystem, or if the directory was renamed and d_revalidate
 * was the first vfs operation to notice.
 *
 * If the inode has an IS_ROOT, DCACHE_DISCONNECTED alias, then prefer
 * any other hashed alias over that one.
 */
struct dentry *d_find_alias(struct inode *inode)
{
	struct dentry *de = NULL;

	if (!hlist_empty(&inode->i_dentry)) {
		spin_lock(&inode->i_lock);
		de = __d_find_alias(inode);
		spin_unlock(&inode->i_lock);
	}
	return de;
}
EXPORT_SYMBOL(d_find_alias);

/*
 *  Caller MUST be holding rcu_read_lock() and be guaranteed
 *  that inode won't get freed until rcu_read_unlock().
 */
struct dentry *d_find_alias_rcu(struct inode *inode)
{
	struct hlist_head *l = &inode->i_dentry;
	struct dentry *de = NULL;

	spin_lock(&inode->i_lock);
	// ->i_dentry and ->i_rcu are colocated, but the latter won't be
	// used without having I_FREEING set, which means no aliases left
	if (likely(!(inode_state_read(inode) & I_FREEING) && !hlist_empty(l))) {
		if (S_ISDIR(inode->i_mode)) {
			de = hlist_entry(l->first, struct dentry, d_u.d_alias);
		} else {
			hlist_for_each_entry(de, l, d_u.d_alias)
				if (!d_unhashed(de))
					break;
		}
	}
	spin_unlock(&inode->i_lock);
	return de;
}

/**
 * d_dispose_if_unused - move unreferenced dentries to shrink list
 * @dentry: dentry in question
 * @dispose: head of shrink list
 *
 * If dentry has no external references, move it to shrink list.
 *
 * NOTE!!! The caller is responsible for preventing eviction of the dentry by
 * holding dentry->d_inode->i_lock or equivalent.
 */
void d_dispose_if_unused(struct dentry *dentry, struct list_head *dispose)
{
	spin_lock(&dentry->d_lock);
	if (!dentry->d_lockref.count)
		to_shrink_list(dentry, dispose);
	spin_unlock(&dentry->d_lock);
}
EXPORT_SYMBOL(d_dispose_if_unused);

/*
 *	Try to kill dentries associated with this inode.
 * WARNING: you must own a reference to inode.
 */
void d_prune_aliases(struct inode *inode)
{
	LIST_HEAD(dispose);
	struct dentry *dentry;

	spin_lock(&inode->i_lock);
	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias)
		d_dispose_if_unused(dentry, &dispose);
	spin_unlock(&inode->i_lock);
	shrink_dentry_list(&dispose);
}
EXPORT_SYMBOL(d_prune_aliases);

static inline void shrink_kill(struct dentry *victim)
{
	do {
		rcu_read_unlock();
		victim = __dentry_kill(victim);
		rcu_read_lock();
	} while (victim && lock_for_kill(victim));
	rcu_read_unlock();
	if (victim)
		spin_unlock(&victim->d_lock);
}

void shrink_dentry_list(struct list_head *list)
{
	while (!list_empty(list)) {
		struct dentry *dentry;

		dentry = list_entry(list->prev, struct dentry, d_lru);
		spin_lock(&dentry->d_lock);
		rcu_read_lock();
		if (!lock_for_kill(dentry)) {
			bool can_free;
			rcu_read_unlock();
			d_shrink_del(dentry);
			can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
			spin_unlock(&dentry->d_lock);
			if (can_free)
				dentry_free(dentry);
			continue;
		}
		d_shrink_del(dentry);
		shrink_kill(dentry);
	}
}
EXPORT_SYMBOL(shrink_dentry_list);

static enum lru_status dentry_lru_isolate(struct list_head *item,
		struct list_lru_one *lru, void *arg)
{
	struct list_head *freeable = arg;
	struct dentry	*dentry = container_of(item, struct dentry, d_lru);


	/*
	 * we are inverting the lru lock/dentry->d_lock here,
	 * so use a trylock. If we fail to get the lock, just skip
	 * it
	 */
	if (!spin_trylock(&dentry->d_lock))
		return LRU_SKIP;

	/*
	 * Referenced dentries are still in use. If they have active
	 * counts, just remove them from the LRU. Otherwise give them
	 * another pass through the LRU.
	 */
	if (dentry->d_lockref.count) {
		d_lru_isolate(lru, dentry);
		spin_unlock(&dentry->d_lock);
		return LRU_REMOVED;
	}

	if (dentry->d_flags & DCACHE_REFERENCED) {
		dentry->d_flags &= ~DCACHE_REFERENCED;
		spin_unlock(&dentry->d_lock);

		/*
		 * The list move itself will be made by the common LRU code. At
		 * this point, we've dropped the dentry->d_lock but keep the
		 * lru lock. This is safe to do, since every list movement is
		 * protected by the lru lock even if both locks are held.
		 *
		 * This is guaranteed by the fact that all LRU management
		 * functions are intermediated by the LRU API calls like
		 * list_lru_add_obj and list_lru_del_obj. List movement in this file
		 * only ever occur through this functions or through callbacks
		 * like this one, that are called from the LRU API.
		 *
		 * The only exceptions to this are functions like
		 * shrink_dentry_list, and code that first checks for the
		 * DCACHE_SHRINK_LIST flag.  Those are guaranteed to be
		 * operating only with stack provided lists after they are
		 * properly isolated from the main list.  It is thus, always a
		 * local access.
		 */
		return LRU_ROTATE;
	}

	d_lru_shrink_move(lru, dentry, freeable);
	spin_unlock(&dentry->d_lock);

	return LRU_REMOVED;
}

/**
 * prune_dcache_sb - shrink the dcache
 * @sb: superblock
 * @sc: shrink control, passed to list_lru_shrink_walk()
 *
 * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This
 * is done when we need more memory and called from the superblock shrinker
 * function.
 *
 * This function may fail to free any resources if all the dentries are in
 * use.
 */
long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
{
	LIST_HEAD(dispose);
	long freed;

	freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc,
				     dentry_lru_isolate, &dispose);
	shrink_dentry_list(&dispose);
	return freed;
}

static enum lru_status dentry_lru_isolate_shrink(struct list_head *item,
		struct list_lru_one *lru, void *arg)
{
	struct list_head *freeable = arg;
	struct dentry	*dentry = container_of(item, struct dentry, d_lru);

	/*
	 * we are inverting the lru lock/dentry->d_lock here,
	 * so use a trylock. If we fail to get the lock, just skip
	 * it
	 */
	if (!spin_trylock(&dentry->d_lock))
		return LRU_SKIP;

	d_lru_shrink_move(lru, dentry, freeable);
	spin_unlock(&dentry->d_lock);

	return LRU_REMOVED;
}


/**
 * shrink_dcache_sb - shrink dcache for a superblock
 * @sb: superblock
 *
 * Shrink the dcache for the specified super block. This is used to free
 * the dcache before unmounting a file system.
 */
void shrink_dcache_sb(struct super_block *sb)
{
	do {
		LIST_HEAD(dispose);

		list_lru_walk(&sb->s_dentry_lru,
			dentry_lru_isolate_shrink, &dispose, 1024);
		shrink_dentry_list(&dispose);
	} while (list_lru_count(&sb->s_dentry_lru) > 0);
}
EXPORT_SYMBOL(shrink_dcache_sb);

/**
 * enum d_walk_ret - action to talke during tree walk
 * @D_WALK_CONTINUE:	contrinue walk
 * @D_WALK_QUIT:	quit walk
 * @D_WALK_NORETRY:	quit when retry is needed
 * @D_WALK_SKIP:	skip this dentry and its children
 */
enum d_walk_ret {
	D_WALK_CONTINUE,
	D_WALK_QUIT,
	D_WALK_NORETRY,
	D_WALK_SKIP,
};

/**
 * d_walk - walk the dentry tree
 * @parent:	start of walk
 * @data:	data passed to @enter() and @finish()
 * @enter:	callback when first entering the dentry
 *
 * The @enter() callbacks are called with d_lock held.
 */
static void d_walk(struct dentry *parent, void *data,
		   enum d_walk_ret (*enter)(void *, struct dentry *))
{
	struct dentry *this_parent, *dentry;
	unsigned seq = 0;
	enum d_walk_ret ret;
	bool retry = true;

again:
	read_seqbegin_or_lock(&rename_lock, &seq);
	this_parent = parent;
	spin_lock(&this_parent->d_lock);

	ret = enter(data, this_parent);
	switch (ret) {
	case D_WALK_CONTINUE:
		break;
	case D_WALK_QUIT:
	case D_WALK_SKIP:
		goto out_unlock;
	case D_WALK_NORETRY:
		retry = false;
		break;
	}
repeat:
	dentry = d_first_child(this_parent);
resume:
	hlist_for_each_entry_from(dentry, d_sib) {
		if (unlikely(dentry->d_flags & DCACHE_DENTRY_CURSOR))
			continue;

		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);

		ret = enter(data, dentry);
		switch (ret) {
		case D_WALK_CONTINUE:
			break;
		case D_WALK_QUIT:
			spin_unlock(&dentry->d_lock);
			goto out_unlock;
		case D_WALK_NORETRY:
			retry = false;
			break;
		case D_WALK_SKIP:
			spin_unlock(&dentry->d_lock);
			continue;
		}

		if (!hlist_empty(&dentry->d_children)) {
			spin_unlock(&this_parent->d_lock);
			spin_release(&dentry->d_lock.dep_map, _RET_IP_);
			this_parent = dentry;
			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
			goto repeat;
		}
		spin_unlock(&dentry->d_lock);
	}
	/*
	 * All done at this level ... ascend and resume the search.
	 */
	rcu_read_lock();
ascend:
	if (this_parent != parent) {
		dentry = this_parent;
		this_parent = dentry->d_parent;

		spin_unlock(&dentry->d_lock);
		spin_lock(&this_parent->d_lock);

		/* might go back up the wrong parent if we have had a rename. */
		if (need_seqretry(&rename_lock, seq))
			goto rename_retry;
		/* go into the first sibling still alive */
		hlist_for_each_entry_continue(dentry, d_sib) {
			if (likely(!(dentry->d_flags & DCACHE_DENTRY_KILLED))) {
				rcu_read_unlock();
				goto resume;
			}
		}
		goto ascend;
	}
	if (need_seqretry(&rename_lock, seq))
		goto rename_retry;
	rcu_read_unlock();

out_unlock:
	spin_unlock(&this_parent->d_lock);
	done_seqretry(&rename_lock, seq);
	return;

rename_retry:
	spin_unlock(&this_parent->d_lock);
	rcu_read_unlock();
	BUG_ON(seq & 1);
	if (!retry)
		return;
	seq = 1;
	goto again;
}

struct check_mount {
	struct vfsmount *mnt;
	unsigned int mounted;
};

/* locks: mount_locked_reader && dentry->d_lock */
static enum d_walk_ret path_check_mount(void *data, struct dentry *dentry)
{
	struct check_mount *info = data;
	struct path path = { .mnt = info->mnt, .dentry = dentry };

	if (likely(!d_mountpoint(dentry)))
		return D_WALK_CONTINUE;
	if (__path_is_mountpoint(&path)) {
		info->mounted = 1;
		return D_WALK_QUIT;
	}
	return D_WALK_CONTINUE;
}

/**
 * path_has_submounts - check for mounts over a dentry in the
 *                      current namespace.
 * @parent: path to check.
 *
 * Return true if the parent or its subdirectories contain
 * a mount point in the current namespace.
 */
int path_has_submounts(const struct path *parent)
{
	struct check_mount data = { .mnt = parent->mnt, .mounted = 0 };

	guard(mount_locked_reader)();
	d_walk(parent->dentry, &data, path_check_mount);

	return data.mounted;
}
EXPORT_SYMBOL(path_has_submounts);

/*
 * Called by mount code to set a mountpoint and check if the mountpoint is
 * reachable (e.g. NFS can unhash a directory dentry and then the complete
 * subtree can become unreachable).
 *
 * Only one of d_invalidate() and d_set_mounted() must succeed.  For
 * this reason take rename_lock and d_lock on dentry and ancestors.
 */
int d_set_mounted(struct dentry *dentry)
{
	struct dentry *p;
	int ret = -ENOENT;
	read_seqlock_excl(&rename_lock);
	for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
		/* Need exclusion wrt. d_invalidate() */
		spin_lock(&p->d_lock);
		if (unlikely(d_unhashed(p))) {
			spin_unlock(&p->d_lock);
			goto out;
		}
		spin_unlock(&p->d_lock);
	}
	spin_lock(&dentry->d_lock);
	if (!d_unlinked(dentry)) {
		ret = -EBUSY;
		if (!d_mountpoint(dentry)) {
			dentry->d_flags |= DCACHE_MOUNTED;
			ret = 0;
		}
	}
 	spin_unlock(&dentry->d_lock);
out:
	read_sequnlock_excl(&rename_lock);
	return ret;
}

/*
 * Search the dentry child list of the specified parent,
 * and move any unused dentries to the end of the unused
 * list for prune_dcache(). We descend to the next level
 * whenever the d_children list is non-empty and continue
 * searching.
 *
 * It returns zero iff there are no unused children,
 * otherwise  it returns the number of children moved to
 * the end of the unused list. This may not be the total
 * number of unused children, because select_parent can
 * drop the lock and return early due to latency
 * constraints.
 */

struct select_data {
	struct dentry *start;
	union {
		long found;
		struct dentry *victim;
	};
	struct list_head dispose;
};

static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
{
	struct select_data *data = _data;
	enum d_walk_ret ret = D_WALK_CONTINUE;

	if (data->start == dentry)
		goto out;

	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
		data->found++;
	} else if (!dentry->d_lockref.count) {
		to_shrink_list(dentry, &data->dispose);
		data->found++;
	} else if (dentry->d_lockref.count < 0) {
		data->found++;
	}
	/*
	 * We can return to the caller if we have found some (this
	 * ensures forward progress). We'll be coming back to find
	 * the rest.
	 */
	if (!list_empty(&data->dispose))
		ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
out:
	return ret;
}

static enum d_walk_ret select_collect_umount(void *_data, struct dentry *dentry)
{
	if (dentry->d_flags & DCACHE_PERSISTENT) {
		dentry->d_flags &= ~DCACHE_PERSISTENT;
		dentry->d_lockref.count--;
	}
	return select_collect(_data, dentry);
}

static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
{
	struct select_data *data = _data;
	enum d_walk_ret ret = D_WALK_CONTINUE;

	if (data->start == dentry)
		goto out;

	if (!dentry->d_lockref.count) {
		if (dentry->d_flags & DCACHE_SHRINK_LIST) {
			rcu_read_lock();
			data->victim = dentry;
			return D_WALK_QUIT;
		}
		to_shrink_list(dentry, &data->dispose);
	}
	/*
	 * We can return to the caller if we have found some (this
	 * ensures forward progress). We'll be coming back to find
	 * the rest.
	 */
	if (!list_empty(&data->dispose))
		ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
out:
	return ret;
}

/**
 * shrink_dcache_tree - prune dcache
 * @parent: parent of entries to prune
 * @for_umount: true if we want to unpin the persistent ones
 *
 * Prune the dcache to remove unused children of the parent dentry.
 */
static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
{
	for (;;) {
		struct select_data data = {.start = parent};

		INIT_LIST_HEAD(&data.dispose);
		d_walk(parent, &data,
			for_umount ? select_collect_umount : select_collect);

		if (!list_empty(&data.dispose)) {
			shrink_dentry_list(&data.dispose);
			continue;
		}

		cond_resched();
		if (!data.found)
			break;
		data.victim = NULL;
		d_walk(parent, &data, select_collect2);
		if (data.victim) {
			spin_lock(&data.victim->d_lock);
			if (!lock_for_kill(data.victim)) {
				spin_unlock(&data.victim->d_lock);
				rcu_read_unlock();
			} else {
				shrink_kill(data.victim);
			}
		}
		if (!list_empty(&data.dispose))
			shrink_dentry_list(&data.dispose);
	}
}

void shrink_dcache_parent(struct dentry *parent)
{
	shrink_dcache_tree(parent, false);
}
EXPORT_SYMBOL(shrink_dcache_parent);

static enum d_walk_ret umount_check(void *_data, struct dentry *dentry)
{
	/* it has busy descendents; complain about those instead */
	if (!hlist_empty(&dentry->d_children))
		return D_WALK_CONTINUE;

	/* root with refcount 1 is fine */
	if (dentry == _data && dentry->d_lockref.count == 1)
		return D_WALK_CONTINUE;

	WARN(1, "BUG: Dentry %p{i=%lx,n=%pd} "
			" still in use (%d) [unmount of %s %s]\n",
		       dentry,
		       dentry->d_inode ?
		       dentry->d_inode->i_ino : 0UL,
		       dentry,
		       dentry->d_lockref.count,
		       dentry->d_sb->s_type->name,
		       dentry->d_sb->s_id);
	return D_WALK_CONTINUE;
}

static void do_one_tree(struct dentry *dentry)
{
	shrink_dcache_tree(dentry, true);
	d_walk(dentry, dentry, umount_check);
	d_drop(dentry);
	dput(dentry);
}

/*
 * destroy the dentries attached to a superblock on unmounting
 */
void shrink_dcache_for_umount(struct super_block *sb)
{
	struct dentry *dentry;

	rwsem_assert_held_write(&sb->s_umount);

	dentry = sb->s_root;
	sb->s_root = NULL;
	do_one_tree(dentry);

	while (!hlist_bl_empty(&sb->s_roots)) {
		dentry = dget(hlist_bl_entry(hlist_bl_first(&sb->s_roots), struct dentry, d_hash));
		do_one_tree(dentry);
	}
}

static enum d_walk_ret find_submount(void *_data, struct dentry *dentry)
{
	struct dentry **victim = _data;
	if (d_mountpoint(dentry)) {
		*victim = dget_dlock(dentry);
		return D_WALK_QUIT;
	}
	return D_WALK_CONTINUE;
}

/**
 * d_invalidate - detach submounts, prune dcache, and drop
 * @dentry: dentry to invalidate (aka detach, prune and drop)
 */
void d_invalidate(struct dentry *dentry)
{
	bool had_submounts = false;
	spin_lock(&dentry->d_lock);
	if (d_unhashed(dentry)) {
		spin_unlock(&dentry->d_lock);
		return;
	}
	__d_drop(dentry);
	spin_unlock(&dentry->d_lock);

	/* Negative dentries can be dropped without further checks */
	if (!dentry->d_inode)
		return;

	shrink_dcache_parent(dentry);
	for (;;) {
		struct dentry *victim = NULL;
		d_walk(dentry, &victim, find_submount);
		if (!victim) {
			if (had_submounts)
				shrink_dcache_parent(dentry);
			return;
		}
		had_submounts = true;
		detach_mounts(victim);
		dput(victim);
	}
}
EXPORT_SYMBOL(d_invalidate);

/**
 * __d_alloc	-	allocate a dcache entry
 * @sb: filesystem it will belong to
 * @name: qstr of the name
 *
 * Allocates a dentry. It returns %NULL if there is insufficient memory
 * available. On a success the dentry is returned. The name passed in is
 * copied and the copy passed in may be reused after this call.
 */
 
static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
{
	struct dentry *dentry;
	char *dname;
	int err;

	dentry = kmem_cache_alloc_lru(dentry_cache, &sb->s_dentry_lru,
				      GFP_KERNEL);
	if (!dentry)
		return NULL;

	/*
	 * We guarantee that the inline name is always NUL-terminated.
	 * This way the memcpy() done by the name switching in rename
	 * will still always have a NUL at the end, even if we might
	 * be overwriting an internal NUL character
	 */
	dentry->d_shortname.string[DNAME_INLINE_LEN-1] = 0;
	if (unlikely(!name)) {
		name = &slash_name;
		dname = dentry->d_shortname.string;
	} else if (name->len > DNAME_INLINE_LEN-1) {
		size_t size = offsetof(struct external_name, name[1]);
		struct external_name *p = kmalloc(size + name->len,
						  GFP_KERNEL_ACCOUNT |
						  __GFP_RECLAIMABLE);
		if (!p) {
			kmem_cache_free(dentry_cache, dentry); 
			return NULL;
		}
		atomic_set(&p->count, 1);
		dname = p->name;
	} else  {
		dname = dentry->d_shortname.string;
	}	

	dentry->__d_name.len = name->len;
	dentry->__d_name.hash = name->hash;
	memcpy(dname, name->name, name->len);
	dname[name->len] = 0;

	/* Make sure we always see the terminating NUL character */
	smp_store_release(&dentry->__d_name.name, dname); /* ^^^ */

	dentry->d_flags = 0;
	lockref_init(&dentry->d_lockref);
	seqcount_spinlock_init(&dentry->d_seq, &dentry->d_lock);
	dentry->d_inode = NULL;
	dentry->d_parent = dentry;
	dentry->d_sb = sb;
	dentry->d_op = sb->__s_d_op;
	dentry->d_flags = sb->s_d_flags;
	dentry->d_fsdata = NULL;
	INIT_HLIST_BL_NODE(&dentry->d_hash);
	INIT_LIST_HEAD(&dentry->d_lru);
	INIT_HLIST_HEAD(&dentry->d_children);
	INIT_HLIST_NODE(&dentry->d_u.d_alias);
	INIT_HLIST_NODE(&dentry->d_sib);

	if (dentry->d_op && dentry->d_op->d_init) {
		err = dentry->d_op->d_init(dentry);
		if (err) {
			if (dname_external(dentry))
				kfree(external_name(dentry));
			kmem_cache_free(dentry_cache, dentry);
			return NULL;
		}
	}

	this_cpu_inc(nr_dentry);

	return dentry;
}

/**
 * d_alloc	-	allocate a dcache entry
 * @parent: parent of entry to allocate
 * @name: qstr of the name
 *
 * Allocates a dentry. It returns %NULL if there is insufficient memory
 * available. On a success the dentry is returned. The name passed in is
 * copied and the copy passed in may be reused after this call.
 */
struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
{
	struct dentry *dentry = __d_alloc(parent->d_sb, name);
	if (!dentry)
		return NULL;
	spin_lock(&parent->d_lock);
	/*
	 * don't need child lock because it is not subject
	 * to concurrency here
	 */
	dentry->d_parent = dget_dlock(parent);
	hlist_add_head(&dentry->d_sib, &parent->d_children);
	spin_unlock(&parent->d_lock);

	return dentry;
}
EXPORT_SYMBOL(d_alloc);

struct dentry *d_alloc_anon(struct super_block *sb)
{
	return __d_alloc(sb, NULL);
}
EXPORT_SYMBOL(d_alloc_anon);

struct dentry *d_alloc_cursor(struct dentry * parent)
{
	struct dentry *dentry = d_alloc_anon(parent->d_sb);
	if (dentry) {
		dentry->d_flags |= DCACHE_DENTRY_CURSOR;
		dentry->d_parent = dget(parent);
	}
	return dentry;
}

/**
 * d_alloc_pseudo - allocate a dentry (for lookup-less filesystems)
 * @sb: the superblock
 * @name: qstr of the name
 *
 * For a filesystem that just pins its dentries in memory and never
 * performs lookups at all, return an unhashed IS_ROOT dentry.
 * This is used for pipes, sockets et.al. - the stuff that should
 * never be anyone's children or parents.  Unlike all other
 * dentries, these will not have RCU delay between dropping the
 * last reference and freeing them.
 *
 * The only user is alloc_file_pseudo() and that's what should
 * be considered a public interface.  Don't use directly.
 */
struct dentry *d_alloc_pseudo(struct super_block *sb, const struct qstr *name)
{
	static const struct dentry_operations anon_ops = {
		.d_dname = simple_dname
	};
	struct dentry *dentry = __d_alloc(sb, name);
	if (likely(dentry)) {
		dentry->d_flags |= DCACHE_NORCU;
		/* d_op_flags(&anon_ops) is 0 */
		if (!dentry->d_op)
			dentry->d_op = &anon_ops;
	}
	return dentry;
}

struct dentry *d_alloc_name(struct dentry *parent, const char *name)
{
	struct qstr q;

	q.name = name;
	q.hash_len = hashlen_string(parent, name);
	return d_alloc(parent, &q);
}
EXPORT_SYMBOL(d_alloc_name);

#define DCACHE_OP_FLAGS \
	(DCACHE_OP_HASH | DCACHE_OP_COMPARE | DCACHE_OP_REVALIDATE | \
	 DCACHE_OP_WEAK_REVALIDATE | DCACHE_OP_DELETE | DCACHE_OP_PRUNE | \
	 DCACHE_OP_REAL)

static unsigned int d_op_flags(const struct dentry_operations *op)
{
	unsigned int flags = 0;
	if (op) {
		if (op->d_hash)
			flags |= DCACHE_OP_HASH;
		if (op->d_compare)
			flags |= DCACHE_OP_COMPARE;
		if (op->d_revalidate)
			flags |= DCACHE_OP_REVALIDATE;
		if (op->d_weak_revalidate)
			flags |= DCACHE_OP_WEAK_REVALIDATE;
		if (op->d_delete)
			flags |= DCACHE_OP_DELETE;
		if (op->d_prune)
			flags |= DCACHE_OP_PRUNE;
		if (op->d_real)
			flags |= DCACHE_OP_REAL;
	}
	return flags;
}

static void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op)
{
	unsigned int flags = d_op_flags(op);
	WARN_ON_ONCE(dentry->d_op);
	WARN_ON_ONCE(dentry->d_flags & DCACHE_OP_FLAGS);
	dentry->d_op = op;
	if (flags)
		dentry->d_flags |= flags;
}

void set_default_d_op(struct super_block *s, const struct dentry_operations *ops)
{
	unsigned int flags = d_op_flags(ops);
	s->__s_d_op = ops;
	s->s_d_flags = (s->s_d_flags & ~DCACHE_OP_FLAGS) | flags;
}
EXPORT_SYMBOL(set_default_d_op);

static unsigned d_flags_for_inode(struct inode *inode)
{
	unsigned add_flags = DCACHE_REGULAR_TYPE;

	if (!inode)
		return DCACHE_MISS_TYPE;

	if (S_ISDIR(inode->i_mode)) {
		add_flags = DCACHE_DIRECTORY_TYPE;
		if (unlikely(!(inode->i_opflags & IOP_LOOKUP))) {
			if (unlikely(!inode->i_op->lookup))
				add_flags = DCACHE_AUTODIR_TYPE;
			else
				inode->i_opflags |= IOP_LOOKUP;
		}
		goto type_determined;
	}

	if (unlikely(!(inode->i_opflags & IOP_NOFOLLOW))) {
		if (unlikely(inode->i_op->get_link)) {
			add_flags = DCACHE_SYMLINK_TYPE;
			goto type_determined;
		}
		inode->i_opflags |= IOP_NOFOLLOW;
	}

	if (unlikely(!S_ISREG(inode->i_mode)))
		add_flags = DCACHE_SPECIAL_TYPE;

type_determined:
	if (unlikely(IS_AUTOMOUNT(inode)))
		add_flags |= DCACHE_NEED_AUTOMOUNT;
	return add_flags;
}

static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
	unsigned add_flags = d_flags_for_inode(inode);
	WARN_ON(d_in_lookup(dentry));

	/*
	 * The negative counter only tracks dentries on the LRU. Don't dec if
	 * d_lru is on another list.
	 */
	if ((dentry->d_flags &
	     (DCACHE_LRU_LIST|DCACHE_SHRINK_LIST)) == DCACHE_LRU_LIST)
		this_cpu_dec(nr_dentry_negative);
	hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
	raw_write_seqcount_begin(&dentry->d_seq);
	__d_set_inode_and_type(dentry, inode, add_flags);
	raw_write_seqcount_end(&dentry->d_seq);
	fsnotify_update_flags(dentry);
}

/**
 * d_instantiate - fill in inode information for a dentry
 * @entry: dentry to complete
 * @inode: inode to attach to this dentry
 *
 * Fill in inode information in the entry.
 *
 * This turns negative dentries into productive full members
 * of society.
 *
 * NOTE! This assumes that the inode count has been incremented
 * (or otherwise set) by the caller to indicate that it is now
 * in use by the dcache.
 */
 
void d_instantiate(struct dentry *entry, struct inode * inode)
{
	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
	if (inode) {
		security_d_instantiate(entry, inode);
		spin_lock(&inode->i_lock);
		spin_lock(&entry->d_lock);
		__d_instantiate(entry, inode);
		spin_unlock(&entry->d_lock);
		spin_unlock(&inode->i_lock);
	}
}
EXPORT_SYMBOL(d_instantiate);

/*
 * This should be equivalent to d_instantiate() + unlock_new_inode(),
 * with lockdep-related part of unlock_new_inode() done before
 * anything else.  Use that instead of open-coding d_instantiate()/
 * unlock_new_inode() combinations.
 */
void d_instantiate_new(struct dentry *entry, struct inode *inode)
{
	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
	BUG_ON(!inode);
	lockdep_annotate_inode_mutex_key(inode);
	security_d_instantiate(entry, inode);
	spin_lock(&inode->i_lock);
	spin_lock(&entry->d_lock);
	__d_instantiate(entry, inode);
	spin_unlock(&entry->d_lock);
	WARN_ON(!(inode_state_read(inode) & I_NEW));
	inode_state_clear(inode, I_NEW | I_CREATING);
	inode_wake_up_bit(inode, __I_NEW);
	spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(d_instantiate_new);

struct dentry *d_make_root(struct inode *root_inode)
{
	struct dentry *res = NULL;

	if (root_inode) {
		res = d_alloc_anon(root_inode->i_sb);
		if (res)
			d_instantiate(res, root_inode);
		else
			iput(root_inode);
	}
	return res;
}
EXPORT_SYMBOL(d_make_root);

static struct dentry *__d_obtain_alias(struct inode *inode, bool disconnected)
{
	struct super_block *sb;
	struct dentry *new, *res;

	if (!inode)
		return ERR_PTR(-ESTALE);
	if (IS_ERR(inode))
		return ERR_CAST(inode);

	sb = inode->i_sb;

	res = d_find_any_alias(inode); /* existing alias? */
	if (res)
		goto out;

	new = d_alloc_anon(sb);
	if (!new) {
		res = ERR_PTR(-ENOMEM);
		goto out;
	}

	security_d_instantiate(new, inode);
	spin_lock(&inode->i_lock);
	res = __d_find_any_alias(inode); /* recheck under lock */
	if (likely(!res)) { /* still no alias, attach a disconnected dentry */
		unsigned add_flags = d_flags_for_inode(inode);

		if (disconnected)
			add_flags |= DCACHE_DISCONNECTED;

		spin_lock(&new->d_lock);
		__d_set_inode_and_type(new, inode, add_flags);
		hlist_add_head(&new->d_u.d_alias, &inode->i_dentry);
		if (!disconnected) {
			hlist_bl_lock(&sb->s_roots);
			hlist_bl_add_head(&new->d_hash, &sb->s_roots);
			hlist_bl_unlock(&sb->s_roots);
		}
		spin_unlock(&new->d_lock);
		spin_unlock(&inode->i_lock);
		inode = NULL; /* consumed by new->d_inode */
		res = new;
	} else {
		spin_unlock(&inode->i_lock);
		dput(new);
	}

 out:
	iput(inode);
	return res;
}

/**
 * d_obtain_alias - find or allocate a DISCONNECTED dentry for a given inode
 * @inode: inode to allocate the dentry for
 *
 * Obtain a dentry for an inode resulting from NFS filehandle conversion or
 * similar open by handle operations.  The returned dentry may be anonymous,
 * or may have a full name (if the inode was already in the cache).
 *
 * When called on a directory inode, we must ensure that the inode only ever
 * has one dentry.  If a dentry is found, that is returned instead of
 * allocating a new one.
 *
 * On successful return, the reference to the inode has been transferred
 * to the dentry.  In case of an error the reference on the inode is released.
 * To make it easier to use in export operations a %NULL or IS_ERR inode may
 * be passed in and the error will be propagated to the return value,
 * with a %NULL @inode replaced by ERR_PTR(-ESTALE).
 */
struct dentry *d_obtain_alias(struct inode *inode)
{
	return __d_obtain_alias(inode, true);
}
EXPORT_SYMBOL(d_obtain_alias);

/**
 * d_obtain_root - find or allocate a dentry for a given inode
 * @inode: inode to allocate the dentry for
 *
 * Obtain an IS_ROOT dentry for the root of a filesystem.
 *
 * We must ensure that directory inodes only ever have one dentry.  If a
 * dentry is found, that is returned instead of allocating a new one.
 *
 * On successful return, the reference to the inode has been transferred
 * to the dentry.  In case of an error the reference on the inode is
 * released.  A %NULL or IS_ERR inode may be passed in and will be the
 * error will be propagate to the return value, with a %NULL @inode
 * replaced by ERR_PTR(-ESTALE).
 */
struct dentry *d_obtain_root(struct inode *inode)
{
	return __d_obtain_alias(inode, false);
}
EXPORT_SYMBOL(d_obtain_root);

/**
 * d_add_ci - lookup or allocate new dentry with case-exact name
 * @dentry: the negative dentry that was passed to the parent's lookup func
 * @inode:  the inode case-insensitive lookup has found
 * @name:   the case-exact name to be associated with the returned dentry
 *
 * This is to avoid filling the dcache with case-insensitive names to the
 * same inode, only the actual correct case is stored in the dcache for
 * case-insensitive filesystems.
 *
 * For a case-insensitive lookup match and if the case-exact dentry
 * already exists in the dcache, use it and return it.
 *
 * If no entry exists with the exact case name, allocate new dentry with
 * the exact case, and return the spliced entry.
 */
struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
			struct qstr *name)
{
	struct dentry *found, *res;

	/*
	 * First check if a dentry matching the name already exists,
	 * if not go ahead and create it now.
	 */
	found = d_hash_and_lookup(dentry->d_parent, name);
	if (found) {
		iput(inode);
		return found;
	}
	if (d_in_lookup(dentry)) {
		found = d_alloc_parallel(dentry->d_parent, name,
					dentry->d_wait);
		if (IS_ERR(found) || !d_in_lookup(found)) {
			iput(inode);
			return found;
		}
	} else {
		found = d_alloc(dentry->d_parent, name);
		if (!found) {
			iput(inode);
			return ERR_PTR(-ENOMEM);
		} 
	}
	res = d_splice_alias(inode, found);
	if (res) {
		d_lookup_done(found);
		dput(found);
		return res;
	}
	return found;
}
EXPORT_SYMBOL(d_add_ci);

/**
 * d_same_name - compare dentry name with case-exact name
 * @dentry: the negative dentry that was passed to the parent's lookup func
 * @parent: parent dentry
 * @name:   the case-exact name to be associated with the returned dentry
 *
 * Return: true if names are same, or false
 */
bool d_same_name(const struct dentry *dentry, const struct dentry *parent,
		 const struct qstr *name)
{
	if (likely(!(parent->d_flags & DCACHE_OP_COMPARE))) {
		if (dentry->d_name.len != name->len)
			return false;
		return dentry_cmp(dentry, name->name, name->len) == 0;
	}
	return parent->d_op->d_compare(dentry,
				       dentry->d_name.len, dentry->d_name.name,
				       name) == 0;
}
EXPORT_SYMBOL_GPL(d_same_name);

/*
 * This is __d_lookup_rcu() when the parent dentry has
 * DCACHE_OP_COMPARE, which makes things much nastier.
 */
static noinline struct dentry *__d_lookup_rcu_op_compare(
	const struct dentry *parent,
	const struct qstr *name,
	unsigned *seqp)
{
	u64 hashlen = name->hash_len;
	struct hlist_bl_head *b = d_hash(hashlen);
	struct hlist_bl_node *node;
	struct dentry *dentry;

	hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
		int tlen;
		const char *tname;
		unsigned seq;

seqretry:
		seq = raw_seqcount_begin(&dentry->d_seq);
		if (dentry->d_parent != parent)
			continue;
		if (d_unhashed(dentry))
			continue;
		if (dentry->d_name.hash != hashlen_hash(hashlen))
			continue;
		tlen = dentry->d_name.len;
		tname = dentry->d_name.name;
		/* we want a consistent (name,len) pair */
		if (read_seqcount_retry(&dentry->d_seq, seq)) {
			cpu_relax();
			goto seqretry;
		}
		if (parent->d_op->d_compare(dentry, tlen, tname, name) != 0)
			continue;
		*seqp = seq;
		return dentry;
	}
	return NULL;
}

/**
 * __d_lookup_rcu - search for a dentry (racy, store-free)
 * @parent: parent dentry
 * @name: qstr of name we wish to find
 * @seqp: returns d_seq value at the point where the dentry was found
 * Returns: dentry, or NULL
 *
 * __d_lookup_rcu is the dcache lookup function for rcu-walk name
 * resolution (store-free path walking) design described in
 * Documentation/filesystems/path-lookup.txt.
 *
 * This is not to be used outside core vfs.
 *
 * __d_lookup_rcu must only be used in rcu-walk mode, ie. with vfsmount lock
 * held, and rcu_read_lock held. The returned dentry must not be stored into
 * without taking d_lock and checking d_seq sequence count against @seq
 * returned here.
 *
 * Alternatively, __d_lookup_rcu may be called again to look up the child of
 * the returned dentry, so long as its parent's seqlock is checked after the
 * child is looked up. Thus, an interlocking stepping of sequence lock checks
 * is formed, giving integrity down the path walk.
 *
 * NOTE! The caller *has* to check the resulting dentry against the sequence
 * number we've returned before using any of the resulting dentry state!
 */
struct dentry *__d_lookup_rcu(const struct dentry *parent,
				const struct qstr *name,
				unsigned *seqp)
{
	u64 hashlen = name->hash_len;
	const unsigned char *str = name->name;
	struct hlist_bl_head *b = d_hash(hashlen);
	struct hlist_bl_node *node;
	struct dentry *dentry;

	/*
	 * Note: There is significant duplication with __d_lookup_rcu which is
	 * required to prevent single threaded performance regressions
	 * especially on architectures where smp_rmb (in seqcounts) are costly.
	 * Keep the two functions in sync.
	 */

	if (unlikely(parent->d_flags & DCACHE_OP_COMPARE))
		return __d_lookup_rcu_op_compare(parent, name, seqp);

	/*
	 * The hash list is protected using RCU.
	 *
	 * Carefully use d_seq when comparing a candidate dentry, to avoid
	 * races with d_move().
	 *
	 * It is possible that concurrent renames can mess up our list
	 * walk here and result in missing our dentry, resulting in the
	 * false-negative result. d_lookup() protects against concurrent
	 * renames using rename_lock seqlock.
	 *
	 * See Documentation/filesystems/path-lookup.txt for more details.
	 */
	hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
		unsigned seq;

		/*
		 * The dentry sequence count protects us from concurrent
		 * renames, and thus protects parent and name fields.
		 *
		 * The caller must perform a seqcount check in order
		 * to do anything useful with the returned dentry.
		 *
		 * NOTE! We do a "raw" seqcount_begin here. That means that
		 * we don't wait for the sequence count to stabilize if it
		 * is in the middle of a sequence change. If we do the slow
		 * dentry compare, we will do seqretries until it is stable,
		 * and if we end up with a successful lookup, we actually
		 * want to exit RCU lookup anyway.
		 *
		 * Note that raw_seqcount_begin still *does* smp_rmb(), so
		 * we are still guaranteed NUL-termination of ->d_name.name.
		 */
		seq = raw_seqcount_begin(&dentry->d_seq);
		if (dentry->d_parent != parent)
			continue;
		if (dentry->d_name.hash_len != hashlen)
			continue;
		if (unlikely(dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0))
			continue;
		/*
		 * Check for the dentry being unhashed.
		 *
		 * As tempting as it is, we *can't* skip it because of a race window
		 * between us finding the dentry before it gets unhashed and loading
		 * the sequence counter after unhashing is finished.
		 *
		 * We can at least predict on it.
		 */
		if (unlikely(d_unhashed(dentry)))
			continue;
		*seqp = seq;
		return dentry;
	}
	return NULL;
}

/**
 * d_lookup - search for a dentry
 * @parent: parent dentry
 * @name: qstr of name we wish to find
 * Returns: dentry, or NULL
 *
 * d_lookup searches the children of the parent dentry for the name in
 * question. If the dentry is found its reference count is incremented and the
 * dentry is returned. The caller must use dput to free the entry when it has
 * finished using it. %NULL is returned if the dentry does not exist.
 */
struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
{
	struct dentry *dentry;
	unsigned seq;

	do {
		seq = read_seqbegin(&rename_lock);
		dentry = __d_lookup(parent, name);
		if (dentry)
			break;
	} while (read_seqretry(&rename_lock, seq));
	return dentry;
}
EXPORT_SYMBOL(d_lookup);

/**
 * __d_lookup - search for a dentry (racy)
 * @parent: parent dentry
 * @name: qstr of name we wish to find
 * Returns: dentry, or NULL
 *
 * __d_lookup is like d_lookup, however it may (rarely) return a
 * false-negative result due to unrelated rename activity.
 *
 * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
 * however it must be used carefully, eg. with a following d_lookup in
 * the case of failure.
 *
 * __d_lookup callers must be commented.
 */
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{
	unsigned int hash = name->hash;
	struct hlist_bl_head *b = d_hash(hash);
	struct hlist_bl_node *node;
	struct dentry *found = NULL;
	struct dentry *dentry;

	/*
	 * Note: There is significant duplication with __d_lookup_rcu which is
	 * required to prevent single threaded performance regressions
	 * especially on architectures where smp_rmb (in seqcounts) are costly.
	 * Keep the two functions in sync.
	 */

	/*
	 * The hash list is protected using RCU.
	 *
	 * Take d_lock when comparing a candidate dentry, to avoid races
	 * with d_move().
	 *
	 * It is possible that concurrent renames can mess up our list
	 * walk here and result in missing our dentry, resulting in the
	 * false-negative result. d_lookup() protects against concurrent
	 * renames using rename_lock seqlock.
	 *
	 * See Documentation/filesystems/path-lookup.txt for more details.
	 */
	rcu_read_lock();
	
	hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {

		if (dentry->d_name.hash != hash)
			continue;

		spin_lock(&dentry->d_lock);
		if (dentry->d_parent != parent)
			goto next;
		if (d_unhashed(dentry))
			goto next;

		if (!d_same_name(dentry, parent, name))
			goto next;

		dentry->d_lockref.count++;
		found = dentry;
		spin_unlock(&dentry->d_lock);
		break;
next:
		spin_unlock(&dentry->d_lock);
 	}
 	rcu_read_unlock();

 	return found;
}

/**
 * d_hash_and_lookup - hash the qstr then search for a dentry
 * @dir: Directory to search in
 * @name: qstr of name we wish to find
 *
 * On lookup failure NULL is returned; on bad name - ERR_PTR(-error)
 */
struct dentry *d_hash_and_lookup(struct dentry *dir, struct qstr *name)
{
	/*
	 * Check for a fs-specific hash function. Note that we must
	 * calculate the standard hash first, as the d_op->d_hash()
	 * routine may choose to leave the hash value unchanged.
	 */
	name->hash = full_name_hash(dir, name->name, name->len);
	if (dir->d_flags & DCACHE_OP_HASH) {
		int err = dir->d_op->d_hash(dir, name);
		if (unlikely(err < 0))
			return ERR_PTR(err);
	}
	return d_lookup(dir, name);
}

/*
 * When a file is deleted, we have two options:
 * - turn this dentry into a negative dentry
 * - unhash this dentry and free it.
 *
 * Usually, we want to just turn this into
 * a negative dentry, but if anybody else is
 * currently using the dentry or the inode
 * we can't do that and we fall back on removing
 * it from the hash queues and waiting for
 * it to be deleted later when it has no users
 */
 
/**
 * d_delete - delete a dentry
 * @dentry: The dentry to delete
 *
 * Turn the dentry into a negative dentry if possible, otherwise
 * remove it from the hash queues so it can be deleted later
 */
 
void d_delete(struct dentry * dentry)
{
	struct inode *inode = dentry->d_inode;

	spin_lock(&inode->i_lock);
	spin_lock(&dentry->d_lock);
	/*
	 * Are we the only user?
	 */
	if (dentry->d_lockref.count == 1) {
		if (dentry_negative_policy)
			__d_drop(dentry);
		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
		dentry_unlink_inode(dentry);
	} else {
		__d_drop(dentry);
		spin_unlock(&dentry->d_lock);
		spin_unlock(&inode->i_lock);
	}
}
EXPORT_SYMBOL(d_delete);

static void __d_rehash(struct dentry *entry)
{
	struct hlist_bl_head *b = d_hash(entry->d_name.hash);

	hlist_bl_lock(b);
	hlist_bl_add_head_rcu(&entry->d_hash, b);
	hlist_bl_unlock(b);
}

/**
 * d_rehash	- add an entry back to the hash
 * @entry: dentry to add to the hash
 *
 * Adds a dentry to the hash according to its name.
 */
 
void d_rehash(struct dentry * entry)
{
	spin_lock(&entry->d_lock);
	__d_rehash(entry);
	spin_unlock(&entry->d_lock);
}
EXPORT_SYMBOL(d_rehash);

static inline unsigned start_dir_add(struct inode *dir)
{
	preempt_disable_nested();
	for (;;) {
		unsigned n = READ_ONCE(dir->i_dir_seq);
		if (!(n & 1) && try_cmpxchg(&dir->i_dir_seq, &n, n + 1))
			return n;
		cpu_relax();
	}
}

static inline void end_dir_add(struct inode *dir, unsigned int n,
			       wait_queue_head_t *d_wait)
{
	smp_store_release(&dir->i_dir_seq, n + 2);
	preempt_enable_nested();
	if (wq_has_sleeper(d_wait))
		wake_up_all(d_wait);
}

static void d_wait_lookup(struct dentry *dentry)
{
	if (d_in_lookup(dentry)) {
		DECLARE_WAITQUEUE(wait, current);
		add_wait_queue(dentry->d_wait, &wait);
		do {
			set_current_state(TASK_UNINTERRUPTIBLE);
			spin_unlock(&dentry->d_lock);
			schedule();
			spin_lock(&dentry->d_lock);
		} while (d_in_lookup(dentry));
	}
}

struct dentry *d_alloc_parallel(struct dentry *parent,
				const struct qstr *name,
				wait_queue_head_t *wq)
{
	unsigned int hash = name->hash;
	struct hlist_bl_head *b = in_lookup_hash(parent, hash);
	struct hlist_bl_node *node;
	struct dentry *new = __d_alloc(parent->d_sb, name);
	struct dentry *dentry;
	unsigned seq, r_seq, d_seq;

	if (unlikely(!new))
		return ERR_PTR(-ENOMEM);

	new->d_flags |= DCACHE_PAR_LOOKUP;
	spin_lock(&parent->d_lock);
	new->d_parent = dget_dlock(parent);
	hlist_add_head(&new->d_sib, &parent->d_children);
	if (parent->d_flags & DCACHE_DISCONNECTED)
		new->d_flags |= DCACHE_DISCONNECTED;
	spin_unlock(&parent->d_lock);

retry:
	rcu_read_lock();
	seq = smp_load_acquire(&parent->d_inode->i_dir_seq);
	r_seq = read_seqbegin(&rename_lock);
	dentry = __d_lookup_rcu(parent, name, &d_seq);
	if (unlikely(dentry)) {
		if (!lockref_get_not_dead(&dentry->d_lockref)) {
			rcu_read_unlock();
			goto retry;
		}
		if (read_seqcount_retry(&dentry->d_seq, d_seq)) {
			rcu_read_unlock();
			dput(dentry);
			goto retry;
		}
		rcu_read_unlock();
		dput(new);
		return dentry;
	}
	if (unlikely(read_seqretry(&rename_lock, r_seq))) {
		rcu_read_unlock();
		goto retry;
	}

	if (unlikely(seq & 1)) {
		rcu_read_unlock();
		goto retry;
	}

	hlist_bl_lock(b);
	if (unlikely(READ_ONCE(parent->d_inode->i_dir_seq) != seq)) {
		hlist_bl_unlock(b);
		rcu_read_unlock();
		goto retry;
	}
	/*
	 * No changes for the parent since the beginning of d_lookup().
	 * Since all removals from the chain happen with hlist_bl_lock(),
	 * any potential in-lookup matches are going to stay here until
	 * we unlock the chain.  All fields are stable in everything
	 * we encounter.
	 */
	hlist_bl_for_each_entry(dentry, node, b, d_u.d_in_lookup_hash) {
		if (dentry->d_name.hash != hash)
			continue;
		if (dentry->d_parent != parent)
			continue;
		if (!d_same_name(dentry, parent, name))
			continue;
		hlist_bl_unlock(b);
		/* now we can try to grab a reference */
		if (!lockref_get_not_dead(&dentry->d_lockref)) {
			rcu_read_unlock();
			goto retry;
		}

		rcu_read_unlock();
		/*
		 * somebody is likely to be still doing lookup for it;
		 * wait for them to finish
		 */
		spin_lock(&dentry->d_lock);
		d_wait_lookup(dentry);
		/*
		 * it's not in-lookup anymore; in principle we should repeat
		 * everything from dcache lookup, but it's likely to be what
		 * d_lookup() would've found anyway.  If it is, just return it;
		 * otherwise we really have to repeat the whole thing.
		 */
		if (unlikely(dentry->d_name.hash != hash))
			goto mismatch;
		if (unlikely(dentry->d_parent != parent))
			goto mismatch;
		if (unlikely(d_unhashed(dentry)))
			goto mismatch;
		if (unlikely(!d_same_name(dentry, parent, name)))
			goto mismatch;
		/* OK, it *is* a hashed match; return it */
		spin_unlock(&dentry->d_lock);
		dput(new);
		return dentry;
	}
	rcu_read_unlock();
	new->d_wait = wq;
	hlist_bl_add_head(&new->d_u.d_in_lookup_hash, b);
	hlist_bl_unlock(b);
	return new;
mismatch:
	spin_unlock(&dentry->d_lock);
	dput(dentry);
	goto retry;
}
EXPORT_SYMBOL(d_alloc_parallel);

/*
 * - Unhash the dentry
 * - Retrieve and clear the waitqueue head in dentry
 * - Return the waitqueue head
 */
static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
{
	wait_queue_head_t *d_wait;
	struct hlist_bl_head *b;

	lockdep_assert_held(&dentry->d_lock);

	b = in_lookup_hash(dentry->d_parent, dentry->d_name.hash);
	hlist_bl_lock(b);
	dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
	__hlist_bl_del(&dentry->d_u.d_in_lookup_hash);
	d_wait = dentry->d_wait;
	dentry->d_wait = NULL;
	hlist_bl_unlock(b);
	INIT_HLIST_NODE(&dentry->d_u.d_alias);
	INIT_LIST_HEAD(&dentry->d_lru);
	return d_wait;
}

void __d_lookup_unhash_wake(struct dentry *dentry)
{
	spin_lock(&dentry->d_lock);
	wake_up_all(__d_lookup_unhash(dentry));
	spin_unlock(&dentry->d_lock);
}
EXPORT_SYMBOL(__d_lookup_unhash_wake);

/* inode->i_lock held if inode is non-NULL */

static inline void __d_add(struct dentry *dentry, struct inode *inode,
			   const struct dentry_operations *ops)
{
	wait_queue_head_t *d_wait;
	struct inode *dir = NULL;
	unsigned n;
	spin_lock(&dentry->d_lock);
	if (unlikely(d_in_lookup(dentry))) {
		dir = dentry->d_parent->d_inode;
		n = start_dir_add(dir);
		d_wait = __d_lookup_unhash(dentry);
	}
	if (unlikely(ops))
		d_set_d_op(dentry, ops);
	if (inode) {
		unsigned add_flags = d_flags_for_inode(inode);
		hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
		raw_write_seqcount_begin(&dentry->d_seq);
		__d_set_inode_and_type(dentry, inode, add_flags);
		raw_write_seqcount_end(&dentry->d_seq);
		fsnotify_update_flags(dentry);
	}
	__d_rehash(dentry);
	if (dir)
		end_dir_add(dir, n, d_wait);
	spin_unlock(&dentry->d_lock);
	if (inode)
		spin_unlock(&inode->i_lock);
}

/**
 * d_add - add dentry to hash queues
 * @entry: dentry to add
 * @inode: The inode to attach to this dentry
 *
 * This adds the entry to the hash queues and initializes @inode.
 * The entry was actually filled in earlier during d_alloc().
 */

void d_add(struct dentry *entry, struct inode *inode)
{
	if (inode) {
		security_d_instantiate(entry, inode);
		spin_lock(&inode->i_lock);
	}
	__d_add(entry, inode, NULL);
}
EXPORT_SYMBOL(d_add);

struct dentry *d_make_persistent(struct dentry *dentry, struct inode *inode)
{
	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
	WARN_ON(!inode);
	security_d_instantiate(dentry, inode);
	spin_lock(&inode->i_lock);
	spin_lock(&dentry->d_lock);
	__d_instantiate(dentry, inode);
	dentry->d_flags |= DCACHE_PERSISTENT;
	dget_dlock(dentry);
	if (d_unhashed(dentry))
		__d_rehash(dentry);
	spin_unlock(&dentry->d_lock);
	spin_unlock(&inode->i_lock);
	return dentry;
}
EXPORT_SYMBOL(d_make_persistent);

static void swap_names(struct dentry *dentry, struct dentry *target)
{
	if (unlikely(dname_external(target))) {
		if (unlikely(dname_external(dentry))) {
			/*
			 * Both external: swap the pointers
			 */
			swap(target->__d_name.name, dentry->__d_name.name);
		} else {
			/*
			 * dentry:internal, target:external.  Steal target's
			 * storage and make target internal.
			 */
			dentry->__d_name.name = target->__d_name.name;
			target->d_shortname = dentry->d_shortname;
			target->__d_name.name = target->d_shortname.string;
		}
	} else {
		if (unlikely(dname_external(dentry))) {
			/*
			 * dentry:external, target:internal.  Give dentry's
			 * storage to target and make dentry internal
			 */
			target->__d_name.name = dentry->__d_name.name;
			dentry->d_shortname = target->d_shortname;
			dentry->__d_name.name = dentry->d_shortname.string;
		} else {
			/*
			 * Both are internal.
			 */
			for (int i = 0; i < DNAME_INLINE_WORDS; i++)
				swap(dentry->d_shortname.words[i],
				     target->d_shortname.words[i]);
		}
	}
	swap(dentry->__d_name.hash_len, target->__d_name.hash_len);
}

static void copy_name(struct dentry *dentry, struct dentry *target)
{
	struct external_name *old_name = NULL;
	if (unlikely(dname_external(dentry)))
		old_name = external_name(dentry);
	if (unlikely(dname_external(target))) {
		atomic_inc(&external_name(target)->count);
		dentry->__d_name = target->__d_name;
	} else {
		dentry->d_shortname = target->d_shortname;
		dentry->__d_name.name = dentry->d_shortname.string;
		dentry->__d_name.hash_len = target->__d_name.hash_len;
	}
	if (old_name && likely(atomic_dec_and_test(&old_name->count)))
		kfree_rcu(old_name, head);
}

/*
 * __d_move - move a dentry
 * @dentry: entry to move
 * @target: new dentry
 * @exchange: exchange the two dentries
 *
 * Update the dcache to reflect the move of a file name. Negative dcache
 * entries should not be moved in this way. Caller must hold rename_lock, the
 * i_rwsem of the source and target directories (exclusively), and the sb->
 * s_vfs_rename_mutex if they differ. See lock_rename().
 */
static void __d_move(struct dentry *dentry, struct dentry *target,
		     bool exchange)
{
	struct dentry *old_parent, *p;
	wait_queue_head_t *d_wait;
	struct inode *dir = NULL;
	unsigned n;

	WARN_ON(!dentry->d_inode);
	if (WARN_ON(dentry == target))
		return;

	BUG_ON(d_ancestor(target, dentry));
	old_parent = dentry->d_parent;
	p = d_ancestor(old_parent, target);
	if (IS_ROOT(dentry)) {
		BUG_ON(p);
		spin_lock(&target->d_parent->d_lock);
	} else if (!p) {
		/* target is not a descendent of dentry->d_parent */
		spin_lock(&target->d_parent->d_lock);
		spin_lock_nested(&old_parent->d_lock, DENTRY_D_LOCK_NESTED);
	} else {
		BUG_ON(p == dentry);
		spin_lock(&old_parent->d_lock);
		if (p != target)
			spin_lock_nested(&target->d_parent->d_lock,
					DENTRY_D_LOCK_NESTED);
	}
	spin_lock_nested(&dentry->d_lock, 2);
	spin_lock_nested(&target->d_lock, 3);

	if (unlikely(d_in_lookup(target))) {
		dir = target->d_parent->d_inode;
		n = start_dir_add(dir);
		d_wait = __d_lookup_unhash(target);
	}

	write_seqcount_begin(&dentry->d_seq);
	write_seqcount_begin_nested(&target->d_seq, DENTRY_D_LOCK_NESTED);

	/* unhash both */
	if (!d_unhashed(dentry))
		___d_drop(dentry);
	if (!d_unhashed(target))
		___d_drop(target);

	/* ... and switch them in the tree */
	dentry->d_parent = target->d_parent;
	if (!exchange) {
		copy_name(dentry, target);
		target->d_hash.pprev = NULL;
		dentry->d_parent->d_lockref.count++;
		if (dentry != old_parent) /* wasn't IS_ROOT */
			WARN_ON(!--old_parent->d_lockref.count);
	} else {
		target->d_parent = old_parent;
		swap_names(dentry, target);
		if (!hlist_unhashed(&target->d_sib))
			__hlist_del(&target->d_sib);
		hlist_add_head(&target->d_sib, &target->d_parent->d_children);
		__d_rehash(target);
		fsnotify_update_flags(target);
	}
	if (!hlist_unhashed(&dentry->d_sib))
		__hlist_del(&dentry->d_sib);
	hlist_add_head(&dentry->d_sib, &dentry->d_parent->d_children);
	__d_rehash(dentry);
	fsnotify_update_flags(dentry);
	fscrypt_handle_d_move(dentry);

	write_seqcount_end(&target->d_seq);
	write_seqcount_end(&dentry->d_seq);

	if (dir)
		end_dir_add(dir, n, d_wait);

	if (dentry->d_parent != old_parent)
		spin_unlock(&dentry->d_parent->d_lock);
	if (dentry != old_parent)
		spin_unlock(&old_parent->d_lock);
	spin_unlock(&target->d_lock);
	spin_unlock(&dentry->d_lock);
}

/*
 * d_move - move a dentry
 * @dentry: entry to move
 * @target: new dentry
 *
 * Update the dcache to reflect the move of a file name. Negative
 * dcache entries should not be moved in this way. See the locking
 * requirements for __d_move.
 */
void d_move(struct dentry *dentry, struct dentry *target)
{
	write_seqlock(&rename_lock);
	__d_move(dentry, target, false);
	write_sequnlock(&rename_lock);
}
EXPORT_SYMBOL(d_move);

/*
 * d_exchange - exchange two dentries
 * @dentry1: first dentry
 * @dentry2: second dentry
 */
void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
{
	write_seqlock(&rename_lock);

	WARN_ON(!dentry1->d_inode);
	WARN_ON(!dentry2->d_inode);
	WARN_ON(IS_ROOT(dentry1));
	WARN_ON(IS_ROOT(dentry2));

	__d_move(dentry1, dentry2, true);

	write_sequnlock(&rename_lock);
}
EXPORT_SYMBOL(d_exchange);

/**
 * d_ancestor - search for an ancestor
 * @p1: ancestor dentry
 * @p2: child dentry
 *
 * Returns the ancestor dentry of p2 which is a child of p1, if p1 is
 * an ancestor of p2, else NULL.
 */
struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
{
	struct dentry *p;

	for (p = p2; !IS_ROOT(p); p = p->d_parent) {
		if (p->d_parent == p1)
			return p;
	}
	return NULL;
}

/*
 * This helper attempts to cope with remotely renamed directories
 *
 * It assumes that the caller is already holding
 * dentry->d_parent->d_inode->i_rwsem, and rename_lock
 *
 * Note: If ever the locking in lock_rename() changes, then please
 * remember to update this too...
 */
static int __d_unalias(struct dentry *dentry, struct dentry *alias)
{
	struct mutex *m1 = NULL;
	struct rw_semaphore *m2 = NULL;
	int ret = -ESTALE;

	/* If alias and dentry share a parent, then no extra locks required */
	if (alias->d_parent == dentry->d_parent)
		goto out_unalias;

	/* See lock_rename() */
	if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
		goto out_err;
	m1 = &dentry->d_sb->s_vfs_rename_mutex;
	if (!inode_trylock_shared(alias->d_parent->d_inode))
		goto out_err;
	m2 = &alias->d_parent->d_inode->i_rwsem;
out_unalias:
	if (alias->d_op && alias->d_op->d_unalias_trylock &&
	    !alias->d_op->d_unalias_trylock(alias))
		goto out_err;
	__d_move(alias, dentry, false);
	if (alias->d_op && alias->d_op->d_unalias_unlock)
		alias->d_op->d_unalias_unlock(alias);
	ret = 0;
out_err:
	if (m2)
		up_read(m2);
	if (m1)
		mutex_unlock(m1);
	return ret;
}

struct dentry *d_splice_alias_ops(struct inode *inode, struct dentry *dentry,
				  const struct dentry_operations *ops)
{
	if (IS_ERR(inode))
		return ERR_CAST(inode);

	BUG_ON(!d_unhashed(dentry));

	if (!inode)
		goto out;

	security_d_instantiate(dentry, inode);
	spin_lock(&inode->i_lock);
	if (S_ISDIR(inode->i_mode)) {
		struct dentry *new = __d_find_any_alias(inode);
		if (unlikely(new)) {
			/* The reference to new ensures it remains an alias */
			spin_unlock(&inode->i_lock);
			write_seqlock(&rename_lock);
			if (unlikely(d_ancestor(new, dentry))) {
				write_sequnlock(&rename_lock);
				dput(new);
				new = ERR_PTR(-ELOOP);
				pr_warn_ratelimited(
					"VFS: Lookup of '%s' in %s %s"
					" would have caused loop\n",
					dentry->d_name.name,
					inode->i_sb->s_type->name,
					inode->i_sb->s_id);
			} else if (!IS_ROOT(new)) {
				struct dentry *old_parent = dget(new->d_parent);
				int err = __d_unalias(dentry, new);
				write_sequnlock(&rename_lock);
				if (err) {
					dput(new);
					new = ERR_PTR(err);
				}
				dput(old_parent);
			} else {
				__d_move(new, dentry, false);
				write_sequnlock(&rename_lock);
			}
			iput(inode);
			return new;
		}
	}
out:
	__d_add(dentry, inode, ops);
	return NULL;
}

/**
 * d_splice_alias - splice a disconnected dentry into the tree if one exists
 * @inode:  the inode which may have a disconnected dentry
 * @dentry: a negative dentry which we want to point to the inode.
 *
 * If inode is a directory and has an IS_ROOT alias, then d_move that in
 * place of the given dentry and return it, else simply d_add the inode
 * to the dentry and return NULL.
 *
 * If a non-IS_ROOT directory is found, the filesystem is corrupt, and
 * we should error out: directories can't have multiple aliases.
 *
 * This is needed in the lookup routine of any filesystem that is exportable
 * (via knfsd) so that we can build dcache paths to directories effectively.
 *
 * If a dentry was found and moved, then it is returned.  Otherwise NULL
 * is returned.  This matches the expected return value of ->lookup.
 *
 * Cluster filesystems may call this function with a negative, hashed dentry.
 * In that case, we know that the inode will be a regular file, and also this
 * will only occur during atomic_open. So we need to check for the dentry
 * being already hashed only in the final case.
 */
struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
{
	return d_splice_alias_ops(inode, dentry, NULL);
}
EXPORT_SYMBOL(d_splice_alias);

/*
 * Test whether new_dentry is a subdirectory of old_dentry.
 *
 * Trivially implemented using the dcache structure
 */

/**
 * is_subdir - is new dentry a subdirectory of old_dentry
 * @new_dentry: new dentry
 * @old_dentry: old dentry
 *
 * Returns true if new_dentry is a subdirectory of the parent (at any depth).
 * Returns false otherwise.
 * Caller must ensure that "new_dentry" is pinned before calling is_subdir()
 */
  
bool is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
{
	bool subdir;
	unsigned seq;

	if (new_dentry == old_dentry)
		return true;

	/* Access d_parent under rcu as d_move() may change it. */
	rcu_read_lock();
	seq = read_seqbegin(&rename_lock);
	subdir = d_ancestor(old_dentry, new_dentry);
	 /* Try lockless once... */
	if (read_seqretry(&rename_lock, seq)) {
		/* ...else acquire lock for progress even on deep chains. */
		read_seqlock_excl(&rename_lock);
		subdir = d_ancestor(old_dentry, new_dentry);
		read_sequnlock_excl(&rename_lock);
	}
	rcu_read_unlock();
	return subdir;
}
EXPORT_SYMBOL(is_subdir);

void d_mark_tmpfile(struct file *file, struct inode *inode)
{
	struct dentry *dentry = file->f_path.dentry;

	BUG_ON(dname_external(dentry) ||
		!hlist_unhashed(&dentry->d_u.d_alias) ||
		!d_unlinked(dentry));
	spin_lock(&dentry->d_parent->d_lock);
	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
	dentry->__d_name.len = sprintf(dentry->d_shortname.string, "#%llu",
				(unsigned long long)inode->i_ino);
	spin_unlock(&dentry->d_lock);
	spin_unlock(&dentry->d_parent->d_lock);
}
EXPORT_SYMBOL(d_mark_tmpfile);

void d_tmpfile(struct file *file, struct inode *inode)
{
	struct dentry *dentry = file->f_path.dentry;

	inode_dec_link_count(inode);
	d_mark_tmpfile(file, inode);
	d_instantiate(dentry, inode);
}
EXPORT_SYMBOL(d_tmpfile);

/*
 * Obtain inode number of the parent dentry.
 */
ino_t d_parent_ino(struct dentry *dentry)
{
	struct dentry *parent;
	struct inode *iparent;
	unsigned seq;
	ino_t ret;

	scoped_guard(rcu) {
		seq = raw_seqcount_begin(&dentry->d_seq);
		parent = READ_ONCE(dentry->d_parent);
		iparent = d_inode_rcu(parent);
		if (likely(iparent)) {
			ret = iparent->i_ino;
			if (!read_seqcount_retry(&dentry->d_seq, seq))
				return ret;
		}
	}

	spin_lock(&dentry->d_lock);
	ret = dentry->d_parent->d_inode->i_ino;
	spin_unlock(&dentry->d_lock);
	return ret;
}
EXPORT_SYMBOL(d_parent_ino);

static __initdata unsigned long dhash_entries;
static int __init set_dhash_entries(char *str)
{
	if (!str)
		return 0;
	dhash_entries = simple_strtoul(str, &str, 0);
	return 1;
}
__setup("dhash_entries=", set_dhash_entries);

static void __init dcache_init_early(void)
{
	/* If hashes are distributed across NUMA nodes, defer
	 * hash allocation until vmalloc space is available.
	 */
	if (hashdist)
		return;

	dentry_hashtable =
		alloc_large_system_hash("Dentry cache",
					sizeof(struct hlist_bl_head),
					dhash_entries,
					13,
					HASH_EARLY | HASH_ZERO,
					&d_hash_shift,
					NULL,
					0,
					0);
	d_hash_shift = 32 - d_hash_shift;

	runtime_const_init(shift, d_hash_shift);
	runtime_const_init(ptr, dentry_hashtable);
}

static void __init dcache_init(void)
{
	/*
	 * A constructor could be added for stable state like the lists,
	 * but it is probably not worth it because of the cache nature
	 * of the dcache.
	 */
	__dentry_cache = KMEM_CACHE_USERCOPY(dentry,
		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_ACCOUNT,
		d_shortname.string);
	runtime_const_init(ptr, __dentry_cache);

	/* Hash may have been set up in dcache_init_early */
	if (!hashdist)
		return;

	dentry_hashtable =
		alloc_large_system_hash("Dentry cache",
					sizeof(struct hlist_bl_head),
					dhash_entries,
					13,
					HASH_ZERO,
					&d_hash_shift,
					NULL,
					0,
					0);
	d_hash_shift = 32 - d_hash_shift;

	runtime_const_init(shift, d_hash_shift);
	runtime_const_init(ptr, dentry_hashtable);
}

/* SLAB cache for __getname() consumers */
struct kmem_cache *names_cachep __ro_after_init;
EXPORT_SYMBOL(names_cachep);

void __init vfs_caches_init_early(void)
{
	int i;

	for (i = 0; i < ARRAY_SIZE(in_lookup_hashtable); i++)
		INIT_HLIST_BL_HEAD(&in_lookup_hashtable[i]);

	dcache_init_early();
	inode_init_early();
}

void __init vfs_caches_init(void)
{
	names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0,
			SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL);

	dcache_init();
	inode_init();
	files_init();
	files_maxfiles_init();
	mnt_init();
	bdev_cache_init();
	chrdev_init();
}
]

7/2 2026/02/02 12:49 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 37431
							output: 23
							thoughts: 1521
						

Thoughts:
**Analyzing a Potential Race Condition in Dentry Destruction**

Okay, let's break this down. I'm looking at a report that flags a potential data race. It seems we have two tasks, 3002 and 3304, both operating on the same `dentry` within the `__dentry_kill` function in `fs/dcache.c`. Specifically, task 3002 is performing a write: `dentry->d_flags &= ~DCACHE_LRU_LIST;` within `d_lru_del`, while task 3304 is performing a read-modify-write: `dentry->d_flags |= DCACHE_DENTRY_KILLED;` within `dentry_unlist`. The report pinpoints the read to a specific line in `dentry_unlist`, which, from what I see, is correct. Both these function calls occur inside `__dentry_kill`.

My initial thought is, how can two separate tasks be executing `__dentry_kill` on the same `dentry`? `__dentry_kill` is called from `finish_dput`. The function starts by marking the `dentry` as dead with `lockref_mark_dead`. This suggests that the dentry is no longer accessible by any other functions, because it sets the lockref count to a negative value.

I can see in `finish_dput` that there is a loop calling `__dentry_kill`. The loop calls `lock_for_kill`, which seems to check the reference count. If the reference count is greater than zero, then it returns false, and the dentry can continue. However, the first check inside `lock_for_kill` is `unlikely(dentry->d_lockref.count)`.

If something is wrong with the reference counting or locking within `dentry` destruction, it is likely that one task has managed to decrement the reference count down to zero, and another task is trying to decrement the same dentry, causing this race condition. In theory, `lockref_mark_dead` within `__dentry_kill` is designed to prevent this; it makes the `dentry` unrecoverably dead. So I'm now curious about the implementation of `lockref_mark_dead`. I need to review `include/linux/lockref.h` now to understand its behavior.

This whole thing raises a red flag. It implies that two tasks are concurrently attempting to "kill" the same dentry, which should be impossible given the presumed locking mechanisms. This is likely a bug.



8/2 2026/02/02 12:50 tool read-file
0m Args:
map[File:include/linux/lockref.h]

Results:
map[Contents:/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __LINUX_LOCKREF_H
#define __LINUX_LOCKREF_H

/*
 * Locked reference counts.
 *
 * These are different from just plain atomic refcounts in that they
 * are atomic with respect to the spinlock that goes with them.  In
 * particular, there can be implementations that don't actually get
 * the spinlock for the common decrement/increment operations, but they
 * still have to check that the operation is done semantically as if
 * the spinlock had been taken (using a cmpxchg operation that covers
 * both the lock and the count word, or using memory transactions, for
 * example).
 */

#include <linux/spinlock.h>
#include <generated/bounds.h>

#define USE_CMPXCHG_LOCKREF \
	(IS_ENABLED(CONFIG_ARCH_USE_CMPXCHG_LOCKREF) && \
	 IS_ENABLED(CONFIG_SMP) && SPINLOCK_SIZE <= 4)

struct lockref {
	union {
#if USE_CMPXCHG_LOCKREF
		aligned_u64 lock_count;
#endif
		struct {
			spinlock_t lock;
			int count;
		};
	};
};

/**
 * lockref_init - Initialize a lockref
 * @lockref: pointer to lockref structure
 *
 * Initializes @lockref->count to 1.
 */
static inline void lockref_init(struct lockref *lockref)
{
	spin_lock_init(&lockref->lock);
	lockref->count = 1;
}

void lockref_get(struct lockref *lockref);
int lockref_put_return(struct lockref *lockref);
bool lockref_get_not_zero(struct lockref *lockref);
bool lockref_put_or_lock(struct lockref *lockref);
#define lockref_put_or_lock(_lockref) \
	(!__cond_lock((_lockref)->lock, !lockref_put_or_lock(_lockref)))

void lockref_mark_dead(struct lockref *lockref);
bool lockref_get_not_dead(struct lockref *lockref);

/* Must be called under spinlock for reliable results */
static inline bool __lockref_is_dead(const struct lockref *l)
{
	return ((int)l->count < 0);
}

#endif /* __LINUX_LOCKREF_H */
]

9/2 2026/02/02 12:50 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 40128
							output: 21
							thoughts: 12
						

Thoughts:
Okay, here's my take on those thoughts, tailored for an expert audience, presented in the first person:

**Lockref Implementation Considerations**

Right, so I'm thinking about the implementation details. I need to keep the high-level design in mind, obviously. But at this stage, I'm specifically turning my attention to the actual code. The core logic will probably reside in `lib/lockref.c`. I'll be digging in there, looking at how the various primitives are orchestrated. Need to ensure atomicity, of course. Thinking about memory barriers, instruction reordering, all that good stuff. This is where the rubber meets the road. What locking strategy is being used? Mutexes? Spinlocks? Fine-grained or coarse-grained? The choice will impact performance and scalability significantly. Gotta consider potential contention points, especially in multi-threaded scenarios. Race conditions are my enemy here. I'll need to meticulously review the code for any subtle flaws that could lead to data corruption or deadlocks. Debugging this stuff is a nightmare if it goes wrong. Also, need to keep portability in mind, ensuring it works as expected across different architectures. Need to be able to port the code to other hardware, that's for sure.



10/2 2026/02/02 12:50 tool read-file
0m Args:
map[File:lib/lockref.c]

Results:
map[Contents:// SPDX-License-Identifier: GPL-2.0
#include <linux/export.h>
#include <linux/lockref.h>

#if USE_CMPXCHG_LOCKREF

/*
 * Note that the "cmpxchg()" reloads the "old" value for the
 * failure case.
 */
#define CMPXCHG_LOOP(CODE, SUCCESS) do {					\
	int retry = 100;							\
	struct lockref old;							\
	BUILD_BUG_ON(sizeof(old) != 8);						\
	old.lock_count = READ_ONCE(lockref->lock_count);			\
	while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {  	\
		struct lockref new = old;					\
		CODE								\
		if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
						 &old.lock_count,		\
						 new.lock_count))) {		\
			SUCCESS;						\
		}								\
		if (!--retry)							\
			break;							\
	}									\
} while (0)

#else

#define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0)

#endif

/**
 * lockref_get - Increments reference count unconditionally
 * @lockref: pointer to lockref structure
 *
 * This operation is only valid if you already hold a reference
 * to the object, so you know the count cannot be zero.
 */
void lockref_get(struct lockref *lockref)
{
	CMPXCHG_LOOP(
		new.count++;
	,
		return;
	);

	spin_lock(&lockref->lock);
	lockref->count++;
	spin_unlock(&lockref->lock);
}
EXPORT_SYMBOL(lockref_get);

/**
 * lockref_get_not_zero - Increments count unless the count is 0 or dead
 * @lockref: pointer to lockref structure
 * Return: 1 if count updated successfully or 0 if count was zero
 */
bool lockref_get_not_zero(struct lockref *lockref)
{
	bool retval = false;

	CMPXCHG_LOOP(
		new.count++;
		if (old.count <= 0)
			return false;
	,
		return true;
	);

	spin_lock(&lockref->lock);
	if (lockref->count > 0) {
		lockref->count++;
		retval = true;
	}
	spin_unlock(&lockref->lock);
	return retval;
}
EXPORT_SYMBOL(lockref_get_not_zero);

/**
 * lockref_put_return - Decrement reference count if possible
 * @lockref: pointer to lockref structure
 *
 * Decrement the reference count and return the new value.
 * If the lockref was dead or locked, return -1.
 */
int lockref_put_return(struct lockref *lockref)
{
	CMPXCHG_LOOP(
		new.count--;
		if (old.count <= 0)
			return -1;
	,
		return new.count;
	);
	return -1;
}
EXPORT_SYMBOL(lockref_put_return);

/**
 * lockref_put_or_lock - decrements count unless count <= 1 before decrement
 * @lockref: pointer to lockref structure
 * Return: 1 if count updated successfully or 0 if count <= 1 and lock taken
 */
#undef lockref_put_or_lock
bool lockref_put_or_lock(struct lockref *lockref)
{
	CMPXCHG_LOOP(
		new.count--;
		if (old.count <= 1)
			break;
	,
		return true;
	);

	spin_lock(&lockref->lock);
	if (lockref->count <= 1)
		return false;
	lockref->count--;
	spin_unlock(&lockref->lock);
	return true;
}
EXPORT_SYMBOL(lockref_put_or_lock);

/**
 * lockref_mark_dead - mark lockref dead
 * @lockref: pointer to lockref structure
 */
void lockref_mark_dead(struct lockref *lockref)
{
	assert_spin_locked(&lockref->lock);
	lockref->count = -128;
}
EXPORT_SYMBOL(lockref_mark_dead);

/**
 * lockref_get_not_dead - Increments count unless the ref is dead
 * @lockref: pointer to lockref structure
 * Return: 1 if count updated successfully or 0 if lockref was dead
 */
bool lockref_get_not_dead(struct lockref *lockref)
{
	bool retval = false;

	CMPXCHG_LOOP(
		new.count++;
		if (old.count < 0)
			return false;
	,
		return true;
	);

	spin_lock(&lockref->lock);
	if (lockref->count >= 0) {
		lockref->count++;
		retval = true;
	}
	spin_unlock(&lockref->lock);
	return retval;
}
EXPORT_SYMBOL(lockref_get_not_dead);
]

11/2 2026/02/02 12:50 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 41698
							output: 21
							thoughts: 2783
						

Thoughts:
**Dentry Lifecycle and Race Conditions: A Deep Dive**

Okay, so I'm digging into this dentry kill situation. It looks like `__dentry_kill` is the heart of the matter, being responsible for marking a dentry as "dead". Initially, I noticed that `lockref_mark_dead` is called within `__dentry_kill` while holding `dentry->d_lock`. Seems straightforward, right? `__dentry_kill` is called from `finish_dput`, which is called when a dentry's reference count goes to zero, the main entry point is `dput`. The `lock_for_kill` function is responsible for ensuring exclusive access to the dentry before the kill can proceed. `lock_for_kill` will acquire `dentry->d_lock` (and maybe `inode->i_lock`) before calling `__dentry_kill`.

Now, the interesting bit.  `fast_dput` is called before `finish_dput`. If `fast_dput` finds the dentry already "dead," it can skip `finish_dput` returning early with a WARN. If it sees `dentry->d_lockref.count` as zero, then it sets the dentry's lock, and then calls `finish_dput`.

Now, the race condition. If `__dentry_kill` is running, then calls `dentry_unlink_inode` which drops `dentry->d_lock`, allowing another thread to start running `__dentry_kill` on the same dentry. This seems to be where the problems arise.

If another task sees that the dentry is already dead, then `lockref_put_return` will return -1.
If `dentry->d_lockref.count` is -128, then it WARNS.

Looking closer at `__dentry_kill`, it *can* return a parent dentry, allowing for a loop. The main issue is that during the process, there's a window where `dentry->d_lock` is unlocked before it's re-acquired. During this window, another task could potentially start `__dentry_kill` on the *same* dentry, or a new dentry.

So I need to consider `shrink_dentry_list`, since it also calls `__dentry_kill`. This might be how things are getting called. But, `shrink_dentry_list` calls `lock_for_kill` before it starts. And `lock_for_kill` checks `dentry->d_lockref.count`, so it shouldn't proceed if it's dead, right? This is unlikely since `lock_for_kill` returns `false` if the dentry is already dead.

Ah, the log indicates a WARN, confirming the kernel's concern. But, if `fast_dput` is triggering the WARN, it's returning *before* `finish_dput` gets called, so it can't be calling `__dentry_kill`.

The key is that both tasks in the report have the *same* stack trace up to `lookup_fast`, which means both tasks are calling `dput` on the same dentry, from the same `lookup_fast` routine. Now I need to focus on how that could be happening in `lookup_fast` and why `dput` might be called on the same dentry concurrently from `lookup_fast`.



12/2 2026/02/02 12:50 tool read-file
0m Args:
map[File:fs/namei.c]

Results:
map[Contents:// SPDX-License-Identifier: GPL-2.0
/*
 *  linux/fs/namei.c
 *
 *  Copyright (C) 1991, 1992  Linus Torvalds
 */

/*
 * Some corrections by tytso.
 */

/* [Feb 1997 T. Schoebel-Theuer] Complete rewrite of the pathname
 * lookup logic.
 */
/* [Feb-Apr 2000, AV] Rewrite to the new namespace architecture.
 */

#include <linux/init.h>
#include <linux/export.h>
#include <linux/slab.h>
#include <linux/wordpart.h>
#include <linux/fs.h>
#include <linux/filelock.h>
#include <linux/namei.h>
#include <linux/pagemap.h>
#include <linux/sched/mm.h>
#include <linux/fsnotify.h>
#include <linux/personality.h>
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/mount.h>
#include <linux/audit.h>
#include <linux/capability.h>
#include <linux/file.h>
#include <linux/fcntl.h>
#include <linux/device_cgroup.h>
#include <linux/fs_struct.h>
#include <linux/posix_acl.h>
#include <linux/hash.h>
#include <linux/bitops.h>
#include <linux/init_task.h>
#include <linux/uaccess.h>

#include "internal.h"
#include "mount.h"

/* [Feb-1997 T. Schoebel-Theuer]
 * Fundamental changes in the pathname lookup mechanisms (namei)
 * were necessary because of omirr.  The reason is that omirr needs
 * to know the _real_ pathname, not the user-supplied one, in case
 * of symlinks (and also when transname replacements occur).
 *
 * The new code replaces the old recursive symlink resolution with
 * an iterative one (in case of non-nested symlink chains).  It does
 * this with calls to <fs>_follow_link().
 * As a side effect, dir_namei(), _namei() and follow_link() are now 
 * replaced with a single function lookup_dentry() that can handle all 
 * the special cases of the former code.
 *
 * With the new dcache, the pathname is stored at each inode, at least as
 * long as the refcount of the inode is positive.  As a side effect, the
 * size of the dcache depends on the inode cache and thus is dynamic.
 *
 * [29-Apr-1998 C. Scott Ananian] Updated above description of symlink
 * resolution to correspond with current state of the code.
 *
 * Note that the symlink resolution is not *completely* iterative.
 * There is still a significant amount of tail- and mid- recursion in
 * the algorithm.  Also, note that <fs>_readlink() is not used in
 * lookup_dentry(): lookup_dentry() on the result of <fs>_readlink()
 * may return different results than <fs>_follow_link().  Many virtual
 * filesystems (including /proc) exhibit this behavior.
 */

/* [24-Feb-97 T. Schoebel-Theuer] Side effects caused by new implementation:
 * New symlink semantics: when open() is called with flags O_CREAT | O_EXCL
 * and the name already exists in form of a symlink, try to create the new
 * name indicated by the symlink. The old code always complained that the
 * name already exists, due to not following the symlink even if its target
 * is nonexistent.  The new semantics affects also mknod() and link() when
 * the name is a symlink pointing to a non-existent name.
 *
 * I don't know which semantics is the right one, since I have no access
 * to standards. But I found by trial that HP-UX 9.0 has the full "new"
 * semantics implemented, while SunOS 4.1.1 and Solaris (SunOS 5.4) have the
 * "old" one. Personally, I think the new semantics is much more logical.
 * Note that "ln old new" where "new" is a symlink pointing to a non-existing
 * file does succeed in both HP-UX and SunOs, but not in Solaris
 * and in the old Linux semantics.
 */

/* [16-Dec-97 Kevin Buhr] For security reasons, we change some symlink
 * semantics.  See the comments in "open_namei" and "do_link" below.
 *
 * [10-Sep-98 Alan Modra] Another symlink change.
 */

/* [Feb-Apr 2000 AV] Complete rewrite. Rules for symlinks:
 *	inside the path - always follow.
 *	in the last component in creation/removal/renaming - never follow.
 *	if LOOKUP_FOLLOW passed - follow.
 *	if the pathname has trailing slashes - follow.
 *	otherwise - don't follow.
 * (applied in that order).
 *
 * [Jun 2000 AV] Inconsistent behaviour of open() in case if flags==O_CREAT
 * restored for 2.4. This is the last surviving part of old 4.2BSD bug.
 * During the 2.4 we need to fix the userland stuff depending on it -
 * hopefully we will be able to get rid of that wart in 2.5. So far only
 * XEmacs seems to be relying on it...
 */
/*
 * [Sep 2001 AV] Single-semaphore locking scheme (kudos to David Holland)
 * implemented.  Let's see if raised priority of ->s_vfs_rename_mutex gives
 * any extra contention...
 */

/* In order to reduce some races, while at the same time doing additional
 * checking and hopefully speeding things up, we copy filenames to the
 * kernel data space before using them..
 *
 * POSIX.1 2.4: an empty pathname is invalid (ENOENT).
 * PATH_MAX includes the nul terminator --RR.
 */

#define EMBEDDED_NAME_MAX	(PATH_MAX - offsetof(struct filename, iname))

static inline void initname(struct filename *name, const char __user *uptr)
{
	name->uptr = uptr;
	name->aname = NULL;
	atomic_set(&name->refcnt, 1);
}

struct filename *
getname_flags(const char __user *filename, int flags)
{
	struct filename *result;
	char *kname;
	int len;

	result = audit_reusename(filename);
	if (result)
		return result;

	result = __getname();
	if (unlikely(!result))
		return ERR_PTR(-ENOMEM);

	/*
	 * First, try to embed the struct filename inside the names_cache
	 * allocation
	 */
	kname = (char *)result->iname;
	result->name = kname;

	len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
	/*
	 * Handle both empty path and copy failure in one go.
	 */
	if (unlikely(len <= 0)) {
		if (unlikely(len < 0)) {
			__putname(result);
			return ERR_PTR(len);
		}

		/* The empty path is special. */
		if (!(flags & LOOKUP_EMPTY)) {
			__putname(result);
			return ERR_PTR(-ENOENT);
		}
	}

	/*
	 * Uh-oh. We have a name that's approaching PATH_MAX. Allocate a
	 * separate struct filename so we can dedicate the entire
	 * names_cache allocation for the pathname, and re-do the copy from
	 * userland.
	 */
	if (unlikely(len == EMBEDDED_NAME_MAX)) {
		const size_t size = offsetof(struct filename, iname[1]);
		kname = (char *)result;

		/*
		 * size is chosen that way we to guarantee that
		 * result->iname[0] is within the same object and that
		 * kname can't be equal to result->iname, no matter what.
		 */
		result = kzalloc(size, GFP_KERNEL);
		if (unlikely(!result)) {
			__putname(kname);
			return ERR_PTR(-ENOMEM);
		}
		result->name = kname;
		len = strncpy_from_user(kname, filename, PATH_MAX);
		if (unlikely(len < 0)) {
			__putname(kname);
			kfree(result);
			return ERR_PTR(len);
		}
		/* The empty path is special. */
		if (unlikely(!len) && !(flags & LOOKUP_EMPTY)) {
			__putname(kname);
			kfree(result);
			return ERR_PTR(-ENOENT);
		}
		if (unlikely(len == PATH_MAX)) {
			__putname(kname);
			kfree(result);
			return ERR_PTR(-ENAMETOOLONG);
		}
	}
	initname(result, filename);
	audit_getname(result);
	return result;
}

struct filename *getname_uflags(const char __user *filename, int uflags)
{
	int flags = (uflags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;

	return getname_flags(filename, flags);
}

struct filename *__getname_maybe_null(const char __user *pathname)
{
	struct filename *name;
	char c;

	/* try to save on allocations; loss on um, though */
	if (get_user(c, pathname))
		return ERR_PTR(-EFAULT);
	if (!c)
		return NULL;

	name = getname_flags(pathname, LOOKUP_EMPTY);
	if (!IS_ERR(name) && !(name->name[0])) {
		putname(name);
		name = NULL;
	}
	return name;
}

struct filename *getname_kernel(const char * filename)
{
	struct filename *result;
	int len = strlen(filename) + 1;

	result = __getname();
	if (unlikely(!result))
		return ERR_PTR(-ENOMEM);

	if (len <= EMBEDDED_NAME_MAX) {
		result->name = (char *)result->iname;
	} else if (len <= PATH_MAX) {
		const size_t size = offsetof(struct filename, iname[1]);
		struct filename *tmp;

		tmp = kmalloc(size, GFP_KERNEL);
		if (unlikely(!tmp)) {
			__putname(result);
			return ERR_PTR(-ENOMEM);
		}
		tmp->name = (char *)result;
		result = tmp;
	} else {
		__putname(result);
		return ERR_PTR(-ENAMETOOLONG);
	}
	memcpy((char *)result->name, filename, len);
	initname(result, NULL);
	audit_getname(result);
	return result;
}
EXPORT_SYMBOL(getname_kernel);

void putname(struct filename *name)
{
	int refcnt;

	if (IS_ERR_OR_NULL(name))
		return;

	refcnt = atomic_read(&name->refcnt);
	if (unlikely(refcnt != 1)) {
		if (WARN_ON_ONCE(!refcnt))
			return;

		if (!atomic_dec_and_test(&name->refcnt))
			return;
	}

	if (unlikely(name->name != name->iname)) {
		__putname(name->name);
		kfree(name);
	} else
		__putname(name);
}
EXPORT_SYMBOL(putname);

/**
 * check_acl - perform ACL permission checking
 * @idmap:	idmap of the mount the inode was found from
 * @inode:	inode to check permissions on
 * @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC ...)
 *
 * This function performs the ACL permission checking. Since this function
 * retrieve POSIX acls it needs to know whether it is called from a blocking or
 * non-blocking context and thus cares about the MAY_NOT_BLOCK bit.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
static int check_acl(struct mnt_idmap *idmap,
		     struct inode *inode, int mask)
{
#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl *acl;

	if (mask & MAY_NOT_BLOCK) {
		acl = get_cached_acl_rcu(inode, ACL_TYPE_ACCESS);
	        if (!acl)
	                return -EAGAIN;
		/* no ->get_inode_acl() calls in RCU mode... */
		if (is_uncached_acl(acl))
			return -ECHILD;
	        return posix_acl_permission(idmap, inode, acl, mask);
	}

	acl = get_inode_acl(inode, ACL_TYPE_ACCESS);
	if (IS_ERR(acl))
		return PTR_ERR(acl);
	if (acl) {
	        int error = posix_acl_permission(idmap, inode, acl, mask);
	        posix_acl_release(acl);
	        return error;
	}
#endif

	return -EAGAIN;
}

/*
 * Very quick optimistic "we know we have no ACL's" check.
 *
 * Note that this is purely for ACL_TYPE_ACCESS, and purely
 * for the "we have cached that there are no ACLs" case.
 *
 * If this returns true, we know there are no ACLs. But if
 * it returns false, we might still not have ACLs (it could
 * be the is_uncached_acl() case).
 */
static inline bool no_acl_inode(struct inode *inode)
{
#ifdef CONFIG_FS_POSIX_ACL
	return likely(!READ_ONCE(inode->i_acl));
#else
	return true;
#endif
}

/**
 * acl_permission_check - perform basic UNIX permission checking
 * @idmap:	idmap of the mount the inode was found from
 * @inode:	inode to check permissions on
 * @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC ...)
 *
 * This function performs the basic UNIX permission checking. Since this
 * function may retrieve POSIX acls it needs to know whether it is called from a
 * blocking or non-blocking context and thus cares about the MAY_NOT_BLOCK bit.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
static int acl_permission_check(struct mnt_idmap *idmap,
				struct inode *inode, int mask)
{
	unsigned int mode = inode->i_mode;
	vfsuid_t vfsuid;

	/*
	 * Common cheap case: everybody has the requested
	 * rights, and there are no ACLs to check. No need
	 * to do any owner/group checks in that case.
	 *
	 *  - 'mask&7' is the requested permission bit set
	 *  - multiplying by 0111 spreads them out to all of ugo
	 *  - '& ~mode' looks for missing inode permission bits
	 *  - the '!' is for "no missing permissions"
	 *
	 * After that, we just need to check that there are no
	 * ACL's on the inode - do the 'IS_POSIXACL()' check last
	 * because it will dereference the ->i_sb pointer and we
	 * want to avoid that if at all possible.
	 */
	if (!((mask & 7) * 0111 & ~mode)) {
		if (no_acl_inode(inode))
			return 0;
		if (!IS_POSIXACL(inode))
			return 0;
	}

	/* Are we the owner? If so, ACL's don't matter */
	vfsuid = i_uid_into_vfsuid(idmap, inode);
	if (likely(vfsuid_eq_kuid(vfsuid, current_fsuid()))) {
		mask &= 7;
		mode >>= 6;
		return (mask & ~mode) ? -EACCES : 0;
	}

	/* Do we have ACL's? */
	if (IS_POSIXACL(inode) && (mode & S_IRWXG)) {
		int error = check_acl(idmap, inode, mask);
		if (error != -EAGAIN)
			return error;
	}

	/* Only RWX matters for group/other mode bits */
	mask &= 7;

	/*
	 * Are the group permissions different from
	 * the other permissions in the bits we care
	 * about? Need to check group ownership if so.
	 */
	if (mask & (mode ^ (mode >> 3))) {
		vfsgid_t vfsgid = i_gid_into_vfsgid(idmap, inode);
		if (vfsgid_in_group_p(vfsgid))
			mode >>= 3;
	}

	/* Bits in 'mode' clear that we require? */
	return (mask & ~mode) ? -EACCES : 0;
}

/**
 * generic_permission -  check for access rights on a Posix-like filesystem
 * @idmap:	idmap of the mount the inode was found from
 * @inode:	inode to check access rights for
 * @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC,
 *		%MAY_NOT_BLOCK ...)
 *
 * Used to check for read/write/execute permissions on a file.
 * We use "fsuid" for this, letting us set arbitrary permissions
 * for filesystem access without changing the "normal" uids which
 * are used for other things.
 *
 * generic_permission is rcu-walk aware. It returns -ECHILD in case an rcu-walk
 * request cannot be satisfied (eg. requires blocking or too much complexity).
 * It would then be called again in ref-walk mode.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
int generic_permission(struct mnt_idmap *idmap, struct inode *inode,
		       int mask)
{
	int ret;

	/*
	 * Do the basic permission checks.
	 */
	ret = acl_permission_check(idmap, inode, mask);
	if (ret != -EACCES)
		return ret;

	if (S_ISDIR(inode->i_mode)) {
		/* DACs are overridable for directories */
		if (!(mask & MAY_WRITE))
			if (capable_wrt_inode_uidgid(idmap, inode,
						     CAP_DAC_READ_SEARCH))
				return 0;
		if (capable_wrt_inode_uidgid(idmap, inode,
					     CAP_DAC_OVERRIDE))
			return 0;
		return -EACCES;
	}

	/*
	 * Searching includes executable on directories, else just read.
	 */
	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
	if (mask == MAY_READ)
		if (capable_wrt_inode_uidgid(idmap, inode,
					     CAP_DAC_READ_SEARCH))
			return 0;
	/*
	 * Read/write DACs are always overridable.
	 * Executable DACs are overridable when there is
	 * at least one exec bit set.
	 */
	if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
		if (capable_wrt_inode_uidgid(idmap, inode,
					     CAP_DAC_OVERRIDE))
			return 0;

	return -EACCES;
}
EXPORT_SYMBOL(generic_permission);

/**
 * do_inode_permission - UNIX permission checking
 * @idmap:	idmap of the mount the inode was found from
 * @inode:	inode to check permissions on
 * @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC ...)
 *
 * We _really_ want to just do "generic_permission()" without
 * even looking at the inode->i_op values. So we keep a cache
 * flag in inode->i_opflags, that says "this has not special
 * permission function, use the fast case".
 */
static inline int do_inode_permission(struct mnt_idmap *idmap,
				      struct inode *inode, int mask)
{
	if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
		if (likely(inode->i_op->permission))
			return inode->i_op->permission(idmap, inode, mask);

		/* This gets set once for the inode lifetime */
		spin_lock(&inode->i_lock);
		inode->i_opflags |= IOP_FASTPERM;
		spin_unlock(&inode->i_lock);
	}
	return generic_permission(idmap, inode, mask);
}

/**
 * sb_permission - Check superblock-level permissions
 * @sb: Superblock of inode to check permission on
 * @inode: Inode to check permission on
 * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
 *
 * Separate out file-system wide checks from inode-specific permission checks.
 *
 * Note: lookup_inode_permission_may_exec() does not call here. If you add
 * MAY_EXEC checks, adjust it.
 */
static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
{
	if (mask & MAY_WRITE) {
		umode_t mode = inode->i_mode;

		/* Nobody gets write access to a read-only fs. */
		if (sb_rdonly(sb) && (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
			return -EROFS;
	}
	return 0;
}

/**
 * inode_permission - Check for access rights to a given inode
 * @idmap:	idmap of the mount the inode was found from
 * @inode:	Inode to check permission on
 * @mask:	Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
 *
 * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
 * this, letting us set arbitrary permissions for filesystem access without
 * changing the "normal" UIDs which are used for other things.
 *
 * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
 */
int inode_permission(struct mnt_idmap *idmap,
		     struct inode *inode, int mask)
{
	int retval;

	retval = sb_permission(inode->i_sb, inode, mask);
	if (unlikely(retval))
		return retval;

	if (mask & MAY_WRITE) {
		/*
		 * Nobody gets write access to an immutable file.
		 */
		if (unlikely(IS_IMMUTABLE(inode)))
			return -EPERM;

		/*
		 * Updating mtime will likely cause i_uid and i_gid to be
		 * written back improperly if their true value is unknown
		 * to the vfs.
		 */
		if (unlikely(HAS_UNMAPPED_ID(idmap, inode)))
			return -EACCES;
	}

	retval = do_inode_permission(idmap, inode, mask);
	if (unlikely(retval))
		return retval;

	retval = devcgroup_inode_permission(inode, mask);
	if (unlikely(retval))
		return retval;

	return security_inode_permission(inode, mask);
}
EXPORT_SYMBOL(inode_permission);

/*
 * lookup_inode_permission_may_exec - Check traversal right for given inode
 *
 * This is a special case routine for may_lookup() making assumptions specific
 * to path traversal. Use inode_permission() if you are doing something else.
 *
 * Work is shaved off compared to inode_permission() as follows:
 * - we know for a fact there is no MAY_WRITE to worry about
 * - it is an invariant the inode is a directory
 *
 * Since majority of real-world traversal happens on inodes which grant it for
 * everyone, we check it upfront and only resort to more expensive work if it
 * fails.
 *
 * Filesystems which have their own ->permission hook and consequently miss out
 * on IOP_FASTPERM can still get the optimization if they set IOP_FASTPERM_MAY_EXEC
 * on their directory inodes.
 */
static __always_inline int lookup_inode_permission_may_exec(struct mnt_idmap *idmap,
	struct inode *inode, int mask)
{
	/* Lookup already checked this to return -ENOTDIR */
	VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode), inode);
	VFS_BUG_ON((mask & ~MAY_NOT_BLOCK) != 0);

	mask |= MAY_EXEC;

	if (unlikely(!(inode->i_opflags & (IOP_FASTPERM | IOP_FASTPERM_MAY_EXEC))))
		return inode_permission(idmap, inode, mask);

	if (unlikely(((inode->i_mode & 0111) != 0111) || !no_acl_inode(inode)))
		return inode_permission(idmap, inode, mask);

	return security_inode_permission(inode, mask);
}

/**
 * path_get - get a reference to a path
 * @path: path to get the reference to
 *
 * Given a path increment the reference count to the dentry and the vfsmount.
 */
void path_get(const struct path *path)
{
	mntget(path->mnt);
	dget(path->dentry);
}
EXPORT_SYMBOL(path_get);

/**
 * path_put - put a reference to a path
 * @path: path to put the reference to
 *
 * Given a path decrement the reference count to the dentry and the vfsmount.
 */
void path_put(const struct path *path)
{
	dput(path->dentry);
	mntput(path->mnt);
}
EXPORT_SYMBOL(path_put);

#define EMBEDDED_LEVELS 2
struct nameidata {
	struct path	path;
	struct qstr	last;
	struct path	root;
	struct inode	*inode; /* path.dentry.d_inode */
	unsigned int	flags, state;
	unsigned	seq, next_seq, m_seq, r_seq;
	int		last_type;
	unsigned	depth;
	int		total_link_count;
	struct saved {
		struct path link;
		struct delayed_call done;
		const char *name;
		unsigned seq;
	} *stack, internal[EMBEDDED_LEVELS];
	struct filename	*name;
	const char *pathname;
	struct nameidata *saved;
	unsigned	root_seq;
	int		dfd;
	vfsuid_t	dir_vfsuid;
	umode_t		dir_mode;
} __randomize_layout;

#define ND_ROOT_PRESET 1
#define ND_ROOT_GRABBED 2
#define ND_JUMPED 4

static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
{
	struct nameidata *old = current->nameidata;
	p->stack = p->internal;
	p->depth = 0;
	p->dfd = dfd;
	p->name = name;
	p->pathname = likely(name) ? name->name : "";
	p->path.mnt = NULL;
	p->path.dentry = NULL;
	p->total_link_count = old ? old->total_link_count : 0;
	p->saved = old;
	current->nameidata = p;
}

static inline void set_nameidata(struct nameidata *p, int dfd, struct filename *name,
			  const struct path *root)
{
	__set_nameidata(p, dfd, name);
	p->state = 0;
	if (unlikely(root)) {
		p->state = ND_ROOT_PRESET;
		p->root = *root;
	}
}

static void restore_nameidata(void)
{
	struct nameidata *now = current->nameidata, *old = now->saved;

	current->nameidata = old;
	if (old)
		old->total_link_count = now->total_link_count;
	if (now->stack != now->internal)
		kfree(now->stack);
}

static bool nd_alloc_stack(struct nameidata *nd)
{
	struct saved *p;

	p= kmalloc_array(MAXSYMLINKS, sizeof(struct saved),
			 nd->flags & LOOKUP_RCU ? GFP_ATOMIC : GFP_KERNEL);
	if (unlikely(!p))
		return false;
	memcpy(p, nd->internal, sizeof(nd->internal));
	nd->stack = p;
	return true;
}

/**
 * path_connected - Verify that a dentry is below mnt.mnt_root
 * @mnt: The mountpoint to check.
 * @dentry: The dentry to check.
 *
 * Rename can sometimes move a file or directory outside of a bind
 * mount, path_connected allows those cases to be detected.
 */
static bool path_connected(struct vfsmount *mnt, struct dentry *dentry)
{
	struct super_block *sb = mnt->mnt_sb;

	/* Bind mounts can have disconnected paths */
	if (mnt->mnt_root == sb->s_root)
		return true;

	return is_subdir(dentry, mnt->mnt_root);
}

static void drop_links(struct nameidata *nd)
{
	int i = nd->depth;
	while (i--) {
		struct saved *last = nd->stack + i;
		do_delayed_call(&last->done);
		clear_delayed_call(&last->done);
	}
}

static void leave_rcu(struct nameidata *nd)
{
	nd->flags &= ~LOOKUP_RCU;
	nd->seq = nd->next_seq = 0;
	rcu_read_unlock();
}

static void terminate_walk(struct nameidata *nd)
{
	if (unlikely(nd->depth))
		drop_links(nd);
	if (!(nd->flags & LOOKUP_RCU)) {
		int i;
		path_put(&nd->path);
		for (i = 0; i < nd->depth; i++)
			path_put(&nd->stack[i].link);
		if (nd->state & ND_ROOT_GRABBED) {
			path_put(&nd->root);
			nd->state &= ~ND_ROOT_GRABBED;
		}
	} else {
		leave_rcu(nd);
	}
	nd->depth = 0;
	nd->path.mnt = NULL;
	nd->path.dentry = NULL;
}

/* path_put is needed afterwards regardless of success or failure */
static bool __legitimize_path(struct path *path, unsigned seq, unsigned mseq)
{
	int res = __legitimize_mnt(path->mnt, mseq);
	if (unlikely(res)) {
		if (res > 0)
			path->mnt = NULL;
		path->dentry = NULL;
		return false;
	}
	if (unlikely(!lockref_get_not_dead(&path->dentry->d_lockref))) {
		path->dentry = NULL;
		return false;
	}
	return !read_seqcount_retry(&path->dentry->d_seq, seq);
}

static inline bool legitimize_path(struct nameidata *nd,
			    struct path *path, unsigned seq)
{
	return __legitimize_path(path, seq, nd->m_seq);
}

static bool legitimize_links(struct nameidata *nd)
{
	int i;

	VFS_BUG_ON(nd->flags & LOOKUP_CACHED);

	for (i = 0; i < nd->depth; i++) {
		struct saved *last = nd->stack + i;
		if (unlikely(!legitimize_path(nd, &last->link, last->seq))) {
			drop_links(nd);
			nd->depth = i + 1;
			return false;
		}
	}
	return true;
}

static bool legitimize_root(struct nameidata *nd)
{
	/* Nothing to do if nd->root is zero or is managed by the VFS user. */
	if (!nd->root.mnt || (nd->state & ND_ROOT_PRESET))
		return true;
	nd->state |= ND_ROOT_GRABBED;
	return legitimize_path(nd, &nd->root, nd->root_seq);
}

/*
 * Path walking has 2 modes, rcu-walk and ref-walk (see
 * Documentation/filesystems/path-lookup.txt).  In situations when we can't
 * continue in RCU mode, we attempt to drop out of rcu-walk mode and grab
 * normal reference counts on dentries and vfsmounts to transition to ref-walk
 * mode.  Refcounts are grabbed at the last known good point before rcu-walk
 * got stuck, so ref-walk may continue from there. If this is not successful
 * (eg. a seqcount has changed), then failure is returned and it's up to caller
 * to restart the path walk from the beginning in ref-walk mode.
 */

/**
 * try_to_unlazy - try to switch to ref-walk mode.
 * @nd: nameidata pathwalk data
 * Returns: true on success, false on failure
 *
 * try_to_unlazy attempts to legitimize the current nd->path and nd->root
 * for ref-walk mode.
 * Must be called from rcu-walk context.
 * Nothing should touch nameidata between try_to_unlazy() failure and
 * terminate_walk().
 */
static bool try_to_unlazy(struct nameidata *nd)
{
	struct dentry *parent = nd->path.dentry;

	BUG_ON(!(nd->flags & LOOKUP_RCU));

	if (unlikely(nd->flags & LOOKUP_CACHED)) {
		drop_links(nd);
		nd->depth = 0;
		goto out1;
	}
	if (unlikely(nd->depth && !legitimize_links(nd)))
		goto out1;
	if (unlikely(!legitimize_path(nd, &nd->path, nd->seq)))
		goto out;
	if (unlikely(!legitimize_root(nd)))
		goto out;
	leave_rcu(nd);
	BUG_ON(nd->inode != parent->d_inode);
	return true;

out1:
	nd->path.mnt = NULL;
	nd->path.dentry = NULL;
out:
	leave_rcu(nd);
	return false;
}

/**
 * try_to_unlazy_next - try to switch to ref-walk mode.
 * @nd: nameidata pathwalk data
 * @dentry: next dentry to step into
 * Returns: true on success, false on failure
 *
 * Similar to try_to_unlazy(), but here we have the next dentry already
 * picked by rcu-walk and want to legitimize that in addition to the current
 * nd->path and nd->root for ref-walk mode.  Must be called from rcu-walk context.
 * Nothing should touch nameidata between try_to_unlazy_next() failure and
 * terminate_walk().
 */
static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
{
	int res;
	BUG_ON(!(nd->flags & LOOKUP_RCU));

	if (unlikely(nd->flags & LOOKUP_CACHED)) {
		drop_links(nd);
		nd->depth = 0;
		goto out2;
	}
	if (unlikely(nd->depth && !legitimize_links(nd)))
		goto out2;
	res = __legitimize_mnt(nd->path.mnt, nd->m_seq);
	if (unlikely(res)) {
		if (res > 0)
			goto out2;
		goto out1;
	}
	if (unlikely(!lockref_get_not_dead(&nd->path.dentry->d_lockref)))
		goto out1;

	/*
	 * We need to move both the parent and the dentry from the RCU domain
	 * to be properly refcounted. And the sequence number in the dentry
	 * validates *both* dentry counters, since we checked the sequence
	 * number of the parent after we got the child sequence number. So we
	 * know the parent must still be valid if the child sequence number is
	 */
	if (unlikely(!lockref_get_not_dead(&dentry->d_lockref)))
		goto out;
	if (read_seqcount_retry(&dentry->d_seq, nd->next_seq))
		goto out_dput;
	/*
	 * Sequence counts matched. Now make sure that the root is
	 * still valid and get it if required.
	 */
	if (unlikely(!legitimize_root(nd)))
		goto out_dput;
	leave_rcu(nd);
	return true;

out2:
	nd->path.mnt = NULL;
out1:
	nd->path.dentry = NULL;
out:
	leave_rcu(nd);
	return false;
out_dput:
	leave_rcu(nd);
	dput(dentry);
	return false;
}

static inline int d_revalidate(struct inode *dir, const struct qstr *name,
			       struct dentry *dentry, unsigned int flags)
{
	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
		return dentry->d_op->d_revalidate(dir, name, dentry, flags);
	else
		return 1;
}

/**
 * complete_walk - successful completion of path walk
 * @nd:  pointer nameidata
 *
 * If we had been in RCU mode, drop out of it and legitimize nd->path.
 * Revalidate the final result, unless we'd already done that during
 * the path walk or the filesystem doesn't ask for it.  Return 0 on
 * success, -error on failure.  In case of failure caller does not
 * need to drop nd->path.
 */
static int complete_walk(struct nameidata *nd)
{
	struct dentry *dentry = nd->path.dentry;
	int status;

	if (nd->flags & LOOKUP_RCU) {
		/*
		 * We don't want to zero nd->root for scoped-lookups or
		 * externally-managed nd->root.
		 */
		if (likely(!(nd->state & ND_ROOT_PRESET)))
			if (likely(!(nd->flags & LOOKUP_IS_SCOPED)))
				nd->root.mnt = NULL;
		nd->flags &= ~LOOKUP_CACHED;
		if (!try_to_unlazy(nd))
			return -ECHILD;
	}

	if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
		/*
		 * While the guarantee of LOOKUP_IS_SCOPED is (roughly) "don't
		 * ever step outside the root during lookup" and should already
		 * be guaranteed by the rest of namei, we want to avoid a namei
		 * BUG resulting in userspace being given a path that was not
		 * scoped within the root at some point during the lookup.
		 *
		 * So, do a final sanity-check to make sure that in the
		 * worst-case scenario (a complete bypass of LOOKUP_IS_SCOPED)
		 * we won't silently return an fd completely outside of the
		 * requested root to userspace.
		 *
		 * Userspace could move the path outside the root after this
		 * check, but as discussed elsewhere this is not a concern (the
		 * resolved file was inside the root at some point).
		 */
		if (!path_is_under(&nd->path, &nd->root))
			return -EXDEV;
	}

	if (likely(!(nd->state & ND_JUMPED)))
		return 0;

	if (likely(!(dentry->d_flags & DCACHE_OP_WEAK_REVALIDATE)))
		return 0;

	status = dentry->d_op->d_weak_revalidate(dentry, nd->flags);
	if (status > 0)
		return 0;

	if (!status)
		status = -ESTALE;

	return status;
}

static int set_root(struct nameidata *nd)
{
	struct fs_struct *fs = current->fs;

	/*
	 * Jumping to the real root in a scoped-lookup is a BUG in namei, but we
	 * still have to ensure it doesn't happen because it will cause a breakout
	 * from the dirfd.
	 */
	if (WARN_ON(nd->flags & LOOKUP_IS_SCOPED))
		return -ENOTRECOVERABLE;

	if (nd->flags & LOOKUP_RCU) {
		unsigned seq;

		do {
			seq = read_seqbegin(&fs->seq);
			nd->root = fs->root;
			nd->root_seq = __read_seqcount_begin(&nd->root.dentry->d_seq);
		} while (read_seqretry(&fs->seq, seq));
	} else {
		get_fs_root(fs, &nd->root);
		nd->state |= ND_ROOT_GRABBED;
	}
	return 0;
}

static int nd_jump_root(struct nameidata *nd)
{
	if (unlikely(nd->flags & LOOKUP_BENEATH))
		return -EXDEV;
	if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
		/* Absolute path arguments to path_init() are allowed. */
		if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
			return -EXDEV;
	}
	if (!nd->root.mnt) {
		int error = set_root(nd);
		if (unlikely(error))
			return error;
	}
	if (nd->flags & LOOKUP_RCU) {
		struct dentry *d;
		nd->path = nd->root;
		d = nd->path.dentry;
		nd->inode = d->d_inode;
		nd->seq = nd->root_seq;
		if (read_seqcount_retry(&d->d_seq, nd->seq))
			return -ECHILD;
	} else {
		path_put(&nd->path);
		nd->path = nd->root;
		path_get(&nd->path);
		nd->inode = nd->path.dentry->d_inode;
	}
	nd->state |= ND_JUMPED;
	return 0;
}

/*
 * Helper to directly jump to a known parsed path from ->get_link,
 * caller must have taken a reference to path beforehand.
 */
int nd_jump_link(const struct path *path)
{
	int error = -ELOOP;
	struct nameidata *nd = current->nameidata;

	if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
		goto err;

	error = -EXDEV;
	if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
		if (nd->path.mnt != path->mnt)
			goto err;
	}
	/* Not currently safe for scoped-lookups. */
	if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
		goto err;

	path_put(&nd->path);
	nd->path = *path;
	nd->inode = nd->path.dentry->d_inode;
	nd->state |= ND_JUMPED;
	return 0;

err:
	path_put(path);
	return error;
}

static inline void put_link(struct nameidata *nd)
{
	struct saved *last = nd->stack + --nd->depth;
	do_delayed_call(&last->done);
	if (!(nd->flags & LOOKUP_RCU))
		path_put(&last->link);
}

static int sysctl_protected_symlinks __read_mostly;
static int sysctl_protected_hardlinks __read_mostly;
static int sysctl_protected_fifos __read_mostly;
static int sysctl_protected_regular __read_mostly;

#ifdef CONFIG_SYSCTL
static const struct ctl_table namei_sysctls[] = {
	{
		.procname	= "protected_symlinks",
		.data		= &sysctl_protected_symlinks,
		.maxlen		= sizeof(int),
		.mode		= 0644,
		.proc_handler	= proc_dointvec_minmax,
		.extra1		= SYSCTL_ZERO,
		.extra2		= SYSCTL_ONE,
	},
	{
		.procname	= "protected_hardlinks",
		.data		= &sysctl_protected_hardlinks,
		.maxlen		= sizeof(int),
		.mode		= 0644,
		.proc_handler	= proc_dointvec_minmax,
		.extra1		= SYSCTL_ZERO,
		.extra2		= SYSCTL_ONE,
	},
	{
		.procname	= "protected_fifos",
		.data		= &sysctl_protected_fifos,
		.maxlen		= sizeof(int),
		.mode		= 0644,
		.proc_handler	= proc_dointvec_minmax,
		.extra1		= SYSCTL_ZERO,
		.extra2		= SYSCTL_TWO,
	},
	{
		.procname	= "protected_regular",
		.data		= &sysctl_protected_regular,
		.maxlen		= sizeof(int),
		.mode		= 0644,
		.proc_handler	= proc_dointvec_minmax,
		.extra1		= SYSCTL_ZERO,
		.extra2		= SYSCTL_TWO,
	},
};

static int __init init_fs_namei_sysctls(void)
{
	register_sysctl_init("fs", namei_sysctls);
	return 0;
}
fs_initcall(init_fs_namei_sysctls);

#endif /* CONFIG_SYSCTL */

/**
 * may_follow_link - Check symlink following for unsafe situations
 * @nd: nameidata pathwalk data
 * @inode: Used for idmapping.
 *
 * In the case of the sysctl_protected_symlinks sysctl being enabled,
 * CAP_DAC_OVERRIDE needs to be specifically ignored if the symlink is
 * in a sticky world-writable directory. This is to protect privileged
 * processes from failing races against path names that may change out
 * from under them by way of other users creating malicious symlinks.
 * It will permit symlinks to be followed only when outside a sticky
 * world-writable directory, or when the uid of the symlink and follower
 * match, or when the directory owner matches the symlink's owner.
 *
 * Returns 0 if following the symlink is allowed, -ve on error.
 */
static inline int may_follow_link(struct nameidata *nd, const struct inode *inode)
{
	struct mnt_idmap *idmap;
	vfsuid_t vfsuid;

	if (!sysctl_protected_symlinks)
		return 0;

	idmap = mnt_idmap(nd->path.mnt);
	vfsuid = i_uid_into_vfsuid(idmap, inode);
	/* Allowed if owner and follower match. */
	if (vfsuid_eq_kuid(vfsuid, current_fsuid()))
		return 0;

	/* Allowed if parent directory not sticky and world-writable. */
	if ((nd->dir_mode & (S_ISVTX|S_IWOTH)) != (S_ISVTX|S_IWOTH))
		return 0;

	/* Allowed if parent directory and link owner match. */
	if (vfsuid_valid(nd->dir_vfsuid) && vfsuid_eq(nd->dir_vfsuid, vfsuid))
		return 0;

	if (nd->flags & LOOKUP_RCU)
		return -ECHILD;

	audit_inode(nd->name, nd->stack[0].link.dentry, 0);
	audit_log_path_denied(AUDIT_ANOM_LINK, "follow_link");
	return -EACCES;
}

/**
 * safe_hardlink_source - Check for safe hardlink conditions
 * @idmap: idmap of the mount the inode was found from
 * @inode: the source inode to hardlink from
 *
 * Return false if at least one of the following conditions:
 *    - inode is not a regular file
 *    - inode is setuid
 *    - inode is setgid and group-exec
 *    - access failure for read and write
 *
 * Otherwise returns true.
 */
static bool safe_hardlink_source(struct mnt_idmap *idmap,
				 struct inode *inode)
{
	umode_t mode = inode->i_mode;

	/* Special files should not get pinned to the filesystem. */
	if (!S_ISREG(mode))
		return false;

	/* Setuid files should not get pinned to the filesystem. */
	if (mode & S_ISUID)
		return false;

	/* Executable setgid files should not get pinned to the filesystem. */
	if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP))
		return false;

	/* Hardlinking to unreadable or unwritable sources is dangerous. */
	if (inode_permission(idmap, inode, MAY_READ | MAY_WRITE))
		return false;

	return true;
}

/**
 * may_linkat - Check permissions for creating a hardlink
 * @idmap: idmap of the mount the inode was found from
 * @link:  the source to hardlink from
 *
 * Block hardlink when all of:
 *  - sysctl_protected_hardlinks enabled
 *  - fsuid does not match inode
 *  - hardlink source is unsafe (see safe_hardlink_source() above)
 *  - not CAP_FOWNER in a namespace with the inode owner uid mapped
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 *
 * Returns 0 if successful, -ve on error.
 */
int may_linkat(struct mnt_idmap *idmap, const struct path *link)
{
	struct inode *inode = link->dentry->d_inode;

	/* Inode writeback is not safe when the uid or gid are invalid. */
	if (!vfsuid_valid(i_uid_into_vfsuid(idmap, inode)) ||
	    !vfsgid_valid(i_gid_into_vfsgid(idmap, inode)))
		return -EOVERFLOW;

	if (!sysctl_protected_hardlinks)
		return 0;

	/* Source inode owner (or CAP_FOWNER) can hardlink all they like,
	 * otherwise, it must be a safe source.
	 */
	if (safe_hardlink_source(idmap, inode) ||
	    inode_owner_or_capable(idmap, inode))
		return 0;

	audit_log_path_denied(AUDIT_ANOM_LINK, "linkat");
	return -EPERM;
}

/**
 * may_create_in_sticky - Check whether an O_CREAT open in a sticky directory
 *			  should be allowed, or not, on files that already
 *			  exist.
 * @idmap: idmap of the mount the inode was found from
 * @nd: nameidata pathwalk data
 * @inode: the inode of the file to open
 *
 * Block an O_CREAT open of a FIFO (or a regular file) when:
 *   - sysctl_protected_fifos (or sysctl_protected_regular) is enabled
 *   - the file already exists
 *   - we are in a sticky directory
 *   - we don't own the file
 *   - the owner of the directory doesn't own the file
 *   - the directory is world writable
 * If the sysctl_protected_fifos (or sysctl_protected_regular) is set to 2
 * the directory doesn't have to be world writable: being group writable will
 * be enough.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 *
 * Returns 0 if the open is allowed, -ve on error.
 */
static int may_create_in_sticky(struct mnt_idmap *idmap, struct nameidata *nd,
				struct inode *const inode)
{
	umode_t dir_mode = nd->dir_mode;
	vfsuid_t dir_vfsuid = nd->dir_vfsuid, i_vfsuid;

	if (likely(!(dir_mode & S_ISVTX)))
		return 0;

	if (S_ISREG(inode->i_mode) && !sysctl_protected_regular)
		return 0;

	if (S_ISFIFO(inode->i_mode) && !sysctl_protected_fifos)
		return 0;

	i_vfsuid = i_uid_into_vfsuid(idmap, inode);

	if (vfsuid_eq(i_vfsuid, dir_vfsuid))
		return 0;

	if (vfsuid_eq_kuid(i_vfsuid, current_fsuid()))
		return 0;

	if (likely(dir_mode & 0002)) {
		audit_log_path_denied(AUDIT_ANOM_CREAT, "sticky_create");
		return -EACCES;
	}

	if (dir_mode & 0020) {
		if (sysctl_protected_fifos >= 2 && S_ISFIFO(inode->i_mode)) {
			audit_log_path_denied(AUDIT_ANOM_CREAT,
					      "sticky_create_fifo");
			return -EACCES;
		}

		if (sysctl_protected_regular >= 2 && S_ISREG(inode->i_mode)) {
			audit_log_path_denied(AUDIT_ANOM_CREAT,
					      "sticky_create_regular");
			return -EACCES;
		}
	}

	return 0;
}

/*
 * follow_up - Find the mountpoint of path's vfsmount
 *
 * Given a path, find the mountpoint of its source file system.
 * Replace @path with the path of the mountpoint in the parent mount.
 * Up is towards /.
 *
 * Return 1 if we went up a level and 0 if we were already at the
 * root.
 */
int follow_up(struct path *path)
{
	struct mount *mnt = real_mount(path->mnt);
	struct mount *parent;
	struct dentry *mountpoint;

	read_seqlock_excl(&mount_lock);
	parent = mnt->mnt_parent;
	if (parent == mnt) {
		read_sequnlock_excl(&mount_lock);
		return 0;
	}
	mntget(&parent->mnt);
	mountpoint = dget(mnt->mnt_mountpoint);
	read_sequnlock_excl(&mount_lock);
	dput(path->dentry);
	path->dentry = mountpoint;
	mntput(path->mnt);
	path->mnt = &parent->mnt;
	return 1;
}
EXPORT_SYMBOL(follow_up);

static bool choose_mountpoint_rcu(struct mount *m, const struct path *root,
				  struct path *path, unsigned *seqp)
{
	while (mnt_has_parent(m)) {
		struct dentry *mountpoint = m->mnt_mountpoint;

		m = m->mnt_parent;
		if (unlikely(root->dentry == mountpoint &&
			     root->mnt == &m->mnt))
			break;
		if (mountpoint != m->mnt.mnt_root) {
			path->mnt = &m->mnt;
			path->dentry = mountpoint;
			*seqp = read_seqcount_begin(&mountpoint->d_seq);
			return true;
		}
	}
	return false;
}

static bool choose_mountpoint(struct mount *m, const struct path *root,
			      struct path *path)
{
	bool found;

	rcu_read_lock();
	while (1) {
		unsigned seq, mseq = read_seqbegin(&mount_lock);

		found = choose_mountpoint_rcu(m, root, path, &seq);
		if (unlikely(!found)) {
			if (!read_seqretry(&mount_lock, mseq))
				break;
		} else {
			if (likely(__legitimize_path(path, seq, mseq)))
				break;
			rcu_read_unlock();
			path_put(path);
			rcu_read_lock();
		}
	}
	rcu_read_unlock();
	return found;
}

/*
 * Perform an automount
 * - return -EISDIR to tell follow_managed() to stop and return the path we
 *   were called with.
 */
static int follow_automount(struct path *path, int *count, unsigned lookup_flags)
{
	struct dentry *dentry = path->dentry;

	/* We don't want to mount if someone's just doing a stat -
	 * unless they're stat'ing a directory and appended a '/' to
	 * the name.
	 *
	 * We do, however, want to mount if someone wants to open or
	 * create a file of any type under the mountpoint, wants to
	 * traverse through the mountpoint or wants to open the
	 * mounted directory.  Also, autofs may mark negative dentries
	 * as being automount points.  These will need the attentions
	 * of the daemon to instantiate them before they can be used.
	 */
	if (!(lookup_flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
			   LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
	    dentry->d_inode)
		return -EISDIR;

	/* No need to trigger automounts if mountpoint crossing is disabled. */
	if (lookup_flags & LOOKUP_NO_XDEV)
		return -EXDEV;

	if (count && (*count)++ >= MAXSYMLINKS)
		return -ELOOP;

	return finish_automount(dentry->d_op->d_automount(path), path);
}

/*
 * mount traversal - out-of-line part.  One note on ->d_flags accesses -
 * dentries are pinned but not locked here, so negative dentry can go
 * positive right under us.  Use of smp_load_acquire() provides a barrier
 * sufficient for ->d_inode and ->d_flags consistency.
 */
static int __traverse_mounts(struct path *path, unsigned flags, bool *jumped,
			     int *count, unsigned lookup_flags)
{
	struct vfsmount *mnt = path->mnt;
	bool need_mntput = false;
	int ret = 0;

	while (flags & DCACHE_MANAGED_DENTRY) {
		/* Allow the filesystem to manage the transit without i_rwsem
		 * being held. */
		if (flags & DCACHE_MANAGE_TRANSIT) {
			if (lookup_flags & LOOKUP_NO_XDEV) {
				ret = -EXDEV;
				break;
			}
			ret = path->dentry->d_op->d_manage(path, false);
			flags = smp_load_acquire(&path->dentry->d_flags);
			if (ret < 0)
				break;
		}

		if (flags & DCACHE_MOUNTED) {	// something's mounted on it..
			struct vfsmount *mounted = lookup_mnt(path);
			if (mounted) {		// ... in our namespace
				dput(path->dentry);
				if (need_mntput)
					mntput(path->mnt);
				path->mnt = mounted;
				path->dentry = dget(mounted->mnt_root);
				// here we know it's positive
				flags = path->dentry->d_flags;
				need_mntput = true;
				if (unlikely(lookup_flags & LOOKUP_NO_XDEV)) {
					ret = -EXDEV;
					break;
				}
				continue;
			}
		}

		if (!(flags & DCACHE_NEED_AUTOMOUNT))
			break;

		// uncovered automount point
		ret = follow_automount(path, count, lookup_flags);
		flags = smp_load_acquire(&path->dentry->d_flags);
		if (ret < 0)
			break;
	}

	if (ret == -EISDIR)
		ret = 0;
	// possible if you race with several mount --move
	if (need_mntput && path->mnt == mnt)
		mntput(path->mnt);
	if (!ret && unlikely(d_flags_negative(flags)))
		ret = -ENOENT;
	*jumped = need_mntput;
	return ret;
}

static inline int traverse_mounts(struct path *path, bool *jumped,
				  int *count, unsigned lookup_flags)
{
	unsigned flags = smp_load_acquire(&path->dentry->d_flags);

	/* fastpath */
	if (likely(!(flags & DCACHE_MANAGED_DENTRY))) {
		*jumped = false;
		if (unlikely(d_flags_negative(flags)))
			return -ENOENT;
		return 0;
	}
	return __traverse_mounts(path, flags, jumped, count, lookup_flags);
}

int follow_down_one(struct path *path)
{
	struct vfsmount *mounted;

	mounted = lookup_mnt(path);
	if (mounted) {
		dput(path->dentry);
		mntput(path->mnt);
		path->mnt = mounted;
		path->dentry = dget(mounted->mnt_root);
		return 1;
	}
	return 0;
}
EXPORT_SYMBOL(follow_down_one);

/*
 * Follow down to the covering mount currently visible to userspace.  At each
 * point, the filesystem owning that dentry may be queried as to whether the
 * caller is permitted to proceed or not.
 */
int follow_down(struct path *path, unsigned int flags)
{
	struct vfsmount *mnt = path->mnt;
	bool jumped;
	int ret = traverse_mounts(path, &jumped, NULL, flags);

	if (path->mnt != mnt)
		mntput(mnt);
	return ret;
}
EXPORT_SYMBOL(follow_down);

/*
 * Try to skip to top of mountpoint pile in rcuwalk mode.  Fail if
 * we meet a managed dentry that would need blocking.
 */
static bool __follow_mount_rcu(struct nameidata *nd, struct path *path)
{
	struct dentry *dentry = path->dentry;
	unsigned int flags = dentry->d_flags;

	if (likely(!(flags & DCACHE_MANAGED_DENTRY)))
		return true;

	if (unlikely(nd->flags & LOOKUP_NO_XDEV))
		return false;

	for (;;) {
		/*
		 * Don't forget we might have a non-mountpoint managed dentry
		 * that wants to block transit.
		 */
		if (unlikely(flags & DCACHE_MANAGE_TRANSIT)) {
			int res = dentry->d_op->d_manage(path, true);
			if (res)
				return res == -EISDIR;
			flags = dentry->d_flags;
		}

		if (flags & DCACHE_MOUNTED) {
			struct mount *mounted = __lookup_mnt(path->mnt, dentry);
			if (mounted) {
				path->mnt = &mounted->mnt;
				dentry = path->dentry = mounted->mnt.mnt_root;
				nd->state |= ND_JUMPED;
				nd->next_seq = read_seqcount_begin(&dentry->d_seq);
				flags = dentry->d_flags;
				// makes sure that non-RCU pathwalk could reach
				// this state.
				if (read_seqretry(&mount_lock, nd->m_seq))
					return false;
				continue;
			}
			if (read_seqretry(&mount_lock, nd->m_seq))
				return false;
		}
		return !(flags & DCACHE_NEED_AUTOMOUNT);
	}
}

static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
			  struct path *path)
{
	bool jumped;
	int ret;

	path->mnt = nd->path.mnt;
	path->dentry = dentry;
	if (nd->flags & LOOKUP_RCU) {
		unsigned int seq = nd->next_seq;
		if (likely(!d_managed(dentry)))
			return 0;
		if (likely(__follow_mount_rcu(nd, path)))
			return 0;
		// *path and nd->next_seq might've been clobbered
		path->mnt = nd->path.mnt;
		path->dentry = dentry;
		nd->next_seq = seq;
		if (unlikely(!try_to_unlazy_next(nd, dentry)))
			return -ECHILD;
	}
	ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags);
	if (jumped)
		nd->state |= ND_JUMPED;
	if (unlikely(ret)) {
		dput(path->dentry);
		if (path->mnt != nd->path.mnt)
			mntput(path->mnt);
	}
	return ret;
}

/*
 * This looks up the name in dcache and possibly revalidates the found dentry.
 * NULL is returned if the dentry does not exist in the cache.
 */
static struct dentry *lookup_dcache(const struct qstr *name,
				    struct dentry *dir,
				    unsigned int flags)
{
	struct dentry *dentry = d_lookup(dir, name);
	if (dentry) {
		int error = d_revalidate(dir->d_inode, name, dentry, flags);
		if (unlikely(error <= 0)) {
			if (!error)
				d_invalidate(dentry);
			dput(dentry);
			return ERR_PTR(error);
		}
	}
	return dentry;
}

/*
 * Parent directory has inode locked exclusive.  This is one
 * and only case when ->lookup() gets called on non in-lookup
 * dentries - as the matter of fact, this only gets called
 * when directory is guaranteed to have no in-lookup children
 * at all.
 * Will return -ENOENT if name isn't found and LOOKUP_CREATE wasn't passed.
 * Will return -EEXIST if name is found and LOOKUP_EXCL was passed.
 */
struct dentry *lookup_one_qstr_excl(const struct qstr *name,
				    struct dentry *base, unsigned int flags)
{
	struct dentry *dentry;
	struct dentry *old;
	struct inode *dir;

	dentry = lookup_dcache(name, base, flags);
	if (dentry)
		goto found;

	/* Don't create child dentry for a dead directory. */
	dir = base->d_inode;
	if (unlikely(IS_DEADDIR(dir)))
		return ERR_PTR(-ENOENT);

	dentry = d_alloc(base, name);
	if (unlikely(!dentry))
		return ERR_PTR(-ENOMEM);

	old = dir->i_op->lookup(dir, dentry, flags);
	if (unlikely(old)) {
		dput(dentry);
		dentry = old;
	}
found:
	if (IS_ERR(dentry))
		return dentry;
	if (d_is_negative(dentry) && !(flags & LOOKUP_CREATE)) {
		dput(dentry);
		return ERR_PTR(-ENOENT);
	}
	if (d_is_positive(dentry) && (flags & LOOKUP_EXCL)) {
		dput(dentry);
		return ERR_PTR(-EEXIST);
	}
	return dentry;
}
EXPORT_SYMBOL(lookup_one_qstr_excl);

/**
 * lookup_fast - do fast lockless (but racy) lookup of a dentry
 * @nd: current nameidata
 *
 * Do a fast, but racy lookup in the dcache for the given dentry, and
 * revalidate it. Returns a valid dentry pointer or NULL if one wasn't
 * found. On error, an ERR_PTR will be returned.
 *
 * If this function returns a valid dentry and the walk is no longer
 * lazy, the dentry will carry a reference that must later be put. If
 * RCU mode is still in force, then this is not the case and the dentry
 * must be legitimized before use. If this returns NULL, then the walk
 * will no longer be in RCU mode.
 */
static struct dentry *lookup_fast(struct nameidata *nd)
{
	struct dentry *dentry, *parent = nd->path.dentry;
	int status = 1;

	/*
	 * Rename seqlock is not required here because in the off chance
	 * of a false negative due to a concurrent rename, the caller is
	 * going to fall back to non-racy lookup.
	 */
	if (nd->flags & LOOKUP_RCU) {
		dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);
		if (unlikely(!dentry)) {
			if (!try_to_unlazy(nd))
				return ERR_PTR(-ECHILD);
			return NULL;
		}

		/*
		 * This sequence count validates that the parent had no
		 * changes while we did the lookup of the dentry above.
		 */
		if (read_seqcount_retry(&parent->d_seq, nd->seq))
			return ERR_PTR(-ECHILD);

		status = d_revalidate(nd->inode, &nd->last, dentry, nd->flags);
		if (likely(status > 0))
			return dentry;
		if (!try_to_unlazy_next(nd, dentry))
			return ERR_PTR(-ECHILD);
		if (status == -ECHILD)
			/* we'd been told to redo it in non-rcu mode */
			status = d_revalidate(nd->inode, &nd->last,
					      dentry, nd->flags);
	} else {
		dentry = __d_lookup(parent, &nd->last);
		if (unlikely(!dentry))
			return NULL;
		status = d_revalidate(nd->inode, &nd->last, dentry, nd->flags);
	}
	if (unlikely(status <= 0)) {
		if (!status)
			d_invalidate(dentry);
		dput(dentry);
		return ERR_PTR(status);
	}
	return dentry;
}

/* Fast lookup failed, do it the slow way */
static struct dentry *__lookup_slow(const struct qstr *name,
				    struct dentry *dir,
				    unsigned int flags)
{
	struct dentry *dentry, *old;
	struct inode *inode = dir->d_inode;
	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

	/* Don't go there if it's already dead */
	if (unlikely(IS_DEADDIR(inode)))
		return ERR_PTR(-ENOENT);
again:
	dentry = d_alloc_parallel(dir, name, &wq);
	if (IS_ERR(dentry))
		return dentry;
	if (unlikely(!d_in_lookup(dentry))) {
		int error = d_revalidate(inode, name, dentry, flags);
		if (unlikely(error <= 0)) {
			if (!error) {
				d_invalidate(dentry);
				dput(dentry);
				goto again;
			}
			dput(dentry);
			dentry = ERR_PTR(error);
		}
	} else {
		old = inode->i_op->lookup(inode, dentry, flags);
		d_lookup_done(dentry);
		if (unlikely(old)) {
			dput(dentry);
			dentry = old;
		}
	}
	return dentry;
}

static noinline struct dentry *lookup_slow(const struct qstr *name,
				  struct dentry *dir,
				  unsigned int flags)
{
	struct inode *inode = dir->d_inode;
	struct dentry *res;
	inode_lock_shared(inode);
	res = __lookup_slow(name, dir, flags);
	inode_unlock_shared(inode);
	return res;
}

static struct dentry *lookup_slow_killable(const struct qstr *name,
					   struct dentry *dir,
					   unsigned int flags)
{
	struct inode *inode = dir->d_inode;
	struct dentry *res;

	if (inode_lock_shared_killable(inode))
		return ERR_PTR(-EINTR);
	res = __lookup_slow(name, dir, flags);
	inode_unlock_shared(inode);
	return res;
}

static inline int may_lookup(struct mnt_idmap *idmap,
			     struct nameidata *restrict nd)
{
	int err, mask;

	mask = nd->flags & LOOKUP_RCU ? MAY_NOT_BLOCK : 0;
	err = lookup_inode_permission_may_exec(idmap, nd->inode, mask);
	if (likely(!err))
		return 0;

	// If we failed, and we weren't in LOOKUP_RCU, it's final
	if (!(nd->flags & LOOKUP_RCU))
		return err;

	// Drop out of RCU mode to make sure it wasn't transient
	if (!try_to_unlazy(nd))
		return -ECHILD;	// redo it all non-lazy

	if (err != -ECHILD)	// hard error
		return err;

	return lookup_inode_permission_may_exec(idmap, nd->inode, 0);
}

static int reserve_stack(struct nameidata *nd, struct path *link)
{
	if (unlikely(nd->total_link_count++ >= MAXSYMLINKS))
		return -ELOOP;

	if (likely(nd->depth != EMBEDDED_LEVELS))
		return 0;
	if (likely(nd->stack != nd->internal))
		return 0;
	if (likely(nd_alloc_stack(nd)))
		return 0;

	if (nd->flags & LOOKUP_RCU) {
		// we need to grab link before we do unlazy.  And we can't skip
		// unlazy even if we fail to grab the link - cleanup needs it
		bool grabbed_link = legitimize_path(nd, link, nd->next_seq);

		if (!try_to_unlazy(nd) || !grabbed_link)
			return -ECHILD;

		if (nd_alloc_stack(nd))
			return 0;
	}
	return -ENOMEM;
}

enum {WALK_TRAILING = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};

static noinline const char *pick_link(struct nameidata *nd, struct path *link,
		     struct inode *inode, int flags)
{
	struct saved *last;
	const char *res;
	int error;

	if (nd->flags & LOOKUP_RCU) {
		/* make sure that d_is_symlink from step_into_slowpath() matches the inode */
		if (read_seqcount_retry(&link->dentry->d_seq, nd->next_seq))
			return ERR_PTR(-ECHILD);
	} else {
		if (link->mnt == nd->path.mnt)
			mntget(link->mnt);
	}

	error = reserve_stack(nd, link);
	if (unlikely(error)) {
		if (!(nd->flags & LOOKUP_RCU))
			path_put(link);
		return ERR_PTR(error);
	}
	last = nd->stack + nd->depth++;
	last->link = *link;
	clear_delayed_call(&last->done);
	last->seq = nd->next_seq;

	if (flags & WALK_TRAILING) {
		error = may_follow_link(nd, inode);
		if (unlikely(error))
			return ERR_PTR(error);
	}

	if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS) ||
			unlikely(link->mnt->mnt_flags & MNT_NOSYMFOLLOW))
		return ERR_PTR(-ELOOP);

	if (unlikely(atime_needs_update(&last->link, inode))) {
		if (nd->flags & LOOKUP_RCU) {
			if (!try_to_unlazy(nd))
				return ERR_PTR(-ECHILD);
		}
		touch_atime(&last->link);
		cond_resched();
	}

	error = security_inode_follow_link(link->dentry, inode,
					   nd->flags & LOOKUP_RCU);
	if (unlikely(error))
		return ERR_PTR(error);

	res = READ_ONCE(inode->i_link);
	if (!res) {
		const char * (*get)(struct dentry *, struct inode *,
				struct delayed_call *);
		get = inode->i_op->get_link;
		if (nd->flags & LOOKUP_RCU) {
			res = get(NULL, inode, &last->done);
			if (res == ERR_PTR(-ECHILD) && try_to_unlazy(nd))
				res = get(link->dentry, inode, &last->done);
		} else {
			res = get(link->dentry, inode, &last->done);
		}
		if (!res)
			goto all_done;
		if (IS_ERR(res))
			return res;
	}
	if (*res == '/') {
		error = nd_jump_root(nd);
		if (unlikely(error))
			return ERR_PTR(error);
		while (unlikely(*++res == '/'))
			;
	}
	if (*res)
		return res;
all_done: // pure jump
	put_link(nd);
	return NULL;
}

/*
 * Do we need to follow links? We _really_ want to be able
 * to do this check without having to look at inode->i_op,
 * so we keep a cache of "no, this doesn't need follow_link"
 * for the common case.
 *
 * NOTE: dentry must be what nd->next_seq had been sampled from.
 */
static noinline const char *step_into_slowpath(struct nameidata *nd, int flags,
		     struct dentry *dentry)
{
	struct path path;
	struct inode *inode;
	int err;

	err = handle_mounts(nd, dentry, &path);
	if (unlikely(err < 0))
		return ERR_PTR(err);
	inode = path.dentry->d_inode;
	if (likely(!d_is_symlink(path.dentry)) ||
	   ((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) ||
	   (flags & WALK_NOFOLLOW)) {
		/* not a symlink or should not follow */
		if (nd->flags & LOOKUP_RCU) {
			if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
				return ERR_PTR(-ECHILD);
			if (unlikely(!inode))
				return ERR_PTR(-ENOENT);
		} else {
			dput(nd->path.dentry);
			if (nd->path.mnt != path.mnt)
				mntput(nd->path.mnt);
		}
		nd->path = path;
		nd->inode = inode;
		nd->seq = nd->next_seq;
		return NULL;
	}
	return pick_link(nd, &path, inode, flags);
}

static __always_inline const char *step_into(struct nameidata *nd, int flags,
                    struct dentry *dentry)
{
	/*
	 * In the common case we are in rcu-walk and traversing over a non-mounted on
	 * directory (as opposed to e.g., a symlink).
	 *
	 * We can handle that and negative entries with the checks below.
	 */
	if (likely((nd->flags & LOOKUP_RCU) &&
	    !d_managed(dentry) && !d_is_symlink(dentry))) {
		struct inode *inode = dentry->d_inode;
		if (read_seqcount_retry(&dentry->d_seq, nd->next_seq))
			return ERR_PTR(-ECHILD);
		if (unlikely(!inode))
			return ERR_PTR(-ENOENT);
		nd->path.dentry = dentry;
		/* nd->path.mnt is retained on purpose */
		nd->inode = inode;
		nd->seq = nd->next_seq;
		return NULL;
	}
	return step_into_slowpath(nd, flags, dentry);
}

static struct dentry *follow_dotdot_rcu(struct nameidata *nd)
{
	struct dentry *parent, *old;

	if (path_equal(&nd->path, &nd->root))
		goto in_root;
	if (unlikely(nd->path.dentry == nd->path.mnt->mnt_root)) {
		struct path path;
		unsigned seq;
		if (!choose_mountpoint_rcu(real_mount(nd->path.mnt),
					   &nd->root, &path, &seq))
			goto in_root;
		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
			return ERR_PTR(-ECHILD);
		nd->path = path;
		nd->inode = path.dentry->d_inode;
		nd->seq = seq;
		// makes sure that non-RCU pathwalk could reach this state
		if (read_seqretry(&mount_lock, nd->m_seq))
			return ERR_PTR(-ECHILD);
		/* we know that mountpoint was pinned */
	}
	old = nd->path.dentry;
	parent = old->d_parent;
	nd->next_seq = read_seqcount_begin(&parent->d_seq);
	// makes sure that non-RCU pathwalk could reach this state
	if (read_seqcount_retry(&old->d_seq, nd->seq))
		return ERR_PTR(-ECHILD);
	if (unlikely(!path_connected(nd->path.mnt, parent)))
		return ERR_PTR(-ECHILD);
	return parent;
in_root:
	if (read_seqretry(&mount_lock, nd->m_seq))
		return ERR_PTR(-ECHILD);
	if (unlikely(nd->flags & LOOKUP_BENEATH))
		return ERR_PTR(-ECHILD);
	nd->next_seq = nd->seq;
	return nd->path.dentry;
}

static struct dentry *follow_dotdot(struct nameidata *nd)
{
	struct dentry *parent;

	if (path_equal(&nd->path, &nd->root))
		goto in_root;
	if (unlikely(nd->path.dentry == nd->path.mnt->mnt_root)) {
		struct path path;

		if (!choose_mountpoint(real_mount(nd->path.mnt),
				       &nd->root, &path))
			goto in_root;
		path_put(&nd->path);
		nd->path = path;
		nd->inode = path.dentry->d_inode;
		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
			return ERR_PTR(-EXDEV);
	}
	/* rare case of legitimate dget_parent()... */
	parent = dget_parent(nd->path.dentry);
	if (unlikely(!path_connected(nd->path.mnt, parent))) {
		dput(parent);
		return ERR_PTR(-ENOENT);
	}
	return parent;

in_root:
	if (unlikely(nd->flags & LOOKUP_BENEATH))
		return ERR_PTR(-EXDEV);
	return dget(nd->path.dentry);
}

static const char *handle_dots(struct nameidata *nd, int type)
{
	if (type == LAST_DOTDOT) {
		const char *error = NULL;
		struct dentry *parent;

		if (!nd->root.mnt) {
			error = ERR_PTR(set_root(nd));
			if (unlikely(error))
				return error;
		}
		if (nd->flags & LOOKUP_RCU)
			parent = follow_dotdot_rcu(nd);
		else
			parent = follow_dotdot(nd);
		if (IS_ERR(parent))
			return ERR_CAST(parent);
		error = step_into(nd, WALK_NOFOLLOW, parent);
		if (unlikely(error))
			return error;

		if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
			/*
			 * If there was a racing rename or mount along our
			 * path, then we can't be sure that ".." hasn't jumped
			 * above nd->root (and so userspace should retry or use
			 * some fallback).
			 */
			smp_rmb();
			if (__read_seqcount_retry(&mount_lock.seqcount, nd->m_seq))
				return ERR_PTR(-EAGAIN);
			if (__read_seqcount_retry(&rename_lock.seqcount, nd->r_seq))
				return ERR_PTR(-EAGAIN);
		}
	}
	return NULL;
}

static __always_inline const char *walk_component(struct nameidata *nd, int flags)
{
	struct dentry *dentry;
	/*
	 * "." and ".." are special - ".." especially so because it has
	 * to be able to know about the current root directory and
	 * parent relationships.
	 */
	if (unlikely(nd->last_type != LAST_NORM)) {
		if (unlikely(nd->depth) && !(flags & WALK_MORE))
			put_link(nd);
		return handle_dots(nd, nd->last_type);
	}
	dentry = lookup_fast(nd);
	if (IS_ERR(dentry))
		return ERR_CAST(dentry);
	if (unlikely(!dentry)) {
		dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
		if (IS_ERR(dentry))
			return ERR_CAST(dentry);
	}
	if (unlikely(nd->depth) && !(flags & WALK_MORE))
		put_link(nd);
	return step_into(nd, flags, dentry);
}

/*
 * We can do the critical dentry name comparison and hashing
 * operations one word at a time, but we are limited to:
 *
 * - Architectures with fast unaligned word accesses. We could
 *   do a "get_unaligned()" if this helps and is sufficiently
 *   fast.
 *
 * - non-CONFIG_DEBUG_PAGEALLOC configurations (so that we
 *   do not trap on the (extremely unlikely) case of a page
 *   crossing operation.
 *
 * - Furthermore, we need an efficient 64-bit compile for the
 *   64-bit case in order to generate the "number of bytes in
 *   the final mask". Again, that could be replaced with a
 *   efficient population count instruction or similar.
 */
#ifdef CONFIG_DCACHE_WORD_ACCESS

#include <asm/word-at-a-time.h>

#ifdef HASH_MIX

/* Architecture provides HASH_MIX and fold_hash() in <asm/hash.h> */

#elif defined(CONFIG_64BIT)
/*
 * Register pressure in the mixing function is an issue, particularly
 * on 32-bit x86, but almost any function requires one state value and
 * one temporary.  Instead, use a function designed for two state values
 * and no temporaries.
 *
 * This function cannot create a collision in only two iterations, so
 * we have two iterations to achieve avalanche.  In those two iterations,
 * we have six layers of mixing, which is enough to spread one bit's
 * influence out to 2^6 = 64 state bits.
 *
 * Rotate constants are scored by considering either 64 one-bit input
 * deltas or 64*63/2 = 2016 two-bit input deltas, and finding the
 * probability of that delta causing a change to each of the 128 output
 * bits, using a sample of random initial states.
 *
 * The Shannon entropy of the computed probabilities is then summed
 * to produce a score.  Ideally, any input change has a 50% chance of
 * toggling any given output bit.
 *
 * Mixing scores (in bits) for (12,45):
 * Input delta: 1-bit      2-bit
 * 1 round:     713.3    42542.6
 * 2 rounds:   2753.7   140389.8
 * 3 rounds:   5954.1   233458.2
 * 4 rounds:   7862.6   256672.2
 * Perfect:    8192     258048
 *            (64*128) (64*63/2 * 128)
 */
#define HASH_MIX(x, y, a)	\
	(	x ^= (a),	\
	y ^= x,	x = rol64(x,12),\
	x += y,	y = rol64(y,45),\
	y *= 9			)

/*
 * Fold two longs into one 32-bit hash value.  This must be fast, but
 * latency isn't quite as critical, as there is a fair bit of additional
 * work done before the hash value is used.
 */
static inline unsigned int fold_hash(unsigned long x, unsigned long y)
{
	y ^= x * GOLDEN_RATIO_64;
	y *= GOLDEN_RATIO_64;
	return y >> 32;
}

#else	/* 32-bit case */

/*
 * Mixing scores (in bits) for (7,20):
 * Input delta: 1-bit      2-bit
 * 1 round:     330.3     9201.6
 * 2 rounds:   1246.4    25475.4
 * 3 rounds:   1907.1    31295.1
 * 4 rounds:   2042.3    31718.6
 * Perfect:    2048      31744
 *            (32*64)   (32*31/2 * 64)
 */
#define HASH_MIX(x, y, a)	\
	(	x ^= (a),	\
	y ^= x,	x = rol32(x, 7),\
	x += y,	y = rol32(y,20),\
	y *= 9			)

static inline unsigned int fold_hash(unsigned long x, unsigned long y)
{
	/* Use arch-optimized multiply if one exists */
	return __hash_32(y ^ __hash_32(x));
}

#endif

/*
 * Return the hash of a string of known length.  This is carfully
 * designed to match hash_name(), which is the more critical function.
 * In particular, we must end by hashing a final word containing 0..7
 * payload bytes, to match the way that hash_name() iterates until it
 * finds the delimiter after the name.
 */
unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
{
	unsigned long a, x = 0, y = (unsigned long)salt;

	for (;;) {
		if (!len)
			goto done;
		a = load_unaligned_zeropad(name);
		if (len < sizeof(unsigned long))
			break;
		HASH_MIX(x, y, a);
		name += sizeof(unsigned long);
		len -= sizeof(unsigned long);
	}
	x ^= a & bytemask_from_count(len);
done:
	return fold_hash(x, y);
}
EXPORT_SYMBOL(full_name_hash);

/* Return the "hash_len" (hash and length) of a null-terminated string */
u64 hashlen_string(const void *salt, const char *name)
{
	unsigned long a = 0, x = 0, y = (unsigned long)salt;
	unsigned long adata, mask, len;
	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;

	len = 0;
	goto inside;

	do {
		HASH_MIX(x, y, a);
		len += sizeof(unsigned long);
inside:
		a = load_unaligned_zeropad(name+len);
	} while (!has_zero(a, &adata, &constants));

	adata = prep_zero_mask(a, adata, &constants);
	mask = create_zero_mask(adata);
	x ^= a & zero_bytemask(mask);

	return hashlen_create(fold_hash(x, y), len + find_zero(mask));
}
EXPORT_SYMBOL(hashlen_string);

/*
 * Calculate the length and hash of the path component, and
 * return the length as the result.
 */
static inline const char *hash_name(struct nameidata *nd,
				    const char *name,
				    unsigned long *lastword)
{
	unsigned long a, b, x, y = (unsigned long)nd->path.dentry;
	unsigned long adata, bdata, mask, len;
	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;

	/*
	 * The first iteration is special, because it can result in
	 * '.' and '..' and has no mixing other than the final fold.
	 */
	a = load_unaligned_zeropad(name);
	b = a ^ REPEAT_BYTE('/');
	if (has_zero(a, &adata, &constants) | has_zero(b, &bdata, &constants)) {
		adata = prep_zero_mask(a, adata, &constants);
		bdata = prep_zero_mask(b, bdata, &constants);
		mask = create_zero_mask(adata | bdata);
		a &= zero_bytemask(mask);
		*lastword = a;
		len = find_zero(mask);
		nd->last.hash = fold_hash(a, y);
		nd->last.len = len;
		return name + len;
	}

	len = 0;
	x = 0;
	do {
		HASH_MIX(x, y, a);
		len += sizeof(unsigned long);
		a = load_unaligned_zeropad(name+len);
		b = a ^ REPEAT_BYTE('/');
	} while (!(has_zero(a, &adata, &constants) | has_zero(b, &bdata, &constants)));

	adata = prep_zero_mask(a, adata, &constants);
	bdata = prep_zero_mask(b, bdata, &constants);
	mask = create_zero_mask(adata | bdata);
	a &= zero_bytemask(mask);
	x ^= a;
	len += find_zero(mask);
	*lastword = 0;		// Multi-word components cannot be DOT or DOTDOT

	nd->last.hash = fold_hash(x, y);
	nd->last.len = len;
	return name + len;
}

/*
 * Note that the 'last' word is always zero-masked, but
 * was loaded as a possibly big-endian word.
 */
#ifdef __BIG_ENDIAN
  #define LAST_WORD_IS_DOT	(0x2eul << (BITS_PER_LONG-8))
  #define LAST_WORD_IS_DOTDOT	(0x2e2eul << (BITS_PER_LONG-16))
#endif

#else	/* !CONFIG_DCACHE_WORD_ACCESS: Slow, byte-at-a-time version */

/* Return the hash of a string of known length */
unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
{
	unsigned long hash = init_name_hash(salt);
	while (len--)
		hash = partial_name_hash((unsigned char)*name++, hash);
	return end_name_hash(hash);
}
EXPORT_SYMBOL(full_name_hash);

/* Return the "hash_len" (hash and length) of a null-terminated string */
u64 hashlen_string(const void *salt, const char *name)
{
	unsigned long hash = init_name_hash(salt);
	unsigned long len = 0, c;

	c = (unsigned char)*name;
	while (c) {
		len++;
		hash = partial_name_hash(c, hash);
		c = (unsigned char)name[len];
	}
	return hashlen_create(end_name_hash(hash), len);
}
EXPORT_SYMBOL(hashlen_string);

/*
 * We know there's a real path component here of at least
 * one character.
 */
static inline const char *hash_name(struct nameidata *nd, const char *name, unsigned long *lastword)
{
	unsigned long hash = init_name_hash(nd->path.dentry);
	unsigned long len = 0, c, last = 0;

	c = (unsigned char)*name;
	do {
		last = (last << 8) + c;
		len++;
		hash = partial_name_hash(c, hash);
		c = (unsigned char)name[len];
	} while (c && c != '/');

	// This is reliable for DOT or DOTDOT, since the component
	// cannot contain NUL characters - top bits being zero means
	// we cannot have had any other pathnames.
	*lastword = last;
	nd->last.hash = end_name_hash(hash);
	nd->last.len = len;
	return name + len;
}

#endif

#ifndef LAST_WORD_IS_DOT
  #define LAST_WORD_IS_DOT	0x2e
  #define LAST_WORD_IS_DOTDOT	0x2e2e
#endif

/*
 * Name resolution.
 * This is the basic name resolution function, turning a pathname into
 * the final dentry. We expect 'base' to be positive and a directory.
 *
 * Returns 0 and nd will have valid dentry and mnt on success.
 * Returns error and drops reference to input namei data on failure.
 */
static int link_path_walk(const char *name, struct nameidata *nd)
{
	int depth = 0; // depth <= nd->depth
	int err;

	nd->last_type = LAST_ROOT;
	nd->flags |= LOOKUP_PARENT;
	if (IS_ERR(name))
		return PTR_ERR(name);
	if (*name == '/') {
		do {
			name++;
		} while (unlikely(*name == '/'));
	}
	if (unlikely(!*name)) {
		nd->dir_mode = 0; // short-circuit the 'hardening' idiocy
		return 0;
	}

	/* At this point we know we have a real path component. */
	for(;;) {
		struct mnt_idmap *idmap;
		const char *link;
		unsigned long lastword;

		idmap = mnt_idmap(nd->path.mnt);
		err = may_lookup(idmap, nd);
		if (unlikely(err))
			return err;

		nd->last.name = name;
		name = hash_name(nd, name, &lastword);

		switch(lastword) {
		case LAST_WORD_IS_DOTDOT:
			nd->last_type = LAST_DOTDOT;
			nd->state |= ND_JUMPED;
			break;

		case LAST_WORD_IS_DOT:
			nd->last_type = LAST_DOT;
			break;

		default:
			nd->last_type = LAST_NORM;
			nd->state &= ~ND_JUMPED;

			struct dentry *parent = nd->path.dentry;
			if (unlikely(parent->d_flags & DCACHE_OP_HASH)) {
				err = parent->d_op->d_hash(parent, &nd->last);
				if (err < 0)
					return err;
			}
		}

		if (!*name)
			goto OK;
		/*
		 * If it wasn't NUL, we know it was '/'. Skip that
		 * slash, and continue until no more slashes.
		 */
		do {
			name++;
		} while (unlikely(*name == '/'));
		if (unlikely(!*name)) {
OK:
			/* pathname or trailing symlink, done */
			if (likely(!depth)) {
				nd->dir_vfsuid = i_uid_into_vfsuid(idmap, nd->inode);
				nd->dir_mode = nd->inode->i_mode;
				nd->flags &= ~LOOKUP_PARENT;
				return 0;
			}
			/* last component of nested symlink */
			name = nd->stack[--depth].name;
			link = walk_component(nd, 0);
		} else {
			/* not the last component */
			link = walk_component(nd, WALK_MORE);
		}
		if (unlikely(link)) {
			if (IS_ERR(link))
				return PTR_ERR(link);
			/* a symlink to follow */
			nd->stack[depth++].name = name;
			name = link;
			continue;
		}
		if (unlikely(!d_can_lookup(nd->path.dentry))) {
			if (nd->flags & LOOKUP_RCU) {
				if (!try_to_unlazy(nd))
					return -ECHILD;
			}
			return -ENOTDIR;
		}
	}
}

/* must be paired with terminate_walk() */
static const char *path_init(struct nameidata *nd, unsigned flags)
{
	int error;
	const char *s = nd->pathname;

	/* LOOKUP_CACHED requires RCU, ask caller to retry */
	if (unlikely((flags & (LOOKUP_RCU | LOOKUP_CACHED)) == LOOKUP_CACHED))
		return ERR_PTR(-EAGAIN);

	if (unlikely(!*s))
		flags &= ~LOOKUP_RCU;
	if (flags & LOOKUP_RCU)
		rcu_read_lock();
	else
		nd->seq = nd->next_seq = 0;

	nd->flags = flags;
	nd->state |= ND_JUMPED;

	nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount);
	nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount);
	smp_rmb();

	if (unlikely(nd->state & ND_ROOT_PRESET)) {
		struct dentry *root = nd->root.dentry;
		struct inode *inode = root->d_inode;
		if (*s && unlikely(!d_can_lookup(root)))
			return ERR_PTR(-ENOTDIR);
		nd->path = nd->root;
		nd->inode = inode;
		if (flags & LOOKUP_RCU) {
			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
			nd->root_seq = nd->seq;
		} else {
			path_get(&nd->path);
		}
		return s;
	}

	nd->root.mnt = NULL;

	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
	if (*s == '/' && likely(!(flags & LOOKUP_IN_ROOT))) {
		error = nd_jump_root(nd);
		if (unlikely(error))
			return ERR_PTR(error);
		return s;
	}

	/* Relative pathname -- get the starting-point it is relative to. */
	if (nd->dfd == AT_FDCWD) {
		if (flags & LOOKUP_RCU) {
			struct fs_struct *fs = current->fs;
			unsigned seq;

			do {
				seq = read_seqbegin(&fs->seq);
				nd->path = fs->pwd;
				nd->inode = nd->path.dentry->d_inode;
				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
			} while (read_seqretry(&fs->seq, seq));
		} else {
			get_fs_pwd(current->fs, &nd->path);
			nd->inode = nd->path.dentry->d_inode;
		}
	} else {
		/* Caller must check execute permissions on the starting path component */
		CLASS(fd_raw, f)(nd->dfd);
		struct dentry *dentry;

		if (fd_empty(f))
			return ERR_PTR(-EBADF);

		if (flags & LOOKUP_LINKAT_EMPTY) {
			if (fd_file(f)->f_cred != current_cred() &&
			    !ns_capable(fd_file(f)->f_cred->user_ns, CAP_DAC_READ_SEARCH))
				return ERR_PTR(-ENOENT);
		}

		dentry = fd_file(f)->f_path.dentry;

		if (*s && unlikely(!d_can_lookup(dentry)))
			return ERR_PTR(-ENOTDIR);

		nd->path = fd_file(f)->f_path;
		if (flags & LOOKUP_RCU) {
			nd->inode = nd->path.dentry->d_inode;
			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
		} else {
			path_get(&nd->path);
			nd->inode = nd->path.dentry->d_inode;
		}
	}

	/* For scoped-lookups we need to set the root to the dirfd as well. */
	if (unlikely(flags & LOOKUP_IS_SCOPED)) {
		nd->root = nd->path;
		if (flags & LOOKUP_RCU) {
			nd->root_seq = nd->seq;
		} else {
			path_get(&nd->root);
			nd->state |= ND_ROOT_GRABBED;
		}
	}
	return s;
}

static inline const char *lookup_last(struct nameidata *nd)
{
	if (nd->last_type == LAST_NORM && nd->last.name[nd->last.len])
		nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;

	return walk_component(nd, WALK_TRAILING);
}

static int handle_lookup_down(struct nameidata *nd)
{
	if (!(nd->flags & LOOKUP_RCU))
		dget(nd->path.dentry);
	nd->next_seq = nd->seq;
	return PTR_ERR(step_into(nd, WALK_NOFOLLOW, nd->path.dentry));
}

/* Returns 0 and nd will be valid on success; Returns error, otherwise. */
static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path)
{
	const char *s = path_init(nd, flags);
	int err;

	if (unlikely(flags & LOOKUP_DOWN) && !IS_ERR(s)) {
		err = handle_lookup_down(nd);
		if (unlikely(err < 0))
			s = ERR_PTR(err);
	}

	while (!(err = link_path_walk(s, nd)) &&
	       (s = lookup_last(nd)) != NULL)
		;
	if (!err && unlikely(nd->flags & LOOKUP_MOUNTPOINT)) {
		err = handle_lookup_down(nd);
		nd->state &= ~ND_JUMPED; // no d_weak_revalidate(), please...
	}
	if (!err)
		err = complete_walk(nd);

	if (!err && nd->flags & LOOKUP_DIRECTORY)
		if (!d_can_lookup(nd->path.dentry))
			err = -ENOTDIR;
	if (!err) {
		*path = nd->path;
		nd->path.mnt = NULL;
		nd->path.dentry = NULL;
	}
	terminate_walk(nd);
	return err;
}

int filename_lookup(int dfd, struct filename *name, unsigned flags,
		    struct path *path, const struct path *root)
{
	int retval;
	struct nameidata nd;
	if (IS_ERR(name))
		return PTR_ERR(name);
	set_nameidata(&nd, dfd, name, root);
	retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
	if (unlikely(retval == -ECHILD))
		retval = path_lookupat(&nd, flags, path);
	if (unlikely(retval == -ESTALE))
		retval = path_lookupat(&nd, flags | LOOKUP_REVAL, path);

	if (likely(!retval))
		audit_inode(name, path->dentry,
			    flags & LOOKUP_MOUNTPOINT ? AUDIT_INODE_NOEVAL : 0);
	restore_nameidata();
	return retval;
}

/* Returns 0 and nd will be valid on success; Returns error, otherwise. */
static int path_parentat(struct nameidata *nd, unsigned flags,
				struct path *parent)
{
	const char *s = path_init(nd, flags);
	int err = link_path_walk(s, nd);
	if (!err)
		err = complete_walk(nd);
	if (!err) {
		*parent = nd->path;
		nd->path.mnt = NULL;
		nd->path.dentry = NULL;
	}
	terminate_walk(nd);
	return err;
}

/* Note: this does not consume "name" */
static int __filename_parentat(int dfd, struct filename *name,
			       unsigned int flags, struct path *parent,
			       struct qstr *last, int *type,
			       const struct path *root)
{
	int retval;
	struct nameidata nd;

	if (IS_ERR(name))
		return PTR_ERR(name);
	set_nameidata(&nd, dfd, name, root);
	retval = path_parentat(&nd, flags | LOOKUP_RCU, parent);
	if (unlikely(retval == -ECHILD))
		retval = path_parentat(&nd, flags, parent);
	if (unlikely(retval == -ESTALE))
		retval = path_parentat(&nd, flags | LOOKUP_REVAL, parent);
	if (likely(!retval)) {
		*last = nd.last;
		*type = nd.last_type;
		audit_inode(name, parent->dentry, AUDIT_INODE_PARENT);
	}
	restore_nameidata();
	return retval;
}

static int filename_parentat(int dfd, struct filename *name,
			     unsigned int flags, struct path *parent,
			     struct qstr *last, int *type)
{
	return __filename_parentat(dfd, name, flags, parent, last, type, NULL);
}

/**
 * __start_dirop - begin a create or remove dirop, performing locking and lookup
 * @parent:       the dentry of the parent in which the operation will occur
 * @name:         a qstr holding the name within that parent
 * @lookup_flags: intent and other lookup flags.
 * @state:        task state bitmask
 *
 * The lookup is performed and necessary locks are taken so that, on success,
 * the returned dentry can be operated on safely.
 * The qstr must already have the hash value calculated.
 *
 * Returns: a locked dentry, or an error.
 *
 */
static struct dentry *__start_dirop(struct dentry *parent, struct qstr *name,
				    unsigned int lookup_flags,
				    unsigned int state)
{
	struct dentry *dentry;
	struct inode *dir = d_inode(parent);

	if (state == TASK_KILLABLE) {
		int ret = down_write_killable_nested(&dir->i_rwsem,
						     I_MUTEX_PARENT);
		if (ret)
			return ERR_PTR(ret);
	} else {
		inode_lock_nested(dir, I_MUTEX_PARENT);
	}
	dentry = lookup_one_qstr_excl(name, parent, lookup_flags);
	if (IS_ERR(dentry))
		inode_unlock(dir);
	return dentry;
}

struct dentry *start_dirop(struct dentry *parent, struct qstr *name,
			   unsigned int lookup_flags)
{
	return __start_dirop(parent, name, lookup_flags, TASK_NORMAL);
}

/**
 * end_dirop - signal completion of a dirop
 * @de: the dentry which was returned by start_dirop or similar.
 *
 * If the de is an error, nothing happens. Otherwise any lock taken to
 * protect the dentry is dropped and the dentry itself is release (dput()).
 */
void end_dirop(struct dentry *de)
{
	if (!IS_ERR(de)) {
		inode_unlock(de->d_parent->d_inode);
		dput(de);
	}
}
EXPORT_SYMBOL(end_dirop);

/* does lookup, returns the object with parent locked */
static struct dentry *__start_removing_path(int dfd, struct filename *name,
					   struct path *path)
{
	struct path parent_path __free(path_put) = {};
	struct dentry *d;
	struct qstr last;
	int type, error;

	error = filename_parentat(dfd, name, 0, &parent_path, &last, &type);
	if (error)
		return ERR_PTR(error);
	if (unlikely(type != LAST_NORM))
		return ERR_PTR(-EINVAL);
	/* don't fail immediately if it's r/o, at least try to report other errors */
	error = mnt_want_write(parent_path.mnt);
	d = start_dirop(parent_path.dentry, &last, 0);
	if (IS_ERR(d))
		goto drop;
	if (error)
		goto fail;
	path->dentry = no_free_ptr(parent_path.dentry);
	path->mnt = no_free_ptr(parent_path.mnt);
	return d;

fail:
	end_dirop(d);
	d = ERR_PTR(error);
drop:
	if (!error)
		mnt_drop_write(parent_path.mnt);
	return d;
}

/**
 * kern_path_parent: lookup path returning parent and target
 * @name: path name
 * @path: path to store parent in
 *
 * The path @name should end with a normal component, not "." or ".." or "/".
 * A lookup is performed and if successful the parent information
 * is store in @parent and the dentry is returned.
 *
 * The dentry maybe negative, the parent will be positive.
 *
 * Returns:  dentry or error.
 */
struct dentry *kern_path_parent(const char *name, struct path *path)
{
	struct path parent_path __free(path_put) = {};
	struct filename *filename __free(putname) = getname_kernel(name);
	struct dentry *d;
	struct qstr last;
	int type, error;

	error = filename_parentat(AT_FDCWD, filename, 0, &parent_path, &last, &type);
	if (error)
		return ERR_PTR(error);
	if (unlikely(type != LAST_NORM))
		return ERR_PTR(-EINVAL);

	d = lookup_noperm_unlocked(&last, parent_path.dentry);
	if (IS_ERR(d))
		return d;
	path->dentry = no_free_ptr(parent_path.dentry);
	path->mnt = no_free_ptr(parent_path.mnt);
	return d;
}

struct dentry *start_removing_path(const char *name, struct path *path)
{
	struct filename *filename = getname_kernel(name);
	struct dentry *res = __start_removing_path(AT_FDCWD, filename, path);

	putname(filename);
	return res;
}

struct dentry *start_removing_user_path_at(int dfd,
					   const char __user *name,
					   struct path *path)
{
	struct filename *filename = getname(name);
	struct dentry *res = __start_removing_path(dfd, filename, path);

	putname(filename);
	return res;
}
EXPORT_SYMBOL(start_removing_user_path_at);

int kern_path(const char *name, unsigned int flags, struct path *path)
{
	struct filename *filename = getname_kernel(name);
	int ret = filename_lookup(AT_FDCWD, filename, flags, path, NULL);

	putname(filename);
	return ret;

}
EXPORT_SYMBOL(kern_path);

/**
 * vfs_path_parent_lookup - lookup a parent path relative to a dentry-vfsmount pair
 * @filename: filename structure
 * @flags: lookup flags
 * @parent: pointer to struct path to fill
 * @last: last component
 * @type: type of the last component
 * @root: pointer to struct path of the base directory
 */
int vfs_path_parent_lookup(struct filename *filename, unsigned int flags,
			   struct path *parent, struct qstr *last, int *type,
			   const struct path *root)
{
	return  __filename_parentat(AT_FDCWD, filename, flags, parent, last,
				    type, root);
}
EXPORT_SYMBOL(vfs_path_parent_lookup);

/**
 * vfs_path_lookup - lookup a file path relative to a dentry-vfsmount pair
 * @dentry:  pointer to dentry of the base directory
 * @mnt: pointer to vfs mount of the base directory
 * @name: pointer to file name
 * @flags: lookup flags
 * @path: pointer to struct path to fill
 */
int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt,
		    const char *name, unsigned int flags,
		    struct path *path)
{
	struct filename *filename;
	struct path root = {.mnt = mnt, .dentry = dentry};
	int ret;

	filename = getname_kernel(name);
	/* the first argument of filename_lookup() is ignored with root */
	ret = filename_lookup(AT_FDCWD, filename, flags, path, &root);
	putname(filename);
	return ret;
}
EXPORT_SYMBOL(vfs_path_lookup);

int lookup_noperm_common(struct qstr *qname, struct dentry *base)
{
	const char *name = qname->name;
	u32 len = qname->len;

	qname->hash = full_name_hash(base, name, len);
	if (!len)
		return -EACCES;

	if (is_dot_dotdot(name, len))
		return -EACCES;

	while (len--) {
		unsigned int c = *(const unsigned char *)name++;
		if (c == '/' || c == '\0')
			return -EACCES;
	}
	/*
	 * See if the low-level filesystem might want
	 * to use its own hash..
	 */
	if (base->d_flags & DCACHE_OP_HASH) {
		int err = base->d_op->d_hash(base, qname);
		if (err < 0)
			return err;
	}
	return 0;
}

static int lookup_one_common(struct mnt_idmap *idmap,
			     struct qstr *qname, struct dentry *base)
{
	int err;
	err = lookup_noperm_common(qname, base);
	if (err < 0)
		return err;
	return inode_permission(idmap, base->d_inode, MAY_EXEC);
}

/**
 * try_lookup_noperm - filesystem helper to lookup single pathname component
 * @name:	qstr storing pathname component to lookup
 * @base:	base directory to lookup from
 *
 * Look up a dentry by name in the dcache, returning NULL if it does not
 * currently exist.  The function does not try to create a dentry and if one
 * is found it doesn't try to revalidate it.
 *
 * Note that this routine is purely a helper for filesystem usage and should
 * not be called by generic code.  It does no permission checking.
 *
 * No locks need be held - only a counted reference to @base is needed.
 *
 */
struct dentry *try_lookup_noperm(struct qstr *name, struct dentry *base)
{
	int err;

	err = lookup_noperm_common(name, base);
	if (err)
		return ERR_PTR(err);

	return d_lookup(base, name);
}
EXPORT_SYMBOL(try_lookup_noperm);

/**
 * lookup_noperm - filesystem helper to lookup single pathname component
 * @name:	qstr storing pathname component to lookup
 * @base:	base directory to lookup from
 *
 * Note that this routine is purely a helper for filesystem usage and should
 * not be called by generic code.  It does no permission checking.
 *
 * The caller must hold base->i_rwsem.
 */
struct dentry *lookup_noperm(struct qstr *name, struct dentry *base)
{
	struct dentry *dentry;
	int err;

	WARN_ON_ONCE(!inode_is_locked(base->d_inode));

	err = lookup_noperm_common(name, base);
	if (err)
		return ERR_PTR(err);

	dentry = lookup_dcache(name, base, 0);
	return dentry ? dentry : __lookup_slow(name, base, 0);
}
EXPORT_SYMBOL(lookup_noperm);

/**
 * lookup_one - lookup single pathname component
 * @idmap:	idmap of the mount the lookup is performed from
 * @name:	qstr holding pathname component to lookup
 * @base:	base directory to lookup from
 *
 * This can be used for in-kernel filesystem clients such as file servers.
 *
 * The caller must hold base->i_rwsem.
 */
struct dentry *lookup_one(struct mnt_idmap *idmap, struct qstr *name,
			  struct dentry *base)
{
	struct dentry *dentry;
	int err;

	WARN_ON_ONCE(!inode_is_locked(base->d_inode));

	err = lookup_one_common(idmap, name, base);
	if (err)
		return ERR_PTR(err);

	dentry = lookup_dcache(name, base, 0);
	return dentry ? dentry : __lookup_slow(name, base, 0);
}
EXPORT_SYMBOL(lookup_one);

/**
 * lookup_one_unlocked - lookup single pathname component
 * @idmap:	idmap of the mount the lookup is performed from
 * @name:	qstr olding pathname component to lookup
 * @base:	base directory to lookup from
 *
 * This can be used for in-kernel filesystem clients such as file servers.
 *
 * Unlike lookup_one, it should be called without the parent
 * i_rwsem held, and will take the i_rwsem itself if necessary.
 */
struct dentry *lookup_one_unlocked(struct mnt_idmap *idmap, struct qstr *name,
				   struct dentry *base)
{
	int err;
	struct dentry *ret;

	err = lookup_one_common(idmap, name, base);
	if (err)
		return ERR_PTR(err);

	ret = lookup_dcache(name, base, 0);
	if (!ret)
		ret = lookup_slow(name, base, 0);
	return ret;
}
EXPORT_SYMBOL(lookup_one_unlocked);

/**
 * lookup_one_positive_killable - lookup single pathname component
 * @idmap:	idmap of the mount the lookup is performed from
 * @name:	qstr olding pathname component to lookup
 * @base:	base directory to lookup from
 *
 * This helper will yield ERR_PTR(-ENOENT) on negatives. The helper returns
 * known positive or ERR_PTR(). This is what most of the users want.
 *
 * Note that pinned negative with unlocked parent _can_ become positive at any
 * time, so callers of lookup_one_unlocked() need to be very careful; pinned
 * positives have >d_inode stable, so this one avoids such problems.
 *
 * This can be used for in-kernel filesystem clients such as file servers.
 *
 * It should be called without the parent i_rwsem held, and will take
 * the i_rwsem itself if necessary.  If a fatal signal is pending or
 * delivered, it will return %-EINTR if the lock is needed.
 */
struct dentry *lookup_one_positive_killable(struct mnt_idmap *idmap,
					    struct qstr *name,
					    struct dentry *base)
{
	int err;
	struct dentry *ret;

	err = lookup_one_common(idmap, name, base);
	if (err)
		return ERR_PTR(err);

	ret = lookup_dcache(name, base, 0);
	if (!ret)
		ret = lookup_slow_killable(name, base, 0);
	if (!IS_ERR(ret) && d_flags_negative(smp_load_acquire(&ret->d_flags))) {
		dput(ret);
		ret = ERR_PTR(-ENOENT);
	}
	return ret;
}
EXPORT_SYMBOL(lookup_one_positive_killable);

/**
 * lookup_one_positive_unlocked - lookup single pathname component
 * @idmap:	idmap of the mount the lookup is performed from
 * @name:	qstr holding pathname component to lookup
 * @base:	base directory to lookup from
 *
 * This helper will yield ERR_PTR(-ENOENT) on negatives. The helper returns
 * known positive or ERR_PTR(). This is what most of the users want.
 *
 * Note that pinned negative with unlocked parent _can_ become positive at any
 * time, so callers of lookup_one_unlocked() need to be very careful; pinned
 * positives have >d_inode stable, so this one avoids such problems.
 *
 * This can be used for in-kernel filesystem clients such as file servers.
 *
 * The helper should be called without i_rwsem held.
 */
struct dentry *lookup_one_positive_unlocked(struct mnt_idmap *idmap,
					    struct qstr *name,
					    struct dentry *base)
{
	struct dentry *ret = lookup_one_unlocked(idmap, name, base);

	if (!IS_ERR(ret) && d_flags_negative(smp_load_acquire(&ret->d_flags))) {
		dput(ret);
		ret = ERR_PTR(-ENOENT);
	}
	return ret;
}
EXPORT_SYMBOL(lookup_one_positive_unlocked);

/**
 * lookup_noperm_unlocked - filesystem helper to lookup single pathname component
 * @name:	pathname component to lookup
 * @base:	base directory to lookup from
 *
 * Note that this routine is purely a helper for filesystem usage and should
 * not be called by generic code. It does no permission checking.
 *
 * Unlike lookup_noperm(), it should be called without the parent
 * i_rwsem held, and will take the i_rwsem itself if necessary.
 *
 * Unlike try_lookup_noperm() it *does* revalidate the dentry if it already
 * existed.
 */
struct dentry *lookup_noperm_unlocked(struct qstr *name, struct dentry *base)
{
	struct dentry *ret;
	int err;

	err = lookup_noperm_common(name, base);
	if (err)
		return ERR_PTR(err);

	ret = lookup_dcache(name, base, 0);
	if (!ret)
		ret = lookup_slow(name, base, 0);
	return ret;
}
EXPORT_SYMBOL(lookup_noperm_unlocked);

/*
 * Like lookup_noperm_unlocked(), except that it yields ERR_PTR(-ENOENT)
 * on negatives.  Returns known positive or ERR_PTR(); that's what
 * most of the users want.  Note that pinned negative with unlocked parent
 * _can_ become positive at any time, so callers of lookup_noperm_unlocked()
 * need to be very careful; pinned positives have ->d_inode stable, so
 * this one avoids such problems.
 */
struct dentry *lookup_noperm_positive_unlocked(struct qstr *name,
					       struct dentry *base)
{
	struct dentry *ret;

	ret = lookup_noperm_unlocked(name, base);
	if (!IS_ERR(ret) && d_flags_negative(smp_load_acquire(&ret->d_flags))) {
		dput(ret);
		ret = ERR_PTR(-ENOENT);
	}
	return ret;
}
EXPORT_SYMBOL(lookup_noperm_positive_unlocked);

/**
 * start_creating - prepare to create a given name with permission checking
 * @idmap:  idmap of the mount
 * @parent: directory in which to prepare to create the name
 * @name:   the name to be created
 *
 * Locks are taken and a lookup is performed prior to creating
 * an object in a directory.  Permission checking (MAY_EXEC) is performed
 * against @idmap.
 *
 * If the name already exists, a positive dentry is returned, so
 * behaviour is similar to O_CREAT without O_EXCL, which doesn't fail
 * with -EEXIST.
 *
 * Returns: a negative or positive dentry, or an error.
 */
struct dentry *start_creating(struct mnt_idmap *idmap, struct dentry *parent,
			      struct qstr *name)
{
	int err = lookup_one_common(idmap, name, parent);

	if (err)
		return ERR_PTR(err);
	return start_dirop(parent, name, LOOKUP_CREATE);
}
EXPORT_SYMBOL(start_creating);

/**
 * start_removing - prepare to remove a given name with permission checking
 * @idmap:  idmap of the mount
 * @parent: directory in which to find the name
 * @name:   the name to be removed
 *
 * Locks are taken and a lookup in performed prior to removing
 * an object from a directory.  Permission checking (MAY_EXEC) is performed
 * against @idmap.
 *
 * If the name doesn't exist, an error is returned.
 *
 * end_removing() should be called when removal is complete, or aborted.
 *
 * Returns: a positive dentry, or an error.
 */
struct dentry *start_removing(struct mnt_idmap *idmap, struct dentry *parent,
			      struct qstr *name)
{
	int err = lookup_one_common(idmap, name, parent);

	if (err)
		return ERR_PTR(err);
	return start_dirop(parent, name, 0);
}
EXPORT_SYMBOL(start_removing);

/**
 * start_creating_killable - prepare to create a given name with permission checking
 * @idmap:  idmap of the mount
 * @parent: directory in which to prepare to create the name
 * @name:   the name to be created
 *
 * Locks are taken and a lookup in performed prior to creating
 * an object in a directory.  Permission checking (MAY_EXEC) is performed
 * against @idmap.
 *
 * If the name already exists, a positive dentry is returned.
 *
 * If a signal is received or was already pending, the function aborts
 * with -EINTR;
 *
 * Returns: a negative or positive dentry, or an error.
 */
struct dentry *start_creating_killable(struct mnt_idmap *idmap,
				       struct dentry *parent,
				       struct qstr *name)
{
	int err = lookup_one_common(idmap, name, parent);

	if (err)
		return ERR_PTR(err);
	return __start_dirop(parent, name, LOOKUP_CREATE, TASK_KILLABLE);
}
EXPORT_SYMBOL(start_creating_killable);

/**
 * start_removing_killable - prepare to remove a given name with permission checking
 * @idmap:  idmap of the mount
 * @parent: directory in which to find the name
 * @name:   the name to be removed
 *
 * Locks are taken and a lookup in performed prior to removing
 * an object from a directory.  Permission checking (MAY_EXEC) is performed
 * against @idmap.
 *
 * If the name doesn't exist, an error is returned.
 *
 * end_removing() should be called when removal is complete, or aborted.
 *
 * If a signal is received or was already pending, the function aborts
 * with -EINTR;
 *
 * Returns: a positive dentry, or an error.
 */
struct dentry *start_removing_killable(struct mnt_idmap *idmap,
				       struct dentry *parent,
				       struct qstr *name)
{
	int err = lookup_one_common(idmap, name, parent);

	if (err)
		return ERR_PTR(err);
	return __start_dirop(parent, name, 0, TASK_KILLABLE);
}
EXPORT_SYMBOL(start_removing_killable);

/**
 * start_creating_noperm - prepare to create a given name without permission checking
 * @parent: directory in which to prepare to create the name
 * @name:   the name to be created
 *
 * Locks are taken and a lookup in performed prior to creating
 * an object in a directory.
 *
 * If the name already exists, a positive dentry is returned.
 *
 * Returns: a negative or positive dentry, or an error.
 */
struct dentry *start_creating_noperm(struct dentry *parent,
				     struct qstr *name)
{
	int err = lookup_noperm_common(name, parent);

	if (err)
		return ERR_PTR(err);
	return start_dirop(parent, name, LOOKUP_CREATE);
}
EXPORT_SYMBOL(start_creating_noperm);

/**
 * start_removing_noperm - prepare to remove a given name without permission checking
 * @parent: directory in which to find the name
 * @name:   the name to be removed
 *
 * Locks are taken and a lookup in performed prior to removing
 * an object from a directory.
 *
 * If the name doesn't exist, an error is returned.
 *
 * end_removing() should be called when removal is complete, or aborted.
 *
 * Returns: a positive dentry, or an error.
 */
struct dentry *start_removing_noperm(struct dentry *parent,
				     struct qstr *name)
{
	int err = lookup_noperm_common(name, parent);

	if (err)
		return ERR_PTR(err);
	return start_dirop(parent, name, 0);
}
EXPORT_SYMBOL(start_removing_noperm);

/**
 * start_creating_dentry - prepare to create a given dentry
 * @parent: directory from which dentry should be removed
 * @child:  the dentry to be removed
 *
 * A lock is taken to protect the dentry again other dirops and
 * the validity of the dentry is checked: correct parent and still hashed.
 *
 * If the dentry is valid and negative a reference is taken and
 * returned.  If not an error is returned.
 *
 * end_creating() should be called when creation is complete, or aborted.
 *
 * Returns: the valid dentry, or an error.
 */
struct dentry *start_creating_dentry(struct dentry *parent,
				     struct dentry *child)
{
	inode_lock_nested(parent->d_inode, I_MUTEX_PARENT);
	if (unlikely(IS_DEADDIR(parent->d_inode) ||
		     child->d_parent != parent ||
		     d_unhashed(child))) {
		inode_unlock(parent->d_inode);
		return ERR_PTR(-EINVAL);
	}
	if (d_is_positive(child)) {
		inode_unlock(parent->d_inode);
		return ERR_PTR(-EEXIST);
	}
	return dget(child);
}
EXPORT_SYMBOL(start_creating_dentry);

/**
 * start_removing_dentry - prepare to remove a given dentry
 * @parent: directory from which dentry should be removed
 * @child:  the dentry to be removed
 *
 * A lock is taken to protect the dentry again other dirops and
 * the validity of the dentry is checked: correct parent and still hashed.
 *
 * If the dentry is valid and positive, a reference is taken and
 * returned.  If not an error is returned.
 *
 * end_removing() should be called when removal is complete, or aborted.
 *
 * Returns: the valid dentry, or an error.
 */
struct dentry *start_removing_dentry(struct dentry *parent,
				     struct dentry *child)
{
	inode_lock_nested(parent->d_inode, I_MUTEX_PARENT);
	if (unlikely(IS_DEADDIR(parent->d_inode) ||
		     child->d_parent != parent ||
		     d_unhashed(child))) {
		inode_unlock(parent->d_inode);
		return ERR_PTR(-EINVAL);
	}
	if (d_is_negative(child)) {
		inode_unlock(parent->d_inode);
		return ERR_PTR(-ENOENT);
	}
	return dget(child);
}
EXPORT_SYMBOL(start_removing_dentry);

#ifdef CONFIG_UNIX98_PTYS
int path_pts(struct path *path)
{
	/* Find something mounted on "pts" in the same directory as
	 * the input path.
	 */
	struct dentry *parent = dget_parent(path->dentry);
	struct dentry *child;
	struct qstr this = QSTR_INIT("pts", 3);

	if (unlikely(!path_connected(path->mnt, parent))) {
		dput(parent);
		return -ENOENT;
	}
	dput(path->dentry);
	path->dentry = parent;
	child = d_hash_and_lookup(parent, &this);
	if (IS_ERR_OR_NULL(child))
		return -ENOENT;

	path->dentry = child;
	dput(parent);
	follow_down(path, 0);
	return 0;
}
#endif

int user_path_at(int dfd, const char __user *name, unsigned flags,
		 struct path *path)
{
	struct filename *filename = getname_flags(name, flags);
	int ret = filename_lookup(dfd, filename, flags, path, NULL);

	putname(filename);
	return ret;
}
EXPORT_SYMBOL(user_path_at);

int __check_sticky(struct mnt_idmap *idmap, struct inode *dir,
		   struct inode *inode)
{
	kuid_t fsuid = current_fsuid();

	if (vfsuid_eq_kuid(i_uid_into_vfsuid(idmap, inode), fsuid))
		return 0;
	if (vfsuid_eq_kuid(i_uid_into_vfsuid(idmap, dir), fsuid))
		return 0;
	return !capable_wrt_inode_uidgid(idmap, inode, CAP_FOWNER);
}
EXPORT_SYMBOL(__check_sticky);

/*
 *	Check whether we can remove a link victim from directory dir, check
 *  whether the type of victim is right.
 *  1. We can't do it if dir is read-only (done in permission())
 *  2. We should have write and exec permissions on dir
 *  3. We can't remove anything from append-only dir
 *  4. We can't do anything with immutable dir (done in permission())
 *  5. If the sticky bit on dir is set we should either
 *	a. be owner of dir, or
 *	b. be owner of victim, or
 *	c. have CAP_FOWNER capability
 *  6. If the victim is append-only or immutable we can't do antyhing with
 *     links pointing to it.
 *  7. If the victim has an unknown uid or gid we can't change the inode.
 *  8. If we were asked to remove a directory and victim isn't one - ENOTDIR.
 *  9. If we were asked to remove a non-directory and victim isn't one - EISDIR.
 * 10. We can't remove a root or mountpoint.
 * 11. We don't allow removal of NFS sillyrenamed files; it's handled by
 *     nfs_async_unlink().
 */
static int may_delete(struct mnt_idmap *idmap, struct inode *dir,
		      struct dentry *victim, bool isdir)
{
	struct inode *inode = d_backing_inode(victim);
	int error;

	if (d_is_negative(victim))
		return -ENOENT;
	BUG_ON(!inode);

	BUG_ON(victim->d_parent->d_inode != dir);

	/* Inode writeback is not safe when the uid or gid are invalid. */
	if (!vfsuid_valid(i_uid_into_vfsuid(idmap, inode)) ||
	    !vfsgid_valid(i_gid_into_vfsgid(idmap, inode)))
		return -EOVERFLOW;

	audit_inode_child(dir, victim, AUDIT_TYPE_CHILD_DELETE);

	error = inode_permission(idmap, dir, MAY_WRITE | MAY_EXEC);
	if (error)
		return error;
	if (IS_APPEND(dir))
		return -EPERM;

	if (check_sticky(idmap, dir, inode) || IS_APPEND(inode) ||
	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) ||
	    HAS_UNMAPPED_ID(idmap, inode))
		return -EPERM;
	if (isdir) {
		if (!d_is_dir(victim))
			return -ENOTDIR;
		if (IS_ROOT(victim))
			return -EBUSY;
	} else if (d_is_dir(victim))
		return -EISDIR;
	if (IS_DEADDIR(dir))
		return -ENOENT;
	if (victim->d_flags & DCACHE_NFSFS_RENAMED)
		return -EBUSY;
	return 0;
}

/*	Check whether we can create an object with dentry child in directory
 *  dir.
 *  1. We can't do it if child already exists (open has special treatment for
 *     this case, but since we are inlined it's OK)
 *  2. We can't do it if dir is read-only (done in permission())
 *  3. We can't do it if the fs can't represent the fsuid or fsgid.
 *  4. We should have write and exec permissions on dir
 *  5. We can't do it if dir is immutable (done in permission())
 */
static inline int may_create(struct mnt_idmap *idmap,
			     struct inode *dir, struct dentry *child)
{
	audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
	if (child->d_inode)
		return -EEXIST;
	if (IS_DEADDIR(dir))
		return -ENOENT;
	if (!fsuidgid_has_mapping(dir->i_sb, idmap))
		return -EOVERFLOW;

	return inode_permission(idmap, dir, MAY_WRITE | MAY_EXEC);
}

// p1 != p2, both are on the same filesystem, ->s_vfs_rename_mutex is held
static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2)
{
	struct dentry *p = p1, *q = p2, *r;

	while ((r = p->d_parent) != p2 && r != p)
		p = r;
	if (r == p2) {
		// p is a child of p2 and an ancestor of p1 or p1 itself
		inode_lock_nested(p2->d_inode, I_MUTEX_PARENT);
		inode_lock_nested(p1->d_inode, I_MUTEX_PARENT2);
		return p;
	}
	// p is the root of connected component that contains p1
	// p2 does not occur on the path from p to p1
	while ((r = q->d_parent) != p1 && r != p && r != q)
		q = r;
	if (r == p1) {
		// q is a child of p1 and an ancestor of p2 or p2 itself
		inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
		inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2);
		return q;
	} else if (likely(r == p)) {
		// both p2 and p1 are descendents of p
		inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
		inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2);
		return NULL;
	} else { // no common ancestor at the time we'd been called
		mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
		return ERR_PTR(-EXDEV);
	}
}

/*
 * p1 and p2 should be directories on the same fs.
 */
struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
{
	if (p1 == p2) {
		inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
		return NULL;
	}

	mutex_lock(&p1->d_sb->s_vfs_rename_mutex);
	return lock_two_directories(p1, p2);
}
EXPORT_SYMBOL(lock_rename);

/*
 * c1 and p2 should be on the same fs.
 */
struct dentry *lock_rename_child(struct dentry *c1, struct dentry *p2)
{
	if (READ_ONCE(c1->d_parent) == p2) {
		/*
		 * hopefully won't need to touch ->s_vfs_rename_mutex at all.
		 */
		inode_lock_nested(p2->d_inode, I_MUTEX_PARENT);
		/*
		 * now that p2 is locked, nobody can move in or out of it,
		 * so the test below is safe.
		 */
		if (likely(c1->d_parent == p2))
			return NULL;

		/*
		 * c1 got moved out of p2 while we'd been taking locks;
		 * unlock and fall back to slow case.
		 */
		inode_unlock(p2->d_inode);
	}

	mutex_lock(&c1->d_sb->s_vfs_rename_mutex);
	/*
	 * nobody can move out of any directories on this fs.
	 */
	if (likely(c1->d_parent != p2))
		return lock_two_directories(c1->d_parent, p2);

	/*
	 * c1 got moved into p2 while we were taking locks;
	 * we need p2 locked and ->s_vfs_rename_mutex unlocked,
	 * for consistency with lock_rename().
	 */
	inode_lock_nested(p2->d_inode, I_MUTEX_PARENT);
	mutex_unlock(&c1->d_sb->s_vfs_rename_mutex);
	return NULL;
}
EXPORT_SYMBOL(lock_rename_child);

void unlock_rename(struct dentry *p1, struct dentry *p2)
{
	inode_unlock(p1->d_inode);
	if (p1 != p2) {
		inode_unlock(p2->d_inode);
		mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
	}
}
EXPORT_SYMBOL(unlock_rename);

/**
 * __start_renaming - lookup and lock names for rename
 * @rd:           rename data containing parents and flags, and
 *                for receiving found dentries
 * @lookup_flags: extra flags to pass to ->lookup (e.g. LOOKUP_REVAL,
 *                LOOKUP_NO_SYMLINKS etc).
 * @old_last:     name of object in @rd.old_parent
 * @new_last:     name of object in @rd.new_parent
 *
 * Look up two names and ensure locks are in place for
 * rename.
 *
 * On success the found dentries are stored in @rd.old_dentry,
 * @rd.new_dentry and an extra ref is taken on @rd.old_parent.
 * These references and the lock are dropped by end_renaming().
 *
 * The passed in qstrs must have the hash calculated, and no permission
 * checking is performed.
 *
 * Returns: zero or an error.
 */
static int
__start_renaming(struct renamedata *rd, int lookup_flags,
		 struct qstr *old_last, struct qstr *new_last)
{
	struct dentry *trap;
	struct dentry *d1, *d2;
	int target_flags = LOOKUP_RENAME_TARGET | LOOKUP_CREATE;
	int err;

	if (rd->flags & RENAME_EXCHANGE)
		target_flags = 0;
	if (rd->flags & RENAME_NOREPLACE)
		target_flags |= LOOKUP_EXCL;

	trap = lock_rename(rd->old_parent, rd->new_parent);
	if (IS_ERR(trap))
		return PTR_ERR(trap);

	d1 = lookup_one_qstr_excl(old_last, rd->old_parent,
				  lookup_flags);
	err = PTR_ERR(d1);
	if (IS_ERR(d1))
		goto out_unlock;

	d2 = lookup_one_qstr_excl(new_last, rd->new_parent,
				  lookup_flags | target_flags);
	err = PTR_ERR(d2);
	if (IS_ERR(d2))
		goto out_dput_d1;

	if (d1 == trap) {
		/* source is an ancestor of target */
		err = -EINVAL;
		goto out_dput_d2;
	}

	if (d2 == trap) {
		/* target is an ancestor of source */
		if (rd->flags & RENAME_EXCHANGE)
			err = -EINVAL;
		else
			err = -ENOTEMPTY;
		goto out_dput_d2;
	}

	rd->old_dentry = d1;
	rd->new_dentry = d2;
	dget(rd->old_parent);
	return 0;

out_dput_d2:
	dput(d2);
out_dput_d1:
	dput(d1);
out_unlock:
	unlock_rename(rd->old_parent, rd->new_parent);
	return err;
}

/**
 * start_renaming - lookup and lock names for rename with permission checking
 * @rd:           rename data containing parents and flags, and
 *                for receiving found dentries
 * @lookup_flags: extra flags to pass to ->lookup (e.g. LOOKUP_REVAL,
 *                LOOKUP_NO_SYMLINKS etc).
 * @old_last:     name of object in @rd.old_parent
 * @new_last:     name of object in @rd.new_parent
 *
 * Look up two names and ensure locks are in place for
 * rename.
 *
 * On success the found dentries are stored in @rd.old_dentry,
 * @rd.new_dentry.  Also the refcount on @rd->old_parent is increased.
 * These references and the lock are dropped by end_renaming().
 *
 * The passed in qstrs need not have the hash calculated, and basic
 * eXecute permission checking is performed against @rd.mnt_idmap.
 *
 * Returns: zero or an error.
 */
int start_renaming(struct renamedata *rd, int lookup_flags,
		   struct qstr *old_last, struct qstr *new_last)
{
	int err;

	err = lookup_one_common(rd->mnt_idmap, old_last, rd->old_parent);
	if (err)
		return err;
	err = lookup_one_common(rd->mnt_idmap, new_last, rd->new_parent);
	if (err)
		return err;
	return __start_renaming(rd, lookup_flags, old_last, new_last);
}
EXPORT_SYMBOL(start_renaming);

static int
__start_renaming_dentry(struct renamedata *rd, int lookup_flags,
			struct dentry *old_dentry, struct qstr *new_last)
{
	struct dentry *trap;
	struct dentry *d2;
	int target_flags = LOOKUP_RENAME_TARGET | LOOKUP_CREATE;
	int err;

	if (rd->flags & RENAME_EXCHANGE)
		target_flags = 0;
	if (rd->flags & RENAME_NOREPLACE)
		target_flags |= LOOKUP_EXCL;

	/* Already have the dentry - need to be sure to lock the correct parent */
	trap = lock_rename_child(old_dentry, rd->new_parent);
	if (IS_ERR(trap))
		return PTR_ERR(trap);
	if (d_unhashed(old_dentry) ||
	    (rd->old_parent && rd->old_parent != old_dentry->d_parent)) {
		/* dentry was removed, or moved and explicit parent requested */
		err = -EINVAL;
		goto out_unlock;
	}

	d2 = lookup_one_qstr_excl(new_last, rd->new_parent,
				  lookup_flags | target_flags);
	err = PTR_ERR(d2);
	if (IS_ERR(d2))
		goto out_unlock;

	if (old_dentry == trap) {
		/* source is an ancestor of target */
		err = -EINVAL;
		goto out_dput_d2;
	}

	if (d2 == trap) {
		/* target is an ancestor of source */
		if (rd->flags & RENAME_EXCHANGE)
			err = -EINVAL;
		else
			err = -ENOTEMPTY;
		goto out_dput_d2;
	}

	rd->old_dentry = dget(old_dentry);
	rd->new_dentry = d2;
	rd->old_parent = dget(old_dentry->d_parent);
	return 0;

out_dput_d2:
	dput(d2);
out_unlock:
	unlock_rename(old_dentry->d_parent, rd->new_parent);
	return err;
}

/**
 * start_renaming_dentry - lookup and lock name for rename with permission checking
 * @rd:           rename data containing parents and flags, and
 *                for receiving found dentries
 * @lookup_flags: extra flags to pass to ->lookup (e.g. LOOKUP_REVAL,
 *                LOOKUP_NO_SYMLINKS etc).
 * @old_dentry:   dentry of name to move
 * @new_last:     name of target in @rd.new_parent
 *
 * Look up target name and ensure locks are in place for
 * rename.
 *
 * On success the found dentry is stored in @rd.new_dentry and
 * @rd.old_parent is confirmed to be the parent of @old_dentry.  If it
 * was originally %NULL, it is set.  In either case a reference is taken
 * so that end_renaming() can have a stable reference to unlock.
 *
 * References and the lock can be dropped with end_renaming()
 *
 * The passed in qstr need not have the hash calculated, and basic
 * eXecute permission checking is performed against @rd.mnt_idmap.
 *
 * Returns: zero or an error.
 */
int start_renaming_dentry(struct renamedata *rd, int lookup_flags,
			  struct dentry *old_dentry, struct qstr *new_last)
{
	int err;

	err = lookup_one_common(rd->mnt_idmap, new_last, rd->new_parent);
	if (err)
		return err;
	return __start_renaming_dentry(rd, lookup_flags, old_dentry, new_last);
}
EXPORT_SYMBOL(start_renaming_dentry);

/**
 * start_renaming_two_dentries - Lock to dentries in given parents for rename
 * @rd:           rename data containing parent
 * @old_dentry:   dentry of name to move
 * @new_dentry:   dentry to move to
 *
 * Ensure locks are in place for rename and check parentage is still correct.
 *
 * On success the two dentries are stored in @rd.old_dentry and
 * @rd.new_dentry and @rd.old_parent and @rd.new_parent are confirmed to
 * be the parents of the dentries.
 *
 * References and the lock can be dropped with end_renaming()
 *
 * Returns: zero or an error.
 */
int
start_renaming_two_dentries(struct renamedata *rd,
			    struct dentry *old_dentry, struct dentry *new_dentry)
{
	struct dentry *trap;
	int err;

	/* Already have the dentry - need to be sure to lock the correct parent */
	trap = lock_rename_child(old_dentry, rd->new_parent);
	if (IS_ERR(trap))
		return PTR_ERR(trap);
	err = -EINVAL;
	if (d_unhashed(old_dentry) ||
	    (rd->old_parent && rd->old_parent != old_dentry->d_parent))
		/* old_dentry was removed, or moved and explicit parent requested */
		goto out_unlock;
	if (d_unhashed(new_dentry) ||
	    rd->new_parent != new_dentry->d_parent)
		/* new_dentry was removed or moved */
		goto out_unlock;

	if (old_dentry == trap)
		/* source is an ancestor of target */
		goto out_unlock;

	if (new_dentry == trap) {
		/* target is an ancestor of source */
		if (rd->flags & RENAME_EXCHANGE)
			err = -EINVAL;
		else
			err = -ENOTEMPTY;
		goto out_unlock;
	}

	err = -EEXIST;
	if (d_is_positive(new_dentry) && (rd->flags & RENAME_NOREPLACE))
		goto out_unlock;

	rd->old_dentry = dget(old_dentry);
	rd->new_dentry = dget(new_dentry);
	rd->old_parent = dget(old_dentry->d_parent);
	return 0;

out_unlock:
	unlock_rename(old_dentry->d_parent, rd->new_parent);
	return err;
}
EXPORT_SYMBOL(start_renaming_two_dentries);

void end_renaming(struct renamedata *rd)
{
	unlock_rename(rd->old_parent, rd->new_parent);
	dput(rd->old_dentry);
	dput(rd->new_dentry);
	dput(rd->old_parent);
}
EXPORT_SYMBOL(end_renaming);

/**
 * vfs_prepare_mode - prepare the mode to be used for a new inode
 * @idmap:	idmap of the mount the inode was found from
 * @dir:	parent directory of the new inode
 * @mode:	mode of the new inode
 * @mask_perms:	allowed permission by the vfs
 * @type:	type of file to be created
 *
 * This helper consolidates and enforces vfs restrictions on the @mode of a new
 * object to be created.
 *
 * Umask stripping depends on whether the filesystem supports POSIX ACLs (see
 * the kernel documentation for mode_strip_umask()). Moving umask stripping
 * after setgid stripping allows the same ordering for both non-POSIX ACL and
 * POSIX ACL supporting filesystems.
 *
 * Note that it's currently valid for @type to be 0 if a directory is created.
 * Filesystems raise that flag individually and we need to check whether each
 * filesystem can deal with receiving S_IFDIR from the vfs before we enforce a
 * non-zero type.
 *
 * Returns: mode to be passed to the filesystem
 */
static inline umode_t vfs_prepare_mode(struct mnt_idmap *idmap,
				       const struct inode *dir, umode_t mode,
				       umode_t mask_perms, umode_t type)
{
	mode = mode_strip_sgid(idmap, dir, mode);
	mode = mode_strip_umask(dir, mode);

	/*
	 * Apply the vfs mandated allowed permission mask and set the type of
	 * file to be created before we call into the filesystem.
	 */
	mode &= (mask_perms & ~S_IFMT);
	mode |= (type & S_IFMT);

	return mode;
}

/**
 * vfs_create - create new file
 * @idmap:	idmap of the mount the inode was found from
 * @dentry:	dentry of the child file
 * @mode:	mode of the child file
 * @di:		returns parent inode, if the inode is delegated.
 *
 * Create a new file.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
int vfs_create(struct mnt_idmap *idmap, struct dentry *dentry, umode_t mode,
	       struct delegated_inode *di)
{
	struct inode *dir = d_inode(dentry->d_parent);
	int error;

	error = may_create(idmap, dir, dentry);
	if (error)
		return error;

	if (!dir->i_op->create)
		return -EACCES;	/* shouldn't it be ENOSYS? */

	mode = vfs_prepare_mode(idmap, dir, mode, S_IALLUGO, S_IFREG);
	error = security_inode_create(dir, dentry, mode);
	if (error)
		return error;
	error = try_break_deleg(dir, di);
	if (error)
		return error;
	error = dir->i_op->create(idmap, dir, dentry, mode, true);
	if (!error)
		fsnotify_create(dir, dentry);
	return error;
}
EXPORT_SYMBOL(vfs_create);

int vfs_mkobj(struct dentry *dentry, umode_t mode,
		int (*f)(struct dentry *, umode_t, void *),
		void *arg)
{
	struct inode *dir = dentry->d_parent->d_inode;
	int error = may_create(&nop_mnt_idmap, dir, dentry);
	if (error)
		return error;

	mode &= S_IALLUGO;
	mode |= S_IFREG;
	error = security_inode_create(dir, dentry, mode);
	if (error)
		return error;
	error = f(dentry, mode, arg);
	if (!error)
		fsnotify_create(dir, dentry);
	return error;
}
EXPORT_SYMBOL(vfs_mkobj);

bool may_open_dev(const struct path *path)
{
	return !(path->mnt->mnt_flags & MNT_NODEV) &&
		!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}

static int may_open(struct mnt_idmap *idmap, const struct path *path,
		    int acc_mode, int flag)
{
	struct dentry *dentry = path->dentry;
	struct inode *inode = dentry->d_inode;
	int error;

	if (!inode)
		return -ENOENT;

	switch (inode->i_mode & S_IFMT) {
	case S_IFLNK:
		return -ELOOP;
	case S_IFDIR:
		if (acc_mode & MAY_WRITE)
			return -EISDIR;
		if (acc_mode & MAY_EXEC)
			return -EACCES;
		break;
	case S_IFBLK:
	case S_IFCHR:
		if (!may_open_dev(path))
			return -EACCES;
		fallthrough;
	case S_IFIFO:
	case S_IFSOCK:
		if (acc_mode & MAY_EXEC)
			return -EACCES;
		flag &= ~O_TRUNC;
		break;
	case S_IFREG:
		if ((acc_mode & MAY_EXEC) && path_noexec(path))
			return -EACCES;
		break;
	default:
		VFS_BUG_ON_INODE(!IS_ANON_FILE(inode), inode);
	}

	error = inode_permission(idmap, inode, MAY_OPEN | acc_mode);
	if (error)
		return error;

	/*
	 * An append-only file must be opened in append mode for writing.
	 */
	if (IS_APPEND(inode)) {
		if  ((flag & O_ACCMODE) != O_RDONLY && !(flag & O_APPEND))
			return -EPERM;
		if (flag & O_TRUNC)
			return -EPERM;
	}

	/* O_NOATIME can only be set by the owner or superuser */
	if (flag & O_NOATIME && !inode_owner_or_capable(idmap, inode))
		return -EPERM;

	return 0;
}

static int handle_truncate(struct mnt_idmap *idmap, struct file *filp)
{
	const struct path *path = &filp->f_path;
	struct inode *inode = path->dentry->d_inode;
	int error = get_write_access(inode);
	if (error)
		return error;

	error = security_file_truncate(filp);
	if (!error) {
		error = do_truncate(idmap, path->dentry, 0,
				    ATTR_MTIME|ATTR_CTIME|ATTR_OPEN,
				    filp);
	}
	put_write_access(inode);
	return error;
}

static inline int open_to_namei_flags(int flag)
{
	if ((flag & O_ACCMODE) == 3)
		flag--;
	return flag;
}

static int may_o_create(struct mnt_idmap *idmap,
			const struct path *dir, struct dentry *dentry,
			umode_t mode)
{
	int error = security_path_mknod(dir, dentry, mode, 0);
	if (error)
		return error;

	if (!fsuidgid_has_mapping(dir->dentry->d_sb, idmap))
		return -EOVERFLOW;

	error = inode_permission(idmap, dir->dentry->d_inode,
				 MAY_WRITE | MAY_EXEC);
	if (error)
		return error;

	return security_inode_create(dir->dentry->d_inode, dentry, mode);
}

/*
 * Attempt to atomically look up, create and open a file from a negative
 * dentry.
 *
 * Returns 0 if successful.  The file will have been created and attached to
 * @file by the filesystem calling finish_open().
 *
 * If the file was looked up only or didn't need creating, FMODE_OPENED won't
 * be set.  The caller will need to perform the open themselves.  @path will
 * have been updated to point to the new dentry.  This may be negative.
 *
 * Returns an error code otherwise.
 */
static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
				  struct file *file,
				  int open_flag, umode_t mode)
{
	struct dentry *const DENTRY_NOT_SET = (void *) -1UL;
	struct inode *dir =  nd->path.dentry->d_inode;
	int error;

	if (nd->flags & LOOKUP_DIRECTORY)
		open_flag |= O_DIRECTORY;

	file->__f_path.dentry = DENTRY_NOT_SET;
	file->__f_path.mnt = nd->path.mnt;
	error = dir->i_op->atomic_open(dir, dentry, file,
				       open_to_namei_flags(open_flag), mode);
	d_lookup_done(dentry);
	if (!error) {
		if (file->f_mode & FMODE_OPENED) {
			if (unlikely(dentry != file->f_path.dentry)) {
				dput(dentry);
				dentry = dget(file->f_path.dentry);
			}
		} else if (WARN_ON(file->f_path.dentry == DENTRY_NOT_SET)) {
			error = -EIO;
		} else {
			if (file->f_path.dentry) {
				dput(dentry);
				dentry = file->f_path.dentry;
			}
			if (unlikely(d_is_negative(dentry)))
				error = -ENOENT;
		}
	}
	if (error) {
		dput(dentry);
		dentry = ERR_PTR(error);
	}
	return dentry;
}

/*
 * Look up and maybe create and open the last component.
 *
 * Must be called with parent locked (exclusive in O_CREAT case).
 *
 * Returns 0 on success, that is, if
 *  the file was successfully atomically created (if necessary) and opened, or
 *  the file was not completely opened at this time, though lookups and
 *  creations were performed.
 * These case are distinguished by presence of FMODE_OPENED on file->f_mode.
 * In the latter case dentry returned in @path might be negative if O_CREAT
 * hadn't been specified.
 *
 * An error code is returned on failure.
 */
static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
				  const struct open_flags *op,
				  bool got_write, struct delegated_inode *delegated_inode)
{
	struct mnt_idmap *idmap;
	struct dentry *dir = nd->path.dentry;
	struct inode *dir_inode = dir->d_inode;
	int open_flag = op->open_flag;
	struct dentry *dentry;
	int error, create_error = 0;
	umode_t mode = op->mode;
	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

	if (unlikely(IS_DEADDIR(dir_inode)))
		return ERR_PTR(-ENOENT);

	file->f_mode &= ~FMODE_CREATED;
	dentry = d_lookup(dir, &nd->last);
	for (;;) {
		if (!dentry) {
			dentry = d_alloc_parallel(dir, &nd->last, &wq);
			if (IS_ERR(dentry))
				return dentry;
		}
		if (d_in_lookup(dentry))
			break;

		error = d_revalidate(dir_inode, &nd->last, dentry, nd->flags);
		if (likely(error > 0))
			break;
		if (error)
			goto out_dput;
		d_invalidate(dentry);
		dput(dentry);
		dentry = NULL;
	}
	if (dentry->d_inode) {
		/* Cached positive dentry: will open in f_op->open */
		return dentry;
	}

	if (open_flag & O_CREAT)
		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);

	/*
	 * Checking write permission is tricky, bacuse we don't know if we are
	 * going to actually need it: O_CREAT opens should work as long as the
	 * file exists.  But checking existence breaks atomicity.  The trick is
	 * to check access and if not granted clear O_CREAT from the flags.
	 *
	 * Another problem is returing the "right" error value (e.g. for an
	 * O_EXCL open we want to return EEXIST not EROFS).
	 */
	if (unlikely(!got_write))
		open_flag &= ~O_TRUNC;
	idmap = mnt_idmap(nd->path.mnt);
	if (open_flag & O_CREAT) {
		if (open_flag & O_EXCL)
			open_flag &= ~O_TRUNC;
		mode = vfs_prepare_mode(idmap, dir->d_inode, mode, mode, mode);
		if (likely(got_write))
			create_error = may_o_create(idmap, &nd->path,
						    dentry, mode);
		else
			create_error = -EROFS;
	}
	if (create_error)
		open_flag &= ~O_CREAT;
	if (dir_inode->i_op->atomic_open) {
		dentry = atomic_open(nd, dentry, file, open_flag, mode);
		if (unlikely(create_error) && dentry == ERR_PTR(-ENOENT))
			dentry = ERR_PTR(create_error);
		return dentry;
	}

	if (d_in_lookup(dentry)) {
		struct dentry *res = dir_inode->i_op->lookup(dir_inode, dentry,
							     nd->flags);
		d_lookup_done(dentry);
		if (unlikely(res)) {
			if (IS_ERR(res)) {
				error = PTR_ERR(res);
				goto out_dput;
			}
			dput(dentry);
			dentry = res;
		}
	}

	/* Negative dentry, just create the file */
	if (!dentry->d_inode && (open_flag & O_CREAT)) {
		/* but break the directory lease first! */
		error = try_break_deleg(dir_inode, delegated_inode);
		if (error)
			goto out_dput;

		file->f_mode |= FMODE_CREATED;
		audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
		if (!dir_inode->i_op->create) {
			error = -EACCES;
			goto out_dput;
		}

		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
						mode, open_flag & O_EXCL);
		if (error)
			goto out_dput;
	}
	if (unlikely(create_error) && !dentry->d_inode) {
		error = create_error;
		goto out_dput;
	}
	return dentry;

out_dput:
	dput(dentry);
	return ERR_PTR(error);
}

static inline bool trailing_slashes(struct nameidata *nd)
{
	return (bool)nd->last.name[nd->last.len];
}

static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
{
	struct dentry *dentry;

	if (open_flag & O_CREAT) {
		if (trailing_slashes(nd))
			return ERR_PTR(-EISDIR);

		/* Don't bother on an O_EXCL create */
		if (open_flag & O_EXCL)
			return NULL;
	}

	if (trailing_slashes(nd))
		nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;

	dentry = lookup_fast(nd);
	if (IS_ERR_OR_NULL(dentry))
		return dentry;

	if (open_flag & O_CREAT) {
		/* Discard negative dentries. Need inode_lock to do the create */
		if (!dentry->d_inode) {
			if (!(nd->flags & LOOKUP_RCU))
				dput(dentry);
			dentry = NULL;
		}
	}
	return dentry;
}

static const char *open_last_lookups(struct nameidata *nd,
		   struct file *file, const struct open_flags *op)
{
	struct delegated_inode delegated_inode = { };
	struct dentry *dir = nd->path.dentry;
	int open_flag = op->open_flag;
	bool got_write = false;
	struct dentry *dentry;
	const char *res;

	nd->flags |= op->intent;

	if (nd->last_type != LAST_NORM) {
		if (nd->depth)
			put_link(nd);
		return handle_dots(nd, nd->last_type);
	}

	/* We _can_ be in RCU mode here */
	dentry = lookup_fast_for_open(nd, open_flag);
	if (IS_ERR(dentry))
		return ERR_CAST(dentry);

	if (likely(dentry))
		goto finish_lookup;

	if (!(open_flag & O_CREAT)) {
		if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
			return ERR_PTR(-ECHILD);
	} else {
		if (nd->flags & LOOKUP_RCU) {
			if (!try_to_unlazy(nd))
				return ERR_PTR(-ECHILD);
		}
	}
retry:
	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
		got_write = !mnt_want_write(nd->path.mnt);
		/*
		 * do _not_ fail yet - we might not need that or fail with
		 * a different error; let lookup_open() decide; we'll be
		 * dropping this one anyway.
		 */
	}
	if (open_flag & O_CREAT)
		inode_lock(dir->d_inode);
	else
		inode_lock_shared(dir->d_inode);
	dentry = lookup_open(nd, file, op, got_write, &delegated_inode);
	if (!IS_ERR(dentry)) {
		if (file->f_mode & FMODE_CREATED)
			fsnotify_create(dir->d_inode, dentry);
		if (file->f_mode & FMODE_OPENED)
			fsnotify_open(file);
	}
	if (open_flag & O_CREAT)
		inode_unlock(dir->d_inode);
	else
		inode_unlock_shared(dir->d_inode);

	if (got_write)
		mnt_drop_write(nd->path.mnt);

	if (IS_ERR(dentry)) {
		if (is_delegated(&delegated_inode)) {
			int error = break_deleg_wait(&delegated_inode);

			if (!error)
				goto retry;
			return ERR_PTR(error);
		}
		return ERR_CAST(dentry);
	}

	if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
		dput(nd->path.dentry);
		nd->path.dentry = dentry;
		return NULL;
	}

finish_lookup:
	if (nd->depth)
		put_link(nd);
	res = step_into(nd, WALK_TRAILING, dentry);
	if (unlikely(res))
		nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
	return res;
}

/*
 * Handle the last step of open()
 */
static int do_open(struct nameidata *nd,
		   struct file *file, const struct open_flags *op)
{
	struct mnt_idmap *idmap;
	int open_flag = op->open_flag;
	bool do_truncate;
	int acc_mode;
	int error;

	if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) {
		error = complete_walk(nd);
		if (error)
			return error;
	}
	if (!(file->f_mode & FMODE_CREATED))
		audit_inode(nd->name, nd->path.dentry, 0);
	idmap = mnt_idmap(nd->path.mnt);
	if (open_flag & O_CREAT) {
		if ((open_flag & O_EXCL) && !(file->f_mode & FMODE_CREATED))
			return -EEXIST;
		if (d_is_dir(nd->path.dentry))
			return -EISDIR;
		error = may_create_in_sticky(idmap, nd,
					     d_backing_inode(nd->path.dentry));
		if (unlikely(error))
			return error;
	}
	if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
		return -ENOTDIR;

	do_truncate = false;
	acc_mode = op->acc_mode;
	if (file->f_mode & FMODE_CREATED) {
		/* Don't check for write permission, don't truncate */
		open_flag &= ~O_TRUNC;
		acc_mode = 0;
	} else if (d_is_reg(nd->path.dentry) && open_flag & O_TRUNC) {
		error = mnt_want_write(nd->path.mnt);
		if (error)
			return error;
		do_truncate = true;
	}
	error = may_open(idmap, &nd->path, acc_mode, open_flag);
	if (!error && !(file->f_mode & FMODE_OPENED))
		error = vfs_open(&nd->path, file);
	if (!error)
		error = security_file_post_open(file, op->acc_mode);
	if (!error && do_truncate)
		error = handle_truncate(idmap, file);
	if (unlikely(error > 0)) {
		WARN_ON(1);
		error = -EINVAL;
	}
	if (do_truncate)
		mnt_drop_write(nd->path.mnt);
	return error;
}

/**
 * vfs_tmpfile - create tmpfile
 * @idmap:	idmap of the mount the inode was found from
 * @parentpath:	pointer to the path of the base directory
 * @file:	file descriptor of the new tmpfile
 * @mode:	mode of the new tmpfile
 *
 * Create a temporary file.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
int vfs_tmpfile(struct mnt_idmap *idmap,
		const struct path *parentpath,
		struct file *file, umode_t mode)
{
	struct dentry *child;
	struct inode *dir = d_inode(parentpath->dentry);
	struct inode *inode;
	int error;
	int open_flag = file->f_flags;

	/* we want directory to be writable */
	error = inode_permission(idmap, dir, MAY_WRITE | MAY_EXEC);
	if (error)
		return error;
	if (!dir->i_op->tmpfile)
		return -EOPNOTSUPP;
	child = d_alloc(parentpath->dentry, &slash_name);
	if (unlikely(!child))
		return -ENOMEM;
	file->__f_path.mnt = parentpath->mnt;
	file->__f_path.dentry = child;
	mode = vfs_prepare_mode(idmap, dir, mode, mode, mode);
	error = dir->i_op->tmpfile(idmap, dir, file, mode);
	dput(child);
	if (file->f_mode & FMODE_OPENED)
		fsnotify_open(file);
	if (error)
		return error;
	/* Don't check for other permissions, the inode was just created */
	error = may_open(idmap, &file->f_path, 0, file->f_flags);
	if (error)
		return error;
	inode = file_inode(file);
	if (!(open_flag & O_EXCL)) {
		spin_lock(&inode->i_lock);
		inode_state_set(inode, I_LINKABLE);
		spin_unlock(&inode->i_lock);
	}
	security_inode_post_create_tmpfile(idmap, inode);
	return 0;
}

/**
 * kernel_tmpfile_open - open a tmpfile for kernel internal use
 * @idmap:	idmap of the mount the inode was found from
 * @parentpath:	path of the base directory
 * @mode:	mode of the new tmpfile
 * @open_flag:	flags
 * @cred:	credentials for open
 *
 * Create and open a temporary file.  The file is not accounted in nr_files,
 * hence this is only for kernel internal use, and must not be installed into
 * file tables or such.
 */
struct file *kernel_tmpfile_open(struct mnt_idmap *idmap,
				 const struct path *parentpath,
				 umode_t mode, int open_flag,
				 const struct cred *cred)
{
	struct file *file;
	int error;

	file = alloc_empty_file_noaccount(open_flag, cred);
	if (IS_ERR(file))
		return file;

	error = vfs_tmpfile(idmap, parentpath, file, mode);
	if (error) {
		fput(file);
		file = ERR_PTR(error);
	}
	return file;
}
EXPORT_SYMBOL(kernel_tmpfile_open);

static int do_tmpfile(struct nameidata *nd, unsigned flags,
		const struct open_flags *op,
		struct file *file)
{
	struct path path;
	int error = path_lookupat(nd, flags | LOOKUP_DIRECTORY, &path);

	if (unlikely(error))
		return error;
	error = mnt_want_write(path.mnt);
	if (unlikely(error))
		goto out;
	error = vfs_tmpfile(mnt_idmap(path.mnt), &path, file, op->mode);
	if (error)
		goto out2;
	audit_inode(nd->name, file->f_path.dentry, 0);
out2:
	mnt_drop_write(path.mnt);
out:
	path_put(&path);
	return error;
}

static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
{
	struct path path;
	int error = path_lookupat(nd, flags, &path);
	if (!error) {
		audit_inode(nd->name, path.dentry, 0);
		error = vfs_open(&path, file);
		path_put(&path);
	}
	return error;
}

static struct file *path_openat(struct nameidata *nd,
			const struct open_flags *op, unsigned flags)
{
	struct file *file;
	int error;

	file = alloc_empty_file(op->open_flag, current_cred());
	if (IS_ERR(file))
		return file;

	if (unlikely(file->f_flags & __O_TMPFILE)) {
		error = do_tmpfile(nd, flags, op, file);
	} else if (unlikely(file->f_flags & O_PATH)) {
		error = do_o_path(nd, flags, file);
	} else {
		const char *s = path_init(nd, flags);
		while (!(error = link_path_walk(s, nd)) &&
		       (s = open_last_lookups(nd, file, op)) != NULL)
			;
		if (!error)
			error = do_open(nd, file, op);
		terminate_walk(nd);
	}
	if (likely(!error)) {
		if (likely(file->f_mode & FMODE_OPENED))
			return file;
		WARN_ON(1);
		error = -EINVAL;
	}
	fput_close(file);
	if (error == -EOPENSTALE) {
		if (flags & LOOKUP_RCU)
			error = -ECHILD;
		else
			error = -ESTALE;
	}
	return ERR_PTR(error);
}

struct file *do_filp_open(int dfd, struct filename *pathname,
		const struct open_flags *op)
{
	struct nameidata nd;
	int flags = op->lookup_flags;
	struct file *filp;

	set_nameidata(&nd, dfd, pathname, NULL);
	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
	if (unlikely(filp == ERR_PTR(-ECHILD)))
		filp = path_openat(&nd, op, flags);
	if (unlikely(filp == ERR_PTR(-ESTALE)))
		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
	restore_nameidata();
	return filp;
}

struct file *do_file_open_root(const struct path *root,
		const char *name, const struct open_flags *op)
{
	struct nameidata nd;
	struct file *file;
	struct filename *filename;
	int flags = op->lookup_flags;

	if (d_is_symlink(root->dentry) && op->intent & LOOKUP_OPEN)
		return ERR_PTR(-ELOOP);

	filename = getname_kernel(name);
	if (IS_ERR(filename))
		return ERR_CAST(filename);

	set_nameidata(&nd, -1, filename, root);
	file = path_openat(&nd, op, flags | LOOKUP_RCU);
	if (unlikely(file == ERR_PTR(-ECHILD)))
		file = path_openat(&nd, op, flags);
	if (unlikely(file == ERR_PTR(-ESTALE)))
		file = path_openat(&nd, op, flags | LOOKUP_REVAL);
	restore_nameidata();
	putname(filename);
	return file;
}

static struct dentry *filename_create(int dfd, struct filename *name,
				      struct path *path, unsigned int lookup_flags)
{
	struct dentry *dentry = ERR_PTR(-EEXIST);
	struct qstr last;
	bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
	unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
	unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
	int type;
	int error;

	error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
	if (error)
		return ERR_PTR(error);

	/*
	 * Yucky last component or no last component at all?
	 * (foo/., foo/.., /////)
	 */
	if (unlikely(type != LAST_NORM))
		goto out;

	/* don't fail immediately if it's r/o, at least try to report other errors */
	error = mnt_want_write(path->mnt);
	/*
	 * Do the final lookup.  Suppress 'create' if there is a trailing
	 * '/', and a directory wasn't requested.
	 */
	if (last.name[last.len] && !want_dir)
		create_flags &= ~LOOKUP_CREATE;
	dentry = start_dirop(path->dentry, &last, reval_flag | create_flags);
	if (IS_ERR(dentry))
		goto out_drop_write;

	if (unlikely(error))
		goto fail;

	return dentry;
fail:
	end_dirop(dentry);
	dentry = ERR_PTR(error);
out_drop_write:
	if (!error)
		mnt_drop_write(path->mnt);
out:
	path_put(path);
	return dentry;
}

struct dentry *start_creating_path(int dfd, const char *pathname,
				   struct path *path, unsigned int lookup_flags)
{
	struct filename *filename = getname_kernel(pathname);
	struct dentry *res = filename_create(dfd, filename, path, lookup_flags);

	putname(filename);
	return res;
}
EXPORT_SYMBOL(start_creating_path);

/**
 * end_creating_path - finish a code section started by start_creating_path()
 * @path: the path instantiated by start_creating_path()
 * @dentry: the dentry returned by start_creating_path()
 *
 * end_creating_path() will unlock and locks taken by start_creating_path()
 * and drop an references that were taken.  It should only be called
 * if start_creating_path() returned a non-error.
 * If vfs_mkdir() was called and it returned an error, that error *should*
 * be passed to end_creating_path() together with the path.
 */
void end_creating_path(const struct path *path, struct dentry *dentry)
{
	end_creating(dentry);
	mnt_drop_write(path->mnt);
	path_put(path);
}
EXPORT_SYMBOL(end_creating_path);

inline struct dentry *start_creating_user_path(
	int dfd, const char __user *pathname,
	struct path *path, unsigned int lookup_flags)
{
	struct filename *filename = getname(pathname);
	struct dentry *res = filename_create(dfd, filename, path, lookup_flags);

	putname(filename);
	return res;
}
EXPORT_SYMBOL(start_creating_user_path);


/**
 * vfs_mknod - create device node or file
 * @idmap:		idmap of the mount the inode was found from
 * @dir:		inode of the parent directory
 * @dentry:		dentry of the child device node
 * @mode:		mode of the child device node
 * @dev:		device number of device to create
 * @delegated_inode:	returns parent inode, if the inode is delegated.
 *
 * Create a device node or file.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
	      struct dentry *dentry, umode_t mode, dev_t dev,
	      struct delegated_inode *delegated_inode)
{
	bool is_whiteout = S_ISCHR(mode) && dev == WHITEOUT_DEV;
	int error = may_create(idmap, dir, dentry);

	if (error)
		return error;

	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
	    !capable(CAP_MKNOD))
		return -EPERM;

	if (!dir->i_op->mknod)
		return -EPERM;

	mode = vfs_prepare_mode(idmap, dir, mode, mode, mode);
	error = devcgroup_inode_mknod(mode, dev);
	if (error)
		return error;

	error = security_inode_mknod(dir, dentry, mode, dev);
	if (error)
		return error;

	error = try_break_deleg(dir, delegated_inode);
	if (error)
		return error;

	error = dir->i_op->mknod(idmap, dir, dentry, mode, dev);
	if (!error)
		fsnotify_create(dir, dentry);
	return error;
}
EXPORT_SYMBOL(vfs_mknod);

static int may_mknod(umode_t mode)
{
	switch (mode & S_IFMT) {
	case S_IFREG:
	case S_IFCHR:
	case S_IFBLK:
	case S_IFIFO:
	case S_IFSOCK:
	case 0: /* zero mode translates to S_IFREG */
		return 0;
	case S_IFDIR:
		return -EPERM;
	default:
		return -EINVAL;
	}
}

static int do_mknodat(int dfd, struct filename *name, umode_t mode,
		unsigned int dev)
{
	struct delegated_inode di = { };
	struct mnt_idmap *idmap;
	struct dentry *dentry;
	struct path path;
	int error;
	unsigned int lookup_flags = 0;

	error = may_mknod(mode);
	if (error)
		goto out1;
retry:
	dentry = filename_create(dfd, name, &path, lookup_flags);
	error = PTR_ERR(dentry);
	if (IS_ERR(dentry))
		goto out1;

	error = security_path_mknod(&path, dentry,
			mode_strip_umask(path.dentry->d_inode, mode), dev);
	if (error)
		goto out2;

	idmap = mnt_idmap(path.mnt);
	switch (mode & S_IFMT) {
		case 0: case S_IFREG:
			error = vfs_create(idmap, dentry, mode, &di);
			if (!error)
				security_path_post_mknod(idmap, dentry);
			break;
		case S_IFCHR: case S_IFBLK:
			error = vfs_mknod(idmap, path.dentry->d_inode,
					  dentry, mode, new_decode_dev(dev), &di);
			break;
		case S_IFIFO: case S_IFSOCK:
			error = vfs_mknod(idmap, path.dentry->d_inode,
					  dentry, mode, 0, &di);
			break;
	}
out2:
	end_creating_path(&path, dentry);
	if (is_delegated(&di)) {
		error = break_deleg_wait(&di);
		if (!error)
			goto retry;
	}
	if (retry_estale(error, lookup_flags)) {
		lookup_flags |= LOOKUP_REVAL;
		goto retry;
	}
out1:
	putname(name);
	return error;
}

SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
		unsigned int, dev)
{
	return do_mknodat(dfd, getname(filename), mode, dev);
}

SYSCALL_DEFINE3(mknod, const char __user *, filename, umode_t, mode, unsigned, dev)
{
	return do_mknodat(AT_FDCWD, getname(filename), mode, dev);
}

/**
 * vfs_mkdir - create directory returning correct dentry if possible
 * @idmap:		idmap of the mount the inode was found from
 * @dir:		inode of the parent directory
 * @dentry:		dentry of the child directory
 * @mode:		mode of the child directory
 * @delegated_inode:	returns parent inode, if the inode is delegated.
 *
 * Create a directory.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 *
 * In the event that the filesystem does not use the *@dentry but leaves it
 * negative or unhashes it and possibly splices a different one returning it,
 * the original dentry is dput() and the alternate is returned.
 *
 * In case of an error the dentry is dput() and an ERR_PTR() is returned.
 */
struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
			 struct dentry *dentry, umode_t mode,
			 struct delegated_inode *delegated_inode)
{
	int error;
	unsigned max_links = dir->i_sb->s_max_links;
	struct dentry *de;

	error = may_create(idmap, dir, dentry);
	if (error)
		goto err;

	error = -EPERM;
	if (!dir->i_op->mkdir)
		goto err;

	mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
	error = security_inode_mkdir(dir, dentry, mode);
	if (error)
		goto err;

	error = -EMLINK;
	if (max_links && dir->i_nlink >= max_links)
		goto err;

	error = try_break_deleg(dir, delegated_inode);
	if (error)
		goto err;

	de = dir->i_op->mkdir(idmap, dir, dentry, mode);
	error = PTR_ERR(de);
	if (IS_ERR(de))
		goto err;
	if (de) {
		dput(dentry);
		dentry = de;
	}
	fsnotify_mkdir(dir, dentry);
	return dentry;

err:
	end_creating(dentry);
	return ERR_PTR(error);
}
EXPORT_SYMBOL(vfs_mkdir);

int do_mkdirat(int dfd, struct filename *name, umode_t mode)
{
	struct dentry *dentry;
	struct path path;
	int error;
	unsigned int lookup_flags = LOOKUP_DIRECTORY;
	struct delegated_inode delegated_inode = { };

retry:
	dentry = filename_create(dfd, name, &path, lookup_flags);
	error = PTR_ERR(dentry);
	if (IS_ERR(dentry))
		goto out_putname;

	error = security_path_mkdir(&path, dentry,
			mode_strip_umask(path.dentry->d_inode, mode));
	if (!error) {
		dentry = vfs_mkdir(mnt_idmap(path.mnt), path.dentry->d_inode,
				   dentry, mode, &delegated_inode);
		if (IS_ERR(dentry))
			error = PTR_ERR(dentry);
	}
	end_creating_path(&path, dentry);
	if (is_delegated(&delegated_inode)) {
		error = break_deleg_wait(&delegated_inode);
		if (!error)
			goto retry;
	}
	if (retry_estale(error, lookup_flags)) {
		lookup_flags |= LOOKUP_REVAL;
		goto retry;
	}
out_putname:
	putname(name);
	return error;
}

SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
{
	return do_mkdirat(dfd, getname(pathname), mode);
}

SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
{
	return do_mkdirat(AT_FDCWD, getname(pathname), mode);
}

/**
 * vfs_rmdir - remove directory
 * @idmap:		idmap of the mount the inode was found from
 * @dir:		inode of the parent directory
 * @dentry:		dentry of the child directory
 * @delegated_inode:	returns parent inode, if it's delegated.
 *
 * Remove a directory.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
	      struct dentry *dentry, struct delegated_inode *delegated_inode)
{
	int error = may_delete(idmap, dir, dentry, 1);

	if (error)
		return error;

	if (!dir->i_op->rmdir)
		return -EPERM;

	dget(dentry);
	inode_lock(dentry->d_inode);

	error = -EBUSY;
	if (is_local_mountpoint(dentry) ||
	    (dentry->d_inode->i_flags & S_KERNEL_FILE))
		goto out;

	error = security_inode_rmdir(dir, dentry);
	if (error)
		goto out;

	error = try_break_deleg(dir, delegated_inode);
	if (error)
		goto out;

	error = dir->i_op->rmdir(dir, dentry);
	if (error)
		goto out;

	shrink_dcache_parent(dentry);
	dentry->d_inode->i_flags |= S_DEAD;
	dont_mount(dentry);
	detach_mounts(dentry);

out:
	inode_unlock(dentry->d_inode);
	dput(dentry);
	if (!error)
		d_delete_notify(dir, dentry);
	return error;
}
EXPORT_SYMBOL(vfs_rmdir);

int do_rmdir(int dfd, struct filename *name)
{
	int error;
	struct dentry *dentry;
	struct path path;
	struct qstr last;
	int type;
	unsigned int lookup_flags = 0;
	struct delegated_inode delegated_inode = { };
retry:
	error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
	if (error)
		goto exit1;

	switch (type) {
	case LAST_DOTDOT:
		error = -ENOTEMPTY;
		goto exit2;
	case LAST_DOT:
		error = -EINVAL;
		goto exit2;
	case LAST_ROOT:
		error = -EBUSY;
		goto exit2;
	}

	error = mnt_want_write(path.mnt);
	if (error)
		goto exit2;

	dentry = start_dirop(path.dentry, &last, lookup_flags);
	error = PTR_ERR(dentry);
	if (IS_ERR(dentry))
		goto exit3;
	error = security_path_rmdir(&path, dentry);
	if (error)
		goto exit4;
	error = vfs_rmdir(mnt_idmap(path.mnt), path.dentry->d_inode,
			  dentry, &delegated_inode);
exit4:
	end_dirop(dentry);
exit3:
	mnt_drop_write(path.mnt);
exit2:
	path_put(&path);
	if (is_delegated(&delegated_inode)) {
		error = break_deleg_wait(&delegated_inode);
		if (!error)
			goto retry;
	}
	if (retry_estale(error, lookup_flags)) {
		lookup_flags |= LOOKUP_REVAL;
		goto retry;
	}
exit1:
	putname(name);
	return error;
}

SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
{
	return do_rmdir(AT_FDCWD, getname(pathname));
}

/**
 * vfs_unlink - unlink a filesystem object
 * @idmap:	idmap of the mount the inode was found from
 * @dir:	parent directory
 * @dentry:	victim
 * @delegated_inode: returns victim inode, if the inode is delegated.
 *
 * The caller must hold dir->i_rwsem exclusively.
 *
 * If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and
 * return a reference to the inode in delegated_inode.  The caller
 * should then break the delegation on that inode and retry.  Because
 * breaking a delegation may take a long time, the caller should drop
 * dir->i_rwsem before doing so.
 *
 * Alternatively, a caller may pass NULL for delegated_inode.  This may
 * be appropriate for callers that expect the underlying filesystem not
 * to be NFS exported.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
	       struct dentry *dentry, struct delegated_inode *delegated_inode)
{
	struct inode *target = dentry->d_inode;
	int error = may_delete(idmap, dir, dentry, 0);

	if (error)
		return error;

	if (!dir->i_op->unlink)
		return -EPERM;

	inode_lock(target);
	if (IS_SWAPFILE(target))
		error = -EPERM;
	else if (is_local_mountpoint(dentry))
		error = -EBUSY;
	else {
		error = security_inode_unlink(dir, dentry);
		if (!error) {
			error = try_break_deleg(dir, delegated_inode);
			if (error)
				goto out;
			error = try_break_deleg(target, delegated_inode);
			if (error)
				goto out;
			error = dir->i_op->unlink(dir, dentry);
			if (!error) {
				dont_mount(dentry);
				detach_mounts(dentry);
			}
		}
	}
out:
	inode_unlock(target);

	/* We don't d_delete() NFS sillyrenamed files--they still exist. */
	if (!error && dentry->d_flags & DCACHE_NFSFS_RENAMED) {
		fsnotify_unlink(dir, dentry);
	} else if (!error) {
		fsnotify_link_count(target);
		d_delete_notify(dir, dentry);
	}

	return error;
}
EXPORT_SYMBOL(vfs_unlink);

/*
 * Make sure that the actual truncation of the file will occur outside its
 * directory's i_rwsem.  Truncate can take a long time if there is a lot of
 * writeout happening, and we don't want to prevent access to the directory
 * while waiting on the I/O.
 */
int do_unlinkat(int dfd, struct filename *name)
{
	int error;
	struct dentry *dentry;
	struct path path;
	struct qstr last;
	int type;
	struct inode *inode;
	struct delegated_inode delegated_inode = { };
	unsigned int lookup_flags = 0;
retry:
	error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
	if (error)
		goto exit_putname;

	error = -EISDIR;
	if (type != LAST_NORM)
		goto exit_path_put;

	error = mnt_want_write(path.mnt);
	if (error)
		goto exit_path_put;
retry_deleg:
	dentry = start_dirop(path.dentry, &last, lookup_flags);
	error = PTR_ERR(dentry);
	if (IS_ERR(dentry))
		goto exit_drop_write;

	/* Why not before? Because we want correct error value */
	if (unlikely(last.name[last.len])) {
		if (d_is_dir(dentry))
			error = -EISDIR;
		else
			error = -ENOTDIR;
		end_dirop(dentry);
		goto exit_drop_write;
	}
	inode = dentry->d_inode;
	ihold(inode);
	error = security_path_unlink(&path, dentry);
	if (error)
		goto exit_end_dirop;
	error = vfs_unlink(mnt_idmap(path.mnt), path.dentry->d_inode,
			   dentry, &delegated_inode);
exit_end_dirop:
	end_dirop(dentry);
	iput(inode);	/* truncate the inode here */
	if (is_delegated(&delegated_inode)) {
		error = break_deleg_wait(&delegated_inode);
		if (!error)
			goto retry_deleg;
	}
exit_drop_write:
	mnt_drop_write(path.mnt);
exit_path_put:
	path_put(&path);
	if (retry_estale(error, lookup_flags)) {
		lookup_flags |= LOOKUP_REVAL;
		goto retry;
	}
exit_putname:
	putname(name);
	return error;
}

SYSCALL_DEFINE3(unlinkat, int, dfd, const char __user *, pathname, int, flag)
{
	if ((flag & ~AT_REMOVEDIR) != 0)
		return -EINVAL;

	if (flag & AT_REMOVEDIR)
		return do_rmdir(dfd, getname(pathname));
	return do_unlinkat(dfd, getname(pathname));
}

SYSCALL_DEFINE1(unlink, const char __user *, pathname)
{
	return do_unlinkat(AT_FDCWD, getname(pathname));
}

/**
 * vfs_symlink - create symlink
 * @idmap:	idmap of the mount the inode was found from
 * @dir:	inode of the parent directory
 * @dentry:	dentry of the child symlink file
 * @oldname:	name of the file to link to
 * @delegated_inode: returns victim inode, if the inode is delegated.
 *
 * Create a symlink.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
		struct dentry *dentry, const char *oldname,
		struct delegated_inode *delegated_inode)
{
	int error;

	error = may_create(idmap, dir, dentry);
	if (error)
		return error;

	if (!dir->i_op->symlink)
		return -EPERM;

	error = security_inode_symlink(dir, dentry, oldname);
	if (error)
		return error;

	error = try_break_deleg(dir, delegated_inode);
	if (error)
		return error;

	error = dir->i_op->symlink(idmap, dir, dentry, oldname);
	if (!error)
		fsnotify_create(dir, dentry);
	return error;
}
EXPORT_SYMBOL(vfs_symlink);

int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
{
	int error;
	struct dentry *dentry;
	struct path path;
	unsigned int lookup_flags = 0;
	struct delegated_inode delegated_inode = { };

	if (IS_ERR(from)) {
		error = PTR_ERR(from);
		goto out_putnames;
	}
retry:
	dentry = filename_create(newdfd, to, &path, lookup_flags);
	error = PTR_ERR(dentry);
	if (IS_ERR(dentry))
		goto out_putnames;

	error = security_path_symlink(&path, dentry, from->name);
	if (!error)
		error = vfs_symlink(mnt_idmap(path.mnt), path.dentry->d_inode,
				    dentry, from->name, &delegated_inode);
	end_creating_path(&path, dentry);
	if (is_delegated(&delegated_inode)) {
		error = break_deleg_wait(&delegated_inode);
		if (!error)
			goto retry;
	}
	if (retry_estale(error, lookup_flags)) {
		lookup_flags |= LOOKUP_REVAL;
		goto retry;
	}
out_putnames:
	putname(to);
	putname(from);
	return error;
}

SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
		int, newdfd, const char __user *, newname)
{
	return do_symlinkat(getname(oldname), newdfd, getname(newname));
}

SYSCALL_DEFINE2(symlink, const char __user *, oldname, const char __user *, newname)
{
	return do_symlinkat(getname(oldname), AT_FDCWD, getname(newname));
}

/**
 * vfs_link - create a new link
 * @old_dentry:	object to be linked
 * @idmap:	idmap of the mount
 * @dir:	new parent
 * @new_dentry:	where to create the new link
 * @delegated_inode: returns inode needing a delegation break
 *
 * The caller must hold dir->i_rwsem exclusively.
 *
 * If vfs_link discovers a delegation on the to-be-linked file in need
 * of breaking, it will return -EWOULDBLOCK and return a reference to the
 * inode in delegated_inode.  The caller should then break the delegation
 * and retry.  Because breaking a delegation may take a long time, the
 * caller should drop the i_rwsem before doing so.
 *
 * Alternatively, a caller may pass NULL for delegated_inode.  This may
 * be appropriate for callers that expect the underlying filesystem not
 * to be NFS exported.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
	     struct inode *dir, struct dentry *new_dentry,
	     struct delegated_inode *delegated_inode)
{
	struct inode *inode = old_dentry->d_inode;
	unsigned max_links = dir->i_sb->s_max_links;
	int error;

	if (!inode)
		return -ENOENT;

	error = may_create(idmap, dir, new_dentry);
	if (error)
		return error;

	if (dir->i_sb != inode->i_sb)
		return -EXDEV;

	/*
	 * A link to an append-only or immutable file cannot be created.
	 */
	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
		return -EPERM;
	/*
	 * Updating the link count will likely cause i_uid and i_gid to
	 * be written back improperly if their true value is unknown to
	 * the vfs.
	 */
	if (HAS_UNMAPPED_ID(idmap, inode))
		return -EPERM;
	if (!dir->i_op->link)
		return -EPERM;
	if (S_ISDIR(inode->i_mode))
		return -EPERM;

	error = security_inode_link(old_dentry, dir, new_dentry);
	if (error)
		return error;

	inode_lock(inode);
	/* Make sure we don't allow creating hardlink to an unlinked file */
	if (inode->i_nlink == 0 && !(inode_state_read_once(inode) & I_LINKABLE))
		error =  -ENOENT;
	else if (max_links && inode->i_nlink >= max_links)
		error = -EMLINK;
	else {
		error = try_break_deleg(dir, delegated_inode);
		if (!error)
			error = try_break_deleg(inode, delegated_inode);
		if (!error)
			error = dir->i_op->link(old_dentry, dir, new_dentry);
	}

	if (!error && (inode_state_read_once(inode) & I_LINKABLE)) {
		spin_lock(&inode->i_lock);
		inode_state_clear(inode, I_LINKABLE);
		spin_unlock(&inode->i_lock);
	}
	inode_unlock(inode);
	if (!error)
		fsnotify_link(dir, inode, new_dentry);
	return error;
}
EXPORT_SYMBOL(vfs_link);

/*
 * Hardlinks are often used in delicate situations.  We avoid
 * security-related surprises by not following symlinks on the
 * newname.  --KAB
 *
 * We don't follow them on the oldname either to be compatible
 * with linux 2.0, and to avoid hard-linking to directories
 * and other special files.  --ADM
 */
int do_linkat(int olddfd, struct filename *old, int newdfd,
	      struct filename *new, int flags)
{
	struct mnt_idmap *idmap;
	struct dentry *new_dentry;
	struct path old_path, new_path;
	struct delegated_inode delegated_inode = { };
	int how = 0;
	int error;

	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) {
		error = -EINVAL;
		goto out_putnames;
	}
	/*
	 * To use null names we require CAP_DAC_READ_SEARCH or
	 * that the open-time creds of the dfd matches current.
	 * This ensures that not everyone will be able to create
	 * a hardlink using the passed file descriptor.
	 */
	if (flags & AT_EMPTY_PATH)
		how |= LOOKUP_LINKAT_EMPTY;

	if (flags & AT_SYMLINK_FOLLOW)
		how |= LOOKUP_FOLLOW;
retry:
	error = filename_lookup(olddfd, old, how, &old_path, NULL);
	if (error)
		goto out_putnames;

	new_dentry = filename_create(newdfd, new, &new_path,
					(how & LOOKUP_REVAL));
	error = PTR_ERR(new_dentry);
	if (IS_ERR(new_dentry))
		goto out_putpath;

	error = -EXDEV;
	if (old_path.mnt != new_path.mnt)
		goto out_dput;
	idmap = mnt_idmap(new_path.mnt);
	error = may_linkat(idmap, &old_path);
	if (unlikely(error))
		goto out_dput;
	error = security_path_link(old_path.dentry, &new_path, new_dentry);
	if (error)
		goto out_dput;
	error = vfs_link(old_path.dentry, idmap, new_path.dentry->d_inode,
			 new_dentry, &delegated_inode);
out_dput:
	end_creating_path(&new_path, new_dentry);
	if (is_delegated(&delegated_inode)) {
		error = break_deleg_wait(&delegated_inode);
		if (!error) {
			path_put(&old_path);
			goto retry;
		}
	}
	if (retry_estale(error, how)) {
		path_put(&old_path);
		how |= LOOKUP_REVAL;
		goto retry;
	}
out_putpath:
	path_put(&old_path);
out_putnames:
	putname(old);
	putname(new);

	return error;
}

SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
		int, newdfd, const char __user *, newname, int, flags)
{
	return do_linkat(olddfd, getname_uflags(oldname, flags),
		newdfd, getname(newname), flags);
}

SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname)
{
	return do_linkat(AT_FDCWD, getname(oldname), AT_FDCWD, getname(newname), 0);
}

/**
 * vfs_rename - rename a filesystem object
 * @rd:		pointer to &struct renamedata info
 *
 * The caller must hold multiple mutexes--see lock_rename()).
 *
 * If vfs_rename discovers a delegation in need of breaking at either
 * the source or destination, it will return -EWOULDBLOCK and return a
 * reference to the inode in delegated_inode.  The caller should then
 * break the delegation and retry.  Because breaking a delegation may
 * take a long time, the caller should drop all locks before doing
 * so.
 *
 * Alternatively, a caller may pass NULL for delegated_inode.  This may
 * be appropriate for callers that expect the underlying filesystem not
 * to be NFS exported.
 *
 * The worst of all namespace operations - renaming directory. "Perverted"
 * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
 * Problems:
 *
 *	a) we can get into loop creation.
 *	b) race potential - two innocent renames can create a loop together.
 *	   That's where 4.4BSD screws up. Current fix: serialization on
 *	   sb->s_vfs_rename_mutex. We might be more accurate, but that's another
 *	   story.
 *	c) we may have to lock up to _four_ objects - parents and victim (if it exists),
 *	   and source (if it's a non-directory or a subdirectory that moves to
 *	   different parent).
 *	   And that - after we got ->i_rwsem on parents (until then we don't know
 *	   whether the target exists).  Solution: try to be smart with locking
 *	   order for inodes.  We rely on the fact that tree topology may change
 *	   only under ->s_vfs_rename_mutex _and_ that parent of the object we
 *	   move will be locked.  Thus we can rank directories by the tree
 *	   (ancestors first) and rank all non-directories after them.
 *	   That works since everybody except rename does "lock parent, lookup,
 *	   lock child" and rename is under ->s_vfs_rename_mutex.
 *	   HOWEVER, it relies on the assumption that any object with ->lookup()
 *	   has no more than 1 dentry.  If "hybrid" objects will ever appear,
 *	   we'd better make sure that there's no link(2) for them.
 *	d) conversion from fhandle to dentry may come in the wrong moment - when
 *	   we are removing the target. Solution: we will have to grab ->i_rwsem
 *	   in the fhandle_to_dentry code. [FIXME - current nfsfh.c relies on
 *	   ->i_rwsem on parents, which works but leads to some truly excessive
 *	   locking].
 */
int vfs_rename(struct renamedata *rd)
{
	int error;
	struct inode *old_dir = d_inode(rd->old_parent);
	struct inode *new_dir = d_inode(rd->new_parent);
	struct dentry *old_dentry = rd->old_dentry;
	struct dentry *new_dentry = rd->new_dentry;
	struct delegated_inode *delegated_inode = rd->delegated_inode;
	unsigned int flags = rd->flags;
	bool is_dir = d_is_dir(old_dentry);
	struct inode *source = old_dentry->d_inode;
	struct inode *target = new_dentry->d_inode;
	bool new_is_dir = false;
	unsigned max_links = new_dir->i_sb->s_max_links;
	struct name_snapshot old_name;
	bool lock_old_subdir, lock_new_subdir;

	if (source == target)
		return 0;

	error = may_delete(rd->mnt_idmap, old_dir, old_dentry, is_dir);
	if (error)
		return error;

	if (!target) {
		error = may_create(rd->mnt_idmap, new_dir, new_dentry);
	} else {
		new_is_dir = d_is_dir(new_dentry);

		if (!(flags & RENAME_EXCHANGE))
			error = may_delete(rd->mnt_idmap, new_dir,
					   new_dentry, is_dir);
		else
			error = may_delete(rd->mnt_idmap, new_dir,
					   new_dentry, new_is_dir);
	}
	if (error)
		return error;

	if (!old_dir->i_op->rename)
		return -EPERM;

	/*
	 * If we are going to change the parent - check write permissions,
	 * we'll need to flip '..'.
	 */
	if (new_dir != old_dir) {
		if (is_dir) {
			error = inode_permission(rd->mnt_idmap, source,
						 MAY_WRITE);
			if (error)
				return error;
		}
		if ((flags & RENAME_EXCHANGE) && new_is_dir) {
			error = inode_permission(rd->mnt_idmap, target,
						 MAY_WRITE);
			if (error)
				return error;
		}
	}

	error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry,
				      flags);
	if (error)
		return error;

	take_dentry_name_snapshot(&old_name, old_dentry);
	dget(new_dentry);
	/*
	 * Lock children.
	 * The source subdirectory needs to be locked on cross-directory
	 * rename or cross-directory exchange since its parent changes.
	 * The target subdirectory needs to be locked on cross-directory
	 * exchange due to parent change and on any rename due to becoming
	 * a victim.
	 * Non-directories need locking in all cases (for NFS reasons);
	 * they get locked after any subdirectories (in inode address order).
	 *
	 * NOTE: WE ONLY LOCK UNRELATED DIRECTORIES IN CROSS-DIRECTORY CASE.
	 * NEVER, EVER DO THAT WITHOUT ->s_vfs_rename_mutex.
	 */
	lock_old_subdir = new_dir != old_dir;
	lock_new_subdir = new_dir != old_dir || !(flags & RENAME_EXCHANGE);
	if (is_dir) {
		if (lock_old_subdir)
			inode_lock_nested(source, I_MUTEX_CHILD);
		if (target && (!new_is_dir || lock_new_subdir))
			inode_lock(target);
	} else if (new_is_dir) {
		if (lock_new_subdir)
			inode_lock_nested(target, I_MUTEX_CHILD);
		inode_lock(source);
	} else {
		lock_two_nondirectories(source, target);
	}

	error = -EPERM;
	if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target)))
		goto out;

	error = -EBUSY;
	if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
		goto out;

	if (max_links && new_dir != old_dir) {
		error = -EMLINK;
		if (is_dir && !new_is_dir && new_dir->i_nlink >= max_links)
			goto out;
		if ((flags & RENAME_EXCHANGE) && !is_dir && new_is_dir &&
		    old_dir->i_nlink >= max_links)
			goto out;
	}
	error = try_break_deleg(old_dir, delegated_inode);
	if (error)
		goto out;
	if (new_dir != old_dir) {
		error = try_break_deleg(new_dir, delegated_inode);
		if (error)
			goto out;
	}
	if (!is_dir) {
		error = try_break_deleg(source, delegated_inode);
		if (error)
			goto out;
	}
	if (target && !new_is_dir) {
		error = try_break_deleg(target, delegated_inode);
		if (error)
			goto out;
	}
	error = old_dir->i_op->rename(rd->mnt_idmap, old_dir, old_dentry,
				      new_dir, new_dentry, flags);
	if (error)
		goto out;

	if (!(flags & RENAME_EXCHANGE) && target) {
		if (is_dir) {
			shrink_dcache_parent(new_dentry);
			target->i_flags |= S_DEAD;
		}
		dont_mount(new_dentry);
		detach_mounts(new_dentry);
	}
	if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE)) {
		if (!(flags & RENAME_EXCHANGE))
			d_move(old_dentry, new_dentry);
		else
			d_exchange(old_dentry, new_dentry);
	}
out:
	if (!is_dir || lock_old_subdir)
		inode_unlock(source);
	if (target && (!new_is_dir || lock_new_subdir))
		inode_unlock(target);
	dput(new_dentry);
	if (!error) {
		fsnotify_move(old_dir, new_dir, &old_name.name, is_dir,
			      !(flags & RENAME_EXCHANGE) ? target : NULL, old_dentry);
		if (flags & RENAME_EXCHANGE) {
			fsnotify_move(new_dir, old_dir, &old_dentry->d_name,
				      new_is_dir, NULL, new_dentry);
		}
	}
	release_dentry_name_snapshot(&old_name);

	return error;
}
EXPORT_SYMBOL(vfs_rename);

int do_renameat2(int olddfd, struct filename *from, int newdfd,
		 struct filename *to, unsigned int flags)
{
	struct renamedata rd;
	struct path old_path, new_path;
	struct qstr old_last, new_last;
	int old_type, new_type;
	struct delegated_inode delegated_inode = { };
	unsigned int lookup_flags = 0;
	bool should_retry = false;
	int error = -EINVAL;

	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
		goto put_names;

	if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) &&
	    (flags & RENAME_EXCHANGE))
		goto put_names;

retry:
	error = filename_parentat(olddfd, from, lookup_flags, &old_path,
				  &old_last, &old_type);
	if (error)
		goto put_names;

	error = filename_parentat(newdfd, to, lookup_flags, &new_path, &new_last,
				  &new_type);
	if (error)
		goto exit1;

	error = -EXDEV;
	if (old_path.mnt != new_path.mnt)
		goto exit2;

	error = -EBUSY;
	if (old_type != LAST_NORM)
		goto exit2;

	if (flags & RENAME_NOREPLACE)
		error = -EEXIST;
	if (new_type != LAST_NORM)
		goto exit2;

	error = mnt_want_write(old_path.mnt);
	if (error)
		goto exit2;

retry_deleg:
	rd.old_parent	   = old_path.dentry;
	rd.mnt_idmap	   = mnt_idmap(old_path.mnt);
	rd.new_parent	   = new_path.dentry;
	rd.delegated_inode = &delegated_inode;
	rd.flags	   = flags;

	error = __start_renaming(&rd, lookup_flags, &old_last, &new_last);
	if (error)
		goto exit_lock_rename;

	if (flags & RENAME_EXCHANGE) {
		if (!d_is_dir(rd.new_dentry)) {
			error = -ENOTDIR;
			if (new_last.name[new_last.len])
				goto exit_unlock;
		}
	}
	/* unless the source is a directory trailing slashes give -ENOTDIR */
	if (!d_is_dir(rd.old_dentry)) {
		error = -ENOTDIR;
		if (old_last.name[old_last.len])
			goto exit_unlock;
		if (!(flags & RENAME_EXCHANGE) && new_last.name[new_last.len])
			goto exit_unlock;
	}

	error = security_path_rename(&old_path, rd.old_dentry,
				     &new_path, rd.new_dentry, flags);
	if (error)
		goto exit_unlock;

	error = vfs_rename(&rd);
exit_unlock:
	end_renaming(&rd);
exit_lock_rename:
	if (is_delegated(&delegated_inode)) {
		error = break_deleg_wait(&delegated_inode);
		if (!error)
			goto retry_deleg;
	}
	mnt_drop_write(old_path.mnt);
exit2:
	if (retry_estale(error, lookup_flags))
		should_retry = true;
	path_put(&new_path);
exit1:
	path_put(&old_path);
	if (should_retry) {
		should_retry = false;
		lookup_flags |= LOOKUP_REVAL;
		goto retry;
	}
put_names:
	putname(from);
	putname(to);
	return error;
}

SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
		int, newdfd, const char __user *, newname, unsigned int, flags)
{
	return do_renameat2(olddfd, getname(oldname), newdfd, getname(newname),
				flags);
}

SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
		int, newdfd, const char __user *, newname)
{
	return do_renameat2(olddfd, getname(oldname), newdfd, getname(newname),
				0);
}

SYSCALL_DEFINE2(rename, const char __user *, oldname, const char __user *, newname)
{
	return do_renameat2(AT_FDCWD, getname(oldname), AT_FDCWD,
				getname(newname), 0);
}

int readlink_copy(char __user *buffer, int buflen, const char *link, int linklen)
{
	int copylen;

	copylen = linklen;
	if (unlikely(copylen > (unsigned) buflen))
		copylen = buflen;
	if (copy_to_user(buffer, link, copylen))
		copylen = -EFAULT;
	return copylen;
}

/**
 * vfs_readlink - copy symlink body into userspace buffer
 * @dentry: dentry on which to get symbolic link
 * @buffer: user memory pointer
 * @buflen: size of buffer
 *
 * Does not touch atime.  That's up to the caller if necessary
 *
 * Does not call security hook.
 */
int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen)
{
	struct inode *inode = d_inode(dentry);
	DEFINE_DELAYED_CALL(done);
	const char *link;
	int res;

	if (inode->i_opflags & IOP_CACHED_LINK)
		return readlink_copy(buffer, buflen, inode->i_link, inode->i_linklen);

	if (unlikely(!(inode->i_opflags & IOP_DEFAULT_READLINK))) {
		if (unlikely(inode->i_op->readlink))
			return inode->i_op->readlink(dentry, buffer, buflen);

		if (!d_is_symlink(dentry))
			return -EINVAL;

		spin_lock(&inode->i_lock);
		inode->i_opflags |= IOP_DEFAULT_READLINK;
		spin_unlock(&inode->i_lock);
	}

	link = READ_ONCE(inode->i_link);
	if (!link) {
		link = inode->i_op->get_link(dentry, inode, &done);
		if (IS_ERR(link))
			return PTR_ERR(link);
	}
	res = readlink_copy(buffer, buflen, link, strlen(link));
	do_delayed_call(&done);
	return res;
}
EXPORT_SYMBOL(vfs_readlink);

/**
 * vfs_get_link - get symlink body
 * @dentry: dentry on which to get symbolic link
 * @done: caller needs to free returned data with this
 *
 * Calls security hook and i_op->get_link() on the supplied inode.
 *
 * It does not touch atime.  That's up to the caller if necessary.
 *
 * Does not work on "special" symlinks like /proc/$$/fd/N
 */
const char *vfs_get_link(struct dentry *dentry, struct delayed_call *done)
{
	const char *res = ERR_PTR(-EINVAL);
	struct inode *inode = d_inode(dentry);

	if (d_is_symlink(dentry)) {
		res = ERR_PTR(security_inode_readlink(dentry));
		if (!res)
			res = inode->i_op->get_link(dentry, inode, done);
	}
	return res;
}
EXPORT_SYMBOL(vfs_get_link);

/* get the link contents into pagecache */
static char *__page_get_link(struct dentry *dentry, struct inode *inode,
			     struct delayed_call *callback)
{
	struct folio *folio;
	struct address_space *mapping = inode->i_mapping;

	if (!dentry) {
		folio = filemap_get_folio(mapping, 0);
		if (IS_ERR(folio))
			return ERR_PTR(-ECHILD);
		if (!folio_test_uptodate(folio)) {
			folio_put(folio);
			return ERR_PTR(-ECHILD);
		}
	} else {
		folio = read_mapping_folio(mapping, 0, NULL);
		if (IS_ERR(folio))
			return ERR_CAST(folio);
	}
	set_delayed_call(callback, page_put_link, folio);
	BUG_ON(mapping_gfp_mask(mapping) & __GFP_HIGHMEM);
	return folio_address(folio);
}

const char *page_get_link_raw(struct dentry *dentry, struct inode *inode,
			      struct delayed_call *callback)
{
	return __page_get_link(dentry, inode, callback);
}
EXPORT_SYMBOL_GPL(page_get_link_raw);

/**
 * page_get_link() - An implementation of the get_link inode_operation.
 * @dentry: The directory entry which is the symlink.
 * @inode: The inode for the symlink.
 * @callback: Used to drop the reference to the symlink.
 *
 * Filesystems which store their symlinks in the page cache should use
 * this to implement the get_link() member of their inode_operations.
 *
 * Return: A pointer to the NUL-terminated symlink.
 */
const char *page_get_link(struct dentry *dentry, struct inode *inode,
					struct delayed_call *callback)
{
	char *kaddr = __page_get_link(dentry, inode, callback);

	if (!IS_ERR(kaddr))
		nd_terminate_link(kaddr, inode->i_size, PAGE_SIZE - 1);
	return kaddr;
}
EXPORT_SYMBOL(page_get_link);

/**
 * page_put_link() - Drop the reference to the symlink.
 * @arg: The folio which contains the symlink.
 *
 * This is used internally by page_get_link().  It is exported for use
 * by filesystems which need to implement a variant of page_get_link()
 * themselves.  Despite the apparent symmetry, filesystems which use
 * page_get_link() do not need to call page_put_link().
 *
 * The argument, while it has a void pointer type, must be a pointer to
 * the folio which was retrieved from the page cache.  The delayed_call
 * infrastructure is used to drop the reference count once the caller
 * is done with the symlink.
 */
void page_put_link(void *arg)
{
	folio_put(arg);
}
EXPORT_SYMBOL(page_put_link);

int page_readlink(struct dentry *dentry, char __user *buffer, int buflen)
{
	const char *link;
	int res;

	DEFINE_DELAYED_CALL(done);
	link = page_get_link(dentry, d_inode(dentry), &done);
	res = PTR_ERR(link);
	if (!IS_ERR(link))
		res = readlink_copy(buffer, buflen, link, strlen(link));
	do_delayed_call(&done);
	return res;
}
EXPORT_SYMBOL(page_readlink);

int page_symlink(struct inode *inode, const char *symname, int len)
{
	struct address_space *mapping = inode->i_mapping;
	const struct address_space_operations *aops = mapping->a_ops;
	bool nofs = !mapping_gfp_constraint(mapping, __GFP_FS);
	struct folio *folio;
	void *fsdata = NULL;
	int err;
	unsigned int flags;

retry:
	if (nofs)
		flags = memalloc_nofs_save();
	err = aops->write_begin(NULL, mapping, 0, len-1, &folio, &fsdata);
	if (nofs)
		memalloc_nofs_restore(flags);
	if (err)
		goto fail;

	memcpy(folio_address(folio), symname, len - 1);

	err = aops->write_end(NULL, mapping, 0, len - 1, len - 1,
						folio, fsdata);
	if (err < 0)
		goto fail;
	if (err < len-1)
		goto retry;

	mark_inode_dirty(inode);
	return 0;
fail:
	return err;
}
EXPORT_SYMBOL(page_symlink);

const struct inode_operations page_symlink_inode_operations = {
	.get_link	= page_get_link,
};
EXPORT_SYMBOL(page_symlink_inode_operations);
]

13/2 2026/02/02 12:50 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 106953
							output: 22
							thoughts: 4805
						

Thoughts:
**Analyzing a Double Dentry Kill: A Race Condition in the Dentry Cache**

Okay, so I'm looking at this report showing two tasks, 3002 and 3304, apparently both calling `__dentry_kill` on the *same* dentry, `lookup_fast`. This is a problem, because `__dentry_kill` is supposed to be the *final* destruction stage, and should only be called once per dentry. My expert intuition screams "race condition." Let's break this down.

The problem starts with `lookup_fast`, specifically the call to `dput` after an invalid `d_revalidate`. A dentry is found via `__d_lookup`. The refcount on the dentry is incremented, and then, if revalidation fails, both tasks end up calling `dput`. `dput`, in turn, can trigger `finish_dput` which leads to `__dentry_kill`. So far, so good - but this should only happen for ONE of the tasks.

The initial hunch is that they *shouldn't* both call `__dentry_kill`... it's a refcount problem. `__d_lookup` increments the refcount, both tasks find the *same* dentry, decrementing the refcount twice leads to zero, and the first `dput` should trigger the kill, while the second `dput` should just decrement to zero and return.

Then I checked the code and realized the process is more complex. Specifically, `d_invalidate` could do interesting things. And `__d_lookup` increments the refcount and `dput` should decrements it. The dentry is only *killed* when the refcount hits zero and `finish_dput` calls `__dentry_kill`. The refcount is incremented in `__d_lookup`. The problem arises after an invalid dentry is revalidated. Then we see that the refcount is decremented. If both tasks find the *same* dentry, this will happen, and both refcounts can be zero. Then, only ONE task should call `__dentry_kill`.

So, the report shows both tasks are in `__dentry_kill` - that's impossible *unless* there's a race, right?

Alright, diving into `__dentry_kill` and the associated logic. `finish_dput` calls `lock_for_kill`. The core problem lies with what's happening *inside* `__dentry_kill` and related locking. `lock_for_kill` checks the dentry's `d_lockref.count`.  If the count is already 0, the function returns true and it proceeds to `__dentry_kill`, and sets a special negative value (`-128`) which I now know indicates that the dentry is *dead* and *actively being killed*. The critical point here is that this `d_lockref.count` check is done *before* `lockref_mark_dead`. This is a window.

If Task A decrements the count to 0, it *should* proceed to `__dentry_kill`, and set the count to `-128`.  If Task B *then* attempts to dput the same dentry, `lock_for_kill` *should* see -128, return false, and *not* enter the `__dentry_kill` path. No problem, right?

But the report shows that both are there, and the crash is likely related to data corruption, not a simple double free. I check the locking again. The lock is held in `lock_for_kill`, `fast_dput`, and `finish_dput`. If there wasn't a race with someone changing `d_lockref.count` the program would have worked.

So, how can *both* tasks be in `__dentry_kill`?  AHA! Maybe there's a problem with *parent* dentry killing.  `__dentry_kill` can *recursively* call itself on the parent.

If the first task kills a child and finds that the refcount on the *parent* should also become zero, the parent might also be killed.

Let's assume there are dentries P, C1 and C2.
Task 3002 is `dput(C1)`, Task 3304 is `dput(C2)`.
Task 3304 calls `__dentry_kill(C1)`, its refcount becomes zero.
Task 3002 is trying to call `__dentry_kill(C2)`, but the same P might be killed at the same time.




14/2 2026/02/02 12:51 tool grepper
0m Args:
map[Expression:#define DCACHE_.*0x]

Results:
map[Output:arch/m68k/include/asm/m54xxacr.h-51-#define ICACHE_SIZE 0x4000	/* instruction - 16k */
arch/m68k/include/asm/m54xxacr.h:52:#define DCACHE_SIZE 0x2000	/* data - 8k */
arch/m68k/include/asm/m54xxacr.h-53-
--
arch/m68k/include/asm/m54xxacr.h-56-#define ICACHE_SIZE 0x8000	/* instruction - 32k */
arch/m68k/include/asm/m54xxacr.h:57:#define DCACHE_SIZE 0x8000	/* data - 32k */
arch/m68k/include/asm/m54xxacr.h-58-
--
arch/m68k/include/asm/m54xxacr.h-61-#define ICACHE_SIZE 0x2000	/* instruction - 8k */
arch/m68k/include/asm/m54xxacr.h:62:#define DCACHE_SIZE 0x2000	/* data - 8k */
arch/m68k/include/asm/m54xxacr.h-63-#endif
--
arch/sh/mm/cache-j2.c-18-#define ICACHE_ENABLE	0x1
arch/sh/mm/cache-j2.c:19:#define DCACHE_ENABLE	0x2
arch/sh/mm/cache-j2.c-20-#define CACHE_ENABLE	(ICACHE_ENABLE | DCACHE_ENABLE)
arch/sh/mm/cache-j2.c-21-#define ICACHE_FLUSH	0x100
arch/sh/mm/cache-j2.c:22:#define DCACHE_FLUSH	0x200
arch/sh/mm/cache-j2.c-23-#define CACHE_FLUSH	(ICACHE_FLUSH | DCACHE_FLUSH)
]

15/2 2026/02/02 12:51 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 113335
							output: 23
							thoughts: 14
						

Thoughts:
Okay, here's my attempt at summarizing that thought process, tailored for an expert audience:

**Diving into the dcache**

Right, so I need to understand something related to the dcache, likely low-level structure or interaction, and I immediately know the starting point. I instinctively think to look within `include/linux/dcache.h`. That's where the core definitions and probably some critical structures related to the dcache are going to be defined. That header file is the most likely spot to get a precise definition of the data structures and core functions involved in how the kernel handles directory caching. It should give me a good lay of the land, the base to begin my investigation.



16/2 2026/02/02 12:51 tool read-file
0m Args:
map[File:include/linux/dcache.h]

Results:
map[Contents:/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __LINUX_DCACHE_H
#define __LINUX_DCACHE_H

#include <linux/atomic.h>
#include <linux/list.h>
#include <linux/math.h>
#include <linux/rculist.h>
#include <linux/rculist_bl.h>
#include <linux/spinlock.h>
#include <linux/seqlock.h>
#include <linux/cache.h>
#include <linux/rcupdate.h>
#include <linux/lockref.h>
#include <linux/stringhash.h>
#include <linux/wait.h>

struct path;
struct file;
struct vfsmount;

/*
 * linux/include/linux/dcache.h
 *
 * Dirent cache data structures
 *
 * (C) Copyright 1997 Thomas Schoebel-Theuer,
 * with heavy changes by Linus Torvalds
 */

#define IS_ROOT(x) ((x) == (x)->d_parent)

/* The hash is always the low bits of hash_len */
#ifdef __LITTLE_ENDIAN
 #define HASH_LEN_DECLARE u32 hash; u32 len
 #define bytemask_from_count(cnt)	(~(~0ul << (cnt)*8))
#else
 #define HASH_LEN_DECLARE u32 len; u32 hash
 #define bytemask_from_count(cnt)	(~(~0ul >> (cnt)*8))
#endif

/*
 * "quick string" -- eases parameter passing, but more importantly
 * saves "metadata" about the string (ie length and the hash).
 *
 * hash comes first so it snuggles against d_parent in the
 * dentry.
 */
struct qstr {
	union {
		struct {
			HASH_LEN_DECLARE;
		};
		u64 hash_len;
	};
	const unsigned char *name;
};

#define QSTR_INIT(n,l) { { { .len = l } }, .name = n }
#define QSTR_LEN(n,l) (struct qstr)QSTR_INIT(n,l)
#define QSTR(n) QSTR_LEN(n, strlen(n))

extern const struct qstr empty_name;
extern const struct qstr slash_name;
extern const struct qstr dotdot_name;

/*
 * Try to keep struct dentry aligned on 64 byte cachelines (this will
 * give reasonable cacheline footprint with larger lines without the
 * large memory footprint increase).
 */
#ifdef CONFIG_64BIT
# define DNAME_INLINE_WORDS 5 /* 192 bytes */
#else
# ifdef CONFIG_SMP
#  define DNAME_INLINE_WORDS 9 /* 128 bytes */
# else
#  define DNAME_INLINE_WORDS 11 /* 128 bytes */
# endif
#endif

#define DNAME_INLINE_LEN (DNAME_INLINE_WORDS*sizeof(unsigned long))

union shortname_store {
	unsigned char string[DNAME_INLINE_LEN];
	unsigned long words[DNAME_INLINE_WORDS];
};

#define d_lock	d_lockref.lock
#define d_iname d_shortname.string

struct dentry {
	/* RCU lookup touched fields */
	unsigned int d_flags;		/* protected by d_lock */
	seqcount_spinlock_t d_seq;	/* per dentry seqlock */
	struct hlist_bl_node d_hash;	/* lookup hash list */
	struct dentry *d_parent;	/* parent directory */
	union {
	struct qstr __d_name;		/* for use ONLY in fs/dcache.c */
	const struct qstr d_name;
	};
	struct inode *d_inode;		/* Where the name belongs to - NULL is
					 * negative */
	union shortname_store d_shortname;
	/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */

	/* Ref lookup also touches following */
	const struct dentry_operations *d_op;
	struct super_block *d_sb;	/* The root of the dentry tree */
	unsigned long d_time;		/* used by d_revalidate */
	void *d_fsdata;			/* fs-specific data */
	/* --- cacheline 2 boundary (128 bytes) --- */
	struct lockref d_lockref;	/* per-dentry lock and refcount
					 * keep separate from RCU lookup area if
					 * possible!
					 */

	union {
		struct list_head d_lru;		/* LRU list */
		wait_queue_head_t *d_wait;	/* in-lookup ones only */
	};
	struct hlist_node d_sib;	/* child of parent list */
	struct hlist_head d_children;	/* our children */
	/*
	 * d_alias and d_rcu can share memory
	 */
	union {
		struct hlist_node d_alias;	/* inode alias list */
		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
	 	struct rcu_head d_rcu;
	} d_u;
};

/*
 * dentry->d_lock spinlock nesting subclasses:
 *
 * 0: normal
 * 1: nested
 */
enum dentry_d_lock_class
{
	DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */
	DENTRY_D_LOCK_NESTED
};

enum d_real_type {
	D_REAL_DATA,
	D_REAL_METADATA,
};

struct dentry_operations {
	int (*d_revalidate)(struct inode *, const struct qstr *,
			    struct dentry *, unsigned int);
	int (*d_weak_revalidate)(struct dentry *, unsigned int);
	int (*d_hash)(const struct dentry *, struct qstr *);
	int (*d_compare)(const struct dentry *,
			unsigned int, const char *, const struct qstr *);
	int (*d_delete)(const struct dentry *);
	int (*d_init)(struct dentry *);
	void (*d_release)(struct dentry *);
	void (*d_prune)(struct dentry *);
	void (*d_iput)(struct dentry *, struct inode *);
	char *(*d_dname)(struct dentry *, char *, int);
	struct vfsmount *(*d_automount)(struct path *);
	int (*d_manage)(const struct path *, bool);
	struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
	bool (*d_unalias_trylock)(const struct dentry *);
	void (*d_unalias_unlock)(const struct dentry *);
} ____cacheline_aligned;

/*
 * Locking rules for dentry_operations callbacks are to be found in
 * Documentation/filesystems/locking.rst. Keep it updated!
 *
 * FUrther descriptions are found in Documentation/filesystems/vfs.rst.
 * Keep it updated too!
 */

/* d_flags entries */
enum dentry_flags {
	DCACHE_OP_HASH			= BIT(0),
	DCACHE_OP_COMPARE		= BIT(1),
	DCACHE_OP_REVALIDATE		= BIT(2),
	DCACHE_OP_DELETE		= BIT(3),
	DCACHE_OP_PRUNE			= BIT(4),
	/*
	 * This dentry is possibly not currently connected to the dcache tree,
	 * in which case its parent will either be itself, or will have this
	 * flag as well.  nfsd will not use a dentry with this bit set, but will
	 * first endeavour to clear the bit either by discovering that it is
	 * connected, or by performing lookup operations.  Any filesystem which
	 * supports nfsd_operations MUST have a lookup function which, if it
	 * finds a directory inode with a DCACHE_DISCONNECTED dentry, will
	 * d_move that dentry into place and return that dentry rather than the
	 * passed one, typically using d_splice_alias.
	 */
	DCACHE_DISCONNECTED		= BIT(5),
	DCACHE_REFERENCED		= BIT(6),	/* Recently used, don't discard. */
	DCACHE_DONTCACHE		= BIT(7),	/* Purge from memory on final dput() */
	DCACHE_CANT_MOUNT		= BIT(8),
	DCACHE_SHRINK_LIST		= BIT(10),
	DCACHE_OP_WEAK_REVALIDATE	= BIT(11),
	/*
	 * this dentry has been "silly renamed" and has to be deleted on the
	 * last dput()
	 */
	DCACHE_NFSFS_RENAMED		= BIT(12),
	DCACHE_FSNOTIFY_PARENT_WATCHED	= BIT(13),	/* Parent inode is watched by some fsnotify listener */
	DCACHE_DENTRY_KILLED		= BIT(14),
	DCACHE_MOUNTED			= BIT(15),	/* is a mountpoint */
	DCACHE_NEED_AUTOMOUNT		= BIT(16),	/* handle automount on this dir */
	DCACHE_MANAGE_TRANSIT		= BIT(17),	/* manage transit from this dirent */
	DCACHE_LRU_LIST			= BIT(18),
	DCACHE_ENTRY_TYPE		= (7 << 19),	/* bits 19..21 are for storing type: */
	DCACHE_MISS_TYPE		= (0 << 19),	/* Negative dentry */
	DCACHE_WHITEOUT_TYPE		= (1 << 19),	/* Whiteout dentry (stop pathwalk) */
	DCACHE_DIRECTORY_TYPE		= (2 << 19),	/* Normal directory */
	DCACHE_AUTODIR_TYPE		= (3 << 19),	/* Lookupless directory (presumed automount) */
	DCACHE_REGULAR_TYPE		= (4 << 19),	/* Regular file type */
	DCACHE_SPECIAL_TYPE		= (5 << 19),	/* Other file type */
	DCACHE_SYMLINK_TYPE		= (6 << 19),	/* Symlink */
	DCACHE_NOKEY_NAME		= BIT(22),	/* Encrypted name encoded without key */
	DCACHE_OP_REAL			= BIT(23),
	DCACHE_PAR_LOOKUP		= BIT(24),	/* being looked up (with parent locked shared) */
	DCACHE_DENTRY_CURSOR		= BIT(25),
	DCACHE_NORCU			= BIT(26),	/* No RCU delay for freeing */
	DCACHE_PERSISTENT		= BIT(27)
};

#define DCACHE_MANAGED_DENTRY \
	(DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)

extern seqlock_t rename_lock;

/*
 * These are the low-level FS interfaces to the dcache..
 */
extern void d_instantiate(struct dentry *, struct inode *);
extern void d_instantiate_new(struct dentry *, struct inode *);
extern void __d_drop(struct dentry *dentry);
extern void d_drop(struct dentry *dentry);
extern void d_delete(struct dentry *);

/* allocate/de-allocate */
extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
extern struct dentry * d_alloc_anon(struct super_block *);
extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *,
					wait_queue_head_t *);
extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
/* weird procfs mess; *NOT* exported */
extern struct dentry * d_splice_alias_ops(struct inode *, struct dentry *,
					  const struct dentry_operations *);
extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
extern bool d_same_name(const struct dentry *dentry, const struct dentry *parent,
			const struct qstr *name);
extern struct dentry *d_find_any_alias(struct inode *inode);
extern struct dentry * d_obtain_alias(struct inode *);
extern struct dentry * d_obtain_root(struct inode *);
extern void shrink_dcache_sb(struct super_block *);
extern void shrink_dcache_parent(struct dentry *);
extern void d_invalidate(struct dentry *);

/* only used at mount-time */
extern struct dentry * d_make_root(struct inode *);

extern void d_mark_tmpfile(struct file *, struct inode *);
extern void d_tmpfile(struct file *, struct inode *);

extern struct dentry *d_find_alias(struct inode *);
extern void d_prune_aliases(struct inode *);
extern void d_dispose_if_unused(struct dentry *, struct list_head *);
extern void shrink_dentry_list(struct list_head *);

extern struct dentry *d_find_alias_rcu(struct inode *);

/* test whether we have any submounts in a subdir tree */
extern int path_has_submounts(const struct path *);

/*
 * This adds the entry to the hash queues.
 */
extern void d_rehash(struct dentry *);
 
extern void d_add(struct dentry *, struct inode *);

/* used for rename() and baskets */
extern void d_move(struct dentry *, struct dentry *);
extern void d_exchange(struct dentry *, struct dentry *);
extern struct dentry *d_ancestor(struct dentry *, struct dentry *);

extern struct dentry *d_lookup(const struct dentry *, const struct qstr *);

static inline unsigned d_count(const struct dentry *dentry)
{
	return dentry->d_lockref.count;
}

ino_t d_parent_ino(struct dentry *dentry);

/*
 * helper function for dentry_operations.d_dname() members
 */
extern __printf(3, 4)
char *dynamic_dname(char *, int, const char *, ...);

extern char *__d_path(const struct path *, const struct path *, char *, int);
extern char *d_absolute_path(const struct path *, char *, int);
extern char *d_path(const struct path *, char *, int);
extern char *dentry_path_raw(const struct dentry *, char *, int);
extern char *dentry_path(const struct dentry *, char *, int);

/* Allocation counts.. */

/**
 * dget_dlock -	get a reference to a dentry
 * @dentry: dentry to get a reference to
 *
 * Given a live dentry, increment the reference count and return the dentry.
 * Caller must hold @dentry->d_lock.  Making sure that dentry is alive is
 * caller's resonsibility.  There are many conditions sufficient to guarantee
 * that; e.g. anything with non-negative refcount is alive, so's anything
 * hashed, anything positive, anyone's parent, etc.
 */
static inline struct dentry *dget_dlock(struct dentry *dentry)
{
	dentry->d_lockref.count++;
	return dentry;
}


/**
 * dget - get a reference to a dentry
 * @dentry: dentry to get a reference to
 *
 * Given a dentry or %NULL pointer increment the reference count
 * if appropriate and return the dentry.  A dentry will not be
 * destroyed when it has references.  Conversely, a dentry with
 * no references can disappear for any number of reasons, starting
 * with memory pressure.  In other words, that primitive is
 * used to clone an existing reference; using it on something with
 * zero refcount is a bug.
 *
 * NOTE: it will spin if @dentry->d_lock is held.  From the deadlock
 * avoidance point of view it is equivalent to spin_lock()/increment
 * refcount/spin_unlock(), so calling it under @dentry->d_lock is
 * always a bug; so's calling it under ->d_lock on any of its descendents.
 *
 */
static inline struct dentry *dget(struct dentry *dentry)
{
	if (dentry)
		lockref_get(&dentry->d_lockref);
	return dentry;
}

extern struct dentry *dget_parent(struct dentry *dentry);

/**
 * d_unhashed - is dentry hashed
 * @dentry: entry to check
 *
 * Returns true if the dentry passed is not currently hashed.
 */
static inline int d_unhashed(const struct dentry *dentry)
{
	return hlist_bl_unhashed(&dentry->d_hash);
}

static inline int d_unlinked(const struct dentry *dentry)
{
	return d_unhashed(dentry) && !IS_ROOT(dentry);
}

static inline int cant_mount(const struct dentry *dentry)
{
	return (dentry->d_flags & DCACHE_CANT_MOUNT);
}

static inline void dont_mount(struct dentry *dentry)
{
	spin_lock(&dentry->d_lock);
	dentry->d_flags |= DCACHE_CANT_MOUNT;
	spin_unlock(&dentry->d_lock);
}

extern void __d_lookup_unhash_wake(struct dentry *dentry);

static inline int d_in_lookup(const struct dentry *dentry)
{
	return dentry->d_flags & DCACHE_PAR_LOOKUP;
}

static inline void d_lookup_done(struct dentry *dentry)
{
	if (unlikely(d_in_lookup(dentry)))
		__d_lookup_unhash_wake(dentry);
}

extern void dput(struct dentry *);

static inline bool d_managed(const struct dentry *dentry)
{
	return dentry->d_flags & DCACHE_MANAGED_DENTRY;
}

static inline bool d_mountpoint(const struct dentry *dentry)
{
	return dentry->d_flags & DCACHE_MOUNTED;
}

/*
 * Directory cache entry type accessor functions.
 */
static inline unsigned __d_entry_type(const struct dentry *dentry)
{
	return dentry->d_flags & DCACHE_ENTRY_TYPE;
}

static inline bool d_is_miss(const struct dentry *dentry)
{
	return __d_entry_type(dentry) == DCACHE_MISS_TYPE;
}

static inline bool d_is_whiteout(const struct dentry *dentry)
{
	return __d_entry_type(dentry) == DCACHE_WHITEOUT_TYPE;
}

static inline bool d_can_lookup(const struct dentry *dentry)
{
	return __d_entry_type(dentry) == DCACHE_DIRECTORY_TYPE;
}

static inline bool d_is_autodir(const struct dentry *dentry)
{
	return __d_entry_type(dentry) == DCACHE_AUTODIR_TYPE;
}

static inline bool d_is_dir(const struct dentry *dentry)
{
	return d_can_lookup(dentry) || d_is_autodir(dentry);
}

static inline bool d_is_symlink(const struct dentry *dentry)
{
	return __d_entry_type(dentry) == DCACHE_SYMLINK_TYPE;
}

static inline bool d_is_reg(const struct dentry *dentry)
{
	return __d_entry_type(dentry) == DCACHE_REGULAR_TYPE;
}

static inline bool d_is_special(const struct dentry *dentry)
{
	return __d_entry_type(dentry) == DCACHE_SPECIAL_TYPE;
}

static inline bool d_is_file(const struct dentry *dentry)
{
	return d_is_reg(dentry) || d_is_special(dentry);
}

static inline bool d_is_negative(const struct dentry *dentry)
{
	// TODO: check d_is_whiteout(dentry) also.
	return d_is_miss(dentry);
}

static inline bool d_flags_negative(unsigned flags)
{
	return (flags & DCACHE_ENTRY_TYPE) == DCACHE_MISS_TYPE;
}

static inline bool d_is_positive(const struct dentry *dentry)
{
	return !d_is_negative(dentry);
}

/**
 * d_really_is_negative - Determine if a dentry is really negative (ignoring fallthroughs)
 * @dentry: The dentry in question
 *
 * Returns true if the dentry represents either an absent name or a name that
 * doesn't map to an inode (ie. ->d_inode is NULL).  The dentry could represent
 * a true miss, a whiteout that isn't represented by a 0,0 chardev or a
 * fallthrough marker in an opaque directory.
 *
 * Note!  (1) This should be used *only* by a filesystem to examine its own
 * dentries.  It should not be used to look at some other filesystem's
 * dentries.  (2) It should also be used in combination with d_inode() to get
 * the inode.  (3) The dentry may have something attached to ->d_lower and the
 * type field of the flags may be set to something other than miss or whiteout.
 */
static inline bool d_really_is_negative(const struct dentry *dentry)
{
	return dentry->d_inode == NULL;
}

/**
 * d_really_is_positive - Determine if a dentry is really positive (ignoring fallthroughs)
 * @dentry: The dentry in question
 *
 * Returns true if the dentry represents a name that maps to an inode
 * (ie. ->d_inode is not NULL).  The dentry might still represent a whiteout if
 * that is represented on medium as a 0,0 chardev.
 *
 * Note!  (1) This should be used *only* by a filesystem to examine its own
 * dentries.  It should not be used to look at some other filesystem's
 * dentries.  (2) It should also be used in combination with d_inode() to get
 * the inode.
 */
static inline bool d_really_is_positive(const struct dentry *dentry)
{
	return dentry->d_inode != NULL;
}

static inline int simple_positive(const struct dentry *dentry)
{
	return d_really_is_positive(dentry) && !d_unhashed(dentry);
}

unsigned long vfs_pressure_ratio(unsigned long val);

/**
 * d_inode - Get the actual inode of this dentry
 * @dentry: The dentry to query
 *
 * This is the helper normal filesystems should use to get at their own inodes
 * in their own dentries and ignore the layering superimposed upon them.
 */
static inline struct inode *d_inode(const struct dentry *dentry)
{
	return dentry->d_inode;
}

/**
 * d_inode_rcu - Get the actual inode of this dentry with READ_ONCE()
 * @dentry: The dentry to query
 *
 * This is the helper normal filesystems should use to get at their own inodes
 * in their own dentries and ignore the layering superimposed upon them.
 */
static inline struct inode *d_inode_rcu(const struct dentry *dentry)
{
	return READ_ONCE(dentry->d_inode);
}

/**
 * d_backing_inode - Get upper or lower inode we should be using
 * @upper: The upper layer
 *
 * This is the helper that should be used to get at the inode that will be used
 * if this dentry were to be opened as a file.  The inode may be on the upper
 * dentry or it may be on a lower dentry pinned by the upper.
 *
 * Normal filesystems should not use this to access their own inodes.
 */
static inline struct inode *d_backing_inode(const struct dentry *upper)
{
	struct inode *inode = upper->d_inode;

	return inode;
}

/**
 * d_real - Return the real dentry
 * @dentry: the dentry to query
 * @type: the type of real dentry (data or metadata)
 *
 * If dentry is on a union/overlay, then return the underlying, real dentry.
 * Otherwise return the dentry itself.
 *
 * See also: Documentation/filesystems/vfs.rst
 */
static inline struct dentry *d_real(struct dentry *dentry, enum d_real_type type)
{
	if (unlikely(dentry->d_flags & DCACHE_OP_REAL))
		return dentry->d_op->d_real(dentry, type);
	else
		return dentry;
}

/**
 * d_real_inode - Return the real inode hosting the data
 * @dentry: The dentry to query
 *
 * If dentry is on a union/overlay, then return the underlying, real inode.
 * Otherwise return d_inode().
 */
static inline struct inode *d_real_inode(const struct dentry *dentry)
{
	/* This usage of d_real() results in const dentry */
	return d_inode(d_real((struct dentry *) dentry, D_REAL_DATA));
}

struct name_snapshot {
	struct qstr name;
	union shortname_store inline_name;
};
void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *);
void release_dentry_name_snapshot(struct name_snapshot *);

static inline struct dentry *d_first_child(const struct dentry *dentry)
{
	return hlist_entry_safe(dentry->d_children.first, struct dentry, d_sib);
}

static inline struct dentry *d_next_sibling(const struct dentry *dentry)
{
	return hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib);
}

void set_default_d_op(struct super_block *, const struct dentry_operations *);
struct dentry *d_make_persistent(struct dentry *, struct inode *);
void d_make_discardable(struct dentry *dentry);

#endif	/* __LINUX_DCACHE_H */
]

17/2 2026/02/02 12:51 llm expert
4m Model:
gemini-3-flash-preview

Tokens:
							input: 119907
							output: 25
							thoughts: 39191
						

18/2 2026/02/02 12:55 tool grepper
0m Args:
map[Expression:WRITE_ONCE\(dentry->d_flags]

Results:
map[Output:fs/dcache.c=414=static inline void __d_clear_type_and_inode(struct dentry *dentry)
--
fs/dcache.c-418-	flags &= ~DCACHE_ENTRY_TYPE;
fs/dcache.c:419:	WRITE_ONCE(dentry->d_flags, flags);
fs/dcache.c-420-	dentry->d_inode = NULL;
]

19/2 2026/02/02 12:55 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 159242
							output: 19
							thoughts: 1003
						

Thoughts:
**Data Race Analysis: Dentry Flag Manipulation**

Okay, so I'm looking at this code and immediately I see it's only used in `__d_clear_type_and_inode`. Not in the functions I'd initially suspect, `d_lru_del` or `dentry_unlist`. That's a good starting point. I need to figure out where the data race is happening.

Aha! Got it! The race is between `dentry->d_flags &= ~DCACHE_LRU_LIST;` in `d_lru_del` and `dentry->d_flags |= DCACHE_DENTRY_KILLED;` in `dentry_unlist`. Both are non-atomic operations, that's the key.

If this race occurs, it implies there's a lock contention issue, either one of the tasks isn't holding `d_lock` or they're working on different dentries. But they are the same dentry, so the lock is the problem. Both are within `__dentry_kill`.

Wait, that's it! `__dentry_kill` *drops* `d_lock`! That's the crucial detail!

```c
	if (dentry->d_inode)
		dentry_unlink_inode(dentry);
	else
		spin_unlock(&dentry->d_lock);
```

So, Task 3304 (the one setting `DCACHE_DENTRY_KILLED`) drops the lock, and then Task 3002 (clearing `DCACHE_LRU_LIST`) takes it. The race is between these tasks, one releasing the lock, the other acquiring. Shouldn't Task 3002 see the dentry as "dead" at that point?

Hold on... `lock_for_kill` checks `d_lockref.count`. If it's -128, the function returns `false`. But is it possible that `lockref_mark_dead` hasn't been called yet? No, `dentry_unlist` is much later in the process.

Okay, is it possible that the function `lockref_mark_dead` has been called but Task 3002 still sees the count as zero? I already dismissed this. Is Task 3002 even calling `lock_for_kill`? The stack trace says yes.

Alright, I'm leaning toward calling this benign. The dentry is being killed and will be freed soon. The flag manipulations are for a dentry that's no longer reachable and not in the LRU. The only potential issue is `DCACHE_DENTRY_KILLED` not being set, but `d_walk` also checks `d_unhashed`.

Wait, I should check if `d_lru_del` can be called twice. If it is, then `list_lru_del_obj` will be called twice. `list_lru_del_obj` returns `false` if the object wasn't on the list. It doesn't seem to crash, that's good.

So, benign it is.

One more check, `dentry_unlist` also has list operations:
```c
	if (unlikely(hlist_unhashed(&dentry->d_sib)))
		return;
	__hlist_del(&dentry->d_sib);
```

If it's called twice, `hlist_unhashed` will catch it the second time. That's safe.

Wait! I see a real reason why it's benign! The race is on `d_flags`.

Task A: `flags &= ~BIT_X`
Task B: `flags |= BIT_Y`

If they race:
1. Task B reads `flags` (BIT_X set, BIT_Y clear).
2. Task A writes `flags` (BIT_X clear, BIT_Y clear).
3. Task B writes `flags` (BIT_X set, BIT_Y set).

Result: `BIT_X` is still set!

But `BIT_X` is `DCACHE_LRU_LIST`.

If it's still set, the dentry is freed.

When it's freed, nobody cares about the flags. That's the real clincher. So it's benign.

Wait! I should check if `DCACHE_LRU_LIST` being set causes any issues during freeing. No, `dentry_free` doesn't care.

So I'm confident this is benign.

Wait! I found a similar report where the conclusion was also "benign"!

"K


20/2 2026/02/02 12:55 tool set-results
0m Args:
map[Benign:true Confident:true]

Results:
map[Benign:true Confident:true]

21/2 2026/02/02 12:55 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 161306
							output: 930
							thoughts: 419
						

Thoughts:
**Analyzing the Dentry Race Condition**

Okay, one last pass over these stack traces. Looks like Task 3002 and Task 3304 are both hitting `__dentry_kill`, specifically around `d_lru_del` and `dentry_unlist`, respectively. They're both operating on the same `dentry->d_flags`. The value changed, and that's where the interesting stuff is happening. Initially, it's `0x00140004`, and it gets clobbered down to `0x00000004`. Clearly, something cleared `DCACHE_LRU_LIST` and `DCACHE_DIRECTORY_TYPE`.

Let's see… Task 3002 is responsible for clearing `DCACHE_LRU_LIST`, that makes sense. But Task 3304 appears to be clearing `DCACHE_DIRECTORY_TYPE`, probably in `dentry_unlink_inode`. Wait a minute, though... isn't `dentry_unlink_inode` *before* `dentry_unlist` in the calling sequence? Aha!

So, the race is between Task 3002's write to `d_flags` and Task 3304's read *during* `dentry_unlist`. Task 3304 is using an `OR` operation, which means it reads the existing `d_flags` value, then updates it. Here's that crucial snippet:

```c
static inline void dentry_unlist(struct dentry *dentry)
{
    // ...
	dentry->d_flags |= DCACHE_DENTRY_KILLED;
```

If Task 3002 writes its value (clearing `DCACHE_LRU_LIST`) between Task 3304's read and write, then Task 3304 will stomp on Task 3002's changes. But is this bad? Well, because we're killing the dentry anyway, I can't see this causing any real problems. The race condition is there, but I think the result is benign. Final verdict: benign.



LLM Calls Summary:
Total Calls Total Tokens Avg Tokens Total Duration (Seconds) Avg Duration (Seconds)
Tool Calls Summary:
Total Calls Total Duration (Seconds) Avg Duration (Seconds)