syzbot


ID Workflow Result Correct Bug Created Started Finished Revision Error
be481e9c-9cf6-4c85-b258-d9a4710a79e1 assessment-kcsan Benign: ✅  Confident: ✅  KCSAN: data-race in aio_read_events / iocb_put (13) 2026/02/24 18:29 2026/02/24 18:29 2026/02/24 18:32 305c0ec5cd886e2d13738e28e1b2df9b0ec20fc9
Benign: ✅    Confident: ✅   

Explanation:
The data race occurs between `aio_complete()` writing to `ring->tail` and `aio_read_events_ring()` reading from it. This is an intentional lockless producer-consumer synchronization pattern used in the Linux AIO (Asynchronous I/O) implementation.

In `aio_complete()` (the producer):
1. The completion event is written to the ring buffer.
2. `smp_wmb()` is called to ensure the event data is visible to other CPUs before the `tail` pointer is updated.
3. `ring->tail` is updated to signal that a new event is available.

In `aio_read_events_ring()` (the consumer):
1. `ring->tail` is read to determine how many events are available.
2. `smp_rmb()` is called to ensure that subsequent reads of the event data happen after the `tail` pointer has been read.
3. The event data is copied to the user-provided buffer.

This pattern ensures that if the consumer sees an updated `tail` pointer, it is guaranteed to see the corresponding event data. The race on the `tail` pointer itself is benign: if the consumer reads a stale value, it will simply process fewer events and catch the new ones in a subsequent call. On x86_64, 32-bit aligned accesses (like this `unsigned int`) are atomic, preventing "torn" reads/writes.

The lack of `READ_ONCE()` and `WRITE_ONCE()` annotations is what triggers the KCSAN report, but the algorithm is fundamentally sound and robust against the observed concurrency. Similar intentional races on the `head` pointer are explicitly documented as safe in the same file (see comments in `user_refill_reqs_available()`).

Conclusion: The data race is benign.

Crash report:
==================================================================
BUG: KCSAN: data-race in aio_read_events / iocb_put

write to 0xffff88813ae2800c of 4 bytes by task 10185 on cpu 1:
 aio_complete fs/aio.c:1162 [inline]
 iocb_put+0x18c/0x680 fs/aio.c:1209
 io_submit_one+0xbcb/0x1350 fs/aio.c:2056
 __do_sys_io_submit fs/aio.c:2112 [inline]
 __se_sys_io_submit+0xfb/0x280 fs/aio.c:2082
 __x64_sys_io_submit+0x43/0x50 fs/aio.c:2082
 x64_sys_call+0x19d5/0x3020 arch/x86/include/generated/asm/syscalls_64.h:210
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff88813ae2800c of 4 bytes by task 10186 on cpu 0:
 aio_read_events_ring fs/aio.c:1238 [inline]
 aio_read_events+0xb2/0x400 fs/aio.c:1298
 read_events fs/aio.c:1334 [inline]
 do_io_getevents+0x11b/0x450 fs/aio.c:2230
 __do_sys_io_getevents fs/aio.c:2263 [inline]
 __se_sys_io_getevents fs/aio.c:2251 [inline]
 __x64_sys_io_getevents+0xca/0x190 fs/aio.c:2251
 x64_sys_call+0x2cfe/0x3020 arch/x86/include/generated/asm/syscalls_64.h:209
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00000003 -> 0x00000008

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 10186 Comm: syz.7.1697 Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
==================================================================

Trajectory:
Seq Timestamp Type Name Duration
0/0 2026/02/24 18:29 flow assessment-kcsan
2m Results:
map[Benign:true Confident:true Explanation:The data race occurs between `aio_complete()` writing to `ring->tail` and `aio_read_events_ring()` reading from it. This is an intentional lockless producer-consumer synchronization pattern used in the Linux AIO (Asynchronous I/O) implementation.

In `aio_complete()` (the producer):
1. The completion event is written to the ring buffer.
2. `smp_wmb()` is called to ensure the event data is visible to other CPUs before the `tail` pointer is updated.
3. `ring->tail` is updated to signal that a new event is available.

In `aio_read_events_ring()` (the consumer):
1. `ring->tail` is read to determine how many events are available.
2. `smp_rmb()` is called to ensure that subsequent reads of the event data happen after the `tail` pointer has been read.
3. The event data is copied to the user-provided buffer.

This pattern ensures that if the consumer sees an updated `tail` pointer, it is guaranteed to see the corresponding event data. The race on the `tail` pointer itself is benign: if the consumer reads a stale value, it will simply process fewer events and catch the new ones in a subsequent call. On x86_64, 32-bit aligned accesses (like this `unsigned int`) are atomic, preventing "torn" reads/writes.

The lack of `READ_ONCE()` and `WRITE_ONCE()` annotations is what triggers the KCSAN report, but the algorithm is fundamentally sound and robust against the observed concurrency. Similar intentional races on the `head` pointer are explicitly documented as safe in the same file (see comments in `user_refill_reqs_available()`).

Conclusion: The data race is benign.]

1/1 2026/02/24 18:29 action kernel-checkouter
0m Results:
map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/2fdd1b0b70843cf11339656b168b6d2ccec9cbc2]

2/1 2026/02/24 18:29 action kernel-builder
0m Results:
map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/b07903d644582d0e293db32421ead036a9fd68a4]

3/1 2026/02/24 18:29 action codesearch-prepare
0m Results:
map[Index:codesearch-index]

4/1 2026/02/24 18:29 agent expert
2m Model:
gemini-3-flash-preview

Results:
map[Benign:true Confident:true]

Instruction:
You are an experienced Linux kernel developer tasked with determining if the given kernel
data race is benign or not. The data race report is from KCSAN tool.
It contains 2 stack traces of the memory accesses that constitute a data race.

A "benign" data races are on a simple int/bool variable or similar field,
and the accesses are not supposed to be protected by any mutual exclusion primitives.
Common examples of such "benign" data races are accesses to various flags fields,
statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash
with a conservative compiler that compiles memory accesses to primitive types
effectively as atomic.

A non-benign (or "harmful" data race) can lead to corruption/crash even with
a conservative compiler that compiles memory accesses to primitive types
effectively as atomic. A common example of a "harmful" data races is race on
a complex container (list/hashmap/etc), where accesses are supposed to be protected
by a mutual exclusion primitive.

Also consider races that happen at the same time with the given one.
If there is no synchronization in between, other memory accesses in the involved threads
race with each other if they access the same memory. For example, if both threads execute:

	some_struct->have_elements = true;
	list_add(new_node, &some_struct->list_head);

the race on some_struct->have_elements may appear benign, however it also implies there
is a race on some_struct->list_head which is not benign, since the list is not thread-safe.

Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses,
and thus even with a conservative compiler may not be fully atomic. However, such races may
still be benign depending on how writes are done, and how read data is used.

In the final reply explain why you think the given data race is benign or is harmful.

Use the provided tools to confirm any assumptions, variables/fields being accessed, etc.
In particular, don't make assumptions about the kernel source code,
use codesearch tools to read the actual source code.

Prefer calling several tools at the same time to save round-trips.


Use set-results tool to provide results of the analysis.
It must be called exactly once before the final reply.
Ignore results of this tool.

Prompt:
The data race report is:

==================================================================
BUG: KCSAN: data-race in aio_read_events / iocb_put

write to 0xffff88813ae2800c of 4 bytes by task 10185 on cpu 1:
 aio_complete fs/aio.c:1162 [inline]
 iocb_put+0x18c/0x680 fs/aio.c:1209
 io_submit_one+0xbcb/0x1350 fs/aio.c:2056
 __do_sys_io_submit fs/aio.c:2112 [inline]
 __se_sys_io_submit+0xfb/0x280 fs/aio.c:2082
 __x64_sys_io_submit+0x43/0x50 fs/aio.c:2082
 x64_sys_call+0x19d5/0x3020 arch/x86/include/generated/asm/syscalls_64.h:210
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff88813ae2800c of 4 bytes by task 10186 on cpu 0:
 aio_read_events_ring fs/aio.c:1238 [inline]
 aio_read_events+0xb2/0x400 fs/aio.c:1298
 read_events fs/aio.c:1334 [inline]
 do_io_getevents+0x11b/0x450 fs/aio.c:2230
 __do_sys_io_getevents fs/aio.c:2263 [inline]
 __se_sys_io_getevents fs/aio.c:2251 [inline]
 __x64_sys_io_getevents+0xca/0x190 fs/aio.c:2251
 x64_sys_call+0x2cfe/0x3020 arch/x86/include/generated/asm/syscalls_64.h:209
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00000003 -> 0x00000008

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 10186 Comm: syz.7.1697 Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
==================================================================


Reply:
The data race occurs between `aio_complete()` writing to `ring->tail` and `aio_read_events_ring()` reading from it. This is an intentional lockless producer-consumer synchronization pattern used in the Linux AIO (Asynchronous I/O) implementation.

In `aio_complete()` (the producer):
1. The completion event is written to the ring buffer.
2. `smp_wmb()` is called to ensure the event data is visible to other CPUs before the `tail` pointer is updated.
3. `ring->tail` is updated to signal that a new event is available.

In `aio_read_events_ring()` (the consumer):
1. `ring->tail` is read to determine how many events are available.
2. `smp_rmb()` is called to ensure that subsequent reads of the event data happen after the `tail` pointer has been read.
3. The event data is copied to the user-provided buffer.

This pattern ensures that if the consumer sees an updated `tail` pointer, it is guaranteed to see the corresponding event data. The race on the `tail` pointer itself is benign: if the consumer reads a stale value, it will simply process fewer events and catch the new ones in a subsequent call. On x86_64, 32-bit aligned accesses (like this `unsigned int`) are atomic, preventing "torn" reads/writes.

The lack of `READ_ONCE()` and `WRITE_ONCE()` annotations is what triggers the KCSAN report, but the algorithm is fundamentally sound and robust against the observed concurrency. Similar intentional races on the `head` pointer are explicitly documented as safe in the same file (see comments in `user_refill_reqs_available()`).

Conclusion: The data race is benign.

5/2 2026/02/24 18:29 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 3877
							output: 78
							thoughts: 128
						

Thoughts:
**Analyzing a Data Race in the AIO Subsystem**

Okay, here's what's hitting me. This report is pointing at a potential data race within the asynchronous I/O (AIO) subsystem, specifically between `aio_complete` and `aio_read_events_ring`. We're talking about a conflict on memory address `0xffff88813ae2800c`. The write operation, the suspect in this scenario, originates from `aio_complete` deep down in `iocb_put`, which is triggered during a `io_submit_one` call – classic AIO flow. The read, the other side of this race, is occurring in `aio_read_events_ring`, which is called through `aio_read_events`, implying a potential concurrency issue where a completion callback could be modifying data while event retrieval is in progress.

My immediate reaction is to dive into the `fs/aio.c` code at those specific line numbers. Line 1162 in `aio_complete` is where the write is happening. Line 1238 in `aio_read_events_ring` is where the read occurs. This is likely related to the management of completion queues or event structures. I need to understand how these two functions interact with that specific memory address to confirm the race condition and then pinpoint the exact cause. Is it a missing lock, incorrect memory barriers, or something else entirely? Time to get the debugger out.



6/2 2026/02/24 18:29 tool codesearch-definition-source
0m Args:
map[ContextFile:fs/aio.c IncludeLines:true Name:aio_complete]

Results:
map[SourceCode:1118:	/* aio_complete
1119:	 *	Called when the io request on the given iocb is complete.
1120:	 */
1121:	static void aio_complete(struct aio_kiocb *iocb)
1122:	{
1123:		struct kioctx	*ctx = iocb->ki_ctx;
1124:		struct aio_ring	*ring;
1125:		struct io_event	*ev_page, *event;
1126:		unsigned tail, pos, head, avail;
1127:		unsigned long	flags;
1128:	
1129:		/*
1130:		 * Add a completion event to the ring buffer. Must be done holding
1131:		 * ctx->completion_lock to prevent other code from messing with the tail
1132:		 * pointer since we might be called from irq context.
1133:		 */
1134:		spin_lock_irqsave(&ctx->completion_lock, flags);
1135:	
1136:		tail = ctx->tail;
1137:		pos = tail + AIO_EVENTS_OFFSET;
1138:	
1139:		if (++tail >= ctx->nr_events)
1140:			tail = 0;
1141:	
1142:		ev_page = folio_address(ctx->ring_folios[pos / AIO_EVENTS_PER_PAGE]);
1143:		event = ev_page + pos % AIO_EVENTS_PER_PAGE;
1144:	
1145:		*event = iocb->ki_res;
1146:	
1147:		flush_dcache_folio(ctx->ring_folios[pos / AIO_EVENTS_PER_PAGE]);
1148:	
1149:		pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb,
1150:			 (void __user *)(unsigned long)iocb->ki_res.obj,
1151:			 iocb->ki_res.data, iocb->ki_res.res, iocb->ki_res.res2);
1152:	
1153:		/* after flagging the request as done, we
1154:		 * must never even look at it again
1155:		 */
1156:		smp_wmb();	/* make event visible before updating tail */
1157:	
1158:		ctx->tail = tail;
1159:	
1160:		ring = folio_address(ctx->ring_folios[0]);
1161:		head = ring->head;
1162:		ring->tail = tail;
1163:		flush_dcache_folio(ctx->ring_folios[0]);
1164:	
1165:		ctx->completed_events++;
1166:		if (ctx->completed_events > 1)
1167:			refill_reqs_available(ctx, head, tail);
1168:	
1169:		avail = tail > head
1170:			? tail - head
1171:			: tail + ctx->nr_events - head;
1172:		spin_unlock_irqrestore(&ctx->completion_lock, flags);
1173:	
1174:		pr_debug("added to ring %p at [%u]\n", iocb, tail);
1175:	
1176:		/*
1177:		 * Check if the user asked us to deliver the result through an
1178:		 * eventfd. The eventfd_signal() function is safe to be called
1179:		 * from IRQ context.
1180:		 */
1181:		if (iocb->ki_eventfd)
1182:			eventfd_signal(iocb->ki_eventfd);
1183:	
1184:		/*
1185:		 * We have to order our ring_info tail store above and test
1186:		 * of the wait list below outside the wait lock.  This is
1187:		 * like in wake_up_bit() where clearing a bit has to be
1188:		 * ordered with the unlocked test.
1189:		 */
1190:		smp_mb();
1191:	
1192:		if (waitqueue_active(&ctx->wait)) {
1193:			struct aio_waiter *curr, *next;
1194:			unsigned long flags;
1195:	
1196:			spin_lock_irqsave(&ctx->wait.lock, flags);
1197:			list_for_each_entry_safe(curr, next, &ctx->wait.head, w.entry)
1198:				if (avail >= curr->min_nr) {
1199:					wake_up_process(curr->w.private);
1200:					list_del_init_careful(&curr->w.entry);
1201:				}
1202:			spin_unlock_irqrestore(&ctx->wait.lock, flags);
1203:		}
1204:	}
 SourceFile:fs/aio.c]

7/2 2026/02/24 18:29 tool codesearch-definition-source
0m Args:
map[ContextFile:fs/aio.c IncludeLines:true Name:aio_read_events_ring]

Results:
map[SourceCode:1214:	/* aio_read_events_ring
1215:	 *	Pull an event off of the ioctx's event ring.  Returns the number of
1216:	 *	events fetched
1217:	 */
1218:	static long aio_read_events_ring(struct kioctx *ctx,
1219:					 struct io_event __user *event, long nr)
1220:	{
1221:		struct aio_ring *ring;
1222:		unsigned head, tail, pos;
1223:		long ret = 0;
1224:		int copy_ret;
1225:	
1226:		/*
1227:		 * The mutex can block and wake us up and that will cause
1228:		 * wait_event_interruptible_hrtimeout() to schedule without sleeping
1229:		 * and repeat. This should be rare enough that it doesn't cause
1230:		 * peformance issues. See the comment in read_events() for more detail.
1231:		 */
1232:		sched_annotate_sleep();
1233:		mutex_lock(&ctx->ring_lock);
1234:	
1235:		/* Access to ->ring_folios here is protected by ctx->ring_lock. */
1236:		ring = folio_address(ctx->ring_folios[0]);
1237:		head = ring->head;
1238:		tail = ring->tail;
1239:	
1240:		/*
1241:		 * Ensure that once we've read the current tail pointer, that
1242:		 * we also see the events that were stored up to the tail.
1243:		 */
1244:		smp_rmb();
1245:	
1246:		pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);
1247:	
1248:		if (head == tail)
1249:			goto out;
1250:	
1251:		head %= ctx->nr_events;
1252:		tail %= ctx->nr_events;
1253:	
1254:		while (ret < nr) {
1255:			long avail;
1256:			struct io_event *ev;
1257:			struct folio *folio;
1258:	
1259:			avail = (head <= tail ?  tail : ctx->nr_events) - head;
1260:			if (head == tail)
1261:				break;
1262:	
1263:			pos = head + AIO_EVENTS_OFFSET;
1264:			folio = ctx->ring_folios[pos / AIO_EVENTS_PER_PAGE];
1265:			pos %= AIO_EVENTS_PER_PAGE;
1266:	
1267:			avail = min(avail, nr - ret);
1268:			avail = min_t(long, avail, AIO_EVENTS_PER_PAGE - pos);
1269:	
1270:			ev = folio_address(folio);
1271:			copy_ret = copy_to_user(event + ret, ev + pos,
1272:						sizeof(*ev) * avail);
1273:	
1274:			if (unlikely(copy_ret)) {
1275:				ret = -EFAULT;
1276:				goto out;
1277:			}
1278:	
1279:			ret += avail;
1280:			head += avail;
1281:			head %= ctx->nr_events;
1282:		}
1283:	
1284:		ring = folio_address(ctx->ring_folios[0]);
1285:		ring->head = head;
1286:		flush_dcache_folio(ctx->ring_folios[0]);
1287:	
1288:		pr_debug("%li  h%u t%u\n", ret, head, tail);
1289:	out:
1290:		mutex_unlock(&ctx->ring_lock);
1291:	
1292:		return ret;
1293:	}
 SourceFile:fs/aio.c]

8/2 2026/02/24 18:29 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 6932
							output: 37
							thoughts: 508
						

Thoughts:
**Analyzing the `ring->tail` Race Condition**

Okay, so we have a race condition on `ring->tail`. Let's break this down from the core. In `aio_complete`, we're *writing* to `ring->tail`:

```c
1160:		ring = folio_address(ctx->ring_folios[0]);
1161:		head = ring->head;
1162:		ring->tail = tail;
```

This write is protected by `ctx->completion_lock` (a spinlock), which should be fast, but we need to verify we have the right lock, as there may be lock hierarchy issues.

Now, over in `aio_read_events_ring`, we're *reading* from `ring->tail`:

```c
1236:		ring = folio_address(ctx->ring_folios[0]);
1237:		head = ring->head;
1238:		tail = ring->tail;
```

This read is protected by `ctx->ring_lock`, which is a mutex. Okay, so these are *different* locks, which makes it interesting. The use of a spinlock for writers and a mutex for readers is often an indication of high contention writers. So, let's step back and consider: are these the *only* synchronization mechanisms involved here?

Looking closer at `aio_complete` again, specifically the relevant section:

```c
1156:		smp_wmb();	/* make event visible before updating tail */
1157:	
1158:		ctx->tail = tail;
1159:	
1160:		ring = folio_address(ctx->ring_folios[0]);
1161:		head = ring->head;
1162:		ring->tail = tail;
```

Ah, there's `smp_wmb()`. That's a write memory barrier. It's meant to ensure that the event data is written *before* the `tail` is updated, which is crucial for data consistency.

Now, in `aio_read_events_ring`:

```c
1238:		tail = ring->tail;
1239:	
1240:		/*
1241:		 * Ensure that once we've read the current tail pointer, that
1242:		 * we also see the events that were stored up to the tail.
1243:		 */
1244:		smp_rmb();
```

And there's `smp_rmb()`. That's a read memory barrier. It's paired with the write memory barrier in `aio_complete`, and it's intended to ensure that, once we read `tail`, we also see all the events that were written up to that point. So the barriers are synchronizing the *event data* with the `tail` updates, as the comment correctly indicates.

The use of `ring` suggests it is shared with userspace, which has further implications for how data is shared and how to handle it. So we need to carefully examine `struct aio_ring` now. This is where things get interesting!



9/2 2026/02/24 18:29 tool codesearch-definition-source
0m Args:
map[ContextFile:fs/aio.c IncludeLines:true Name:aio_ring]

Results:
map[SourceCode:  56:	struct aio_ring {
  57:		unsigned	id;	/* kernel internal index number */
  58:		unsigned	nr;	/* number of io_events */
  59:		unsigned	head;	/* Written to by userland or under ring_lock
  60:					 * mutex by aio_read_events_ring(). */
  61:		unsigned	tail;
  62:	
  63:		unsigned	magic;
  64:		unsigned	compat_features;
  65:		unsigned	incompat_features;
  66:		unsigned	header_length;	/* size of aio_ring */
  67:	
  68:	
  69:		struct io_event		io_events[];
  70:	}; /* 128 bytes + ring size */
 SourceFile:fs/aio.c]

10/2 2026/02/24 18:29 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 8430
							output: 41
							thoughts: 1884
						

Thoughts:
**Analyzing the Asynchronous I/O Event Ring Data Race**

Okay, so I'm looking at this asynchronous I/O (AIO) event ring implementation, and KCSAN flagged a potential data race. My mind is immediately drawn to the `tail` and `head` fields. I see that the kernel updates `tail` in `aio_complete`, while the kernel or userland updates `head`. The kernel also reads `tail` in `aio_read_events_ring`.

The primary concern is the potential race between `aio_complete` writing `ring->tail` and `aio_read_events_ring` reading it. Is this benign? It's an `unsigned int` so should be a 32-bit atomic operation on x86_64, which is good. The code uses `smp_wmb()` and `smp_rmb()` to make the event data visible before `ring->tail` is updated, and the event data readable after. So, conceptually, it's a lockless producer-consumer pattern, and those memory barriers are supposed to handle it. However, KCSAN's flagging it as a race because it's a plain access.

I now remember that `ring->tail` is in shared memory with userspace. This means that userspace is also reading and writing `ring->tail`, as well as writing `ring->head`. The kernel has to handle this shared-memory access very carefully.

Specifically, in `aio_read_events_ring`, the kernel is reading `ring->tail` without the protection of the lock that `aio_complete` holds. `aio_complete` uses `ctx->completion_lock`, while `aio_read_events_ring` takes `ctx->ring_lock`. If `aio_read_events_ring` reads a stale value for `tail`, it might miss some events, but that's usually not a big deal since it will see them the next time. If it were a "torn" read on an architecture with non-atomic 32-bit reads, it *could* be bad, but x86_64 is fine.

The value change from `0x3` to `0x8` is multi-bit, so no chance of torn reads.

Let me see where else `ring->tail` is being accessed.

The comment in `aio_ring` says that only the kernel is writing to `tail`, which is consistent with my current understanding. However, the lack of an explicit writer in the comment is slightly concerning, since both the kernel and userspace need to know who updates it.

Since the reads and writes are 4-byte aligned, the read/write of `tail` itself *should* be atomic on x86.

The data race is real in the sense that they are concurrent accesses without explicit synchronization, as KCSAN points out.

Is it harmful? If `aio_read_events_ring` reads an older value of `tail`, it won't see the new events. If it reads the new value, the `smp_rmb()` ensures it also sees the event data.

However, I now see something else. `aio_complete` first updates `ctx->tail`, then `ring->tail`. The former is internal to the kernel, and the latter is shared. Then `aio_read_events_ring` uses the `smp_rmb()` after reading the value of tail. This design looks like a standard lockless ring buffer, and the "race" is intentional.

The compiler optimizations could cause problems, and using `READ_ONCE` and `WRITE_ONCE` could prevent those. But, is it going to cause a crash? Probably not, if the compiler doesn't do anything too crazy.

I now see the write to `ring->head` in `aio_read_events_ring`. And also `aio_complete` reads `ring->head`. This is another potential race.

Can `aio_complete` be called concurrently for the same `ctx`? Yes, because multiple I/Os can complete concurrently. However, I see that `aio_complete` uses `ctx->completion_lock` and `aio_read_events_ring` uses `ctx->ring_lock`. This *does* serialize updates to `ring->tail` (via `completion_lock`) and `ring->head` (via `ring_lock`).

Here's the problem. It is definitely a lockless design. And those head and tail pointers are also accessed by userspace.

If userspace can change `ring->head`, the kernel must be extremely careful when reading it. `aio_complete` reads `ring->head` without the `ring_lock`. In `aio_complete`, the value of `head`


11/2 2026/02/24 18:30 tool codesearch-definition-source
0m Args:
map[ContextFile:fs/aio.c IncludeLines:true Name:refill_reqs_available]

Results:
map[SourceCode: 978:	/* refill_reqs_available
 979:	 *	Updates the reqs_available reference counts used for tracking the
 980:	 *	number of free slots in the completion ring.  This can be called
 981:	 *	from aio_complete() (to optimistically update reqs_available) or
 982:	 *	from aio_get_req() (the we're out of events case).  It must be
 983:	 *	called holding ctx->completion_lock.
 984:	 */
 985:	static void refill_reqs_available(struct kioctx *ctx, unsigned head,
 986:	                                  unsigned tail)
 987:	{
 988:		unsigned events_in_ring, completed;
 989:	
 990:		/* Clamp head since userland can write to it. */
 991:		head %= ctx->nr_events;
 992:		if (head <= tail)
 993:			events_in_ring = tail - head;
 994:		else
 995:			events_in_ring = ctx->nr_events - (head - tail);
 996:	
 997:		completed = ctx->completed_events;
 998:		if (events_in_ring < completed)
 999:			completed -= events_in_ring;
1000:		else
1001:			completed = 0;
1002:	
1003:		if (!completed)
1004:			return;
1005:	
1006:		ctx->completed_events -= completed;
1007:		put_reqs_available(ctx, completed);
1008:	}
 SourceFile:fs/aio.c]

12/2 2026/02/24 18:30 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 11862
							output: 18
							thoughts: 868
						

Thoughts:
**Analyzing a Potential Data Race in Asynchronous I/O Event Handling**

Okay, so I'm looking at this code and a KCSAN report. The core of the issue seems to be around `ring->tail` and potential data races. `aio_complete` writes to it, and `aio_read_events_ring` reads it. The report is saying that there is a data race.

Now, as an expert I know that this is tricky. Since it's a 32-bit aligned read/write on x86, we won't get torn values, but we might read an old or new value. If we read the old `tail`, we just think there are no new events, which isn't a *correctness* issue, only a performance one. If we read the new `tail`, we see the new events. The `smp_wmb()` and `smp_rmb()` should handle the memory ordering, so it *should* be fine, right?

But, KCSAN is complaining about the lack of `READ_ONCE`/`WRITE_ONCE`. Hmm... it's nagging me about something else.

Then it hits me: what if the compiler reloads `ring->tail` into the local variable `tail` *after* the `smp_rmb()`?

Looking at the code block in `aio_read_events_ring`, I see the critical bit. We read `tail`, then have the memory barrier `smp_rmb()`. After that, we check if `head == tail`, calculate `tail %= ctx->nr_events`, and calculate `avail` using `tail`. The compiler might optimize and reload `tail` from `ring->tail`. But, if this happens *after* the barrier, the ordering is lost for that second read. This is a problem! We could end up with an updated `tail` value for which the events are *not* guaranteed to be visible, potentially reading uninitialized or outdated event data.

Thinking back, that `tail` variable is a local. The compiler *could* decide to reload it from memory if it thinks it's beneficial. It's not like the value *has* to stay in that register. We use `READ_ONCE` to prevent this kind of optimization.

So, if the compiler reloads the `tail` variable *after* the `smp_rmb()`, we have a problem.

Now, how likely is this to happen? I see that `tail` is used in a few places in the subsequent logic: `head == tail`, `tail %= ctx->nr_events`, and calculating `avail`. The calculation of `avail` relies on the value of `tail`. Even if we're okay with the compiler reloading it, it has to be the latest value when used in the calculation of `avail`. The `smp_rmb()` *should* have ordered the read of events relative to the initial read of `tail`, but a second read could skip that memory ordering. That memory ordering is lost!

So yes, this *is* a potentially harmful race, if the compiler decides to reload `tail`. We use `READ_ONCE` specifically to prevent this.

The more I think about it, the more I'm leaning toward it being important here. `ring->tail` is in a folio address, too. And `tail` is used quite a bit!

I should look for examples of `READ_ONCE` elsewhere in `fs/aio.c` to be sure.



13/2 2026/02/24 18:30 tool grepper
0m Args:
map[Expression:READ_ONCE]

Results:
map[Output:
Full output is too long, showing 500 out of 37872 lines.
Use more precise expression if possible.

[Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=78=lock-acquisition and lock-release functions::
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-85-    6   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:86:    7   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-87-    8   raw_spin_unlock_rcu_node(rnp);
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-93-   14   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:94:   15   r2 = READ_ONCE(z);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-95-   16   raw_spin_unlock_rcu_node(rnp);
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-101-   22   smp_mb();
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:102:   23   r3 = READ_ONCE(x);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-103-   24 }
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=112=The ``smp_mb__after_unlock_lock()`` invocations prevent this
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-133-|  g = get_state_synchronize_rcu()           smp_mb()                   |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:134:|  while (!poll_state_synchronize_rcu(g))    r1 = READ_ONCE(X)          |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-135-|          continue;                                                    |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:136:|  r0 = READ_ONCE(Y)                                                    |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-137-|                                                                       |
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=198=newly arrived RCU callbacks against future grace periods:
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-213-   12   /* Handle nohz enablement switches conservatively. */
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:214:   13   tne = READ_ONCE(tick_nohz_active);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-215-   14   if (tne != rdp->tick_nohz_enabled_snap) {
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-167-         x="255.60869"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:168:         y="309.29346">r1 = READ_ONCE(a);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-169-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-189-         x="396.10254"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:190:         y="384.71124">r2 = READ_ONCE(b);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-191-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-200-         x="396.10254"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:201:         y="582.13617">r3 = READ_ONCE(c);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-202-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-195-         x="255.60869"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:196:         y="297.29346">r1 = READ_ONCE(a);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-197-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-228-         x="396.10254"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:229:         y="572.13617">r2 = READ_ONCE(c);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-230-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-463-         x="541.70508"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:464:         y="387.6217">r3 = READ_ONCE(d);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-465-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-509-         x="686.27747"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:510:         y="461.83929">r4 = READ_ONCE(b);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-511-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-520-         x="686.27747"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:521:         y="669.26422">r5 = READ_ONCE(e);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-522-    <text
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=84=overhead to readers, for example:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-92-       5   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:93:       6   r1 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst:94:       7   r2 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-95-       8   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=138=recovery from node failure, more or less as follows:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-153-      12   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:154:      13   state_snap = READ_ONCE(state);
 Documentation/RCU/Design/Requirements/Requirements.rst-155-      14   if (state_snap == STATE_NORMAL)
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=467=resembling the dependency-ordering barrier that was later subsumed
 Documentation/RCU/Design/Requirements/Requirements.rst:468:into rcu_dereference() and later still into READ_ONCE(). The
 Documentation/RCU/Design/Requirements/Requirements.rst-469-need for these operations made itself known quite suddenly at a
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=702=threads:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-718-      13   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:719:      14   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-720-      15   rcu_read_unlock();
 Documentation/RCU/Design/Requirements/Requirements.rst-721-      16   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:722:      17   r2 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst-723-      18   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=737=significant ordering constraints would slow down these fast-path APIs.
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-745-+-----------------------------------------------------------------------+
 Documentation/RCU/Design/Requirements/Requirements.rst:746:| No, the volatile casts in READ_ONCE() and WRITE_ONCE()                |
 Documentation/RCU/Design/Requirements/Requirements.rst-747-| prevent the compiler from reordering in this particular case.         |
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=755=example illustrates this:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-761-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:762:       4   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-763-       5   if (r1) {
 Documentation/RCU/Design/Requirements/Requirements.rst-764-       6     do_something_with_nonzero_x();
 Documentation/RCU/Design/Requirements/Requirements.rst:765:       7     r2 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst-766-       8     WARN_ON(!r2); /* BUG!!! */
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=819=are initially all zero:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-832-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:833:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-834-      12   synchronize_rcu();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-840-      18   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:841:      19   r2 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:842:      20   r3 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-843-      21   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=862=period is known to end before the second grace period starts:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-875-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:876:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-877-      12   synchronize_rcu();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-882-      17 {
 Documentation/RCU/Design/Requirements/Requirements.rst:883:      18   r2 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-884-      19   synchronize_rcu();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-890-      25   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:891:      26   r3 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:892:      27   r4 = READ_ONCE(d);
 Documentation/RCU/Design/Requirements/Requirements.rst-893-      28   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=920=illustrated by the following, with all variables initially zero:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-933-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:934:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-935-      12   synchronize_rcu();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-942-      19   WRITE_ONCE(d, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:943:      20   r2 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-944-      21   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-948-      25 {
 Documentation/RCU/Design/Requirements/Requirements.rst:949:      26   r3 = READ_ONCE(d);
 Documentation/RCU/Design/Requirements/Requirements.rst-950-      27   synchronize_rcu();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-956-      33   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:957:      34   r4 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:958:      35   r5 = READ_ONCE(e);
 Documentation/RCU/Design/Requirements/Requirements.rst-959-      36   rcu_read_unlock();
 --
 Documentation/RCU/checklist.rst=12=over a rather long period of time, but improvements are always welcome!
 --
 Documentation/RCU/checklist.rst-363-	time that readers might be accessing that structure.  In such
 Documentation/RCU/checklist.rst:364:	cases, READ_ONCE() may be used in place of rcu_dereference()
 Documentation/RCU/checklist.rst-365-	and the read-side markers (rcu_read_lock() and rcu_read_unlock(),
 --
 Documentation/RCU/listRCU.rst=145=has become list_for_each_entry_rcu().  The **_rcu()** list-traversal
 Documentation/RCU/listRCU.rst:146:primitives add READ_ONCE() and diagnostic checks for incorrect use
 Documentation/RCU/listRCU.rst-147-outside of an RCU read-side critical section.
 --
 Documentation/RCU/rcu_dereference.rst=25=readers working properly:
 --
 Documentation/RCU/rcu_dereference.rst-35-	return data preceding initialization that preceded the store
 Documentation/RCU/rcu_dereference.rst:36:	of the pointer.  (As noted later, in recent kernels READ_ONCE()
 Documentation/RCU/rcu_dereference.rst-37-	also prevents DEC Alpha from playing these tricks.)
 --
 Documentation/RCU/rcu_dereference.rst-45--	In the special case where data is added but is never removed
 Documentation/RCU/rcu_dereference.rst:46:	while readers are accessing the structure, READ_ONCE() may be used
 Documentation/RCU/rcu_dereference.rst:47:	instead of rcu_dereference().  In this case, use of READ_ONCE()
 Documentation/RCU/rcu_dereference.rst-48-	takes on the role of the lockless_dereference() primitive that
 --
 Documentation/RCU/whatisRCU.rst=687=don't forget about them when submitting patches making use of RCU!]::
 --
 Documentation/RCU/whatisRCU.rst-695-	({ \
 Documentation/RCU/whatisRCU.rst:696:		typeof(p) _________p1 = READ_ONCE(p); \
 Documentation/RCU/whatisRCU.rst-697-		(_________p1); \
 --
 Documentation/atomic_t.txt=82=The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
 Documentation/atomic_t.txt:83:implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
 Documentation/atomic_t.txt-84-smp_store_release() respectively. Therefore, if you find yourself only using
 --
 Documentation/atomic_t.txt=119=with a lock:
 --
 Documentation/atomic_t.txt-124-    lock();
 Documentation/atomic_t.txt:125:    ret = READ_ONCE(v->counter); // == 1
 Documentation/atomic_t.txt-126-						atomic_set(v, 0);
 --
 Documentation/atomic_t.txt=234=strictly stronger than ACQUIRE. As illustrated:
 --
 Documentation/atomic_t.txt-242-  {
 Documentation/atomic_t.txt:243:    r0 = READ_ONCE(*x);
 Documentation/atomic_t.txt-244-    smp_rmb();
 --
 Documentation/core-api/circular-buffers.rst=154=The producer will look something like this::
 --
 Documentation/core-api/circular-buffers.rst-159-	/* The spin_unlock() and next spin_lock() provide needed ordering. */
 Documentation/core-api/circular-buffers.rst:160:	unsigned long tail = READ_ONCE(buffer->tail);
 Documentation/core-api/circular-buffers.rst-161-
 --
 Documentation/core-api/circular-buffers.rst=219=before it writes the new tail pointer, which will erase the item.
 Documentation/core-api/circular-buffers.rst-220-
 Documentation/core-api/circular-buffers.rst:221:Note the use of READ_ONCE() and smp_load_acquire() to read the
 Documentation/core-api/circular-buffers.rst-222-opposition index.  This prevents the compiler from discarding and
 --
 Documentation/core-api/errseq.rst=144=errseq_check_and_advance after taking the lock. e.g.::
 Documentation/core-api/errseq.rst-145-
 Documentation/core-api/errseq.rst:146:        if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) {
 Documentation/core-api/errseq.rst-147-                /* su.s_wd_err is protected by s_wd_err_lock */
 --
 Documentation/core-api/refcount-vs-atomic.rst=37=are executed in program order on a single CPU.
 Documentation/core-api/refcount-vs-atomic.rst:38:This is implemented using READ_ONCE()/WRITE_ONCE() and
 Documentation/core-api/refcount-vs-atomic.rst-39-compare-and-swap primitives.
 --
 Documentation/dev-tools/kcsan.rst=214=and if that code is free from data races.
 Documentation/dev-tools/kcsan.rst-215-
 Documentation/dev-tools/kcsan.rst:216:KCSAN is aware of *marked atomic operations* (``READ_ONCE``, ``WRITE_ONCE``,
 Documentation/dev-tools/kcsan.rst-217-``atomic_*``, etc.), and a subset of ordering guarantees implied by memory
 --
 Documentation/dev-tools/kcsan.rst=297=barrier. Consider the example::
 --
 Documentation/dev-tools/kcsan.rst-306-    {
 Documentation/dev-tools/kcsan.rst:307:        while (!READ_ONCE(flag));   // correct: smp_load_acquire(&flag)
 Documentation/dev-tools/kcsan.rst-308-        ... = x;                    // data race!
 --
 Documentation/driver-api/surface_aggregator/internal.rst=270=submission, i.e. cancellation, can not rely on the ``ptl`` reference to be
 Documentation/driver-api/surface_aggregator/internal.rst:271:set. Access to it in these functions is guarded by ``READ_ONCE()``, whereas
 Documentation/driver-api/surface_aggregator/internal.rst-272-setting ``ptl`` is equally guarded with ``WRITE_ONCE()`` for symmetry.
 --
 Documentation/driver-api/surface_aggregator/internal.rst=275=them, specifically priority and state for tracing. In those cases, proper
 Documentation/driver-api/surface_aggregator/internal.rst:276:access is ensured by employing ``WRITE_ONCE()`` and ``READ_ONCE()``. Such
 Documentation/driver-api/surface_aggregator/internal.rst-277-read-only access is only allowed when stale values are not critical.
 --
 Documentation/driver-api/surface_aggregator/internal.rst=451=them, specifically the state for tracing. In those cases, proper access is
 Documentation/driver-api/surface_aggregator/internal.rst:452:ensured by employing ``WRITE_ONCE()`` and ``READ_ONCE()``. Such read-only
 Documentation/driver-api/surface_aggregator/internal.rst-453-access is only allowed when stale values are not critical.
 --
 Documentation/driver-api/surface_aggregator/internal.rst=572=invalid usages, but rather aim to help catch them. In those cases, proper
 Documentation/driver-api/surface_aggregator/internal.rst:573:variable access is ensured by employing ``WRITE_ONCE()`` and ``READ_ONCE()``.
 Documentation/driver-api/surface_aggregator/internal.rst-574-
 --
 Documentation/filesystems/path-lookup.rst=896=similar.
 Documentation/filesystems/path-lookup.rst-897-
 Documentation/filesystems/path-lookup.rst:898:.. _READ_ONCE: https://lwn.net/Articles/624126/
 Documentation/filesystems/path-lookup.rst-899-
 --
 Documentation/filesystems/path-lookup.rst=904=when accessing fields in the dentry.  This "extra care" typically
 Documentation/filesystems/path-lookup.rst:905:involves using  `READ_ONCE() <READ_ONCE_>`_ to access fields, and verifying the
 Documentation/filesystems/path-lookup.rst-906-result is not NULL before using it.  This pattern can be seen in
 --
 Documentation/gpu/todo.rst=932=struct drm_sched_rq is read at many places without any locks, not even with a
 Documentation/gpu/todo.rst:933:READ_ONCE. At XDC 2025 no one could really tell why that is the case, whether
 Documentation/gpu/todo.rst-934-locks are needed and whether they could be added. (But for real, that should
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus=14=P0(int *x, atomic_t *y)
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-18-
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus:19:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-20-	smp_rmb();
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus=12=P0(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-19-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:20:	r0 = READ_ONCE(*y);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-21-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus=23=P1(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-30-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:31:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-32-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus=20=P1(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-26-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus:27:	r2 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-28-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus=13=P0(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-19-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:20:	r0 = READ_ONCE(*y);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-21-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus=23=P1(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-29-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:30:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-31-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus=21=P1(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-26-	r1 = cmpxchg(y, 0, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus:27:	r2 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-28-}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus=15=P0(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-20-
 Documentation/litmus-tests/locking/DCL-broken.litmus:21:	r0 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-22-	if (r0 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus-23-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-broken.litmus:24:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-25-		if (r1 == 0) {
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-30-	}
 Documentation/litmus-tests/locking/DCL-broken.litmus:31:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-broken.litmus-32-}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus=34=P1(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-39-
 Documentation/litmus-tests/locking/DCL-broken.litmus:40:	r0 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-41-	if (r0 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus-42-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-broken.litmus:43:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-44-		if (r1 == 0) {
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-49-	}
 Documentation/litmus-tests/locking/DCL-broken.litmus:50:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-broken.litmus-51-}
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus=16=P0(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-24-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-fixed.litmus:25:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-26-		if (r1 == 0) {
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-31-	}
 Documentation/litmus-tests/locking/DCL-fixed.litmus:32:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-33-}
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus=35=P1(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-43-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-fixed.litmus:44:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-45-		if (r1 == 0) {
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-50-	}
 Documentation/litmus-tests/locking/DCL-fixed.litmus:51:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-52-}
 --
 Documentation/litmus-tests/locking/RM-broken.litmus=26=P1(int *x, atomic_t *y, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/RM-broken.litmus-32-	spin_lock(lck);
 Documentation/litmus-tests/locking/RM-broken.litmus:33:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/locking/RM-broken.litmus:34:	r1 = READ_ONCE(*x);
 Documentation/litmus-tests/locking/RM-broken.litmus-35-	r2 = atomic_inc_return(y);
 --
 Documentation/litmus-tests/locking/RM-fixed.litmus=26=P1(int *x, atomic_t *y, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/RM-fixed.litmus-31-
 Documentation/litmus-tests/locking/RM-fixed.litmus:32:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/locking/RM-fixed.litmus:33:	r1 = READ_ONCE(*x);
 Documentation/litmus-tests/locking/RM-fixed.litmus-34-	spin_lock(lck);
 --
 Documentation/litmus-tests/rcu/RCU+sync+free.litmus=24=P0(int *x, int *z, int **y)
 --
 Documentation/litmus-tests/rcu/RCU+sync+free.litmus-30-	r0 = rcu_dereference(*y);
 Documentation/litmus-tests/rcu/RCU+sync+free.litmus:31:	r1 = READ_ONCE(*r0);
 Documentation/litmus-tests/rcu/RCU+sync+free.litmus-32-	rcu_read_unlock();
 --
 Documentation/litmus-tests/rcu/RCU+sync+read.litmus=27=P1(int *x, int *y)
 --
 Documentation/litmus-tests/rcu/RCU+sync+read.litmus-31-
 Documentation/litmus-tests/rcu/RCU+sync+read.litmus:32:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/rcu/RCU+sync+read.litmus-33-	synchronize_rcu();
 Documentation/litmus-tests/rcu/RCU+sync+read.litmus:34:	r1 = READ_ONCE(*y);
 Documentation/litmus-tests/rcu/RCU+sync+read.litmus-35-}
 --
 Documentation/livepatch/callbacks.rst=119=so that tcp_send_challenge_ack() could first be changed to read
 Documentation/livepatch/callbacks.rst:120:sysctl_tcp_challenge_ack_limit with READ_ONCE.
 Documentation/livepatch/callbacks.rst-121-
 --
 Documentation/memory-barriers.txt=226=There are some minimal guarantees that may be expected of a CPU:
 --
 Documentation/memory-barriers.txt-230-
 Documentation/memory-barriers.txt:231:	Q = READ_ONCE(P); D = READ_ONCE(*Q);
 Documentation/memory-barriers.txt-232-
 --
 Documentation/memory-barriers.txt-236-
 Documentation/memory-barriers.txt:237:     and always in that order.  However, on DEC Alpha, READ_ONCE() also
 Documentation/memory-barriers.txt-238-     emits a memory-barrier instruction, so that a DEC Alpha CPU will
 --
 Documentation/memory-barriers.txt-242-
 Documentation/memory-barriers.txt:243:     Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler
 Documentation/memory-barriers.txt-244-     mischief.
 --
 Documentation/memory-barriers.txt-248-
 Documentation/memory-barriers.txt:249:	a = READ_ONCE(*X); WRITE_ONCE(*X, b);
 Documentation/memory-barriers.txt-250-
 --
 Documentation/memory-barriers.txt-256-
 Documentation/memory-barriers.txt:257:	WRITE_ONCE(*X, c); d = READ_ONCE(*X);
 Documentation/memory-barriers.txt-258-
 --
 Documentation/memory-barriers.txt=266=And there are a number of things that _must_ or _must_not_ be assumed:
 --
 Documentation/memory-barriers.txt-268- (*) It _must_not_ be assumed that the compiler will do what you want
 Documentation/memory-barriers.txt:269:     with memory references that are not protected by READ_ONCE() and
 Documentation/memory-barriers.txt-270-     WRITE_ONCE().  Without them, the compiler is within its rights to
 --
 Documentation/memory-barriers.txt=377=Memory barriers come in four basic varieties:
 --
 Documentation/memory-barriers.txt-438-     dependency barriers.  Nowadays, APIs for marking loads from shared
 Documentation/memory-barriers.txt:439:     variables such as READ_ONCE() and rcu_dereference() provide implicit
 Documentation/memory-barriers.txt-440-     address-dependency barriers.
 --
 Documentation/memory-barriers.txt=567=see Documentation/RCU/rcu_dereference.rst.
 Documentation/memory-barriers.txt-568-
 Documentation/memory-barriers.txt:569:As of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for
 Documentation/memory-barriers.txt-570-DEC Alpha, which means that about the only people who need to pay attention
 Documentation/memory-barriers.txt=571=to this section are those working on DEC Alpha architecture-specific code
 Documentation/memory-barriers.txt:572:and those working on READ_ONCE() itself.  For those who need it, and for
 Documentation/memory-barriers.txt-573-those who are interested in the history, here is the story of
 --
 Documentation/memory-barriers.txt=582=following sequence of events:
 --
 Documentation/memory-barriers.txt-589-	WRITE_ONCE(P, &B);
 Documentation/memory-barriers.txt:590:			      Q = READ_ONCE_OLD(P);
 Documentation/memory-barriers.txt-591-			      D = *Q;
 Documentation/memory-barriers.txt-592-
 Documentation/memory-barriers.txt:593:[!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which
 Documentation/memory-barriers.txt-594-doesn't imply an address-dependency barrier.
 --
 Documentation/memory-barriers.txt=609=Alpha).
 Documentation/memory-barriers.txt-610-
 Documentation/memory-barriers.txt:611:To deal with this, READ_ONCE() provides an implicit address-dependency barrier
 Documentation/memory-barriers.txt-612-since kernel release v4.15:
 --
 Documentation/memory-barriers.txt-619-	WRITE_ONCE(P, &B);
 Documentation/memory-barriers.txt:620:			      Q = READ_ONCE(P);
 Documentation/memory-barriers.txt-621-			      <implicit address-dependency barrier>
 --
 Documentation/memory-barriers.txt=644=dependencies in a great many highly creative ways.
 --
 Documentation/memory-barriers.txt-651-	WRITE_ONCE(P, &B);
 Documentation/memory-barriers.txt:652:			      Q = READ_ONCE_OLD(P);
 Documentation/memory-barriers.txt-653-			      WRITE_ONCE(*Q, 5);
 --
 Documentation/memory-barriers.txt=656=Q with the store into *Q.  In other words, this outcome is prohibited,
 Documentation/memory-barriers.txt:657:even without an implicit address-dependency barrier of modern READ_ONCE():
 Documentation/memory-barriers.txt-658-
 --
 Documentation/memory-barriers.txt=689=Consider the following bit of code:
 Documentation/memory-barriers.txt-690-
 Documentation/memory-barriers.txt:691:	q = READ_ONCE(a);
 Documentation/memory-barriers.txt-692-	<implicit address-dependency barrier>
 --
 Documentation/memory-barriers.txt-694-		/* BUG: No address dependency!!! */
 Documentation/memory-barriers.txt:695:		p = READ_ONCE(b);
 Documentation/memory-barriers.txt-696-	}
 --
 Documentation/memory-barriers.txt=702=what's actually required is:
 Documentation/memory-barriers.txt-703-
 Documentation/memory-barriers.txt:704:	q = READ_ONCE(a);
 Documentation/memory-barriers.txt-705-	if (q) {
 Documentation/memory-barriers.txt-706-		<read barrier>
 Documentation/memory-barriers.txt:707:		p = READ_ONCE(b);
 Documentation/memory-barriers.txt-708-	}
 --
 Documentation/memory-barriers.txt=711=for load-store control dependencies, as in the following example:
 Documentation/memory-barriers.txt-712-
 Documentation/memory-barriers.txt:713:	q = READ_ONCE(a);
 Documentation/memory-barriers.txt-714-	if (q) {
 --
 Documentation/memory-barriers.txt=718=Control dependencies pair normally with other types of barriers.
 Documentation/memory-barriers.txt:719:That said, please note that neither READ_ONCE() nor WRITE_ONCE()
 Documentation/memory-barriers.txt:720:are optional! Without the READ_ONCE(), the compiler might combine the
 Documentation/memory-barriers.txt-721-load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
 --
 Documentation/memory-barriers.txt=728=as follows:
 --
 Documentation/memory-barriers.txt-732-
 Documentation/memory-barriers.txt:733:So don't leave out the READ_ONCE().
 Documentation/memory-barriers.txt-734-
 --
 Documentation/memory-barriers.txt=736=branches of the "if" statement as follows:
 Documentation/memory-barriers.txt-737-
 Documentation/memory-barriers.txt:738:	q = READ_ONCE(a);
 Documentation/memory-barriers.txt-739-	if (q) {
 --
 Documentation/memory-barriers.txt=750=optimization levels:
 Documentation/memory-barriers.txt-751-
 Documentation/memory-barriers.txt:752:	q = READ_ONCE(a);
 Documentation/memory-barriers.txt-753-	barrier();
 --
 Documentation/memory-barriers.txt=768=memory barriers, for example, smp_store_release():
 Documentation/memory-barriers.txt-769-
 Documentation/memory-barriers.txt:770:	q = READ_ONCE(a);
 Documentation/memory-barriers.txt-771-	if (q) {
 --
 Documentation/memory-barriers.txt=780=ordering is guaranteed only when the stores differ, for example:
 Documentation/memory-barriers.txt-781-
 Documentation/memory-barriers.txt:782:	q = READ_ONCE(a);
 Documentation/memory-barriers.txt-783-	if (q) {
 --
 Documentation/memory-barriers.txt-790-
 Documentation/memory-barriers.txt:791:The initial READ_ONCE() is still required to prevent the compiler from
 Documentation/memory-barriers.txt-792-proving the value of 'a'.
 --
 Documentation/memory-barriers.txt=796=the needed conditional.  For example:
 Documentation/memory-barriers.txt-797-
 Documentation/memory-barriers.txt:798:	q = READ_ONCE(a);
 Documentation/memory-barriers.txt-799-	if (q % MAX) {
]
]

14/2 2026/02/24 18:30 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 24627
							output: 23
							thoughts: 14
						

Thoughts:
Okay, here's my interpretation of those thoughts, tailored for an expert audience, presented in the first person:

**Post-Hoc Regret: Targeted Grepping**

Alright, I'm kicking myself a little. Looking back at that debugging session, I realize I could've saved myself some time, and maybe even gotten to the root cause faster. The problem area was definitely centered around asynchronous I/O, wasn't it? So, why didn't I immediately restrict my `grep` searches to just `fs/aio.c`? Seriously, that's where the most relevant code likely resided, and I spent a good chunk of time sifting through irrelevant output. My initial approach was too broad-brush. It would have been more efficient, less noisy, and less prone to distraction, to narrow the scope from the outset. Next time, I'm starting with pinpoint accuracy.



15/2 2026/02/24 18:30 tool grepper
0m Args:
map[Expression:READ_ONCE|WRITE_ONCE]

Results:
map[Output:
Full output is too long, showing 500 out of 55723 lines.
Use more precise expression if possible.

[Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=78=lock-acquisition and lock-release functions::
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-84-    5   raw_spin_lock_rcu_node(rnp);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:85:    6   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:86:    7   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-87-    8   raw_spin_unlock_rcu_node(rnp);
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-92-   13   raw_spin_lock_rcu_node(rnp);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:93:   14   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:94:   15   r2 = READ_ONCE(z);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-95-   16   raw_spin_unlock_rcu_node(rnp);
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-99-   20 {
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:100:   21   WRITE_ONCE(z, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-101-   22   smp_mb();
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:102:   23   r3 = READ_ONCE(x);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-103-   24 }
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=112=The ``smp_mb__after_unlock_lock()`` invocations prevent this
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-131-|  ----                                      ----                       |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:132:|  WRITE_ONCE(X, 1)                          WRITE_ONCE(Y, 1)           |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-133-|  g = get_state_synchronize_rcu()           smp_mb()                   |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:134:|  while (!poll_state_synchronize_rcu(g))    r1 = READ_ONCE(X)          |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-135-|          continue;                                                    |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:136:|  r0 = READ_ONCE(Y)                                                    |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-137-|                                                                       |
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=198=newly arrived RCU callbacks against future grace periods:
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-213-   12   /* Handle nohz enablement switches conservatively. */
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:214:   13   tne = READ_ONCE(tick_nohz_active);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-215-   14   if (tne != rdp->tick_nohz_enabled_snap) {
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-145-         x="112.04738"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:146:         y="268.18076">WRITE_ONCE(a, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-147-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-156-         x="112.04738"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:157:         y="439.13766">WRITE_ONCE(b, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-158-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-167-         x="255.60869"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:168:         y="309.29346">r1 = READ_ONCE(a);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-169-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-178-         x="255.14423"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:179:         y="520.61786">WRITE_ONCE(c, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-180-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-189-         x="396.10254"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:190:         y="384.71124">r2 = READ_ONCE(b);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-191-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-200-         x="396.10254"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:201:         y="582.13617">r3 = READ_ONCE(c);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-202-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-173-         x="112.04738"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:174:         y="268.18076">WRITE_ONCE(a, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-175-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-184-         x="112.04738"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:185:         y="487.13766">WRITE_ONCE(b, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-186-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-195-         x="255.60869"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:196:         y="297.29346">r1 = READ_ONCE(a);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-197-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-206-         x="255.14423"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:207:         y="554.61786">WRITE_ONCE(c, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-208-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-217-         x="396.10254"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:218:         y="370.71124">WRITE_ONCE(d, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-219-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-228-         x="396.10254"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:229:         y="572.13617">r2 = READ_ONCE(c);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-230-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-463-         x="541.70508"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:464:         y="387.6217">r3 = READ_ONCE(d);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-465-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-474-         x="541.2406"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:475:         y="646.94611">WRITE_ONCE(e, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-476-    <path
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-509-         x="686.27747"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:510:         y="461.83929">r4 = READ_ONCE(b);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-511-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-520-         x="686.27747"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:521:         y="669.26422">r5 = READ_ONCE(e);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-522-    <text
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=84=overhead to readers, for example:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-92-       5   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:93:       6   r1 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst:94:       7   r2 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-95-       8   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-99-      12 {
 Documentation/RCU/Design/Requirements/Requirements.rst:100:      13   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-101-      14   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:102:      15   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-103-      16 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=138=recovery from node failure, more or less as follows:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-153-      12   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:154:      13   state_snap = READ_ONCE(state);
 Documentation/RCU/Design/Requirements/Requirements.rst-155-      14   if (state_snap == STATE_NORMAL)
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-163-      22 {
 Documentation/RCU/Design/Requirements/Requirements.rst:164:      23   WRITE_ONCE(state, STATE_WANT_RECOVERY);
 Documentation/RCU/Design/Requirements/Requirements.rst-165-      24   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:166:      25   WRITE_ONCE(state, STATE_RECOVERING);
 Documentation/RCU/Design/Requirements/Requirements.rst-167-      26   recovery();
 Documentation/RCU/Design/Requirements/Requirements.rst:168:      27   WRITE_ONCE(state, STATE_WANT_NORMAL);
 Documentation/RCU/Design/Requirements/Requirements.rst-169-      28   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:170:      29   WRITE_ONCE(state, STATE_NORMAL);
 Documentation/RCU/Design/Requirements/Requirements.rst-171-      30 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=467=resembling the dependency-ordering barrier that was later subsumed
 Documentation/RCU/Design/Requirements/Requirements.rst:468:into rcu_dereference() and later still into READ_ONCE(). The
 Documentation/RCU/Design/Requirements/Requirements.rst-469-need for these operations made itself known quite suddenly at a
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=702=threads:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-708-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:709:       4   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-710-       5   rcu_read_unlock();
 Documentation/RCU/Design/Requirements/Requirements.rst-711-       6   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:712:       7   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-713-       8   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-718-      13   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:719:      14   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-720-      15   rcu_read_unlock();
 Documentation/RCU/Design/Requirements/Requirements.rst-721-      16   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:722:      17   r2 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst-723-      18   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=737=significant ordering constraints would slow down these fast-path APIs.
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-745-+-----------------------------------------------------------------------+
 Documentation/RCU/Design/Requirements/Requirements.rst:746:| No, the volatile casts in READ_ONCE() and WRITE_ONCE()                |
 Documentation/RCU/Design/Requirements/Requirements.rst-747-| prevent the compiler from reordering in this particular case.         |
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=755=example illustrates this:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-761-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:762:       4   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-763-       5   if (r1) {
 Documentation/RCU/Design/Requirements/Requirements.rst-764-       6     do_something_with_nonzero_x();
 Documentation/RCU/Design/Requirements/Requirements.rst:765:       7     r2 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst-766-       8     WARN_ON(!r2); /* BUG!!! */
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-773-      15   spin_lock(&my_lock);
 Documentation/RCU/Design/Requirements/Requirements.rst:774:      16   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:775:      17   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-776-      18   spin_unlock(&my_lock);
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=819=are initially all zero:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-825-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:826:       4   WRITE_ONCE(a, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:827:       5   WRITE_ONCE(b, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-828-       6   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-832-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:833:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-834-      12   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:835:      13   WRITE_ONCE(c, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-836-      14 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-840-      18   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:841:      19   r2 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:842:      20   r3 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-843-      21   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=862=period is known to end before the second grace period starts:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-868-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:869:       4   WRITE_ONCE(a, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:870:       5   WRITE_ONCE(b, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-871-       6   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-875-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:876:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-877-      12   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:878:      13   WRITE_ONCE(c, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-879-      14 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-882-      17 {
 Documentation/RCU/Design/Requirements/Requirements.rst:883:      18   r2 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-884-      19   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:885:      20   WRITE_ONCE(d, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-886-      21 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-890-      25   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:891:      26   r3 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:892:      27   r4 = READ_ONCE(d);
 Documentation/RCU/Design/Requirements/Requirements.rst-893-      28   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=920=illustrated by the following, with all variables initially zero:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-926-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:927:       4   WRITE_ONCE(a, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:928:       5   WRITE_ONCE(b, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-929-       6   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-933-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:934:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-935-      12   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:936:      13   WRITE_ONCE(c, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-937-      14 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-941-      18   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:942:      19   WRITE_ONCE(d, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:943:      20   r2 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-944-      21   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-948-      25 {
 Documentation/RCU/Design/Requirements/Requirements.rst:949:      26   r3 = READ_ONCE(d);
 Documentation/RCU/Design/Requirements/Requirements.rst-950-      27   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:951:      28   WRITE_ONCE(e, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-952-      29 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-956-      33   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:957:      34   r4 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:958:      35   r5 = READ_ONCE(e);
 Documentation/RCU/Design/Requirements/Requirements.rst-959-      36   rcu_read_unlock();
 --
 Documentation/RCU/checklist.rst=12=over a rather long period of time, but improvements are always welcome!
 --
 Documentation/RCU/checklist.rst-363-	time that readers might be accessing that structure.  In such
 Documentation/RCU/checklist.rst:364:	cases, READ_ONCE() may be used in place of rcu_dereference()
 Documentation/RCU/checklist.rst-365-	and the read-side markers (rcu_read_lock() and rcu_read_unlock(),
 --
 Documentation/RCU/listRCU.rst=145=has become list_for_each_entry_rcu().  The **_rcu()** list-traversal
 Documentation/RCU/listRCU.rst:146:primitives add READ_ONCE() and diagnostic checks for incorrect use
 Documentation/RCU/listRCU.rst-147-outside of an RCU read-side critical section.
 --
 Documentation/RCU/rcu_dereference.rst=25=readers working properly:
 --
 Documentation/RCU/rcu_dereference.rst-35-	return data preceding initialization that preceded the store
 Documentation/RCU/rcu_dereference.rst:36:	of the pointer.  (As noted later, in recent kernels READ_ONCE()
 Documentation/RCU/rcu_dereference.rst-37-	also prevents DEC Alpha from playing these tricks.)
 --
 Documentation/RCU/rcu_dereference.rst-45--	In the special case where data is added but is never removed
 Documentation/RCU/rcu_dereference.rst:46:	while readers are accessing the structure, READ_ONCE() may be used
 Documentation/RCU/rcu_dereference.rst:47:	instead of rcu_dereference().  In this case, use of READ_ONCE()
 Documentation/RCU/rcu_dereference.rst-48-	takes on the role of the lockless_dereference() primitive that
 --
 Documentation/RCU/whatisRCU.rst=687=don't forget about them when submitting patches making use of RCU!]::
 --
 Documentation/RCU/whatisRCU.rst-695-	({ \
 Documentation/RCU/whatisRCU.rst:696:		typeof(p) _________p1 = READ_ONCE(p); \
 Documentation/RCU/whatisRCU.rst-697-		(_________p1); \
 --
 Documentation/atomic_t.txt=82=The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
 Documentation/atomic_t.txt:83:implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
 Documentation/atomic_t.txt-84-smp_store_release() respectively. Therefore, if you find yourself only using
 --
 Documentation/atomic_t.txt=119=with a lock:
 --
 Documentation/atomic_t.txt-124-    lock();
 Documentation/atomic_t.txt:125:    ret = READ_ONCE(v->counter); // == 1
 Documentation/atomic_t.txt-126-						atomic_set(v, 0);
 Documentation/atomic_t.txt:127:    if (ret != u)				  WRITE_ONCE(v->counter, 0);
 Documentation/atomic_t.txt:128:      WRITE_ONCE(v->counter, ret + 1);
 Documentation/atomic_t.txt-129-    unlock();
 --
 Documentation/atomic_t.txt=234=strictly stronger than ACQUIRE. As illustrated:
 --
 Documentation/atomic_t.txt-242-  {
 Documentation/atomic_t.txt:243:    r0 = READ_ONCE(*x);
 Documentation/atomic_t.txt-244-    smp_rmb();
 --
 Documentation/atomic_t.txt-251-    smp_mb__after_atomic();
 Documentation/atomic_t.txt:252:    WRITE_ONCE(*x, 1);
 Documentation/atomic_t.txt-253-  }
 --
 Documentation/atomic_t.txt=260=because it would not order the W part of the RMW against the following
 Documentation/atomic_t.txt:261:WRITE_ONCE.  Thus:
 Documentation/atomic_t.txt-262-
 --
 Documentation/core-api/circular-buffers.rst=154=The producer will look something like this::
 --
 Documentation/core-api/circular-buffers.rst-159-	/* The spin_unlock() and next spin_lock() provide needed ordering. */
 Documentation/core-api/circular-buffers.rst:160:	unsigned long tail = READ_ONCE(buffer->tail);
 Documentation/core-api/circular-buffers.rst-161-
 --
 Documentation/core-api/circular-buffers.rst=219=before it writes the new tail pointer, which will erase the item.
 Documentation/core-api/circular-buffers.rst-220-
 Documentation/core-api/circular-buffers.rst:221:Note the use of READ_ONCE() and smp_load_acquire() to read the
 Documentation/core-api/circular-buffers.rst-222-opposition index.  This prevents the compiler from discarding and
 --
 Documentation/core-api/errseq.rst=144=errseq_check_and_advance after taking the lock. e.g.::
 Documentation/core-api/errseq.rst-145-
 Documentation/core-api/errseq.rst:146:        if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) {
 Documentation/core-api/errseq.rst-147-                /* su.s_wd_err is protected by s_wd_err_lock */
 --
 Documentation/core-api/refcount-vs-atomic.rst=37=are executed in program order on a single CPU.
 Documentation/core-api/refcount-vs-atomic.rst:38:This is implemented using READ_ONCE()/WRITE_ONCE() and
 Documentation/core-api/refcount-vs-atomic.rst-39-compare-and-swap primitives.
 --
 Documentation/dev-tools/kcsan.rst=214=and if that code is free from data races.
 Documentation/dev-tools/kcsan.rst-215-
 Documentation/dev-tools/kcsan.rst:216:KCSAN is aware of *marked atomic operations* (``READ_ONCE``, ``WRITE_ONCE``,
 Documentation/dev-tools/kcsan.rst-217-``atomic_*``, etc.), and a subset of ordering guarantees implied by memory
 --
 Documentation/dev-tools/kcsan.rst=297=barrier. Consider the example::
 --
 Documentation/dev-tools/kcsan.rst-302-        x = 1;                  // data race!
 Documentation/dev-tools/kcsan.rst:303:        WRITE_ONCE(flag, 1);    // correct: smp_store_release(&flag, 1)
 Documentation/dev-tools/kcsan.rst-304-    }
 --
 Documentation/dev-tools/kcsan.rst-306-    {
 Documentation/dev-tools/kcsan.rst:307:        while (!READ_ONCE(flag));   // correct: smp_load_acquire(&flag)
 Documentation/dev-tools/kcsan.rst-308-        ... = x;                    // data race!
 --
 Documentation/driver-api/surface_aggregator/internal.rst=270=submission, i.e. cancellation, can not rely on the ``ptl`` reference to be
 Documentation/driver-api/surface_aggregator/internal.rst:271:set. Access to it in these functions is guarded by ``READ_ONCE()``, whereas
 Documentation/driver-api/surface_aggregator/internal.rst:272:setting ``ptl`` is equally guarded with ``WRITE_ONCE()`` for symmetry.
 Documentation/driver-api/surface_aggregator/internal.rst-273-
 --
 Documentation/driver-api/surface_aggregator/internal.rst=275=them, specifically priority and state for tracing. In those cases, proper
 Documentation/driver-api/surface_aggregator/internal.rst:276:access is ensured by employing ``WRITE_ONCE()`` and ``READ_ONCE()``. Such
 Documentation/driver-api/surface_aggregator/internal.rst-277-read-only access is only allowed when stale values are not critical.
 --
 Documentation/driver-api/surface_aggregator/internal.rst=451=them, specifically the state for tracing. In those cases, proper access is
 Documentation/driver-api/surface_aggregator/internal.rst:452:ensured by employing ``WRITE_ONCE()`` and ``READ_ONCE()``. Such read-only
 Documentation/driver-api/surface_aggregator/internal.rst-453-access is only allowed when stale values are not critical.
 --
 Documentation/driver-api/surface_aggregator/internal.rst=572=invalid usages, but rather aim to help catch them. In those cases, proper
 Documentation/driver-api/surface_aggregator/internal.rst:573:variable access is ensured by employing ``WRITE_ONCE()`` and ``READ_ONCE()``.
 Documentation/driver-api/surface_aggregator/internal.rst-574-
 --
 Documentation/filesystems/path-lookup.rst=896=similar.
 Documentation/filesystems/path-lookup.rst-897-
 Documentation/filesystems/path-lookup.rst:898:.. _READ_ONCE: https://lwn.net/Articles/624126/
 Documentation/filesystems/path-lookup.rst-899-
 --
 Documentation/filesystems/path-lookup.rst=904=when accessing fields in the dentry.  This "extra care" typically
 Documentation/filesystems/path-lookup.rst:905:involves using  `READ_ONCE() <READ_ONCE_>`_ to access fields, and verifying the
 Documentation/filesystems/path-lookup.rst-906-result is not NULL before using it.  This pattern can be seen in
 --
 Documentation/gpu/todo.rst=932=struct drm_sched_rq is read at many places without any locks, not even with a
 Documentation/gpu/todo.rst:933:READ_ONCE. At XDC 2025 no one could really tell why that is the case, whether
 Documentation/gpu/todo.rst-934-locks are needed and whether they could be added. (But for real, that should
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus=14=P0(int *x, atomic_t *y)
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-18-
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus:19:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-20-	smp_rmb();
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus=24=P1(int *x, atomic_t *y)
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-27-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus:28:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-29-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus=12=P0(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-16-
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:17:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-18-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-19-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:20:	r0 = READ_ONCE(*y);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-21-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus=23=P1(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-27-
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:28:	WRITE_ONCE(*y, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-29-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-30-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:31:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-32-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus=12=P0(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-15-
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus:16:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-17-	r1 = cmpxchg(y, 0, 1);
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus=20=P1(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-26-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus:27:	r2 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-28-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus=13=P0(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-17-
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:18:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-19-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:20:	r0 = READ_ONCE(*y);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-21-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus=23=P1(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-27-
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:28:	WRITE_ONCE(*y, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-29-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:30:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-31-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus=13=P0(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-16-
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus:17:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-18-	r1 = cmpxchg(y, 0, 1);
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus=21=P1(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-26-	r1 = cmpxchg(y, 0, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus:27:	r2 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-28-}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus=15=P0(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-20-
 Documentation/litmus-tests/locking/DCL-broken.litmus:21:	r0 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-22-	if (r0 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus-23-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-broken.litmus:24:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-25-		if (r1 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus:26:			WRITE_ONCE(*data, 1);
 Documentation/litmus-tests/locking/DCL-broken.litmus:27:			WRITE_ONCE(*flag, 1);
 Documentation/litmus-tests/locking/DCL-broken.litmus-28-		}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-30-	}
 Documentation/litmus-tests/locking/DCL-broken.litmus:31:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-broken.litmus-32-}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus=34=P1(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-39-
 Documentation/litmus-tests/locking/DCL-broken.litmus:40:	r0 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-41-	if (r0 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus-42-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-broken.litmus:43:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-44-		if (r1 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus:45:			WRITE_ONCE(*data, 1);
 Documentation/litmus-tests/locking/DCL-broken.litmus:46:			WRITE_ONCE(*flag, 1);
 Documentation/litmus-tests/locking/DCL-broken.litmus-47-		}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-49-	}
 Documentation/litmus-tests/locking/DCL-broken.litmus:50:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-broken.litmus-51-}
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus=16=P0(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-24-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-fixed.litmus:25:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-26-		if (r1 == 0) {
 Documentation/litmus-tests/locking/DCL-fixed.litmus:27:			WRITE_ONCE(*data, 1);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-28-			smp_store_release(flag, 1);
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-31-	}
 Documentation/litmus-tests/locking/DCL-fixed.litmus:32:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-33-}
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus=35=P1(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-43-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-fixed.litmus:44:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-45-		if (r1 == 0) {
 Documentation/litmus-tests/locking/DCL-fixed.litmus:46:			WRITE_ONCE(*data, 1);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-47-			smp_store_release(flag, 1);
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-50-	}
 Documentation/litmus-tests/locking/DCL-fixed.litmus:51:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-52-}
 --
 Documentation/litmus-tests/locking/RM-broken.litmus=16=P0(int *x, atomic_t *y, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/RM-broken.litmus-21-	r2 = atomic_inc_return(y);
 Documentation/litmus-tests/locking/RM-broken.litmus:22:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/locking/RM-broken.litmus-23-	spin_unlock(lck);
 --
 Documentation/litmus-tests/locking/RM-broken.litmus=26=P1(int *x, atomic_t *y, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/RM-broken.litmus-32-	spin_lock(lck);
 Documentation/litmus-tests/locking/RM-broken.litmus:33:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/locking/RM-broken.litmus:34:	r1 = READ_ONCE(*x);
 Documentation/litmus-tests/locking/RM-broken.litmus-35-	r2 = atomic_inc_return(y);
 --
 Documentation/litmus-tests/locking/RM-fixed.litmus=16=P0(int *x, atomic_t *y, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/RM-fixed.litmus-21-	r2 = atomic_inc_return(y);
 Documentation/litmus-tests/locking/RM-fixed.litmus:22:	WRITE_ONCE(*x, 1);
]
]

16/2 2026/02/24 18:30 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 37229
							output: 32
							thoughts: 18
						

Thoughts:
Okay, here's my summary, taking on the persona you've requested, assuming I'm deep in some kernel debugging or investigation:

**Focusing the Search**

Ugh, I'm swimming in irrelevant results again. Documentation is spitting back far too much noise. I'm after something specific, something crucial, and all this generic stuff is just cluttering my mental workspace. I *know* what I need! It's buried in the `fs` directory, likely related to asynchronous I/O... yes, that's right... I need to get to `fs/aio.c`. That's where the magic, or the problem, lies. Need to refine this search and cut through the excess verbiage that the documentation is throwing at me.



17/2 2026/02/24 18:30 tool grepper
0m Args:
map[Expression:(READ_ONCE|WRITE_ONCE).*ring->(head|tail)]

Results:
map[Output:drivers/gpu/drm/i915/gt/intel_ring.h=90=assert_ring_tail_valid(const struct intel_ring *ring, unsigned int tail)
drivers/gpu/drm/i915/gt/intel_ring.h-91-{
drivers/gpu/drm/i915/gt/intel_ring.h:92:	unsigned int head = READ_ONCE(ring->head);
drivers/gpu/drm/i915/gt/intel_ring.h-93-
--
drivers/net/ethernet/actions/owl-emac.c=695=static void owl_emac_tx_complete(struct owl_emac_priv *priv)
--
drivers/net/ethernet/actions/owl-emac.c-732-
drivers/net/ethernet/actions/owl-emac.c:733:			status = READ_ONCE(ring->descs[ring->tail].status);
drivers/net/ethernet/actions/owl-emac.c-734-			dma_rmb(); /* Ensure data has been read before used. */
drivers/net/ethernet/actions/owl-emac.c-735-			status &= ~OWL_EMAC_BIT_TDES0_OWN;
drivers/net/ethernet/actions/owl-emac.c:736:			WRITE_ONCE(ring->descs[ring->tail].status, status);
drivers/net/ethernet/actions/owl-emac.c-737-
]

18/2 2026/02/24 18:30 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 37823
							output: 23
							thoughts: 4071
						

Thoughts:
**Analyzing the aio ring buffer data race**

Okay, I'm diving back into `fs/aio.c`, specifically focusing on a potential data race within the AIO ring buffer implementation. My initial concern centers around the accesses to `ring->tail` and `ring->head` within `aio_complete` and `aio_read_events_ring`. These are plain memory accesses, which is the first red flag. The crucial part is how `aio_read_events_ring` is called (through `aio_read_events` and `do_io_getevents`), revealing that `ring` is actually shared memory mapped with userspace.

The code includes `smp_rmb()` to ensure ordering, but the absence of `READ_ONCE`/`WRITE_ONCE` for accessing shared indices is where things get interesting. KCSAN caught this, making it a real race. The question is, is this a benign data race or something more serious?

Considering the kernel's memory model, these indices *should* be accessed using `READ_ONCE`/`WRITE_ONCE` to prevent compiler reordering and reloading. However, many older parts of the kernel don't yet follow this practice. I have to assess whether this is actually causing a problem.

The potential for harm arises if `ring->tail` gets updated "ahead" of the events being written. Examining `aio_complete` shows it writes the event, flushes the dcache, uses `smp_wmb()` for a memory barrier, and then updates `ring->tail`. This sequence is correct on the producer side. The consumer side (`aio_read_events_ring`) correctly implements a memory barrier via `smp_rmb()` after reading `ring->tail`. The only real problem is the lack of `READ_ONCE`/`WRITE_ONCE`.

According to the instructions, a "benign" data race is one involving a simple variable like an `unsigned int` without mutual exclusion, whereas a "harmful" race could lead to corruption/crash even with a conservative compiler. If we assume a conservative compiler that *doesn't* reload or split accesses, this race *should* be benign due to the existing memory barriers. However, the kernel's own memory model mandates `READ_ONCE`/`WRITE_ONCE` for such cases.

If the compiler *does* reload the `tail` variable *after* the `smp_rmb()`, then it is possible we could read uninitialized data. Specifically, the consumer's `avail` variable might be larger than what was guaranteed by the memory barrier. This is exactly the scenario that `READ_ONCE` is designed to prevent.

I need to assess if this is a plausible scenario. The use of `tail` in multiple places could work against a reload, but compilers are aggressive. The provided instructions state that such races are benign when assuming a conservative compiler. But the KCSAN reports usually prompt a fix. I have seen the addition of `READ_ONCE`/`WRITE_ONCE` to AIO ring buffers before.

If it *were* harmful, the symptom would be userspace reading new data, but it would be stale or garbage. This is, in effect, data corruption from the application's perspective, but it is not "memory corruption/crash" in the kernel. By the definitions provided, this data race is "benign" if we assume a conservative compiler. This is further supported because the code correctly uses the memory barriers `smp_wmb()` and `smp_rmb()`.

I should also consider other races. The `ring->head` variable is also subject to similar access patterns. If we look at how the aio functions are implemented, they are designed to be lockless and somewhat robust against stale values.

I must classify this as either benign or harmful based on the definitions. While the ideal solution is to add `READ_ONCE`/`WRITE_ONCE`, in the context of this task, the race is Benign.

I have to confirm there is no other reason for it to be harmful. The use of `tail` in the `copy_to_user` function, where a race on its value could lead to out-of-bounds reads, is not a problem. The calculated `avail` variable which is based on `tail` is always capped by `AIO_EVENTS_PER_PAGE - pos`. The various modulo operations on `tail` further ensure it remains within bounds. Even compiler reloads appear to not create buffer overflows.

I can confirm that the `ring->tail` variable is *only* written by `aio_complete`, which also means it is safe from other kinds of writes that could create a race condition. Thus I am confident that the race is Benign.



19/2 2026/02/24 18:31 tool codesearch-find-references
0m Args:
map[Name:aio_ring::tail]

Results:
map[References:[map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:aio_complete SourceFile:fs/aio.c SourceLine:1162] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:aio_read_events_ring SourceFile:fs/aio.c SourceLine:1238] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:aio_setup_ring SourceFile:fs/aio.c SourceLine:570]] TruncatedOutput:false]

20/2 2026/02/24 18:31 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 43097
							output: 38
							thoughts: 328
						

Thoughts:
**Analyzing the AIO Ring Data Race**

Okay, so I'm digging into this AIO code, trying to nail down a potential data race. My initial understanding is that `aio_setup_ring` sets things up, `aio_complete` updates the ring's tail, and `aio_read_events_ring` reads from it. Pretty straightforward, right?

But then I had to stop and really think about this `aio_read_events_ring` function. The report specifically shows it reading the ring, and then I see the read at `fs/aio.c:1238`, which is simply `tail = ring->tail;`. And, the write occurs at `fs/aio.c:1162`, which is `ring->tail = tail;`. So my analysis holds up, the tail is read and written to.

Now, the real question is, are there *other* places reading this value? The report seems pretty clear; the only other place is `aio_read_events_ring`. KCSAN isn't going to see userspace, so I can safely ignore that.

So, it's a clear race between `aio_complete` and `aio_read_events_ring`. Both are accessing and modifying this `tail` value, and there isn't a solid lock in place here.

One more thing that I noticed - `aio_complete` is actually called from `iocb_put`. Let me quickly look at that function, and make sure that it confirms this analysis.
```c
1209:		aio_complete(iocb);
```



21/2 2026/02/24 18:31 tool codesearch-definition-source
0m Args:
map[ContextFile:fs/aio.c IncludeLines:true Name:iocb_put]

Results:
map[SourceCode:1206:	static inline void iocb_put(struct aio_kiocb *iocb)
1207:	{
1208:		if (refcount_dec_and_test(&iocb->ki_refcnt)) {
1209:			aio_complete(iocb);
1210:			iocb_destroy(iocb);
1211:		}
1212:	}
 SourceFile:fs/aio.c]

22/2 2026/02/24 18:31 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 43946
							output: 31
							thoughts: 563
						

Thoughts:
**Analyzing the AIO Data Race**

Okay, so the issue boils down to a race condition that occurs when the last reference to an AIO request is removed. Specifically, the stack trace shows this happens in `io_submit_one` with the call to `iocb_put(req)`. It seems the race is between a fast completion path or submission path and the `getevents` path. This kind of concurrency is pretty common in AIO, so I need to determine if it's actually benign.

In the kernel context, "benign" typically means intentional and safe, as opposed to something that can lead to data corruption or a system crash, especially when considering conservative compiler optimizations. Based on my analysis, it seems like the race shouldn't cause corruption.

But hold on, let me double-check the `aio_read_events_ring` code. I need to be absolutely sure. Okay, here it is:

```c
1233:		mutex_lock(&ctx->ring_lock);
1234:	
1235:		/* Access to ->ring_folios here is protected by ctx->ring_lock. */
1236:		ring = folio_address(ctx->ring_folios[0]);
1237:		head = ring->head;
1238:		tail = ring->tail;
```

The comment *claims* `ctx->ring_lock` protects the access to `->ring_folios`, but critically, it *doesn't* explicitly say that `ring->tail` is protected by the mutex. It's updated in `aio_complete`, which doesn't hold `ring_lock`. This strongly suggests that concurrent access to `ring->tail` is explicitly allowed.

Now, if this concurrent access is intentional and guarded correctly with barriers, it's considered benign. My understanding is that the AIO design permits this access pattern, so I need to confirm that `ring` itself won't be freed while it's being accessed.

Let's see. `ring` is a part of `ctx`, and `ctx` is kept alive by system calls. `aio_read_events_ring` is called from `do_io_getevents`, which holds a reference to `ctx`. `aio_complete` is called from `io_submit_one`, and also holds a reference to `ctx`. Okay, so the memory for `ring` appears to be safely protected from being freed.

The observed change in `ring->tail` from `0x3` to `0x8` indicates five events were completed between the read and the write. This is totally normal behavior for AIO.

Therefore, I think this data race is indeed benign. I'm going to conclude it is safe because of the locking that is present and the protection of the data structures.

Just to be absolutely certain, I'll search for any recent patches to `fs/aio.c` related to KCSAN or data races, just in case there's something recent I've missed.



23/2 2026/02/24 18:31 tool grepper
0m Args:
map[Expression:KCSAN|data-race|READ_ONCE|WRITE_ONCE]

Results:
map[Output:
Full output is too long, showing 500 out of 58565 lines.
Use more precise expression if possible.

[Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=78=lock-acquisition and lock-release functions::
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-84-    5   raw_spin_lock_rcu_node(rnp);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:85:    6   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:86:    7   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-87-    8   raw_spin_unlock_rcu_node(rnp);
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-92-   13   raw_spin_lock_rcu_node(rnp);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:93:   14   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:94:   15   r2 = READ_ONCE(z);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-95-   16   raw_spin_unlock_rcu_node(rnp);
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-99-   20 {
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:100:   21   WRITE_ONCE(z, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-101-   22   smp_mb();
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:102:   23   r3 = READ_ONCE(x);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-103-   24 }
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=112=The ``smp_mb__after_unlock_lock()`` invocations prevent this
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-131-|  ----                                      ----                       |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:132:|  WRITE_ONCE(X, 1)                          WRITE_ONCE(Y, 1)           |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-133-|  g = get_state_synchronize_rcu()           smp_mb()                   |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:134:|  while (!poll_state_synchronize_rcu(g))    r1 = READ_ONCE(X)          |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-135-|          continue;                                                    |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:136:|  r0 = READ_ONCE(Y)                                                    |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-137-|                                                                       |
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=198=newly arrived RCU callbacks against future grace periods:
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-213-   12   /* Handle nohz enablement switches conservatively. */
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:214:   13   tne = READ_ONCE(tick_nohz_active);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-215-   14   if (tne != rdp->tick_nohz_enabled_snap) {
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-145-         x="112.04738"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:146:         y="268.18076">WRITE_ONCE(a, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-147-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-156-         x="112.04738"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:157:         y="439.13766">WRITE_ONCE(b, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-158-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-167-         x="255.60869"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:168:         y="309.29346">r1 = READ_ONCE(a);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-169-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-178-         x="255.14423"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:179:         y="520.61786">WRITE_ONCE(c, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-180-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-189-         x="396.10254"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:190:         y="384.71124">r2 = READ_ONCE(b);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-191-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-200-         x="396.10254"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:201:         y="582.13617">r3 = READ_ONCE(c);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-202-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-173-         x="112.04738"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:174:         y="268.18076">WRITE_ONCE(a, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-175-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-184-         x="112.04738"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:185:         y="487.13766">WRITE_ONCE(b, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-186-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-195-         x="255.60869"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:196:         y="297.29346">r1 = READ_ONCE(a);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-197-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-206-         x="255.14423"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:207:         y="554.61786">WRITE_ONCE(c, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-208-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-217-         x="396.10254"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:218:         y="370.71124">WRITE_ONCE(d, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-219-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-228-         x="396.10254"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:229:         y="572.13617">r2 = READ_ONCE(c);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-230-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-463-         x="541.70508"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:464:         y="387.6217">r3 = READ_ONCE(d);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-465-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-474-         x="541.2406"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:475:         y="646.94611">WRITE_ONCE(e, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-476-    <path
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-509-         x="686.27747"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:510:         y="461.83929">r4 = READ_ONCE(b);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-511-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-520-         x="686.27747"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:521:         y="669.26422">r5 = READ_ONCE(e);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-522-    <text
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=84=overhead to readers, for example:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-92-       5   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:93:       6   r1 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst:94:       7   r2 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-95-       8   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-99-      12 {
 Documentation/RCU/Design/Requirements/Requirements.rst:100:      13   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-101-      14   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:102:      15   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-103-      16 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=138=recovery from node failure, more or less as follows:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-153-      12   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:154:      13   state_snap = READ_ONCE(state);
 Documentation/RCU/Design/Requirements/Requirements.rst-155-      14   if (state_snap == STATE_NORMAL)
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-163-      22 {
 Documentation/RCU/Design/Requirements/Requirements.rst:164:      23   WRITE_ONCE(state, STATE_WANT_RECOVERY);
 Documentation/RCU/Design/Requirements/Requirements.rst-165-      24   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:166:      25   WRITE_ONCE(state, STATE_RECOVERING);
 Documentation/RCU/Design/Requirements/Requirements.rst-167-      26   recovery();
 Documentation/RCU/Design/Requirements/Requirements.rst:168:      27   WRITE_ONCE(state, STATE_WANT_NORMAL);
 Documentation/RCU/Design/Requirements/Requirements.rst-169-      28   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:170:      29   WRITE_ONCE(state, STATE_NORMAL);
 Documentation/RCU/Design/Requirements/Requirements.rst-171-      30 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=467=resembling the dependency-ordering barrier that was later subsumed
 Documentation/RCU/Design/Requirements/Requirements.rst:468:into rcu_dereference() and later still into READ_ONCE(). The
 Documentation/RCU/Design/Requirements/Requirements.rst-469-need for these operations made itself known quite suddenly at a
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=702=threads:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-708-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:709:       4   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-710-       5   rcu_read_unlock();
 Documentation/RCU/Design/Requirements/Requirements.rst-711-       6   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:712:       7   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-713-       8   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-718-      13   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:719:      14   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-720-      15   rcu_read_unlock();
 Documentation/RCU/Design/Requirements/Requirements.rst-721-      16   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:722:      17   r2 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst-723-      18   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=737=significant ordering constraints would slow down these fast-path APIs.
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-745-+-----------------------------------------------------------------------+
 Documentation/RCU/Design/Requirements/Requirements.rst:746:| No, the volatile casts in READ_ONCE() and WRITE_ONCE()                |
 Documentation/RCU/Design/Requirements/Requirements.rst-747-| prevent the compiler from reordering in this particular case.         |
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=755=example illustrates this:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-761-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:762:       4   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-763-       5   if (r1) {
 Documentation/RCU/Design/Requirements/Requirements.rst-764-       6     do_something_with_nonzero_x();
 Documentation/RCU/Design/Requirements/Requirements.rst:765:       7     r2 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst-766-       8     WARN_ON(!r2); /* BUG!!! */
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-773-      15   spin_lock(&my_lock);
 Documentation/RCU/Design/Requirements/Requirements.rst:774:      16   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:775:      17   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-776-      18   spin_unlock(&my_lock);
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=819=are initially all zero:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-825-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:826:       4   WRITE_ONCE(a, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:827:       5   WRITE_ONCE(b, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-828-       6   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-832-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:833:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-834-      12   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:835:      13   WRITE_ONCE(c, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-836-      14 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-840-      18   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:841:      19   r2 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:842:      20   r3 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-843-      21   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=862=period is known to end before the second grace period starts:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-868-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:869:       4   WRITE_ONCE(a, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:870:       5   WRITE_ONCE(b, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-871-       6   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-875-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:876:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-877-      12   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:878:      13   WRITE_ONCE(c, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-879-      14 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-882-      17 {
 Documentation/RCU/Design/Requirements/Requirements.rst:883:      18   r2 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-884-      19   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:885:      20   WRITE_ONCE(d, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-886-      21 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-890-      25   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:891:      26   r3 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:892:      27   r4 = READ_ONCE(d);
 Documentation/RCU/Design/Requirements/Requirements.rst-893-      28   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=920=illustrated by the following, with all variables initially zero:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-926-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:927:       4   WRITE_ONCE(a, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:928:       5   WRITE_ONCE(b, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-929-       6   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-933-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:934:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-935-      12   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:936:      13   WRITE_ONCE(c, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-937-      14 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-941-      18   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:942:      19   WRITE_ONCE(d, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:943:      20   r2 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-944-      21   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-948-      25 {
 Documentation/RCU/Design/Requirements/Requirements.rst:949:      26   r3 = READ_ONCE(d);
 Documentation/RCU/Design/Requirements/Requirements.rst-950-      27   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:951:      28   WRITE_ONCE(e, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-952-      29 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-956-      33   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:957:      34   r4 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:958:      35   r5 = READ_ONCE(e);
 Documentation/RCU/Design/Requirements/Requirements.rst-959-      36   rcu_read_unlock();
 --
 Documentation/RCU/checklist.rst=12=over a rather long period of time, but improvements are always welcome!
 --
 Documentation/RCU/checklist.rst-363-	time that readers might be accessing that structure.  In such
 Documentation/RCU/checklist.rst:364:	cases, READ_ONCE() may be used in place of rcu_dereference()
 Documentation/RCU/checklist.rst-365-	and the read-side markers (rcu_read_lock() and rcu_read_unlock(),
 --
 Documentation/RCU/listRCU.rst=145=has become list_for_each_entry_rcu().  The **_rcu()** list-traversal
 Documentation/RCU/listRCU.rst:146:primitives add READ_ONCE() and diagnostic checks for incorrect use
 Documentation/RCU/listRCU.rst-147-outside of an RCU read-side critical section.
 --
 Documentation/RCU/rcu_dereference.rst=25=readers working properly:
 --
 Documentation/RCU/rcu_dereference.rst-35-	return data preceding initialization that preceded the store
 Documentation/RCU/rcu_dereference.rst:36:	of the pointer.  (As noted later, in recent kernels READ_ONCE()
 Documentation/RCU/rcu_dereference.rst-37-	also prevents DEC Alpha from playing these tricks.)
 --
 Documentation/RCU/rcu_dereference.rst-45--	In the special case where data is added but is never removed
 Documentation/RCU/rcu_dereference.rst:46:	while readers are accessing the structure, READ_ONCE() may be used
 Documentation/RCU/rcu_dereference.rst:47:	instead of rcu_dereference().  In this case, use of READ_ONCE()
 Documentation/RCU/rcu_dereference.rst-48-	takes on the role of the lockless_dereference() primitive that
 --
 Documentation/RCU/whatisRCU.rst=687=don't forget about them when submitting patches making use of RCU!]::
 --
 Documentation/RCU/whatisRCU.rst-695-	({ \
 Documentation/RCU/whatisRCU.rst:696:		typeof(p) _________p1 = READ_ONCE(p); \
 Documentation/RCU/whatisRCU.rst-697-		(_________p1); \
 --
 Documentation/atomic_t.txt=82=The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
 Documentation/atomic_t.txt:83:implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
 Documentation/atomic_t.txt-84-smp_store_release() respectively. Therefore, if you find yourself only using
 --
 Documentation/atomic_t.txt=119=with a lock:
 --
 Documentation/atomic_t.txt-124-    lock();
 Documentation/atomic_t.txt:125:    ret = READ_ONCE(v->counter); // == 1
 Documentation/atomic_t.txt-126-						atomic_set(v, 0);
 Documentation/atomic_t.txt:127:    if (ret != u)				  WRITE_ONCE(v->counter, 0);
 Documentation/atomic_t.txt:128:      WRITE_ONCE(v->counter, ret + 1);
 Documentation/atomic_t.txt-129-    unlock();
 --
 Documentation/atomic_t.txt=234=strictly stronger than ACQUIRE. As illustrated:
 --
 Documentation/atomic_t.txt-242-  {
 Documentation/atomic_t.txt:243:    r0 = READ_ONCE(*x);
 Documentation/atomic_t.txt-244-    smp_rmb();
 --
 Documentation/atomic_t.txt-251-    smp_mb__after_atomic();
 Documentation/atomic_t.txt:252:    WRITE_ONCE(*x, 1);
 Documentation/atomic_t.txt-253-  }
 --
 Documentation/atomic_t.txt=260=because it would not order the W part of the RMW against the following
 Documentation/atomic_t.txt:261:WRITE_ONCE.  Thus:
 Documentation/atomic_t.txt-262-
 --
 Documentation/core-api/circular-buffers.rst=154=The producer will look something like this::
 --
 Documentation/core-api/circular-buffers.rst-159-	/* The spin_unlock() and next spin_lock() provide needed ordering. */
 Documentation/core-api/circular-buffers.rst:160:	unsigned long tail = READ_ONCE(buffer->tail);
 Documentation/core-api/circular-buffers.rst-161-
 --
 Documentation/core-api/circular-buffers.rst=219=before it writes the new tail pointer, which will erase the item.
 Documentation/core-api/circular-buffers.rst-220-
 Documentation/core-api/circular-buffers.rst:221:Note the use of READ_ONCE() and smp_load_acquire() to read the
 Documentation/core-api/circular-buffers.rst-222-opposition index.  This prevents the compiler from discarding and
 --
 Documentation/core-api/errseq.rst=144=errseq_check_and_advance after taking the lock. e.g.::
 Documentation/core-api/errseq.rst-145-
 Documentation/core-api/errseq.rst:146:        if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) {
 Documentation/core-api/errseq.rst-147-                /* su.s_wd_err is protected by s_wd_err_lock */
 --
 Documentation/core-api/refcount-vs-atomic.rst=37=are executed in program order on a single CPU.
 Documentation/core-api/refcount-vs-atomic.rst:38:This is implemented using READ_ONCE()/WRITE_ONCE() and
 Documentation/core-api/refcount-vs-atomic.rst-39-compare-and-swap primitives.
 --
 Documentation/dev-tools/kcsan.rst-3-
 Documentation/dev-tools/kcsan.rst:4:Kernel Concurrency Sanitizer (KCSAN)
 Documentation/dev-tools/kcsan.rst-5-====================================
 Documentation/dev-tools/kcsan.rst-6-
 Documentation/dev-tools/kcsan.rst:7:The Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector, which
 Documentation/dev-tools/kcsan.rst-8-relies on compile-time instrumentation, and uses a watchpoint-based sampling
 Documentation/dev-tools/kcsan.rst:9:approach to detect races. KCSAN's primary purpose is to detect `data races`_.
 Documentation/dev-tools/kcsan.rst-10-
 Documentation/dev-tools/kcsan.rst=11=Usage
 --
 Documentation/dev-tools/kcsan.rst-13-
 Documentation/dev-tools/kcsan.rst:14:KCSAN is supported by both GCC and Clang. With GCC we require version 11 or
 Documentation/dev-tools/kcsan.rst-15-later, and with Clang also require version 11 or later.
 Documentation/dev-tools/kcsan.rst-16-
 Documentation/dev-tools/kcsan.rst:17:To enable KCSAN configure the kernel with::
 Documentation/dev-tools/kcsan.rst-18-
 Documentation/dev-tools/kcsan.rst:19:    CONFIG_KCSAN = y
 Documentation/dev-tools/kcsan.rst-20-
 Documentation/dev-tools/kcsan.rst:21:KCSAN provides several other configuration options to customize behaviour (see
 Documentation/dev-tools/kcsan.rst-22-the respective help text in ``lib/Kconfig.kcsan`` for more info).
 --
 Documentation/dev-tools/kcsan.rst=27=A typical data race report looks like this::
 --
 Documentation/dev-tools/kcsan.rst-29-    ==================================================================
 Documentation/dev-tools/kcsan.rst:30:    BUG: KCSAN: data-race in test_kernel_read / test_kernel_write
 Documentation/dev-tools/kcsan.rst-31-
 --
 Documentation/dev-tools/kcsan.rst=52=the race. It is followed by the access types and stack traces of the 2 threads
 Documentation/dev-tools/kcsan.rst:53:involved in the data race. If KCSAN also observed a value change, the observed
 Documentation/dev-tools/kcsan.rst-54-old value and new value are shown on the "value changed" line respectively.
 --
 Documentation/dev-tools/kcsan.rst=56=The other less common type of data race report looks like this::
 --
 Documentation/dev-tools/kcsan.rst-58-    ==================================================================
 Documentation/dev-tools/kcsan.rst:59:    BUG: KCSAN: data-race in test_kernel_rmw_array+0x71/0xd0
 Documentation/dev-tools/kcsan.rst-60-
 --
 Documentation/dev-tools/kcsan.rst=78=the racing thread, but could also occur due to e.g. DMA accesses. Such reports
 Documentation/dev-tools/kcsan.rst:79:are shown only if ``CONFIG_KCSAN_REPORT_RACE_UNKNOWN_ORIGIN=y``, which is
 Documentation/dev-tools/kcsan.rst-80-enabled by default.
 --
 Documentation/dev-tools/kcsan.rst=87=the below options are available:
 Documentation/dev-tools/kcsan.rst-88-
 Documentation/dev-tools/kcsan.rst:89:* KCSAN understands the ``data_race(expr)`` annotation, which tells KCSAN that
 Documentation/dev-tools/kcsan.rst-90-  any data races due to accesses in ``expr`` should be ignored and resulting
 --
 Documentation/dev-tools/kcsan.rst-95-  to document that all data races due to accesses to a variable are intended
 Documentation/dev-tools/kcsan.rst:96:  and should be ignored by KCSAN::
 Documentation/dev-tools/kcsan.rst-97-
 --
 Documentation/dev-tools/kcsan.rst-116-
 Documentation/dev-tools/kcsan.rst:117:    KCSAN_SANITIZE_file.o := n
 Documentation/dev-tools/kcsan.rst-118-
 --
 Documentation/dev-tools/kcsan.rst-121-
 Documentation/dev-tools/kcsan.rst:122:    KCSAN_SANITIZE := n
 Documentation/dev-tools/kcsan.rst-123-
 --
 Documentation/dev-tools/kcsan.rst-125-
 Documentation/dev-tools/kcsan.rst:126:Furthermore, it is possible to tell KCSAN to show or hide entire classes of
 Documentation/dev-tools/kcsan.rst-127-data races, depending on preferences. These can be changed via the following
 Documentation/dev-tools/kcsan.rst=128=Kconfig options:
 Documentation/dev-tools/kcsan.rst-129-
 Documentation/dev-tools/kcsan.rst:130:* ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY``: If enabled and a conflicting write
 Documentation/dev-tools/kcsan.rst-131-  is observed via a watchpoint, but the data value of the memory location was
 --
 Documentation/dev-tools/kcsan.rst-133-
 Documentation/dev-tools/kcsan.rst:134:* ``CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC``: Assume that plain aligned writes
 Documentation/dev-tools/kcsan.rst-135-  up to word size are atomic by default. Assumes that such writes are not
 Documentation/dev-tools/kcsan.rst-136-  subject to unsafe compiler optimizations resulting in data races. The option
 Documentation/dev-tools/kcsan.rst:137:  causes KCSAN to not report data races due to conflicts where the only plain
 Documentation/dev-tools/kcsan.rst-138-  accesses are aligned writes up to word size.
 Documentation/dev-tools/kcsan.rst-139-
 Documentation/dev-tools/kcsan.rst:140:* ``CONFIG_KCSAN_PERMISSIVE``: Enable additional permissive rules to ignore
 Documentation/dev-tools/kcsan.rst-141-  certain classes of common data races. Unlike the above, the rules are more
 Documentation/dev-tools/kcsan.rst-142-  complex involving value-change patterns, access type, and address. This
 Documentation/dev-tools/kcsan.rst:143:  option depends on ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=y``. For details
 Documentation/dev-tools/kcsan.rst-144-  please see the ``kernel/kcsan/permissive.h``. Testers and maintainers that
 --
 Documentation/dev-tools/kcsan.rst-147-
 Documentation/dev-tools/kcsan.rst:148:To use the strictest possible rules, select ``CONFIG_KCSAN_STRICT=y``, which
 Documentation/dev-tools/kcsan.rst:149:configures KCSAN to follow the Linux-kernel memory consistency model (LKMM) as
 Documentation/dev-tools/kcsan.rst-150-closely as possible.
 --
 Documentation/dev-tools/kcsan.rst=155=The file ``/sys/kernel/debug/kcsan`` provides the following interface:
 --
 Documentation/dev-tools/kcsan.rst-158-
 Documentation/dev-tools/kcsan.rst:159:* Writing ``on`` or ``off`` to ``/sys/kernel/debug/kcsan`` allows turning KCSAN
 Documentation/dev-tools/kcsan.rst-160-  on or off, respectively.
 --
 Documentation/dev-tools/kcsan.rst=172=Tuning performance
 --
 Documentation/dev-tools/kcsan.rst-174-
 Documentation/dev-tools/kcsan.rst:175:Core parameters that affect KCSAN's overall performance and bug detection
 Documentation/dev-tools/kcsan.rst-176-ability are exposed as kernel command-line arguments whose defaults can also be
 Documentation/dev-tools/kcsan.rst=177=changed via the corresponding Kconfig options.
 Documentation/dev-tools/kcsan.rst-178-
 Documentation/dev-tools/kcsan.rst:179:* ``kcsan.skip_watch`` (``CONFIG_KCSAN_SKIP_WATCH``): Number of per-CPU memory
 Documentation/dev-tools/kcsan.rst-180-  operations to skip, before another watchpoint is set up. Setting up
 --
 Documentation/dev-tools/kcsan.rst-184-
 Documentation/dev-tools/kcsan.rst:185:* ``kcsan.udelay_task`` (``CONFIG_KCSAN_UDELAY_TASK``): For tasks, the
 Documentation/dev-tools/kcsan.rst-186-  microsecond delay to stall execution after a watchpoint has been set up.
 --
 Documentation/dev-tools/kcsan.rst-189-
 Documentation/dev-tools/kcsan.rst:190:* ``kcsan.udelay_interrupt`` (``CONFIG_KCSAN_UDELAY_INTERRUPT``): For
 Documentation/dev-tools/kcsan.rst-191-  interrupts, the microsecond delay to stall execution after a watchpoint has
 --
 Documentation/dev-tools/kcsan.rst=214=and if that code is free from data races.
 Documentation/dev-tools/kcsan.rst-215-
 Documentation/dev-tools/kcsan.rst:216:KCSAN is aware of *marked atomic operations* (``READ_ONCE``, ``WRITE_ONCE``,
 Documentation/dev-tools/kcsan.rst-217-``atomic_*``, etc.), and a subset of ordering guarantees implied by memory
 Documentation/dev-tools/kcsan.rst:218:barriers. With ``CONFIG_KCSAN_WEAK_MEMORY=y``, KCSAN models load or store
 Documentation/dev-tools/kcsan.rst-219-buffering, and can detect missing ``smp_mb()``, ``smp_wmb()``, ``smp_rmb()``,
 --
 Documentation/dev-tools/kcsan.rst=221=implied barriers.
 Documentation/dev-tools/kcsan.rst-222-
 Documentation/dev-tools/kcsan.rst:223:Note, KCSAN will not report all data races due to missing memory ordering,
 Documentation/dev-tools/kcsan.rst-224-specifically where a memory barrier would be required to prohibit subsequent
 --
 Documentation/dev-tools/kcsan.rst=243=Implementation Details
 --
 Documentation/dev-tools/kcsan.rst-245-
 Documentation/dev-tools/kcsan.rst:246:KCSAN relies on observing that two accesses happen concurrently. Crucially, we
 Documentation/dev-tools/kcsan.rst-247-want to (a) increase the chances of observing races (especially for races that
 --
 Documentation/dev-tools/kcsan.rst=255=in `DataCollider
 Documentation/dev-tools/kcsan.rst-256-<http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf>`_.
 Documentation/dev-tools/kcsan.rst:257:Unlike DataCollider, KCSAN does not use hardware watchpoints, but instead
 Documentation/dev-tools/kcsan.rst-258-relies on compiler instrumentation and "soft watchpoints".
 Documentation/dev-tools/kcsan.rst-259-
 Documentation/dev-tools/kcsan.rst:260:In KCSAN, watchpoints are implemented using an efficient encoding that stores
 Documentation/dev-tools/kcsan.rst-261-access type, size, and address in a long; the benefits of using "soft
 Documentation/dev-tools/kcsan.rst:262:watchpoints" are portability and greater flexibility. KCSAN then relies on the
 Documentation/dev-tools/kcsan.rst-263-compiler instrumenting plain accesses. For each instrumented plain access:
 --
 Documentation/dev-tools/kcsan.rst-273-
 Documentation/dev-tools/kcsan.rst:274:To detect data races between plain and marked accesses, KCSAN also annotates
 Documentation/dev-tools/kcsan.rst:275:marked accesses, but only to check if a watchpoint exists; i.e. KCSAN never
 Documentation/dev-tools/kcsan.rst-276-sets up a watchpoint on marked accesses. By never setting up watchpoints for
 Documentation/dev-tools/kcsan.rst=277=marked operations, if all accesses to a variable that is accessed concurrently
 Documentation/dev-tools/kcsan.rst:278:are properly marked, KCSAN will never trigger a watchpoint and therefore never
 Documentation/dev-tools/kcsan.rst-279-report the accesses.
 --
 Documentation/dev-tools/kcsan.rst=281=Modeling Weak Memory
 --
 Documentation/dev-tools/kcsan.rst-283-
 Documentation/dev-tools/kcsan.rst:284:KCSAN's approach to detecting data races due to missing memory barriers is
 Documentation/dev-tools/kcsan.rst:285:based on modeling access reordering (with ``CONFIG_KCSAN_WEAK_MEMORY=y``).
 Documentation/dev-tools/kcsan.rst-286-Each plain memory access for which a watchpoint is set up, is also selected for
 --
 Documentation/dev-tools/kcsan.rst=293=reordering.
 Documentation/dev-tools/kcsan.rst-294-
 Documentation/dev-tools/kcsan.rst:295:When the result of a memory operation should be ordered by a barrier, KCSAN can
 Documentation/dev-tools/kcsan.rst-296-then detect data races where the conflict only occurs as a result of a missing
 Documentation/dev-tools/kcsan.rst=297=barrier. Consider the example::
 --
 Documentation/dev-tools/kcsan.rst-302-        x = 1;                  // data race!
 Documentation/dev-tools/kcsan.rst:303:        WRITE_ONCE(flag, 1);    // correct: smp_store_release(&flag, 1)
 Documentation/dev-tools/kcsan.rst-304-    }
 --
 Documentation/dev-tools/kcsan.rst-306-    {
 Documentation/dev-tools/kcsan.rst:307:        while (!READ_ONCE(flag));   // correct: smp_load_acquire(&flag)
 Documentation/dev-tools/kcsan.rst-308-        ... = x;                    // data race!
 --
 Documentation/dev-tools/kcsan.rst-310-
 Documentation/dev-tools/kcsan.rst:311:When weak memory modeling is enabled, KCSAN can consider ``x`` in ``T1`` for
 Documentation/dev-tools/kcsan.rst-312-simulated reordering. After the write of ``flag``, ``x`` is again checked for
 --
 Documentation/dev-tools/kcsan.rst=321=the effects of "buffering" (delaying accesses), since the runtime cannot
 Documentation/dev-tools/kcsan.rst-322-"prefetch" accesses. Also recall that watchpoints are only set up for plain
 Documentation/dev-tools/kcsan.rst:323:accesses, and the only access type for which KCSAN simulates reordering. This
 Documentation/dev-tools/kcsan.rst-324-means reordering of marked accesses is not modeled.
 --
 Documentation/dev-tools/kcsan.rst=331=Key Properties
 --
 Documentation/dev-tools/kcsan.rst-337-
 Documentation/dev-tools/kcsan.rst:338:2. **Performance Overhead:** KCSAN's runtime aims to be minimal, using an
 Documentation/dev-tools/kcsan.rst-339-   efficient watchpoint encoding that does not require acquiring any shared
 --
 Documentation/dev-tools/kcsan.rst-341-
 Documentation/dev-tools/kcsan.rst:342:   - 5.0x slow-down with the default KCSAN config;
 Documentation/dev-tools/kcsan.rst-343-   - 2.8x slow-down from runtime fast-path overhead only (set very large
 Documentation/dev-tools/kcsan.rst:344:     ``KCSAN_SKIP_WATCH`` and unset ``KCSAN_SKIP_WATCH_RANDOMIZE``).
 Documentation/dev-tools/kcsan.rst-345-
 Documentation/dev-tools/kcsan.rst:346:3. **Annotation Overheads:** Minimal annotations are required outside the KCSAN
 Documentation/dev-tools/kcsan.rst-347-   runtime. As a result, maintenance overheads are minimal as the kernel
 --
]
]

24/2 2026/02/24 18:31 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 57163
							output: 27
							thoughts: 27
						

Thoughts:
Okay, here's my take on those thoughts, tailored for an expert audience:

**Analyzing Potential Concurrency Issues in `fs/aio.c`**

Alright, my first instinct is to dive into `fs/aio.c`. Given the inherent asynchronous nature of AIO and the complexities of file I/O within a kernel environment, I need to meticulously scrutinize this file for any signs of potential data races or improper memory access synchronization. Specifically, I'm going to be looking for usages of `data_race()` macros or similar constructs designed to detect or mitigate them. I'll also be heavily focused on `READ_ONCE()` calls, as those are critical for ensuring atomic reads of shared data structures that might be modified concurrently by different AIO threads or interrupt handlers. My goal is to determine if memory orderings are correct and if critical sections are properly protected to prevent subtle corruption or inconsistent data views. Any unexpected usages or lack thereof warrants close inspection.



25/2 2026/02/24 18:31 tool grepper
0m Args:
map[Expression:data_race|READ_ONCE|WRITE_ONCE]

Results:
map[Output:
Full output is too long, showing 500 out of 56601 lines.
Use more precise expression if possible.

[Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=78=lock-acquisition and lock-release functions::
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-84-    5   raw_spin_lock_rcu_node(rnp);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:85:    6   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:86:    7   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-87-    8   raw_spin_unlock_rcu_node(rnp);
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-92-   13   raw_spin_lock_rcu_node(rnp);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:93:   14   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:94:   15   r2 = READ_ONCE(z);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-95-   16   raw_spin_unlock_rcu_node(rnp);
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-99-   20 {
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:100:   21   WRITE_ONCE(z, 1);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-101-   22   smp_mb();
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:102:   23   r3 = READ_ONCE(x);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-103-   24 }
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=112=The ``smp_mb__after_unlock_lock()`` invocations prevent this
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-131-|  ----                                      ----                       |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:132:|  WRITE_ONCE(X, 1)                          WRITE_ONCE(Y, 1)           |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-133-|  g = get_state_synchronize_rcu()           smp_mb()                   |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:134:|  while (!poll_state_synchronize_rcu(g))    r1 = READ_ONCE(X)          |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-135-|          continue;                                                    |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:136:|  r0 = READ_ONCE(Y)                                                    |
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-137-|                                                                       |
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst=198=newly arrived RCU callbacks against future grace periods:
 --
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-213-   12   /* Handle nohz enablement switches conservatively. */
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst:214:   13   tne = READ_ONCE(tick_nohz_active);
 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst-215-   14   if (tne != rdp->tick_nohz_enabled_snap) {
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-145-         x="112.04738"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:146:         y="268.18076">WRITE_ONCE(a, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-147-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-156-         x="112.04738"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:157:         y="439.13766">WRITE_ONCE(b, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-158-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-167-         x="255.60869"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:168:         y="309.29346">r1 = READ_ONCE(a);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-169-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-178-         x="255.14423"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:179:         y="520.61786">WRITE_ONCE(c, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-180-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-189-         x="396.10254"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:190:         y="384.71124">r2 = READ_ONCE(b);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-191-    <text
 --
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-200-         x="396.10254"
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg:201:         y="582.13617">r3 = READ_ONCE(c);</tspan></text>
 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg-202-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-173-         x="112.04738"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:174:         y="268.18076">WRITE_ONCE(a, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-175-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-184-         x="112.04738"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:185:         y="487.13766">WRITE_ONCE(b, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-186-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-195-         x="255.60869"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:196:         y="297.29346">r1 = READ_ONCE(a);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-197-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-206-         x="255.14423"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:207:         y="554.61786">WRITE_ONCE(c, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-208-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-217-         x="396.10254"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:218:         y="370.71124">WRITE_ONCE(d, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-219-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-228-         x="396.10254"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:229:         y="572.13617">r2 = READ_ONCE(c);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-230-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-463-         x="541.70508"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:464:         y="387.6217">r3 = READ_ONCE(d);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-465-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-474-         x="541.2406"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:475:         y="646.94611">WRITE_ONCE(e, 1);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-476-    <path
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-509-         x="686.27747"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:510:         y="461.83929">r4 = READ_ONCE(b);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-511-    <text
 --
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-520-         x="686.27747"
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg:521:         y="669.26422">r5 = READ_ONCE(e);</tspan></text>
 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg-522-    <text
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=84=overhead to readers, for example:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-92-       5   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:93:       6   r1 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst:94:       7   r2 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-95-       8   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-99-      12 {
 Documentation/RCU/Design/Requirements/Requirements.rst:100:      13   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-101-      14   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:102:      15   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-103-      16 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=138=recovery from node failure, more or less as follows:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-153-      12   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:154:      13   state_snap = READ_ONCE(state);
 Documentation/RCU/Design/Requirements/Requirements.rst-155-      14   if (state_snap == STATE_NORMAL)
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-163-      22 {
 Documentation/RCU/Design/Requirements/Requirements.rst:164:      23   WRITE_ONCE(state, STATE_WANT_RECOVERY);
 Documentation/RCU/Design/Requirements/Requirements.rst-165-      24   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:166:      25   WRITE_ONCE(state, STATE_RECOVERING);
 Documentation/RCU/Design/Requirements/Requirements.rst-167-      26   recovery();
 Documentation/RCU/Design/Requirements/Requirements.rst:168:      27   WRITE_ONCE(state, STATE_WANT_NORMAL);
 Documentation/RCU/Design/Requirements/Requirements.rst-169-      28   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:170:      29   WRITE_ONCE(state, STATE_NORMAL);
 Documentation/RCU/Design/Requirements/Requirements.rst-171-      30 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=467=resembling the dependency-ordering barrier that was later subsumed
 Documentation/RCU/Design/Requirements/Requirements.rst:468:into rcu_dereference() and later still into READ_ONCE(). The
 Documentation/RCU/Design/Requirements/Requirements.rst-469-need for these operations made itself known quite suddenly at a
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=702=threads:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-708-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:709:       4   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-710-       5   rcu_read_unlock();
 Documentation/RCU/Design/Requirements/Requirements.rst-711-       6   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:712:       7   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-713-       8   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-718-      13   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:719:      14   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-720-      15   rcu_read_unlock();
 Documentation/RCU/Design/Requirements/Requirements.rst-721-      16   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:722:      17   r2 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst-723-      18   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=737=significant ordering constraints would slow down these fast-path APIs.
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-745-+-----------------------------------------------------------------------+
 Documentation/RCU/Design/Requirements/Requirements.rst:746:| No, the volatile casts in READ_ONCE() and WRITE_ONCE()                |
 Documentation/RCU/Design/Requirements/Requirements.rst-747-| prevent the compiler from reordering in this particular case.         |
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=755=example illustrates this:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-761-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:762:       4   r1 = READ_ONCE(y);
 Documentation/RCU/Design/Requirements/Requirements.rst-763-       5   if (r1) {
 Documentation/RCU/Design/Requirements/Requirements.rst-764-       6     do_something_with_nonzero_x();
 Documentation/RCU/Design/Requirements/Requirements.rst:765:       7     r2 = READ_ONCE(x);
 Documentation/RCU/Design/Requirements/Requirements.rst-766-       8     WARN_ON(!r2); /* BUG!!! */
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-773-      15   spin_lock(&my_lock);
 Documentation/RCU/Design/Requirements/Requirements.rst:774:      16   WRITE_ONCE(x, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:775:      17   WRITE_ONCE(y, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-776-      18   spin_unlock(&my_lock);
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=819=are initially all zero:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-825-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:826:       4   WRITE_ONCE(a, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:827:       5   WRITE_ONCE(b, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-828-       6   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-832-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:833:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-834-      12   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:835:      13   WRITE_ONCE(c, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-836-      14 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-840-      18   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:841:      19   r2 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:842:      20   r3 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-843-      21   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=862=period is known to end before the second grace period starts:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-868-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:869:       4   WRITE_ONCE(a, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:870:       5   WRITE_ONCE(b, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-871-       6   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-875-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:876:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-877-      12   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:878:      13   WRITE_ONCE(c, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-879-      14 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-882-      17 {
 Documentation/RCU/Design/Requirements/Requirements.rst:883:      18   r2 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-884-      19   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:885:      20   WRITE_ONCE(d, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-886-      21 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-890-      25   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:891:      26   r3 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:892:      27   r4 = READ_ONCE(d);
 Documentation/RCU/Design/Requirements/Requirements.rst-893-      28   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst=920=illustrated by the following, with all variables initially zero:
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-926-       3   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:927:       4   WRITE_ONCE(a, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:928:       5   WRITE_ONCE(b, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-929-       6   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-933-      10 {
 Documentation/RCU/Design/Requirements/Requirements.rst:934:      11   r1 = READ_ONCE(a);
 Documentation/RCU/Design/Requirements/Requirements.rst-935-      12   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:936:      13   WRITE_ONCE(c, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-937-      14 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-941-      18   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:942:      19   WRITE_ONCE(d, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst:943:      20   r2 = READ_ONCE(c);
 Documentation/RCU/Design/Requirements/Requirements.rst-944-      21   rcu_read_unlock();
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-948-      25 {
 Documentation/RCU/Design/Requirements/Requirements.rst:949:      26   r3 = READ_ONCE(d);
 Documentation/RCU/Design/Requirements/Requirements.rst-950-      27   synchronize_rcu();
 Documentation/RCU/Design/Requirements/Requirements.rst:951:      28   WRITE_ONCE(e, 1);
 Documentation/RCU/Design/Requirements/Requirements.rst-952-      29 }
 --
 Documentation/RCU/Design/Requirements/Requirements.rst-956-      33   rcu_read_lock();
 Documentation/RCU/Design/Requirements/Requirements.rst:957:      34   r4 = READ_ONCE(b);
 Documentation/RCU/Design/Requirements/Requirements.rst:958:      35   r5 = READ_ONCE(e);
 Documentation/RCU/Design/Requirements/Requirements.rst-959-      36   rcu_read_unlock();
 --
 Documentation/RCU/checklist.rst=12=over a rather long period of time, but improvements are always welcome!
 --
 Documentation/RCU/checklist.rst-363-	time that readers might be accessing that structure.  In such
 Documentation/RCU/checklist.rst:364:	cases, READ_ONCE() may be used in place of rcu_dereference()
 Documentation/RCU/checklist.rst-365-	and the read-side markers (rcu_read_lock() and rcu_read_unlock(),
 --
 Documentation/RCU/listRCU.rst=145=has become list_for_each_entry_rcu().  The **_rcu()** list-traversal
 Documentation/RCU/listRCU.rst:146:primitives add READ_ONCE() and diagnostic checks for incorrect use
 Documentation/RCU/listRCU.rst-147-outside of an RCU read-side critical section.
 --
 Documentation/RCU/rcu_dereference.rst=25=readers working properly:
 --
 Documentation/RCU/rcu_dereference.rst-35-	return data preceding initialization that preceded the store
 Documentation/RCU/rcu_dereference.rst:36:	of the pointer.  (As noted later, in recent kernels READ_ONCE()
 Documentation/RCU/rcu_dereference.rst-37-	also prevents DEC Alpha from playing these tricks.)
 --
 Documentation/RCU/rcu_dereference.rst-45--	In the special case where data is added but is never removed
 Documentation/RCU/rcu_dereference.rst:46:	while readers are accessing the structure, READ_ONCE() may be used
 Documentation/RCU/rcu_dereference.rst:47:	instead of rcu_dereference().  In this case, use of READ_ONCE()
 Documentation/RCU/rcu_dereference.rst-48-	takes on the role of the lockless_dereference() primitive that
 --
 Documentation/RCU/whatisRCU.rst=687=don't forget about them when submitting patches making use of RCU!]::
 --
 Documentation/RCU/whatisRCU.rst-695-	({ \
 Documentation/RCU/whatisRCU.rst:696:		typeof(p) _________p1 = READ_ONCE(p); \
 Documentation/RCU/whatisRCU.rst-697-		(_________p1); \
 --
 Documentation/atomic_t.txt=82=The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
 Documentation/atomic_t.txt:83:implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
 Documentation/atomic_t.txt-84-smp_store_release() respectively. Therefore, if you find yourself only using
 --
 Documentation/atomic_t.txt=119=with a lock:
 --
 Documentation/atomic_t.txt-124-    lock();
 Documentation/atomic_t.txt:125:    ret = READ_ONCE(v->counter); // == 1
 Documentation/atomic_t.txt-126-						atomic_set(v, 0);
 Documentation/atomic_t.txt:127:    if (ret != u)				  WRITE_ONCE(v->counter, 0);
 Documentation/atomic_t.txt:128:      WRITE_ONCE(v->counter, ret + 1);
 Documentation/atomic_t.txt-129-    unlock();
 --
 Documentation/atomic_t.txt=234=strictly stronger than ACQUIRE. As illustrated:
 --
 Documentation/atomic_t.txt-242-  {
 Documentation/atomic_t.txt:243:    r0 = READ_ONCE(*x);
 Documentation/atomic_t.txt-244-    smp_rmb();
 --
 Documentation/atomic_t.txt-251-    smp_mb__after_atomic();
 Documentation/atomic_t.txt:252:    WRITE_ONCE(*x, 1);
 Documentation/atomic_t.txt-253-  }
 --
 Documentation/atomic_t.txt=260=because it would not order the W part of the RMW against the following
 Documentation/atomic_t.txt:261:WRITE_ONCE.  Thus:
 Documentation/atomic_t.txt-262-
 --
 Documentation/core-api/circular-buffers.rst=154=The producer will look something like this::
 --
 Documentation/core-api/circular-buffers.rst-159-	/* The spin_unlock() and next spin_lock() provide needed ordering. */
 Documentation/core-api/circular-buffers.rst:160:	unsigned long tail = READ_ONCE(buffer->tail);
 Documentation/core-api/circular-buffers.rst-161-
 --
 Documentation/core-api/circular-buffers.rst=219=before it writes the new tail pointer, which will erase the item.
 Documentation/core-api/circular-buffers.rst-220-
 Documentation/core-api/circular-buffers.rst:221:Note the use of READ_ONCE() and smp_load_acquire() to read the
 Documentation/core-api/circular-buffers.rst-222-opposition index.  This prevents the compiler from discarding and
 --
 Documentation/core-api/errseq.rst=144=errseq_check_and_advance after taking the lock. e.g.::
 Documentation/core-api/errseq.rst-145-
 Documentation/core-api/errseq.rst:146:        if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) {
 Documentation/core-api/errseq.rst-147-                /* su.s_wd_err is protected by s_wd_err_lock */
 --
 Documentation/core-api/refcount-vs-atomic.rst=37=are executed in program order on a single CPU.
 Documentation/core-api/refcount-vs-atomic.rst:38:This is implemented using READ_ONCE()/WRITE_ONCE() and
 Documentation/core-api/refcount-vs-atomic.rst-39-compare-and-swap primitives.
 --
 Documentation/dev-tools/checkpatch.rst=456=Comments
 --
 Documentation/dev-tools/checkpatch.rst-476-  **DATA_RACE**
 Documentation/dev-tools/checkpatch.rst:477:    Applications of data_race() should have a comment so as to document the
 Documentation/dev-tools/checkpatch.rst-478-    reasoning behind why it was deemed safe.
 --
 Documentation/dev-tools/kcsan.rst=87=the below options are available:
 Documentation/dev-tools/kcsan.rst-88-
 Documentation/dev-tools/kcsan.rst:89:* KCSAN understands the ``data_race(expr)`` annotation, which tells KCSAN that
 Documentation/dev-tools/kcsan.rst-90-  any data races due to accesses in ``expr`` should be ignored and resulting
 --
 Documentation/dev-tools/kcsan.rst-93-
 Documentation/dev-tools/kcsan.rst:94:* Similar to ``data_race(...)``, the type qualifier ``__data_racy`` can be used
 Documentation/dev-tools/kcsan.rst-95-  to document that all data races due to accesses to a variable are intended
 --
 Documentation/dev-tools/kcsan.rst=214=and if that code is free from data races.
 Documentation/dev-tools/kcsan.rst-215-
 Documentation/dev-tools/kcsan.rst:216:KCSAN is aware of *marked atomic operations* (``READ_ONCE``, ``WRITE_ONCE``,
 Documentation/dev-tools/kcsan.rst-217-``atomic_*``, etc.), and a subset of ordering guarantees implied by memory
 --
 Documentation/dev-tools/kcsan.rst=297=barrier. Consider the example::
 --
 Documentation/dev-tools/kcsan.rst-302-        x = 1;                  // data race!
 Documentation/dev-tools/kcsan.rst:303:        WRITE_ONCE(flag, 1);    // correct: smp_store_release(&flag, 1)
 Documentation/dev-tools/kcsan.rst-304-    }
 --
 Documentation/dev-tools/kcsan.rst-306-    {
 Documentation/dev-tools/kcsan.rst:307:        while (!READ_ONCE(flag));   // correct: smp_load_acquire(&flag)
 Documentation/dev-tools/kcsan.rst-308-        ... = x;                    // data race!
 --
 Documentation/driver-api/surface_aggregator/internal.rst=270=submission, i.e. cancellation, can not rely on the ``ptl`` reference to be
 Documentation/driver-api/surface_aggregator/internal.rst:271:set. Access to it in these functions is guarded by ``READ_ONCE()``, whereas
 Documentation/driver-api/surface_aggregator/internal.rst:272:setting ``ptl`` is equally guarded with ``WRITE_ONCE()`` for symmetry.
 Documentation/driver-api/surface_aggregator/internal.rst-273-
 --
 Documentation/driver-api/surface_aggregator/internal.rst=275=them, specifically priority and state for tracing. In those cases, proper
 Documentation/driver-api/surface_aggregator/internal.rst:276:access is ensured by employing ``WRITE_ONCE()`` and ``READ_ONCE()``. Such
 Documentation/driver-api/surface_aggregator/internal.rst-277-read-only access is only allowed when stale values are not critical.
 --
 Documentation/driver-api/surface_aggregator/internal.rst=451=them, specifically the state for tracing. In those cases, proper access is
 Documentation/driver-api/surface_aggregator/internal.rst:452:ensured by employing ``WRITE_ONCE()`` and ``READ_ONCE()``. Such read-only
 Documentation/driver-api/surface_aggregator/internal.rst-453-access is only allowed when stale values are not critical.
 --
 Documentation/driver-api/surface_aggregator/internal.rst=572=invalid usages, but rather aim to help catch them. In those cases, proper
 Documentation/driver-api/surface_aggregator/internal.rst:573:variable access is ensured by employing ``WRITE_ONCE()`` and ``READ_ONCE()``.
 Documentation/driver-api/surface_aggregator/internal.rst-574-
 --
 Documentation/filesystems/path-lookup.rst=896=similar.
 Documentation/filesystems/path-lookup.rst-897-
 Documentation/filesystems/path-lookup.rst:898:.. _READ_ONCE: https://lwn.net/Articles/624126/
 Documentation/filesystems/path-lookup.rst-899-
 --
 Documentation/filesystems/path-lookup.rst=904=when accessing fields in the dentry.  This "extra care" typically
 Documentation/filesystems/path-lookup.rst:905:involves using  `READ_ONCE() <READ_ONCE_>`_ to access fields, and verifying the
 Documentation/filesystems/path-lookup.rst-906-result is not NULL before using it.  This pattern can be seen in
 --
 Documentation/gpu/todo.rst=932=struct drm_sched_rq is read at many places without any locks, not even with a
 Documentation/gpu/todo.rst:933:READ_ONCE. At XDC 2025 no one could really tell why that is the case, whether
 Documentation/gpu/todo.rst-934-locks are needed and whether they could be added. (But for real, that should
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus=14=P0(int *x, atomic_t *y)
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-18-
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus:19:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-20-	smp_rmb();
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus=24=P1(int *x, atomic_t *y)
 --
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-27-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus:28:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/Atomic-RMW+mb__after_atomic-is-stronger-than-acquire.litmus-29-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus=12=P0(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-16-
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:17:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-18-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-19-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:20:	r0 = READ_ONCE(*y);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-21-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus=23=P1(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-27-
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:28:	WRITE_ONCE(*y, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-29-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-30-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus:31:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-1.litmus-32-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus=12=P0(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-15-
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus:16:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-17-	r1 = cmpxchg(y, 0, 1);
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus=20=P1(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-26-	smp_mb__after_atomic();
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus:27:	r2 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-ordered-2.litmus-28-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus=13=P0(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-17-
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:18:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-19-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:20:	r0 = READ_ONCE(*y);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-21-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus=23=P1(int *x, int *y, int *z)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-27-
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:28:	WRITE_ONCE(*y, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-29-	r1 = cmpxchg(z, 1, 0);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus:30:	r0 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-1.litmus-31-}
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus=13=P0(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-16-
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus:17:	WRITE_ONCE(*x, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-18-	r1 = cmpxchg(y, 0, 1);
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus=21=P1(int *x, int *y)
 --
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-26-	r1 = cmpxchg(y, 0, 1);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus:27:	r2 = READ_ONCE(*x);
 Documentation/litmus-tests/atomic/cmpxchg-fail-unordered-2.litmus-28-}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus=15=P0(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-20-
 Documentation/litmus-tests/locking/DCL-broken.litmus:21:	r0 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-22-	if (r0 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus-23-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-broken.litmus:24:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-25-		if (r1 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus:26:			WRITE_ONCE(*data, 1);
 Documentation/litmus-tests/locking/DCL-broken.litmus:27:			WRITE_ONCE(*flag, 1);
 Documentation/litmus-tests/locking/DCL-broken.litmus-28-		}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-30-	}
 Documentation/litmus-tests/locking/DCL-broken.litmus:31:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-broken.litmus-32-}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus=34=P1(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-39-
 Documentation/litmus-tests/locking/DCL-broken.litmus:40:	r0 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-41-	if (r0 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus-42-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-broken.litmus:43:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-broken.litmus-44-		if (r1 == 0) {
 Documentation/litmus-tests/locking/DCL-broken.litmus:45:			WRITE_ONCE(*data, 1);
 Documentation/litmus-tests/locking/DCL-broken.litmus:46:			WRITE_ONCE(*flag, 1);
 Documentation/litmus-tests/locking/DCL-broken.litmus-47-		}
 --
 Documentation/litmus-tests/locking/DCL-broken.litmus-49-	}
 Documentation/litmus-tests/locking/DCL-broken.litmus:50:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-broken.litmus-51-}
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus=16=P0(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-24-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-fixed.litmus:25:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-26-		if (r1 == 0) {
 Documentation/litmus-tests/locking/DCL-fixed.litmus:27:			WRITE_ONCE(*data, 1);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-28-			smp_store_release(flag, 1);
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-31-	}
 Documentation/litmus-tests/locking/DCL-fixed.litmus:32:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-33-}
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus=35=P1(int *flag, int *data, spinlock_t *lck)
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-43-		spin_lock(lck);
 Documentation/litmus-tests/locking/DCL-fixed.litmus:44:		r1 = READ_ONCE(*flag);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-45-		if (r1 == 0) {
 Documentation/litmus-tests/locking/DCL-fixed.litmus:46:			WRITE_ONCE(*data, 1);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-47-			smp_store_release(flag, 1);
 --
 Documentation/litmus-tests/locking/DCL-fixed.litmus-50-	}
 Documentation/litmus-tests/locking/DCL-fixed.litmus:51:	r2 = READ_ONCE(*data);
 Documentation/litmus-tests/locking/DCL-fixed.litmus-52-}
 --
 Documentation/litmus-tests/locking/RM-broken.litmus=16=P0(int *x, atomic_t *y, spinlock_t *lck)
 --
]
]

26/2 2026/02/24 18:31 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 69726
							output: 20
							thoughts: 17
						

Thoughts:
Okay, here's my take on those thoughts, tailored for an expert audience:

**Analyzing the AIO Implementation**

Right, let's get down to brass tacks. I'm going to dive straight into `fs/aio.c`. No need for preliminary fluff, I'm already familiar with the broader architecture. My focus is laser-sharp: I need to pinpoint the implementation details I'm specifically after. I'll be sifting through the code, looking for the relevant functions and data structures. I'll be paying attention to the synchronization primitives, any potential locking issues, and how the actual I/O operations are initiated and completed. I'll be mentally building a mental picture of the asynchronous I/O flow. This will be a surgical strike; I'm not here for a general overview, I'm here to find what I know I need.



27/2 2026/02/24 18:31 tool read-file
0m Args:
map[File:fs/aio.c]

Results:
map[Contents:/*
 *	An async IO implementation for Linux
 *	Written by Benjamin LaHaise <bcrl@kvack.org>
 *
 *	Implements an efficient asynchronous io interface.
 *
 *	Copyright 2000, 2001, 2002 Red Hat, Inc.  All Rights Reserved.
 *	Copyright 2018 Christoph Hellwig.
 *
 *	See ../COPYING for licensing terms.
 */
#define pr_fmt(fmt) "%s: " fmt, __func__

#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/errno.h>
#include <linux/time.h>
#include <linux/aio_abi.h>
#include <linux/export.h>
#include <linux/syscalls.h>
#include <linux/backing-dev.h>
#include <linux/refcount.h>
#include <linux/uio.h>

#include <linux/sched/signal.h>
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/percpu.h>
#include <linux/slab.h>
#include <linux/timer.h>
#include <linux/aio.h>
#include <linux/highmem.h>
#include <linux/workqueue.h>
#include <linux/security.h>
#include <linux/eventfd.h>
#include <linux/blkdev.h>
#include <linux/compat.h>
#include <linux/migrate.h>
#include <linux/ramfs.h>
#include <linux/percpu-refcount.h>
#include <linux/mount.h>
#include <linux/pseudo_fs.h>

#include <linux/uaccess.h>
#include <linux/nospec.h>

#include "internal.h"

#define KIOCB_KEY		0

#define AIO_RING_MAGIC			0xa10a10a1
#define AIO_RING_COMPAT_FEATURES	1
#define AIO_RING_INCOMPAT_FEATURES	0
struct aio_ring {
	unsigned	id;	/* kernel internal index number */
	unsigned	nr;	/* number of io_events */
	unsigned	head;	/* Written to by userland or under ring_lock
				 * mutex by aio_read_events_ring(). */
	unsigned	tail;

	unsigned	magic;
	unsigned	compat_features;
	unsigned	incompat_features;
	unsigned	header_length;	/* size of aio_ring */


	struct io_event		io_events[];
}; /* 128 bytes + ring size */

/*
 * Plugging is meant to work with larger batches of IOs. If we don't
 * have more than the below, then don't bother setting up a plug.
 */
#define AIO_PLUG_THRESHOLD	2

#define AIO_RING_PAGES	8

struct kioctx_table {
	struct rcu_head		rcu;
	unsigned		nr;
	struct kioctx __rcu	*table[] __counted_by(nr);
};

struct kioctx_cpu {
	unsigned		reqs_available;
};

struct ctx_rq_wait {
	struct completion comp;
	atomic_t count;
};

struct kioctx {
	struct percpu_ref	users;
	atomic_t		dead;

	struct percpu_ref	reqs;

	unsigned long		user_id;

	struct kioctx_cpu __percpu *cpu;

	/*
	 * For percpu reqs_available, number of slots we move to/from global
	 * counter at a time:
	 */
	unsigned		req_batch;
	/*
	 * This is what userspace passed to io_setup(), it's not used for
	 * anything but counting against the global max_reqs quota.
	 *
	 * The real limit is nr_events - 1, which will be larger (see
	 * aio_setup_ring())
	 */
	unsigned		max_reqs;

	/* Size of ringbuffer, in units of struct io_event */
	unsigned		nr_events;

	unsigned long		mmap_base;
	unsigned long		mmap_size;

	struct folio		**ring_folios;
	long			nr_pages;

	struct rcu_work		free_rwork;	/* see free_ioctx() */

	/*
	 * signals when all in-flight requests are done
	 */
	struct ctx_rq_wait	*rq_wait;

	struct {
		/*
		 * This counts the number of available slots in the ringbuffer,
		 * so we avoid overflowing it: it's decremented (if positive)
		 * when allocating a kiocb and incremented when the resulting
		 * io_event is pulled off the ringbuffer.
		 *
		 * We batch accesses to it with a percpu version.
		 */
		atomic_t	reqs_available;
	} ____cacheline_aligned_in_smp;

	struct {
		spinlock_t	ctx_lock;
		struct list_head active_reqs;	/* used for cancellation */
	} ____cacheline_aligned_in_smp;

	struct {
		struct mutex	ring_lock;
		wait_queue_head_t wait;
	} ____cacheline_aligned_in_smp;

	struct {
		unsigned	tail;
		unsigned	completed_events;
		spinlock_t	completion_lock;
	} ____cacheline_aligned_in_smp;

	struct folio		*internal_folios[AIO_RING_PAGES];
	struct file		*aio_ring_file;

	unsigned		id;
};

/*
 * First field must be the file pointer in all the
 * iocb unions! See also 'struct kiocb' in <linux/fs.h>
 */
struct fsync_iocb {
	struct file		*file;
	struct work_struct	work;
	bool			datasync;
	struct cred		*creds;
};

struct poll_iocb {
	struct file		*file;
	struct wait_queue_head	*head;
	__poll_t		events;
	bool			cancelled;
	bool			work_scheduled;
	bool			work_need_resched;
	struct wait_queue_entry	wait;
	struct work_struct	work;
};

/*
 * NOTE! Each of the iocb union members has the file pointer
 * as the first entry in their struct definition. So you can
 * access the file pointer through any of the sub-structs,
 * or directly as just 'ki_filp' in this struct.
 */
struct aio_kiocb {
	union {
		struct file		*ki_filp;
		struct kiocb		rw;
		struct fsync_iocb	fsync;
		struct poll_iocb	poll;
	};

	struct kioctx		*ki_ctx;
	kiocb_cancel_fn		*ki_cancel;

	struct io_event		ki_res;

	struct list_head	ki_list;	/* the aio core uses this
						 * for cancellation */
	refcount_t		ki_refcnt;

	/*
	 * If the aio_resfd field of the userspace iocb is not zero,
	 * this is the underlying eventfd context to deliver events to.
	 */
	struct eventfd_ctx	*ki_eventfd;
};

/*------ sysctl variables----*/
static DEFINE_SPINLOCK(aio_nr_lock);
static unsigned long aio_nr;		/* current system wide number of aio requests */
static unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
/*----end sysctl variables---*/
#ifdef CONFIG_SYSCTL
static const struct ctl_table aio_sysctls[] = {
	{
		.procname	= "aio-nr",
		.data		= &aio_nr,
		.maxlen		= sizeof(aio_nr),
		.mode		= 0444,
		.proc_handler	= proc_doulongvec_minmax,
	},
	{
		.procname	= "aio-max-nr",
		.data		= &aio_max_nr,
		.maxlen		= sizeof(aio_max_nr),
		.mode		= 0644,
		.proc_handler	= proc_doulongvec_minmax,
	},
};

static void __init aio_sysctl_init(void)
{
	register_sysctl_init("fs", aio_sysctls);
}
#else
#define aio_sysctl_init() do { } while (0)
#endif

static struct kmem_cache	*kiocb_cachep;
static struct kmem_cache	*kioctx_cachep;

static struct vfsmount *aio_mnt;

static const struct file_operations aio_ring_fops;
static const struct address_space_operations aio_ctx_aops;

static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
{
	struct file *file;
	struct inode *inode = alloc_anon_inode(aio_mnt->mnt_sb);
	if (IS_ERR(inode))
		return ERR_CAST(inode);

	inode->i_mapping->a_ops = &aio_ctx_aops;
	inode->i_mapping->i_private_data = ctx;
	inode->i_size = PAGE_SIZE * nr_pages;

	file = alloc_file_pseudo(inode, aio_mnt, "[aio]",
				O_RDWR, &aio_ring_fops);
	if (IS_ERR(file))
		iput(inode);
	return file;
}

static int aio_init_fs_context(struct fs_context *fc)
{
	if (!init_pseudo(fc, AIO_RING_MAGIC))
		return -ENOMEM;
	fc->s_iflags |= SB_I_NOEXEC;
	return 0;
}

/* aio_setup
 *	Creates the slab caches used by the aio routines, panic on
 *	failure as this is done early during the boot sequence.
 */
static int __init aio_setup(void)
{
	static struct file_system_type aio_fs = {
		.name		= "aio",
		.init_fs_context = aio_init_fs_context,
		.kill_sb	= kill_anon_super,
	};
	aio_mnt = kern_mount(&aio_fs);
	if (IS_ERR(aio_mnt))
		panic("Failed to create aio fs mount.");

	kiocb_cachep = KMEM_CACHE(aio_kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
	kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
	aio_sysctl_init();
	return 0;
}
__initcall(aio_setup);

static void put_aio_ring_file(struct kioctx *ctx)
{
	struct file *aio_ring_file = ctx->aio_ring_file;
	struct address_space *i_mapping;

	if (aio_ring_file) {
		truncate_setsize(file_inode(aio_ring_file), 0);

		/* Prevent further access to the kioctx from migratepages */
		i_mapping = aio_ring_file->f_mapping;
		spin_lock(&i_mapping->i_private_lock);
		i_mapping->i_private_data = NULL;
		ctx->aio_ring_file = NULL;
		spin_unlock(&i_mapping->i_private_lock);

		fput(aio_ring_file);
	}
}

static void aio_free_ring(struct kioctx *ctx)
{
	int i;

	/* Disconnect the kiotx from the ring file.  This prevents future
	 * accesses to the kioctx from page migration.
	 */
	put_aio_ring_file(ctx);

	for (i = 0; i < ctx->nr_pages; i++) {
		struct folio *folio = ctx->ring_folios[i];

		if (!folio)
			continue;

		pr_debug("pid(%d) [%d] folio->count=%d\n", current->pid, i,
			 folio_ref_count(folio));
		ctx->ring_folios[i] = NULL;
		folio_put(folio);
	}

	if (ctx->ring_folios && ctx->ring_folios != ctx->internal_folios) {
		kfree(ctx->ring_folios);
		ctx->ring_folios = NULL;
	}
}

static int aio_ring_mremap(struct vm_area_struct *vma)
{
	struct file *file = vma->vm_file;
	struct mm_struct *mm = vma->vm_mm;
	struct kioctx_table *table;
	int i, res = -EINVAL;

	spin_lock(&mm->ioctx_lock);
	rcu_read_lock();
	table = rcu_dereference(mm->ioctx_table);
	if (!table)
		goto out_unlock;

	for (i = 0; i < table->nr; i++) {
		struct kioctx *ctx;

		ctx = rcu_dereference(table->table[i]);
		if (ctx && ctx->aio_ring_file == file) {
			if (!atomic_read(&ctx->dead)) {
				ctx->user_id = ctx->mmap_base = vma->vm_start;
				res = 0;
			}
			break;
		}
	}

out_unlock:
	rcu_read_unlock();
	spin_unlock(&mm->ioctx_lock);
	return res;
}

static const struct vm_operations_struct aio_ring_vm_ops = {
	.mremap		= aio_ring_mremap,
#if IS_ENABLED(CONFIG_MMU)
	.fault		= filemap_fault,
	.map_pages	= filemap_map_pages,
	.page_mkwrite	= filemap_page_mkwrite,
#endif
};

static int aio_ring_mmap_prepare(struct vm_area_desc *desc)
{
	desc->vm_flags |= VM_DONTEXPAND;
	desc->vm_ops = &aio_ring_vm_ops;
	return 0;
}

static const struct file_operations aio_ring_fops = {
	.mmap_prepare = aio_ring_mmap_prepare,
};

#if IS_ENABLED(CONFIG_MIGRATION)
static int aio_migrate_folio(struct address_space *mapping, struct folio *dst,
			struct folio *src, enum migrate_mode mode)
{
	struct kioctx *ctx;
	unsigned long flags;
	pgoff_t idx;
	int rc = 0;

	/* mapping->i_private_lock here protects against the kioctx teardown.  */
	spin_lock(&mapping->i_private_lock);
	ctx = mapping->i_private_data;
	if (!ctx) {
		rc = -EINVAL;
		goto out;
	}

	/* The ring_lock mutex.  The prevents aio_read_events() from writing
	 * to the ring's head, and prevents page migration from mucking in
	 * a partially initialized kiotx.
	 */
	if (!mutex_trylock(&ctx->ring_lock)) {
		rc = -EAGAIN;
		goto out;
	}

	idx = src->index;
	if (idx < (pgoff_t)ctx->nr_pages) {
		/* Make sure the old folio hasn't already been changed */
		if (ctx->ring_folios[idx] != src)
			rc = -EAGAIN;
	} else
		rc = -EINVAL;

	if (rc != 0)
		goto out_unlock;

	/* Writeback must be complete */
	BUG_ON(folio_test_writeback(src));
	folio_get(dst);

	rc = folio_migrate_mapping(mapping, dst, src, 1);
	if (rc) {
		folio_put(dst);
		goto out_unlock;
	}

	/* Take completion_lock to prevent other writes to the ring buffer
	 * while the old folio is copied to the new.  This prevents new
	 * events from being lost.
	 */
	spin_lock_irqsave(&ctx->completion_lock, flags);
	folio_copy(dst, src);
	folio_migrate_flags(dst, src);
	BUG_ON(ctx->ring_folios[idx] != src);
	ctx->ring_folios[idx] = dst;
	spin_unlock_irqrestore(&ctx->completion_lock, flags);

	/* The old folio is no longer accessible. */
	folio_put(src);

out_unlock:
	mutex_unlock(&ctx->ring_lock);
out:
	spin_unlock(&mapping->i_private_lock);
	return rc;
}
#else
#define aio_migrate_folio NULL
#endif

static const struct address_space_operations aio_ctx_aops = {
	.dirty_folio	= noop_dirty_folio,
	.migrate_folio	= aio_migrate_folio,
};

static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
{
	struct aio_ring *ring;
	struct mm_struct *mm = current->mm;
	unsigned long size, unused;
	int nr_pages;
	int i;
	struct file *file;

	/* Compensate for the ring buffer's head/tail overlap entry */
	nr_events += 2;	/* 1 is required, 2 for good luck */

	size = sizeof(struct aio_ring);
	size += sizeof(struct io_event) * nr_events;

	nr_pages = PFN_UP(size);
	if (nr_pages < 0)
		return -EINVAL;

	file = aio_private_file(ctx, nr_pages);
	if (IS_ERR(file)) {
		ctx->aio_ring_file = NULL;
		return -ENOMEM;
	}

	ctx->aio_ring_file = file;
	nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring))
			/ sizeof(struct io_event);

	ctx->ring_folios = ctx->internal_folios;
	if (nr_pages > AIO_RING_PAGES) {
		ctx->ring_folios = kcalloc(nr_pages, sizeof(struct folio *),
					   GFP_KERNEL);
		if (!ctx->ring_folios) {
			put_aio_ring_file(ctx);
			return -ENOMEM;
		}
	}

	for (i = 0; i < nr_pages; i++) {
		struct folio *folio;

		folio = __filemap_get_folio(file->f_mapping, i,
					    FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
					    GFP_USER | __GFP_ZERO);
		if (IS_ERR(folio))
			break;

		pr_debug("pid(%d) [%d] folio->count=%d\n", current->pid, i,
			 folio_ref_count(folio));
		folio_end_read(folio, true);

		ctx->ring_folios[i] = folio;
	}
	ctx->nr_pages = i;

	if (unlikely(i != nr_pages)) {
		aio_free_ring(ctx);
		return -ENOMEM;
	}

	ctx->mmap_size = nr_pages * PAGE_SIZE;
	pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size);

	if (mmap_write_lock_killable(mm)) {
		ctx->mmap_size = 0;
		aio_free_ring(ctx);
		return -EINTR;
	}

	ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
				 PROT_READ | PROT_WRITE,
				 MAP_SHARED, 0, 0, &unused, NULL);
	mmap_write_unlock(mm);
	if (IS_ERR((void *)ctx->mmap_base)) {
		ctx->mmap_size = 0;
		aio_free_ring(ctx);
		return -ENOMEM;
	}

	pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);

	ctx->user_id = ctx->mmap_base;
	ctx->nr_events = nr_events; /* trusted copy */

	ring = folio_address(ctx->ring_folios[0]);
	ring->nr = nr_events;	/* user copy */
	ring->id = ~0U;
	ring->head = ring->tail = 0;
	ring->magic = AIO_RING_MAGIC;
	ring->compat_features = AIO_RING_COMPAT_FEATURES;
	ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
	ring->header_length = sizeof(struct aio_ring);
	flush_dcache_folio(ctx->ring_folios[0]);

	return 0;
}

#define AIO_EVENTS_PER_PAGE	(PAGE_SIZE / sizeof(struct io_event))
#define AIO_EVENTS_FIRST_PAGE	((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event))
#define AIO_EVENTS_OFFSET	(AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)

void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel)
{
	struct aio_kiocb *req;
	struct kioctx *ctx;
	unsigned long flags;

	/*
	 * kiocb didn't come from aio or is neither a read nor a write, hence
	 * ignore it.
	 */
	if (!(iocb->ki_flags & IOCB_AIO_RW))
		return;

	req = container_of(iocb, struct aio_kiocb, rw);

	if (WARN_ON_ONCE(!list_empty(&req->ki_list)))
		return;

	ctx = req->ki_ctx;

	spin_lock_irqsave(&ctx->ctx_lock, flags);
	list_add_tail(&req->ki_list, &ctx->active_reqs);
	req->ki_cancel = cancel;
	spin_unlock_irqrestore(&ctx->ctx_lock, flags);
}
EXPORT_SYMBOL(kiocb_set_cancel_fn);

/*
 * free_ioctx() should be RCU delayed to synchronize against the RCU
 * protected lookup_ioctx() and also needs process context to call
 * aio_free_ring().  Use rcu_work.
 */
static void free_ioctx(struct work_struct *work)
{
	struct kioctx *ctx = container_of(to_rcu_work(work), struct kioctx,
					  free_rwork);
	pr_debug("freeing %p\n", ctx);

	aio_free_ring(ctx);
	free_percpu(ctx->cpu);
	percpu_ref_exit(&ctx->reqs);
	percpu_ref_exit(&ctx->users);
	kmem_cache_free(kioctx_cachep, ctx);
}

static void free_ioctx_reqs(struct percpu_ref *ref)
{
	struct kioctx *ctx = container_of(ref, struct kioctx, reqs);

	/* At this point we know that there are no any in-flight requests */
	if (ctx->rq_wait && atomic_dec_and_test(&ctx->rq_wait->count))
		complete(&ctx->rq_wait->comp);

	/* Synchronize against RCU protected table->table[] dereferences */
	INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
	queue_rcu_work(system_percpu_wq, &ctx->free_rwork);
}

/*
 * When this function runs, the kioctx has been removed from the "hash table"
 * and ctx->users has dropped to 0, so we know no more kiocbs can be submitted -
 * now it's safe to cancel any that need to be.
 */
static void free_ioctx_users(struct percpu_ref *ref)
{
	struct kioctx *ctx = container_of(ref, struct kioctx, users);
	struct aio_kiocb *req;

	spin_lock_irq(&ctx->ctx_lock);

	while (!list_empty(&ctx->active_reqs)) {
		req = list_first_entry(&ctx->active_reqs,
				       struct aio_kiocb, ki_list);
		req->ki_cancel(&req->rw);
		list_del_init(&req->ki_list);
	}

	spin_unlock_irq(&ctx->ctx_lock);

	percpu_ref_kill(&ctx->reqs);
	percpu_ref_put(&ctx->reqs);
}

static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
{
	unsigned i, new_nr;
	struct kioctx_table *table, *old;
	struct aio_ring *ring;

	spin_lock(&mm->ioctx_lock);
	table = rcu_dereference_raw(mm->ioctx_table);

	while (1) {
		if (table)
			for (i = 0; i < table->nr; i++)
				if (!rcu_access_pointer(table->table[i])) {
					ctx->id = i;
					rcu_assign_pointer(table->table[i], ctx);
					spin_unlock(&mm->ioctx_lock);

					/* While kioctx setup is in progress,
					 * we are protected from page migration
					 * changes ring_folios by ->ring_lock.
					 */
					ring = folio_address(ctx->ring_folios[0]);
					ring->id = ctx->id;
					return 0;
				}

		new_nr = (table ? table->nr : 1) * 4;
		spin_unlock(&mm->ioctx_lock);

		table = kzalloc(struct_size(table, table, new_nr), GFP_KERNEL);
		if (!table)
			return -ENOMEM;

		table->nr = new_nr;

		spin_lock(&mm->ioctx_lock);
		old = rcu_dereference_raw(mm->ioctx_table);

		if (!old) {
			rcu_assign_pointer(mm->ioctx_table, table);
		} else if (table->nr > old->nr) {
			memcpy(table->table, old->table,
			       old->nr * sizeof(struct kioctx *));

			rcu_assign_pointer(mm->ioctx_table, table);
			kfree_rcu(old, rcu);
		} else {
			kfree(table);
			table = old;
		}
	}
}

static void aio_nr_sub(unsigned nr)
{
	spin_lock(&aio_nr_lock);
	if (WARN_ON(aio_nr - nr > aio_nr))
		aio_nr = 0;
	else
		aio_nr -= nr;
	spin_unlock(&aio_nr_lock);
}

/* ioctx_alloc
 *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
 */
static struct kioctx *ioctx_alloc(unsigned nr_events)
{
	struct mm_struct *mm = current->mm;
	struct kioctx *ctx;
	int err = -ENOMEM;

	/*
	 * Store the original nr_events -- what userspace passed to io_setup(),
	 * for counting against the global limit -- before it changes.
	 */
	unsigned int max_reqs = nr_events;

	/*
	 * We keep track of the number of available ringbuffer slots, to prevent
	 * overflow (reqs_available), and we also use percpu counters for this.
	 *
	 * So since up to half the slots might be on other cpu's percpu counters
	 * and unavailable, double nr_events so userspace sees what they
	 * expected: additionally, we move req_batch slots to/from percpu
	 * counters at a time, so make sure that isn't 0:
	 */
	nr_events = max(nr_events, num_possible_cpus() * 4);
	nr_events *= 2;

	/* Prevent overflows */
	if (nr_events > (0x10000000U / sizeof(struct io_event))) {
		pr_debug("ENOMEM: nr_events too high\n");
		return ERR_PTR(-EINVAL);
	}

	if (!nr_events || (unsigned long)max_reqs > aio_max_nr)
		return ERR_PTR(-EAGAIN);

	ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL);
	if (!ctx)
		return ERR_PTR(-ENOMEM);

	ctx->max_reqs = max_reqs;

	spin_lock_init(&ctx->ctx_lock);
	spin_lock_init(&ctx->completion_lock);
	mutex_init(&ctx->ring_lock);
	/* Protect against page migration throughout kiotx setup by keeping
	 * the ring_lock mutex held until setup is complete. */
	mutex_lock(&ctx->ring_lock);
	init_waitqueue_head(&ctx->wait);

	INIT_LIST_HEAD(&ctx->active_reqs);

	if (percpu_ref_init(&ctx->users, free_ioctx_users, 0, GFP_KERNEL))
		goto err;

	if (percpu_ref_init(&ctx->reqs, free_ioctx_reqs, 0, GFP_KERNEL))
		goto err;

	ctx->cpu = alloc_percpu(struct kioctx_cpu);
	if (!ctx->cpu)
		goto err;

	err = aio_setup_ring(ctx, nr_events);
	if (err < 0)
		goto err;

	atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
	ctx->req_batch = (ctx->nr_events - 1) / (num_possible_cpus() * 4);
	if (ctx->req_batch < 1)
		ctx->req_batch = 1;

	/* limit the number of system wide aios */
	spin_lock(&aio_nr_lock);
	if (aio_nr + ctx->max_reqs > aio_max_nr ||
	    aio_nr + ctx->max_reqs < aio_nr) {
		spin_unlock(&aio_nr_lock);
		err = -EAGAIN;
		goto err_ctx;
	}
	aio_nr += ctx->max_reqs;
	spin_unlock(&aio_nr_lock);

	percpu_ref_get(&ctx->users);	/* io_setup() will drop this ref */
	percpu_ref_get(&ctx->reqs);	/* free_ioctx_users() will drop this */

	err = ioctx_add_table(ctx, mm);
	if (err)
		goto err_cleanup;

	/* Release the ring_lock mutex now that all setup is complete. */
	mutex_unlock(&ctx->ring_lock);

	pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
		 ctx, ctx->user_id, mm, ctx->nr_events);
	return ctx;

err_cleanup:
	aio_nr_sub(ctx->max_reqs);
err_ctx:
	atomic_set(&ctx->dead, 1);
	if (ctx->mmap_size)
		vm_munmap(ctx->mmap_base, ctx->mmap_size);
	aio_free_ring(ctx);
err:
	mutex_unlock(&ctx->ring_lock);
	free_percpu(ctx->cpu);
	percpu_ref_exit(&ctx->reqs);
	percpu_ref_exit(&ctx->users);
	kmem_cache_free(kioctx_cachep, ctx);
	pr_debug("error allocating ioctx %d\n", err);
	return ERR_PTR(err);
}

/* kill_ioctx
 *	Cancels all outstanding aio requests on an aio context.  Used
 *	when the processes owning a context have all exited to encourage
 *	the rapid destruction of the kioctx.
 */
static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
		      struct ctx_rq_wait *wait)
{
	struct kioctx_table *table;

	spin_lock(&mm->ioctx_lock);
	if (atomic_xchg(&ctx->dead, 1)) {
		spin_unlock(&mm->ioctx_lock);
		return -EINVAL;
	}

	table = rcu_dereference_raw(mm->ioctx_table);
	WARN_ON(ctx != rcu_access_pointer(table->table[ctx->id]));
	RCU_INIT_POINTER(table->table[ctx->id], NULL);
	spin_unlock(&mm->ioctx_lock);

	/* free_ioctx_reqs() will do the necessary RCU synchronization */
	wake_up_all(&ctx->wait);

	/*
	 * It'd be more correct to do this in free_ioctx(), after all
	 * the outstanding kiocbs have finished - but by then io_destroy
	 * has already returned, so io_setup() could potentially return
	 * -EAGAIN with no ioctxs actually in use (as far as userspace
	 *  could tell).
	 */
	aio_nr_sub(ctx->max_reqs);

	if (ctx->mmap_size)
		vm_munmap(ctx->mmap_base, ctx->mmap_size);

	ctx->rq_wait = wait;
	percpu_ref_kill(&ctx->users);
	return 0;
}

/*
 * exit_aio: called when the last user of mm goes away.  At this point, there is
 * no way for any new requests to be submited or any of the io_* syscalls to be
 * called on the context.
 *
 * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
 * them.
 */
void exit_aio(struct mm_struct *mm)
{
	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
	struct ctx_rq_wait wait;
	int i, skipped;

	if (!table)
		return;

	atomic_set(&wait.count, table->nr);
	init_completion(&wait.comp);

	skipped = 0;
	for (i = 0; i < table->nr; ++i) {
		struct kioctx *ctx =
			rcu_dereference_protected(table->table[i], true);

		if (!ctx) {
			skipped++;
			continue;
		}

		/*
		 * We don't need to bother with munmap() here - exit_mmap(mm)
		 * is coming and it'll unmap everything. And we simply can't,
		 * this is not necessarily our ->mm.
		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
		 * that it needs to unmap the area, just set it to 0.
		 */
		ctx->mmap_size = 0;
		kill_ioctx(mm, ctx, &wait);
	}

	if (!atomic_sub_and_test(skipped, &wait.count)) {
		/* Wait until all IO for the context are done. */
		wait_for_completion(&wait.comp);
	}

	RCU_INIT_POINTER(mm->ioctx_table, NULL);
	kfree(table);
}

static void put_reqs_available(struct kioctx *ctx, unsigned nr)
{
	struct kioctx_cpu *kcpu;
	unsigned long flags;

	local_irq_save(flags);
	kcpu = this_cpu_ptr(ctx->cpu);
	kcpu->reqs_available += nr;

	while (kcpu->reqs_available >= ctx->req_batch * 2) {
		kcpu->reqs_available -= ctx->req_batch;
		atomic_add(ctx->req_batch, &ctx->reqs_available);
	}

	local_irq_restore(flags);
}

static bool __get_reqs_available(struct kioctx *ctx)
{
	struct kioctx_cpu *kcpu;
	bool ret = false;
	unsigned long flags;

	local_irq_save(flags);
	kcpu = this_cpu_ptr(ctx->cpu);
	if (!kcpu->reqs_available) {
		int avail = atomic_read(&ctx->reqs_available);

		do {
			if (avail < ctx->req_batch)
				goto out;
		} while (!atomic_try_cmpxchg(&ctx->reqs_available,
					     &avail, avail - ctx->req_batch));

		kcpu->reqs_available += ctx->req_batch;
	}

	ret = true;
	kcpu->reqs_available--;
out:
	local_irq_restore(flags);
	return ret;
}

/* refill_reqs_available
 *	Updates the reqs_available reference counts used for tracking the
 *	number of free slots in the completion ring.  This can be called
 *	from aio_complete() (to optimistically update reqs_available) or
 *	from aio_get_req() (the we're out of events case).  It must be
 *	called holding ctx->completion_lock.
 */
static void refill_reqs_available(struct kioctx *ctx, unsigned head,
                                  unsigned tail)
{
	unsigned events_in_ring, completed;

	/* Clamp head since userland can write to it. */
	head %= ctx->nr_events;
	if (head <= tail)
		events_in_ring = tail - head;
	else
		events_in_ring = ctx->nr_events - (head - tail);

	completed = ctx->completed_events;
	if (events_in_ring < completed)
		completed -= events_in_ring;
	else
		completed = 0;

	if (!completed)
		return;

	ctx->completed_events -= completed;
	put_reqs_available(ctx, completed);
}

/* user_refill_reqs_available
 *	Called to refill reqs_available when aio_get_req() encounters an
 *	out of space in the completion ring.
 */
static void user_refill_reqs_available(struct kioctx *ctx)
{
	spin_lock_irq(&ctx->completion_lock);
	if (ctx->completed_events) {
		struct aio_ring *ring;
		unsigned head;

		/* Access of ring->head may race with aio_read_events_ring()
		 * here, but that's okay since whether we read the old version
		 * or the new version, and either will be valid.  The important
		 * part is that head cannot pass tail since we prevent
		 * aio_complete() from updating tail by holding
		 * ctx->completion_lock.  Even if head is invalid, the check
		 * against ctx->completed_events below will make sure we do the
		 * safe/right thing.
		 */
		ring = folio_address(ctx->ring_folios[0]);
		head = ring->head;

		refill_reqs_available(ctx, head, ctx->tail);
	}

	spin_unlock_irq(&ctx->completion_lock);
}

static bool get_reqs_available(struct kioctx *ctx)
{
	if (__get_reqs_available(ctx))
		return true;
	user_refill_reqs_available(ctx);
	return __get_reqs_available(ctx);
}

/* aio_get_req
 *	Allocate a slot for an aio request.
 * Returns NULL if no requests are free.
 *
 * The refcount is initialized to 2 - one for the async op completion,
 * one for the synchronous code that does this.
 */
static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
{
	struct aio_kiocb *req;

	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
	if (unlikely(!req))
		return NULL;

	if (unlikely(!get_reqs_available(ctx))) {
		kmem_cache_free(kiocb_cachep, req);
		return NULL;
	}

	percpu_ref_get(&ctx->reqs);
	req->ki_ctx = ctx;
	INIT_LIST_HEAD(&req->ki_list);
	refcount_set(&req->ki_refcnt, 2);
	req->ki_eventfd = NULL;
	return req;
}

static struct kioctx *lookup_ioctx(unsigned long ctx_id)
{
	struct aio_ring __user *ring  = (void __user *)ctx_id;
	struct mm_struct *mm = current->mm;
	struct kioctx *ctx, *ret = NULL;
	struct kioctx_table *table;
	unsigned id;

	if (get_user(id, &ring->id))
		return NULL;

	rcu_read_lock();
	table = rcu_dereference(mm->ioctx_table);

	if (!table || id >= table->nr)
		goto out;

	id = array_index_nospec(id, table->nr);
	ctx = rcu_dereference(table->table[id]);
	if (ctx && ctx->user_id == ctx_id) {
		if (percpu_ref_tryget_live(&ctx->users))
			ret = ctx;
	}
out:
	rcu_read_unlock();
	return ret;
}

static inline void iocb_destroy(struct aio_kiocb *iocb)
{
	if (iocb->ki_eventfd)
		eventfd_ctx_put(iocb->ki_eventfd);
	if (iocb->ki_filp)
		fput(iocb->ki_filp);
	percpu_ref_put(&iocb->ki_ctx->reqs);
	kmem_cache_free(kiocb_cachep, iocb);
}

struct aio_waiter {
	struct wait_queue_entry	w;
	size_t			min_nr;
};

/* aio_complete
 *	Called when the io request on the given iocb is complete.
 */
static void aio_complete(struct aio_kiocb *iocb)
{
	struct kioctx	*ctx = iocb->ki_ctx;
	struct aio_ring	*ring;
	struct io_event	*ev_page, *event;
	unsigned tail, pos, head, avail;
	unsigned long	flags;

	/*
	 * Add a completion event to the ring buffer. Must be done holding
	 * ctx->completion_lock to prevent other code from messing with the tail
	 * pointer since we might be called from irq context.
	 */
	spin_lock_irqsave(&ctx->completion_lock, flags);

	tail = ctx->tail;
	pos = tail + AIO_EVENTS_OFFSET;

	if (++tail >= ctx->nr_events)
		tail = 0;

	ev_page = folio_address(ctx->ring_folios[pos / AIO_EVENTS_PER_PAGE]);
	event = ev_page + pos % AIO_EVENTS_PER_PAGE;

	*event = iocb->ki_res;

	flush_dcache_folio(ctx->ring_folios[pos / AIO_EVENTS_PER_PAGE]);

	pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb,
		 (void __user *)(unsigned long)iocb->ki_res.obj,
		 iocb->ki_res.data, iocb->ki_res.res, iocb->ki_res.res2);

	/* after flagging the request as done, we
	 * must never even look at it again
	 */
	smp_wmb();	/* make event visible before updating tail */

	ctx->tail = tail;

	ring = folio_address(ctx->ring_folios[0]);
	head = ring->head;
	ring->tail = tail;
	flush_dcache_folio(ctx->ring_folios[0]);

	ctx->completed_events++;
	if (ctx->completed_events > 1)
		refill_reqs_available(ctx, head, tail);

	avail = tail > head
		? tail - head
		: tail + ctx->nr_events - head;
	spin_unlock_irqrestore(&ctx->completion_lock, flags);

	pr_debug("added to ring %p at [%u]\n", iocb, tail);

	/*
	 * Check if the user asked us to deliver the result through an
	 * eventfd. The eventfd_signal() function is safe to be called
	 * from IRQ context.
	 */
	if (iocb->ki_eventfd)
		eventfd_signal(iocb->ki_eventfd);

	/*
	 * We have to order our ring_info tail store above and test
	 * of the wait list below outside the wait lock.  This is
	 * like in wake_up_bit() where clearing a bit has to be
	 * ordered with the unlocked test.
	 */
	smp_mb();

	if (waitqueue_active(&ctx->wait)) {
		struct aio_waiter *curr, *next;
		unsigned long flags;

		spin_lock_irqsave(&ctx->wait.lock, flags);
		list_for_each_entry_safe(curr, next, &ctx->wait.head, w.entry)
			if (avail >= curr->min_nr) {
				wake_up_process(curr->w.private);
				list_del_init_careful(&curr->w.entry);
			}
		spin_unlock_irqrestore(&ctx->wait.lock, flags);
	}
}

static inline void iocb_put(struct aio_kiocb *iocb)
{
	if (refcount_dec_and_test(&iocb->ki_refcnt)) {
		aio_complete(iocb);
		iocb_destroy(iocb);
	}
}

/* aio_read_events_ring
 *	Pull an event off of the ioctx's event ring.  Returns the number of
 *	events fetched
 */
static long aio_read_events_ring(struct kioctx *ctx,
				 struct io_event __user *event, long nr)
{
	struct aio_ring *ring;
	unsigned head, tail, pos;
	long ret = 0;
	int copy_ret;

	/*
	 * The mutex can block and wake us up and that will cause
	 * wait_event_interruptible_hrtimeout() to schedule without sleeping
	 * and repeat. This should be rare enough that it doesn't cause
	 * peformance issues. See the comment in read_events() for more detail.
	 */
	sched_annotate_sleep();
	mutex_lock(&ctx->ring_lock);

	/* Access to ->ring_folios here is protected by ctx->ring_lock. */
	ring = folio_address(ctx->ring_folios[0]);
	head = ring->head;
	tail = ring->tail;

	/*
	 * Ensure that once we've read the current tail pointer, that
	 * we also see the events that were stored up to the tail.
	 */
	smp_rmb();

	pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);

	if (head == tail)
		goto out;

	head %= ctx->nr_events;
	tail %= ctx->nr_events;

	while (ret < nr) {
		long avail;
		struct io_event *ev;
		struct folio *folio;

		avail = (head <= tail ?  tail : ctx->nr_events) - head;
		if (head == tail)
			break;

		pos = head + AIO_EVENTS_OFFSET;
		folio = ctx->ring_folios[pos / AIO_EVENTS_PER_PAGE];
		pos %= AIO_EVENTS_PER_PAGE;

		avail = min(avail, nr - ret);
		avail = min_t(long, avail, AIO_EVENTS_PER_PAGE - pos);

		ev = folio_address(folio);
		copy_ret = copy_to_user(event + ret, ev + pos,
					sizeof(*ev) * avail);

		if (unlikely(copy_ret)) {
			ret = -EFAULT;
			goto out;
		}

		ret += avail;
		head += avail;
		head %= ctx->nr_events;
	}

	ring = folio_address(ctx->ring_folios[0]);
	ring->head = head;
	flush_dcache_folio(ctx->ring_folios[0]);

	pr_debug("%li  h%u t%u\n", ret, head, tail);
out:
	mutex_unlock(&ctx->ring_lock);

	return ret;
}

static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr,
			    struct io_event __user *event, long *i)
{
	long ret = aio_read_events_ring(ctx, event + *i, nr - *i);

	if (ret > 0)
		*i += ret;

	if (unlikely(atomic_read(&ctx->dead)))
		ret = -EINVAL;

	if (!*i)
		*i = ret;

	return ret < 0 || *i >= min_nr;
}

static long read_events(struct kioctx *ctx, long min_nr, long nr,
			struct io_event __user *event,
			ktime_t until)
{
	struct hrtimer_sleeper	t;
	struct aio_waiter	w;
	long ret = 0, ret2 = 0;

	/*
	 * Note that aio_read_events() is being called as the conditional - i.e.
	 * we're calling it after prepare_to_wait() has set task state to
	 * TASK_INTERRUPTIBLE.
	 *
	 * But aio_read_events() can block, and if it blocks it's going to flip
	 * the task state back to TASK_RUNNING.
	 *
	 * This should be ok, provided it doesn't flip the state back to
	 * TASK_RUNNING and return 0 too much - that causes us to spin. That
	 * will only happen if the mutex_lock() call blocks, and we then find
	 * the ringbuffer empty. So in practice we should be ok, but it's
	 * something to be aware of when touching this code.
	 */
	aio_read_events(ctx, min_nr, nr, event, &ret);
	if (until == 0 || ret < 0 || ret >= min_nr)
		return ret;

	hrtimer_setup_sleeper_on_stack(&t, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
	if (until != KTIME_MAX) {
		hrtimer_set_expires_range_ns(&t.timer, until, current->timer_slack_ns);
		hrtimer_sleeper_start_expires(&t, HRTIMER_MODE_REL);
	}

	init_wait(&w.w);

	while (1) {
		unsigned long nr_got = ret;

		w.min_nr = min_nr - ret;

		ret2 = prepare_to_wait_event(&ctx->wait, &w.w, TASK_INTERRUPTIBLE);
		if (!ret2 && !t.task)
			ret2 = -ETIME;

		if (aio_read_events(ctx, min_nr, nr, event, &ret) || ret2)
			break;

		if (nr_got == ret)
			schedule();
	}

	finish_wait(&ctx->wait, &w.w);
	hrtimer_cancel(&t.timer);
	destroy_hrtimer_on_stack(&t.timer);

	return ret;
}

/* sys_io_setup:
 *	Create an aio_context capable of receiving at least nr_events.
 *	ctxp must not point to an aio_context that already exists, and
 *	must be initialized to 0 prior to the call.  On successful
 *	creation of the aio_context, *ctxp is filled in with the resulting 
 *	handle.  May fail with -EINVAL if *ctxp is not initialized,
 *	if the specified nr_events exceeds internal limits.  May fail 
 *	with -EAGAIN if the specified nr_events exceeds the user's limit 
 *	of available events.  May fail with -ENOMEM if insufficient kernel
 *	resources are available.  May fail with -EFAULT if an invalid
 *	pointer is passed for ctxp.  Will fail with -ENOSYS if not
 *	implemented.
 */
SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
{
	struct kioctx *ioctx = NULL;
	unsigned long ctx;
	long ret;

	ret = get_user(ctx, ctxp);
	if (unlikely(ret))
		goto out;

	ret = -EINVAL;
	if (unlikely(ctx || nr_events == 0)) {
		pr_debug("EINVAL: ctx %lu nr_events %u\n",
		         ctx, nr_events);
		goto out;
	}

	ioctx = ioctx_alloc(nr_events);
	ret = PTR_ERR(ioctx);
	if (!IS_ERR(ioctx)) {
		ret = put_user(ioctx->user_id, ctxp);
		if (ret)
			kill_ioctx(current->mm, ioctx, NULL);
		percpu_ref_put(&ioctx->users);
	}

out:
	return ret;
}

#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, u32 __user *, ctx32p)
{
	struct kioctx *ioctx = NULL;
	unsigned long ctx;
	long ret;

	ret = get_user(ctx, ctx32p);
	if (unlikely(ret))
		goto out;

	ret = -EINVAL;
	if (unlikely(ctx || nr_events == 0)) {
		pr_debug("EINVAL: ctx %lu nr_events %u\n",
		         ctx, nr_events);
		goto out;
	}

	ioctx = ioctx_alloc(nr_events);
	ret = PTR_ERR(ioctx);
	if (!IS_ERR(ioctx)) {
		/* truncating is ok because it's a user address */
		ret = put_user((u32)ioctx->user_id, ctx32p);
		if (ret)
			kill_ioctx(current->mm, ioctx, NULL);
		percpu_ref_put(&ioctx->users);
	}

out:
	return ret;
}
#endif

/* sys_io_destroy:
 *	Destroy the aio_context specified.  May cancel any outstanding 
 *	AIOs and block on completion.  Will fail with -ENOSYS if not
 *	implemented.  May fail with -EINVAL if the context pointed to
 *	is invalid.
 */
SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
{
	struct kioctx *ioctx = lookup_ioctx(ctx);
	if (likely(NULL != ioctx)) {
		struct ctx_rq_wait wait;
		int ret;

		init_completion(&wait.comp);
		atomic_set(&wait.count, 1);

		/* Pass requests_done to kill_ioctx() where it can be set
		 * in a thread-safe way. If we try to set it here then we have
		 * a race condition if two io_destroy() called simultaneously.
		 */
		ret = kill_ioctx(current->mm, ioctx, &wait);
		percpu_ref_put(&ioctx->users);

		/* Wait until all IO for the context are done. Otherwise kernel
		 * keep using user-space buffers even if user thinks the context
		 * is destroyed.
		 */
		if (!ret)
			wait_for_completion(&wait.comp);

		return ret;
	}
	pr_debug("EINVAL: invalid context id\n");
	return -EINVAL;
}

static void aio_remove_iocb(struct aio_kiocb *iocb)
{
	struct kioctx *ctx = iocb->ki_ctx;
	unsigned long flags;

	spin_lock_irqsave(&ctx->ctx_lock, flags);
	list_del(&iocb->ki_list);
	spin_unlock_irqrestore(&ctx->ctx_lock, flags);
}

static void aio_complete_rw(struct kiocb *kiocb, long res)
{
	struct aio_kiocb *iocb = container_of(kiocb, struct aio_kiocb, rw);

	if (!list_empty_careful(&iocb->ki_list))
		aio_remove_iocb(iocb);

	if (kiocb->ki_flags & IOCB_WRITE) {
		struct inode *inode = file_inode(kiocb->ki_filp);

		if (S_ISREG(inode->i_mode))
			kiocb_end_write(kiocb);
	}

	iocb->ki_res.res = res;
	iocb->ki_res.res2 = 0;
	iocb_put(iocb);
}

static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type)
{
	int ret;

	req->ki_write_stream = 0;
	req->ki_complete = aio_complete_rw;
	req->private = NULL;
	req->ki_pos = iocb->aio_offset;
	req->ki_flags = req->ki_filp->f_iocb_flags | IOCB_AIO_RW;
	if (iocb->aio_flags & IOCB_FLAG_RESFD)
		req->ki_flags |= IOCB_EVENTFD;
	if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
		/*
		 * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
		 * aio_reqprio is interpreted as an I/O scheduling
		 * class and priority.
		 */
		ret = ioprio_check_cap(iocb->aio_reqprio);
		if (ret) {
			pr_debug("aio ioprio check cap error: %d\n", ret);
			return ret;
		}

		req->ki_ioprio = iocb->aio_reqprio;
	} else
		req->ki_ioprio = get_current_ioprio();

	ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags, rw_type);
	if (unlikely(ret))
		return ret;

	req->ki_flags &= ~IOCB_HIPRI; /* no one is going to poll for this I/O */
	return 0;
}

static ssize_t aio_setup_rw(int rw, const struct iocb *iocb,
		struct iovec **iovec, bool vectored, bool compat,
		struct iov_iter *iter)
{
	void __user *buf = (void __user *)(uintptr_t)iocb->aio_buf;
	size_t len = iocb->aio_nbytes;

	if (!vectored) {
		ssize_t ret = import_ubuf(rw, buf, len, iter);
		*iovec = NULL;
		return ret;
	}

	return __import_iovec(rw, buf, len, UIO_FASTIOV, iovec, iter, compat);
}

static inline void aio_rw_done(struct kiocb *req, ssize_t ret)
{
	switch (ret) {
	case -EIOCBQUEUED:
		break;
	case -ERESTARTSYS:
	case -ERESTARTNOINTR:
	case -ERESTARTNOHAND:
	case -ERESTART_RESTARTBLOCK:
		/*
		 * There's no easy way to restart the syscall since other AIO's
		 * may be already running. Just fail this IO with EINTR.
		 */
		ret = -EINTR;
		fallthrough;
	default:
		req->ki_complete(req, ret);
	}
}

static int aio_read(struct kiocb *req, const struct iocb *iocb,
			bool vectored, bool compat)
{
	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
	struct iov_iter iter;
	struct file *file;
	int ret;

	ret = aio_prep_rw(req, iocb, READ);
	if (ret)
		return ret;
	file = req->ki_filp;
	if (unlikely(!(file->f_mode & FMODE_READ)))
		return -EBADF;
	if (unlikely(!file->f_op->read_iter))
		return -EINVAL;

	ret = aio_setup_rw(ITER_DEST, iocb, &iovec, vectored, compat, &iter);
	if (ret < 0)
		return ret;
	ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter));
	if (!ret)
		aio_rw_done(req, file->f_op->read_iter(req, &iter));
	kfree(iovec);
	return ret;
}

static int aio_write(struct kiocb *req, const struct iocb *iocb,
			 bool vectored, bool compat)
{
	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
	struct iov_iter iter;
	struct file *file;
	int ret;

	ret = aio_prep_rw(req, iocb, WRITE);
	if (ret)
		return ret;
	file = req->ki_filp;

	if (unlikely(!(file->f_mode & FMODE_WRITE)))
		return -EBADF;
	if (unlikely(!file->f_op->write_iter))
		return -EINVAL;

	ret = aio_setup_rw(ITER_SOURCE, iocb, &iovec, vectored, compat, &iter);
	if (ret < 0)
		return ret;
	ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
	if (!ret) {
		if (S_ISREG(file_inode(file)->i_mode))
			kiocb_start_write(req);
		req->ki_flags |= IOCB_WRITE;
		aio_rw_done(req, file->f_op->write_iter(req, &iter));
	}
	kfree(iovec);
	return ret;
}

static void aio_fsync_work(struct work_struct *work)
{
	struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, fsync.work);

	scoped_with_creds(iocb->fsync.creds)
		iocb->ki_res.res = vfs_fsync(iocb->fsync.file, iocb->fsync.datasync);

	put_cred(iocb->fsync.creds);
	iocb_put(iocb);
}

static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
		     bool datasync)
{
	if (unlikely(iocb->aio_buf || iocb->aio_offset || iocb->aio_nbytes ||
			iocb->aio_rw_flags))
		return -EINVAL;

	if (unlikely(!req->file->f_op->fsync))
		return -EINVAL;

	req->creds = prepare_creds();
	if (!req->creds)
		return -ENOMEM;

	req->datasync = datasync;
	INIT_WORK(&req->work, aio_fsync_work);
	schedule_work(&req->work);
	return 0;
}

static void aio_poll_put_work(struct work_struct *work)
{
	struct poll_iocb *req = container_of(work, struct poll_iocb, work);
	struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);

	iocb_put(iocb);
}

/*
 * Safely lock the waitqueue which the request is on, synchronizing with the
 * case where the ->poll() provider decides to free its waitqueue early.
 *
 * Returns true on success, meaning that req->head->lock was locked, req->wait
 * is on req->head, and an RCU read lock was taken.  Returns false if the
 * request was already removed from its waitqueue (which might no longer exist).
 */
static bool poll_iocb_lock_wq(struct poll_iocb *req)
{
	wait_queue_head_t *head;

	/*
	 * While we hold the waitqueue lock and the waitqueue is nonempty,
	 * wake_up_pollfree() will wait for us.  However, taking the waitqueue
	 * lock in the first place can race with the waitqueue being freed.
	 *
	 * We solve this as eventpoll does: by taking advantage of the fact that
	 * all users of wake_up_pollfree() will RCU-delay the actual free.  If
	 * we enter rcu_read_lock() and see that the pointer to the queue is
	 * non-NULL, we can then lock it without the memory being freed out from
	 * under us, then check whether the request is still on the queue.
	 *
	 * Keep holding rcu_read_lock() as long as we hold the queue lock, in
	 * case the caller deletes the entry from the queue, leaving it empty.
	 * In that case, only RCU prevents the queue memory from being freed.
	 */
	rcu_read_lock();
	head = smp_load_acquire(&req->head);
	if (head) {
		spin_lock(&head->lock);
		if (!list_empty(&req->wait.entry))
			return true;
		spin_unlock(&head->lock);
	}
	rcu_read_unlock();
	return false;
}

static void poll_iocb_unlock_wq(struct poll_iocb *req)
{
	spin_unlock(&req->head->lock);
	rcu_read_unlock();
}

static void aio_poll_complete_work(struct work_struct *work)
{
	struct poll_iocb *req = container_of(work, struct poll_iocb, work);
	struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);
	struct poll_table_struct pt = { ._key = req->events };
	struct kioctx *ctx = iocb->ki_ctx;
	__poll_t mask = 0;

	if (!READ_ONCE(req->cancelled))
		mask = vfs_poll(req->file, &pt) & req->events;

	/*
	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
	 * synchronize with them.  In the cancellation case the list_del_init
	 * itself is not actually needed, but harmless so we keep it in to
	 * avoid further branches in the fast path.
	 */
	spin_lock_irq(&ctx->ctx_lock);
	if (poll_iocb_lock_wq(req)) {
		if (!mask && !READ_ONCE(req->cancelled)) {
			/*
			 * The request isn't actually ready to be completed yet.
			 * Reschedule completion if another wakeup came in.
			 */
			if (req->work_need_resched) {
				schedule_work(&req->work);
				req->work_need_resched = false;
			} else {
				req->work_scheduled = false;
			}
			poll_iocb_unlock_wq(req);
			spin_unlock_irq(&ctx->ctx_lock);
			return;
		}
		list_del_init(&req->wait.entry);
		poll_iocb_unlock_wq(req);
	} /* else, POLLFREE has freed the waitqueue, so we must complete */
	list_del_init(&iocb->ki_list);
	iocb->ki_res.res = mangle_poll(mask);
	spin_unlock_irq(&ctx->ctx_lock);

	iocb_put(iocb);
}

/* assumes we are called with irqs disabled */
static int aio_poll_cancel(struct kiocb *iocb)
{
	struct aio_kiocb *aiocb = container_of(iocb, struct aio_kiocb, rw);
	struct poll_iocb *req = &aiocb->poll;

	if (poll_iocb_lock_wq(req)) {
		WRITE_ONCE(req->cancelled, true);
		if (!req->work_scheduled) {
			schedule_work(&aiocb->poll.work);
			req->work_scheduled = true;
		}
		poll_iocb_unlock_wq(req);
	} /* else, the request was force-cancelled by POLLFREE already */

	return 0;
}

static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
		void *key)
{
	struct poll_iocb *req = container_of(wait, struct poll_iocb, wait);
	struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);
	__poll_t mask = key_to_poll(key);
	unsigned long flags;

	/* for instances that support it check for an event match first: */
	if (mask && !(mask & req->events))
		return 0;

	/*
	 * Complete the request inline if possible.  This requires that three
	 * conditions be met:
	 *   1. An event mask must have been passed.  If a plain wakeup was done
	 *	instead, then mask == 0 and we have to call vfs_poll() to get
	 *	the events, so inline completion isn't possible.
	 *   2. The completion work must not have already been scheduled.
	 *   3. ctx_lock must not be busy.  We have to use trylock because we
	 *	already hold the waitqueue lock, so this inverts the normal
	 *	locking order.  Use irqsave/irqrestore because not all
	 *	filesystems (e.g. fuse) call this function with IRQs disabled,
	 *	yet IRQs have to be disabled before ctx_lock is obtained.
	 */
	if (mask && !req->work_scheduled &&
	    spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) {
		struct kioctx *ctx = iocb->ki_ctx;

		list_del_init(&req->wait.entry);
		list_del(&iocb->ki_list);
		iocb->ki_res.res = mangle_poll(mask);
		if (iocb->ki_eventfd && !eventfd_signal_allowed()) {
			iocb = NULL;
			INIT_WORK(&req->work, aio_poll_put_work);
			schedule_work(&req->work);
		}
		spin_unlock_irqrestore(&ctx->ctx_lock, flags);
		if (iocb)
			iocb_put(iocb);
	} else {
		/*
		 * Schedule the completion work if needed.  If it was already
		 * scheduled, record that another wakeup came in.
		 *
		 * Don't remove the request from the waitqueue here, as it might
		 * not actually be complete yet (we won't know until vfs_poll()
		 * is called), and we must not miss any wakeups.  POLLFREE is an
		 * exception to this; see below.
		 */
		if (req->work_scheduled) {
			req->work_need_resched = true;
		} else {
			schedule_work(&req->work);
			req->work_scheduled = true;
		}

		/*
		 * If the waitqueue is being freed early but we can't complete
		 * the request inline, we have to tear down the request as best
		 * we can.  That means immediately removing the request from its
		 * waitqueue and preventing all further accesses to the
		 * waitqueue via the request.  We also need to schedule the
		 * completion work (done above).  Also mark the request as
		 * cancelled, to potentially skip an unneeded call to ->poll().
		 */
		if (mask & POLLFREE) {
			WRITE_ONCE(req->cancelled, true);
			list_del_init(&req->wait.entry);

			/*
			 * Careful: this *must* be the last step, since as soon
			 * as req->head is NULL'ed out, the request can be
			 * completed and freed, since aio_poll_complete_work()
			 * will no longer need to take the waitqueue lock.
			 */
			smp_store_release(&req->head, NULL);
		}
	}
	return 1;
}

struct aio_poll_table {
	struct poll_table_struct	pt;
	struct aio_kiocb		*iocb;
	bool				queued;
	int				error;
};

static void
aio_poll_queue_proc(struct file *file, struct wait_queue_head *head,
		struct poll_table_struct *p)
{
	struct aio_poll_table *pt = container_of(p, struct aio_poll_table, pt);

	/* multiple wait queues per file are not supported */
	if (unlikely(pt->queued)) {
		pt->error = -EINVAL;
		return;
	}

	pt->queued = true;
	pt->error = 0;
	pt->iocb->poll.head = head;
	add_wait_queue(head, &pt->iocb->poll.wait);
}

static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
{
	struct kioctx *ctx = aiocb->ki_ctx;
	struct poll_iocb *req = &aiocb->poll;
	struct aio_poll_table apt;
	bool cancel = false;
	__poll_t mask;

	/* reject any unknown events outside the normal event mask. */
	if ((u16)iocb->aio_buf != iocb->aio_buf)
		return -EINVAL;
	/* reject fields that are not defined for poll */
	if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags)
		return -EINVAL;

	INIT_WORK(&req->work, aio_poll_complete_work);
	req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP;

	req->head = NULL;
	req->cancelled = false;
	req->work_scheduled = false;
	req->work_need_resched = false;

	apt.pt._qproc = aio_poll_queue_proc;
	apt.pt._key = req->events;
	apt.iocb = aiocb;
	apt.queued = false;
	apt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */

	/* initialized the list so that we can do list_empty checks */
	INIT_LIST_HEAD(&req->wait.entry);
	init_waitqueue_func_entry(&req->wait, aio_poll_wake);

	mask = vfs_poll(req->file, &apt.pt) & req->events;
	spin_lock_irq(&ctx->ctx_lock);
	if (likely(apt.queued)) {
		bool on_queue = poll_iocb_lock_wq(req);

		if (!on_queue || req->work_scheduled) {
			/*
			 * aio_poll_wake() already either scheduled the async
			 * completion work, or completed the request inline.
			 */
			if (apt.error) /* unsupported case: multiple queues */
				cancel = true;
			apt.error = 0;
			mask = 0;
		}
		if (mask || apt.error) {
			/* Steal to complete synchronously. */
			list_del_init(&req->wait.entry);
		} else if (cancel) {
			/* Cancel if possible (may be too late though). */
			WRITE_ONCE(req->cancelled, true);
		} else if (on_queue) {
			/*
			 * Actually waiting for an event, so add the request to
			 * active_reqs so that it can be cancelled if needed.
			 */
			list_add_tail(&aiocb->ki_list, &ctx->active_reqs);
			aiocb->ki_cancel = aio_poll_cancel;
		}
		if (on_queue)
			poll_iocb_unlock_wq(req);
	}
	if (mask) { /* no async, we'd stolen it */
		aiocb->ki_res.res = mangle_poll(mask);
		apt.error = 0;
	}
	spin_unlock_irq(&ctx->ctx_lock);
	if (mask)
		iocb_put(aiocb);
	return apt.error;
}

static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
			   struct iocb __user *user_iocb, struct aio_kiocb *req,
			   bool compat)
{
	req->ki_filp = fget(iocb->aio_fildes);
	if (unlikely(!req->ki_filp))
		return -EBADF;

	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
		struct eventfd_ctx *eventfd;
		/*
		 * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
		 * instance of the file* now. The file descriptor must be
		 * an eventfd() fd, and will be signaled for each completed
		 * event using the eventfd_signal() function.
		 */
		eventfd = eventfd_ctx_fdget(iocb->aio_resfd);
		if (IS_ERR(eventfd))
			return PTR_ERR(eventfd);

		req->ki_eventfd = eventfd;
	}

	if (unlikely(put_user(KIOCB_KEY, &user_iocb->aio_key))) {
		pr_debug("EFAULT: aio_key\n");
		return -EFAULT;
	}

	req->ki_res.obj = (u64)(unsigned long)user_iocb;
	req->ki_res.data = iocb->aio_data;
	req->ki_res.res = 0;
	req->ki_res.res2 = 0;

	switch (iocb->aio_lio_opcode) {
	case IOCB_CMD_PREAD:
		return aio_read(&req->rw, iocb, false, compat);
	case IOCB_CMD_PWRITE:
		return aio_write(&req->rw, iocb, false, compat);
	case IOCB_CMD_PREADV:
		return aio_read(&req->rw, iocb, true, compat);
	case IOCB_CMD_PWRITEV:
		return aio_write(&req->rw, iocb, true, compat);
	case IOCB_CMD_FSYNC:
		return aio_fsync(&req->fsync, iocb, false);
	case IOCB_CMD_FDSYNC:
		return aio_fsync(&req->fsync, iocb, true);
	case IOCB_CMD_POLL:
		return aio_poll(req, iocb);
	default:
		pr_debug("invalid aio operation %d\n", iocb->aio_lio_opcode);
		return -EINVAL;
	}
}

static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
			 bool compat)
{
	struct aio_kiocb *req;
	struct iocb iocb;
	int err;

	if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb))))
		return -EFAULT;

	/* enforce forwards compatibility on users */
	if (unlikely(iocb.aio_reserved2)) {
		pr_debug("EINVAL: reserve field set\n");
		return -EINVAL;
	}

	/* prevent overflows */
	if (unlikely(
	    (iocb.aio_buf != (unsigned long)iocb.aio_buf) ||
	    (iocb.aio_nbytes != (size_t)iocb.aio_nbytes) ||
	    ((ssize_t)iocb.aio_nbytes < 0)
	   )) {
		pr_debug("EINVAL: overflow check\n");
		return -EINVAL;
	}

	req = aio_get_req(ctx);
	if (unlikely(!req))
		return -EAGAIN;

	err = __io_submit_one(ctx, &iocb, user_iocb, req, compat);

	/* Done with the synchronous reference */
	iocb_put(req);

	/*
	 * If err is 0, we'd either done aio_complete() ourselves or have
	 * arranged for that to be done asynchronously.  Anything non-zero
	 * means that we need to destroy req ourselves.
	 */
	if (unlikely(err)) {
		iocb_destroy(req);
		put_reqs_available(ctx, 1);
	}
	return err;
}

/* sys_io_submit:
 *	Queue the nr iocbs pointed to by iocbpp for processing.  Returns
 *	the number of iocbs queued.  May return -EINVAL if the aio_context
 *	specified by ctx_id is invalid, if nr is < 0, if the iocb at
 *	*iocbpp[0] is not properly initialized, if the operation specified
 *	is invalid for the file descriptor in the iocb.  May fail with
 *	-EFAULT if any of the data structures point to invalid data.  May
 *	fail with -EBADF if the file descriptor specified in the first
 *	iocb is invalid.  May fail with -EAGAIN if insufficient resources
 *	are available to queue any iocbs.  Will return 0 if nr is 0.  Will
 *	fail with -ENOSYS if not implemented.
 */
SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
		struct iocb __user * __user *, iocbpp)
{
	struct kioctx *ctx;
	long ret = 0;
	int i = 0;
	struct blk_plug plug;

	if (unlikely(nr < 0))
		return -EINVAL;

	ctx = lookup_ioctx(ctx_id);
	if (unlikely(!ctx)) {
		pr_debug("EINVAL: invalid context id\n");
		return -EINVAL;
	}

	if (nr > ctx->nr_events)
		nr = ctx->nr_events;

	if (nr > AIO_PLUG_THRESHOLD)
		blk_start_plug(&plug);
	for (i = 0; i < nr; i++) {
		struct iocb __user *user_iocb;

		if (unlikely(get_user(user_iocb, iocbpp + i))) {
			ret = -EFAULT;
			break;
		}

		ret = io_submit_one(ctx, user_iocb, false);
		if (ret)
			break;
	}
	if (nr > AIO_PLUG_THRESHOLD)
		blk_finish_plug(&plug);

	percpu_ref_put(&ctx->users);
	return i ? i : ret;
}

#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id,
		       int, nr, compat_uptr_t __user *, iocbpp)
{
	struct kioctx *ctx;
	long ret = 0;
	int i = 0;
	struct blk_plug plug;

	if (unlikely(nr < 0))
		return -EINVAL;

	ctx = lookup_ioctx(ctx_id);
	if (unlikely(!ctx)) {
		pr_debug("EINVAL: invalid context id\n");
		return -EINVAL;
	}

	if (nr > ctx->nr_events)
		nr = ctx->nr_events;

	if (nr > AIO_PLUG_THRESHOLD)
		blk_start_plug(&plug);
	for (i = 0; i < nr; i++) {
		compat_uptr_t user_iocb;

		if (unlikely(get_user(user_iocb, iocbpp + i))) {
			ret = -EFAULT;
			break;
		}

		ret = io_submit_one(ctx, compat_ptr(user_iocb), true);
		if (ret)
			break;
	}
	if (nr > AIO_PLUG_THRESHOLD)
		blk_finish_plug(&plug);

	percpu_ref_put(&ctx->users);
	return i ? i : ret;
}
#endif

/* sys_io_cancel:
 *	Attempts to cancel an iocb previously passed to io_submit.  If
 *	the operation is successfully cancelled, the resulting event is
 *	copied into the memory pointed to by result without being placed
 *	into the completion queue and 0 is returned.  May fail with
 *	-EFAULT if any of the data structures pointed to are invalid.
 *	May fail with -EINVAL if aio_context specified by ctx_id is
 *	invalid.  May fail with -EAGAIN if the iocb specified was not
 *	cancelled.  Will fail with -ENOSYS if not implemented.
 */
SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
		struct io_event __user *, result)
{
	struct kioctx *ctx;
	struct aio_kiocb *kiocb;
	int ret = -EINVAL;
	u32 key;
	u64 obj = (u64)(unsigned long)iocb;

	if (unlikely(get_user(key, &iocb->aio_key)))
		return -EFAULT;
	if (unlikely(key != KIOCB_KEY))
		return -EINVAL;

	ctx = lookup_ioctx(ctx_id);
	if (unlikely(!ctx))
		return -EINVAL;

	spin_lock_irq(&ctx->ctx_lock);
	list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) {
		if (kiocb->ki_res.obj == obj) {
			ret = kiocb->ki_cancel(&kiocb->rw);
			list_del_init(&kiocb->ki_list);
			break;
		}
	}
	spin_unlock_irq(&ctx->ctx_lock);

	if (!ret) {
		/*
		 * The result argument is no longer used - the io_event is
		 * always delivered via the ring buffer. -EINPROGRESS indicates
		 * cancellation is progress:
		 */
		ret = -EINPROGRESS;
	}

	percpu_ref_put(&ctx->users);

	return ret;
}

static long do_io_getevents(aio_context_t ctx_id,
		long min_nr,
		long nr,
		struct io_event __user *events,
		struct timespec64 *ts)
{
	ktime_t until = ts ? timespec64_to_ktime(*ts) : KTIME_MAX;
	struct kioctx *ioctx = lookup_ioctx(ctx_id);
	long ret = -EINVAL;

	if (likely(ioctx)) {
		if (likely(min_nr <= nr && min_nr >= 0))
			ret = read_events(ioctx, min_nr, nr, events, until);
		percpu_ref_put(&ioctx->users);
	}

	return ret;
}

/* io_getevents:
 *	Attempts to read at least min_nr events and up to nr events from
 *	the completion queue for the aio_context specified by ctx_id. If
 *	it succeeds, the number of read events is returned. May fail with
 *	-EINVAL if ctx_id is invalid, if min_nr is out of range, if nr is
 *	out of range, if timeout is out of range.  May fail with -EFAULT
 *	if any of the memory specified is invalid.  May return 0 or
 *	< min_nr if the timeout specified by timeout has elapsed
 *	before sufficient events are available, where timeout == NULL
 *	specifies an infinite timeout. Note that the timeout pointed to by
 *	timeout is relative.  Will fail with -ENOSYS if not implemented.
 */
#ifdef CONFIG_64BIT

SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
		long, min_nr,
		long, nr,
		struct io_event __user *, events,
		struct __kernel_timespec __user *, timeout)
{
	struct timespec64	ts;
	int			ret;

	if (timeout && unlikely(get_timespec64(&ts, timeout)))
		return -EFAULT;

	ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL);
	if (!ret && signal_pending(current))
		ret = -EINTR;
	return ret;
}

#endif

struct __aio_sigset {
	const sigset_t __user	*sigmask;
	size_t		sigsetsize;
};

SYSCALL_DEFINE6(io_pgetevents,
		aio_context_t, ctx_id,
		long, min_nr,
		long, nr,
		struct io_event __user *, events,
		struct __kernel_timespec __user *, timeout,
		const struct __aio_sigset __user *, usig)
{
	struct __aio_sigset	ksig = { NULL, };
	struct timespec64	ts;
	bool interrupted;
	int ret;

	if (timeout && unlikely(get_timespec64(&ts, timeout)))
		return -EFAULT;

	if (usig && copy_from_user(&ksig, usig, sizeof(ksig)))
		return -EFAULT;

	ret = set_user_sigmask(ksig.sigmask, ksig.sigsetsize);
	if (ret)
		return ret;

	ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL);

	interrupted = signal_pending(current);
	restore_saved_sigmask_unless(interrupted);
	if (interrupted && !ret)
		ret = -ERESTARTNOHAND;

	return ret;
}

#if defined(CONFIG_COMPAT_32BIT_TIME) && !defined(CONFIG_64BIT)

SYSCALL_DEFINE6(io_pgetevents_time32,
		aio_context_t, ctx_id,
		long, min_nr,
		long, nr,
		struct io_event __user *, events,
		struct old_timespec32 __user *, timeout,
		const struct __aio_sigset __user *, usig)
{
	struct __aio_sigset	ksig = { NULL, };
	struct timespec64	ts;
	bool interrupted;
	int ret;

	if (timeout && unlikely(get_old_timespec32(&ts, timeout)))
		return -EFAULT;

	if (usig && copy_from_user(&ksig, usig, sizeof(ksig)))
		return -EFAULT;


	ret = set_user_sigmask(ksig.sigmask, ksig.sigsetsize);
	if (ret)
		return ret;

	ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL);

	interrupted = signal_pending(current);
	restore_saved_sigmask_unless(interrupted);
	if (interrupted && !ret)
		ret = -ERESTARTNOHAND;

	return ret;
}

#endif

#if defined(CONFIG_COMPAT_32BIT_TIME)

SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id,
		__s32, min_nr,
		__s32, nr,
		struct io_event __user *, events,
		struct old_timespec32 __user *, timeout)
{
	struct timespec64 t;
	int ret;

	if (timeout && get_old_timespec32(&t, timeout))
		return -EFAULT;

	ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);
	if (!ret && signal_pending(current))
		ret = -EINTR;
	return ret;
}

#endif

#ifdef CONFIG_COMPAT

struct __compat_aio_sigset {
	compat_uptr_t		sigmask;
	compat_size_t		sigsetsize;
};

#if defined(CONFIG_COMPAT_32BIT_TIME)

COMPAT_SYSCALL_DEFINE6(io_pgetevents,
		compat_aio_context_t, ctx_id,
		compat_long_t, min_nr,
		compat_long_t, nr,
		struct io_event __user *, events,
		struct old_timespec32 __user *, timeout,
		const struct __compat_aio_sigset __user *, usig)
{
	struct __compat_aio_sigset ksig = { 0, };
	struct timespec64 t;
	bool interrupted;
	int ret;

	if (timeout && get_old_timespec32(&t, timeout))
		return -EFAULT;

	if (usig && copy_from_user(&ksig, usig, sizeof(ksig)))
		return -EFAULT;

	ret = set_compat_user_sigmask(compat_ptr(ksig.sigmask), ksig.sigsetsize);
	if (ret)
		return ret;

	ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);

	interrupted = signal_pending(current);
	restore_saved_sigmask_unless(interrupted);
	if (interrupted && !ret)
		ret = -ERESTARTNOHAND;

	return ret;
}

#endif

COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64,
		compat_aio_context_t, ctx_id,
		compat_long_t, min_nr,
		compat_long_t, nr,
		struct io_event __user *, events,
		struct __kernel_timespec __user *, timeout,
		const struct __compat_aio_sigset __user *, usig)
{
	struct __compat_aio_sigset ksig = { 0, };
	struct timespec64 t;
	bool interrupted;
	int ret;

	if (timeout && get_timespec64(&t, timeout))
		return -EFAULT;

	if (usig && copy_from_user(&ksig, usig, sizeof(ksig)))
		return -EFAULT;

	ret = set_compat_user_sigmask(compat_ptr(ksig.sigmask), ksig.sigsetsize);
	if (ret)
		return ret;

	ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);

	interrupted = signal_pending(current);
	restore_saved_sigmask_unless(interrupted);
	if (interrupted && !ret)
		ret = -ERESTARTNOHAND;

	return ret;
}
#endif
]

28/2 2026/02/24 18:31 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 93766
							output: 19
							thoughts: 809
						

Thoughts:
**Analyzing the AIO Ring Buffer Data Race**

Okay, so I've been looking through this AIO code and initially I noticed that the `ring->tail` access within `aio_poll_wake` and `aio_poll_complete_work` had `READ_ONCE`/`WRITE_ONCE` applied, but it's missing in `aio_complete` and `aio_read_events_ring`. This immediately flagged KCSAN, naturally. But then I caught this crucial comment in `user_refill_reqs_available`:

```c
/* Access of ring->head may race with aio_read_events_ring()
 * here, but that's okay since whether we read the old version
 * or the new version, and either will be valid.  The important
 * part is that head cannot pass tail since we prevent
 * aio_complete() from updating tail by holding
 * ctx->completion_lock.  Even if head is invalid, the check
 * against ctx->completed_events below will make sure we do the
 * safe/right thing.
 */
```

Right, that's the key. This race on `ring->head` is *explicitly* acknowledged as acceptable. The race on `ring->tail` between `aio_complete` and `aio_read_events_ring` is functionally identical. Let's break down the access pattern:

1.  Producer (`aio_complete`) writes the event data.
2.  `smp_wmb()` is called.
3.  Producer updates `ring->tail`.
4.  Consumer (`aio_read_events_ring`) reads `ring->tail`.
5.  `smp_rmb()` is called.
6.  Consumer reads the event data.

This is a textbook lockless producer-consumer pattern. The race on the index, `tail`, is entirely intentional. If the consumer reads an old `tail`, it just misses some events, which is fine. If it reads a new `tail`, the `smp_rmb()` ensures the data is visible.

Given that the KCSAN warning is triggered because the accesses aren't explicitly marked with `READ_ONCE`/`WRITE_ONCE`, and considering the nature of the lockless algorithm, coupled with the explicit comment about the `ring->head` race, I'm comfortable that this is a benign data race. The instructions also state that "A 'benign' data race does not lead to memory corruption/crash with a conservative compiler that compiles memory accesses to primitive types effectively as atomic". This behavior aligns perfectly with the stated definition of a benign race.

To be absolutely sure, I checked the references to `ring->tail` and found that the only place it is used is in `aio_complete` and `aio_read_events_ring`. Likewise for `ring->head`, it's accessed in `aio_complete`, `aio_read_events_ring`, and `user_refill_reqs_available`. These accesses are all plain and are not protected by locks, relying on the atomicity of the access and the compiler's adherence to memory barriers to work correctly.

Therefore, I'm confident in classifying this race as benign.



29/2 2026/02/24 18:31 tool set-results
0m Args:
map[Benign:true Confident:true]

Results:
map[Benign:true Confident:true]

30/2 2026/02/24 18:31 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 95307
							output: 389
							thoughts: 279
						

Thoughts:
**Analyzing the Data Race in the AIO Ring Buffer**

Okay, so KCSAN flagged a potential data race here, and my initial thought is, "Ah, classic producer-consumer." Specifically, we're talking about the `aio_complete()` function, acting as the producer, updating `ring->tail`, and the `aio_read_events_ring()` function, our consumer, reading from that same `ring->tail`. It's that ring buffer again, which is a shared memory region designed for efficient event passing to userspace.

The code uses `smp_wmb()` in the producer to make sure all the event data is fully written to memory before `tail` gets bumped.  And on the consumer side, `smp_rmb()` ensures the reverse: that any reads of the event data happen *after* we've checked the latest `tail` value.  This is all by design.

The thing is, the accesses *aren't* wrapped in `READ_ONCE()` or `WRITE_ONCE()`. That's why KCSAN is getting noisy. But, I believe this is a false positive. We're on x86_64, where 32-bit aligned accesses are atomic by nature. That means the reads and writes of `tail` will be coherent.

Even if the consumer happens to read a stale value of `tail` (which is possible given the lockless design), it's not the end of the world. It just means it'll see a slightly older view of the buffer, miss a few events, and catch up in the subsequent read or syscall. The code comments (like in `user_refill_reqs_available`) actually acknowledge this behavior. We *expect* similar race conditions on the `head` pointer and the code is structured to handle them.

So, in the end, I'm confident this data race is, for lack of a better word, *benign*. It's not leading to memory corruption, it won't crash the system, and with a conservative compiler, we should be fine. It's a trade-off: speed in exchange for a theoretical race condition that's handled correctly by the hardware and the overall algorithm. It's a well-known pattern.



LLM Calls Summary:
Total Calls Total Tokens Avg Tokens Total Duration (Seconds) Avg Duration (Seconds)
Tool Calls Summary:
Total Calls Total Duration (Seconds) Avg Duration (Seconds)