syzbot


ID Workflow Result Correct Bug Created Started Finished Revision Error
2a99d994-2690-4e69-9a4e-e8a844a26c3a assessment-kcsan Benign: ✅  Confident: ✅  KCSAN: data-race in copy_folio_from_iter_atomic / copy_folio_from_iter_atomic 2026/02/24 21:45 2026/02/24 21:45 2026/02/24 21:54 305c0ec5cd886e2d13738e28e1b2df9b0ec20fc9
Benign: ✅    Confident: ✅   

Explanation:
The data race occurs in `copy_folio_from_iter_atomic`, which is called during buffered file writes via `generic_perform_write`. 

### Analysis of the Race

1.  **Task 7152** is performing a buffered write to an **ext4** file (via `sendfile` -> `iter_file_splice_write`). It holds the `inode_lock` (exclusive) for the ext4 inode and is writing data into a folio in the ext4 page cache. The `write` reported by KCSAN is the `memcpy` into this ext4 folio.
2.  **Task 3545** is a loop device worker performing a write to a **shmem** backing file. The source of this write is a `bio` that contains the same ext4 folio (likely because someone used `splice`/`sendfile` to move data from the ext4 file to the loop device). The `read` reported by KCSAN is the `memcpy` reading from this ext4 folio to copy its contents into a shmem folio.
3.  **The Conflict**: Task 7152 is modifying the ext4 folio while Task 3545 is reading from it. While Task 7152 holds the `inode_lock` for the ext4 file, Task 3545 (the loop worker) does not, as it is only concerned with the destination shmem file's locking.

### Why it is Benign

*   **File Data Consistency**: In Linux, concurrent unsynchronized reads and writes to the same file range are allowed but result in undefined data (e.g., torn reads). This is a well-known behavior of the POSIX I/O model and is managed at the application level if consistency is required.
*   **Kernel Stability**: Both tasks are accessing valid, referenced folios. The `memcpy` operations are performed on kernel memory mapped via `kmap_local_folio`. A data race on the *contents* of these pages does not lead to kernel memory corruption, null pointer dereferences, or crashes.
*   **No Metadata Race**: The race is strictly on the data payload within the folio. Metadata updates (like marking the folio dirty or updating file sizes) are protected by the respective inode locks and other internal synchronization mechanisms.
*   **Atomic Copy**: `copy_folio_from_iter_atomic` is specifically designed to handle copies safely in various contexts, including those where page faults must be avoided.

In summary, this is a classic data race on file contents. While it can lead to inconsistent data being written to the loop device's backing file, it is not a kernel bug and does not compromise the stability of the system.

Final Verdict: **Benign**

Crash report:
==================================================================
BUG: KCSAN: data-race in copy_folio_from_iter_atomic / copy_folio_from_iter_atomic

write to 0xffff8881814fc4ba of 20 bytes by task 7152 on cpu 0:
 memcpy_from_iter lib/iov_iter.c:85 [inline]
 iterate_bvec include/linux/iov_iter.h:123 [inline]
 iterate_and_advance2 include/linux/iov_iter.h:306 [inline]
 iterate_and_advance include/linux/iov_iter.h:330 [inline]
 __copy_from_iter lib/iov_iter.c:261 [inline]
 copy_folio_from_iter_atomic+0x75f/0x1170 lib/iov_iter.c:491
 generic_perform_write+0x2c1/0x490 mm/filemap.c:4332
 ext4_buffered_write_iter+0x1ee/0x3c0 fs/ext4/file.c:300
 ext4_file_write_iter+0x380/0xfa0 fs/ext4/file.c:-1
 iter_file_splice_write+0x6bc/0xa80 fs/splice.c:738
 do_splice_from fs/splice.c:938 [inline]
 direct_splice_actor+0x156/0x2a0 fs/splice.c:1161
 splice_direct_to_actor+0x311/0x670 fs/splice.c:1105
 do_splice_direct_actor fs/splice.c:1204 [inline]
 do_splice_direct+0x119/0x1a0 fs/splice.c:1230
 do_sendfile+0x382/0x650 fs/read_write.c:1372
 __do_sys_sendfile64 fs/read_write.c:1433 [inline]
 __se_sys_sendfile64 fs/read_write.c:1419 [inline]
 __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1419
 x64_sys_call+0x2dc4/0x3020 arch/x86/include/generated/asm/syscalls_64.h:41
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff8881814fc400 of 1024 bytes by task 3545 on cpu 1:
 memcpy_from_iter lib/iov_iter.c:85 [inline]
 iterate_bvec include/linux/iov_iter.h:123 [inline]
 iterate_and_advance2 include/linux/iov_iter.h:306 [inline]
 iterate_and_advance include/linux/iov_iter.h:330 [inline]
 __copy_from_iter lib/iov_iter.c:261 [inline]
 copy_folio_from_iter_atomic+0x75f/0x1170 lib/iov_iter.c:491
 generic_perform_write+0x2c1/0x490 mm/filemap.c:4332
 shmem_file_write_iter+0xc5/0xf0 mm/shmem.c:3502
 lo_rw_aio+0x67d/0x730 drivers/block/loop.c:-1
 do_req_filebacked drivers/block/loop.c:-1 [inline]
 loop_handle_cmd drivers/block/loop.c:1926 [inline]
 loop_process_work+0x56c/0xac0 drivers/block/loop.c:1961
 loop_workfn+0x31/0x40 drivers/block/loop.c:1985
 process_one_work kernel/workqueue.c:3275 [inline]
 process_scheduled_works+0x4de/0x9e0 kernel/workqueue.c:3358
 worker_thread+0x581/0x770 kernel/workqueue.c:3439
 kthread+0x22a/0x280 kernel/kthread.c:467
 ret_from_fork+0x150/0x360 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 3545 Comm: kworker/u8:7 Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
Workqueue: loop8 loop_workfn
==================================================================

Trajectory:
Seq Timestamp Type Name Duration
0/0 2026/02/24 21:45 flow assessment-kcsan
9m Results:
map[Benign:true Confident:true Explanation:The data race occurs in `copy_folio_from_iter_atomic`, which is called during buffered file writes via `generic_perform_write`. 

### Analysis of the Race

1.  **Task 7152** is performing a buffered write to an **ext4** file (via `sendfile` -> `iter_file_splice_write`). It holds the `inode_lock` (exclusive) for the ext4 inode and is writing data into a folio in the ext4 page cache. The `write` reported by KCSAN is the `memcpy` into this ext4 folio.
2.  **Task 3545** is a loop device worker performing a write to a **shmem** backing file. The source of this write is a `bio` that contains the same ext4 folio (likely because someone used `splice`/`sendfile` to move data from the ext4 file to the loop device). The `read` reported by KCSAN is the `memcpy` reading from this ext4 folio to copy its contents into a shmem folio.
3.  **The Conflict**: Task 7152 is modifying the ext4 folio while Task 3545 is reading from it. While Task 7152 holds the `inode_lock` for the ext4 file, Task 3545 (the loop worker) does not, as it is only concerned with the destination shmem file's locking.

### Why it is Benign

*   **File Data Consistency**: In Linux, concurrent unsynchronized reads and writes to the same file range are allowed but result in undefined data (e.g., torn reads). This is a well-known behavior of the POSIX I/O model and is managed at the application level if consistency is required.
*   **Kernel Stability**: Both tasks are accessing valid, referenced folios. The `memcpy` operations are performed on kernel memory mapped via `kmap_local_folio`. A data race on the *contents* of these pages does not lead to kernel memory corruption, null pointer dereferences, or crashes.
*   **No Metadata Race**: The race is strictly on the data payload within the folio. Metadata updates (like marking the folio dirty or updating file sizes) are protected by the respective inode locks and other internal synchronization mechanisms.
*   **Atomic Copy**: `copy_folio_from_iter_atomic` is specifically designed to handle copies safely in various contexts, including those where page faults must be avoided.

In summary, this is a classic data race on file contents. While it can lead to inconsistent data being written to the loop device's backing file, it is not a kernel bug and does not compromise the stability of the system.

Final Verdict: **Benign**]

1/1 2026/02/24 21:45 action kernel-checkouter
0m Results:
map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/64ce7073360d8ffada95f0fe7d47ad11e5540bda]

2/1 2026/02/24 21:45 action kernel-builder
3m Results:
map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/d1c9dae6e07a54194024c7c79d26c7d4c2234fd5]

3/1 2026/02/24 21:49 action codesearch-prepare
2m Results:
map[Index:codesearch-index]

4/1 2026/02/24 21:52 agent expert
2m Model:
gemini-3-flash-preview

Results:
map[Benign:true Confident:true]

Instruction:
You are an experienced Linux kernel developer tasked with determining if the given kernel
data race is benign or not. The data race report is from KCSAN tool.
It contains 2 stack traces of the memory accesses that constitute a data race.

A "benign" data races are on a simple int/bool variable or similar field,
and the accesses are not supposed to be protected by any mutual exclusion primitives.
Common examples of such "benign" data races are accesses to various flags fields,
statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash
with a conservative compiler that compiles memory accesses to primitive types
effectively as atomic.

A non-benign (or "harmful" data race) can lead to corruption/crash even with
a conservative compiler that compiles memory accesses to primitive types
effectively as atomic. A common example of a "harmful" data races is race on
a complex container (list/hashmap/etc), where accesses are supposed to be protected
by a mutual exclusion primitive.

Also consider races that happen at the same time with the given one.
If there is no synchronization in between, other memory accesses in the involved threads
race with each other if they access the same memory. For example, if both threads execute:

	some_struct->have_elements = true;
	list_add(new_node, &some_struct->list_head);

the race on some_struct->have_elements may appear benign, however it also implies there
is a race on some_struct->list_head which is not benign, since the list is not thread-safe.

Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses,
and thus even with a conservative compiler may not be fully atomic. However, such races may
still be benign depending on how writes are done, and how read data is used.

In the final reply explain why you think the given data race is benign or is harmful.

Use the provided tools to confirm any assumptions, variables/fields being accessed, etc.
In particular, don't make assumptions about the kernel source code,
use codesearch tools to read the actual source code.

Prefer calling several tools at the same time to save round-trips.


Use set-results tool to provide results of the analysis.
It must be called exactly once before the final reply.
Ignore results of this tool.

Prompt:
The data race report is:

==================================================================
BUG: KCSAN: data-race in copy_folio_from_iter_atomic / copy_folio_from_iter_atomic

write to 0xffff8881814fc4ba of 20 bytes by task 7152 on cpu 0:
 memcpy_from_iter lib/iov_iter.c:85 [inline]
 iterate_bvec include/linux/iov_iter.h:123 [inline]
 iterate_and_advance2 include/linux/iov_iter.h:306 [inline]
 iterate_and_advance include/linux/iov_iter.h:330 [inline]
 __copy_from_iter lib/iov_iter.c:261 [inline]
 copy_folio_from_iter_atomic+0x75f/0x1170 lib/iov_iter.c:491
 generic_perform_write+0x2c1/0x490 mm/filemap.c:4332
 ext4_buffered_write_iter+0x1ee/0x3c0 fs/ext4/file.c:300
 ext4_file_write_iter+0x380/0xfa0 fs/ext4/file.c:-1
 iter_file_splice_write+0x6bc/0xa80 fs/splice.c:738
 do_splice_from fs/splice.c:938 [inline]
 direct_splice_actor+0x156/0x2a0 fs/splice.c:1161
 splice_direct_to_actor+0x311/0x670 fs/splice.c:1105
 do_splice_direct_actor fs/splice.c:1204 [inline]
 do_splice_direct+0x119/0x1a0 fs/splice.c:1230
 do_sendfile+0x382/0x650 fs/read_write.c:1372
 __do_sys_sendfile64 fs/read_write.c:1433 [inline]
 __se_sys_sendfile64 fs/read_write.c:1419 [inline]
 __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1419
 x64_sys_call+0x2dc4/0x3020 arch/x86/include/generated/asm/syscalls_64.h:41
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff8881814fc400 of 1024 bytes by task 3545 on cpu 1:
 memcpy_from_iter lib/iov_iter.c:85 [inline]
 iterate_bvec include/linux/iov_iter.h:123 [inline]
 iterate_and_advance2 include/linux/iov_iter.h:306 [inline]
 iterate_and_advance include/linux/iov_iter.h:330 [inline]
 __copy_from_iter lib/iov_iter.c:261 [inline]
 copy_folio_from_iter_atomic+0x75f/0x1170 lib/iov_iter.c:491
 generic_perform_write+0x2c1/0x490 mm/filemap.c:4332
 shmem_file_write_iter+0xc5/0xf0 mm/shmem.c:3502
 lo_rw_aio+0x67d/0x730 drivers/block/loop.c:-1
 do_req_filebacked drivers/block/loop.c:-1 [inline]
 loop_handle_cmd drivers/block/loop.c:1926 [inline]
 loop_process_work+0x56c/0xac0 drivers/block/loop.c:1961
 loop_workfn+0x31/0x40 drivers/block/loop.c:1985
 process_one_work kernel/workqueue.c:3275 [inline]
 process_scheduled_works+0x4de/0x9e0 kernel/workqueue.c:3358
 worker_thread+0x581/0x770 kernel/workqueue.c:3439
 kthread+0x22a/0x280 kernel/kthread.c:467
 ret_from_fork+0x150/0x360 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 3545 Comm: kworker/u8:7 Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
Workqueue: loop8 loop_workfn
==================================================================


Reply:
The data race occurs in `copy_folio_from_iter_atomic`, which is called during buffered file writes via `generic_perform_write`. 

### Analysis of the Race

1.  **Task 7152** is performing a buffered write to an **ext4** file (via `sendfile` -> `iter_file_splice_write`). It holds the `inode_lock` (exclusive) for the ext4 inode and is writing data into a folio in the ext4 page cache. The `write` reported by KCSAN is the `memcpy` into this ext4 folio.
2.  **Task 3545** is a loop device worker performing a write to a **shmem** backing file. The source of this write is a `bio` that contains the same ext4 folio (likely because someone used `splice`/`sendfile` to move data from the ext4 file to the loop device). The `read` reported by KCSAN is the `memcpy` reading from this ext4 folio to copy its contents into a shmem folio.
3.  **The Conflict**: Task 7152 is modifying the ext4 folio while Task 3545 is reading from it. While Task 7152 holds the `inode_lock` for the ext4 file, Task 3545 (the loop worker) does not, as it is only concerned with the destination shmem file's locking.

### Why it is Benign

*   **File Data Consistency**: In Linux, concurrent unsynchronized reads and writes to the same file range are allowed but result in undefined data (e.g., torn reads). This is a well-known behavior of the POSIX I/O model and is managed at the application level if consistency is required.
*   **Kernel Stability**: Both tasks are accessing valid, referenced folios. The `memcpy` operations are performed on kernel memory mapped via `kmap_local_folio`. A data race on the *contents* of these pages does not lead to kernel memory corruption, null pointer dereferences, or crashes.
*   **No Metadata Race**: The race is strictly on the data payload within the folio. Metadata updates (like marking the folio dirty or updating file sizes) are protected by the respective inode locks and other internal synchronization mechanisms.
*   **Atomic Copy**: `copy_folio_from_iter_atomic` is specifically designed to handle copies safely in various contexts, including those where page faults must be avoided.

In summary, this is a classic data race on file contents. While it can lead to inconsistent data being written to the loop device's backing file, it is not a kernel bug and does not compromise the stability of the system.

Final Verdict: **Benign**

5/2 2026/02/24 21:52 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 4391
							output: 41
							thoughts: 601
						

Thoughts:
**Analyzing a Data Race in `copy_folio_from_iter_atomic`**

Okay, so I'm looking at this report and it looks like we have a data race within `copy_folio_from_iter_atomic`, which is called through `generic_perform_write`. Task 7152 is trying to write 20 bytes to address `0xffff8881814fc4ba`, while simultaneously, Task 3545 is reading 1024 bytes from `0xffff8881814fc400`. The report says "read *to*", which is a little unusual. KCSAN normally reports "read from" or "write to". This makes me think… wait… "read to" must indicate the destination of the read operation. This makes sense considering `copy_folio_from_iter_atomic`.

The report also shows me that `copy_folio_from_iter_atomic` doesn't directly copy the data. Instead it calls `__copy_from_iter`, which subsequently calls `memcpy_from_iter`, which uses `memcpy` in `lib/iov_iter.c:85`.

Now let's see how these tasks are reaching this function.

Task 7152's path is: `ext4_buffered_write_iter` -> `generic_perform_write` -> `copy_folio_from_iter_atomic`. Task 3545's path is: `shmem_file_write_iter` -> `generic_perform_write` -> `copy_folio_from_iter_atomic`.

Ah, okay, both tasks are writing data, but to different file systems. Task 7152 is writing to an ext4 file and is using `sendfile`. Task 3545 is writing to a shmem file that's probably on a loop device.

The addresses being accessed are: Task 7152 is writing to `0xffff8881814fc4ba` (20 bytes). Task 3545 is reading to `0xffff8881814fc400` (1024 bytes). Given the report's language, the "read to" implies that `0xffff8881814fc400` is the *destination* of the copy, which aligns with `copy_folio_from_iter_atomic`'s intention to copy data from an iterator (the source) to a folio (the destination). So, from the KCSAN report, "read to" means the memory location the value is being written into and "write to" means the memory location the value is being written from.

Therefore, the source of the read in `memcpy` should be `0xffff8881814fc400` (from task 3545), and `0xffff8881814fc4ba` is the destination (from task 7152), and this is confirmed by my understanding of `copy_folio_from_iter_atomic`. So this is `memcpy(dest, src, n)` - read from src and write to dest.

Let's check `lib/iov_iter.c:85` to confirm.



6/2 2026/02/24 21:52 tool codesearch-definition-source
0m Args:
map[ContextFile:lib/iov_iter.c IncludeLines:true Name:memcpy_from_iter]

Results:
map[SourceCode:  81:	static __always_inline
  82:	size_t memcpy_from_iter(void *iter_from, size_t progress,
  83:				size_t len, void *to, void *priv2)
  84:	{
  85:		memcpy(to + progress, iter_from, len);
  86:		return 0;
  87:	}
 SourceFile:lib/iov_iter.c]

7/2 2026/02/24 21:52 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 5873
							output: 45
							thoughts: 4352
						

Thoughts:
**Analyzing a Data Race in the Loop Device and Ext4 Interaction**

Okay, so I'm looking at this KCSAN report, and I need to piece together what's going on. It looks like `memcpy_from_iter` is the critical function, the callback for `iterate_and_advance`. I'm seeing two tasks, 7152 and 3545, both hitting the same memory address, which means a potential race condition.

Task 7152 is writing to `0xffff8881814fc4ba`, while task 3545 is doing a "read to" `0xffff8881814fc400`. The "read to" is bugging me a bit, as I'm used to seeing "read from". I double-checked the report; yes, it's reading to that address. My focus is reading into the direction of `to` and `iter_from` and the `memcpy` call in `iterate_and_advance`. `to` is my destination; so this is important!

The addresses are close, so they are hitting the same page. So that means task 3545 is reading from an area of memory that 7152 is writing to. This is where it gets interesting, as Task 3545 is involved in loop device operations, and Task 7152 is doing an ext4 write via `sendfile`.

`sendfile` itself is a clue. It generally involves reading data from one file (`in_file`) and writing to another (`out_file`). Task 7152 is writing to an ext4 file, so the data is coming from the `in_file` of the `sendfile`. But `sendfile` is actually using a pipe internally, or a `splice` mechanism! So I need to trace this `splice` call further.

The `sendfile` call will eventually call `generic_perform_write` which then calls `copy_folio_from_iter_atomic`. This confirms that 7152 is writing to the ext4 file's page cache. The source is the iterator created inside `iter_file_splice_write`, which comes from the pipe data in this case. In essence, `sendfile` is reading data from the `in_file` into this pipe, then writing the contents of the pipe to the ext4 file.

Task 3545 involves `shmem_file_write_iter` which is writing to a shmem file via the loop device. The loop device itself is taking requests from block I/O. Therefore I need to understand from where this iterator comes from.

The key to the race must be how data moves *between* these two processes. It looks like Task 7152 is writing to a file, and Task 3545 is reading it, as an iterator, and writing to another location. If the loop device's block layer is involved then the issue is the source and destination are shared. That's likely where the race exists. Task 3545 is reading *from* the ext4 folio, to *write to* shmem. This means those folios are the source for writes from the loop device.

The interesting thing is where the loop device block I/O comes from. If the ext4 folio is the *source* of the write data for the loop device then we have a problem. But how can ext4 data become the input for a loop write?

This suggests that `sendfile` is being used to write to the loop device. This means the block I/O is somehow referencing data from the ext4 file. So it must be a `sendfile(loop_fd, ext4_fd,...)` that's the cause.

So task 7152 is writing the ext4 file and that's the *target*, while task 3545 is reading from an area that `sendfile` may have written to.

Therefore 3545 is the loop worker. It handles the write request to the loop device, and reads the *request's* pages to write to the backing shmem file. And the kicker is that those "request pages" are *the ext4 folios*! This means someone is using data mapped from the ext4 file when writing to the loop device, and this could be an issue if there are other writes to the same folios.

Therefore, the key is Task A doing something like `write(loop_fd, ext4_mapped_buffer, len)`. I think the problem is that `sendfile` must be creating this issue somehow. If `sendfile` *reads* ext4 folios to populate a pipe for the *input* to loop, the ext


8/2 2026/02/24 21:52 tool codesearch-definition-source
0m Args:
map[ContextFile:lib/iov_iter.c IncludeLines:true Name:copy_folio_from_iter_atomic]

Results:
map[SourceCode: 472:	size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
 473:			size_t bytes, struct iov_iter *i)
 474:	{
 475:		size_t n, copied = 0;
 476:	
 477:		if (!page_copy_sane(&folio->page, offset, bytes))
 478:			return 0;
 479:		if (WARN_ON_ONCE(!i->data_source))
 480:			return 0;
 481:	
 482:		do {
 483:			char *to = kmap_local_folio(folio, offset);
 484:	
 485:			n = bytes - copied;
 486:			if (folio_test_partial_kmap(folio) &&
 487:			    n > PAGE_SIZE - offset_in_page(offset))
 488:				n = PAGE_SIZE - offset_in_page(offset);
 489:	
 490:			pagefault_disable();
 491:			n = __copy_from_iter(to, n, i);
 492:			pagefault_enable();
 493:			kunmap_local(to);
 494:			copied += n;
 495:			offset += n;
 496:		} while (copied != bytes && n > 0);
 497:	
 498:		return copied;
 499:	}
 SourceFile:lib/iov_iter.c]

9/2 2026/02/24 21:52 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 11730
							output: 40
							thoughts: 626
						

Thoughts:
**Analyzing the Data Race**

Okay, let's break this down. It looks like we've got a potential data race situation. The trace shows a `__copy_from_iter` call, which, if it's a `bvec` or `kvec` iterator (and this looks like a `bvec` iterator from the `iterate_bvec` reference), eventually leads to `memcpy_from_iter`. Specifically, Task 3545 is reading from a `bvec`, which, given the loop device context, likely means it's reading from an ext4 folio.

On the other hand, Task 7152 is writing to that same ext4 folio. It's doing so via `ext4_buffered_write_iter`, which calls `generic_perform_write`, and ultimately `copy_folio_from_iter_atomic`. The source for Task 7152's write is an iterator, likely a pipe iterator from `sendfile` through `iter_file_splice_write`. So we have, effectively, `memcpy(ext4_folio + offset, pipe_buffer, len)` for Task 7152 and `memcpy(shmem_folio + offset, ext4_folio + offset, len)` for Task 3545.

The crux of the matter is the two `memcpy`s racing on `ext4_folio + offset`. Task 7152 is the writer, and Task 3545 is the reader. In the standard concurrent I/O case, this is usually acceptable, and just means the shmem file will get a mix of old and new data. But is it *harmful* here?

`copy_folio_from_iter_atomic` uses `kmap_local_folio` and `pagefault_disable()`, which is a clear indication that it's designed for atomic context (or at least without causing page faults). The data race is on the *contents* of the page, not the page's existence, so we need to be careful.

If Task 3545 is reading from a page that's being modified, could this lead to a crash? `memcpy` from a partially modified page is generally safe, resulting in inconsistent data. But, if the page was *unmapped* during the read, it'd be a problem. However, both tasks have the folio pinned. Task 7152 is writing to it within `generic_perform_write`, so it's locked or at least referenced. Task 3545 is reading via a bio submitted to the loop device, so the bio holds the reference.

But does `generic_perform_write` lock the page? I should check `generic_perform_write` in `mm/filemap.c`. Because, if it doesn't lock before the `copy_folio_from_iter_atomic` call, we have a major issue.




10/2 2026/02/24 21:53 tool codesearch-definition-source
0m Args:
map[ContextFile:mm/filemap.c IncludeLines:true Name:generic_perform_write]

Results:
map[SourceCode:4286:	ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
4287:	{
4288:		struct file *file = iocb->ki_filp;
4289:		loff_t pos = iocb->ki_pos;
4290:		struct address_space *mapping = file->f_mapping;
4291:		const struct address_space_operations *a_ops = mapping->a_ops;
4292:		size_t chunk = mapping_max_folio_size(mapping);
4293:		long status = 0;
4294:		ssize_t written = 0;
4295:	
4296:		do {
4297:			struct folio *folio;
4298:			size_t offset;		/* Offset into folio */
4299:			size_t bytes;		/* Bytes to write to folio */
4300:			size_t copied;		/* Bytes copied from user */
4301:			void *fsdata = NULL;
4302:	
4303:			bytes = iov_iter_count(i);
4304:	retry:
4305:			offset = pos & (chunk - 1);
4306:			bytes = min(chunk - offset, bytes);
4307:			balance_dirty_pages_ratelimited(mapping);
4308:	
4309:			if (fatal_signal_pending(current)) {
4310:				status = -EINTR;
4311:				break;
4312:			}
4313:	
4314:			status = a_ops->write_begin(iocb, mapping, pos, bytes,
4315:							&folio, &fsdata);
4316:			if (unlikely(status < 0))
4317:				break;
4318:	
4319:			offset = offset_in_folio(folio, pos);
4320:			if (bytes > folio_size(folio) - offset)
4321:				bytes = folio_size(folio) - offset;
4322:	
4323:			if (mapping_writably_mapped(mapping))
4324:				flush_dcache_folio(folio);
4325:	
4326:			/*
4327:			 * Faults here on mmap()s can recurse into arbitrary
4328:			 * filesystem code. Lots of locks are held that can
4329:			 * deadlock. Use an atomic copy to avoid deadlocking
4330:			 * in page fault handling.
4331:			 */
4332:			copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
4333:			flush_dcache_folio(folio);
4334:	
4335:			status = a_ops->write_end(iocb, mapping, pos, bytes, copied,
4336:							folio, fsdata);
4337:			if (unlikely(status != copied)) {
4338:				iov_iter_revert(i, copied - max(status, 0L));
4339:				if (unlikely(status < 0))
4340:					break;
4341:			}
4342:			cond_resched();
4343:	
4344:			if (unlikely(status == 0)) {
4345:				/*
4346:				 * A short copy made ->write_end() reject the
4347:				 * thing entirely.  Might be memory poisoning
4348:				 * halfway through, might be a race with munmap,
4349:				 * might be severe memory pressure.
4350:				 */
4351:				if (chunk > PAGE_SIZE)
4352:					chunk /= 2;
4353:				if (copied) {
4354:					bytes = copied;
4355:					goto retry;
4356:				}
4357:	
4358:				/*
4359:				 * 'folio' is now unlocked and faults on it can be
4360:				 * handled. Ensure forward progress by trying to
4361:				 * fault it in now.
4362:				 */
4363:				if (fault_in_iov_iter_readable(i, bytes) == bytes) {
4364:					status = -EFAULT;
4365:					break;
4366:				}
4367:			} else {
4368:				pos += status;
4369:				written += status;
4370:			}
4371:		} while (iov_iter_count(i));
4372:	
4373:		if (!written)
4374:			return status;
4375:		iocb->ki_pos += written;
4376:		return written;
4377:	}
 SourceFile:mm/filemap.c]

11/2 2026/02/24 21:53 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 14348
							output: 41
							thoughts: 2040
						

Thoughts:
**Data Race Analysis: Concurrent Read/Write to ext4 Folio**

Okay, so here's the situation as I understand it. I've got Task 7152 that's calling `a_ops->write_begin`, which, for ext4, locks the folio. This makes sense; we want to ensure exclusive access for the write operation. Task 7152 holds the lock on that ext4 folio. But, and this is the kicker, Task 3545, the loop worker, is *also* reading from the same folio. It’s essentially a `memcpy` operation stemming from a bio, reading directly from that ext4 folio.

My initial thought was, "Does the loop worker check for the lock?" Nope. It’s a direct `memcpy` from the pages provided in the bio. The bio itself, it seems, was created earlier. The critical question: Was the folio locked when the bio was built? If this bio came about from something like `sendfile` -> pipe -> loop, then the folio was *probably* not locked when the bio was created. Or, perhaps it *was* locked and then unlocked after it had been put into the pipe.

Okay, let's step back and consider the `splice_to_pipe` path used by `sendfile`. That will: 1) find or create folios in the page cache; 2) wait for those folios to be up-to-date; 3) put them into the pipe. *At this point*, the folios are in the pipe and unlocked. Then `sendfile` calls `splice_from_pipe` (or similar) to write to the loop device (`out_fd`). `splice_write` (through `iter_file_splice_write`) will create an iterator from the pipe and calls `vfs_iter_write`. This, for a block device like the loop device, leads to `loop_make_request`. The pages, now from the pipe, are *now* in a bio and queued to the loop worker. The loop worker (Task 3545) starts its read.

And, bingo, during this read, Task 7152 can start its write to the *same* ext4 file. It locks the folio and calls `copy_folio_from_iter_atomic`, which, as the name suggests, is an atomic copy, but not under the shared context of the loop worker's read. So, we have concurrent read and write to the *same* ext4 folio. The loop worker is reading the folio to write to shmem, the other task is writing new data to the ext4 folio. Classic data race.

Now, the obvious question is, is this benign? In the Linux kernel, these data races on file contents are often considered benign, as long as they don't lead to memory corruption or crashes. A `memcpy` between valid kernel memory areas (folios) is safe. The resulting data in the shmem file will be a mix of old and new data, depending on when the ext4 folio was updated during the copy. This outcome is expected, especially if the user doesn't use any locking on their part.

But let's not be too quick to dismiss it. What about folio freeing? Task 3545 *does* have a reference to the folio via the bio, and Task 7152 as well.

Metadata corruption? No, the race is on the *contents*.

Then, KCSAN throws a wrench into things. It flags races where unsynchronized memory accesses happen at the same time.

Looking at `generic_perform_write`, we see the pattern: `copy_folio_from_iter_atomic`, followed by `flush_dcache_folio`, and finally `write_end`. The race occurs in the `copy_folio_from_iter_atomic` function. `write_end` will mark the folio dirty and unlock it. In Task 3545 (loop worker), it’s calling `generic_perform_write` (to shmem). The race is the read from the ext4 folio. If Task 7152 is writing to the ext4 folio, that's updating the `dirty` flag. Task 3545 isn't touching the metadata, just the contents.

Could this be a “stable writes” issue? No, we're not actually writing to *disk* here. The ext4 folio is being read by the loop device which is then writing to shmem. Shmem doesn't have the same guarantees about stable writes. Even if it did, the race is on the *source* of the write, which is ext4 here. If the source changes during the copy, the shmem folio will just have mixed data.

What if the ext


12/2 2026/02/24 21:53 tool codesearch-definition-source
0m Args:
map[ContextFile:drivers/block/loop.c IncludeLines:true Name:lo_rw_aio]

Results:
map[SourceCode: 340:	static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 341:			     loff_t pos, int rw)
 342:	{
 343:		struct iov_iter iter;
 344:		struct req_iterator rq_iter;
 345:		struct bio_vec *bvec;
 346:		struct request *rq = blk_mq_rq_from_pdu(cmd);
 347:		struct bio *bio = rq->bio;
 348:		struct file *file = lo->lo_backing_file;
 349:		struct bio_vec tmp;
 350:		unsigned int offset;
 351:		unsigned int nr_bvec;
 352:		int ret;
 353:	
 354:		nr_bvec = blk_rq_nr_bvec(rq);
 355:	
 356:		if (rq->bio != rq->biotail) {
 357:	
 358:			bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 359:					     GFP_NOIO);
 360:			if (!bvec)
 361:				return -EIO;
 362:			cmd->bvec = bvec;
 363:	
 364:			/*
 365:			 * The bios of the request may be started from the middle of
 366:			 * the 'bvec' because of bio splitting, so we can't directly
 367:			 * copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
 368:			 * API will take care of all details for us.
 369:			 */
 370:			rq_for_each_bvec(tmp, rq, rq_iter) {
 371:				*bvec = tmp;
 372:				bvec++;
 373:			}
 374:			bvec = cmd->bvec;
 375:			offset = 0;
 376:		} else {
 377:			/*
 378:			 * Same here, this bio may be started from the middle of the
 379:			 * 'bvec' because of bio splitting, so offset from the bvec
 380:			 * must be passed to iov iterator
 381:			 */
 382:			offset = bio->bi_iter.bi_bvec_done;
 383:			bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
 384:		}
 385:		atomic_set(&cmd->ref, 2);
 386:	
 387:		iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
 388:		iter.iov_offset = offset;
 389:	
 390:		cmd->iocb.ki_pos = pos;
 391:		cmd->iocb.ki_filp = file;
 392:		cmd->iocb.ki_ioprio = req_get_ioprio(rq);
 393:		if (cmd->use_aio) {
 394:			cmd->iocb.ki_complete = lo_rw_aio_complete;
 395:			cmd->iocb.ki_flags = IOCB_DIRECT;
 396:		} else {
 397:			cmd->iocb.ki_complete = NULL;
 398:			cmd->iocb.ki_flags = 0;
 399:		}
 400:	
 401:		if (rw == ITER_SOURCE) {
 402:			kiocb_start_write(&cmd->iocb);
 403:			ret = file->f_op->write_iter(&cmd->iocb, &iter);
 404:		} else
 405:			ret = file->f_op->read_iter(&cmd->iocb, &iter);
 406:	
 407:		lo_rw_aio_do_completion(cmd);
 408:	
 409:		if (ret != -EIOCBQUEUED)
 410:			lo_rw_aio_complete(&cmd->iocb, ret);
 411:		return -EIOCBQUEUED;
 412:	}
 SourceFile:drivers/block/loop.c]

13/2 2026/02/24 21:53 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 18644
							output: 45
							thoughts: 834
						

Thoughts:
**Analyzing a Potential Data Race**

Okay, so I'm looking at this scenario where `lo_rw_aio` is creating an `iov_iter` from request bvecs and then calls `write_iter` on the backing file. If this request originated from a `sendfile` utilizing `splice` to pull pages from an ext4 file, then the bvecs are essentially pointing to ext4 folios.

Now, the potential race is what's catching my eye. There are two `memcpy` operations that are conflicting. One is reading from the ext4 folio, specifically within `shmem_file_write_iter` ultimately calling `copy_folio_from_iter_atomic`. The other is writing to that same ext4 folio, originating from `ext4_buffered_write_iter`, which also ends up calling `copy_folio_from_iter_atomic`. This is a clear data race on the file data.

The important question is whether this is actually harmful. If the data ends up torn in the shmem, it's just file data corruption, which is already a possibility with concurrent, unsynchronized I/O. So the kernel should still be stable. But, I should go through the mental checklist to see if there's anything else in play.

Okay, is the `memcpy` implementation problematic? No, `memcpy` itself should be fine. Is there any other shared state? The `iov_iter`s are local to each task/call stack. The folios are shared, but accessed via `kmap_local_folio`, which suggests reasonable synchronization.

Now, a point to consider is that `copy_folio_from_iter_atomic` uses `pagefault_disable()`. This is to avoid potential deadlocks if the iterator pointed to a memory-mapped region of the file being written to. But in this case, the iterator points to kernel pages from bvecs for one task and pipe for the other. So, no page faults are expected either way. So it boils down to just a race on the actual data inside those pages.

Are we at risk of invalid addresses with `memcpy`? No, the pages are pinned and referenced, and it's the same for the destination folio. So there shouldn't be.

My preliminary conclusion is that this is a benign data race, in the sense that it isn't going to crash the kernel or corrupt memory. It *might* result in inconsistent data, but that's within the realm of what's expected with concurrent, unsynchronized I/O.

Okay, let's step back, and check if there's anything similar that's been reported before, and maybe there's some useful context there. Ah, I've seen similar reports about `copy_to_user` and `copy_from_user`. Those are usually benign as well if they are just racing on user data. This is different in that it's kernel-to-kernel copy, using `copy_folio_from_iter_atomic` with bvec/pipe.

Now, let me investigate the `ext4_buffered_write_iter` specifically. Does it introduce any factors that could be relevant? It does take `i_rwsem`. Task 7152 is holding `i_rwsem` on the ext4 file. But Task 3545 *isn't* holding `i_rwsem` on the ext4 file – it is for the shmem file. That's why the race exists. If Task 3545 *was* also doing a "normal" read from the ext4 file, it would take `i_rwsem` in shared mode, and we'd still have a race if the buffered write was in shared mode, too. The question is, does buffered write take exclusive mode on `i_rwsem`? Yes, usually, it does. Let me just confirm that in the code... Yup.



14/2 2026/02/24 21:53 tool codesearch-definition-source
0m Args:
map[ContextFile:fs/ext4/file.c IncludeLines:true Name:ext4_buffered_write_iter]

Results:
map[SourceCode: 286:	static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 287:						struct iov_iter *from)
 288:	{
 289:		ssize_t ret;
 290:		struct inode *inode = file_inode(iocb->ki_filp);
 291:	
 292:		if (iocb->ki_flags & IOCB_NOWAIT)
 293:			return -EOPNOTSUPP;
 294:	
 295:		inode_lock(inode);
 296:		ret = ext4_write_checks(iocb, from);
 297:		if (ret <= 0)
 298:			goto out;
 299:	
 300:		ret = generic_perform_write(iocb, from);
 301:	
 302:	out:
 303:		inode_unlock(inode);
 304:		if (unlikely(ret <= 0))
 305:			return ret;
 306:		return generic_write_sync(iocb, ret);
 307:	}
 SourceFile:fs/ext4/file.c]

15/2 2026/02/24 21:53 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 20699
							output: 43
							thoughts: 1151
						

Thoughts:
**Benign Data Race in ext4/loop Device Interaction**

Okay, let's break this down. I see that `inode_lock(inode)` is acquired exclusively by Task 7152, meaning it has exclusive access to the ext4 inode. That part is straightforward. If another task tries to write or read directly using the standard file operations on the same file, it would be blocked due to the lock.

However, Task 3545 isn't using the normal file operations. It's working with the folios directly, because they were passed to the loop device within a bio. This bio came about from a `sendfile` operation from the ext4 file to the loop device. The core issue is that `sendfile`'s read operation, specifically within `splice_to_pipe`, takes the lock, gets the pages, and then *releases* the lock. Once these pages are in the pipe, they're no longer protected by the `inode_lock`. Then, they're sent to the loop device for async processing.

This is where the race happens. While the loop device is asynchronously processing those pages, Task 7152 or another task could write to the ext4 file again. The loop device, using the original pages, will not be privy to those changes.

Is this a bug in `splice`? Well, it's a known characteristic. `splice` doesn't guarantee data stability. If you need that, you'd use `O_DIRECT` or similar mechanisms, and even then, it's tricky. For most applications, this is "don't do that if you want data consistency."

From the kernel's perspective, this isn't a crash-inducing bug. It's a data race on the contents of the page.

I should verify if there are any other consequences. Could `copy_folio_from_iter_atomic` have any implications? But it just returns the number of bytes copied. If it's short, `generic_perform_write` handles it. `memcpy` doesn't really "fail." It just returns the length copied. Looking at `__copy_from_iter` for `ITER_BVEC`, it's just `memcpy`.

So, yeah, it really is a data race on the data itself.

One thing to consider is `flush_dcache_folio(folio)`. Task 7152 calls it after writing. Task 3545 (shmem write) also calls it, but 3545 is reading from the ext4 folio, so does it need to flush the dcache? `memcpy` generally handles dcache coherency, or the caller does. Here it's a read. In any case, KCSAN is reporting the race on `memcpy` itself.

Is there any reason to think this isn't benign? Is there a case that this could lead to a kernel panic? I don't see one. The addresses are valid, sizes are within bounds.

What if there's a race on `bytes` or `offset`? No, those are local variables for each task. The only shared thing is the memory pointed to by the pointers. Let's verify the addresses: Task 7152 is writing to `0xffff8881814fc4ba` (20 bytes), Task 3545 is reading to `0xffff8881814fc400` (1024 bytes). `0x4ba` is inside the read range. This is very much a standard data race on the data. I'm going to mark this as benign.

I should check if there are any similar cases where `data_race()` is used for `memcpy`. Actually, `memcpy` is often used where data races are possible.

Could there be something funky with type-punning? No, it's just bytes.

One detail: Task 3545 is `shmem_file_write_iter`. Shmem is special, but `generic_perform_write` is used for shmem too. I should check if `shmem_file_write_iter` does something different, but I don't see a significant difference that would change this analysis.



16/2 2026/02/24 21:53 tool codesearch-definition-source
0m Args:
map[ContextFile:mm/shmem.c IncludeLines:true Name:shmem_file_write_iter]

Results:
map[SourceCode:3486:	static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
3487:	{
3488:		struct file *file = iocb->ki_filp;
3489:		struct inode *inode = file->f_mapping->host;
3490:		ssize_t ret;
3491:	
3492:		inode_lock(inode);
3493:		ret = generic_write_checks(iocb, from);
3494:		if (ret <= 0)
3495:			goto unlock;
3496:		ret = file_remove_privs(file);
3497:		if (ret)
3498:			goto unlock;
3499:		ret = file_update_time(file);
3500:		if (ret)
3501:			goto unlock;
3502:		ret = generic_perform_write(iocb, from);
3503:	unlock:
3504:		inode_unlock(inode);
3505:		return ret;
3506:	}
 SourceFile:mm/shmem.c]

17/2 2026/02/24 21:53 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 23132
							output: 42
							thoughts: 1064
						

Thoughts:
**Analyzing a Potential Data Race in File I/O**

Okay, let's break this down. It looks like we've got a data race report, and the initial symptoms remind me of how `ext4_buffered_write_iter` works. We've got two tasks, Task 3545 and Task 7152. From the outset, Task 3545 seems to be holding the `inode_lock` on a shmem file, while Task 7152 has the lock for an ext4 file. Since they're different files, the locking itself isn't the problem, but the fact that they're sharing the same data pages because of `splice` or `sendfile` is immediately concerning. That suggests a benign data race on file contents, which isn't the end of the world but warrants a closer look.

Let me just check... KCSAN only shows those two tasks involved, so that simplifies things a bit.

Could this be related to the `loop` device's `transfer` function? No, the stack trace shows `shmem_file_write_iter` -> `generic_perform_write`, so the loop device is acting as a backing file and writing to it normally.

I need to confirm what type of I/O each task is doing. Is the `iov_iter` in Task 3545 a `user` iterator? No, the stack trace shows `iterate_bvec`. The `lo_rw_aio` function is creating a bvec iterator.

What about Task 7152? It's `ext4_buffered_write_iter` -> `generic_perform_write`. The iterator `from` comes from `ext4_file_write_iter`. Within `ext4_file_write_iter`, the code branches, either calling `ext4_dio_write_iter` if `IOCB_DIRECT` is set or `ext4_buffered_write_iter`. And `ext4_file_write_iter` is itself called from `iter_file_splice_write`. This means Task 7152 is using a pipe iterator. The `__copy_from_iter` function for a pipe iterator is a simple wrapper for `iterate_and_advance` which is fine.

Hold on, I see `iterate_bvec` in BOTH tasks, according to the report. That means Task 7152 is also using a bvec iterator. But wait, if Task 7152 is indeed `iter_file_splice_write`, why is it using a bvec iterator? Ah, `iter_file_splice_write` might use bvecs if the pipe buffers are already in bvec form? No, pipes use `pipe_buffer`.

Let me re-read the report... Both tasks have `iterate_bvec`.

Okay, I've got it. Both tasks are using `iterate_bvec`.



18/2 2026/02/24 21:54 tool codesearch-definition-source
0m Args:
map[ContextFile:lib/iov_iter.c IncludeLines:true Name:__copy_from_iter]

Results:
map[SourceCode: 258:	static __always_inline
 259:	size_t __copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 260:	{
 261:		return iterate_and_advance(i, bytes, addr,
 262:					   copy_from_user_iter, memcpy_from_iter);
 263:	}
 SourceFile:lib/iov_iter.c]

19/2 2026/02/24 21:54 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 25005
							output: 20
							thoughts: 217
						

Thoughts:
**Debugging a Mismatched Iterator Type**

Okay, so I'm digging into this KCSAN report. The code uses `iterate_and_advance`, which is, as expected, a macro or inline function. It's supposed to handle different iterator types by dispatching to the appropriate function. Specifically, it should call `iterate_bvec` when the iterator type is `ITER_BVEC` and `iterate_pipe` when it's `ITER_PIPE`.

The problem is, Task 7152, which I *believe* is related to `iter_file_splice_write`, *should* be operating with a `ITER_PIPE` iterator. However, KCSAN is pointing to `iterate_bvec`. This doesn't make sense! I have to figure out why.

My first thought is that the function being called, although appearing to be `iter_file_splice_write` is actually not.

Let's look at the stack trace KCSAN gave me: `iter_file_splice_write+0x6bc/0xa80 fs/splice.c:738`... right, that points directly to the function *it* thinks is the problem, but let's make sure. The stack trace indicates `iter_file_splice_write` is at the top, then it goes down through `do_splice_from` and `direct_splice_actor`.

Okay, so the trace does indeed point to the `iter_file_splice_write` function! So, what's happening *inside* that function? I'm going to look at `fs/splice.c:738` to see what is going on.



20/2 2026/02/24 21:54 tool read-file
0m Args:
map[File:fs/splice.c]

Results:
map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/*
 * "splice": joining two ropes together by interweaving their strands.
 *
 * This is the "extended pipe" functionality, where a pipe is used as
 * an arbitrary in-memory buffer. Think of a pipe as a small kernel
 * buffer that you can use to transfer data from one end to the other.
 *
 * The traditional unix read/write is extended with a "splice()" operation
 * that transfers data buffers to or from a pipe buffer.
 *
 * Named by Larry McVoy, original implementation from Linus, extended by
 * Jens to support splicing to files, network, direct splicing, etc and
 * fixing lots of bugs.
 *
 * Copyright (C) 2005-2006 Jens Axboe <axboe@kernel.dk>
 * Copyright (C) 2005-2006 Linus Torvalds <torvalds@osdl.org>
 * Copyright (C) 2006 Ingo Molnar <mingo@elte.hu>
 *
 */
#include <linux/bvec.h>
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/pagemap.h>
#include <linux/splice.h>
#include <linux/memcontrol.h>
#include <linux/mm_inline.h>
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/export.h>
#include <linux/syscalls.h>
#include <linux/uio.h>
#include <linux/fsnotify.h>
#include <linux/security.h>
#include <linux/gfp.h>
#include <linux/net.h>
#include <linux/socket.h>
#include <linux/sched/signal.h>

#include "internal.h"

/*
 * Splice doesn't support FMODE_NOWAIT. Since pipes may set this flag to
 * indicate they support non-blocking reads or writes, we must clear it
 * here if set to avoid blocking other users of this pipe if splice is
 * being done on it.
 */
static noinline void pipe_clear_nowait(struct file *file)
{
	fmode_t fmode = READ_ONCE(file->f_mode);

	do {
		if (!(fmode & FMODE_NOWAIT))
			break;
	} while (!try_cmpxchg(&file->f_mode, &fmode, fmode & ~FMODE_NOWAIT));
}

/*
 * Attempt to steal a page from a pipe buffer. This should perhaps go into
 * a vm helper function, it's already simplified quite a bit by the
 * addition of remove_mapping(). If success is returned, the caller may
 * attempt to reuse this page for another destination.
 */
static bool page_cache_pipe_buf_try_steal(struct pipe_inode_info *pipe,
		struct pipe_buffer *buf)
{
	struct folio *folio = page_folio(buf->page);
	struct address_space *mapping;

	folio_lock(folio);

	mapping = folio_mapping(folio);
	if (mapping) {
		WARN_ON(!folio_test_uptodate(folio));

		/*
		 * At least for ext2 with nobh option, we need to wait on
		 * writeback completing on this folio, since we'll remove it
		 * from the pagecache.  Otherwise truncate wont wait on the
		 * folio, allowing the disk blocks to be reused by someone else
		 * before we actually wrote our data to them. fs corruption
		 * ensues.
		 */
		folio_wait_writeback(folio);

		if (!filemap_release_folio(folio, GFP_KERNEL))
			goto out_unlock;

		/*
		 * If we succeeded in removing the mapping, set LRU flag
		 * and return good.
		 */
		if (remove_mapping(mapping, folio)) {
			buf->flags |= PIPE_BUF_FLAG_LRU;
			return true;
		}
	}

	/*
	 * Raced with truncate or failed to remove folio from current
	 * address space, unlock and return failure.
	 */
out_unlock:
	folio_unlock(folio);
	return false;
}

static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe,
					struct pipe_buffer *buf)
{
	put_page(buf->page);
	buf->flags &= ~PIPE_BUF_FLAG_LRU;
}

/*
 * Check whether the contents of buf is OK to access. Since the content
 * is a page cache page, IO may be in flight.
 */
static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
				       struct pipe_buffer *buf)
{
	struct folio *folio = page_folio(buf->page);
	int err;

	if (!folio_test_uptodate(folio)) {
		folio_lock(folio);

		/*
		 * Folio got truncated/unhashed. This will cause a 0-byte
		 * splice, if this is the first page.
		 */
		if (!folio->mapping) {
			err = -ENODATA;
			goto error;
		}

		/*
		 * Uh oh, read-error from disk.
		 */
		if (!folio_test_uptodate(folio)) {
			err = -EIO;
			goto error;
		}

		/* Folio is ok after all, we are done */
		folio_unlock(folio);
	}

	return 0;
error:
	folio_unlock(folio);
	return err;
}

const struct pipe_buf_operations page_cache_pipe_buf_ops = {
	.confirm	= page_cache_pipe_buf_confirm,
	.release	= page_cache_pipe_buf_release,
	.try_steal	= page_cache_pipe_buf_try_steal,
	.get		= generic_pipe_buf_get,
};

static bool user_page_pipe_buf_try_steal(struct pipe_inode_info *pipe,
		struct pipe_buffer *buf)
{
	if (!(buf->flags & PIPE_BUF_FLAG_GIFT))
		return false;

	buf->flags |= PIPE_BUF_FLAG_LRU;
	return generic_pipe_buf_try_steal(pipe, buf);
}

static const struct pipe_buf_operations user_page_pipe_buf_ops = {
	.release	= page_cache_pipe_buf_release,
	.try_steal	= user_page_pipe_buf_try_steal,
	.get		= generic_pipe_buf_get,
};

static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
{
	smp_mb();
	if (waitqueue_active(&pipe->rd_wait))
		wake_up_interruptible(&pipe->rd_wait);
	kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
}

/**
 * splice_to_pipe - fill passed data into a pipe
 * @pipe:	pipe to fill
 * @spd:	data to fill
 *
 * Description:
 *    @spd contains a map of pages and len/offset tuples, along with
 *    the struct pipe_buf_operations associated with these pages. This
 *    function will link that data to the pipe.
 *
 */
ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
		       struct splice_pipe_desc *spd)
{
	unsigned int spd_pages = spd->nr_pages;
	unsigned int tail = pipe->tail;
	unsigned int head = pipe->head;
	ssize_t ret = 0;
	int page_nr = 0;

	if (!spd_pages)
		return 0;

	if (unlikely(!pipe->readers)) {
		send_sig(SIGPIPE, current, 0);
		ret = -EPIPE;
		goto out;
	}

	while (!pipe_full(head, tail, pipe->max_usage)) {
		struct pipe_buffer *buf = pipe_buf(pipe, head);

		buf->page = spd->pages[page_nr];
		buf->offset = spd->partial[page_nr].offset;
		buf->len = spd->partial[page_nr].len;
		buf->private = spd->partial[page_nr].private;
		buf->ops = spd->ops;
		buf->flags = 0;

		head++;
		pipe->head = head;
		page_nr++;
		ret += buf->len;

		if (!--spd->nr_pages)
			break;
	}

	if (!ret)
		ret = -EAGAIN;

out:
	while (page_nr < spd_pages)
		spd->spd_release(spd, page_nr++);

	return ret;
}
EXPORT_SYMBOL_GPL(splice_to_pipe);

ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
{
	unsigned int head = pipe->head;
	unsigned int tail = pipe->tail;
	int ret;

	if (unlikely(!pipe->readers)) {
		send_sig(SIGPIPE, current, 0);
		ret = -EPIPE;
	} else if (pipe_full(head, tail, pipe->max_usage)) {
		ret = -EAGAIN;
	} else {
		*pipe_buf(pipe, head) = *buf;
		pipe->head = head + 1;
		return buf->len;
	}
	pipe_buf_release(pipe, buf);
	return ret;
}
EXPORT_SYMBOL(add_to_pipe);

/*
 * Check if we need to grow the arrays holding pages and partial page
 * descriptions.
 */
int splice_grow_spd(const struct pipe_inode_info *pipe, struct splice_pipe_desc *spd)
{
	unsigned int max_usage = READ_ONCE(pipe->max_usage);

	spd->nr_pages_max = max_usage;
	if (max_usage <= PIPE_DEF_BUFFERS)
		return 0;

	spd->pages = kmalloc_array(max_usage, sizeof(struct page *), GFP_KERNEL);
	spd->partial = kmalloc_array(max_usage, sizeof(struct partial_page),
				     GFP_KERNEL);

	if (spd->pages && spd->partial)
		return 0;

	kfree(spd->pages);
	kfree(spd->partial);
	return -ENOMEM;
}

void splice_shrink_spd(struct splice_pipe_desc *spd)
{
	if (spd->nr_pages_max <= PIPE_DEF_BUFFERS)
		return;

	kfree(spd->pages);
	kfree(spd->partial);
}

/**
 * copy_splice_read -  Copy data from a file and splice the copy into a pipe
 * @in: The file to read from
 * @ppos: Pointer to the file position to read from
 * @pipe: The pipe to splice into
 * @len: The amount to splice
 * @flags: The SPLICE_F_* flags
 *
 * This function allocates a bunch of pages sufficient to hold the requested
 * amount of data (but limited by the remaining pipe capacity), passes it to
 * the file's ->read_iter() to read into and then splices the used pages into
 * the pipe.
 *
 * Return: On success, the number of bytes read will be returned and *@ppos
 * will be updated if appropriate; 0 will be returned if there is no more data
 * to be read; -EAGAIN will be returned if the pipe had no space, and some
 * other negative error code will be returned on error.  A short read may occur
 * if the pipe has insufficient space, we reach the end of the data or we hit a
 * hole.
 */
ssize_t copy_splice_read(struct file *in, loff_t *ppos,
			 struct pipe_inode_info *pipe,
			 size_t len, unsigned int flags)
{
	struct iov_iter to;
	struct bio_vec *bv;
	struct kiocb kiocb;
	struct page **pages;
	ssize_t ret;
	size_t used, npages, chunk, remain, keep = 0;
	int i;

	/* Work out how much data we can actually add into the pipe */
	used = pipe_buf_usage(pipe);
	npages = max_t(ssize_t, pipe->max_usage - used, 0);
	len = min_t(size_t, len, npages * PAGE_SIZE);
	npages = DIV_ROUND_UP(len, PAGE_SIZE);

	bv = kzalloc(array_size(npages, sizeof(bv[0])) +
		     array_size(npages, sizeof(struct page *)), GFP_KERNEL);
	if (!bv)
		return -ENOMEM;

	pages = (struct page **)(bv + npages);
	npages = alloc_pages_bulk(GFP_USER, npages, pages);
	if (!npages) {
		kfree(bv);
		return -ENOMEM;
	}

	remain = len = min_t(size_t, len, npages * PAGE_SIZE);

	for (i = 0; i < npages; i++) {
		chunk = min_t(size_t, PAGE_SIZE, remain);
		bv[i].bv_page = pages[i];
		bv[i].bv_offset = 0;
		bv[i].bv_len = chunk;
		remain -= chunk;
	}

	/* Do the I/O */
	iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
	init_sync_kiocb(&kiocb, in);
	kiocb.ki_pos = *ppos;
	ret = in->f_op->read_iter(&kiocb, &to);

	if (ret > 0) {
		keep = DIV_ROUND_UP(ret, PAGE_SIZE);
		*ppos = kiocb.ki_pos;
	}

	/*
	 * Callers of ->splice_read() expect -EAGAIN on "can't put anything in
	 * there", rather than -EFAULT.
	 */
	if (ret == -EFAULT)
		ret = -EAGAIN;

	/* Free any pages that didn't get touched at all. */
	if (keep < npages)
		release_pages(pages + keep, npages - keep);

	/* Push the remaining pages into the pipe. */
	remain = ret;
	for (i = 0; i < keep; i++) {
		struct pipe_buffer *buf = pipe_head_buf(pipe);

		chunk = min_t(size_t, remain, PAGE_SIZE);
		*buf = (struct pipe_buffer) {
			.ops	= &default_pipe_buf_ops,
			.page	= bv[i].bv_page,
			.offset	= 0,
			.len	= chunk,
		};
		pipe->head++;
		remain -= chunk;
	}

	kfree(bv);
	return ret;
}
EXPORT_SYMBOL(copy_splice_read);

const struct pipe_buf_operations default_pipe_buf_ops = {
	.release	= generic_pipe_buf_release,
	.try_steal	= generic_pipe_buf_try_steal,
	.get		= generic_pipe_buf_get,
};

/* Pipe buffer operations for a socket and similar. */
const struct pipe_buf_operations nosteal_pipe_buf_ops = {
	.release	= generic_pipe_buf_release,
	.get		= generic_pipe_buf_get,
};
EXPORT_SYMBOL(nosteal_pipe_buf_ops);

static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
{
	smp_mb();
	if (waitqueue_active(&pipe->wr_wait))
		wake_up_interruptible(&pipe->wr_wait);
	kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}

/**
 * splice_from_pipe_feed - feed available data from a pipe to a file
 * @pipe:	pipe to splice from
 * @sd:		information to @actor
 * @actor:	handler that splices the data
 *
 * Description:
 *    This function loops over the pipe and calls @actor to do the
 *    actual moving of a single struct pipe_buffer to the desired
 *    destination.  It returns when there's no more buffers left in
 *    the pipe or if the requested number of bytes (@sd->total_len)
 *    have been copied.  It returns a positive number (one) if the
 *    pipe needs to be filled with more data, zero if the required
 *    number of bytes have been copied and -errno on error.
 *
 *    This, together with splice_from_pipe_{begin,end,next}, may be
 *    used to implement the functionality of __splice_from_pipe() when
 *    locking is required around copying the pipe buffers to the
 *    destination.
 */
static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd,
			  splice_actor *actor)
{
	unsigned int head = pipe->head;
	unsigned int tail = pipe->tail;
	int ret;

	while (!pipe_empty(head, tail)) {
		struct pipe_buffer *buf = pipe_buf(pipe, tail);

		sd->len = buf->len;
		if (sd->len > sd->total_len)
			sd->len = sd->total_len;

		ret = pipe_buf_confirm(pipe, buf);
		if (unlikely(ret)) {
			if (ret == -ENODATA)
				ret = 0;
			return ret;
		}

		ret = actor(pipe, buf, sd);
		if (ret <= 0)
			return ret;

		buf->offset += ret;
		buf->len -= ret;

		sd->num_spliced += ret;
		sd->len -= ret;
		sd->pos += ret;
		sd->total_len -= ret;

		if (!buf->len) {
			pipe_buf_release(pipe, buf);
			tail++;
			pipe->tail = tail;
			if (pipe->files)
				sd->need_wakeup = true;
		}

		if (!sd->total_len)
			return 0;
	}

	return 1;
}

/* We know we have a pipe buffer, but maybe it's empty? */
static inline bool eat_empty_buffer(struct pipe_inode_info *pipe)
{
	unsigned int tail = pipe->tail;
	struct pipe_buffer *buf = pipe_buf(pipe, tail);

	if (unlikely(!buf->len)) {
		pipe_buf_release(pipe, buf);
		pipe->tail = tail+1;
		return true;
	}

	return false;
}

/**
 * splice_from_pipe_next - wait for some data to splice from
 * @pipe:	pipe to splice from
 * @sd:		information about the splice operation
 *
 * Description:
 *    This function will wait for some data and return a positive
 *    value (one) if pipe buffers are available.  It will return zero
 *    or -errno if no more data needs to be spliced.
 */
static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_desc *sd)
{
	/*
	 * Check for signal early to make process killable when there are
	 * always buffers available
	 */
	if (signal_pending(current))
		return -ERESTARTSYS;

repeat:
	while (pipe_is_empty(pipe)) {
		if (!pipe->writers)
			return 0;

		if (sd->num_spliced)
			return 0;

		if (sd->flags & SPLICE_F_NONBLOCK)
			return -EAGAIN;

		if (signal_pending(current))
			return -ERESTARTSYS;

		if (sd->need_wakeup) {
			wakeup_pipe_writers(pipe);
			sd->need_wakeup = false;
		}

		pipe_wait_readable(pipe);
	}

	if (eat_empty_buffer(pipe))
		goto repeat;

	return 1;
}

/**
 * splice_from_pipe_begin - start splicing from pipe
 * @sd:		information about the splice operation
 *
 * Description:
 *    This function should be called before a loop containing
 *    splice_from_pipe_next() and splice_from_pipe_feed() to
 *    initialize the necessary fields of @sd.
 */
static void splice_from_pipe_begin(struct splice_desc *sd)
{
	sd->num_spliced = 0;
	sd->need_wakeup = false;
}

/**
 * splice_from_pipe_end - finish splicing from pipe
 * @pipe:	pipe to splice from
 * @sd:		information about the splice operation
 *
 * Description:
 *    This function will wake up pipe writers if necessary.  It should
 *    be called after a loop containing splice_from_pipe_next() and
 *    splice_from_pipe_feed().
 */
static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_desc *sd)
{
	if (sd->need_wakeup)
		wakeup_pipe_writers(pipe);
}

/**
 * __splice_from_pipe - splice data from a pipe to given actor
 * @pipe:	pipe to splice from
 * @sd:		information to @actor
 * @actor:	handler that splices the data
 *
 * Description:
 *    This function does little more than loop over the pipe and call
 *    @actor to do the actual moving of a single struct pipe_buffer to
 *    the desired destination. See pipe_to_file, pipe_to_sendmsg, or
 *    pipe_to_user.
 *
 */
ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
			   splice_actor *actor)
{
	int ret;

	splice_from_pipe_begin(sd);
	do {
		cond_resched();
		ret = splice_from_pipe_next(pipe, sd);
		if (ret > 0)
			ret = splice_from_pipe_feed(pipe, sd, actor);
	} while (ret > 0);
	splice_from_pipe_end(pipe, sd);

	return sd->num_spliced ? sd->num_spliced : ret;
}
EXPORT_SYMBOL(__splice_from_pipe);

/**
 * splice_from_pipe - splice data from a pipe to a file
 * @pipe:	pipe to splice from
 * @out:	file to splice to
 * @ppos:	position in @out
 * @len:	how many bytes to splice
 * @flags:	splice modifier flags
 * @actor:	handler that splices the data
 *
 * Description:
 *    See __splice_from_pipe. This function locks the pipe inode,
 *    otherwise it's identical to __splice_from_pipe().
 *
 */
ssize_t splice_from_pipe(struct pipe_inode_info *pipe, struct file *out,
			 loff_t *ppos, size_t len, unsigned int flags,
			 splice_actor *actor)
{
	ssize_t ret;
	struct splice_desc sd = {
		.total_len = len,
		.flags = flags,
		.pos = *ppos,
		.u.file = out,
	};

	pipe_lock(pipe);
	ret = __splice_from_pipe(pipe, &sd, actor);
	pipe_unlock(pipe);

	return ret;
}

/**
 * iter_file_splice_write - splice data from a pipe to a file
 * @pipe:	pipe info
 * @out:	file to write to
 * @ppos:	position in @out
 * @len:	number of bytes to splice
 * @flags:	splice modifier flags
 *
 * Description:
 *    Will either move or copy pages (determined by @flags options) from
 *    the given pipe inode to the given file.
 *    This one is ->write_iter-based.
 *
 */
ssize_t
iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
			  loff_t *ppos, size_t len, unsigned int flags)
{
	struct splice_desc sd = {
		.total_len = len,
		.flags = flags,
		.pos = *ppos,
		.u.file = out,
	};
	int nbufs = pipe->max_usage;
	struct bio_vec *array;
	ssize_t ret;

	if (!out->f_op->write_iter)
		return -EINVAL;

	array = kcalloc(nbufs, sizeof(struct bio_vec), GFP_KERNEL);
	if (unlikely(!array))
		return -ENOMEM;

	pipe_lock(pipe);

	splice_from_pipe_begin(&sd);
	while (sd.total_len) {
		struct kiocb kiocb;
		struct iov_iter from;
		unsigned int head, tail;
		size_t left;
		int n;

		ret = splice_from_pipe_next(pipe, &sd);
		if (ret <= 0)
			break;

		if (unlikely(nbufs < pipe->max_usage)) {
			kfree(array);
			nbufs = pipe->max_usage;
			array = kcalloc(nbufs, sizeof(struct bio_vec),
					GFP_KERNEL);
			if (!array) {
				ret = -ENOMEM;
				break;
			}
		}

		head = pipe->head;
		tail = pipe->tail;

		/* build the vector */
		left = sd.total_len;
		for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++) {
			struct pipe_buffer *buf = pipe_buf(pipe, tail);
			size_t this_len = buf->len;

			/* zero-length bvecs are not supported, skip them */
			if (!this_len)
				continue;
			this_len = min(this_len, left);

			ret = pipe_buf_confirm(pipe, buf);
			if (unlikely(ret)) {
				if (ret == -ENODATA)
					ret = 0;
				goto done;
			}

			bvec_set_page(&array[n], buf->page, this_len,
				      buf->offset);
			left -= this_len;
			n++;
		}

		iov_iter_bvec(&from, ITER_SOURCE, array, n, sd.total_len - left);
		init_sync_kiocb(&kiocb, out);
		kiocb.ki_pos = sd.pos;
		ret = out->f_op->write_iter(&kiocb, &from);
		sd.pos = kiocb.ki_pos;
		if (ret <= 0)
			break;
		WARN_ONCE(ret > sd.total_len - left,
			  "Splice Exceeded! ret=%zd tot=%zu left=%zu\n",
			  ret, sd.total_len, left);

		sd.num_spliced += ret;
		sd.total_len -= ret;
		*ppos = sd.pos;

		/* dismiss the fully eaten buffers, adjust the partial one */
		tail = pipe->tail;
		while (ret) {
			struct pipe_buffer *buf = pipe_buf(pipe, tail);
			if (ret >= buf->len) {
				ret -= buf->len;
				buf->len = 0;
				pipe_buf_release(pipe, buf);
				tail++;
				pipe->tail = tail;
				if (pipe->files)
					sd.need_wakeup = true;
			} else {
				buf->offset += ret;
				buf->len -= ret;
				ret = 0;
			}
		}
	}
done:
	kfree(array);
	splice_from_pipe_end(pipe, &sd);

	pipe_unlock(pipe);

	if (sd.num_spliced)
		ret = sd.num_spliced;

	return ret;
}

EXPORT_SYMBOL(iter_file_splice_write);

#ifdef CONFIG_NET
/**
 * splice_to_socket - splice data from a pipe to a socket
 * @pipe:	pipe to splice from
 * @out:	socket to write to
 * @ppos:	position in @out
 * @len:	number of bytes to splice
 * @flags:	splice modifier flags
 *
 * Description:
 *    Will send @len bytes from the pipe to a network socket. No data copying
 *    is involved.
 *
 */
ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
			 loff_t *ppos, size_t len, unsigned int flags)
{
	struct socket *sock = sock_from_file(out);
	struct bio_vec bvec[16];
	struct msghdr msg = {};
	ssize_t ret = 0;
	size_t spliced = 0;
	bool need_wakeup = false;

	pipe_lock(pipe);

	while (len > 0) {
		unsigned int head, tail, bc = 0;
		size_t remain = len;

		/*
		 * Check for signal early to make process killable when there
		 * are always buffers available
		 */
		ret = -ERESTARTSYS;
		if (signal_pending(current))
			break;

		while (pipe_is_empty(pipe)) {
			ret = 0;
			if (!pipe->writers)
				goto out;

			if (spliced)
				goto out;

			ret = -EAGAIN;
			if (flags & SPLICE_F_NONBLOCK)
				goto out;

			ret = -ERESTARTSYS;
			if (signal_pending(current))
				goto out;

			if (need_wakeup) {
				wakeup_pipe_writers(pipe);
				need_wakeup = false;
			}

			pipe_wait_readable(pipe);
		}

		head = pipe->head;
		tail = pipe->tail;

		while (!pipe_empty(head, tail)) {
			struct pipe_buffer *buf = pipe_buf(pipe, tail);
			size_t seg;

			if (!buf->len) {
				tail++;
				continue;
			}

			seg = min_t(size_t, remain, buf->len);

			ret = pipe_buf_confirm(pipe, buf);
			if (unlikely(ret)) {
				if (ret == -ENODATA)
					ret = 0;
				break;
			}

			bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset);
			remain -= seg;
			if (remain == 0 || bc >= ARRAY_SIZE(bvec))
				break;
			tail++;
		}

		if (!bc)
			break;

		msg.msg_flags = MSG_SPLICE_PAGES;
		if (flags & SPLICE_F_MORE)
			msg.msg_flags |= MSG_MORE;
		if (remain && pipe_occupancy(pipe->head, tail) > 0)
			msg.msg_flags |= MSG_MORE;
		if (out->f_flags & O_NONBLOCK)
			msg.msg_flags |= MSG_DONTWAIT;

		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc,
			      len - remain);
		ret = sock_sendmsg(sock, &msg);
		if (ret <= 0)
			break;

		spliced += ret;
		len -= ret;
		tail = pipe->tail;
		while (ret > 0) {
			struct pipe_buffer *buf = pipe_buf(pipe, tail);
			size_t seg = min_t(size_t, ret, buf->len);

			buf->offset += seg;
			buf->len -= seg;
			ret -= seg;

			if (!buf->len) {
				pipe_buf_release(pipe, buf);
				tail++;
			}
		}

		if (tail != pipe->tail) {
			pipe->tail = tail;
			if (pipe->files)
				need_wakeup = true;
		}
	}

out:
	pipe_unlock(pipe);
	if (need_wakeup)
		wakeup_pipe_writers(pipe);
	return spliced ?: ret;
}
#endif

static int warn_unsupported(struct file *file, const char *op)
{
	pr_debug_ratelimited(
		"splice %s not supported for file %pD4 (pid: %d comm: %.20s)\n",
		op, file, current->pid, current->comm);
	return -EINVAL;
}

/*
 * Attempt to initiate a splice from pipe to file.
 */
static ssize_t do_splice_from(struct pipe_inode_info *pipe, struct file *out,
			      loff_t *ppos, size_t len, unsigned int flags)
{
	if (unlikely(!out->f_op->splice_write))
		return warn_unsupported(out, "write");
	return out->f_op->splice_write(pipe, out, ppos, len, flags);
}

/*
 * Indicate to the caller that there was a premature EOF when reading from the
 * source and the caller didn't indicate they would be sending more data after
 * this.
 */
static void do_splice_eof(struct splice_desc *sd)
{
	if (sd->splice_eof)
		sd->splice_eof(sd);
}

/*
 * Callers already called rw_verify_area() on the entire range.
 * No need to call it for sub ranges.
 */
static ssize_t do_splice_read(struct file *in, loff_t *ppos,
			      struct pipe_inode_info *pipe, size_t len,
			      unsigned int flags)
{
	unsigned int p_space;

	if (unlikely(!(in->f_mode & FMODE_READ)))
		return -EBADF;
	if (!len)
		return 0;

	/* Don't try to read more the pipe has space for. */
	p_space = pipe->max_usage - pipe_buf_usage(pipe);
	len = min_t(size_t, len, p_space << PAGE_SHIFT);

	if (unlikely(len > MAX_RW_COUNT))
		len = MAX_RW_COUNT;

	if (unlikely(!in->f_op->splice_read))
		return warn_unsupported(in, "read");
	/*
	 * O_DIRECT and DAX don't deal with the pagecache, so we allocate a
	 * buffer, copy into it and splice that into the pipe.
	 */
	if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
		return copy_splice_read(in, ppos, pipe, len, flags);
	return in->f_op->splice_read(in, ppos, pipe, len, flags);
}

/**
 * vfs_splice_read - Read data from a file and splice it into a pipe
 * @in:		File to splice from
 * @ppos:	Input file offset
 * @pipe:	Pipe to splice to
 * @len:	Number of bytes to splice
 * @flags:	Splice modifier flags (SPLICE_F_*)
 *
 * Splice the requested amount of data from the input file to the pipe.  This
 * is synchronous as the caller must hold the pipe lock across the entire
 * operation.
 *
 * If successful, it returns the amount of data spliced, 0 if it hit the EOF or
 * a hole and a negative error code otherwise.
 */
ssize_t vfs_splice_read(struct file *in, loff_t *ppos,
			struct pipe_inode_info *pipe, size_t len,
			unsigned int flags)
{
	ssize_t ret;

	ret = rw_verify_area(READ, in, ppos, len);
	if (unlikely(ret < 0))
		return ret;

	return do_splice_read(in, ppos, pipe, len, flags);
}
EXPORT_SYMBOL_GPL(vfs_splice_read);

/**
 * splice_direct_to_actor - splices data directly between two non-pipes
 * @in:		file to splice from
 * @sd:		actor information on where to splice to
 * @actor:	handles the data splicing
 *
 * Description:
 *    This is a special case helper to splice directly between two
 *    points, without requiring an explicit pipe. Internally an allocated
 *    pipe is cached in the process, and reused during the lifetime of
 *    that process.
 *
 */
ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
			       splice_direct_actor *actor)
{
	struct pipe_inode_info *pipe;
	ssize_t ret, bytes;
	size_t len;
	int i, flags, more;

	/*
	 * We require the input to be seekable, as we don't want to randomly
	 * drop data for eg socket -> socket splicing. Use the piped splicing
	 * for that!
	 */
	if (unlikely(!(in->f_mode & FMODE_LSEEK)))
		return -EINVAL;

	/*
	 * neither in nor out is a pipe, setup an internal pipe attached to
	 * 'out' and transfer the wanted data from 'in' to 'out' through that
	 */
	pipe = current->splice_pipe;
	if (unlikely(!pipe)) {
		pipe = alloc_pipe_info();
		if (!pipe)
			return -ENOMEM;

		/*
		 * We don't have an immediate reader, but we'll read the stuff
		 * out of the pipe right after the splice_to_pipe(). So set
		 * PIPE_READERS appropriately.
		 */
		pipe->readers = 1;

		current->splice_pipe = pipe;
	}

	/*
	 * Do the splice.
	 */
	bytes = 0;
	len = sd->total_len;

	/* Don't block on output, we have to drain the direct pipe. */
	flags = sd->flags;
	sd->flags &= ~SPLICE_F_NONBLOCK;

	/*
	 * We signal MORE until we've read sufficient data to fulfill the
	 * request and we keep signalling it if the caller set it.
	 */
	more = sd->flags & SPLICE_F_MORE;
	sd->flags |= SPLICE_F_MORE;

	WARN_ON_ONCE(!pipe_is_empty(pipe));

	while (len) {
		size_t read_len;
		loff_t pos = sd->pos, prev_pos = pos;

		ret = do_splice_read(in, &pos, pipe, len, flags);
		if (unlikely(ret <= 0))
			goto read_failure;

		read_len = ret;
		sd->total_len = read_len;

		/*
		 * If we now have sufficient data to fulfill the request then
		 * we clear SPLICE_F_MORE if it was not set initially.
		 */
		if (read_len >= len && !more)
			sd->flags &= ~SPLICE_F_MORE;

		/*
		 * NOTE: nonblocking mode only applies to the input. We
		 * must not do the output in nonblocking mode as then we
		 * could get stuck data in the internal pipe:
		 */
		ret = actor(pipe, sd);
		if (unlikely(ret <= 0)) {
			sd->pos = prev_pos;
			goto out_release;
		}

		bytes += ret;
		len -= ret;
		sd->pos = pos;

		if (ret < read_len) {
			sd->pos = prev_pos + ret;
			goto out_release;
		}
	}

done:
	pipe->tail = pipe->head = 0;
	file_accessed(in);
	return bytes;

read_failure:
	/*
	 * If the user did *not* set SPLICE_F_MORE *and* we didn't hit that
	 * "use all of len" case that cleared SPLICE_F_MORE, *and* we did a
	 * "->splice_in()" that returned EOF (ie zero) *and* we have sent at
	 * least 1 byte *then* we will also do the ->splice_eof() call.
	 */
	if (ret == 0 && !more && len > 0 && bytes)
		do_splice_eof(sd);
out_release:
	/*
	 * If we did an incomplete transfer we must release
	 * the pipe buffers in question:
	 */
	for (i = 0; i < pipe->ring_size; i++) {
		struct pipe_buffer *buf = &pipe->bufs[i];

		if (buf->ops)
			pipe_buf_release(pipe, buf);
	}

	if (!bytes)
		bytes = ret;

	goto done;
}
EXPORT_SYMBOL(splice_direct_to_actor);

static int direct_splice_actor(struct pipe_inode_info *pipe,
			       struct splice_desc *sd)
{
	struct file *file = sd->u.file;
	long ret;

	file_start_write(file);
	ret = do_splice_from(pipe, file, sd->opos, sd->total_len, sd->flags);
	file_end_write(file);
	return ret;
}

static int splice_file_range_actor(struct pipe_inode_info *pipe,
					struct splice_desc *sd)
{
	struct file *file = sd->u.file;

	return do_splice_from(pipe, file, sd->opos, sd->total_len, sd->flags);
}

static void direct_file_splice_eof(struct splice_desc *sd)
{
	struct file *file = sd->u.file;

	if (file->f_op->splice_eof)
		file->f_op->splice_eof(file);
}

static ssize_t do_splice_direct_actor(struct file *in, loff_t *ppos,
				      struct file *out, loff_t *opos,
				      size_t len, unsigned int flags,
				      splice_direct_actor *actor)
{
	struct splice_desc sd = {
		.len		= len,
		.total_len	= len,
		.flags		= flags,
		.pos		= *ppos,
		.u.file		= out,
		.splice_eof	= direct_file_splice_eof,
		.opos		= opos,
	};
	ssize_t ret;

	if (unlikely(!(out->f_mode & FMODE_WRITE)))
		return -EBADF;

	if (unlikely(out->f_flags & O_APPEND))
		return -EINVAL;

	ret = splice_direct_to_actor(in, &sd, actor);
	if (ret > 0)
		*ppos = sd.pos;

	return ret;
}
/**
 * do_splice_direct - splices data directly between two files
 * @in:		file to splice from
 * @ppos:	input file offset
 * @out:	file to splice to
 * @opos:	output file offset
 * @len:	number of bytes to splice
 * @flags:	splice modifier flags
 *
 * Description:
 *    For use by do_sendfile(). splice can easily emulate sendfile, but
 *    doing it in the application would incur an extra system call
 *    (splice in + splice out, as compared to just sendfile()). So this helper
 *    can splice directly through a process-private pipe.
 *
 * Callers already called rw_verify_area() on the entire range.
 */
ssize_t do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
			 loff_t *opos, size_t len, unsigned int flags)
{
	return do_splice_direct_actor(in, ppos, out, opos, len, flags,
				      direct_splice_actor);
}
EXPORT_SYMBOL(do_splice_direct);

/**
 * splice_file_range - splices data between two files for copy_file_range()
 * @in:		file to splice from
 * @ppos:	input file offset
 * @out:	file to splice to
 * @opos:	output file offset
 * @len:	number of bytes to splice
 *
 * Description:
 *    For use by ->copy_file_range() methods.
 *    Like do_splice_direct(), but vfs_copy_file_range() already holds
 *    start_file_write() on @out file.
 *
 * Callers already called rw_verify_area() on the entire range.
 */
ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
			  loff_t *opos, size_t len)
{
	lockdep_assert(file_write_started(out));

	return do_splice_direct_actor(in, ppos, out, opos,
				      min_t(size_t, len, MAX_RW_COUNT),
				      0, splice_file_range_actor);
}
EXPORT_SYMBOL(splice_file_range);

static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
{
	for (;;) {
		if (unlikely(!pipe->readers)) {
			send_sig(SIGPIPE, current, 0);
			return -EPIPE;
		}
		if (!pipe_is_full(pipe))
			return 0;
		if (flags & SPLICE_F_NONBLOCK)
			return -EAGAIN;
		if (signal_pending(current))
			return -ERESTARTSYS;
		pipe_wait_writable(pipe);
	}
}

static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
			       struct pipe_inode_info *opipe,
			       size_t len, unsigned int flags);

ssize_t splice_file_to_pipe(struct file *in,
			    struct pipe_inode_info *opipe,
			    loff_t *offset,
			    size_t len, unsigned int flags)
{
	ssize_t ret;

	pipe_lock(opipe);
	ret = wait_for_space(opipe, flags);
	if (!ret)
		ret = do_splice_read(in, offset, opipe, len, flags);
	pipe_unlock(opipe);
	if (ret > 0)
		wakeup_pipe_readers(opipe);
	return ret;
}

/*
 * Determine where to splice to/from.
 */
ssize_t do_splice(struct file *in, loff_t *off_in, struct file *out,
		  loff_t *off_out, size_t len, unsigned int flags)
{
	struct pipe_inode_info *ipipe;
	struct pipe_inode_info *opipe;
	loff_t offset;
	ssize_t ret;

	if (unlikely(!(in->f_mode & FMODE_READ) ||
		     !(out->f_mode & FMODE_WRITE)))
		return -EBADF;

	ipipe = get_pipe_info(in, true);
	opipe = get_pipe_info(out, true);

	if (ipipe && opipe) {
		if (off_in || off_out)
			return -ESPIPE;

		/* Splicing to self would be fun, but... */
		if (ipipe == opipe)
			return -EINVAL;

		if ((in->f_flags | out->f_flags) & O_NONBLOCK)
			flags |= SPLICE_F_NONBLOCK;

		ret = splice_pipe_to_pipe(ipipe, opipe, len, flags);
	} else if (ipipe) {
		if (off_in)
			return -ESPIPE;
		if (off_out) {
			if (!(out->f_mode & FMODE_PWRITE))
				return -EINVAL;
			offset = *off_out;
		} else {
			offset = out->f_pos;
		}

		if (unlikely(out->f_flags & O_APPEND))
			return -EINVAL;

		ret = rw_verify_area(WRITE, out, &offset, len);
		if (unlikely(ret < 0))
			return ret;

		if (in->f_flags & O_NONBLOCK)
			flags |= SPLICE_F_NONBLOCK;

		file_start_write(out);
		ret = do_splice_from(ipipe, out, &offset, len, flags);
		file_end_write(out);

		if (!off_out)
			out->f_pos = offset;
		else
			*off_out = offset;
	} else if (opipe) {
		if (off_out)
			return -ESPIPE;
		if (off_in) {
			if (!(in->f_mode & FMODE_PREAD))
				return -EINVAL;
			offset = *off_in;
		} else {
			offset = in->f_pos;
		}

		ret = rw_verify_area(READ, in, &offset, len);
		if (unlikely(ret < 0))
			return ret;

		if (out->f_flags & O_NONBLOCK)
			flags |= SPLICE_F_NONBLOCK;

		ret = splice_file_to_pipe(in, opipe, &offset, len, flags);

		if (!off_in)
			in->f_pos = offset;
		else
			*off_in = offset;
	} else {
		ret = -EINVAL;
	}

	if (ret > 0) {
		/*
		 * Generate modify out before access in:
		 * do_splice_from() may've already sent modify out,
		 * and this ensures the events get merged.
		 */
		fsnotify_modify(out);
		fsnotify_access(in);
	}

	return ret;
}

static ssize_t __do_splice(struct file *in, loff_t __user *off_in,
			   struct file *out, loff_t __user *off_out,
			   size_t len, unsigned int flags)
{
	struct pipe_inode_info *ipipe;
	struct pipe_inode_info *opipe;
	loff_t offset, *__off_in = NULL, *__off_out = NULL;
	ssize_t ret;

	ipipe = get_pipe_info(in, true);
	opipe = get_pipe_info(out, true);

	if (ipipe) {
		if (off_in)
			return -ESPIPE;
		pipe_clear_nowait(in);
	}
	if (opipe) {
		if (off_out)
			return -ESPIPE;
		pipe_clear_nowait(out);
	}

	if (off_out) {
		if (copy_from_user(&offset, off_out, sizeof(loff_t)))
			return -EFAULT;
		__off_out = &offset;
	}
	if (off_in) {
		if (copy_from_user(&offset, off_in, sizeof(loff_t)))
			return -EFAULT;
		__off_in = &offset;
	}

	ret = do_splice(in, __off_in, out, __off_out, len, flags);
	if (ret < 0)
		return ret;

	if (__off_out && copy_to_user(off_out, __off_out, sizeof(loff_t)))
		return -EFAULT;
	if (__off_in && copy_to_user(off_in, __off_in, sizeof(loff_t)))
		return -EFAULT;

	return ret;
}

static ssize_t iter_to_pipe(struct iov_iter *from,
			    struct pipe_inode_info *pipe,
			    unsigned int flags)
{
	struct pipe_buffer buf = {
		.ops = &user_page_pipe_buf_ops,
		.flags = flags
	};
	size_t total = 0;
	ssize_t ret = 0;

	while (iov_iter_count(from)) {
		struct page *pages[16];
		ssize_t left;
		size_t start;
		int i, n;

		left = iov_iter_get_pages2(from, pages, ~0UL, 16, &start);
		if (left <= 0) {
			ret = left;
			break;
		}

		n = DIV_ROUND_UP(left + start, PAGE_SIZE);
		for (i = 0; i < n; i++) {
			int size = umin(left, PAGE_SIZE - start);

			buf.page = pages[i];
			buf.offset = start;
			buf.len = size;
			ret = add_to_pipe(pipe, &buf);
			if (unlikely(ret < 0)) {
				iov_iter_revert(from, left);
				// this one got dropped by add_to_pipe()
				while (++i < n)
					put_page(pages[i]);
				goto out;
			}
			total += ret;
			left -= size;
			start = 0;
		}
	}
out:
	return total ? total : ret;
}

static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
			struct splice_desc *sd)
{
	int n = copy_page_to_iter(buf->page, buf->offset, sd->len, sd->u.data);
	return n == sd->len ? n : -EFAULT;
}

/*
 * For lack of a better implementation, implement vmsplice() to userspace
 * as a simple copy of the pipe's pages to the user iov.
 */
static ssize_t vmsplice_to_user(struct file *file, struct iov_iter *iter,
				unsigned int flags)
{
	struct pipe_inode_info *pipe = get_pipe_info(file, true);
	struct splice_desc sd = {
		.total_len = iov_iter_count(iter),
		.flags = flags,
		.u.data = iter
	};
	ssize_t ret = 0;

	if (!pipe)
		return -EBADF;

	pipe_clear_nowait(file);

	if (sd.total_len) {
		pipe_lock(pipe);
		ret = __splice_from_pipe(pipe, &sd, pipe_to_user);
		pipe_unlock(pipe);
	}

	if (ret > 0)
		fsnotify_access(file);

	return ret;
}

/*
 * vmsplice splices a user address range into a pipe. It can be thought of
 * as splice-from-memory, where the regular splice is splice-from-file (or
 * to file). In both cases the output is a pipe, naturally.
 */
static ssize_t vmsplice_to_pipe(struct file *file, struct iov_iter *iter,
				unsigned int flags)
{
	struct pipe_inode_info *pipe;
	ssize_t ret = 0;
	unsigned buf_flag = 0;

	if (flags & SPLICE_F_GIFT)
		buf_flag = PIPE_BUF_FLAG_GIFT;

	pipe = get_pipe_info(file, true);
	if (!pipe)
		return -EBADF;

	pipe_clear_nowait(file);

	pipe_lock(pipe);
	ret = wait_for_space(pipe, flags);
	if (!ret)
		ret = iter_to_pipe(iter, pipe, buf_flag);
	pipe_unlock(pipe);
	if (ret > 0) {
		wakeup_pipe_readers(pipe);
		fsnotify_modify(file);
	}
	return ret;
}

/*
 * Note that vmsplice only really supports true splicing _from_ user memory
 * to a pipe, not the other way around. Splicing from user memory is a simple
 * operation that can be supported without any funky alignment restrictions
 * or nasty vm tricks. We simply map in the user memory and fill them into
 * a pipe. The reverse isn't quite as easy, though. There are two possible
 * solutions for that:
 *
 *	- memcpy() the data internally, at which point we might as well just
 *	  do a regular read() on the buffer anyway.
 *	- Lots of nasty vm tricks, that are neither fast nor flexible (it
 *	  has restriction limitations on both ends of the pipe).
 *
 * Currently we punt and implement it as a normal copy, see pipe_to_user().
 *
 */
SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, uiov,
		unsigned long, nr_segs, unsigned int, flags)
{
	struct iovec iovstack[UIO_FASTIOV];
	struct iovec *iov = iovstack;
	struct iov_iter iter;
	ssize_t error;
	int type;

	if (unlikely(flags & ~SPLICE_F_ALL))
		return -EINVAL;

	CLASS(fd, f)(fd);
	if (fd_empty(f))
		return -EBADF;
	if (fd_file(f)->f_mode & FMODE_WRITE)
		type = ITER_SOURCE;
	else if (fd_file(f)->f_mode & FMODE_READ)
		type = ITER_DEST;
	else
		return -EBADF;

	error = import_iovec(type, uiov, nr_segs,
			     ARRAY_SIZE(iovstack), &iov, &iter);
	if (error < 0)
		return error;

	if (!iov_iter_count(&iter))
		error = 0;
	else if (type == ITER_SOURCE)
		error = vmsplice_to_pipe(fd_file(f), &iter, flags);
	else
		error = vmsplice_to_user(fd_file(f), &iter, flags);

	kfree(iov);
	return error;
}

SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
		int, fd_out, loff_t __user *, off_out,
		size_t, len, unsigned int, flags)
{
	if (unlikely(!len))
		return 0;

	if (unlikely(flags & ~SPLICE_F_ALL))
		return -EINVAL;

	CLASS(fd, in)(fd_in);
	if (fd_empty(in))
		return -EBADF;

	CLASS(fd, out)(fd_out);
	if (fd_empty(out))
		return -EBADF;

	return __do_splice(fd_file(in), off_in, fd_file(out), off_out,
					    len, flags);
}

/*
 * Make sure there's data to read. Wait for input if we can, otherwise
 * return an appropriate error.
 */
static int ipipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
{
	int ret;

	/*
	 * Check the pipe occupancy without the inode lock first. This function
	 * is speculative anyways, so missing one is ok.
	 */
	if (!pipe_is_empty(pipe))
		return 0;

	ret = 0;
	pipe_lock(pipe);

	while (pipe_is_empty(pipe)) {
		if (signal_pending(current)) {
			ret = -ERESTARTSYS;
			break;
		}
		if (!pipe->writers)
			break;
		if (flags & SPLICE_F_NONBLOCK) {
			ret = -EAGAIN;
			break;
		}
		pipe_wait_readable(pipe);
	}

	pipe_unlock(pipe);
	return ret;
}

/*
 * Make sure there's writeable room. Wait for room if we can, otherwise
 * return an appropriate error.
 */
static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
{
	int ret;

	/*
	 * Check pipe occupancy without the inode lock first. This function
	 * is speculative anyways, so missing one is ok.
	 */
	if (!pipe_is_full(pipe))
		return 0;

	ret = 0;
	pipe_lock(pipe);

	while (pipe_is_full(pipe)) {
		if (!pipe->readers) {
			send_sig(SIGPIPE, current, 0);
			ret = -EPIPE;
			break;
		}
		if (flags & SPLICE_F_NONBLOCK) {
			ret = -EAGAIN;
			break;
		}
		if (signal_pending(current)) {
			ret = -ERESTARTSYS;
			break;
		}
		pipe_wait_writable(pipe);
	}

	pipe_unlock(pipe);
	return ret;
}

/*
 * Splice contents of ipipe to opipe.
 */
static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
			       struct pipe_inode_info *opipe,
			       size_t len, unsigned int flags)
{
	struct pipe_buffer *ibuf, *obuf;
	unsigned int i_head, o_head;
	unsigned int i_tail, o_tail;
	int ret = 0;
	bool input_wakeup = false;


retry:
	ret = ipipe_prep(ipipe, flags);
	if (ret)
		return ret;

	ret = opipe_prep(opipe, flags);
	if (ret)
		return ret;

	/*
	 * Potential ABBA deadlock, work around it by ordering lock
	 * grabbing by pipe info address. Otherwise two different processes
	 * could deadlock (one doing tee from A -> B, the other from B -> A).
	 */
	pipe_double_lock(ipipe, opipe);

	i_tail = ipipe->tail;
	o_head = opipe->head;

	do {
		size_t o_len;

		if (!opipe->readers) {
			send_sig(SIGPIPE, current, 0);
			if (!ret)
				ret = -EPIPE;
			break;
		}

		i_head = ipipe->head;
		o_tail = opipe->tail;

		if (pipe_empty(i_head, i_tail) && !ipipe->writers)
			break;

		/*
		 * Cannot make any progress, because either the input
		 * pipe is empty or the output pipe is full.
		 */
		if (pipe_empty(i_head, i_tail) ||
		    pipe_full(o_head, o_tail, opipe->max_usage)) {
			/* Already processed some buffers, break */
			if (ret)
				break;

			if (flags & SPLICE_F_NONBLOCK) {
				ret = -EAGAIN;
				break;
			}

			/*
			 * We raced with another reader/writer and haven't
			 * managed to process any buffers.  A zero return
			 * value means EOF, so retry instead.
			 */
			pipe_unlock(ipipe);
			pipe_unlock(opipe);
			goto retry;
		}

		ibuf = pipe_buf(ipipe, i_tail);
		obuf = pipe_buf(opipe, o_head);

		if (len >= ibuf->len) {
			/*
			 * Simply move the whole buffer from ipipe to opipe
			 */
			*obuf = *ibuf;
			ibuf->ops = NULL;
			i_tail++;
			ipipe->tail = i_tail;
			input_wakeup = true;
			o_len = obuf->len;
			o_head++;
			opipe->head = o_head;
		} else {
			/*
			 * Get a reference to this pipe buffer,
			 * so we can copy the contents over.
			 */
			if (!pipe_buf_get(ipipe, ibuf)) {
				if (ret == 0)
					ret = -EFAULT;
				break;
			}
			*obuf = *ibuf;

			/*
			 * Don't inherit the gift and merge flags, we need to
			 * prevent multiple steals of this page.
			 */
			obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
			obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE;

			obuf->len = len;
			ibuf->offset += len;
			ibuf->len -= len;
			o_len = len;
			o_head++;
			opipe->head = o_head;
		}
		ret += o_len;
		len -= o_len;
	} while (len);

	pipe_unlock(ipipe);
	pipe_unlock(opipe);

	/*
	 * If we put data in the output pipe, wakeup any potential readers.
	 */
	if (ret > 0)
		wakeup_pipe_readers(opipe);

	if (input_wakeup)
		wakeup_pipe_writers(ipipe);

	return ret;
}

/*
 * Link contents of ipipe to opipe.
 */
static ssize_t link_pipe(struct pipe_inode_info *ipipe,
			 struct pipe_inode_info *opipe,
			 size_t len, unsigned int flags)
{
	struct pipe_buffer *ibuf, *obuf;
	unsigned int i_head, o_head;
	unsigned int i_tail, o_tail;
	ssize_t ret = 0;

	/*
	 * Potential ABBA deadlock, work around it by ordering lock
	 * grabbing by pipe info address. Otherwise two different processes
	 * could deadlock (one doing tee from A -> B, the other from B -> A).
	 */
	pipe_double_lock(ipipe, opipe);

	i_tail = ipipe->tail;
	o_head = opipe->head;

	do {
		if (!opipe->readers) {
			send_sig(SIGPIPE, current, 0);
			if (!ret)
				ret = -EPIPE;
			break;
		}

		i_head = ipipe->head;
		o_tail = opipe->tail;

		/*
		 * If we have iterated all input buffers or run out of
		 * output room, break.
		 */
		if (pipe_empty(i_head, i_tail) ||
		    pipe_full(o_head, o_tail, opipe->max_usage))
			break;

		ibuf = pipe_buf(ipipe, i_tail);
		obuf = pipe_buf(opipe, o_head);

		/*
		 * Get a reference to this pipe buffer,
		 * so we can copy the contents over.
		 */
		if (!pipe_buf_get(ipipe, ibuf)) {
			if (ret == 0)
				ret = -EFAULT;
			break;
		}

		*obuf = *ibuf;

		/*
		 * Don't inherit the gift and merge flag, we need to prevent
		 * multiple steals of this page.
		 */
		obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
		obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE;

		if (obuf->len > len)
			obuf->len = len;
		ret += obuf->len;
		len -= obuf->len;

		o_head++;
		opipe->head = o_head;
		i_tail++;
	} while (len);

	pipe_unlock(ipipe);
	pipe_unlock(opipe);

	/*
	 * If we put data in the output pipe, wakeup any potential readers.
	 */
	if (ret > 0)
		wakeup_pipe_readers(opipe);

	return ret;
}

/*
 * This is a tee(1) implementation that works on pipes. It doesn't copy
 * any data, it simply references the 'in' pages on the 'out' pipe.
 * The 'flags' used are the SPLICE_F_* variants, currently the only
 * applicable one is SPLICE_F_NONBLOCK.
 */
ssize_t do_tee(struct file *in, struct file *out, size_t len,
	       unsigned int flags)
{
	struct pipe_inode_info *ipipe = get_pipe_info(in, true);
	struct pipe_inode_info *opipe = get_pipe_info(out, true);
	ssize_t ret = -EINVAL;

	if (unlikely(!(in->f_mode & FMODE_READ) ||
		     !(out->f_mode & FMODE_WRITE)))
		return -EBADF;

	/*
	 * Duplicate the contents of ipipe to opipe without actually
	 * copying the data.
	 */
	if (ipipe && opipe && ipipe != opipe) {
		if ((in->f_flags | out->f_flags) & O_NONBLOCK)
			flags |= SPLICE_F_NONBLOCK;

		/*
		 * Keep going, unless we encounter an error. The ipipe/opipe
		 * ordering doesn't really matter.
		 */
		ret = ipipe_prep(ipipe, flags);
		if (!ret) {
			ret = opipe_prep(opipe, flags);
			if (!ret)
				ret = link_pipe(ipipe, opipe, len, flags);
		}
	}

	if (ret > 0) {
		fsnotify_access(in);
		fsnotify_modify(out);
	}

	return ret;
}

SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
{
	if (unlikely(flags & ~SPLICE_F_ALL))
		return -EINVAL;

	if (unlikely(!len))
		return 0;

	CLASS(fd, in)(fdin);
	if (fd_empty(in))
		return -EBADF;

	CLASS(fd, out)(fdout);
	if (fd_empty(out))
		return -EBADF;

	return do_tee(fd_file(in), fd_file(out), len, flags);
}
]

21/2 2026/02/24 21:54 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 42751
							output: 19
							thoughts: 1206
						

Thoughts:
**Benign Data Race Analysis**

Okay, let's break this down. From the KCSAN report, I see a write in `iter_file_splice_write` – this is where the `bvec` array is built and the iterator is used. The key is that both tasks are using `ITER_BVEC` to access the same pages.

So, Task 7152 is writing using `generic_perform_write` to ext4, and Task 3545 is also using `generic_perform_write`, but to shmem. Initially, I thought Task 3545 was reading from the ext4 folio, but that's not quite right. `generic_perform_write` means it's also writing! So, Task 3545 is writing to shmem, but the *source* of that write is the ext4 folio.

Right, so Task 7152 is writing *to* ext4 through `iter_file_splice_write`, using data from a pipe. This implies a `sendfile`-like operation where data is read from some other file or socket, put into a pipe, and then written *to* ext4. Task 3545 is taking that *same* data that was written to ext4 and copying it into shmem.

Therefore, the race occurs on the ext4 folio. Task 7152 is doing `memcpy(ext4_folio, pipe_buffer_page, len)`, and Task 3545 is doing `memcpy(shmem_folio, ext4_folio, len)`. I assumed correctly that `pipe_buffer_page` is *not* necessarily the same as `shmem_folio`, and that it doesn't matter.

Task 7152 holds the `inode_lock` on the ext4 inode; Task 3545 does not. So, the race is between these two memory copies on the ext4 folio. It looks like the ext4 folio is being written *to* and read *from* at the same time. Since this is just file data, it's a benign data race. The fact that `copy_folio_from_iter_atomic` is involved ensures it's safe. Specifically, it uses `kmap_local_folio`, which is designed to be safe in this type of scenario.

So, I'm concluding that this is indeed a benign data race.



22/2 2026/02/24 21:54 tool set-results
0m Args:
map[Benign:true Confident:true]

Results:
map[Benign:true Confident:true]

23/2 2026/02/24 21:54 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 44515
							output: 586
							thoughts: 195
						

Thoughts:
**Analyzing the Data Race - My Perspective**

Okay, so I've been digging into this data race, and I've concluded it's benign, thankfully. The core issue boils down to a race condition between two `memcpy` operations accessing the same folio within the page cache. One process, task 7152, is performing a buffered write to an ext4 file. This writing is occurring through the standard mechanisms associated with buffered I/O, as the ext4 file is the source of the data. Another task, 3545, is also working on this same folio, reading data as part of a write operation to a shmem file. The wrinkle is that this read is happening through a loop device, which was provided this specific folio in a bio.

Here's the rub: the reading task, the loop worker, doesn't hold the `inode_lock` of the ext4 file during its `memcpy` operation. Meanwhile, the writing task *does* hold that lock. That's the crux of the race.

Effectively, this scenario is a standard data race on file contents as seen in the Linux kernel. This race leads to inconsistent data in the *destination* shmem file if the user is performing unsynchronized concurrent I/O. It won't cause memory corruption or crashes in the kernel itself, so we can consider it harmless in the context of system stability. It's essentially the classic "read-while-write" issue, and the system is designed to handle this specific class of races.



LLM Calls Summary:
Total Calls Total Tokens Avg Tokens Total Duration (Seconds) Avg Duration (Seconds)
Tool Calls Summary:
Total Calls Total Duration (Seconds) Avg Duration (Seconds)