syzbot

ID	Workflow	Result	Correct	Bug	Created	Started	Finished	Revision	Error
cbb3e7f6-7c59-4ee5-b470-059dfcdc8232	assessment-kcsan	Benign: ✅ Confident: ✅	❓	KCSAN: data-race in filemap_read / filemap_splice_read (3)	2026/02/26 05:04	2026/02/26 05:04	2026/02/26 05:14	305c0ec5cd886e2d13738e28e1b2df9b0ec20fc9

The data race occurs on the `prev_pos` field of `struct file_ra_state`, which is part of the `struct file`. This field is used by the readahead algorithm to track the last byte read in the previous request, helping it detect sequential access patterns and decide when to trigger readahead.

The race happens when multiple tasks perform read operations (such as `filemap_read` and `filemap_splice_read`) on the same file descriptor concurrently. In the reported case, one task is performing a `sendfile` (which uses `filemap_splice_read`) while another is performing an `io_uring` read (which uses `filemap_read`). Both functions update `in->f_ra.prev_pos` to reflect their progress.

This data race is considered **benign** for the following reasons:
1.  **Hint-only field**: The `prev_pos` field (and the entire `file_ra_state` structure) is used only as a hint for readahead and LRU management. It does not affect the correctness of the data returned to the user or the integrity of the filesystem.
2.  **Intentional lack of synchronization**: The readahead state is intentionally accessed without strict synchronization (like locks) to avoid performance degradation on hot read paths. The kernel documentation for readahead (`mm/readahead.c`) explicitly mentions that concurrent streams on the same file descriptor can interfere with each other's readahead state.
3.  **Atomic writes**: On the x86_64 architecture, 8-byte aligned writes to a `loff_t` are atomic. The `prev_pos` field is 8-byte aligned within `struct file_ra_state`. Thus, even with a race, the field will always contain a valid (though potentially "stale" from another thread's perspective) offset.
4.  **Robust algorithm**: The readahead algorithm is designed to be robust. If `prev_pos` is updated by another thread, the algorithm might temporarily fail to detect a sequential pattern and fall back to a slower path that queries the page cache, but it will still function correctly and eventually re-establish readahead if the pattern persists.
5.  **Harmless side effects**: In `filemap_read`, `prev_pos` is also used to avoid redundant `folio_mark_accessed()` calls. A race here might result in an extra call to mark a folio as accessed, which is a harmless operation for the LRU cache.

In summary, this is a known and expected data race on a performance-hinting field that does not compromise system stability or data integrity.

==================================================================
BUG: KCSAN: data-race in filemap_read / filemap_splice_read

write to 0xffff88811a6f8228 of 8 bytes by task 10061 on cpu 0:
 filemap_splice_read+0x523/0x780 mm/filemap.c:3125
 do_splice_read fs/splice.c:980 [inline]
 splice_direct_to_actor+0x26e/0x670 fs/splice.c:1084
 do_splice_direct_actor fs/splice.c:1202 [inline]
 do_splice_direct+0x119/0x1a0 fs/splice.c:1228
 do_sendfile+0x382/0x650 fs/read_write.c:1372
 __do_sys_sendfile64 fs/read_write.c:1433 [inline]
 __se_sys_sendfile64 fs/read_write.c:1419 [inline]
 __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1419
 x64_sys_call+0x2dc4/0x3020 arch/x86/include/generated/asm/syscalls_64.h:41
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

write to 0xffff88811a6f8228 of 8 bytes by task 10066 on cpu 1:
 filemap_read+0x98d/0xa10 mm/filemap.c:2873
 generic_file_read_iter+0x79/0x330 mm/filemap.c:2992
 io_iter_do_read io_uring/rw.c:835 [inline]
 __io_read+0x463/0xc50 io_uring/rw.c:951
 io_read+0x4a/0x190 io_uring/rw.c:1031
 __io_issue_sqe+0xfd/0x2d0 io_uring/io_uring.c:1384
 io_issue_sqe+0x206/0xc00 io_uring/io_uring.c:1407
 io_wq_submit_work+0x5bc/0x830 io_uring/io_uring.c:1519
 io_worker_handle_work+0x41e/0x950 io_uring/io-wq.c:652
 io_wq_worker+0x23d/0x970 io_uring/io-wq.c:706
 ret_from_fork+0x150/0x360 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

value changed: 0x00000000000000d8 -> 0x00000000000014f0

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 10066 Comm: iou-wrk-10063 Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
==================================================================

Trajectory:
Seq	Timestamp	Type	Name	Duration
0/0	2026/02/26 05:04	flow	assessment-kcsan	9m Results: map[Benign:true Confident:true Explanation:The data race occurs on the `prev_pos` field of `struct file_ra_state`, which is part of the `struct file`. This field is used by the readahead algorithm to track the last byte read in the previous request, helping it detect sequential access patterns and decide when to trigger readahead. The race happens when multiple tasks perform read operations (such as `filemap_read` and `filemap_splice_read`) on the same file descriptor concurrently. In the reported case, one task is performing a `sendfile` (which uses `filemap_splice_read`) while another is performing an `io_uring` read (which uses `filemap_read`). Both functions update `in->f_ra.prev_pos` to reflect their progress. This data race is considered benign for the following reasons: 1. Hint-only field: The `prev_pos` field (and the entire `file_ra_state` structure) is used only as a hint for readahead and LRU management. It does not affect the correctness of the data returned to the user or the integrity of the filesystem. 2. Intentional lack of synchronization: The readahead state is intentionally accessed without strict synchronization (like locks) to avoid performance degradation on hot read paths. The kernel documentation for readahead (`mm/readahead.c`) explicitly mentions that concurrent streams on the same file descriptor can interfere with each other's readahead state. 3. Atomic writes: On the x86_64 architecture, 8-byte aligned writes to a `loff_t` are atomic. The `prev_pos` field is 8-byte aligned within `struct file_ra_state`. Thus, even with a race, the field will always contain a valid (though potentially "stale" from another thread's perspective) offset. 4. Robust algorithm: The readahead algorithm is designed to be robust. If `prev_pos` is updated by another thread, the algorithm might temporarily fail to detect a sequential pattern and fall back to a slower path that queries the page cache, but it will still function correctly and eventually re-establish readahead if the pattern persists. 5. Harmless side effects: In `filemap_read`, `prev_pos` is also used to avoid redundant `folio_mark_accessed()` calls. A race here might result in an extra call to mark a folio as accessed, which is a harmless operation for the LRU cache. In summary, this is a known and expected data race on a performance-hinting field that does not compromise system stability or data integrity.]
1/1	2026/02/26 05:04	action	kernel-checkouter	0m Results: map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/4a2b064596d08cb6a84691d51f20dc6b9bfd3434]
2/1	2026/02/26 05:05	action	kernel-builder	3m Results: map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/da6fca5c4649a67b9e82aa07431048b6f910d2c8]
3/1	2026/02/26 05:09	action	codesearch-prepare	2m Results: map[Index:codesearch-index]
4/1	2026/02/26 05:12	agent	expert	1m Model: gemini-3-flash-preview Results: map[Benign:true Confident:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel data race is benign or not. The data race report is from KCSAN tool. It contains 2 stack traces of the memory accesses that constitute a data race. A "benign" data races are on a simple int/bool variable or similar field, and the accesses are not supposed to be protected by any mutual exclusion primitives. Common examples of such "benign" data races are accesses to various flags fields, statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A non-benign (or "harmful" data race) can lead to corruption/crash even with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A common example of a "harmful" data races is race on a complex container (list/hashmap/etc), where accesses are supposed to be protected by a mutual exclusion primitive. Also consider races that happen at the same time with the given one. If there is no synchronization in between, other memory accesses in the involved threads race with each other if they access the same memory. For example, if both threads execute: some_struct->have_elements = true; list_add(new_node, &some_struct->list_head); the race on some_struct->have_elements may appear benign, however it also implies there is a race on some_struct->list_head which is not benign, since the list is not thread-safe. Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses, and thus even with a conservative compiler may not be fully atomic. However, such races may still be benign depending on how writes are done, and how read data is used. In the final reply explain why you think the given data race is benign or is harmful. Use the provided tools to confirm any assumptions, variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: The data race report is: ================================================================== BUG: KCSAN: data-race in filemap_read / filemap_splice_read write to 0xffff88811a6f8228 of 8 bytes by task 10061 on cpu 0: filemap_splice_read+0x523/0x780 mm/filemap.c:3125 do_splice_read fs/splice.c:980 [inline] splice_direct_to_actor+0x26e/0x670 fs/splice.c:1084 do_splice_direct_actor fs/splice.c:1202 [inline] do_splice_direct+0x119/0x1a0 fs/splice.c:1228 do_sendfile+0x382/0x650 fs/read_write.c:1372 __do_sys_sendfile64 fs/read_write.c:1433 [inline] __se_sys_sendfile64 fs/read_write.c:1419 [inline] __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1419 x64_sys_call+0x2dc4/0x3020 arch/x86/include/generated/asm/syscalls_64.h:41 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f write to 0xffff88811a6f8228 of 8 bytes by task 10066 on cpu 1: filemap_read+0x98d/0xa10 mm/filemap.c:2873 generic_file_read_iter+0x79/0x330 mm/filemap.c:2992 io_iter_do_read io_uring/rw.c:835 [inline] __io_read+0x463/0xc50 io_uring/rw.c:951 io_read+0x4a/0x190 io_uring/rw.c:1031 __io_issue_sqe+0xfd/0x2d0 io_uring/io_uring.c:1384 io_issue_sqe+0x206/0xc00 io_uring/io_uring.c:1407 io_wq_submit_work+0x5bc/0x830 io_uring/io_uring.c:1519 io_worker_handle_work+0x41e/0x950 io_uring/io-wq.c:652 io_wq_worker+0x23d/0x970 io_uring/io-wq.c:706 ret_from_fork+0x150/0x360 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 value changed: 0x00000000000000d8 -> 0x00000000000014f0 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 10066 Comm: iou-wrk-10063 Tainted: G W syzkaller #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 ================================================================== Reply: The data race occurs on the `prev_pos` field of `struct file_ra_state`, which is part of the `struct file`. This field is used by the readahead algorithm to track the last byte read in the previous request, helping it detect sequential access patterns and decide when to trigger readahead. The race happens when multiple tasks perform read operations (such as `filemap_read` and `filemap_splice_read`) on the same file descriptor concurrently. In the reported case, one task is performing a `sendfile` (which uses `filemap_splice_read`) while another is performing an `io_uring` read (which uses `filemap_read`). Both functions update `in->f_ra.prev_pos` to reflect their progress. This data race is considered benign for the following reasons: 1. Hint-only field: The `prev_pos` field (and the entire `file_ra_state` structure) is used only as a hint for readahead and LRU management. It does not affect the correctness of the data returned to the user or the integrity of the filesystem. 2. Intentional lack of synchronization: The readahead state is intentionally accessed without strict synchronization (like locks) to avoid performance degradation on hot read paths. The kernel documentation for readahead (`mm/readahead.c`) explicitly mentions that concurrent streams on the same file descriptor can interfere with each other's readahead state. 3. Atomic writes: On the x86_64 architecture, 8-byte aligned writes to a `loff_t` are atomic. The `prev_pos` field is 8-byte aligned within `struct file_ra_state`. Thus, even with a race, the field will always contain a valid (though potentially "stale" from another thread's perspective) offset. 4. Robust algorithm: The readahead algorithm is designed to be robust. If `prev_pos` is updated by another thread, the algorithm might temporarily fail to detect a sequential pattern and fall back to a slower path that queries the page cache, but it will still function correctly and eventually re-establish readahead if the pattern persists. 5. Harmless side effects: In `filemap_read`, `prev_pos` is also used to avoid redundant `folio_mark_accessed()` calls. A race here might result in an extra call to mark a folio as accessed, which is a harmless operation for the LRU cache. In summary, this is a known and expected data race on a performance-hinting field that does not compromise system stability or data integrity.
5/2	2026/02/26 05:12	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 4040 output: 41 thoughts: 142 Thoughts: Okay, here's my summary, taking on the persona you've requested. Analyzing the Reported Data Race Alright, a data race has popped up, specifically between `filemap_splice_read` and `filemap_read`. The report is pointing a finger at a shared memory address: `0xffff88811a6f8228`. Both of these functions, apparently, are issuing 8-byte writes to the same location. That's a red flag, and it warrants a deeper dive. My immediate next steps are to pinpoint the exact code involved. The report has conveniently provided the specific lines: `mm/filemap.c:3125` for `filemap_splice_read` and `mm/filemap.c:2873` for `filemap_read`. I'll first list the files as a reference, then I’m going to go straight to those lines and examine the surrounding context. Understanding the adjacent code is crucial for getting a full picture of what's happening. I need to see what's being written, under what conditions, and if there are any obvious synchronization mechanisms in place, or the lack thereof. Let’s get to it.
6/2	2026/02/26 05:12	tool	codesearch-dir-index	0m Args: map[Dir:mm/] Results: map[Files:[backing-dev.c balloon.c bootmem_info.c bpf_memcontrol.c cma.c cma.h cma_debug.c cma_sysfs.c compaction.c debug.c debug_page_alloc.c debug_page_ref.c debug_vm_pgtable.c dmapool.c dmapool_test.c early_ioremap.c execmem.c fadvise.c fail_page_alloc.c failslab.c filemap.c folio-compat.c gup.c gup_test.c gup_test.h highmem.c hmm.c huge_memory.c hugetlb.c hugetlb_cgroup.c hugetlb_cma.c hugetlb_cma.h hugetlb_internal.h hugetlb_sysctl.c hugetlb_sysfs.c hugetlb_vmemmap.c hugetlb_vmemmap.h hwpoison-inject.c init-mm.c internal.h interval_tree.c ioremap.c khugepaged.c kmemleak.c ksm.c list_lru.c maccess.c madvise.c mapping_dirty_helpers.c memblock.c memcontrol-v1.c memcontrol-v1.h memcontrol.c memfd.c memfd_luo.c memory-failure.c memory-tiers.c memory.c memory_hotplug.c mempolicy.c mempool.c memremap.c memtest.c migrate.c migrate_device.c mincore.c mlock.c mm_init.c mm_slot.h mmap.c mmap_lock.c mmu_gather.c mmu_notifier.c mmzone.c mprotect.c mremap.c mseal.c msync.c nommu.c numa.c numa_emulation.c numa_memblks.c oom_kill.c page-writeback.c page_alloc.c page_counter.c page_ext.c page_frag_cache.c page_idle.c page_io.c page_isolation.c page_owner.c page_poison.c page_reporting.c page_reporting.h page_table_check.c page_vma_mapped.c pagewalk.c percpu-internal.h percpu-km.c percpu-stats.c percpu-vm.c percpu.c pgalloc-track.h pgtable-generic.c process_vm_access.c ptdump.c readahead.c rmap.c rodata_test.c secretmem.c shmem.c shmem_quota.c show_mem.c shrinker.c shrinker_debug.c shuffle.c shuffle.h slab.h slab_common.c slub.c sparse-vmemmap.c sparse.c swap.c swap.h swap_cgroup.c swap_state.c swap_table.h swapfile.c truncate.c usercopy.c userfaultfd.c util.c vma.c vma.h vma_exec.c vma_init.c vma_internal.h vmalloc.c vmpressure.c vmscan.c vmstat.c workingset.c zpdesc.h zsmalloc.c zswap.c] Subdirs:[damon kasan kfence kmsan tests]]
7/2	2026/02/26 05:12	tool	read-file	0m Args: map[File:mm/filemap.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0-only /* * linux/mm/filemap.c * * Copyright (C) 1994-1999 Linus Torvalds / / * This file handles the generic file mmap semantics used by * most "normal" filesystems (but you don't /have/ to use this: * the NFS filesystem used to do this differently, for example) / #include <linux/export.h> #include <linux/compiler.h> #include <linux/dax.h> #include <linux/fs.h> #include <linux/sched/signal.h> #include <linux/uaccess.h> #include <linux/capability.h> #include <linux/kernel_stat.h> #include <linux/gfp.h> #include <linux/mm.h> #include <linux/swap.h> #include <linux/leafops.h> #include <linux/syscalls.h> #include <linux/mman.h> #include <linux/pagemap.h> #include <linux/file.h> #include <linux/uio.h> #include <linux/error-injection.h> #include <linux/hash.h> #include <linux/writeback.h> #include <linux/backing-dev.h> #include <linux/pagevec.h> #include <linux/security.h> #include <linux/cpuset.h> #include <linux/hugetlb.h> #include <linux/memcontrol.h> #include <linux/shmem_fs.h> #include <linux/rmap.h> #include <linux/delayacct.h> #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> #include <linux/migrate.h> #include <linux/pipe_fs_i.h> #include <linux/splice.h> #include <linux/rcupdate_wait.h> #include <linux/sched/mm.h> #include <linux/sysctl.h> #include <linux/pgalloc.h> #include <asm/tlbflush.h> #include "internal.h" #define CREATE_TRACE_POINTS #include <trace/events/filemap.h> / * FIXME: remove all knowledge of the buffer layer from the core VM / #include <linux/buffer_head.h> / for try_to_free_buffers / #include <asm/mman.h> #include "swap.h" / * Shared mappings implemented 30.11.1994. It's not fully working yet, * though. * * Shared mappings now work. 15.8.1995 Bruno. * * finished 'unifying' the page and buffer cache and SMP-threaded the * page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com> * * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de> / / * Lock ordering: * * ->i_mmap_rwsem (truncate_pagecache) * ->private_lock (__free_pte->block_dirty_folio) * ->swap_lock (exclusive_swap_page, others) * ->i_pages lock * * ->i_rwsem * ->invalidate_lock (acquired by fs in truncate path) * ->i_mmap_rwsem (truncate->unmap_mapping_range) * * ->mmap_lock * ->i_mmap_rwsem * ->page_table_lock or pte_lock (various, mainly in memory.c) * ->i_pages lock (arch-dependent flush_dcache_mmap_lock) * * ->mmap_lock * ->invalidate_lock (filemap_fault) * ->lock_page (filemap_fault, access_process_vm) * * ->i_rwsem (generic_perform_write) * ->mmap_lock (fault_in_readable->do_page_fault) * * bdi->wb.list_lock * sb_lock (fs/fs-writeback.c) * ->i_pages lock (__sync_single_inode) * * ->i_mmap_rwsem * ->anon_vma.lock (vma_merge) * * ->anon_vma.lock * ->page_table_lock or pte_lock (anon_vma_prepare and various) * * ->page_table_lock or pte_lock * ->swap_lock (try_to_unmap_one) * ->private_lock (try_to_unmap_one) * ->i_pages lock (try_to_unmap_one) * ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) * ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) * ->private_lock (folio_remove_rmap_pte->set_page_dirty) * ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) * bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) * ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) * bdi.wb->list_lock (zap_pte_range->set_page_dirty) * ->inode->i_lock (zap_pte_range->set_page_dirty) * ->private_lock (zap_pte_range->block_dirty_folio) / static void page_cache_delete(struct address_space mapping, struct folio folio, void shadow) { XA_STATE(xas, &mapping->i_pages, folio->index); long nr = 1; mapping_set_update(&xas, mapping); xas_set_order(&xas, folio->index, folio_order(folio)); nr = folio_nr_pages(folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); xas_store(&xas, shadow); xas_init_marks(&xas); folio->mapping = NULL; /* Leave folio->index set: truncation lookup relies upon it / mapping->nrpages -= nr; } static void filemap_unaccount_folio(struct address_space mapping, struct folio folio) { long nr; VM_BUG_ON_FOLIO(folio_mapped(folio), folio); if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(folio_mapped(folio))) { pr_alert("BUG: Bad page cache in process %s pfn:%05lx\n", current->comm, folio_pfn(folio)); dump_page(&folio->page, "still mapped when deleted"); dump_stack(); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); if (mapping_exiting(mapping) && !folio_test_large(folio)) { int mapcount = folio_mapcount(folio); if (folio_ref_count(folio) >= mapcount + 2) { / * All vmas have already been torn down, so it's * a good bet that actually the page is unmapped * and we'd rather not leak it: if we're wrong, * another bad page check should catch it later. / atomic_set(&folio->_mapcount, -1); folio_ref_sub(folio, mapcount); } } } / hugetlb folios do not participate in page cache accounting. / if (folio_test_hugetlb(folio)) return; nr = folio_nr_pages(folio); lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr); if (folio_test_swapbacked(folio)) { lruvec_stat_mod_folio(folio, NR_SHMEM, -nr); if (folio_test_pmd_mappable(folio)) lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr); } else if (folio_test_pmd_mappable(folio)) { lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr); filemap_nr_thps_dec(mapping); } if (test_bit(AS_KERNEL_FILE, &folio->mapping->flags)) mod_node_page_state(folio_pgdat(folio), NR_KERNEL_FILE_PAGES, -nr); / * At this point folio must be either written or cleaned by * truncate. Dirty folio here signals a bug and loss of * unwritten data - on ordinary filesystems. * * But it's harmless on in-memory filesystems like tmpfs; and can * occur when a driver which did get_user_pages() sets page dirty * before putting it, while the inode is being finally evicted. * * Below fixes dirty accounting after removing the folio entirely * but leaves the dirty flag set: it has no effect for truncated * folio and anyway will be cleared before returning folio to * buddy allocator. / if (WARN_ON_ONCE(folio_test_dirty(folio) && mapping_can_writeback(mapping))) folio_account_cleaned(folio, inode_to_wb(mapping->host)); } / * Delete a page from the page cache and free it. Caller has to make * sure the page is locked and that nobody else uses it - or that usage * is safe. The caller must hold the i_pages lock. / void __filemap_remove_folio(struct folio folio, void shadow) { struct address_space mapping = folio->mapping; trace_mm_filemap_delete_from_page_cache(folio); filemap_unaccount_folio(mapping, folio); page_cache_delete(mapping, folio, shadow); } void filemap_free_folio(struct address_space mapping, struct folio folio) { void (free_folio)(struct folio ); free_folio = mapping->a_ops->free_folio; if (free_folio) free_folio(folio); folio_put_refs(folio, folio_nr_pages(folio)); } /** * filemap_remove_folio - Remove folio from page cache. * @folio: The folio. * * This must be called only on folios that are locked and have been * verified to be in the page cache. It will never put the folio into * the free list because the caller has a reference on the page. / void filemap_remove_folio(struct folio folio) { struct address_space mapping = folio->mapping; BUG_ON(!folio_test_locked(folio)); spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); __filemap_remove_folio(folio, NULL); xa_unlock_irq(&mapping->i_pages); if (mapping_shrinkable(mapping)) inode_lru_list_add(mapping->host); spin_unlock(&mapping->host->i_lock); filemap_free_folio(mapping, folio); } / * page_cache_delete_batch - delete several folios from page cache * @mapping: the mapping to which folios belong * @fbatch: batch of folios to delete * * The function walks over mapping->i_pages and removes folios passed in * @fbatch from the mapping. The function expects @fbatch to be sorted * by page index and is optimised for it to be dense. * It tolerates holes in @fbatch (mapping entries at those indices are not * modified). * * The function expects the i_pages lock to be held. / static void page_cache_delete_batch(struct address_space mapping, struct folio_batch fbatch) { XA_STATE(xas, &mapping->i_pages, fbatch->folios[0]->index); long total_pages = 0; int i = 0; struct folio folio; mapping_set_update(&xas, mapping); xas_for_each(&xas, folio, ULONG_MAX) { if (i >= folio_batch_count(fbatch)) break; /* A swap/dax/shadow entry got inserted? Skip it. / if (xa_is_value(folio)) continue; / * A page got inserted in our range? Skip it. We have our * pages locked so they are protected from being removed. * If we see a page whose index is higher than ours, it * means our page has been removed, which shouldn't be * possible because we're holding the PageLock. / if (folio != fbatch->folios[i]) { VM_BUG_ON_FOLIO(folio->index > fbatch->folios[i]->index, folio); continue; } WARN_ON_ONCE(!folio_test_locked(folio)); folio->mapping = NULL; / Leave folio->index set: truncation lookup relies on it / i++; xas_store(&xas, NULL); total_pages += folio_nr_pages(folio); } mapping->nrpages -= total_pages; } void delete_from_page_cache_batch(struct address_space mapping, struct folio_batch fbatch) { int i; if (!folio_batch_count(fbatch)) return; spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); for (i = 0; i < folio_batch_count(fbatch); i++) { struct folio folio = fbatch->folios[i]; trace_mm_filemap_delete_from_page_cache(folio); filemap_unaccount_folio(mapping, folio); } page_cache_delete_batch(mapping, fbatch); xa_unlock_irq(&mapping->i_pages); if (mapping_shrinkable(mapping)) inode_lru_list_add(mapping->host); spin_unlock(&mapping->host->i_lock); for (i = 0; i < folio_batch_count(fbatch); i++) filemap_free_folio(mapping, fbatch->folios[i]); } int filemap_check_errors(struct address_space mapping) { int ret = 0; / Check for outstanding write errors / if (test_bit(AS_ENOSPC, &mapping->flags) && test_and_clear_bit(AS_ENOSPC, &mapping->flags)) ret = -ENOSPC; if (test_bit(AS_EIO, &mapping->flags) && test_and_clear_bit(AS_EIO, &mapping->flags)) ret = -EIO; return ret; } EXPORT_SYMBOL(filemap_check_errors); static int filemap_check_and_keep_errors(struct address_space mapping) { /* Check for outstanding write errors / if (test_bit(AS_EIO, &mapping->flags)) return -EIO; if (test_bit(AS_ENOSPC, &mapping->flags)) return -ENOSPC; return 0; } static int filemap_writeback(struct address_space mapping, loff_t start, loff_t end, enum writeback_sync_modes sync_mode, long nr_to_write) { struct writeback_control wbc = { .sync_mode = sync_mode, .nr_to_write = nr_to_write ? nr_to_write : LONG_MAX, .range_start = start, .range_end = end, }; int ret; if (!mapping_can_writeback(mapping) \|\| !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; wbc_attach_fdatawrite_inode(&wbc, mapping->host); ret = do_writepages(mapping, &wbc); wbc_detach_inode(&wbc); if (!ret && nr_to_write) nr_to_write = wbc.nr_to_write; return ret; } /* * filemap_fdatawrite_range - start writeback on mapping dirty pages in range * @mapping: address space structure to write * @start: offset in bytes where the range starts * @end: offset in bytes where the range ends (inclusive) * * Start writeback against all of a mapping's dirty pages that lie * within the byte offsets <start, end> inclusive. * * This is a data integrity operation that waits upon dirty or in writeback * pages. * * Return: %0 on success, negative error code otherwise. / int filemap_fdatawrite_range(struct address_space mapping, loff_t start, loff_t end) { return filemap_writeback(mapping, start, end, WB_SYNC_ALL, NULL); } EXPORT_SYMBOL(filemap_fdatawrite_range); int filemap_fdatawrite(struct address_space mapping) { return filemap_fdatawrite_range(mapping, 0, LLONG_MAX); } EXPORT_SYMBOL(filemap_fdatawrite); /* * filemap_flush_range - start writeback on a range * @mapping: target address_space * @start: index to start writeback on * @end: last (inclusive) index for writeback * * This is a non-integrity writeback helper, to start writing back folios * for the indicated range. * * Return: %0 on success, negative error code otherwise. / int filemap_flush_range(struct address_space mapping, loff_t start, loff_t end) { return filemap_writeback(mapping, start, end, WB_SYNC_NONE, NULL); } EXPORT_SYMBOL_GPL(filemap_flush_range); /** * filemap_flush - mostly a non-blocking flush * @mapping: target address_space * * This is a mostly non-blocking flush. Not suitable for data-integrity * purposes - I/O may not be started against all dirty pages. * * Return: %0 on success, negative error code otherwise. / int filemap_flush(struct address_space mapping) { return filemap_flush_range(mapping, 0, LLONG_MAX); } EXPORT_SYMBOL(filemap_flush); /* * Start writeback on @nr_to_write pages from @mapping. No one but the existing * btrfs caller should be using this. Talk to linux-mm if you think adding a * new caller is a good idea. / int filemap_flush_nr(struct address_space mapping, long nr_to_write) { return filemap_writeback(mapping, 0, LLONG_MAX, WB_SYNC_NONE, nr_to_write); } EXPORT_SYMBOL_FOR_MODULES(filemap_flush_nr, "btrfs"); /* * filemap_range_has_page - check if a page exists in range. * @mapping: address space within which to check * @start_byte: offset in bytes where the range starts * @end_byte: offset in bytes where the range ends (inclusive) * * Find at least one page in the range supplied, usually used to check if * direct writing in this range will trigger a writeback. * * Return: %true if at least one page exists in the specified range, * %false otherwise. / bool filemap_range_has_page(struct address_space mapping, loff_t start_byte, loff_t end_byte) { struct folio folio; XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT); pgoff_t max = end_byte >> PAGE_SHIFT; if (end_byte < start_byte) return false; rcu_read_lock(); for (;;) { folio = xas_find(&xas, max); if (xas_retry(&xas, folio)) continue; / Shadow entries don't count / if (xa_is_value(folio)) continue; / * We don't need to try to pin this page; we're about to * release the RCU lock anyway. It is enough to know that * there was a page here recently. / break; } rcu_read_unlock(); return folio != NULL; } EXPORT_SYMBOL(filemap_range_has_page); static void __filemap_fdatawait_range(struct address_space mapping, loff_t start_byte, loff_t end_byte) { pgoff_t index = start_byte >> PAGE_SHIFT; pgoff_t end = end_byte >> PAGE_SHIFT; struct folio_batch fbatch; unsigned nr_folios; folio_batch_init(&fbatch); while (index <= end) { unsigned i; nr_folios = filemap_get_folios_tag(mapping, &index, end, PAGECACHE_TAG_WRITEBACK, &fbatch); if (!nr_folios) break; for (i = 0; i < nr_folios; i++) { struct folio folio = fbatch.folios[i]; folio_wait_writeback(folio); } folio_batch_release(&fbatch); cond_resched(); } } /* * filemap_fdatawait_range - wait for writeback to complete * @mapping: address space structure to wait for * @start_byte: offset in bytes where the range starts * @end_byte: offset in bytes where the range ends (inclusive) * * Walk the list of under-writeback pages of the given address space * in the given range and wait for all of them. Check error status of * the address space and return it. * * Since the error status of the address space is cleared by this function, * callers are responsible for checking the return value and handling and/or * reporting the error. * * Return: error status of the address space. / int filemap_fdatawait_range(struct address_space mapping, loff_t start_byte, loff_t end_byte) { __filemap_fdatawait_range(mapping, start_byte, end_byte); return filemap_check_errors(mapping); } EXPORT_SYMBOL(filemap_fdatawait_range); /** * filemap_fdatawait_range_keep_errors - wait for writeback to complete * @mapping: address space structure to wait for * @start_byte: offset in bytes where the range starts * @end_byte: offset in bytes where the range ends (inclusive) * * Walk the list of under-writeback pages of the given address space in the * given range and wait for all of them. Unlike filemap_fdatawait_range(), * this function does not clear error status of the address space. * * Use this function if callers don't handle errors themselves. Expected * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), * fsfreeze(8) / int filemap_fdatawait_range_keep_errors(struct address_space mapping, loff_t start_byte, loff_t end_byte) { __filemap_fdatawait_range(mapping, start_byte, end_byte); return filemap_check_and_keep_errors(mapping); } EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors); /** * file_fdatawait_range - wait for writeback to complete * @file: file pointing to address space structure to wait for * @start_byte: offset in bytes where the range starts * @end_byte: offset in bytes where the range ends (inclusive) * * Walk the list of under-writeback pages of the address space that file * refers to, in the given range and wait for all of them. Check error * status of the address space vs. the file->f_wb_err cursor and return it. * * Since the error status of the file is advanced by this function, * callers are responsible for checking the return value and handling and/or * reporting the error. * * Return: error status of the address space vs. the file->f_wb_err cursor. / int file_fdatawait_range(struct file file, loff_t start_byte, loff_t end_byte) { struct address_space mapping = file->f_mapping; __filemap_fdatawait_range(mapping, start_byte, end_byte); return file_check_and_advance_wb_err(file); } EXPORT_SYMBOL(file_fdatawait_range); /* * filemap_fdatawait_keep_errors - wait for writeback without clearing errors * @mapping: address space structure to wait for * * Walk the list of under-writeback pages of the given address space * and wait for all of them. Unlike filemap_fdatawait(), this function * does not clear error status of the address space. * * Use this function if callers don't handle errors themselves. Expected * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), * fsfreeze(8) * * Return: error status of the address space. / int filemap_fdatawait_keep_errors(struct address_space mapping) { __filemap_fdatawait_range(mapping, 0, LLONG_MAX); return filemap_check_and_keep_errors(mapping); } EXPORT_SYMBOL(filemap_fdatawait_keep_errors); /* Returns true if writeback might be needed or already in progress. / static bool mapping_needs_writeback(struct address_space mapping) { return mapping->nrpages; } bool filemap_range_has_writeback(struct address_space mapping, loff_t start_byte, loff_t end_byte) { XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT); pgoff_t max = end_byte >> PAGE_SHIFT; struct folio folio; if (end_byte < start_byte) return false; rcu_read_lock(); xas_for_each(&xas, folio, max) { if (xas_retry(&xas, folio)) continue; if (xa_is_value(folio)) continue; if (folio_test_dirty(folio) \|\| folio_test_locked(folio) \|\| folio_test_writeback(folio)) break; } rcu_read_unlock(); return folio != NULL; } EXPORT_SYMBOL_GPL(filemap_range_has_writeback); /** * filemap_write_and_wait_range - write out & wait on a file range * @mapping: the address_space for the pages * @lstart: offset in bytes where the range starts * @lend: offset in bytes where the range ends (inclusive) * * Write out and wait upon file offsets lstart->lend, inclusive. * * Note that @lend is inclusive (describes the last byte to be written) so * that this function can be used to write to the very end-of-file (end = -1). * * Return: error status of the address space. / int filemap_write_and_wait_range(struct address_space mapping, loff_t lstart, loff_t lend) { int err = 0, err2; if (lend < lstart) return 0; if (mapping_needs_writeback(mapping)) { err = filemap_fdatawrite_range(mapping, lstart, lend); /* * Even if the above returned error, the pages may be * written partially (e.g. -ENOSPC), so we wait for it. * But the -EIO is special case, it may indicate the worst * thing (e.g. bug) happened, so we avoid waiting for it. / if (err != -EIO) __filemap_fdatawait_range(mapping, lstart, lend); } err2 = filemap_check_errors(mapping); if (!err) err = err2; return err; } EXPORT_SYMBOL(filemap_write_and_wait_range); void __filemap_set_wb_err(struct address_space mapping, int err) { errseq_t eseq = errseq_set(&mapping->wb_err, err); trace_filemap_set_wb_err(mapping, eseq); } EXPORT_SYMBOL(__filemap_set_wb_err); /** * file_check_and_advance_wb_err - report wb error (if any) that was previously * and advance wb_err to current one * @file: struct file on which the error is being reported * * When userland calls fsync (or something like nfsd does the equivalent), we * want to report any writeback errors that occurred since the last fsync (or * since the file was opened if there haven't been any). * * Grab the wb_err from the mapping. If it matches what we have in the file, * then just quickly return 0. The file is all caught up. * * If it doesn't match, then take the mapping value, set the "seen" flag in * it and try to swap it into place. If it works, or another task beat us * to it with the new value, then update the f_wb_err and return the error * portion. The error at this point must be reported via proper channels * (a'la fsync, or NFS COMMIT operation, etc.). * * While we handle mapping->wb_err with atomic operations, the f_wb_err * value is protected by the f_lock since we must ensure that it reflects * the latest value swapped in for this file descriptor. * * Return: %0 on success, negative error code otherwise. / int file_check_and_advance_wb_err(struct file file) { int err = 0; errseq_t old = READ_ONCE(file->f_wb_err); struct address_space mapping = file->f_mapping; / Locklessly handle the common case where nothing has changed / if (errseq_check(&mapping->wb_err, old)) { / Something changed, must use slow path / spin_lock(&file->f_lock); old = file->f_wb_err; err = errseq_check_and_advance(&mapping->wb_err, &file->f_wb_err); trace_file_check_and_advance_wb_err(file, old); spin_unlock(&file->f_lock); } / * We're mostly using this function as a drop in replacement for * filemap_check_errors. Clear AS_EIO/AS_ENOSPC to emulate the effect * that the legacy code would have had on these flags. / clear_bit(AS_EIO, &mapping->flags); clear_bit(AS_ENOSPC, &mapping->flags); return err; } EXPORT_SYMBOL(file_check_and_advance_wb_err); /* * file_write_and_wait_range - write out & wait on a file range * @file: file pointing to address_space with pages * @lstart: offset in bytes where the range starts * @lend: offset in bytes where the range ends (inclusive) * * Write out and wait upon file offsets lstart->lend, inclusive. * * Note that @lend is inclusive (describes the last byte to be written) so * that this function can be used to write to the very end-of-file (end = -1). * * After writing out and waiting on the data, we check and advance the * f_wb_err cursor to the latest value, and return any errors detected there. * * Return: %0 on success, negative error code otherwise. / int file_write_and_wait_range(struct file file, loff_t lstart, loff_t lend) { int err = 0, err2; struct address_space mapping = file->f_mapping; if (lend < lstart) return 0; if (mapping_needs_writeback(mapping)) { err = filemap_fdatawrite_range(mapping, lstart, lend); / See comment of filemap_write_and_wait() / if (err != -EIO) __filemap_fdatawait_range(mapping, lstart, lend); } err2 = file_check_and_advance_wb_err(file); if (!err) err = err2; return err; } EXPORT_SYMBOL(file_write_and_wait_range); /* * replace_page_cache_folio - replace a pagecache folio with a new one * @old: folio to be replaced * @new: folio to replace with * * This function replaces a folio in the pagecache with a new one. On * success it acquires the pagecache reference for the new folio and * drops it for the old folio. Both the old and new folios must be * locked. This function does not add the new folio to the LRU, the * caller must do that. * * The remove + add is atomic. This function cannot fail. / void replace_page_cache_folio(struct folio old, struct folio new) { struct address_space mapping = old->mapping; void (free_folio)(struct folio ) = mapping->a_ops->free_folio; pgoff_t offset = old->index; XA_STATE(xas, &mapping->i_pages, offset); VM_BUG_ON_FOLIO(!folio_test_locked(old), old); VM_BUG_ON_FOLIO(!folio_test_locked(new), new); VM_BUG_ON_FOLIO(new->mapping, new); folio_get(new); new->mapping = mapping; new->index = offset; mem_cgroup_replace_folio(old, new); xas_lock_irq(&xas); xas_store(&xas, new); old->mapping = NULL; /* hugetlb pages do not participate in page cache accounting. / if (!folio_test_hugetlb(old)) lruvec_stat_sub_folio(old, NR_FILE_PAGES); if (!folio_test_hugetlb(new)) lruvec_stat_add_folio(new, NR_FILE_PAGES); if (folio_test_swapbacked(old)) lruvec_stat_sub_folio(old, NR_SHMEM); if (folio_test_swapbacked(new)) lruvec_stat_add_folio(new, NR_SHMEM); xas_unlock_irq(&xas); if (free_folio) free_folio(old); folio_put(old); } EXPORT_SYMBOL_GPL(replace_page_cache_folio); noinline int __filemap_add_folio(struct address_space mapping, struct folio folio, pgoff_t index, gfp_t gfp, void shadowp) { XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); bool huge; long nr; unsigned int forder = folio_order(folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio); VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping), folio); mapping_set_update(&xas, mapping); VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); huge = folio_test_hugetlb(folio); nr = folio_nr_pages(folio); gfp &= GFP_RECLAIM_MASK; folio_ref_add(folio, nr); folio->mapping = mapping; folio->index = xas.xa_index; for (;;) { int order = -1; void entry, old = NULL; xas_lock_irq(&xas); xas_for_each_conflict(&xas, entry) { old = entry; if (!xa_is_value(entry)) { xas_set_err(&xas, -EEXIST); goto unlock; } / * If a larger entry exists, * it will be the first and only entry iterated. / if (order == -1) order = xas_get_order(&xas); } if (old) { if (order > 0 && order > forder) { unsigned int split_order = max(forder, xas_try_split_min_order(order)); / How to handle large swap entries? / BUG_ON(shmem_mapping(mapping)); while (order > forder) { xas_set_order(&xas, index, split_order); xas_try_split(&xas, old, order); if (xas_error(&xas)) goto unlock; order = split_order; split_order = max(xas_try_split_min_order( split_order), forder); } xas_reset(&xas); } if (shadowp) shadowp = old; } xas_store(&xas, folio); if (xas_error(&xas)) goto unlock; mapping->nrpages += nr; /* hugetlb pages do not participate in page cache accounting / if (!huge) { lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr); if (folio_test_pmd_mappable(folio)) lruvec_stat_mod_folio(folio, NR_FILE_THPS, nr); } unlock: xas_unlock_irq(&xas); if (!xas_nomem(&xas, gfp)) break; } if (xas_error(&xas)) goto error; trace_mm_filemap_add_to_page_cache(folio); return 0; error: folio->mapping = NULL; / Leave folio->index set: truncation relies upon it / folio_put_refs(folio, nr); return xas_error(&xas); } ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO); int filemap_add_folio(struct address_space mapping, struct folio folio, pgoff_t index, gfp_t gfp) { void shadow = NULL; int ret; struct mem_cgroup tmp; bool kernel_file = test_bit(AS_KERNEL_FILE, &mapping->flags); if (kernel_file) tmp = set_active_memcg(root_mem_cgroup); ret = mem_cgroup_charge(folio, NULL, gfp); if (kernel_file) set_active_memcg(tmp); if (ret) return ret; __folio_set_locked(folio); ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow); if (unlikely(ret)) { mem_cgroup_uncharge(folio); __folio_clear_locked(folio); } else { / * The folio might have been evicted from cache only * recently, in which case it should be activated like * any other repeatedly accessed folio. * The exception is folios getting rewritten; evicting other * data from the working set, only to cache data that will * get overwritten with something else, is a waste of memory. / WARN_ON_ONCE(folio_test_active(folio)); if (!(gfp & __GFP_WRITE) && shadow) workingset_refault(folio, shadow); folio_add_lru(folio); if (kernel_file) mod_node_page_state(folio_pgdat(folio), NR_KERNEL_FILE_PAGES, folio_nr_pages(folio)); } return ret; } EXPORT_SYMBOL_GPL(filemap_add_folio); #ifdef CONFIG_NUMA struct folio filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order, struct mempolicy policy) { int n; struct folio folio; if (policy) return folio_alloc_mpol_noprof(gfp, order, policy, NO_INTERLEAVE_INDEX, numa_node_id()); if (cpuset_do_page_mem_spread()) { unsigned int cpuset_mems_cookie; do { cpuset_mems_cookie = read_mems_allowed_begin(); n = cpuset_mem_spread_node(); folio = __folio_alloc_node_noprof(gfp, order, n); } while (!folio && read_mems_allowed_retry(cpuset_mems_cookie)); return folio; } return folio_alloc_noprof(gfp, order); } EXPORT_SYMBOL(filemap_alloc_folio_noprof); #endif /* * filemap_invalidate_lock_two - lock invalidate_lock for two mappings * * Lock exclusively invalidate_lock of any passed mapping that is not NULL. * * @mapping1: the first mapping to lock * @mapping2: the second mapping to lock / void filemap_invalidate_lock_two(struct address_space mapping1, struct address_space mapping2) { if (mapping1 > mapping2) swap(mapping1, mapping2); if (mapping1) down_write(&mapping1->invalidate_lock); if (mapping2 && mapping1 != mapping2) down_write_nested(&mapping2->invalidate_lock, 1); } EXPORT_SYMBOL(filemap_invalidate_lock_two); / * filemap_invalidate_unlock_two - unlock invalidate_lock for two mappings * * Unlock exclusive invalidate_lock of any passed mapping that is not NULL. * * @mapping1: the first mapping to unlock * @mapping2: the second mapping to unlock / void filemap_invalidate_unlock_two(struct address_space mapping1, struct address_space mapping2) { if (mapping1) up_write(&mapping1->invalidate_lock); if (mapping2 && mapping1 != mapping2) up_write(&mapping2->invalidate_lock); } EXPORT_SYMBOL(filemap_invalidate_unlock_two); / * In order to wait for pages to become available there must be * waitqueues associated with pages. By using a hash table of * waitqueues where the bucket discipline is to maintain all * waiters on the same queue and wake all when any of the pages * become available, and for the woken contexts to check to be * sure the appropriate page became available, this saves space * at a cost of "thundering herd" phenomena during rare hash * collisions. / #define PAGE_WAIT_TABLE_BITS 8 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS) static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned; static wait_queue_head_t folio_waitqueue(struct folio folio) { return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)]; } / How many times do we accept lock stealing from under a waiter? / static int sysctl_page_lock_unfairness = 5; static const struct ctl_table filemap_sysctl_table[] = { { .procname = "page_lock_unfairness", .data = &sysctl_page_lock_unfairness, .maxlen = sizeof(sysctl_page_lock_unfairness), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, } }; void __init pagecache_init(void) { int i; for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++) init_waitqueue_head(&folio_wait_table[i]); page_writeback_init(); register_sysctl_init("vm", filemap_sysctl_table); } / * The page wait code treats the "wait->flags" somewhat unusually, because * we have multiple different kinds of waits, not just the usual "exclusive" * one. * * We have: * * (a) no special bits set: * * We're just waiting for the bit to be released, and when a waker * calls the wakeup function, we set WQ_FLAG_WOKEN and wake it up, * and remove it from the wait queue. * * Simple and straightforward. * * (b) WQ_FLAG_EXCLUSIVE: * * The waiter is waiting to get the lock, and only one waiter should * be woken up to avoid any thundering herd behavior. We'll set the * WQ_FLAG_WOKEN bit, wake it up, and remove it from the wait queue. * * This is the traditional exclusive wait. * * (c) WQ_FLAG_EXCLUSIVE \| WQ_FLAG_CUSTOM: * * The waiter is waiting to get the bit, and additionally wants the * lock to be transferred to it for fair lock behavior. If the lock * cannot be taken, we stop walking the wait queue without waking * the waiter. * * This is the "fair lock handoff" case, and in addition to setting * WQ_FLAG_WOKEN, we set WQ_FLAG_DONE to let the waiter easily see * that it now has the lock. / static int wake_page_function(wait_queue_entry_t wait, unsigned mode, int sync, void arg) { unsigned int flags; struct wait_page_key key = arg; struct wait_page_queue wait_page = container_of(wait, struct wait_page_queue, wait); if (!wake_page_match(wait_page, key)) return 0; / * If it's a lock handoff wait, we get the bit for it, and * stop walking (and do not wake it up) if we can't. / flags = wait->flags; if (flags & WQ_FLAG_EXCLUSIVE) { if (test_bit(key->bit_nr, &key->folio->flags.f)) return -1; if (flags & WQ_FLAG_CUSTOM) { if (test_and_set_bit(key->bit_nr, &key->folio->flags.f)) return -1; flags \|= WQ_FLAG_DONE; } } / * We are holding the wait-queue lock, but the waiter that * is waiting for this will be checking the flags without * any locking. * * So update the flags atomically, and wake up the waiter * afterwards to avoid any races. This store-release pairs * with the load-acquire in folio_wait_bit_common(). / smp_store_release(&wait->flags, flags \| WQ_FLAG_WOKEN); wake_up_state(wait->private, mode); / * Ok, we have successfully done what we're waiting for, * and we can unconditionally remove the wait entry. * * Note that this pairs with the "finish_wait()" in the * waiter, and has to be the absolute last thing we do. * After this list_del_init(&wait->entry) the wait entry * might be de-allocated and the process might even have * exited. / list_del_init_careful(&wait->entry); return (flags & WQ_FLAG_EXCLUSIVE) != 0; } static void folio_wake_bit(struct folio folio, int bit_nr) { wait_queue_head_t q = folio_waitqueue(folio); struct wait_page_key key; unsigned long flags; key.folio = folio; key.bit_nr = bit_nr; key.page_match = 0; spin_lock_irqsave(&q->lock, flags); __wake_up_locked_key(q, TASK_NORMAL, &key); / * It's possible to miss clearing waiters here, when we woke our page * waiters, but the hashed waitqueue has waiters for other pages on it. * That's okay, it's a rare case. The next waker will clear it. * * Note that, depending on the page pool (buddy, hugetlb, ZONE_DEVICE, * other), the flag may be cleared in the course of freeing the page; * but that is not required for correctness. / if (!waitqueue_active(q) \|\| !key.page_match) folio_clear_waiters(folio); spin_unlock_irqrestore(&q->lock, flags); } / * A choice of three behaviors for folio_wait_bit_common(): / enum behavior { EXCLUSIVE, / Hold ref to page and take the bit when woken, like * __folio_lock() waiting on then setting PG_locked. / SHARED, / Hold ref to page and check the bit when woken, like * folio_wait_writeback() waiting on PG_writeback. / DROP, / Drop ref to page before wait, no check when woken, * like folio_put_wait_locked() on PG_locked. / }; / * Attempt to check (or get) the folio flag, and mark us done * if successful. / static inline bool folio_trylock_flag(struct folio folio, int bit_nr, struct wait_queue_entry wait) { if (wait->flags & WQ_FLAG_EXCLUSIVE) { if (test_and_set_bit(bit_nr, &folio->flags.f)) return false; } else if (test_bit(bit_nr, &folio->flags.f)) return false; wait->flags \|= WQ_FLAG_WOKEN \| WQ_FLAG_DONE; return true; } static inline int folio_wait_bit_common(struct folio folio, int bit_nr, int state, enum behavior behavior) { wait_queue_head_t q = folio_waitqueue(folio); int unfairness = sysctl_page_lock_unfairness; struct wait_page_queue wait_page; wait_queue_entry_t wait = &wait_page.wait; bool thrashing = false; unsigned long pflags; bool in_thrashing; if (bit_nr == PG_locked && !folio_test_uptodate(folio) && folio_test_workingset(folio)) { delayacct_thrashing_start(&in_thrashing); psi_memstall_enter(&pflags); thrashing = true; } init_wait(wait); wait->func = wake_page_function; wait_page.folio = folio; wait_page.bit_nr = bit_nr; repeat: wait->flags = 0; if (behavior == EXCLUSIVE) { wait->flags = WQ_FLAG_EXCLUSIVE; if (--unfairness < 0) wait->flags \|= WQ_FLAG_CUSTOM; } /* * Do one last check whether we can get the * page bit synchronously. * * Do the folio_set_waiters() marking before that * to let any waker we _just_ missed know they * need to wake us up (otherwise they'll never * even go to the slow case that looks at the * page queue), and add ourselves to the wait * queue if we need to sleep. * * This part needs to be done under the queue * lock to avoid races. / spin_lock_irq(&q->lock); folio_set_waiters(folio); if (!folio_trylock_flag(folio, bit_nr, wait)) __add_wait_queue_entry_tail(q, wait); spin_unlock_irq(&q->lock); / * From now on, all the logic will be based on * the WQ_FLAG_WOKEN and WQ_FLAG_DONE flag, to * see whether the page bit testing has already * been done by the wake function. * * We can drop our reference to the folio. / if (behavior == DROP) folio_put(folio); / * Note that until the "finish_wait()", or until * we see the WQ_FLAG_WOKEN flag, we need to * be very careful with the 'wait->flags', because * we may race with a waker that sets them. / for (;;) { unsigned int flags; set_current_state(state); / Loop until we've been woken or interrupted / flags = smp_load_acquire(&wait->flags); if (!(flags & WQ_FLAG_WOKEN)) { if (signal_pending_state(state, current)) break; io_schedule(); continue; } / If we were non-exclusive, we're done / if (behavior != EXCLUSIVE) break; / If the waker got the lock for us, we're done / if (flags & WQ_FLAG_DONE) break; / * Otherwise, if we're getting the lock, we need to * try to get it ourselves. * * And if that fails, we'll have to retry this all. / if (unlikely(test_and_set_bit(bit_nr, folio_flags(folio, 0)))) goto repeat; wait->flags \|= WQ_FLAG_DONE; break; } / * If a signal happened, this 'finish_wait()' may remove the last * waiter from the wait-queues, but the folio waiters bit will remain * set. That's ok. The next wakeup will take care of it, and trying * to do it here would be difficult and prone to races. / finish_wait(q, wait); if (thrashing) { delayacct_thrashing_end(&in_thrashing); psi_memstall_leave(&pflags); } / * NOTE! The wait->flags weren't stable until we've done the * 'finish_wait()', and we could have exited the loop above due * to a signal, and had a wakeup event happen after the signal * test but before the 'finish_wait()'. * * So only after the finish_wait() can we reliably determine * if we got woken up or not, so we can now figure out the final * return value based on that state without races. * * Also note that WQ_FLAG_WOKEN is sufficient for a non-exclusive * waiter, but an exclusive one requires WQ_FLAG_DONE. / if (behavior == EXCLUSIVE) return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR; return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR; } #ifdef CONFIG_MIGRATION /* * migration_entry_wait_on_locked - Wait for a migration entry to be removed * @entry: migration swap entry. * @ptl: already locked ptl. This function will drop the lock. * * Wait for a migration entry referencing the given page to be removed. This is * equivalent to folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE) except * this can be called without taking a reference on the page. Instead this * should be called while holding the ptl for the migration entry referencing * the page. * * Returns after unlocking the ptl. * * This follows the same logic as folio_wait_bit_common() so see the comments * there. / void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t ptl) __releases(ptl) { struct wait_page_queue wait_page; wait_queue_entry_t wait = &wait_page.wait; bool thrashing = false; unsigned long pflags; bool in_thrashing; wait_queue_head_t q; struct folio folio = softleaf_to_folio(entry); q = folio_waitqueue(folio); if (!folio_test_uptodate(folio) && folio_test_workingset(folio)) { delayacct_thrashing_start(&in_thrashing); psi_memstall_enter(&pflags); thrashing = true; } init_wait(wait); wait->func = wake_page_function; wait_page.folio = folio; wait_page.bit_nr = PG_locked; wait->flags = 0; spin_lock_irq(&q->lock); folio_set_waiters(folio); if (!folio_trylock_flag(folio, PG_locked, wait)) __add_wait_queue_entry_tail(q, wait); spin_unlock_irq(&q->lock); / * If a migration entry exists for the page the migration path must hold * a valid reference to the page, and it must take the ptl to remove the * migration entry. So the page is valid until the ptl is dropped. / spin_unlock(ptl); for (;;) { unsigned int flags; set_current_state(TASK_UNINTERRUPTIBLE); / Loop until we've been woken or interrupted / flags = smp_load_acquire(&wait->flags); if (!(flags & WQ_FLAG_WOKEN)) { if (signal_pending_state(TASK_UNINTERRUPTIBLE, current)) break; io_schedule(); continue; } break; } finish_wait(q, wait); if (thrashing) { delayacct_thrashing_end(&in_thrashing); psi_memstall_leave(&pflags); } } #endif void folio_wait_bit(struct folio folio, int bit_nr) { folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED); } EXPORT_SYMBOL(folio_wait_bit); int folio_wait_bit_killable(struct folio folio, int bit_nr) { return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED); } EXPORT_SYMBOL(folio_wait_bit_killable); /* * folio_put_wait_locked - Drop a reference and wait for it to be unlocked * @folio: The folio to wait for. * @state: The sleep state (TASK_KILLABLE, TASK_UNINTERRUPTIBLE, etc). * * The caller should hold a reference on @folio. They expect the page to * become unlocked relatively soon, but do not wish to hold up migration * (for example) by holding the reference while waiting for the folio to * come unlocked. After this function returns, the caller should not * dereference @folio. * * Return: 0 if the folio was unlocked or -EINTR if interrupted by a signal. / static int folio_put_wait_locked(struct folio folio, int state) { return folio_wait_bit_common(folio, PG_locked, state, DROP); } /** * folio_unlock - Unlock a locked folio. * @folio: The folio. * * Unlocks the folio and wakes up any thread sleeping on the page lock. * * Context: May be called from interrupt or process context. May not be * called from NMI context. / void folio_unlock(struct folio folio) { /* Bit 7 allows x86 to check the byte's sign bit / BUILD_BUG_ON(PG_waiters != 7); BUILD_BUG_ON(PG_locked > 7); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); if (folio_xor_flags_has_waiters(folio, 1 << PG_locked)) folio_wake_bit(folio, PG_locked); } EXPORT_SYMBOL(folio_unlock); /* * folio_end_read - End read on a folio. * @folio: The folio. * @success: True if all reads completed successfully. * * When all reads against a folio have completed, filesystems should * call this function to let the pagecache know that no more reads * are outstanding. This will unlock the folio and wake up any thread * sleeping on the lock. The folio will also be marked uptodate if all * reads succeeded. * * Context: May be called from interrupt or process context. May not be * called from NMI context. / void folio_end_read(struct folio folio, bool success) { unsigned long mask = 1 << PG_locked; /* Must be in bottom byte for x86 to work / BUILD_BUG_ON(PG_uptodate > 7); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(success && folio_test_uptodate(folio), folio); if (likely(success)) mask \|= 1 << PG_uptodate; if (folio_xor_flags_has_waiters(folio, mask)) folio_wake_bit(folio, PG_locked); } EXPORT_SYMBOL(folio_end_read); /* * folio_end_private_2 - Clear PG_private_2 and wake any waiters. * @folio: The folio. * * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for * it. The folio reference held for PG_private_2 being set is released. * * This is, for example, used when a netfs folio is being written to a local * disk cache, thereby allowing writes to the cache for the same folio to be * serialised. / void folio_end_private_2(struct folio folio) { VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio); clear_bit_unlock(PG_private_2, folio_flags(folio, 0)); folio_wake_bit(folio, PG_private_2); folio_put(folio); } EXPORT_SYMBOL(folio_end_private_2); /** * folio_wait_private_2 - Wait for PG_private_2 to be cleared on a folio. * @folio: The folio to wait on. * * Wait for PG_private_2 to be cleared on a folio. / void folio_wait_private_2(struct folio folio) { while (folio_test_private_2(folio)) folio_wait_bit(folio, PG_private_2); } EXPORT_SYMBOL(folio_wait_private_2); /** * folio_wait_private_2_killable - Wait for PG_private_2 to be cleared on a folio. * @folio: The folio to wait on. * * Wait for PG_private_2 to be cleared on a folio or until a fatal signal is * received by the calling task. * * Return: * - 0 if successful. * - -EINTR if a fatal signal was encountered. / int folio_wait_private_2_killable(struct folio folio) { int ret = 0; while (folio_test_private_2(folio)) { ret = folio_wait_bit_killable(folio, PG_private_2); if (ret < 0) break; } return ret; } EXPORT_SYMBOL(folio_wait_private_2_killable); static void filemap_end_dropbehind(struct folio folio) { struct address_space mapping = folio->mapping; VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); if (folio_test_writeback(folio) \|\| folio_test_dirty(folio)) return; if (!folio_test_clear_dropbehind(folio)) return; if (mapping) folio_unmap_invalidate(mapping, folio, 0); } /* * If folio was marked as dropbehind, then pages should be dropped when writeback * completes. Do that now. If we fail, it's likely because of a big folio - * just reset dropbehind for that case and latter completions should invalidate. / void folio_end_dropbehind(struct folio folio) { if (!folio_test_dropbehind(folio)) return; /* * Hitting !in_task() should not happen off RWF_DONTCACHE writeback, * but can happen if normal writeback just happens to find dirty folios * that were created as part of uncached writeback, and that writeback * would otherwise not need non-IRQ handling. Just skip the * invalidation in that case. / if (in_task() && folio_trylock(folio)) { filemap_end_dropbehind(folio); folio_unlock(folio); } } EXPORT_SYMBOL_GPL(folio_end_dropbehind); /* * folio_end_writeback_no_dropbehind - End writeback against a folio. * @folio: The folio. * * The folio must actually be under writeback. * This call is intended for filesystems that need to defer dropbehind. * * Context: May be called from process or interrupt context. / void folio_end_writeback_no_dropbehind(struct folio folio) { VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio); /* * folio_test_clear_reclaim() could be used here but it is an * atomic operation and overkill in this particular case. Failing * to shuffle a folio marked for immediate reclaim is too mild * a gain to justify taking an atomic operation penalty at the * end of every folio writeback. / if (folio_test_reclaim(folio)) { folio_clear_reclaim(folio); folio_rotate_reclaimable(folio); } if (__folio_end_writeback(folio)) folio_wake_bit(folio, PG_writeback); acct_reclaim_writeback(folio); } EXPORT_SYMBOL_GPL(folio_end_writeback_no_dropbehind); /* * folio_end_writeback - End writeback against a folio. * @folio: The folio. * * The folio must actually be under writeback. * * Context: May be called from process or interrupt context. / void folio_end_writeback(struct folio folio) { VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio); /* * Writeback does not hold a folio reference of its own, relying * on truncation to wait for the clearing of PG_writeback. * But here we must make sure that the folio is not freed and * reused before the folio_wake_bit(). / folio_get(folio); folio_end_writeback_no_dropbehind(folio); folio_end_dropbehind(folio); folio_put(folio); } EXPORT_SYMBOL(folio_end_writeback); /* * __folio_lock - Get a lock on the folio, assuming we need to sleep to get it. * @folio: The folio to lock / void __folio_lock(struct folio folio) { folio_wait_bit_common(folio, PG_locked, TASK_UNINTERRUPTIBLE, EXCLUSIVE); } EXPORT_SYMBOL(__folio_lock); int __folio_lock_killable(struct folio folio) { return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE, EXCLUSIVE); } EXPORT_SYMBOL_GPL(__folio_lock_killable); static int __folio_lock_async(struct folio folio, struct wait_page_queue wait) { struct wait_queue_head q = folio_waitqueue(folio); int ret; wait->folio = folio; wait->bit_nr = PG_locked; spin_lock_irq(&q->lock); __add_wait_queue_entry_tail(q, &wait->wait); folio_set_waiters(folio); ret = !folio_trylock(folio); /* * If we were successful now, we know we're still on the * waitqueue as we're still under the lock. This means it's * safe to remove and return success, we know the callback * isn't going to trigger. / if (!ret) __remove_wait_queue(q, &wait->wait); else ret = -EIOCBQUEUED; spin_unlock_irq(&q->lock); return ret; } / * Return values: * 0 - folio is locked. * non-zero - folio is not locked. * mmap_lock or per-VMA lock has been released (mmap_read_unlock() or * vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and * FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held. * * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0 * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed. / vm_fault_t __folio_lock_or_retry(struct folio folio, struct vm_fault vmf) { unsigned int flags = vmf->flags; if (fault_flag_allow_retry_first(flags)) { / * CAUTION! In this case, mmap_lock/per-VMA lock is not * released even though returning VM_FAULT_RETRY. / if (flags & FAULT_FLAG_RETRY_NOWAIT) return VM_FAULT_RETRY; release_fault_lock(vmf); if (flags & FAULT_FLAG_KILLABLE) folio_wait_locked_killable(folio); else folio_wait_locked(folio); return VM_FAULT_RETRY; } if (flags & FAULT_FLAG_KILLABLE) { bool ret; ret = __folio_lock_killable(folio); if (ret) { release_fault_lock(vmf); return VM_FAULT_RETRY; } } else { __folio_lock(folio); } return 0; } /* * page_cache_next_miss() - Find the next gap in the page cache. * @mapping: Mapping. * @index: Index. * @max_scan: Maximum range to search. * * Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for the * gap with the lowest index. * * This function may be called under the rcu_read_lock. However, this will * not atomically search a snapshot of the cache at a single point in time. * For example, if a gap is created at index 5, then subsequently a gap is * created at index 10, page_cache_next_miss covering both indices may * return 10 if called under the rcu_read_lock. * * Return: The index of the gap if found, otherwise an index outside the * range specified (in which case 'return - index >= max_scan' will be true). * In the rare case of index wrap-around, 0 will be returned. / pgoff_t page_cache_next_miss(struct address_space mapping, pgoff_t index, unsigned long max_scan) { XA_STATE(xas, &mapping->i_pages, index); unsigned long nr = max_scan; while (nr--) { void entry = xas_next(&xas); if (!entry \|\| xa_is_value(entry)) return xas.xa_index; if (xas.xa_index == 0) return 0; } return index + max_scan; } EXPORT_SYMBOL(page_cache_next_miss); /* * page_cache_prev_miss() - Find the previous gap in the page cache. * @mapping: Mapping. * @index: Index. * @max_scan: Maximum range to search. * * Search the range [max(index - max_scan + 1, 0), index] for the * gap with the highest index. * * This function may be called under the rcu_read_lock. However, this will * not atomically search a snapshot of the cache at a single point in time. * For example, if a gap is created at index 10, then subsequently a gap is * created at index 5, page_cache_prev_miss() covering both indices may * return 5 if called under the rcu_read_lock. * * Return: The index of the gap if found, otherwise an index outside the * range specified (in which case 'index - return >= max_scan' will be true). * In the rare case of wrap-around, ULONG_MAX will be returned. / pgoff_t page_cache_prev_miss(struct address_space mapping, pgoff_t index, unsigned long max_scan) { XA_STATE(xas, &mapping->i_pages, index); while (max_scan--) { void entry = xas_prev(&xas); if (!entry \|\| xa_is_value(entry)) break; if (xas.xa_index == ULONG_MAX) break; } return xas.xa_index; } EXPORT_SYMBOL(page_cache_prev_miss); / * Lockless page cache protocol: * On the lookup side: * 1. Load the folio from i_pages * 2. Increment the refcount if it's not zero * 3. If the folio is not found by xas_reload(), put the refcount and retry * * On the removal side: * A. Freeze the page (by zeroing the refcount if nobody else has a reference) * B. Remove the page from i_pages * C. Return the page to the page allocator * * This means that any page may have its reference count temporarily * increased by a speculative page cache (or GUP-fast) lookup as it can * be allocated by another user before the RCU grace period expires. * Because the refcount temporarily acquired here may end up being the * last refcount on the page, any page allocation must be freeable by * folio_put(). / / * filemap_get_entry - Get a page cache entry. * @mapping: the address_space to search * @index: The page cache index. * * Looks up the page cache entry at @mapping & @index. If it is a folio, * it is returned with an increased refcount. If it is a shadow entry * of a previously evicted folio, or a swap entry from shmem/tmpfs, * it is returned without further action. * * Return: The folio, swap or shadow entry, %NULL if nothing is found. / void filemap_get_entry(struct address_space mapping, pgoff_t index) { XA_STATE(xas, &mapping->i_pages, index); struct folio folio; rcu_read_lock(); repeat: xas_reset(&xas); folio = xas_load(&xas); if (xas_retry(&xas, folio)) goto repeat; /* * A shadow entry of a recently evicted page, or a swap entry from * shmem/tmpfs. Return it without attempting to raise page count. / if (!folio \|\| xa_is_value(folio)) goto out; if (!folio_try_get(folio)) goto repeat; if (unlikely(folio != xas_reload(&xas))) { folio_put(folio); goto repeat; } out: rcu_read_unlock(); return folio; } /* * __filemap_get_folio_mpol - Find and get a reference to a folio. * @mapping: The address_space to search. * @index: The page index. * @fgp_flags: %FGP flags modify how the folio is returned. * @gfp: Memory allocation flags to use if %FGP_CREAT is specified. * @policy: NUMA memory allocation policy to follow. * * Looks up the page cache entry at @mapping & @index. * * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even * if the %GFP flags specified for %FGP_CREAT are atomic. * * If this function returns a folio, it is returned with an increased refcount. * * Return: The found folio or an ERR_PTR() otherwise. / struct folio __filemap_get_folio_mpol(struct address_space mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp, struct mempolicy policy) { struct folio folio; repeat: folio = filemap_get_entry(mapping, index); if (xa_is_value(folio)) folio = NULL; if (!folio) goto no_page; if (fgp_flags & FGP_LOCK) { if (fgp_flags & FGP_NOWAIT) { if (!folio_trylock(folio)) { folio_put(folio); return ERR_PTR(-EAGAIN); } } else { folio_lock(folio); } / Has the page been truncated? / if (unlikely(folio->mapping != mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; } VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio); } if (fgp_flags & FGP_ACCESSED) folio_mark_accessed(folio); else if (fgp_flags & FGP_WRITE) { / Clear idle flag for buffer write / if (folio_test_idle(folio)) folio_clear_idle(folio); } if (fgp_flags & FGP_STABLE) folio_wait_stable(folio); no_page: if (!folio && (fgp_flags & FGP_CREAT)) { unsigned int min_order = mapping_min_folio_order(mapping); unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags)); int err; index = mapping_align_index(mapping, index); if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping)) gfp \|= __GFP_WRITE; if (fgp_flags & FGP_NOFS) gfp &= ~__GFP_FS; if (fgp_flags & FGP_NOWAIT) { gfp &= ~GFP_KERNEL; gfp \|= GFP_NOWAIT; } if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK \| FGP_FOR_MMAP)))) fgp_flags \|= FGP_LOCK; if (order > mapping_max_folio_order(mapping)) order = mapping_max_folio_order(mapping); / If we're not aligned, allocate a smaller folio / if (index & ((1UL << order) - 1)) order = __ffs(index); do { gfp_t alloc_gfp = gfp; err = -ENOMEM; if (order > min_order) alloc_gfp \|= __GFP_NORETRY \| __GFP_NOWARN; folio = filemap_alloc_folio(alloc_gfp, order, policy); if (!folio) continue; / Init accessed so avoid atomic mark_page_accessed later / if (fgp_flags & FGP_ACCESSED) __folio_set_referenced(folio); if (fgp_flags & FGP_DONTCACHE) __folio_set_dropbehind(folio); err = filemap_add_folio(mapping, folio, index, gfp); if (!err) break; folio_put(folio); folio = NULL; } while (order-- > min_order); if (err == -EEXIST) goto repeat; if (err) { / * When NOWAIT I/O fails to allocate folios this could * be due to a nonblocking memory allocation and not * because the system actually is out of memory. * Return -EAGAIN so that there caller retries in a * blocking fashion instead of propagating -ENOMEM * to the application. / if ((fgp_flags & FGP_NOWAIT) && err == -ENOMEM) err = -EAGAIN; return ERR_PTR(err); } / * filemap_add_folio locks the page, and for mmap * we expect an unlocked page. / if (folio && (fgp_flags & FGP_FOR_MMAP)) folio_unlock(folio); } if (!folio) return ERR_PTR(-ENOENT); / not an uncached lookup, clear uncached if set / if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE)) folio_clear_dropbehind(folio); return folio; } EXPORT_SYMBOL(__filemap_get_folio_mpol); static inline struct folio find_get_entry(struct xa_state xas, pgoff_t max, xa_mark_t mark) { struct folio folio; retry: if (mark == XA_PRESENT) folio = xas_find(xas, max); else folio = xas_find_marked(xas, max, mark); if (xas_retry(xas, folio)) goto retry; /* * A shadow entry of a recently evicted page, a swap * entry from shmem/tmpfs or a DAX entry. Return it * without attempting to raise page count. / if (!folio \|\| xa_is_value(folio)) return folio; if (!folio_try_get(folio)) goto reset; if (unlikely(folio != xas_reload(xas))) { folio_put(folio); goto reset; } return folio; reset: xas_reset(xas); goto retry; } /* * find_get_entries - gang pagecache lookup * @mapping: The address_space to search * @start: The starting page cache index * @end: The final page index (inclusive). * @fbatch: Where the resulting entries are placed. * @indices: The cache indices corresponding to the entries in @entries * * find_get_entries() will search for and return a batch of entries in * the mapping. The entries are placed in @fbatch. find_get_entries() * takes a reference on any actual folios it returns. * * The entries have ascending indexes. The indices may not be consecutive * due to not-present entries or large folios. * * Any shadow entries of evicted folios, or swap entries from * shmem/tmpfs, are included in the returned array. * * Return: The number of entries which were found. / unsigned find_get_entries(struct address_space mapping, pgoff_t start, pgoff_t end, struct folio_batch fbatch, pgoff_t indices) { XA_STATE(xas, &mapping->i_pages, start); struct folio folio; rcu_read_lock(); while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) { indices[fbatch->nr] = xas.xa_index; if (!folio_batch_add(fbatch, folio)) break; } if (folio_batch_count(fbatch)) { unsigned long nr; int idx = folio_batch_count(fbatch) - 1; folio = fbatch->folios[idx]; if (!xa_is_value(folio)) nr = folio_nr_pages(folio); else nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]); start = round_down(indices[idx] + nr, nr); } rcu_read_unlock(); return folio_batch_count(fbatch); } /** * find_lock_entries - Find a batch of pagecache entries. * @mapping: The address_space to search. * @start: The starting page cache index. * @end: The final page index (inclusive). * @fbatch: Where the resulting entries are placed. * @indices: The cache indices of the entries in @fbatch. * * find_lock_entries() will return a batch of entries from @mapping. * Swap, shadow and DAX entries are included. Folios are returned * locked and with an incremented refcount. Folios which are locked * by somebody else or under writeback are skipped. Folios which are * partially outside the range are not returned. * * The entries have ascending indexes. The indices may not be consecutive * due to not-present entries, large folios, folios which could not be * locked or folios under writeback. * * Return: The number of entries which were found. / unsigned find_lock_entries(struct address_space mapping, pgoff_t start, pgoff_t end, struct folio_batch fbatch, pgoff_t indices) { XA_STATE(xas, &mapping->i_pages, start); struct folio folio; rcu_read_lock(); while ((folio = find_get_entry(&xas, end, XA_PRESENT))) { unsigned long base; unsigned long nr; if (!xa_is_value(folio)) { nr = folio_nr_pages(folio); base = folio->index; / Omit large folio which begins before the start / if (base < start) goto put; /* Omit large folio which extends beyond the end / if (base + nr - 1 > end) goto put; if (!folio_trylock(folio)) goto put; if (folio->mapping != mapping \|\| folio_test_writeback(folio)) goto unlock; VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index), folio); } else { nr = 1 << xas_get_order(&xas); base = xas.xa_index & ~(nr - 1); / Omit order>0 value which begins before the start / if (base < start) continue; /* Omit order>0 value which extends beyond the end / if (base + nr - 1 > end) break; } / Update start now so that last update is correct on return / start = base + nr; indices[fbatch->nr] = xas.xa_index; if (!folio_batch_add(fbatch, folio)) break; continue; unlock: folio_unlock(folio); put: folio_put(folio); } rcu_read_unlock(); return folio_batch_count(fbatch); } /** * filemap_get_folios - Get a batch of folios * @mapping: The address_space to search * @start: The starting page index * @end: The final page index (inclusive) * @fbatch: The batch to fill. * * Search for and return a batch of folios in the mapping starting at * index @start and up to index @end (inclusive). The folios are returned * in @fbatch with an elevated reference count. * * Return: The number of folios which were found. * We also update @start to index the next folio for the traversal. / unsigned filemap_get_folios(struct address_space mapping, pgoff_t start, pgoff_t end, struct folio_batch fbatch) { return filemap_get_folios_tag(mapping, start, end, XA_PRESENT, fbatch); } EXPORT_SYMBOL(filemap_get_folios); /** * filemap_get_folios_contig - Get a batch of contiguous folios * @mapping: The address_space to search * @start: The starting page index * @end: The final page index (inclusive) * @fbatch: The batch to fill * * filemap_get_folios_contig() works exactly like filemap_get_folios(), * except the returned folios are guaranteed to be contiguous. This may * not return all contiguous folios if the batch gets filled up. * * Return: The number of folios found. * Also update @start to be positioned for traversal of the next folio. / unsigned filemap_get_folios_contig(struct address_space mapping, pgoff_t start, pgoff_t end, struct folio_batch fbatch) { XA_STATE(xas, &mapping->i_pages, start); unsigned long nr; struct folio folio; rcu_read_lock(); for (folio = xas_load(&xas); folio && xas.xa_index <= end; folio = xas_next(&xas)) { if (xas_retry(&xas, folio)) continue; /* * If the entry has been swapped out, we can stop looking. * No current caller is looking for DAX entries. / if (xa_is_value(folio)) goto update_start; / If we landed in the middle of a THP, continue at its end. / if (xa_is_sibling(folio)) goto update_start; if (!folio_try_get(folio)) goto retry; if (unlikely(folio != xas_reload(&xas))) goto put_folio; if (!folio_batch_add(fbatch, folio)) { nr = folio_nr_pages(folio); start = folio->index + nr; goto out; } xas_advance(&xas, folio_next_index(folio) - 1); continue; put_folio: folio_put(folio); retry: xas_reset(&xas); } update_start: nr = folio_batch_count(fbatch); if (nr) { folio = fbatch->folios[nr - 1]; start = folio_next_index(folio); } out: rcu_read_unlock(); return folio_batch_count(fbatch); } EXPORT_SYMBOL(filemap_get_folios_contig); /* * filemap_get_folios_tag - Get a batch of folios matching @tag * @mapping: The address_space to search * @start: The starting page index * @end: The final page index (inclusive) * @tag: The tag index * @fbatch: The batch to fill * * The first folio may start before @start; if it does, it will contain * @start. The final folio may extend beyond @end; if it does, it will * contain @end. The folios have ascending indices. There may be gaps * between the folios if there are indices which have no folio in the * page cache. If folios are added to or removed from the page cache * while this is running, they may or may not be found by this call. * Only returns folios that are tagged with @tag. * * Return: The number of folios found. * Also update @start to index the next folio for traversal. / unsigned filemap_get_folios_tag(struct address_space mapping, pgoff_t start, pgoff_t end, xa_mark_t tag, struct folio_batch fbatch) { XA_STATE(xas, &mapping->i_pages, start); struct folio folio; rcu_read_lock(); while ((folio = find_get_entry(&xas, end, tag)) != NULL) { /* * Shadow entries should never be tagged, but this iteration * is lockless so there is a window for page reclaim to evict * a page we saw tagged. Skip over it. / if (xa_is_value(folio)) continue; if (!folio_batch_add(fbatch, folio)) { unsigned long nr = folio_nr_pages(folio); start = folio->index + nr; goto out; } } /* * We come here when there is no page beyond @end. We take care to not * overflow the index @start as it confuses some of the callers. This * breaks the iteration when there is a page at index -1 but that is * already broke anyway. / if (end == (pgoff_t)-1) start = (pgoff_t)-1; else start = end + 1; out: rcu_read_unlock(); return folio_batch_count(fbatch); } EXPORT_SYMBOL(filemap_get_folios_tag); /* * filemap_get_folios_dirty - Get a batch of dirty folios * @mapping: The address_space to search * @start: The starting folio index * @end: The final folio index (inclusive) * @fbatch: The batch to fill * * filemap_get_folios_dirty() works exactly like filemap_get_folios(), except * the returned folios are presumed to be dirty or undergoing writeback. Dirty * state is presumed because we don't block on folio lock nor want to miss * folios. Callers that need to can recheck state upon locking the folio. * * This may not return all dirty folios if the batch gets filled up. * * Return: The number of folios found. * Also update @start to be positioned for traversal of the next folio. / unsigned filemap_get_folios_dirty(struct address_space mapping, pgoff_t start, pgoff_t end, struct folio_batch fbatch) { XA_STATE(xas, &mapping->i_pages, start); struct folio folio; rcu_read_lock(); while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) { if (xa_is_value(folio)) continue; if (folio_trylock(folio)) { bool clean = !folio_test_dirty(folio) && !folio_test_writeback(folio); folio_unlock(folio); if (clean) { folio_put(folio); continue; } } if (!folio_batch_add(fbatch, folio)) { unsigned long nr = folio_nr_pages(folio); start = folio->index + nr; goto out; } } / * We come here when there is no folio beyond @end. We take care to not * overflow the index @start as it confuses some of the callers. This * breaks the iteration when there is a folio at index -1 but that is * already broke anyway. / if (end == (pgoff_t)-1) start = (pgoff_t)-1; else start = end + 1; out: rcu_read_unlock(); return folio_batch_count(fbatch); } / * CD/DVDs are error prone. When a medium error occurs, the driver may fail * a _large_ part of the i/o request. Imagine the worst scenario: * * ---R__________________________________________B__________ * ^ reading here ^ bad block(assume 4k) * * read(R) => miss => readahead(R...B) => media error => frustrating retries * => failing the whole request => read(R) => read(R+1) => * readahead(R+1...B+1) => bang => read(R+2) => read(R+3) => * readahead(R+3...B+2) => bang => read(R+3) => read(R+4) => * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ...... * * It is going insane. Fix it by quickly scaling down the readahead size. / static void shrink_readahead_size_eio(struct file_ra_state ra) { ra->ra_pages /= 4; } /* * filemap_get_read_batch - Get a batch of folios for read * * Get a batch of folios which represent a contiguous range of bytes in * the file. No exceptional entries will be returned. If @index is in * the middle of a folio, the entire folio will be returned. The last * folio in the batch may have the readahead flag set or the uptodate flag * clear so that the caller can take the appropriate action. / static void filemap_get_read_batch(struct address_space mapping, pgoff_t index, pgoff_t max, struct folio_batch fbatch) { XA_STATE(xas, &mapping->i_pages, index); struct folio folio; rcu_read_lock(); for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) { if (xas_retry(&xas, folio)) continue; if (xas.xa_index > max \|\| xa_is_value(folio)) break; if (xa_is_sibling(folio)) break; if (!folio_try_get(folio)) goto retry; if (unlikely(folio != xas_reload(&xas))) goto put_folio; if (!folio_batch_add(fbatch, folio)) break; if (!folio_test_uptodate(folio)) break; if (folio_test_readahead(folio)) break; xas_advance(&xas, folio_next_index(folio) - 1); continue; put_folio: folio_put(folio); retry: xas_reset(&xas); } rcu_read_unlock(); } static int filemap_read_folio(struct file file, filler_t filler, struct folio folio) { bool workingset = folio_test_workingset(folio); unsigned long pflags; int error; /* Start the actual read. The read will unlock the page. / if (unlikely(workingset)) psi_memstall_enter(&pflags); error = filler(file, folio); if (unlikely(workingset)) psi_memstall_leave(&pflags); if (error) return error; error = folio_wait_locked_killable(folio); if (error) return error; if (folio_test_uptodate(folio)) return 0; if (file) shrink_readahead_size_eio(&file->f_ra); return -EIO; } static bool filemap_range_uptodate(struct address_space mapping, loff_t pos, size_t count, struct folio folio, bool need_uptodate) { if (folio_test_uptodate(folio)) return true; / pipes can't handle partially uptodate pages / if (need_uptodate) return false; if (!mapping->a_ops->is_partially_uptodate) return false; if (mapping->host->i_blkbits >= folio_shift(folio)) return false; if (folio_pos(folio) > pos) { count -= folio_pos(folio) - pos; pos = 0; } else { pos -= folio_pos(folio); } if (pos == 0 && count >= folio_size(folio)) return false; return mapping->a_ops->is_partially_uptodate(folio, pos, count); } static int filemap_update_page(struct kiocb iocb, struct address_space mapping, size_t count, struct folio folio, bool need_uptodate) { int error; if (iocb->ki_flags & IOCB_NOWAIT) { if (!filemap_invalidate_trylock_shared(mapping)) return -EAGAIN; } else { filemap_invalidate_lock_shared(mapping); } if (!folio_trylock(folio)) { error = -EAGAIN; if (iocb->ki_flags & (IOCB_NOWAIT \| IOCB_NOIO)) goto unlock_mapping; if (!(iocb->ki_flags & IOCB_WAITQ)) { filemap_invalidate_unlock_shared(mapping); /* * This is where we usually end up waiting for a * previously submitted readahead to finish. / folio_put_wait_locked(folio, TASK_KILLABLE); return AOP_TRUNCATED_PAGE; } error = __folio_lock_async(folio, iocb->ki_waitq); if (error) goto unlock_mapping; } error = AOP_TRUNCATED_PAGE; if (!folio->mapping) goto unlock; error = 0; if (filemap_range_uptodate(mapping, iocb->ki_pos, count, folio, need_uptodate)) goto unlock; error = -EAGAIN; if (iocb->ki_flags & (IOCB_NOIO \| IOCB_NOWAIT \| IOCB_WAITQ)) goto unlock; error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio, folio); goto unlock_mapping; unlock: folio_unlock(folio); unlock_mapping: filemap_invalidate_unlock_shared(mapping); if (error == AOP_TRUNCATED_PAGE) folio_put(folio); return error; } static int filemap_create_folio(struct kiocb iocb, struct folio_batch fbatch) { struct address_space mapping = iocb->ki_filp->f_mapping; struct folio folio; int error; unsigned int min_order = mapping_min_folio_order(mapping); pgoff_t index; if (iocb->ki_flags & (IOCB_NOWAIT \| IOCB_WAITQ)) return -EAGAIN; folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order, NULL); if (!folio) return -ENOMEM; if (iocb->ki_flags & IOCB_DONTCACHE) __folio_set_dropbehind(folio); / * Protect against truncate / hole punch. Grabbing invalidate_lock * here assures we cannot instantiate and bring uptodate new * pagecache folios after evicting page cache during truncate * and before actually freeing blocks. Note that we could * release invalidate_lock after inserting the folio into * the page cache as the locked folio would then be enough to * synchronize with hole punching. But there are code paths * such as filemap_update_page() filling in partially uptodate * pages or ->readahead() that need to hold invalidate_lock * while mapping blocks for IO so let's hold the lock here as * well to keep locking rules simple. / filemap_invalidate_lock_shared(mapping); index = (iocb->ki_pos >> (PAGE_SHIFT + min_order)) << min_order; error = filemap_add_folio(mapping, folio, index, mapping_gfp_constraint(mapping, GFP_KERNEL)); if (error == -EEXIST) error = AOP_TRUNCATED_PAGE; if (error) goto error; error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio, folio); if (error) goto error; filemap_invalidate_unlock_shared(mapping); folio_batch_add(fbatch, folio); return 0; error: filemap_invalidate_unlock_shared(mapping); folio_put(folio); return error; } static int filemap_readahead(struct kiocb iocb, struct file file, struct address_space mapping, struct folio folio, pgoff_t last_index) { DEFINE_READAHEAD(ractl, file, &file->f_ra, mapping, folio->index); if (iocb->ki_flags & IOCB_NOIO) return -EAGAIN; if (iocb->ki_flags & IOCB_DONTCACHE) ractl.dropbehind = 1; page_cache_async_ra(&ractl, folio, last_index - folio->index); return 0; } static int filemap_get_pages(struct kiocb iocb, size_t count, struct folio_batch fbatch, bool need_uptodate) { struct file filp = iocb->ki_filp; struct address_space mapping = filp->f_mapping; pgoff_t index = iocb->ki_pos >> PAGE_SHIFT; pgoff_t last_index; struct folio folio; unsigned int flags; int err = 0; /* "last_index" is the index of the folio beyond the end of the read / last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping)) >> PAGE_SHIFT; retry: if (fatal_signal_pending(current)) return -EINTR; filemap_get_read_batch(mapping, index, last_index - 1, fbatch); if (!folio_batch_count(fbatch)) { DEFINE_READAHEAD(ractl, filp, &filp->f_ra, mapping, index); if (iocb->ki_flags & IOCB_NOIO) return -EAGAIN; if (iocb->ki_flags & IOCB_NOWAIT) flags = memalloc_noio_save(); if (iocb->ki_flags & IOCB_DONTCACHE) ractl.dropbehind = 1; page_cache_sync_ra(&ractl, last_index - index); if (iocb->ki_flags & IOCB_NOWAIT) memalloc_noio_restore(flags); filemap_get_read_batch(mapping, index, last_index - 1, fbatch); } if (!folio_batch_count(fbatch)) { err = filemap_create_folio(iocb, fbatch); if (err == AOP_TRUNCATED_PAGE) goto retry; return err; } folio = fbatch->folios[folio_batch_count(fbatch) - 1]; if (folio_test_readahead(folio)) { err = filemap_readahead(iocb, filp, mapping, folio, last_index); if (err) goto err; } if (!folio_test_uptodate(folio)) { if (folio_batch_count(fbatch) > 1) { err = -EAGAIN; goto err; } err = filemap_update_page(iocb, mapping, count, folio, need_uptodate); if (err) goto err; } trace_mm_filemap_get_pages(mapping, index, last_index - 1); return 0; err: if (err < 0) folio_put(folio); if (likely(--fbatch->nr)) return 0; if (err == AOP_TRUNCATED_PAGE) goto retry; return err; } static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio folio) { unsigned int shift = folio_shift(folio); return (pos1 >> shift == pos2 >> shift); } static void filemap_end_dropbehind_read(struct folio folio) { if (!folio_test_dropbehind(folio)) return; if (folio_test_writeback(folio) \|\| folio_test_dirty(folio)) return; if (folio_trylock(folio)) { filemap_end_dropbehind(folio); folio_unlock(folio); } } /* * filemap_read - Read data from the page cache. * @iocb: The iocb to read. * @iter: Destination for the data. * @already_read: Number of bytes already read by the caller. * * Copies data from the page cache. If the data is not currently present, * uses the readahead and read_folio address_space operations to fetch it. * * Return: Total number of bytes copied, including those already read by * the caller. If an error happens before any bytes are copied, returns * a negative error number. / ssize_t filemap_read(struct kiocb iocb, struct iov_iter iter, ssize_t already_read) { struct file filp = iocb->ki_filp; struct file_ra_state ra = &filp->f_ra; struct address_space mapping = filp->f_mapping; struct inode inode = mapping->host; struct folio_batch fbatch; int i, error = 0; bool writably_mapped; loff_t isize, end_offset; loff_t last_pos = ra->prev_pos; if (unlikely(iocb->ki_pos < 0)) return -EINVAL; if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes)) return 0; if (unlikely(!iov_iter_count(iter))) return 0; iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos); folio_batch_init(&fbatch); do { cond_resched(); / * If we've already successfully copied some data, then we * can no longer safely return -EIOCBQUEUED. Hence mark * an async read NOWAIT at that point. / if ((iocb->ki_flags & IOCB_WAITQ) && already_read) iocb->ki_flags \|= IOCB_NOWAIT; if (unlikely(iocb->ki_pos >= i_size_read(inode))) break; error = filemap_get_pages(iocb, iter->count, &fbatch, false); if (error < 0) break; / * i_size must be checked after we know the pages are Uptodate. * * Checking i_size after the check allows us to calculate * the correct value for "nr", which means the zero-filled * part of the page is not copied back to userspace (unless * another truncate extends the file - this is desired though). / isize = i_size_read(inode); if (unlikely(iocb->ki_pos >= isize)) goto put_folios; end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count); / * Once we start copying data, we don't want to be touching any * cachelines that might be contended: / writably_mapped = mapping_writably_mapped(mapping); / * When a read accesses the same folio several times, only * mark it as accessed the first time. / if (!pos_same_folio(iocb->ki_pos, last_pos - 1, fbatch.folios[0])) folio_mark_accessed(fbatch.folios[0]); for (i = 0; i < folio_batch_count(&fbatch); i++) { struct folio folio = fbatch.folios[i]; size_t fsize = folio_size(folio); size_t offset = iocb->ki_pos & (fsize - 1); size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos, fsize - offset); size_t copied; if (end_offset < folio_pos(folio)) break; if (i > 0) folio_mark_accessed(folio); /* * If users can be writing to this folio using arbitrary * virtual addresses, take care of potential aliasing * before reading the folio on the kernel side. / if (writably_mapped) flush_dcache_folio(folio); copied = copy_folio_to_iter(folio, offset, bytes, iter); already_read += copied; iocb->ki_pos += copied; last_pos = iocb->ki_pos; if (copied < bytes) { error = -EFAULT; break; } } put_folios: for (i = 0; i < folio_batch_count(&fbatch); i++) { struct folio folio = fbatch.folios[i]; filemap_end_dropbehind_read(folio); folio_put(folio); } folio_batch_init(&fbatch); } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); file_accessed(filp); ra->prev_pos = last_pos; return already_read ? already_read : error; } EXPORT_SYMBOL_GPL(filemap_read); int kiocb_write_and_wait(struct kiocb iocb, size_t count) { struct address_space mapping = iocb->ki_filp->f_mapping; loff_t pos = iocb->ki_pos; loff_t end = pos + count - 1; if (iocb->ki_flags & IOCB_NOWAIT) { if (filemap_range_needs_writeback(mapping, pos, end)) return -EAGAIN; return 0; } return filemap_write_and_wait_range(mapping, pos, end); } EXPORT_SYMBOL_GPL(kiocb_write_and_wait); int filemap_invalidate_pages(struct address_space mapping, loff_t pos, loff_t end, bool nowait) { int ret; if (nowait) { / we could block if there are any pages in the range / if (filemap_range_has_page(mapping, pos, end)) return -EAGAIN; } else { ret = filemap_write_and_wait_range(mapping, pos, end); if (ret) return ret; } / * After a write we want buffered reads to be sure to go to disk to get * the new data. We invalidate clean cached page from the region we're * about to write. We do this before the write so that we can return * without clobbering -EIOCBQUEUED from ->direct_IO(). / return invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, end >> PAGE_SHIFT); } int kiocb_invalidate_pages(struct kiocb iocb, size_t count) { struct address_space mapping = iocb->ki_filp->f_mapping; return filemap_invalidate_pages(mapping, iocb->ki_pos, iocb->ki_pos + count - 1, iocb->ki_flags & IOCB_NOWAIT); } EXPORT_SYMBOL_GPL(kiocb_invalidate_pages); /* * generic_file_read_iter - generic filesystem read routine * @iocb: kernel I/O control block * @iter: destination for the data read * * This is the "read_iter()" routine for all filesystems * that can use the page cache directly. * * The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall * be returned when no data can be read without waiting for I/O requests * to complete; it doesn't prevent readahead. * * The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/O * requests shall be made for the read or for readahead. When no data * can be read, -EAGAIN shall be returned. When readahead would be * triggered, a partial, possibly empty read shall be returned. * * Return: * * number of bytes copied, even for partial reads * * negative error code (or 0 if IOCB_NOIO) if nothing was read / ssize_t generic_file_read_iter(struct kiocb iocb, struct iov_iter iter) { size_t count = iov_iter_count(iter); ssize_t retval = 0; if (!count) return 0; / skip atime / if (iocb->ki_flags & IOCB_DIRECT) { struct file file = iocb->ki_filp; struct address_space mapping = file->f_mapping; struct inode inode = mapping->host; retval = kiocb_write_and_wait(iocb, count); if (retval < 0) return retval; file_accessed(file); retval = mapping->a_ops->direct_IO(iocb, iter); if (retval >= 0) { iocb->ki_pos += retval; count -= retval; } if (retval != -EIOCBQUEUED) iov_iter_revert(iter, count - iov_iter_count(iter)); /* * Btrfs can have a short DIO read if we encounter * compressed extents, so if there was an error, or if * we've already read everything we wanted to, or if * there was a short read because we hit EOF, go ahead * and return. Otherwise fallthrough to buffered io for * the rest of the read. Buffered reads will not work for * DAX files, so don't bother trying. / if (retval < 0 \|\| !count \|\| IS_DAX(inode)) return retval; if (iocb->ki_pos >= i_size_read(inode)) return retval; } return filemap_read(iocb, iter, retval); } EXPORT_SYMBOL(generic_file_read_iter); / * Splice subpages from a folio into a pipe. / size_t splice_folio_into_pipe(struct pipe_inode_info pipe, struct folio folio, loff_t fpos, size_t size) { struct page page; size_t spliced = 0, offset = offset_in_folio(folio, fpos); page = folio_page(folio, offset / PAGE_SIZE); size = min(size, folio_size(folio) - offset); offset %= PAGE_SIZE; while (spliced < size && !pipe_is_full(pipe)) { struct pipe_buffer buf = pipe_head_buf(pipe); size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced); buf = (struct pipe_buffer) { .ops = &page_cache_pipe_buf_ops, .page = page, .offset = offset, .len = part, }; folio_get(folio); pipe->head++; page++; spliced += part; offset = 0; } return spliced; } /** * filemap_splice_read - Splice data from a file's pagecache into a pipe * @in: The file to read from * @ppos: Pointer to the file position to read from * @pipe: The pipe to splice into * @len: The amount to splice * @flags: The SPLICE_F_* flags * * This function gets folios from a file's pagecache and splices them into the * pipe. Readahead will be called as necessary to fill more folios. This may * be used for blockdevs also. * * Return: On success, the number of bytes read will be returned and @ppos will be updated if appropriate; 0 will be returned if there is no more data * to be read; -EAGAIN will be returned if the pipe had no space, and some * other negative error code will be returned on error. A short read may occur * if the pipe has insufficient space, we reach the end of the data or we hit a * hole. / ssize_t filemap_splice_read(struct file in, loff_t ppos, struct pipe_inode_info pipe, size_t len, unsigned int flags) { struct folio_batch fbatch; struct kiocb iocb; size_t total_spliced = 0, used, npages; loff_t isize, end_offset; bool writably_mapped; int i, error = 0; if (unlikely(ppos >= in->f_mapping->host->i_sb->s_maxbytes)) return 0; init_sync_kiocb(&iocb, in); iocb.ki_pos = ppos; /* Work out how much data we can actually add into the pipe / used = pipe_buf_usage(pipe); npages = max_t(ssize_t, pipe->max_usage - used, 0); len = min_t(size_t, len, npages PAGE_SIZE); folio_batch_init(&fbatch); do { cond_resched(); if (ppos >= i_size_read(in->f_mapping->host)) break; iocb.ki_pos = ppos; error = filemap_get_pages(&iocb, len, &fbatch, true); if (error < 0) break; /* * i_size must be checked after we know the pages are Uptodate. * * Checking i_size after the check allows us to calculate * the correct value for "nr", which means the zero-filled * part of the page is not copied back to userspace (unless * another truncate extends the file - this is desired though). / isize = i_size_read(in->f_mapping->host); if (unlikely(ppos >= isize)) break; end_offset = min_t(loff_t, isize, ppos + len); / * Once we start copying data, we don't want to be touching any * cachelines that might be contended: / writably_mapped = mapping_writably_mapped(in->f_mapping); for (i = 0; i < folio_batch_count(&fbatch); i++) { struct folio folio = fbatch.folios[i]; size_t n; if (folio_pos(folio) >= end_offset) goto out; folio_mark_accessed(folio); /* * If users can be writing to this folio using arbitrary * virtual addresses, take care of potential aliasing * before reading the folio on the kernel side. / if (writably_mapped) flush_dcache_folio(folio); n = min_t(loff_t, len, isize - ppos); n = splice_folio_into_pipe(pipe, folio, ppos, n); if (!n) goto out; len -= n; total_spliced += n; ppos += n; in->f_ra.prev_pos = ppos; if (pipe_is_full(pipe)) goto out; } folio_batch_release(&fbatch); } while (len); out: folio_batch_release(&fbatch); file_accessed(in); return total_spliced ? total_spliced : error; } EXPORT_SYMBOL(filemap_splice_read); static inline loff_t folio_seek_hole_data(struct xa_state xas, struct address_space mapping, struct folio folio, loff_t start, loff_t end, bool seek_data) { const struct address_space_operations ops = mapping->a_ops; size_t offset, bsz = i_blocksize(mapping->host); if (xa_is_value(folio) \|\| folio_test_uptodate(folio)) return seek_data ? start : end; if (!ops->is_partially_uptodate) return seek_data ? end : start; xas_pause(xas); rcu_read_unlock(); folio_lock(folio); if (unlikely(folio->mapping != mapping)) goto unlock; offset = offset_in_folio(folio, start) & ~(bsz - 1); do { if (ops->is_partially_uptodate(folio, offset, bsz) == seek_data) break; start = (start + bsz) & ~((u64)bsz - 1); offset += bsz; } while (offset < folio_size(folio)); unlock: folio_unlock(folio); rcu_read_lock(); return start; } static inline size_t seek_folio_size(struct xa_state xas, struct folio folio) { if (xa_is_value(folio)) return PAGE_SIZE << xas_get_order(xas); return folio_size(folio); } /* * mapping_seek_hole_data - Seek for SEEK_DATA / SEEK_HOLE in the page cache. * @mapping: Address space to search. * @start: First byte to consider. * @end: Limit of search (exclusive). * @whence: Either SEEK_HOLE or SEEK_DATA. * * If the page cache knows which blocks contain holes and which blocks * contain data, your filesystem can use this function to implement * SEEK_HOLE and SEEK_DATA. This is useful for filesystems which are * entirely memory-based such as tmpfs, and filesystems which support * unwritten extents. * * Return: The requested offset on success, or -ENXIO if @whence specifies * SEEK_DATA and there is no data after @start. There is an implicit hole * after @end - 1, so SEEK_HOLE returns @end if all the bytes between @start * and @end contain data. / loff_t mapping_seek_hole_data(struct address_space mapping, loff_t start, loff_t end, int whence) { XA_STATE(xas, &mapping->i_pages, start >> PAGE_SHIFT); pgoff_t max = (end - 1) >> PAGE_SHIFT; bool seek_data = (whence == SEEK_DATA); struct folio folio; if (end <= start) return -ENXIO; rcu_read_lock(); while ((folio = find_get_entry(&xas, max, XA_PRESENT))) { loff_t pos = (u64)xas.xa_index << PAGE_SHIFT; size_t seek_size; if (start < pos) { if (!seek_data) goto unlock; start = pos; } seek_size = seek_folio_size(&xas, folio); pos = round_up((u64)pos + 1, seek_size); start = folio_seek_hole_data(&xas, mapping, folio, start, pos, seek_data); if (start < pos) goto unlock; if (start >= end) break; if (seek_size > PAGE_SIZE) xas_set(&xas, pos >> PAGE_SHIFT); if (!xa_is_value(folio)) folio_put(folio); } if (seek_data) start = -ENXIO; unlock: rcu_read_unlock(); if (folio && !xa_is_value(folio)) folio_put(folio); if (start > end) return end; return start; } #ifdef CONFIG_MMU #define MMAP_LOTSAMISS (100) / * lock_folio_maybe_drop_mmap - lock the page, possibly dropping the mmap_lock * @vmf - the vm_fault for this fault. * @folio - the folio to lock. * @fpin - the pointer to the file we may pin (or is already pinned). * * This works similar to lock_folio_or_retry in that it can drop the * mmap_lock. It differs in that it actually returns the folio locked * if it returns 1 and 0 if it couldn't lock the folio. If we did have * to drop the mmap_lock then fpin will point to the pinned file and * needs to be fput()'ed at a later point. / static int lock_folio_maybe_drop_mmap(struct vm_fault vmf, struct folio folio, struct file fpin) { if (folio_trylock(folio)) return 1; / * NOTE! This will make us return with VM_FAULT_RETRY, but with * the fault lock still held. That's how FAULT_FLAG_RETRY_NOWAIT * is supposed to work. We have way too many special cases.. / if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) return 0; fpin = maybe_unlock_mmap_for_io(vmf, fpin); if (vmf->flags & FAULT_FLAG_KILLABLE) { if (__folio_lock_killable(folio)) { / * We didn't have the right flags to drop the * fault lock, but all fault_handlers only check * for fatal signals if we return VM_FAULT_RETRY, * so we need to drop the fault lock here and * return 0 if we don't have a fpin. / if (fpin == NULL) release_fault_lock(vmf); return 0; } } else __folio_lock(folio); return 1; } /* * Synchronous readahead happens when we don't even find a page in the page * cache at all. We don't want to perform IO under the mmap sem, so if we have * to drop the mmap sem we return the file that was pinned in order for us to do * that. If we didn't pin a file then we return NULL. The file that is * returned needs to be fput()'ed when we're done with it. / static struct file do_sync_mmap_readahead(struct vm_fault vmf) { struct file file = vmf->vma->vm_file; struct file_ra_state ra = &file->f_ra; struct address_space mapping = file->f_mapping; DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff); struct file fpin = NULL; vm_flags_t vm_flags = vmf->vma->vm_flags; bool force_thp_readahead = false; unsigned short mmap_miss; / Use the readahead code, even if readahead is disabled / if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) force_thp_readahead = true; if (!force_thp_readahead) { / * If we don't want any read-ahead, don't bother. * VM_EXEC case below is already intended for random access. / if ((vm_flags & (VM_RAND_READ \| VM_EXEC)) == VM_RAND_READ) return fpin; if (!ra->ra_pages) return fpin; if (vm_flags & VM_SEQ_READ) { fpin = maybe_unlock_mmap_for_io(vmf, fpin); page_cache_sync_ra(&ractl, ra->ra_pages); return fpin; } } if (!(vm_flags & VM_SEQ_READ)) { / Avoid banging the cache line if not needed / mmap_miss = READ_ONCE(ra->mmap_miss); if (mmap_miss < MMAP_LOTSAMISS 10) WRITE_ONCE(ra->mmap_miss, ++mmap_miss); /* * Do we miss much more than hit in this file? If so, * stop bothering with read-ahead. It will only hurt. / if (mmap_miss > MMAP_LOTSAMISS) return fpin; } if (force_thp_readahead) { fpin = maybe_unlock_mmap_for_io(vmf, fpin); ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1); ra->size = HPAGE_PMD_NR; / * Fetch two PMD folios, so we get the chance to actually * readahead, unless we've been told not to. / if (!(vm_flags & VM_RAND_READ)) ra->size = 2; ra->async_size = HPAGE_PMD_NR; ra->order = HPAGE_PMD_ORDER; page_cache_ra_order(&ractl, ra); return fpin; } if (vm_flags & VM_EXEC) { /* * Allow arch to request a preferred minimum folio order for * executable memory. This can often be beneficial to * performance if (e.g.) arm64 can contpte-map the folio. * Executable memory rarely benefits from readahead, due to its * random access nature, so set async_size to 0. * * Limit to the boundaries of the VMA to avoid reading in any * pad that might exist between sections, which would be a waste * of memory. / struct vm_area_struct vma = vmf->vma; unsigned long start = vma->vm_pgoff; unsigned long end = start + vma_pages(vma); unsigned long ra_end; ra->order = exec_folio_order(); ra->start = round_down(vmf->pgoff, 1UL << ra->order); ra->start = max(ra->start, start); ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order); ra_end = min(ra_end, end); ra->size = ra_end - ra->start; ra->async_size = 0; } else { /* * mmap read-around / ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); ra->size = ra->ra_pages; ra->async_size = ra->ra_pages / 4; ra->order = 0; } fpin = maybe_unlock_mmap_for_io(vmf, fpin); ractl._index = ra->start; page_cache_ra_order(&ractl, ra); return fpin; } / * Asynchronous readahead happens when we find the page and PG_readahead, * so we want to possibly extend the readahead further. We return the file that * was pinned if we have to drop the mmap_lock in order to do IO. / static struct file do_async_mmap_readahead(struct vm_fault vmf, struct folio folio) { struct file file = vmf->vma->vm_file; struct file_ra_state ra = &file->f_ra; DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff); struct file fpin = NULL; unsigned short mmap_miss; / If we don't want any read-ahead, don't bother / if (vmf->vma->vm_flags & VM_RAND_READ \|\| !ra->ra_pages) return fpin; / * If the folio is locked, we're likely racing against another fault. * Don't touch the mmap_miss counter to avoid decreasing it multiple * times for a single folio and break the balance with mmap_miss * increase in do_sync_mmap_readahead(). / if (likely(!folio_test_locked(folio))) { mmap_miss = READ_ONCE(ra->mmap_miss); if (mmap_miss) WRITE_ONCE(ra->mmap_miss, --mmap_miss); } if (folio_test_readahead(folio)) { fpin = maybe_unlock_mmap_for_io(vmf, fpin); page_cache_async_ra(&ractl, folio, ra->ra_pages); } return fpin; } static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault vmf) { struct vm_area_struct vma = vmf->vma; vm_fault_t ret = 0; pte_t ptep; /* * We might have COW'ed a pagecache folio and might now have an mlocked * anon folio mapped. The original pagecache folio is not mlocked and * might have been evicted. During a read+clear/modify/write update of * the PTE, such as done in do_numa_page()/change_pte_range(), we * temporarily clear the PTE under PT lock and might detect it here as * "none" when not holding the PT lock. * * Not rechecking the PTE under PT lock could result in an unexpected * major fault in an mlock'ed region. Recheck only for this special * scenario while holding the PT lock, to not degrade non-mlocked * scenarios. Recheck the PTE without PT lock firstly, thereby reducing * the number of times we hold PT lock. / if (!(vma->vm_flags & VM_LOCKED)) return 0; if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)) return 0; ptep = pte_offset_map_ro_nolock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!ptep)) return VM_FAULT_NOPAGE; if (unlikely(!pte_none(ptep_get_lockless(ptep)))) { ret = VM_FAULT_NOPAGE; } else { spin_lock(vmf->ptl); if (unlikely(!pte_none(ptep_get(ptep)))) ret = VM_FAULT_NOPAGE; spin_unlock(vmf->ptl); } pte_unmap(ptep); return ret; } /* * filemap_fault - read in file data for page fault handling * @vmf: struct vm_fault containing details of the fault * * filemap_fault() is invoked via the vma operations vector for a * mapped memory region to read in file data during a page fault. * * The goto's are kind of ugly, but this streamlines the normal case of having * it in the page cache, and handles the special cases reasonably without * having a lot of duplicated code. * * vma->vm_mm->mmap_lock must be held on entry. * * If our return value has VM_FAULT_RETRY set, it's because the mmap_lock * may be dropped before doing I/O or by lock_folio_maybe_drop_mmap(). * * If our return value does not have VM_FAULT_RETRY set, the mmap_lock * has not been released. * * We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set. * * Return: bitwise-OR of %VM_FAULT_ codes. / vm_fault_t filemap_fault(struct vm_fault vmf) { int error; struct file file = vmf->vma->vm_file; struct file fpin = NULL; struct address_space mapping = file->f_mapping; struct inode inode = mapping->host; pgoff_t max_idx, index = vmf->pgoff; struct folio folio; vm_fault_t ret = 0; bool mapping_locked = false; max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); if (unlikely(index >= max_idx)) return VM_FAULT_SIGBUS; trace_mm_filemap_fault(mapping, index); / * Do we have something in the page cache already? / folio = filemap_get_folio(mapping, index); if (likely(!IS_ERR(folio))) { / * We found the page, so try async readahead before waiting for * the lock. / if (!(vmf->flags & FAULT_FLAG_TRIED)) fpin = do_async_mmap_readahead(vmf, folio); if (unlikely(!folio_test_uptodate(folio))) { filemap_invalidate_lock_shared(mapping); mapping_locked = true; } } else { ret = filemap_fault_recheck_pte_none(vmf); if (unlikely(ret)) return ret; / No page in the page cache at all / count_vm_event(PGMAJFAULT); count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; fpin = do_sync_mmap_readahead(vmf); retry_find: / * See comment in filemap_create_folio() why we need * invalidate_lock / if (!mapping_locked) { filemap_invalidate_lock_shared(mapping); mapping_locked = true; } folio = __filemap_get_folio(mapping, index, FGP_CREAT\|FGP_FOR_MMAP, vmf->gfp_mask); if (IS_ERR(folio)) { if (fpin) goto out_retry; filemap_invalidate_unlock_shared(mapping); return VM_FAULT_OOM; } } if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin)) goto out_retry; / Did it get truncated? / if (unlikely(folio->mapping != mapping)) { folio_unlock(folio); folio_put(folio); goto retry_find; } VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio); / * We have a locked folio in the page cache, now we need to check * that it's up-to-date. If not, it is going to be due to an error, * or because readahead was otherwise unable to retrieve it. / if (unlikely(!folio_test_uptodate(folio))) { / * If the invalidate lock is not held, the folio was in cache * and uptodate and now it is not. Strange but possible since we * didn't hold the page lock all the time. Let's drop * everything, get the invalidate lock and try again. / if (!mapping_locked) { folio_unlock(folio); folio_put(folio); goto retry_find; } / * OK, the folio is really not uptodate. This can be because the * VMA has the VM_RAND_READ flag set, or because an error * arose. Let's read it in directly. / goto page_not_uptodate; } / * We've made it this far and we had to drop our mmap_lock, now is the * time to return to the upper layer and have it re-find the vma and * redo the fault. / if (fpin) { folio_unlock(folio); goto out_retry; } if (mapping_locked) filemap_invalidate_unlock_shared(mapping); / * Found the page and have a reference on it. * We must recheck i_size under page lock. / max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); if (unlikely(index >= max_idx)) { folio_unlock(folio); folio_put(folio); return VM_FAULT_SIGBUS; } vmf->page = folio_file_page(folio, index); return ret \| VM_FAULT_LOCKED; page_not_uptodate: / * Umm, take care of errors if the page isn't up-to-date. * Try to re-read it _once_. We do this synchronously, * because there really aren't any performance issues here * and we need to check for errors. / fpin = maybe_unlock_mmap_for_io(vmf, fpin); error = filemap_read_folio(file, mapping->a_ops->read_folio, folio); if (fpin) goto out_retry; folio_put(folio); if (!error \|\| error == AOP_TRUNCATED_PAGE) goto retry_find; filemap_invalidate_unlock_shared(mapping); return VM_FAULT_SIGBUS; out_retry: / * We dropped the mmap_lock, we need to return to the fault handler to * re-find the vma and come back and find our hopefully still populated * page. / if (!IS_ERR(folio)) folio_put(folio); if (mapping_locked) filemap_invalidate_unlock_shared(mapping); if (fpin) fput(fpin); return ret \| VM_FAULT_RETRY; } EXPORT_SYMBOL(filemap_fault); static bool filemap_map_pmd(struct vm_fault vmf, struct folio folio, pgoff_t start) { struct mm_struct mm = vmf->vma->vm_mm; /* Huge page is mapped? No need to proceed. / if (pmd_trans_huge(vmf->pmd)) { folio_unlock(folio); folio_put(folio); return true; } if (pmd_none(vmf->pmd) && folio_test_pmd_mappable(folio)) { struct page page = folio_file_page(folio, start); vm_fault_t ret = do_set_pmd(vmf, folio, page); if (!ret) { /* The page is mapped successfully, reference consumed. / folio_unlock(folio); return true; } } if (pmd_none(vmf->pmd) && vmf->prealloc_pte) pmd_install(mm, vmf->pmd, &vmf->prealloc_pte); return false; } static struct folio next_uptodate_folio(struct xa_state xas, struct address_space mapping, pgoff_t end_pgoff) { struct folio folio = xas_next_entry(xas, end_pgoff); unsigned long max_idx; do { if (!folio) return NULL; if (xas_retry(xas, folio)) continue; if (xa_is_value(folio)) continue; if (!folio_try_get(folio)) continue; if (folio_test_locked(folio)) goto skip; /* Has the page moved or been split? / if (unlikely(folio != xas_reload(xas))) goto skip; if (!folio_test_uptodate(folio) \|\| folio_test_readahead(folio)) goto skip; if (!folio_trylock(folio)) goto skip; if (folio->mapping != mapping) goto unlock; if (!folio_test_uptodate(folio)) goto unlock; max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); if (xas->xa_index >= max_idx) goto unlock; return folio; unlock: folio_unlock(folio); skip: folio_put(folio); } while ((folio = xas_next_entry(xas, end_pgoff)) != NULL); return NULL; } / * Map page range [start_page, start_page + nr_pages) of folio. * start_page is gotten from start by folio_page(folio, start) / static vm_fault_t filemap_map_folio_range(struct vm_fault vmf, struct folio folio, unsigned long start, unsigned long addr, unsigned int nr_pages, unsigned long rss, unsigned short mmap_miss, pgoff_t file_end) { struct address_space mapping = folio->mapping; unsigned int ref_from_caller = 1; vm_fault_t ret = 0; struct page page = folio_page(folio, start); unsigned int count = 0; pte_t old_ptep = vmf->pte; unsigned long addr0; /* * Map the large folio fully where possible: * * - The folio is fully within size of the file or belong * to shmem/tmpfs; * - The folio doesn't cross VMA boundary; * - The folio doesn't cross page table boundary; / addr0 = addr - start PAGE_SIZE; if ((file_end >= folio_next_index(folio) \|\| shmem_mapping(mapping)) && folio_within_vma(folio, vmf->vma) && (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) { vmf->pte -= start; page -= start; addr = addr0; nr_pages = folio_nr_pages(folio); } do { if (PageHWPoison(page + count)) goto skip; /* * If there are too many folios that are recently evicted * in a file, they will probably continue to be evicted. * In such situation, read-ahead is only a waste of IO. * Don't decrease mmap_miss in this scenario to make sure * we can stop read-ahead. / if (!folio_test_workingset(folio)) (mmap_miss)++; /* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit the * fault-around logic. / if (!pte_none(ptep_get(&vmf->pte[count]))) goto skip; count++; continue; skip: if (count) { set_pte_range(vmf, folio, page, count, addr); rss += count; folio_ref_add(folio, count - ref_from_caller); ref_from_caller = 0; if (in_range(vmf->address, addr, count * PAGE_SIZE)) ret = VM_FAULT_NOPAGE; } count++; page += count; vmf->pte += count; addr += count * PAGE_SIZE; count = 0; } while (--nr_pages > 0); if (count) { set_pte_range(vmf, folio, page, count, addr); rss += count; folio_ref_add(folio, count - ref_from_caller); ref_from_caller = 0; if (in_range(vmf->address, addr, count PAGE_SIZE)) ret = VM_FAULT_NOPAGE; } vmf->pte = old_ptep; if (ref_from_caller) /* Locked folios cannot get truncated. / folio_ref_dec(folio); return ret; } static vm_fault_t filemap_map_order0_folio(struct vm_fault vmf, struct folio folio, unsigned long addr, unsigned long rss, unsigned short mmap_miss) { vm_fault_t ret = 0; struct page page = &folio->page; if (PageHWPoison(page)) goto out; /* See comment of filemap_map_folio_range() / if (!folio_test_workingset(folio)) (mmap_miss)++; /* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit * the fault-around logic. / if (!pte_none(ptep_get(vmf->pte))) goto out; if (vmf->address == addr) ret = VM_FAULT_NOPAGE; set_pte_range(vmf, folio, page, 1, addr); (rss)++; return ret; out: /* Locked folios cannot get truncated. / folio_ref_dec(folio); return ret; } vm_fault_t filemap_map_pages(struct vm_fault vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { struct vm_area_struct vma = vmf->vma; struct file file = vma->vm_file; struct address_space mapping = file->f_mapping; pgoff_t file_end, last_pgoff = start_pgoff; unsigned long addr; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct folio folio; vm_fault_t ret = 0; unsigned long rss = 0; unsigned int nr_pages = 0, folio_type; unsigned short mmap_miss = 0, mmap_miss_saved; rcu_read_lock(); folio = next_uptodate_folio(&xas, mapping, end_pgoff); if (!folio) goto out; file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; end_pgoff = min(end_pgoff, file_end); /* * Do not allow to map with PMD across i_size to preserve * SIGBUS semantics. * * Make an exception for shmem/tmpfs that for long time * intentionally mapped with PMDs across i_size. / if ((file_end >= folio_next_index(folio) \|\| shmem_mapping(mapping)) && filemap_map_pmd(vmf, folio, start_pgoff)) { ret = VM_FAULT_NOPAGE; goto out; } addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) { folio_unlock(folio); folio_put(folio); goto out; } folio_type = mm_counter_file(folio); do { unsigned long end; addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT; vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; end = folio_next_index(folio) - 1; nr_pages = min(end, end_pgoff) - xas.xa_index + 1; if (!folio_test_large(folio)) ret \|= filemap_map_order0_folio(vmf, folio, addr, &rss, &mmap_miss); else ret \|= filemap_map_folio_range(vmf, folio, xas.xa_index - folio->index, addr, nr_pages, &rss, &mmap_miss, file_end); folio_unlock(folio); } while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL); add_mm_counter(vma->vm_mm, folio_type, rss); pte_unmap_unlock(vmf->pte, vmf->ptl); trace_mm_filemap_map_pages(mapping, start_pgoff, end_pgoff); out: rcu_read_unlock(); mmap_miss_saved = READ_ONCE(file->f_ra.mmap_miss); if (mmap_miss >= mmap_miss_saved) WRITE_ONCE(file->f_ra.mmap_miss, 0); else WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss_saved - mmap_miss); return ret; } EXPORT_SYMBOL(filemap_map_pages); vm_fault_t filemap_page_mkwrite(struct vm_fault vmf) { struct address_space mapping = vmf->vma->vm_file->f_mapping; struct folio folio = page_folio(vmf->page); vm_fault_t ret = VM_FAULT_LOCKED; sb_start_pagefault(mapping->host->i_sb); file_update_time(vmf->vma->vm_file); folio_lock(folio); if (folio->mapping != mapping) { folio_unlock(folio); ret = VM_FAULT_NOPAGE; goto out; } /* * We mark the folio dirty already here so that when freeze is in * progress, we are guaranteed that writeback during freezing will * see the dirty folio and writeprotect it again. / folio_mark_dirty(folio); folio_wait_stable(folio); out: sb_end_pagefault(mapping->host->i_sb); return ret; } const struct vm_operations_struct generic_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = filemap_page_mkwrite, }; / This is used for a general mmap of a disk file / int generic_file_mmap(struct file file, struct vm_area_struct vma) { struct address_space mapping = file->f_mapping; if (!mapping->a_ops->read_folio) return -ENOEXEC; file_accessed(file); vma->vm_ops = &generic_file_vm_ops; return 0; } int generic_file_mmap_prepare(struct vm_area_desc desc) { struct file file = desc->file; struct address_space mapping = file->f_mapping; if (!mapping->a_ops->read_folio) return -ENOEXEC; file_accessed(file); desc->vm_ops = &generic_file_vm_ops; return 0; } / * This is for filesystems which do not implement ->writepage. / int generic_file_readonly_mmap(struct file file, struct vm_area_struct vma) { if (vma_is_shared_maywrite(vma)) return -EINVAL; return generic_file_mmap(file, vma); } int generic_file_readonly_mmap_prepare(struct vm_area_desc desc) { if (is_shared_maywrite(&desc->vma_flags)) return -EINVAL; return generic_file_mmap_prepare(desc); } #else vm_fault_t filemap_page_mkwrite(struct vm_fault vmf) { return VM_FAULT_SIGBUS; } int generic_file_mmap(struct file file, struct vm_area_struct vma) { return -ENOSYS; } int generic_file_mmap_prepare(struct vm_area_desc desc) { return -ENOSYS; } int generic_file_readonly_mmap(struct file file, struct vm_area_struct vma) { return -ENOSYS; } int generic_file_readonly_mmap_prepare(struct vm_area_desc desc) { return -ENOSYS; } #endif / CONFIG_MMU / EXPORT_SYMBOL(filemap_page_mkwrite); EXPORT_SYMBOL(generic_file_mmap); EXPORT_SYMBOL(generic_file_mmap_prepare); EXPORT_SYMBOL(generic_file_readonly_mmap); EXPORT_SYMBOL(generic_file_readonly_mmap_prepare); static struct folio do_read_cache_folio(struct address_space mapping, pgoff_t index, filler_t filler, struct file file, gfp_t gfp) { struct folio folio; int err; if (!filler) filler = mapping->a_ops->read_folio; repeat: folio = filemap_get_folio(mapping, index); if (IS_ERR(folio)) { folio = filemap_alloc_folio(gfp, mapping_min_folio_order(mapping), NULL); if (!folio) return ERR_PTR(-ENOMEM); index = mapping_align_index(mapping, index); err = filemap_add_folio(mapping, folio, index, gfp); if (unlikely(err)) { folio_put(folio); if (err == -EEXIST) goto repeat; / Presumably ENOMEM for xarray node / return ERR_PTR(err); } goto filler; } if (folio_test_uptodate(folio)) goto out; if (!folio_trylock(folio)) { folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE); goto repeat; } / Folio was truncated from mapping / if (!folio->mapping) { folio_unlock(folio); folio_put(folio); goto repeat; } / Someone else locked and filled the page in a very small window / if (folio_test_uptodate(folio)) { folio_unlock(folio); goto out; } filler: err = filemap_read_folio(file, filler, folio); if (err) { folio_put(folio); if (err == AOP_TRUNCATED_PAGE) goto repeat; return ERR_PTR(err); } out: folio_mark_accessed(folio); return folio; } /* * read_cache_folio - Read into page cache, fill it if needed. * @mapping: The address_space to read from. * @index: The index to read. * @filler: Function to perform the read, or NULL to use aops->read_folio(). * @file: Passed to filler function, may be NULL if not required. * * Read one page into the page cache. If it succeeds, the folio returned * will contain @index, but it may not be the first page of the folio. * * If the filler function returns an error, it will be returned to the * caller. * * Context: May sleep. Expects mapping->invalidate_lock to be held. * Return: An uptodate folio on success, ERR_PTR() on failure. / struct folio read_cache_folio(struct address_space mapping, pgoff_t index, filler_t filler, struct file file) { return do_read_cache_folio(mapping, index, filler, file, mapping_gfp_mask(mapping)); } EXPORT_SYMBOL(read_cache_folio); /** * mapping_read_folio_gfp - Read into page cache, using specified allocation flags. * @mapping: The address_space for the folio. * @index: The index that the allocated folio will contain. * @gfp: The page allocator flags to use if allocating. * * This is the same as "read_cache_folio(mapping, index, NULL, NULL)", but with * any new memory allocations done using the specified allocation flags. * * The most likely error from this function is EIO, but ENOMEM is * possible and so is EINTR. If ->read_folio returns another error, * that will be returned to the caller. * * The function expects mapping->invalidate_lock to be already held. * * Return: Uptodate folio on success, ERR_PTR() on failure. / struct folio mapping_read_folio_gfp(struct address_space mapping, pgoff_t index, gfp_t gfp) { return do_read_cache_folio(mapping, index, NULL, NULL, gfp); } EXPORT_SYMBOL(mapping_read_folio_gfp); static struct page do_read_cache_page(struct address_space mapping, pgoff_t index, filler_t filler, struct file file, gfp_t gfp) { struct folio folio; folio = do_read_cache_folio(mapping, index, filler, file, gfp); if (IS_ERR(folio)) return &folio->page; return folio_file_page(folio, index); } struct page read_cache_page(struct address_space mapping, pgoff_t index, filler_t filler, struct file file) { return do_read_cache_page(mapping, index, filler, file, mapping_gfp_mask(mapping)); } EXPORT_SYMBOL(read_cache_page); /** * read_cache_page_gfp - read into page cache, using specified page allocation flags. * @mapping: the page's address_space * @index: the page index * @gfp: the page allocator flags to use if allocating * * This is the same as "read_mapping_page(mapping, index, NULL)", but with * any new page allocations done using the specified allocation flags. * * If the page does not get brought uptodate, return -EIO. * * The function expects mapping->invalidate_lock to be already held. * * Return: up to date page on success, ERR_PTR() on failure. / struct page read_cache_page_gfp(struct address_space mapping, pgoff_t index, gfp_t gfp) { return do_read_cache_page(mapping, index, NULL, NULL, gfp); } EXPORT_SYMBOL(read_cache_page_gfp); / * Warn about a page cache invalidation failure during a direct I/O write. / static void dio_warn_stale_pagecache(struct file filp) { static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST); char pathname[128]; char path; errseq_set(&filp->f_mapping->wb_err, -EIO); if (__ratelimit(&_rs)) { path = file_path(filp, pathname, sizeof(pathname)); if (IS_ERR(path)) path = "(unknown)"; pr_crit("Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!\n"); pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid, current->comm); } } void kiocb_invalidate_post_direct_write(struct kiocb iocb, size_t count) { struct address_space mapping = iocb->ki_filp->f_mapping; if (mapping->nrpages && invalidate_inode_pages2_range(mapping, iocb->ki_pos >> PAGE_SHIFT, (iocb->ki_pos + count - 1) >> PAGE_SHIFT)) dio_warn_stale_pagecache(iocb->ki_filp); } ssize_t generic_file_direct_write(struct kiocb iocb, struct iov_iter from) { struct address_space mapping = iocb->ki_filp->f_mapping; size_t write_len = iov_iter_count(from); ssize_t written; /* * If a page can not be invalidated, return 0 to fall back * to buffered write. / written = kiocb_invalidate_pages(iocb, write_len); if (written) { if (written == -EBUSY) return 0; return written; } written = mapping->a_ops->direct_IO(iocb, from); / * Finally, try again to invalidate clean pages which might have been * cached by non-direct readahead, or faulted in by get_user_pages() * if the source of the write was an mmap'ed region of the file * we're writing. Either one is a pretty crazy thing to do, * so we don't support it 100%. If this invalidation * fails, tough, the write still worked... * * Most of the time we do not need this since dio_complete() will do * the invalidation for us. However there are some file systems that * do not end up with dio_complete() being called, so let's not break * them by removing it completely. * * Noticeable example is a blkdev_direct_IO(). * * Skip invalidation for async writes or if mapping has no pages. / if (written > 0) { struct inode inode = mapping->host; loff_t pos = iocb->ki_pos; kiocb_invalidate_post_direct_write(iocb, written); pos += written; write_len -= written; if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) { i_size_write(inode, pos); mark_inode_dirty(inode); } iocb->ki_pos = pos; } if (written != -EIOCBQUEUED) iov_iter_revert(from, write_len - iov_iter_count(from)); return written; } EXPORT_SYMBOL(generic_file_direct_write); ssize_t generic_perform_write(struct kiocb iocb, struct iov_iter i) { struct file file = iocb->ki_filp; loff_t pos = iocb->ki_pos; struct address_space mapping = file->f_mapping; const struct address_space_operations a_ops = mapping->a_ops; size_t chunk = mapping_max_folio_size(mapping); long status = 0; ssize_t written = 0; do { struct folio folio; size_t offset; /* Offset into folio / size_t bytes; / Bytes to write to folio / size_t copied; / Bytes copied from user / void fsdata = NULL; bytes = iov_iter_count(i); retry: offset = pos & (chunk - 1); bytes = min(chunk - offset, bytes); balance_dirty_pages_ratelimited(mapping); if (fatal_signal_pending(current)) { status = -EINTR; break; } status = a_ops->write_begin(iocb, mapping, pos, bytes, &folio, &fsdata); if (unlikely(status < 0)) break; offset = offset_in_folio(folio, pos); if (bytes > folio_size(folio) - offset) bytes = folio_size(folio) - offset; if (mapping_writably_mapped(mapping)) flush_dcache_folio(folio); /* * Faults here on mmap()s can recurse into arbitrary * filesystem code. Lots of locks are held that can * deadlock. Use an atomic copy to avoid deadlocking * in page fault handling. / copied = copy_folio_from_iter_atomic(folio, offset, bytes, i); flush_dcache_folio(folio); status = a_ops->write_end(iocb, mapping, pos, bytes, copied, folio, fsdata); if (unlikely(status != copied)) { iov_iter_revert(i, copied - max(status, 0L)); if (unlikely(status < 0)) break; } cond_resched(); if (unlikely(status == 0)) { / * A short copy made ->write_end() reject the * thing entirely. Might be memory poisoning * halfway through, might be a race with munmap, * might be severe memory pressure. / if (chunk > PAGE_SIZE) chunk /= 2; if (copied) { bytes = copied; goto retry; } / * 'folio' is now unlocked and faults on it can be * handled. Ensure forward progress by trying to * fault it in now. / if (fault_in_iov_iter_readable(i, bytes) == bytes) { status = -EFAULT; break; } } else { pos += status; written += status; } } while (iov_iter_count(i)); if (!written) return status; iocb->ki_pos += written; return written; } EXPORT_SYMBOL(generic_perform_write); /* * __generic_file_write_iter - write data to a file * @iocb: IO state structure (file, offset, etc.) * @from: iov_iter with data to write * * This function does all the work needed for actually writing data to a * file. It does all basic checks, removes SUID from the file, updates * modification times and calls proper subroutines depending on whether we * do direct IO or a standard buffered write. * * It expects i_rwsem to be grabbed unless we work on a block device or similar * object which does not need locking at all. * * This function does not take care of syncing data in case of O_SYNC write. * A caller has to handle it. This is mainly due to the fact that we want to * avoid syncing under i_rwsem. * * Return: * * number of bytes written, even for truncated writes * * negative error code if no data has been written at all / ssize_t __generic_file_write_iter(struct kiocb iocb, struct iov_iter from) { struct file file = iocb->ki_filp; struct address_space mapping = file->f_mapping; struct inode inode = mapping->host; ssize_t ret; ret = file_remove_privs(file); if (ret) return ret; ret = file_update_time(file); if (ret) return ret; if (iocb->ki_flags & IOCB_DIRECT) { ret = generic_file_direct_write(iocb, from); /* * If the write stopped short of completing, fall back to * buffered writes. Some filesystems do this for writes to * holes, for example. For DAX files, a buffered write will * not succeed (even if it did, DAX does not handle dirty * page-cache pages correctly). / if (ret < 0 \|\| !iov_iter_count(from) \|\| IS_DAX(inode)) return ret; return direct_write_fallback(iocb, from, ret, generic_perform_write(iocb, from)); } return generic_perform_write(iocb, from); } EXPORT_SYMBOL(__generic_file_write_iter); /* * generic_file_write_iter - write data to a file * @iocb: IO state structure * @from: iov_iter with data to write * * This is a wrapper around __generic_file_write_iter() to be used by most * filesystems. It takes care of syncing the file in case of O_SYNC file * and acquires i_rwsem as needed. * Return: * * negative error code if no data has been written at all of * vfs_fsync_range() failed for a synchronous write * * number of bytes written, even for truncated writes / ssize_t generic_file_write_iter(struct kiocb iocb, struct iov_iter from) { struct file file = iocb->ki_filp; struct inode inode = file->f_mapping->host; ssize_t ret; inode_lock(inode); ret = generic_write_checks(iocb, from); if (ret > 0) ret = __generic_file_write_iter(iocb, from); inode_unlock(inode); if (ret > 0) ret = generic_write_sync(iocb, ret); return ret; } EXPORT_SYMBOL(generic_file_write_iter); /* * filemap_release_folio() - Release fs-specific metadata on a folio. * @folio: The folio which the kernel is trying to free. * @gfp: Memory allocation flags (and I/O mode). * * The address_space is trying to release any data attached to a folio * (presumably at folio->private). * * This will also be called if the private_2 flag is set on a page, * indicating that the folio has other metadata associated with it. * * The @gfp argument specifies whether I/O may be performed to release * this page (__GFP_IO), and whether the call may block * (__GFP_RECLAIM & __GFP_FS). * * Return: %true if the release was successful, otherwise %false. / bool filemap_release_folio(struct folio folio, gfp_t gfp) { struct address_space * const mapping = folio->mapping; BUG_ON(!folio_test_locked(folio)); if (!folio_needs_release(folio)) return true; if (folio_test_writeback(folio)) return false; if (mapping && mapping->a_ops->release_folio) return mapping->a_ops->release_folio(folio, gfp); return try_to_free_buffers(folio); } EXPORT_SYMBOL(filemap_release_folio); /** * filemap_invalidate_inode - Invalidate/forcibly write back a range of an inode's pagecache * @inode: The inode to flush * @flush: Set to write back rather than simply invalidate. * @start: First byte to in range. * @end: Last byte in range (inclusive), or LLONG_MAX for everything from start * onwards. * * Invalidate all the folios on an inode that contribute to the specified * range, possibly writing them back first. Whilst the operation is * undertaken, the invalidate lock is held to prevent new folios from being * installed. / int filemap_invalidate_inode(struct inode inode, bool flush, loff_t start, loff_t end) { struct address_space mapping = inode->i_mapping; pgoff_t first = start >> PAGE_SHIFT; pgoff_t last = end >> PAGE_SHIFT; pgoff_t nr = end == LLONG_MAX ? ULONG_MAX : last - first + 1; if (!mapping \|\| !mapping->nrpages \|\| end < start) goto out; / Prevent new folios from being added to the inode. / filemap_invalidate_lock(mapping); if (!mapping->nrpages) goto unlock; unmap_mapping_pages(mapping, first, nr, false); / Write back the data if we're asked to. / if (flush) filemap_fdatawrite_range(mapping, start, end); / Wait for writeback to complete on all folios and discard. / invalidate_inode_pages2_range(mapping, start / PAGE_SIZE, end / PAGE_SIZE); unlock: filemap_invalidate_unlock(mapping); out: return filemap_check_errors(mapping); } EXPORT_SYMBOL_GPL(filemap_invalidate_inode); #ifdef CONFIG_CACHESTAT_SYSCALL /* * filemap_cachestat() - compute the page cache statistics of a mapping * @mapping: The mapping to compute the statistics for. * @first_index: The starting page cache index. * @last_index: The final page index (inclusive). * @cs: the cachestat struct to write the result to. * * This will query the page cache statistics of a mapping in the * page range of [first_index, last_index] (inclusive). The statistics * queried include: number of dirty pages, number of pages marked for * writeback, and the number of (recently) evicted pages. / static void filemap_cachestat(struct address_space mapping, pgoff_t first_index, pgoff_t last_index, struct cachestat cs) { XA_STATE(xas, &mapping->i_pages, first_index); struct folio folio; /* Flush stats (and potentially sleep) outside the RCU read section. / mem_cgroup_flush_stats_ratelimited(NULL); rcu_read_lock(); xas_for_each(&xas, folio, last_index) { int order; unsigned long nr_pages; pgoff_t folio_first_index, folio_last_index; / * Don't deref the folio. It is not pinned, and might * get freed (and reused) underneath us. * * We could pin it, but that would be expensive for * what should be a fast and lightweight syscall. * * Instead, derive all information of interest from * the rcu-protected xarray. / if (xas_retry(&xas, folio)) continue; order = xas_get_order(&xas); nr_pages = 1 << order; folio_first_index = round_down(xas.xa_index, 1 << order); folio_last_index = folio_first_index + nr_pages - 1; / Folios might straddle the range boundaries, only count covered pages / if (folio_first_index < first_index) nr_pages -= first_index - folio_first_index; if (folio_last_index > last_index) nr_pages -= folio_last_index - last_index; if (xa_is_value(folio)) { / page is evicted / void shadow = (void )folio; bool workingset; / not used / cs->nr_evicted += nr_pages; #ifdef CONFIG_SWAP / implies CONFIG_MMU / if (shmem_mapping(mapping)) { / shmem file - in swap cache / swp_entry_t swp = radix_to_swp_entry(folio); / swapin error results in poisoned entry / if (!softleaf_is_swap(swp)) goto resched; / * Getting a swap entry from the shmem * inode means we beat * shmem_unuse(). rcu_read_lock() * ensures swapoff waits for us before * freeing the swapper space. However, * we can race with swapping and * invalidation, so there might not be * a shadow in the swapcache (yet). / shadow = swap_cache_get_shadow(swp); if (!shadow) goto resched; } #endif if (workingset_test_recent(shadow, true, &workingset, false)) cs->nr_recently_evicted += nr_pages; goto resched; } / page is in cache / cs->nr_cache += nr_pages; if (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY)) cs->nr_dirty += nr_pages; if (xas_get_mark(&xas, PAGECACHE_TAG_WRITEBACK)) cs->nr_writeback += nr_pages; resched: if (need_resched()) { xas_pause(&xas); cond_resched_rcu(); } } rcu_read_unlock(); } / * See mincore: reveal pagecache information only for files * that the calling process has write access to, or could (if * tried) open for writing. / static inline bool can_do_cachestat(struct file f) { if (f->f_mode & FMODE_WRITE) return true; if (inode_owner_or_capable(file_mnt_idmap(f), file_inode(f))) return true; return file_permission(f, MAY_WRITE) == 0; } /* * The cachestat(2) system call. * * cachestat() returns the page cache statistics of a file in the * bytes range specified by `off` and `len`: number of cached pages, * number of dirty pages, number of pages marked for writeback, * number of evicted pages, and number of recently evicted pages. * * An evicted page is a page that is previously in the page cache * but has been evicted since. A page is recently evicted if its last * eviction was recent enough that its reentry to the cache would * indicate that it is actively being used by the system, and that * there is memory pressure on the system. * * `off` and `len` must be non-negative integers. If `len` > 0, * the queried range is [`off`, `off` + `len`]. If `len` == 0, * we will query in the range from `off` to the end of the file. * * The `flags` argument is unused for now, but is included for future * extensibility. User should pass 0 (i.e no flag specified). * * Currently, hugetlbfs is not supported. * * Because the status of a page can change after cachestat() checks it * but before it returns to the application, the returned values may * contain stale information. * * return values: * zero - success * -EFAULT - cstat or cstat_range points to an illegal address * -EINVAL - invalid flags * -EBADF - invalid file descriptor * -EOPNOTSUPP - file descriptor is of a hugetlbfs file / SYSCALL_DEFINE4(cachestat, unsigned int, fd, struct cachestat_range __user , cstat_range, struct cachestat __user , cstat, unsigned int, flags) { CLASS(fd, f)(fd); struct address_space mapping; struct cachestat_range csr; struct cachestat cs; pgoff_t first_index, last_index; if (fd_empty(f)) return -EBADF; if (copy_from_user(&csr, cstat_range, sizeof(struct cachestat_range))) return -EFAULT; /* hugetlbfs is not supported / if (is_file_hugepages(fd_file(f))) return -EOPNOTSUPP; if (!can_do_cachestat(fd_file(f))) return -EPERM; if (flags != 0) return -EINVAL; first_index = csr.off >> PAGE_SHIFT; last_index = csr.len == 0 ? ULONG_MAX : (csr.off + csr.len - 1) >> PAGE_SHIFT; memset(&cs, 0, sizeof(struct cachestat)); mapping = fd_file(f)->f_mapping; filemap_cachestat(mapping, first_index, last_index, &cs); if (copy_to_user(cstat, &cs, sizeof(struct cachestat))) return -EFAULT; return 0; } #endif / CONFIG_CACHESTAT_SYSCALL */ ]
8/2	2026/02/26 05:12	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 53072 output: 40 thoughts: 371 Thoughts: Analyzing a Data Race in the Filemap Code Okay, so I've got a data race on address `0xffff88811a6f8228`, which is... hmm, looks like that's `f_ra.prev_pos`. Task 10061 is in `filemap_splice_read` at line 3125 of `mm/filemap.c`, and task 10066 is in `filemap_read` at line 2873 of the same file. I need to take a closer look at those lines of code. Alright, let's see... Line 3125 in `filemap_splice_read` is writing `*ppos` to `in->f_ra.prev_pos`, and in `filemap_read`, line 2873, it looks like `last_pos` is being assigned to `ra->prev_pos`. Interesting. Both functions are indeed writing to this `prev_pos` field, with `ra` being the same thing as `&filp->f_ra`. So, it's a conflict between two tasks updating `f_ra.prev_pos`. Now, I need to confirm that this is indeed what's happening and figure out what the root cause and possible solutions are.
9/2	2026/02/26 05:12	tool	codesearch-definition-source	0m Args: map[ContextFile:mm/filemap.c IncludeLines:true Name:file_ra_state] Results: map[SourceCode:1197: /** 1198: * struct file_ra_state - Track a file's readahead state. 1199: * @start: Where the most recent readahead started. 1200: * @size: Number of pages read in the most recent readahead. 1201: * @async_size: Numer of pages that were/are not needed immediately 1202: * and so were/are genuinely "ahead". Start next readahead when 1203: * the first of these pages is accessed. 1204: * @ra_pages: Maximum size of a readahead request, copied from the bdi. 1205: * @order: Preferred folio order used for most recent readahead. 1206: * @mmap_miss: How many mmap accesses missed in the page cache. 1207: * @prev_pos: The last byte in the most recent read request. 1208: * 1209: * When this structure is passed to ->readahead(), the "most recent" 1210: * readahead means the current readahead. 1211: */ 1212: struct file_ra_state { 1213: pgoff_t start; 1214: unsigned int size; 1215: unsigned int async_size; 1216: unsigned int ra_pages; 1217: unsigned short order; 1218: unsigned short mmap_miss; 1219: loff_t prev_pos; 1220: }; SourceFile:include/linux/fs.h]
10/2	2026/02/26 05:12	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 54211 output: 21 thoughts: 1065 Thoughts: Analyzing a Potential Race Condition in Readahead Logic Okay, so I'm looking at this readahead implementation, specifically around how the `prev_pos` field in `struct file_ra_state` (`f_ra`) is handled. It seems like both `filemap_read` and `filemap_splice_read` are updating `f_ra.prev_pos`, which is a `loff_t` (64-bit). The point is to track the last read position for readahead decisions. My initial thought is, is this concurrent access a problem? Both functions update `f_ra.prev_pos` to track the last read position for readahead decisions. The updates happen within the read loops in these functions. I realize that multiple threads can read from the same file descriptor (meaning they can share the `struct file` that contains `f_ra`). The VFS layer usually doesn't provide explicit locking for concurrent reads for performance reasons unless the filesystem itself implements it. Crucially, neither `filemap_read` nor `filemap_splice_read` appears to be taking the `inode->i_rwsem` lock in all cases, which could provide protection. `generic_file_read_iter` calls `filemap_read`, and it doesn't take the `inode_lock` either for buffered reads. `filemap_splice_read` also isn't using it. So, if two threads are reading from the same file, they are both updating `f_ra.prev_pos` simultaneously. Considering `prev_pos` is just a hint for readahead, the worst-case scenario should be suboptimal readahead behavior, but ideally, no memory corruption or crashes. The write itself should be atomic since `loff_t` is 64-bit and aligned on x86_64, right? The system in question is 64-bit. But, KCSAN reported a data race. `prev_pos` value is changing from `0xd8` to `0x14f0`, which is definitely a conflict. On 32-bit systems, where the atomic-ness of writes might not be guaranteed, this could be a more severe problem, but here we are on x86_64. Okay, let me double-check how these functions are handling the `iocb` structure. In `filemap_read`, the `iocb` is passed in, while in `filemap_splice_read`, a local `iocb` is initialized. However, the `in->f_ra` is shared. So, they both access `in->f_ra`, in other words, they access the same `struct file`, hence the same `f_ra`. So the question becomes: is `f_ra` itself protected by any lock? I understand that readahead is often treated as best-effort, and access to it might not be strictly synchronized to avoid contention on critical paths. I should probably check if there are any other places where `f_ra` is being accessed and if any locking is involved there. `page_cache_sync_ra` and `page_cache_async_ra` use `f_ra`, so I should dive into that. I'll look at the `page_cache_sync_ra` function in `mm/readahead.c` to see if there are any hints.
11/2	2026/02/26 05:12	tool	read-file	0m Args: map[File:mm/readahead.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0-only /* * mm/readahead.c - address_space-level file readahead. * * Copyright (C) 2002, Linus Torvalds * * 09Apr2002 Andrew Morton * Initial version. / /* * DOC: Readahead Overview * * Readahead is used to read content into the page cache before it is * explicitly requested by the application. Readahead only ever * attempts to read folios that are not yet in the page cache. If a * folio is present but not up-to-date, readahead will not try to read * it. In that case a simple ->read_folio() will be requested. * * Readahead is triggered when an application read request (whether a * system call or a page fault) finds that the requested folio is not in * the page cache, or that it is in the page cache and has the * readahead flag set. This flag indicates that the folio was read * as part of a previous readahead request and now that it has been * accessed, it is time for the next readahead. * * Each readahead request is partly synchronous read, and partly async * readahead. This is reflected in the struct file_ra_state which * contains ->size being the total number of pages, and ->async_size * which is the number of pages in the async section. The readahead * flag will be set on the first folio in this async section to trigger * a subsequent readahead. Once a series of sequential reads has been * established, there should be no need for a synchronous component and * all readahead request will be fully asynchronous. * * When either of the triggers causes a readahead, three numbers need * to be determined: the start of the region to read, the size of the * region, and the size of the async tail. * * The start of the region is simply the first page address at or after * the accessed address, which is not currently populated in the page * cache. This is found with a simple search in the page cache. * * The size of the async tail is determined by subtracting the size that * was explicitly requested from the determined request size, unless * this would be less than zero - then zero is used. NOTE THIS * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED * PAGE. ALSO THIS CALCULATION IS NOT USED CONSISTENTLY. * * The size of the region is normally determined from the size of the * previous readahead which loaded the preceding pages. This may be * discovered from the struct file_ra_state for simple sequential reads, * or from examining the state of the page cache when multiple * sequential reads are interleaved. Specifically: where the readahead * was triggered by the readahead flag, the size of the previous * readahead is assumed to be the number of pages from the triggering * page to the start of the new readahead. In these cases, the size of * the previous readahead is scaled, often doubled, for the new * readahead, though see get_next_ra_size() for details. * * If the size of the previous read cannot be determined, the number of * preceding pages in the page cache is used to estimate the size of * a previous read. This estimate could easily be misled by random * reads being coincidentally adjacent, so it is ignored unless it is * larger than the current request, and it is not scaled up, unless it * is at the start of file. * * In general readahead is accelerated at the start of the file, as * reads from there are often sequential. There are other minor * adjustments to the readahead size in various special cases and these * are best discovered by reading the code. * * The above calculation, based on the previous readahead size, * determines the size of the readahead, to which any requested read * size may be added. * * Readahead requests are sent to the filesystem using the ->readahead() * address space operation, for which mpage_readahead() is a canonical * implementation. ->readahead() should normally initiate reads on all * folios, but may fail to read any or all folios without causing an I/O * error. The page cache reading code will issue a ->read_folio() request * for any folio which ->readahead() did not read, and only an error * from this will be final. * * ->readahead() will generally call readahead_folio() repeatedly to get * each folio from those prepared for readahead. It may fail to read a * folio by: * * * not calling readahead_folio() sufficiently many times, effectively * ignoring some folios, as might be appropriate if the path to * storage is congested. * * * failing to actually submit a read request for a given folio, * possibly due to insufficient resources, or * * * getting an error during subsequent processing of a request. * * In the last two cases, the folio should be unlocked by the filesystem * to indicate that the read attempt has failed. In the first case the * folio will be unlocked by the VFS. * * Those folios not in the final ``async_size`` of the request should be * considered to be important and ->readahead() should not fail them due * to congestion or temporary resource unavailability, but should wait * for necessary resources (e.g. memory or indexing information) to * become available. Folios in the final ``async_size`` may be * considered less urgent and failure to read them is more acceptable. * In this case it is best to use filemap_remove_folio() to remove the * folios from the page cache as is automatically done for folios that * were not fetched with readahead_folio(). This will allow a * subsequent synchronous readahead request to try them again. If they * are left in the page cache, then they will be read individually using * ->read_folio() which may be less efficient. / #include <linux/blkdev.h> #include <linux/kernel.h> #include <linux/dax.h> #include <linux/gfp.h> #include <linux/export.h> #include <linux/backing-dev.h> #include <linux/task_io_accounting_ops.h> #include <linux/pagemap.h> #include <linux/psi.h> #include <linux/syscalls.h> #include <linux/file.h> #include <linux/mm_inline.h> #include <linux/blk-cgroup.h> #include <linux/fadvise.h> #include <linux/sched/mm.h> #define CREATE_TRACE_POINTS #include <trace/events/readahead.h> #include "internal.h" / * Initialise a struct file's readahead state. Assumes that the caller has * memset ra to zero. / void file_ra_state_init(struct file_ra_state ra, struct address_space mapping) { ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages; ra->prev_pos = -1; } EXPORT_SYMBOL_GPL(file_ra_state_init); static void read_pages(struct readahead_control rac) { const struct address_space_operations aops = rac->mapping->a_ops; struct folio folio; struct blk_plug plug; if (!readahead_count(rac)) return; if (unlikely(rac->_workingset)) psi_memstall_enter(&rac->_pflags); blk_start_plug(&plug); if (aops->readahead) { aops->readahead(rac); / Clean up the remaining folios. / while ((folio = readahead_folio(rac)) != NULL) { folio_get(folio); filemap_remove_folio(folio); folio_unlock(folio); folio_put(folio); } } else { while ((folio = readahead_folio(rac)) != NULL) aops->read_folio(rac->file, folio); } blk_finish_plug(&plug); if (unlikely(rac->_workingset)) psi_memstall_leave(&rac->_pflags); rac->_workingset = false; BUG_ON(readahead_count(rac)); } static struct folio ractl_alloc_folio(struct readahead_control ractl, gfp_t gfp_mask, unsigned int order) { struct folio folio; folio = filemap_alloc_folio(gfp_mask, order, NULL); if (folio && ractl->dropbehind) __folio_set_dropbehind(folio); return folio; } /** * page_cache_ra_unbounded - Start unchecked readahead. * @ractl: Readahead control. * @nr_to_read: The number of pages to read. * @lookahead_size: Where to start the next readahead. * * This function is for filesystems to call when they want to start * readahead beyond a file's stated i_size. This is almost certainly * not the function you want to call. Use page_cache_async_readahead() * or page_cache_sync_readahead() instead. * * Context: File is referenced by caller, and ractl->mapping->invalidate_lock * must be held by the caller at least in shared mode. Mutexes may be held by * caller. May sleep, but will not reenter filesystem to reclaim memory. / void page_cache_ra_unbounded(struct readahead_control ractl, unsigned long nr_to_read, unsigned long lookahead_size) { struct address_space mapping = ractl->mapping; unsigned long index = readahead_index(ractl); gfp_t gfp_mask = readahead_gfp_mask(mapping); unsigned long mark = ULONG_MAX, i = 0; unsigned int min_nrpages = mapping_min_folio_nrpages(mapping); / * Partway through the readahead operation, we will have added * locked pages to the page cache, but will not yet have submitted * them for I/O. Adding another page may need to allocate memory, * which can trigger memory reclaim. Telling the VM we're in * the middle of a filesystem operation will cause it to not * touch file-backed pages, preventing a deadlock. Most (all?) * filesystems already specify __GFP_NOFS in their mapping's * gfp_mask, but let's be explicit here. / unsigned int nofs = memalloc_nofs_save(); lockdep_assert_held(&mapping->invalidate_lock); trace_page_cache_ra_unbounded(mapping->host, index, nr_to_read, lookahead_size); index = mapping_align_index(mapping, index); / * As iterator `i` is aligned to min_nrpages, round_up the * difference between nr_to_read and lookahead_size to mark the * index that only has lookahead or "async_region" to set the * readahead flag. / if (lookahead_size <= nr_to_read) { unsigned long ra_folio_index; ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size, min_nrpages); mark = ra_folio_index - index; } nr_to_read += readahead_index(ractl) - index; ractl->_index = index; / * Preallocate as many pages as we will need. / while (i < nr_to_read) { struct folio folio = xa_load(&mapping->i_pages, index + i); int ret; if (folio && !xa_is_value(folio)) { /* * Page already present? Kick off the current batch * of contiguous pages before continuing with the * next batch. This page may be the one we would * have intended to mark as Readahead, but we don't * have a stable reference to this page, and it's * not worth getting one just for that. / read_pages(ractl); ractl->_index += min_nrpages; i = ractl->_index + ractl->_nr_pages - index; continue; } folio = ractl_alloc_folio(ractl, gfp_mask, mapping_min_folio_order(mapping)); if (!folio) break; ret = filemap_add_folio(mapping, folio, index + i, gfp_mask); if (ret < 0) { folio_put(folio); if (ret == -ENOMEM) break; read_pages(ractl); ractl->_index += min_nrpages; i = ractl->_index + ractl->_nr_pages - index; continue; } if (i == mark) folio_set_readahead(folio); ractl->_workingset \|= folio_test_workingset(folio); ractl->_nr_pages += min_nrpages; i += min_nrpages; } / * Now start the IO. We ignore I/O errors - if the folio is not * uptodate then the caller will launch read_folio again, and * will then handle the error. / read_pages(ractl); memalloc_nofs_restore(nofs); } EXPORT_SYMBOL_GPL(page_cache_ra_unbounded); / * do_page_cache_ra() actually reads a chunk of disk. It allocates * the pages first, then submits them for I/O. This avoids the very bad * behaviour which would occur if page allocations are causing VM writeback. * We really don't want to intermingle reads and writes like that. / static void do_page_cache_ra(struct readahead_control ractl, unsigned long nr_to_read, unsigned long lookahead_size) { struct address_space mapping = ractl->mapping; unsigned long index = readahead_index(ractl); loff_t isize = i_size_read(mapping->host); pgoff_t end_index; / The last page we want to read / if (isize == 0) return; end_index = (isize - 1) >> PAGE_SHIFT; if (index > end_index) return; / Don't read past the page containing the last byte of the file / if (nr_to_read > end_index - index) nr_to_read = end_index - index + 1; filemap_invalidate_lock_shared(mapping); page_cache_ra_unbounded(ractl, nr_to_read, lookahead_size); filemap_invalidate_unlock_shared(mapping); } / * Chunk the readahead into 2 megabyte units, so that we don't pin too much * memory at once. / void force_page_cache_ra(struct readahead_control ractl, unsigned long nr_to_read) { struct address_space mapping = ractl->mapping; struct file_ra_state ra = ractl->ra; struct backing_dev_info bdi = inode_to_bdi(mapping->host); unsigned long max_pages; if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead)) return; / * If the request exceeds the readahead window, allow the read to * be up to the optimal hardware IO size / max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages); nr_to_read = min_t(unsigned long, nr_to_read, max_pages); while (nr_to_read) { unsigned long this_chunk = (2 1024 * 1024) / PAGE_SIZE; if (this_chunk > nr_to_read) this_chunk = nr_to_read; do_page_cache_ra(ractl, this_chunk, 0); nr_to_read -= this_chunk; } } /* * Set the initial window size, round to next power of 2 and square * for small size, x 4 for medium, and x 2 for large * for 128k (32 page) max ra * 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial / static unsigned long get_init_ra_size(unsigned long size, unsigned long max) { unsigned long newsize = roundup_pow_of_two(size); if (newsize <= max / 32) newsize = newsize 4; else if (newsize <= max / 4) newsize = newsize * 2; else newsize = max; return newsize; } /* * Get the previous window size, ramp it up, and * return it as the new window size. / static unsigned long get_next_ra_size(struct file_ra_state ra, unsigned long max) { unsigned long cur = ra->size; if (cur < max / 16) return 4 * cur; if (cur <= max / 2) return 2 * cur; return max; } /* * On-demand readahead design. * * The fields in struct file_ra_state represent the most-recently-executed * readahead attempt: * * \|<----- async_size ---------\| * \|------------------- size -------------------->\| * \|==================#===========================\| * ^start ^page marked with PG_readahead * * To overlap application thinking time and disk I/O time, we do * `readahead pipelining': Do not wait until the application consumed all * readahead pages and stalled on the missing page at readahead_index; * Instead, submit an asynchronous readahead I/O as soon as there are * only async_size pages left in the readahead window. Normally async_size * will be equal to size, for maximum pipelining. * * In interleaved sequential reads, concurrent streams on the same fd can * be invalidating each other's readahead state. So we flag the new readahead * page at (start+size-async_size) with PG_readahead, and use it as readahead * indicator. The flag won't be set on already cached pages, to avoid the * readahead-for-nothing fuss, saving pointless page cache lookups. * * prev_pos tracks the last visited byte in the _previous_ read request. * It should be maintained by the caller, and will be used for detecting * small random reads. Note that the readahead algorithm checks loosely * for sequential patterns. Hence interleaved reads might be served as * sequential ones. * * There is a special-case: if the first page which the application tries to * read happens to be the first page of the file, it is assumed that a linear * read is about to happen and the window is immediately set to the initial size * based on I/O request size and the max_readahead. * * The code ramps up the readahead size aggressively at first, but slow down as * it approaches max_readahead. / static inline int ra_alloc_folio(struct readahead_control ractl, pgoff_t index, pgoff_t mark, unsigned int order, gfp_t gfp) { int err; struct folio folio = ractl_alloc_folio(ractl, gfp, order); if (!folio) return -ENOMEM; mark = round_down(mark, 1UL << order); if (index == mark) folio_set_readahead(folio); err = filemap_add_folio(ractl->mapping, folio, index, gfp); if (err) { folio_put(folio); return err; } ractl->_nr_pages += 1UL << order; ractl->_workingset \|= folio_test_workingset(folio); return 0; } void page_cache_ra_order(struct readahead_control ractl, struct file_ra_state ra) { struct address_space mapping = ractl->mapping; pgoff_t start = readahead_index(ractl); pgoff_t index = start; unsigned int min_order = mapping_min_folio_order(mapping); pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; pgoff_t mark = index + ra->size - ra->async_size; unsigned int nofs; int err = 0; gfp_t gfp = readahead_gfp_mask(mapping); unsigned int new_order = ra->order; trace_page_cache_ra_order(mapping->host, start, ra); if (!mapping_large_folio_support(mapping)) { ra->order = 0; goto fallback; } limit = min(limit, index + ra->size - 1); new_order = min(mapping_max_folio_order(mapping), new_order); new_order = min_t(unsigned int, new_order, ilog2(ra->size)); new_order = max(new_order, min_order); ra->order = new_order; /* See comment in page_cache_ra_unbounded() / nofs = memalloc_nofs_save(); filemap_invalidate_lock_shared(mapping); / * If the new_order is greater than min_order and index is * already aligned to new_order, then this will be noop as index * aligned to new_order should also be aligned to min_order. / ractl->_index = mapping_align_index(mapping, index); index = readahead_index(ractl); while (index <= limit) { unsigned int order = new_order; / Align with smaller pages if needed / if (index & ((1UL << order) - 1)) order = __ffs(index); / Don't allocate pages past EOF / while (order > min_order && index + (1UL << order) - 1 > limit) order--; err = ra_alloc_folio(ractl, index, mark, order, gfp); if (err) break; index += 1UL << order; } read_pages(ractl); filemap_invalidate_unlock_shared(mapping); memalloc_nofs_restore(nofs); / * If there were already pages in the page cache, then we may have * left some gaps. Let the regular readahead code take care of this * situation below. / if (!err) return; fallback: / * ->readahead() may have updated readahead window size so we have to * check there's still something to read. / if (ra->size > index - start) do_page_cache_ra(ractl, ra->size - (index - start), ra->async_size); } static unsigned long ractl_max_pages(struct readahead_control ractl, unsigned long req_size) { struct backing_dev_info bdi = inode_to_bdi(ractl->mapping->host); unsigned long max_pages = ractl->ra->ra_pages; / * If the request exceeds the readahead window, allow the read to * be up to the optimal hardware IO size / if (req_size > max_pages && bdi->io_pages > max_pages) max_pages = min(req_size, bdi->io_pages); return max_pages; } void page_cache_sync_ra(struct readahead_control ractl, unsigned long req_count) { pgoff_t index = readahead_index(ractl); bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM); struct file_ra_state ra = ractl->ra; unsigned long max_pages, contig_count; pgoff_t prev_index, miss; trace_page_cache_sync_ra(ractl->mapping->host, index, ra, req_count); / * Even if readahead is disabled, issue this request as readahead * as we'll need it to satisfy the requested range. The forced * readahead will do the right thing and limit the read to just the * requested range, which we'll set to 1 page for this case. / if (!ra->ra_pages \|\| blk_cgroup_congested()) { if (!ractl->file) return; req_count = 1; do_forced_ra = true; } / be dumb / if (do_forced_ra) { force_page_cache_ra(ractl, req_count); return; } max_pages = ractl_max_pages(ractl, req_count); prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT; / * A start of file, oversized read, or sequential cache miss: * trivial case: (index - prev_index) == 1 * unaligned reads: (index - prev_index) == 0 / if (!index \|\| req_count > max_pages \|\| index - prev_index <= 1UL) { ra->start = index; ra->size = get_init_ra_size(req_count, max_pages); ra->async_size = ra->size > req_count ? ra->size - req_count : ra->size >> 1; goto readit; } / * Query the page cache and look for the traces(cached history pages) * that a sequential stream would leave behind. / rcu_read_lock(); miss = page_cache_prev_miss(ractl->mapping, index - 1, max_pages); rcu_read_unlock(); contig_count = index - miss - 1; / * Standalone, small random read. Read as is, and do not pollute the * readahead state. / if (contig_count <= req_count) { do_page_cache_ra(ractl, req_count, 0); return; } / * File cached from the beginning: * it is a strong indication of long-run stream (or whole-file-read) / if (miss == ULONG_MAX) contig_count = 2; ra->start = index; ra->size = min(contig_count + req_count, max_pages); ra->async_size = 1; readit: ra->order = 0; ractl->_index = ra->start; page_cache_ra_order(ractl, ra); } EXPORT_SYMBOL_GPL(page_cache_sync_ra); void page_cache_async_ra(struct readahead_control ractl, struct folio folio, unsigned long req_count) { unsigned long max_pages; struct file_ra_state ra = ractl->ra; pgoff_t index = readahead_index(ractl); pgoff_t expected, start, end, aligned_end, align; / no readahead / if (!ra->ra_pages) return; / * Same bit is used for PG_readahead and PG_reclaim. / if (folio_test_writeback(folio)) return; trace_page_cache_async_ra(ractl->mapping->host, index, ra, req_count); folio_clear_readahead(folio); if (blk_cgroup_congested()) return; max_pages = ractl_max_pages(ractl, req_count); / * It's the expected callback index, assume sequential access. * Ramp up sizes, and push forward the readahead window. / expected = round_down(ra->start + ra->size - ra->async_size, folio_nr_pages(folio)); if (index == expected) { ra->start += ra->size; / * In the case of MADV_HUGEPAGE, the actual size might exceed * the readahead window. / ra->size = max(ra->size, get_next_ra_size(ra, max_pages)); goto readit; } / * Hit a marked folio without valid readahead state. * E.g. interleaved reads. * Query the pagecache for async_size, which normally equals to * readahead size. Ramp it up and use it as the new readahead size. / rcu_read_lock(); start = page_cache_next_miss(ractl->mapping, index + 1, max_pages); rcu_read_unlock(); if (!start \|\| start - index > max_pages) return; ra->start = start; ra->size = start - index; / old async_size / ra->size += req_count; ra->size = get_next_ra_size(ra, max_pages); readit: ra->order += 2; align = 1UL << min(ra->order, ffs(max_pages) - 1); end = ra->start + ra->size; aligned_end = round_down(end, align); if (aligned_end > ra->start) ra->size -= end - aligned_end; ra->async_size = ra->size; ractl->_index = ra->start; page_cache_ra_order(ractl, ra); } EXPORT_SYMBOL_GPL(page_cache_async_ra); ssize_t ksys_readahead(int fd, loff_t offset, size_t count) { struct file file; const struct inode inode; CLASS(fd, f)(fd); if (fd_empty(f)) return -EBADF; file = fd_file(f); if (!(file->f_mode & FMODE_READ)) return -EBADF; / * The readahead() syscall is intended to run only on files * that can execute readahead. If readahead is not possible * on this file, then we must return -EINVAL. / if (!file->f_mapping) return -EINVAL; if (!file->f_mapping->a_ops) return -EINVAL; inode = file_inode(file); if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode)) return -EINVAL; if (IS_ANON_FILE(inode)) return -EINVAL; return vfs_fadvise(fd_file(f), offset, count, POSIX_FADV_WILLNEED); } SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count) { return ksys_readahead(fd, offset, count); } #if defined(CONFIG_COMPAT) && defined(__ARCH_WANT_COMPAT_READAHEAD) COMPAT_SYSCALL_DEFINE4(readahead, int, fd, compat_arg_u64_dual(offset), size_t, count) { return ksys_readahead(fd, compat_arg_u64_glue(offset), count); } #endif /* * readahead_expand - Expand a readahead request * @ractl: The request to be expanded * @new_start: The revised start * @new_len: The revised size of the request * * Attempt to expand a readahead request outwards from the current size to the * specified size by inserting locked pages before and after the current window * to increase the size to the new window. This may involve the insertion of * THPs, in which case the window may get expanded even beyond what was * requested. * * The algorithm will stop if it encounters a conflicting page already in the * pagecache and leave a smaller expansion than requested. * * The caller must check for this by examining the revised @ractl object for a * different expansion than was requested. / void readahead_expand(struct readahead_control ractl, loff_t new_start, size_t new_len) { struct address_space mapping = ractl->mapping; struct file_ra_state ra = ractl->ra; pgoff_t new_index, new_nr_pages; gfp_t gfp_mask = readahead_gfp_mask(mapping); unsigned long min_nrpages = mapping_min_folio_nrpages(mapping); unsigned int min_order = mapping_min_folio_order(mapping); new_index = new_start / PAGE_SIZE; /* * Readahead code should have aligned the ractl->_index to * min_nrpages before calling readahead aops. / VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages)); / Expand the leading edge downwards / while (ractl->_index > new_index) { unsigned long index = ractl->_index - 1; struct folio folio = xa_load(&mapping->i_pages, index); if (folio && !xa_is_value(folio)) return; /* Folio apparently present / folio = ractl_alloc_folio(ractl, gfp_mask, min_order); if (!folio) return; index = mapping_align_index(mapping, index); if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) { folio_put(folio); return; } if (unlikely(folio_test_workingset(folio)) && !ractl->_workingset) { ractl->_workingset = true; psi_memstall_enter(&ractl->_pflags); } ractl->_nr_pages += min_nrpages; ractl->_index = folio->index; } new_len += new_start - readahead_pos(ractl); new_nr_pages = DIV_ROUND_UP(new_len, PAGE_SIZE); / Expand the trailing edge upwards / while (ractl->_nr_pages < new_nr_pages) { unsigned long index = ractl->_index + ractl->_nr_pages; struct folio folio = xa_load(&mapping->i_pages, index); if (folio && !xa_is_value(folio)) return; /* Folio apparently present */ folio = ractl_alloc_folio(ractl, gfp_mask, min_order); if (!folio) return; index = mapping_align_index(mapping, index); if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) { folio_put(folio); return; } if (unlikely(folio_test_workingset(folio)) && !ractl->_workingset) { ractl->_workingset = true; psi_memstall_enter(&ractl->_pflags); } ractl->_nr_pages += min_nrpages; if (ra) { ra->size += min_nrpages; ra->async_size += min_nrpages; } } } EXPORT_SYMBOL(readahead_expand); ]
12/2	2026/02/26 05:12	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 65043 output: 17 thoughts: 1820 Thoughts: Analyzing the Data Race on `file_ra_state->prev_pos` Okay, so I'm digging into this data race on `prev_pos` within the `file_ra_state` struct. From what I'm seeing, `prev_pos` is primarily used as a hint for the readahead algorithm, specifically to detect sequential access patterns. It's written in `filemap_read` and `filemap_splice_read`, and read in `page_cache_sync_ra`. The issue arises when multiple threads are reading from the same file using a shared file descriptor; they're all vying to update `prev_pos`. My initial thought is that this is likely a benign data race. Here's why: `prev_pos` is a `loff_t` (64-bit on x86_64), and its offset within `file_ra_state` (offset 24) is 8-byte aligned. This means writes to it should be atomic on x86_64. So, a corrupted write is unlikely. Even if the value is incorrect, it just affects readahead. It's not part of the file's correct contents. Looking at the code, in `page_cache_sync_ra`, it's used to calculate `prev_index`. This is then compared with the current `index` to determine if readahead should be triggered. Specifically: `if (!index \|\| req_count > max_pages \|\| index - prev_index <= 1UL)`. A corrupted `prev_pos` could lead to an incorrect `prev_index`, and hence, a wrong readahead decision. But, if `index` and `prev_index` are both `pgoff_t` and `index < prev_index`, `index - prev_index <= 1UL` will be false as there will be a large number after the subtraction. The readahead decision would simply be missed, or occasionally triggered when it could have been skipped. That's not a correctness issue, just a performance one. The kernel developers seem aware of this "interleaved reads" behaviour, which is confirmed by the comment in `mm/readahead.c`. There's no lock, and no atomic operations. This reinforces the idea that it's a trade-off. Readahead is a performance optimization. Not a necessity. So this looks like a classic "benign" data race on a statistics/hint field. However, I'm making sure to check for any other uses of `prev_pos`. The report only mentions this. I've also noted that `filemap_splice_read` updates `prev_pos` within its loop, while `filemap_read` updates it after its loop. This is consistent across both functions. If I were going to write to it, I'd probably write the most recent value after I'm done processing the data, so it makes sense to me. I also checked to make sure that the subtraction done with the `index` and `prev_index` in `page_cache_sync_ra` does not lead to any underflow, and this is confirmed. The condition is just used to decide whether to do read ahead. In summary, I think KCSAN reported a race that is unlikely to be harmful, given how `prev_pos` is used, the atomicity of writes on x86_64, and the lack of other uses of this value.
13/2	2026/02/26 05:13	tool	grepper	0m Args: map[Expression:prev_pos] Results: map[Output:drivers/gpu/drm/xe/xe_vm.c=2796=static void xe_vma_op_unwind(struct xe_vm vm, struct xe_vma_op op, drivers/gpu/drm/xe/xe_vm.c:2797: bool post_commit, bool prev_post_commit, drivers/gpu/drm/xe/xe_vm.c-2798- bool next_post_commit) -- drivers/gpu/drm/xe/xe_vm.c-2826- if (op->remap.prev) { drivers/gpu/drm/xe/xe_vm.c:2827: prep_vma_destroy(vm, op->remap.prev, prev_post_commit); drivers/gpu/drm/xe/xe_vm.c-2828- xe_vma_destroy_unlocked(op->remap.prev); -- drivers/media/platform/samsung/exynos4-is/fimc-is.h=160=struct is_af_info { -- drivers/media/platform/samsung/exynos4-is/fimc-is.h-167- u16 pos_y; drivers/media/platform/samsung/exynos4-is/fimc-is.h:168: u16 prev_pos_x; drivers/media/platform/samsung/exynos4-is/fimc-is.h:169: u16 prev_pos_y; drivers/media/platform/samsung/exynos4-is/fimc-is.h-170- u16 use_af; -- drivers/pci/pci.h=117=bool pcie_cap_has_rtctl(const struct pci_dev dev); -- drivers/pci/pci.h-138- u8 __id, __found_pos = 0; \ drivers/pci/pci.h:139: u8 __prev_pos = (start); \ drivers/pci/pci.h-140- u8 __pos = (start); \ -- drivers/pci/pci.h-158- if (prev_ptr != NULL) \ drivers/pci/pci.h:159: (u8 )prev_ptr = __prev_pos; \ drivers/pci/pci.h-160- break; \ -- drivers/pci/pci.h-162- \ drivers/pci/pci.h:163: __prev_pos = __pos; \ drivers/pci/pci.h-164- __pos = FIELD_GET(PCI_CAP_LIST_NEXT_MASK, __ent); \ -- drivers/pci/pci.h-189- u16 __found_pos = 0; \ drivers/pci/pci.h:190: u16 __prev_pos; \ drivers/pci/pci.h-191- int __ttl, __ret; \ -- drivers/pci/pci.h-193- \ drivers/pci/pci.h:194: __prev_pos = __pos; \ drivers/pci/pci.h-195- __ttl = (PCI_CFG_SPACE_EXP_SIZE - PCI_CFG_SPACE_SIZE) / 8; \ -- drivers/pci/pci.h-206- if (prev_ptr != NULL) \ drivers/pci/pci.h:207: (u16 )prev_ptr = __prev_pos; \ drivers/pci/pci.h-208- break; \ -- drivers/pci/pci.h-210- \ drivers/pci/pci.h:211: __prev_pos = __pos; \ drivers/pci/pci.h-212- __pos = PCI_EXT_CAP_NEXT(__header); \ -- drivers/scsi/aic7xxx/aic79xx_core.c=8000=ahd_qinfifo_requeue_tail(struct ahd_softc ahd, struct scb scb) -- drivers/scsi/aic7xxx/aic79xx_core.c-8009- u_int prev_tag; drivers/scsi/aic7xxx/aic79xx_core.c:8010: u_int prev_pos; drivers/scsi/aic7xxx/aic79xx_core.c-8011- drivers/scsi/aic7xxx/aic79xx_core.c:8012: prev_pos = AHD_QIN_WRAP(ahd->qinfifonext - 1); drivers/scsi/aic7xxx/aic79xx_core.c:8013: prev_tag = ahd->qinfifo[prev_pos]; drivers/scsi/aic7xxx/aic79xx_core.c-8014- prev_scb = ahd_lookup_scb(ahd, prev_tag); -- drivers/scsi/aic7xxx/aic7xxx_core.c=5745=ahc_qinfifo_requeue_tail(struct ahc_softc ahc, struct scb scb) -- drivers/scsi/aic7xxx/aic7xxx_core.c-5751- u_int prev_tag; drivers/scsi/aic7xxx/aic7xxx_core.c:5752: uint8_t prev_pos; drivers/scsi/aic7xxx/aic7xxx_core.c-5753- drivers/scsi/aic7xxx/aic7xxx_core.c:5754: prev_pos = ahc->qinfifonext - 1; drivers/scsi/aic7xxx/aic7xxx_core.c:5755: prev_tag = ahc->qinfifo[prev_pos]; drivers/scsi/aic7xxx/aic7xxx_core.c-5756- prev_scb = ahc_lookup_scb(ahc, prev_tag); -- fs/ext4/dir.c=130=static int ext4_readdir(struct file file, struct dir_context ctx) -- fs/ext4/dir.c-202- 1 << EXT4_SB(sb)->s_min_folio_order); fs/ext4/dir.c:203: file->f_ra.prev_pos = (loff_t)index << PAGE_SHIFT; fs/ext4/dir.c-204- bh = ext4_bread(NULL, inode, map.m_lblk, 0); -- fs/fuse/file.c=2162=static bool fuse_folios_need_send(struct fuse_conn fc, loff_t pos, -- fs/fuse/file.c-2168- unsigned bytes = cur_bytes + len; fs/fuse/file.c:2169: loff_t prev_pos; fs/fuse/file.c-2170- size_t max_bytes = write ? fc->max_write : fc->max_read; -- fs/fuse/file.c-2183- prev_desc = ap->descs[ap->num_folios - 1]; fs/fuse/file.c:2184: prev_pos = folio_pos(prev_folio) + prev_desc.offset + prev_desc.length; fs/fuse/file.c:2185: if (prev_pos != pos) fs/fuse/file.c-2186- return true; -- fs/ntfs3/fsntfs.c=1184=int ntfs_read_run_nb_ra(struct ntfs_sb_info sbi, const struct runs_tree run, -- fs/ntfs3/fsntfs.c-1241- index, 1); fs/ntfs3/fsntfs.c:1242: ra->prev_pos = (loff_t)index << PAGE_SHIFT; fs/ntfs3/fsntfs.c-1243- } -- fs/splice.c=1025=ssize_t splice_direct_to_actor(struct file in, struct splice_desc sd, -- fs/splice.c-1081- size_t read_len; fs/splice.c:1082: loff_t pos = sd->pos, prev_pos = pos; fs/splice.c-1083- -- fs/splice.c-1104- if (unlikely(ret <= 0)) { fs/splice.c:1105: sd->pos = prev_pos; fs/splice.c-1106- goto out_release; -- fs/splice.c-1113- if (ret < read_len) { fs/splice.c:1114: sd->pos = prev_pos + ret; fs/splice.c-1115- goto out_release; -- include/linux/fs.h=1188=struct fown_struct { -- include/linux/fs.h-1206- * @mmap_miss: How many mmap accesses missed in the page cache. include/linux/fs.h:1207: * @prev_pos: The last byte in the most recent read request. include/linux/fs.h-1208- * -- include/linux/fs.h=1212=struct file_ra_state { -- include/linux/fs.h-1218- unsigned short mmap_miss; include/linux/fs.h:1219: loff_t prev_pos; include/linux/fs.h-1220-}; -- include/trace/events/readahead.h=76=DECLARE_EVENT_CLASS(page_cache_ra_op, -- include/trace/events/readahead.h-90- __field(unsigned int, mmap_miss) include/trace/events/readahead.h:91: __field(loff_t, prev_pos) include/trace/events/readahead.h-92- __field(unsigned long, req_count) -- include/trace/events/readahead.h-103- __entry->mmap_miss = ra->mmap_miss; include/trace/events/readahead.h:104: __entry->prev_pos = ra->prev_pos; include/trace/events/readahead.h-105- __entry->req_count = req_count; -- include/trace/events/readahead.h-108- TP_printk( include/trace/events/readahead.h:109: "dev=%d:%d ino=%lx index=%lu req_count=%lu order=%u size=%u async_size=%u ra_pages=%u mmap_miss=%u prev_pos=%lld", include/trace/events/readahead.h-110- MAJOR(__entry->s_dev), MINOR(__entry->s_dev), __entry->i_ino, -- include/trace/events/readahead.h-112- __entry->size, __entry->async_size, __entry->ra_pages, include/trace/events/readahead.h:113: __entry->mmap_miss, __entry->prev_pos include/trace/events/readahead.h-114- ) -- lib/xz/xz_dec_bcj.c=88=static size_t bcj_x86(struct xz_dec_bcj s, uint8_t buf, size_t size) -- lib/xz/xz_dec_bcj.c-95- size_t i; lib/xz/xz_dec_bcj.c:96: size_t prev_pos = (size_t)-1; lib/xz/xz_dec_bcj.c-97- uint32_t prev_mask = s->x86_prev_mask; -- lib/xz/xz_dec_bcj.c-110- lib/xz/xz_dec_bcj.c:111: prev_pos = i - prev_pos; lib/xz/xz_dec_bcj.c:112: if (prev_pos > 3) { lib/xz/xz_dec_bcj.c-113- prev_mask = 0; lib/xz/xz_dec_bcj.c-114- } else { lib/xz/xz_dec_bcj.c:115: prev_mask = (prev_mask << (prev_pos - 1)) & 7; lib/xz/xz_dec_bcj.c-116- if (prev_mask != 0) { -- lib/xz/xz_dec_bcj.c-119- \|\| bcj_x86_test_msbyte(b)) { lib/xz/xz_dec_bcj.c:120: prev_pos = i; lib/xz/xz_dec_bcj.c-121- prev_mask = (prev_mask << 1) \| 1; -- lib/xz/xz_dec_bcj.c-126- lib/xz/xz_dec_bcj.c:127: prev_pos = i; lib/xz/xz_dec_bcj.c-128- -- lib/xz/xz_dec_bcj.c-152- lib/xz/xz_dec_bcj.c:153: prev_pos = i - prev_pos; lib/xz/xz_dec_bcj.c:154: s->x86_prev_mask = prev_pos > 3 ? 0 : prev_mask << (prev_pos - 1); lib/xz/xz_dec_bcj.c-155- return i; -- mm/filemap.c=2763=ssize_t filemap_read(struct kiocb iocb, struct iov_iter iter, -- mm/filemap.c-2773- loff_t isize, end_offset; mm/filemap.c:2774: loff_t last_pos = ra->prev_pos; mm/filemap.c-2775- -- mm/filemap.c-2872- file_accessed(filp); mm/filemap.c:2873: ra->prev_pos = last_pos; mm/filemap.c-2874- return already_read ? already_read : error; -- mm/filemap.c=3048=ssize_t filemap_splice_read(struct file in, loff_t ppos, -- mm/filemap.c-3124- ppos += n; mm/filemap.c:3125: in->f_ra.prev_pos = ppos; mm/filemap.c-3126- if (pipe_is_full(pipe)) -- mm/readahead.c=142=file_ra_state_init(struct file_ra_state ra, struct address_space mapping) -- mm/readahead.c-144- ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages; mm/readahead.c:145: ra->prev_pos = -1; mm/readahead.c-146-} -- mm/readahead.c=394=static unsigned long get_next_ra_size(struct file_ra_state ra, -- mm/readahead.c-429- mm/readahead.c:430: * prev_pos tracks the last visited byte in the _previous_ read request. mm/readahead.c-431- * It should be maintained by the caller, and will be used for detecting -- mm/readahead.c=557=void page_cache_sync_ra(struct readahead_control ractl, -- mm/readahead.c-586- max_pages = ractl_max_pages(ractl, req_count); mm/readahead.c:587: prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT; mm/readahead.c-588- / -- mm/shmem.c=3553=static ssize_t shmem_file_splice_read(struct file in, loff_t ppos, -- mm/shmem.c-3650- ppos += n; mm/shmem.c:3651: in->f_ra.prev_pos = ppos; mm/shmem.c-3652- if (pipe_is_full(pipe)) -- sound/soc/intel/avs/core.c=239=static void hdac_stream_update_pos(struct hdac_stream stream, u64 buffer_size) sound/soc/intel/avs/core.c-240-{ sound/soc/intel/avs/core.c:241: u64 prev_pos, pos, num_bytes; sound/soc/intel/avs/core.c-242- sound/soc/intel/avs/core.c:243: div64_u64_rem(stream->curr_pos, buffer_size, &prev_pos); sound/soc/intel/avs/core.c-244- pos = snd_hdac_stream_get_pos_posbuf(stream); sound/soc/intel/avs/core.c-245- sound/soc/intel/avs/core.c:246: if (pos < prev_pos) sound/soc/intel/avs/core.c:247: num_bytes = (buffer_size - prev_pos) + pos; sound/soc/intel/avs/core.c-248- else sound/soc/intel/avs/core.c:249: num_bytes = pos - prev_pos; sound/soc/intel/avs/core.c-250- -- sound/soc/sof/compress.c=15=static void sof_set_transferred_bytes(struct sof_compr_stream sstream, -- sound/soc/sof/compress.c-17-{ sound/soc/sof/compress.c:18: u64 prev_pos; sound/soc/sof/compress.c-19- unsigned int copied; sound/soc/sof/compress.c-20- sound/soc/sof/compress.c:21: div64_u64_rem(sstream->copied_total, buffer_size, &prev_pos); sound/soc/sof/compress.c-22- sound/soc/sof/compress.c:23: if (host_pos < prev_pos) sound/soc/sof/compress.c:24: copied = (buffer_size - prev_pos) + host_pos; sound/soc/sof/compress.c-25- else sound/soc/sof/compress.c:26: copied = host_pos - prev_pos; sound/soc/sof/compress.c-27- -- sound/soc/sof/intel/hda-stream.c=800=hda_dsp_compr_bytes_transferred(struct hdac_stream *hstream, int direction) -- sound/soc/sof/intel/hda-stream.c-802- u64 buffer_size = hstream->bufsize; sound/soc/sof/intel/hda-stream.c:803: u64 prev_pos, pos, num_bytes; sound/soc/sof/intel/hda-stream.c-804- sound/soc/sof/intel/hda-stream.c:805: div64_u64_rem(hstream->curr_pos, buffer_size, &prev_pos); sound/soc/sof/intel/hda-stream.c-806- pos = hda_dsp_stream_get_position(hstream, direction, false); sound/soc/sof/intel/hda-stream.c-807- sound/soc/sof/intel/hda-stream.c:808: if (pos < prev_pos) sound/soc/sof/intel/hda-stream.c:809: num_bytes = (buffer_size - prev_pos) + pos; sound/soc/sof/intel/hda-stream.c-810- else sound/soc/sof/intel/hda-stream.c:811: num_bytes = pos - prev_pos; sound/soc/sof/intel/hda-stream.c-812- ]
14/2	2026/02/26 05:13	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 72916 output: 20 thoughts: 4888 Thoughts: Benign Data Race in `struct file_ra_state` Okay, here's the deal. I've been looking into this KCSAN report about a data race on the `prev_pos` field within the `struct file_ra_state` (`f_ra`) structure. From what I've gathered, this race is expected and, for the most part, harmless. Essentially, `prev_pos`, along with other fields like `start`, `size`, `async_size`, `order`, and `mmap_miss` in `f_ra`, are used to optimize readahead. They are read and updated without explicit locking, which causes the data race. The key functions involved in the readahead process are `filemap_read`, `filemap_splice_read`, and `page_cache_sync_ra`, and a few others. The concern is this: if multiple threads use the same file descriptor (and thus share the `f_ra` structure), they can interfere with each other's readahead hints. For example, `filemap_splice_read` and `filemap_read` concurrently updating `prev_pos`. I initially worried that this could lead to out-of-bounds reads or other serious issues, but after tracing the readahead logic, I'm confident that the consequences are limited to inefficient readahead. The calculations of the read range are bounded by the file size, and the `order` is sanitized to stay within reasonable limits. There is code in `page_cache_sync_ra` that tries to account for races. Specifically, integer overflows due to the large negative value `-1` initially assigned to `prev_pos` are handled correctly, as there is a check in place. The race is between two writers, updating a hint. Each write is, essentially, the most recent reading position of a thread using this shared file descriptor. Because these are all advancing positions within a file, this race should not have any problematic consequences. I did a deep dive and looked at the code paths that use `prev_pos`, the values that it is used with, and how they relate. This includes how values of read position are handled. Based on my analysis, these values are used to decide whether the fast or slow path is followed. The readahead state is updated in either case. Even with the "wrong" `prev_pos` value, it still converges to the readahead state. I also double-checked whether any other `f_ra` fields are similarly raced, and yes, `mmap_miss` is also updated without locks, using `READ_ONCE` and `WRITE_ONCE`. While `WRITE_ONCE` is not enough to avoid data races, the races on `mmap_miss` are understood to be fine. This is likely an older code path. The key thing is that any "lost updates" only impact the readahead effectiveness, not correctness or security. I looked at the code path for splice, and verified that these races are acceptable, even in high concurrency scenarios. `prev_pos` is updated frequently in splice. `splice_direct_to_actor` in `fs/splice.c` uses a local variable that is separate from `in->f_ra.prev_pos`. The race also could happen in `splice`, but only with shared file descriptors, which do not have any implications for correctness, safety, or security. Regarding the KCSAN report itself, the "torn write" scenario (partially written values) isn't a concern on x86-64, because 8-byte writes are atomic. So, no torn writes or other issues like that. In conclusion, the data race is benign. It's a known trade-off for performance, and the code is structured to handle this shared mutable state reasonably well. While technically a race, it's not a bug.
15/2	2026/02/26 05:13	tool	read-file	0m Args: map[File:fs/splice.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0-only /* * "splice": joining two ropes together by interweaving their strands. * * This is the "extended pipe" functionality, where a pipe is used as * an arbitrary in-memory buffer. Think of a pipe as a small kernel * buffer that you can use to transfer data from one end to the other. * * The traditional unix read/write is extended with a "splice()" operation * that transfers data buffers to or from a pipe buffer. * * Named by Larry McVoy, original implementation from Linus, extended by * Jens to support splicing to files, network, direct splicing, etc and * fixing lots of bugs. * * Copyright (C) 2005-2006 Jens Axboe <axboe@kernel.dk> * Copyright (C) 2005-2006 Linus Torvalds <torvalds@osdl.org> * Copyright (C) 2006 Ingo Molnar <mingo@elte.hu> * / #include <linux/bvec.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/pagemap.h> #include <linux/splice.h> #include <linux/memcontrol.h> #include <linux/mm_inline.h> #include <linux/swap.h> #include <linux/writeback.h> #include <linux/export.h> #include <linux/syscalls.h> #include <linux/uio.h> #include <linux/fsnotify.h> #include <linux/security.h> #include <linux/gfp.h> #include <linux/net.h> #include <linux/socket.h> #include <linux/sched/signal.h> #include "internal.h" / * Splice doesn't support FMODE_NOWAIT. Since pipes may set this flag to * indicate they support non-blocking reads or writes, we must clear it * here if set to avoid blocking other users of this pipe if splice is * being done on it. / static noinline void pipe_clear_nowait(struct file file) { fmode_t fmode = READ_ONCE(file->f_mode); do { if (!(fmode & FMODE_NOWAIT)) break; } while (!try_cmpxchg(&file->f_mode, &fmode, fmode & ~FMODE_NOWAIT)); } /* * Attempt to steal a page from a pipe buffer. This should perhaps go into * a vm helper function, it's already simplified quite a bit by the * addition of remove_mapping(). If success is returned, the caller may * attempt to reuse this page for another destination. / static bool page_cache_pipe_buf_try_steal(struct pipe_inode_info pipe, struct pipe_buffer buf) { struct folio folio = page_folio(buf->page); struct address_space mapping; folio_lock(folio); mapping = folio_mapping(folio); if (mapping) { WARN_ON(!folio_test_uptodate(folio)); / * At least for ext2 with nobh option, we need to wait on * writeback completing on this folio, since we'll remove it * from the pagecache. Otherwise truncate wont wait on the * folio, allowing the disk blocks to be reused by someone else * before we actually wrote our data to them. fs corruption * ensues. / folio_wait_writeback(folio); if (!filemap_release_folio(folio, GFP_KERNEL)) goto out_unlock; / * If we succeeded in removing the mapping, set LRU flag * and return good. / if (remove_mapping(mapping, folio)) { buf->flags \|= PIPE_BUF_FLAG_LRU; return true; } } / * Raced with truncate or failed to remove folio from current * address space, unlock and return failure. / out_unlock: folio_unlock(folio); return false; } static void page_cache_pipe_buf_release(struct pipe_inode_info pipe, struct pipe_buffer buf) { put_page(buf->page); buf->flags &= ~PIPE_BUF_FLAG_LRU; } / * Check whether the contents of buf is OK to access. Since the content * is a page cache page, IO may be in flight. / static int page_cache_pipe_buf_confirm(struct pipe_inode_info pipe, struct pipe_buffer buf) { struct folio folio = page_folio(buf->page); int err; if (!folio_test_uptodate(folio)) { folio_lock(folio); /* * Folio got truncated/unhashed. This will cause a 0-byte * splice, if this is the first page. / if (!folio->mapping) { err = -ENODATA; goto error; } / * Uh oh, read-error from disk. / if (!folio_test_uptodate(folio)) { err = -EIO; goto error; } / Folio is ok after all, we are done / folio_unlock(folio); } return 0; error: folio_unlock(folio); return err; } const struct pipe_buf_operations page_cache_pipe_buf_ops = { .confirm = page_cache_pipe_buf_confirm, .release = page_cache_pipe_buf_release, .try_steal = page_cache_pipe_buf_try_steal, .get = generic_pipe_buf_get, }; static bool user_page_pipe_buf_try_steal(struct pipe_inode_info pipe, struct pipe_buffer buf) { if (!(buf->flags & PIPE_BUF_FLAG_GIFT)) return false; buf->flags \|= PIPE_BUF_FLAG_LRU; return generic_pipe_buf_try_steal(pipe, buf); } static const struct pipe_buf_operations user_page_pipe_buf_ops = { .release = page_cache_pipe_buf_release, .try_steal = user_page_pipe_buf_try_steal, .get = generic_pipe_buf_get, }; static void wakeup_pipe_readers(struct pipe_inode_info pipe) { smp_mb(); if (waitqueue_active(&pipe->rd_wait)) wake_up_interruptible(&pipe->rd_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } /** * splice_to_pipe - fill passed data into a pipe * @pipe: pipe to fill * @spd: data to fill * * Description: * @spd contains a map of pages and len/offset tuples, along with * the struct pipe_buf_operations associated with these pages. This * function will link that data to the pipe. * / ssize_t splice_to_pipe(struct pipe_inode_info pipe, struct splice_pipe_desc spd) { unsigned int spd_pages = spd->nr_pages; unsigned int tail = pipe->tail; unsigned int head = pipe->head; ssize_t ret = 0; int page_nr = 0; if (!spd_pages) return 0; if (unlikely(!pipe->readers)) { send_sig(SIGPIPE, current, 0); ret = -EPIPE; goto out; } while (!pipe_full(head, tail, pipe->max_usage)) { struct pipe_buffer buf = pipe_buf(pipe, head); buf->page = spd->pages[page_nr]; buf->offset = spd->partial[page_nr].offset; buf->len = spd->partial[page_nr].len; buf->private = spd->partial[page_nr].private; buf->ops = spd->ops; buf->flags = 0; head++; pipe->head = head; page_nr++; ret += buf->len; if (!--spd->nr_pages) break; } if (!ret) ret = -EAGAIN; out: while (page_nr < spd_pages) spd->spd_release(spd, page_nr++); return ret; } EXPORT_SYMBOL_GPL(splice_to_pipe); ssize_t add_to_pipe(struct pipe_inode_info pipe, struct pipe_buffer buf) { unsigned int head = pipe->head; unsigned int tail = pipe->tail; int ret; if (unlikely(!pipe->readers)) { send_sig(SIGPIPE, current, 0); ret = -EPIPE; } else if (pipe_full(head, tail, pipe->max_usage)) { ret = -EAGAIN; } else { pipe_buf(pipe, head) = buf; pipe->head = head + 1; return buf->len; } pipe_buf_release(pipe, buf); return ret; } EXPORT_SYMBOL(add_to_pipe); /* * Check if we need to grow the arrays holding pages and partial page * descriptions. / int splice_grow_spd(const struct pipe_inode_info pipe, struct splice_pipe_desc spd) { unsigned int max_usage = READ_ONCE(pipe->max_usage); spd->nr_pages_max = max_usage; if (max_usage <= PIPE_DEF_BUFFERS) return 0; spd->pages = kmalloc_objs(struct page , max_usage); spd->partial = kmalloc_objs(struct partial_page, max_usage); if (spd->pages && spd->partial) return 0; kfree(spd->pages); kfree(spd->partial); return -ENOMEM; } void splice_shrink_spd(struct splice_pipe_desc spd) { if (spd->nr_pages_max <= PIPE_DEF_BUFFERS) return; kfree(spd->pages); kfree(spd->partial); } /* * copy_splice_read - Copy data from a file and splice the copy into a pipe * @in: The file to read from * @ppos: Pointer to the file position to read from * @pipe: The pipe to splice into * @len: The amount to splice * @flags: The SPLICE_F_* flags * * This function allocates a bunch of pages sufficient to hold the requested * amount of data (but limited by the remaining pipe capacity), passes it to * the file's ->read_iter() to read into and then splices the used pages into * the pipe. * * Return: On success, the number of bytes read will be returned and @ppos will be updated if appropriate; 0 will be returned if there is no more data * to be read; -EAGAIN will be returned if the pipe had no space, and some * other negative error code will be returned on error. A short read may occur * if the pipe has insufficient space, we reach the end of the data or we hit a * hole. / ssize_t copy_splice_read(struct file in, loff_t ppos, struct pipe_inode_info pipe, size_t len, unsigned int flags) { struct iov_iter to; struct bio_vec bv; struct kiocb kiocb; struct page pages; ssize_t ret; size_t used, npages, chunk, remain, keep = 0; int i; / Work out how much data we can actually add into the pipe / used = pipe_buf_usage(pipe); npages = max_t(ssize_t, pipe->max_usage - used, 0); len = min_t(size_t, len, npages PAGE_SIZE); npages = DIV_ROUND_UP(len, PAGE_SIZE); bv = kzalloc(array_size(npages, sizeof(bv[0])) + array_size(npages, sizeof(struct page )), GFP_KERNEL); if (!bv) return -ENOMEM; pages = (struct page )(bv + npages); npages = alloc_pages_bulk(GFP_USER, npages, pages); if (!npages) { kfree(bv); return -ENOMEM; } remain = len = min_t(size_t, len, npages PAGE_SIZE); for (i = 0; i < npages; i++) { chunk = min_t(size_t, PAGE_SIZE, remain); bv[i].bv_page = pages[i]; bv[i].bv_offset = 0; bv[i].bv_len = chunk; remain -= chunk; } /* Do the I/O / iov_iter_bvec(&to, ITER_DEST, bv, npages, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = ppos; ret = in->f_op->read_iter(&kiocb, &to); if (ret > 0) { keep = DIV_ROUND_UP(ret, PAGE_SIZE); ppos = kiocb.ki_pos; } / * Callers of ->splice_read() expect -EAGAIN on "can't put anything in * there", rather than -EFAULT. / if (ret == -EFAULT) ret = -EAGAIN; / Free any pages that didn't get touched at all. / if (keep < npages) release_pages(pages + keep, npages - keep); / Push the remaining pages into the pipe. / remain = ret; for (i = 0; i < keep; i++) { struct pipe_buffer buf = pipe_head_buf(pipe); chunk = min_t(size_t, remain, PAGE_SIZE); buf = (struct pipe_buffer) { .ops = &default_pipe_buf_ops, .page = bv[i].bv_page, .offset = 0, .len = chunk, }; pipe->head++; remain -= chunk; } kfree(bv); return ret; } EXPORT_SYMBOL(copy_splice_read); const struct pipe_buf_operations default_pipe_buf_ops = { .release = generic_pipe_buf_release, .try_steal = generic_pipe_buf_try_steal, .get = generic_pipe_buf_get, }; / Pipe buffer operations for a socket and similar. / const struct pipe_buf_operations nosteal_pipe_buf_ops = { .release = generic_pipe_buf_release, .get = generic_pipe_buf_get, }; EXPORT_SYMBOL(nosteal_pipe_buf_ops); static void wakeup_pipe_writers(struct pipe_inode_info pipe) { smp_mb(); if (waitqueue_active(&pipe->wr_wait)) wake_up_interruptible(&pipe->wr_wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } /** * splice_from_pipe_feed - feed available data from a pipe to a file * @pipe: pipe to splice from * @sd: information to @actor * @actor: handler that splices the data * * Description: * This function loops over the pipe and calls @actor to do the * actual moving of a single struct pipe_buffer to the desired * destination. It returns when there's no more buffers left in * the pipe or if the requested number of bytes (@sd->total_len) * have been copied. It returns a positive number (one) if the * pipe needs to be filled with more data, zero if the required * number of bytes have been copied and -errno on error. * * This, together with splice_from_pipe_{begin,end,next}, may be * used to implement the functionality of __splice_from_pipe() when * locking is required around copying the pipe buffers to the * destination. / static int splice_from_pipe_feed(struct pipe_inode_info pipe, struct splice_desc sd, splice_actor actor) { unsigned int head = pipe->head; unsigned int tail = pipe->tail; int ret; while (!pipe_empty(head, tail)) { struct pipe_buffer buf = pipe_buf(pipe, tail); sd->len = buf->len; if (sd->len > sd->total_len) sd->len = sd->total_len; ret = pipe_buf_confirm(pipe, buf); if (unlikely(ret)) { if (ret == -ENODATA) ret = 0; return ret; } ret = actor(pipe, buf, sd); if (ret <= 0) return ret; buf->offset += ret; buf->len -= ret; sd->num_spliced += ret; sd->len -= ret; sd->pos += ret; sd->total_len -= ret; if (!buf->len) { pipe_buf_release(pipe, buf); tail++; pipe->tail = tail; if (pipe->files) sd->need_wakeup = true; } if (!sd->total_len) return 0; } return 1; } / We know we have a pipe buffer, but maybe it's empty? / static inline bool eat_empty_buffer(struct pipe_inode_info pipe) { unsigned int tail = pipe->tail; struct pipe_buffer buf = pipe_buf(pipe, tail); if (unlikely(!buf->len)) { pipe_buf_release(pipe, buf); pipe->tail = tail+1; return true; } return false; } /* * splice_from_pipe_next - wait for some data to splice from * @pipe: pipe to splice from * @sd: information about the splice operation * * Description: * This function will wait for some data and return a positive * value (one) if pipe buffers are available. It will return zero * or -errno if no more data needs to be spliced. / static int splice_from_pipe_next(struct pipe_inode_info pipe, struct splice_desc sd) { / * Check for signal early to make process killable when there are * always buffers available / if (signal_pending(current)) return -ERESTARTSYS; repeat: while (pipe_is_empty(pipe)) { if (!pipe->writers) return 0; if (sd->num_spliced) return 0; if (sd->flags & SPLICE_F_NONBLOCK) return -EAGAIN; if (signal_pending(current)) return -ERESTARTSYS; if (sd->need_wakeup) { wakeup_pipe_writers(pipe); sd->need_wakeup = false; } pipe_wait_readable(pipe); } if (eat_empty_buffer(pipe)) goto repeat; return 1; } /* * splice_from_pipe_begin - start splicing from pipe * @sd: information about the splice operation * * Description: * This function should be called before a loop containing * splice_from_pipe_next() and splice_from_pipe_feed() to * initialize the necessary fields of @sd. / static void splice_from_pipe_begin(struct splice_desc sd) { sd->num_spliced = 0; sd->need_wakeup = false; } /** * splice_from_pipe_end - finish splicing from pipe * @pipe: pipe to splice from * @sd: information about the splice operation * * Description: * This function will wake up pipe writers if necessary. It should * be called after a loop containing splice_from_pipe_next() and * splice_from_pipe_feed(). / static void splice_from_pipe_end(struct pipe_inode_info pipe, struct splice_desc sd) { if (sd->need_wakeup) wakeup_pipe_writers(pipe); } /* * __splice_from_pipe - splice data from a pipe to given actor * @pipe: pipe to splice from * @sd: information to @actor * @actor: handler that splices the data * * Description: * This function does little more than loop over the pipe and call * @actor to do the actual moving of a single struct pipe_buffer to * the desired destination. See pipe_to_file, pipe_to_sendmsg, or * pipe_to_user. * / ssize_t __splice_from_pipe(struct pipe_inode_info pipe, struct splice_desc sd, splice_actor actor) { int ret; splice_from_pipe_begin(sd); do { cond_resched(); ret = splice_from_pipe_next(pipe, sd); if (ret > 0) ret = splice_from_pipe_feed(pipe, sd, actor); } while (ret > 0); splice_from_pipe_end(pipe, sd); return sd->num_spliced ? sd->num_spliced : ret; } EXPORT_SYMBOL(__splice_from_pipe); /** * splice_from_pipe - splice data from a pipe to a file * @pipe: pipe to splice from * @out: file to splice to * @ppos: position in @out * @len: how many bytes to splice * @flags: splice modifier flags * @actor: handler that splices the data * * Description: * See __splice_from_pipe. This function locks the pipe inode, * otherwise it's identical to __splice_from_pipe(). * / ssize_t splice_from_pipe(struct pipe_inode_info pipe, struct file out, loff_t ppos, size_t len, unsigned int flags, splice_actor actor) { ssize_t ret; struct splice_desc sd = { .total_len = len, .flags = flags, .pos = ppos, .u.file = out, }; pipe_lock(pipe); ret = __splice_from_pipe(pipe, &sd, actor); pipe_unlock(pipe); return ret; } /** * iter_file_splice_write - splice data from a pipe to a file * @pipe: pipe info * @out: file to write to * @ppos: position in @out * @len: number of bytes to splice * @flags: splice modifier flags * * Description: * Will either move or copy pages (determined by @flags options) from * the given pipe inode to the given file. * This one is ->write_iter-based. * / ssize_t iter_file_splice_write(struct pipe_inode_info pipe, struct file out, loff_t ppos, size_t len, unsigned int flags) { struct splice_desc sd = { .total_len = len, .flags = flags, .pos = ppos, .u.file = out, }; int nbufs = pipe->max_usage; struct bio_vec array; ssize_t ret; if (!out->f_op->write_iter) return -EINVAL; array = kzalloc_objs(struct bio_vec, nbufs); if (unlikely(!array)) return -ENOMEM; pipe_lock(pipe); splice_from_pipe_begin(&sd); while (sd.total_len) { struct kiocb kiocb; struct iov_iter from; unsigned int head, tail; size_t left; int n; ret = splice_from_pipe_next(pipe, &sd); if (ret <= 0) break; if (unlikely(nbufs < pipe->max_usage)) { kfree(array); nbufs = pipe->max_usage; array = kzalloc_objs(struct bio_vec, nbufs); if (!array) { ret = -ENOMEM; break; } } head = pipe->head; tail = pipe->tail; /* build the vector / left = sd.total_len; for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++) { struct pipe_buffer buf = pipe_buf(pipe, tail); size_t this_len = buf->len; /* zero-length bvecs are not supported, skip them / if (!this_len) continue; this_len = min(this_len, left); ret = pipe_buf_confirm(pipe, buf); if (unlikely(ret)) { if (ret == -ENODATA) ret = 0; goto done; } bvec_set_page(&array[n], buf->page, this_len, buf->offset); left -= this_len; n++; } iov_iter_bvec(&from, ITER_SOURCE, array, n, sd.total_len - left); init_sync_kiocb(&kiocb, out); kiocb.ki_pos = sd.pos; ret = out->f_op->write_iter(&kiocb, &from); sd.pos = kiocb.ki_pos; if (ret <= 0) break; WARN_ONCE(ret > sd.total_len - left, "Splice Exceeded! ret=%zd tot=%zu left=%zu\n", ret, sd.total_len, left); sd.num_spliced += ret; sd.total_len -= ret; ppos = sd.pos; /* dismiss the fully eaten buffers, adjust the partial one / tail = pipe->tail; while (ret) { struct pipe_buffer buf = pipe_buf(pipe, tail); if (ret >= buf->len) { ret -= buf->len; buf->len = 0; pipe_buf_release(pipe, buf); tail++; pipe->tail = tail; if (pipe->files) sd.need_wakeup = true; } else { buf->offset += ret; buf->len -= ret; ret = 0; } } } done: kfree(array); splice_from_pipe_end(pipe, &sd); pipe_unlock(pipe); if (sd.num_spliced) ret = sd.num_spliced; return ret; } EXPORT_SYMBOL(iter_file_splice_write); #ifdef CONFIG_NET /** * splice_to_socket - splice data from a pipe to a socket * @pipe: pipe to splice from * @out: socket to write to * @ppos: position in @out * @len: number of bytes to splice * @flags: splice modifier flags * * Description: * Will send @len bytes from the pipe to a network socket. No data copying * is involved. * / ssize_t splice_to_socket(struct pipe_inode_info pipe, struct file out, loff_t ppos, size_t len, unsigned int flags) { struct socket sock = sock_from_file(out); struct bio_vec bvec[16]; struct msghdr msg = {}; ssize_t ret = 0; size_t spliced = 0; bool need_wakeup = false; pipe_lock(pipe); while (len > 0) { unsigned int head, tail, bc = 0; size_t remain = len; / * Check for signal early to make process killable when there * are always buffers available / ret = -ERESTARTSYS; if (signal_pending(current)) break; while (pipe_is_empty(pipe)) { ret = 0; if (!pipe->writers) goto out; if (spliced) goto out; ret = -EAGAIN; if (flags & SPLICE_F_NONBLOCK) goto out; ret = -ERESTARTSYS; if (signal_pending(current)) goto out; if (need_wakeup) { wakeup_pipe_writers(pipe); need_wakeup = false; } pipe_wait_readable(pipe); } head = pipe->head; tail = pipe->tail; while (!pipe_empty(head, tail)) { struct pipe_buffer buf = pipe_buf(pipe, tail); size_t seg; if (!buf->len) { tail++; continue; } seg = min_t(size_t, remain, buf->len); ret = pipe_buf_confirm(pipe, buf); if (unlikely(ret)) { if (ret == -ENODATA) ret = 0; break; } bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset); remain -= seg; if (remain == 0 \|\| bc >= ARRAY_SIZE(bvec)) break; tail++; } if (!bc) break; msg.msg_flags = MSG_SPLICE_PAGES; if (flags & SPLICE_F_MORE) msg.msg_flags \|= MSG_MORE; if (remain && pipe_occupancy(pipe->head, tail) > 0) msg.msg_flags \|= MSG_MORE; if (out->f_flags & O_NONBLOCK) msg.msg_flags \|= MSG_DONTWAIT; iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc, len - remain); ret = sock_sendmsg(sock, &msg); if (ret <= 0) break; spliced += ret; len -= ret; tail = pipe->tail; while (ret > 0) { struct pipe_buffer buf = pipe_buf(pipe, tail); size_t seg = min_t(size_t, ret, buf->len); buf->offset += seg; buf->len -= seg; ret -= seg; if (!buf->len) { pipe_buf_release(pipe, buf); tail++; } } if (tail != pipe->tail) { pipe->tail = tail; if (pipe->files) need_wakeup = true; } } out: pipe_unlock(pipe); if (need_wakeup) wakeup_pipe_writers(pipe); return spliced ?: ret; } #endif static int warn_unsupported(struct file file, const char op) { pr_debug_ratelimited( "splice %s not supported for file %pD4 (pid: %d comm: %.20s)\n", op, file, current->pid, current->comm); return -EINVAL; } / * Attempt to initiate a splice from pipe to file. / static ssize_t do_splice_from(struct pipe_inode_info pipe, struct file out, loff_t ppos, size_t len, unsigned int flags) { if (unlikely(!out->f_op->splice_write)) return warn_unsupported(out, "write"); return out->f_op->splice_write(pipe, out, ppos, len, flags); } /* * Indicate to the caller that there was a premature EOF when reading from the * source and the caller didn't indicate they would be sending more data after * this. / static void do_splice_eof(struct splice_desc sd) { if (sd->splice_eof) sd->splice_eof(sd); } /* * Callers already called rw_verify_area() on the entire range. * No need to call it for sub ranges. / static ssize_t do_splice_read(struct file in, loff_t ppos, struct pipe_inode_info pipe, size_t len, unsigned int flags) { unsigned int p_space; if (unlikely(!(in->f_mode & FMODE_READ))) return -EBADF; if (!len) return 0; /* Don't try to read more the pipe has space for. / p_space = pipe->max_usage - pipe_buf_usage(pipe); len = min_t(size_t, len, p_space << PAGE_SHIFT); if (unlikely(len > MAX_RW_COUNT)) len = MAX_RW_COUNT; if (unlikely(!in->f_op->splice_read)) return warn_unsupported(in, "read"); / * O_DIRECT and DAX don't deal with the pagecache, so we allocate a * buffer, copy into it and splice that into the pipe. / if ((in->f_flags & O_DIRECT) \|\| IS_DAX(in->f_mapping->host)) return copy_splice_read(in, ppos, pipe, len, flags); return in->f_op->splice_read(in, ppos, pipe, len, flags); } /* * vfs_splice_read - Read data from a file and splice it into a pipe * @in: File to splice from * @ppos: Input file offset * @pipe: Pipe to splice to * @len: Number of bytes to splice * @flags: Splice modifier flags (SPLICE_F_) * Splice the requested amount of data from the input file to the pipe. This * is synchronous as the caller must hold the pipe lock across the entire * operation. * * If successful, it returns the amount of data spliced, 0 if it hit the EOF or * a hole and a negative error code otherwise. / ssize_t vfs_splice_read(struct file in, loff_t ppos, struct pipe_inode_info pipe, size_t len, unsigned int flags) { ssize_t ret; ret = rw_verify_area(READ, in, ppos, len); if (unlikely(ret < 0)) return ret; return do_splice_read(in, ppos, pipe, len, flags); } EXPORT_SYMBOL_GPL(vfs_splice_read); /** * splice_direct_to_actor - splices data directly between two non-pipes * @in: file to splice from * @sd: actor information on where to splice to * @actor: handles the data splicing * * Description: * This is a special case helper to splice directly between two * points, without requiring an explicit pipe. Internally an allocated * pipe is cached in the process, and reused during the lifetime of * that process. * / ssize_t splice_direct_to_actor(struct file in, struct splice_desc sd, splice_direct_actor actor) { struct pipe_inode_info pipe; ssize_t ret, bytes; size_t len; int i, flags, more; / * We require the input to be seekable, as we don't want to randomly * drop data for eg socket -> socket splicing. Use the piped splicing * for that! / if (unlikely(!(in->f_mode & FMODE_LSEEK))) return -EINVAL; / * neither in nor out is a pipe, setup an internal pipe attached to * 'out' and transfer the wanted data from 'in' to 'out' through that / pipe = current->splice_pipe; if (unlikely(!pipe)) { pipe = alloc_pipe_info(); if (!pipe) return -ENOMEM; / * We don't have an immediate reader, but we'll read the stuff * out of the pipe right after the splice_to_pipe(). So set * PIPE_READERS appropriately. / pipe->readers = 1; current->splice_pipe = pipe; } / * Do the splice. / bytes = 0; len = sd->total_len; / Don't block on output, we have to drain the direct pipe. / flags = sd->flags; sd->flags &= ~SPLICE_F_NONBLOCK; / * We signal MORE until we've read sufficient data to fulfill the * request and we keep signalling it if the caller set it. / more = sd->flags & SPLICE_F_MORE; sd->flags \|= SPLICE_F_MORE; WARN_ON_ONCE(!pipe_is_empty(pipe)); while (len) { size_t read_len; loff_t pos = sd->pos, prev_pos = pos; ret = do_splice_read(in, &pos, pipe, len, flags); if (unlikely(ret <= 0)) goto read_failure; read_len = ret; sd->total_len = read_len; / * If we now have sufficient data to fulfill the request then * we clear SPLICE_F_MORE if it was not set initially. / if (read_len >= len && !more) sd->flags &= ~SPLICE_F_MORE; / * NOTE: nonblocking mode only applies to the input. We * must not do the output in nonblocking mode as then we * could get stuck data in the internal pipe: / ret = actor(pipe, sd); if (unlikely(ret <= 0)) { sd->pos = prev_pos; goto out_release; } bytes += ret; len -= ret; sd->pos = pos; if (ret < read_len) { sd->pos = prev_pos + ret; goto out_release; } } done: pipe->tail = pipe->head = 0; file_accessed(in); return bytes; read_failure: / * If the user did not set SPLICE_F_MORE and we didn't hit that * "use all of len" case that cleared SPLICE_F_MORE, and we did a * "->splice_in()" that returned EOF (ie zero) and we have sent at * least 1 byte then we will also do the ->splice_eof() call. / if (ret == 0 && !more && len > 0 && bytes) do_splice_eof(sd); out_release: / * If we did an incomplete transfer we must release * the pipe buffers in question: / for (i = 0; i < pipe->ring_size; i++) { struct pipe_buffer buf = &pipe->bufs[i]; if (buf->ops) pipe_buf_release(pipe, buf); } if (!bytes) bytes = ret; goto done; } EXPORT_SYMBOL(splice_direct_to_actor); static int direct_splice_actor(struct pipe_inode_info pipe, struct splice_desc sd) { struct file file = sd->u.file; long ret; file_start_write(file); ret = do_splice_from(pipe, file, sd->opos, sd->total_len, sd->flags); file_end_write(file); return ret; } static int splice_file_range_actor(struct pipe_inode_info pipe, struct splice_desc sd) { struct file file = sd->u.file; return do_splice_from(pipe, file, sd->opos, sd->total_len, sd->flags); } static void direct_file_splice_eof(struct splice_desc sd) { struct file file = sd->u.file; if (file->f_op->splice_eof) file->f_op->splice_eof(file); } static ssize_t do_splice_direct_actor(struct file in, loff_t ppos, struct file out, loff_t opos, size_t len, unsigned int flags, splice_direct_actor actor) { struct splice_desc sd = { .len = len, .total_len = len, .flags = flags, .pos = ppos, .u.file = out, .splice_eof = direct_file_splice_eof, .opos = opos, }; ssize_t ret; if (unlikely(!(out->f_mode & FMODE_WRITE))) return -EBADF; if (unlikely(out->f_flags & O_APPEND)) return -EINVAL; ret = splice_direct_to_actor(in, &sd, actor); if (ret > 0) ppos = sd.pos; return ret; } /* * do_splice_direct - splices data directly between two files * @in: file to splice from * @ppos: input file offset * @out: file to splice to * @opos: output file offset * @len: number of bytes to splice * @flags: splice modifier flags * * Description: * For use by do_sendfile(). splice can easily emulate sendfile, but * doing it in the application would incur an extra system call * (splice in + splice out, as compared to just sendfile()). So this helper * can splice directly through a process-private pipe. * * Callers already called rw_verify_area() on the entire range. / ssize_t do_splice_direct(struct file in, loff_t ppos, struct file out, loff_t opos, size_t len, unsigned int flags) { return do_splice_direct_actor(in, ppos, out, opos, len, flags, direct_splice_actor); } EXPORT_SYMBOL(do_splice_direct); /* * splice_file_range - splices data between two files for copy_file_range() * @in: file to splice from * @ppos: input file offset * @out: file to splice to * @opos: output file offset * @len: number of bytes to splice * * Description: * For use by ->copy_file_range() methods. * Like do_splice_direct(), but vfs_copy_file_range() already holds * start_file_write() on @out file. * * Callers already called rw_verify_area() on the entire range. / ssize_t splice_file_range(struct file in, loff_t ppos, struct file out, loff_t opos, size_t len) { lockdep_assert(file_write_started(out)); return do_splice_direct_actor(in, ppos, out, opos, min_t(size_t, len, MAX_RW_COUNT), 0, splice_file_range_actor); } EXPORT_SYMBOL(splice_file_range); static int wait_for_space(struct pipe_inode_info pipe, unsigned flags) { for (;;) { if (unlikely(!pipe->readers)) { send_sig(SIGPIPE, current, 0); return -EPIPE; } if (!pipe_is_full(pipe)) return 0; if (flags & SPLICE_F_NONBLOCK) return -EAGAIN; if (signal_pending(current)) return -ERESTARTSYS; pipe_wait_writable(pipe); } } static int splice_pipe_to_pipe(struct pipe_inode_info ipipe, struct pipe_inode_info opipe, size_t len, unsigned int flags); ssize_t splice_file_to_pipe(struct file in, struct pipe_inode_info opipe, loff_t offset, size_t len, unsigned int flags) { ssize_t ret; pipe_lock(opipe); ret = wait_for_space(opipe, flags); if (!ret) ret = do_splice_read(in, offset, opipe, len, flags); pipe_unlock(opipe); if (ret > 0) wakeup_pipe_readers(opipe); return ret; } / * Determine where to splice to/from. / ssize_t do_splice(struct file in, loff_t off_in, struct file out, loff_t off_out, size_t len, unsigned int flags) { struct pipe_inode_info ipipe; struct pipe_inode_info opipe; loff_t offset; ssize_t ret; if (unlikely(!(in->f_mode & FMODE_READ) \|\| !(out->f_mode & FMODE_WRITE))) return -EBADF; ipipe = get_pipe_info(in, true); opipe = get_pipe_info(out, true); if (ipipe && opipe) { if (off_in \|\| off_out) return -ESPIPE; / Splicing to self would be fun, but... / if (ipipe == opipe) return -EINVAL; if ((in->f_flags \| out->f_flags) & O_NONBLOCK) flags \|= SPLICE_F_NONBLOCK; ret = splice_pipe_to_pipe(ipipe, opipe, len, flags); } else if (ipipe) { if (off_in) return -ESPIPE; if (off_out) { if (!(out->f_mode & FMODE_PWRITE)) return -EINVAL; offset = off_out; } else { offset = out->f_pos; } if (unlikely(out->f_flags & O_APPEND)) return -EINVAL; ret = rw_verify_area(WRITE, out, &offset, len); if (unlikely(ret < 0)) return ret; if (in->f_flags & O_NONBLOCK) flags \|= SPLICE_F_NONBLOCK; file_start_write(out); ret = do_splice_from(ipipe, out, &offset, len, flags); file_end_write(out); if (!off_out) out->f_pos = offset; else off_out = offset; } else if (opipe) { if (off_out) return -ESPIPE; if (off_in) { if (!(in->f_mode & FMODE_PREAD)) return -EINVAL; offset = off_in; } else { offset = in->f_pos; } ret = rw_verify_area(READ, in, &offset, len); if (unlikely(ret < 0)) return ret; if (out->f_flags & O_NONBLOCK) flags \|= SPLICE_F_NONBLOCK; ret = splice_file_to_pipe(in, opipe, &offset, len, flags); if (!off_in) in->f_pos = offset; else off_in = offset; } else { ret = -EINVAL; } if (ret > 0) { / * Generate modify out before access in: * do_splice_from() may've already sent modify out, * and this ensures the events get merged. / fsnotify_modify(out); fsnotify_access(in); } return ret; } static ssize_t __do_splice(struct file in, loff_t __user off_in, struct file out, loff_t __user off_out, size_t len, unsigned int flags) { struct pipe_inode_info ipipe; struct pipe_inode_info opipe; loff_t offset, __off_in = NULL, __off_out = NULL; ssize_t ret; ipipe = get_pipe_info(in, true); opipe = get_pipe_info(out, true); if (ipipe) { if (off_in) return -ESPIPE; pipe_clear_nowait(in); } if (opipe) { if (off_out) return -ESPIPE; pipe_clear_nowait(out); } if (off_out) { if (copy_from_user(&offset, off_out, sizeof(loff_t))) return -EFAULT; __off_out = &offset; } if (off_in) { if (copy_from_user(&offset, off_in, sizeof(loff_t))) return -EFAULT; __off_in = &offset; } ret = do_splice(in, __off_in, out, __off_out, len, flags); if (ret < 0) return ret; if (__off_out && copy_to_user(off_out, __off_out, sizeof(loff_t))) return -EFAULT; if (__off_in && copy_to_user(off_in, __off_in, sizeof(loff_t))) return -EFAULT; return ret; } static ssize_t iter_to_pipe(struct iov_iter from, struct pipe_inode_info pipe, unsigned int flags) { struct pipe_buffer buf = { .ops = &user_page_pipe_buf_ops, .flags = flags }; size_t total = 0; ssize_t ret = 0; while (iov_iter_count(from)) { struct page pages[16]; ssize_t left; size_t start; int i, n; left = iov_iter_get_pages2(from, pages, ~0UL, 16, &start); if (left <= 0) { ret = left; break; } n = DIV_ROUND_UP(left + start, PAGE_SIZE); for (i = 0; i < n; i++) { int size = umin(left, PAGE_SIZE - start); buf.page = pages[i]; buf.offset = start; buf.len = size; ret = add_to_pipe(pipe, &buf); if (unlikely(ret < 0)) { iov_iter_revert(from, left); // this one got dropped by add_to_pipe() while (++i < n) put_page(pages[i]); goto out; } total += ret; left -= size; start = 0; } } out: return total ? total : ret; } static int pipe_to_user(struct pipe_inode_info pipe, struct pipe_buffer buf, struct splice_desc sd) { int n = copy_page_to_iter(buf->page, buf->offset, sd->len, sd->u.data); return n == sd->len ? n : -EFAULT; } / * For lack of a better implementation, implement vmsplice() to userspace * as a simple copy of the pipe's pages to the user iov. / static ssize_t vmsplice_to_user(struct file file, struct iov_iter iter, unsigned int flags) { struct pipe_inode_info pipe = get_pipe_info(file, true); struct splice_desc sd = { .total_len = iov_iter_count(iter), .flags = flags, .u.data = iter }; ssize_t ret = 0; if (!pipe) return -EBADF; pipe_clear_nowait(file); if (sd.total_len) { pipe_lock(pipe); ret = __splice_from_pipe(pipe, &sd, pipe_to_user); pipe_unlock(pipe); } if (ret > 0) fsnotify_access(file); return ret; } /* * vmsplice splices a user address range into a pipe. It can be thought of * as splice-from-memory, where the regular splice is splice-from-file (or * to file). In both cases the output is a pipe, naturally. / static ssize_t vmsplice_to_pipe(struct file file, struct iov_iter iter, unsigned int flags) { struct pipe_inode_info pipe; ssize_t ret = 0; unsigned buf_flag = 0; if (flags & SPLICE_F_GIFT) buf_flag = PIPE_BUF_FLAG_GIFT; pipe = get_pipe_info(file, true); if (!pipe) return -EBADF; pipe_clear_nowait(file); pipe_lock(pipe); ret = wait_for_space(pipe, flags); if (!ret) ret = iter_to_pipe(iter, pipe, buf_flag); pipe_unlock(pipe); if (ret > 0) { wakeup_pipe_readers(pipe); fsnotify_modify(file); } return ret; } /* * Note that vmsplice only really supports true splicing _from_ user memory * to a pipe, not the other way around. Splicing from user memory is a simple * operation that can be supported without any funky alignment restrictions * or nasty vm tricks. We simply map in the user memory and fill them into * a pipe. The reverse isn't quite as easy, though. There are two possible * solutions for that: * * - memcpy() the data internally, at which point we might as well just * do a regular read() on the buffer anyway. * - Lots of nasty vm tricks, that are neither fast nor flexible (it * has restriction limitations on both ends of the pipe). * * Currently we punt and implement it as a normal copy, see pipe_to_user(). * / SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user , uiov, unsigned long, nr_segs, unsigned int, flags) { struct iovec iovstack[UIO_FASTIOV]; struct iovec iov = iovstack; struct iov_iter iter; ssize_t error; int type; if (unlikely(flags & ~SPLICE_F_ALL)) return -EINVAL; CLASS(fd, f)(fd); if (fd_empty(f)) return -EBADF; if (fd_file(f)->f_mode & FMODE_WRITE) type = ITER_SOURCE; else if (fd_file(f)->f_mode & FMODE_READ) type = ITER_DEST; else return -EBADF; error = import_iovec(type, uiov, nr_segs, ARRAY_SIZE(iovstack), &iov, &iter); if (error < 0) return error; if (!iov_iter_count(&iter)) error = 0; else if (type == ITER_SOURCE) error = vmsplice_to_pipe(fd_file(f), &iter, flags); else error = vmsplice_to_user(fd_file(f), &iter, flags); kfree(iov); return error; } SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user , off_in, int, fd_out, loff_t __user , off_out, size_t, len, unsigned int, flags) { if (unlikely(!len)) return 0; if (unlikely(flags & ~SPLICE_F_ALL)) return -EINVAL; CLASS(fd, in)(fd_in); if (fd_empty(in)) return -EBADF; CLASS(fd, out)(fd_out); if (fd_empty(out)) return -EBADF; return __do_splice(fd_file(in), off_in, fd_file(out), off_out, len, flags); } / * Make sure there's data to read. Wait for input if we can, otherwise * return an appropriate error. / static int ipipe_prep(struct pipe_inode_info pipe, unsigned int flags) { int ret; /* * Check the pipe occupancy without the inode lock first. This function * is speculative anyways, so missing one is ok. / if (!pipe_is_empty(pipe)) return 0; ret = 0; pipe_lock(pipe); while (pipe_is_empty(pipe)) { if (signal_pending(current)) { ret = -ERESTARTSYS; break; } if (!pipe->writers) break; if (flags & SPLICE_F_NONBLOCK) { ret = -EAGAIN; break; } pipe_wait_readable(pipe); } pipe_unlock(pipe); return ret; } / * Make sure there's writeable room. Wait for room if we can, otherwise * return an appropriate error. / static int opipe_prep(struct pipe_inode_info pipe, unsigned int flags) { int ret; /* * Check pipe occupancy without the inode lock first. This function * is speculative anyways, so missing one is ok. / if (!pipe_is_full(pipe)) return 0; ret = 0; pipe_lock(pipe); while (pipe_is_full(pipe)) { if (!pipe->readers) { send_sig(SIGPIPE, current, 0); ret = -EPIPE; break; } if (flags & SPLICE_F_NONBLOCK) { ret = -EAGAIN; break; } if (signal_pending(current)) { ret = -ERESTARTSYS; break; } pipe_wait_writable(pipe); } pipe_unlock(pipe); return ret; } / * Splice contents of ipipe to opipe. / static int splice_pipe_to_pipe(struct pipe_inode_info ipipe, struct pipe_inode_info opipe, size_t len, unsigned int flags) { struct pipe_buffer ibuf, obuf; unsigned int i_head, o_head; unsigned int i_tail, o_tail; int ret = 0; bool input_wakeup = false; retry: ret = ipipe_prep(ipipe, flags); if (ret) return ret; ret = opipe_prep(opipe, flags); if (ret) return ret; / * Potential ABBA deadlock, work around it by ordering lock * grabbing by pipe info address. Otherwise two different processes * could deadlock (one doing tee from A -> B, the other from B -> A). / pipe_double_lock(ipipe, opipe); i_tail = ipipe->tail; o_head = opipe->head; do { size_t o_len; if (!opipe->readers) { send_sig(SIGPIPE, current, 0); if (!ret) ret = -EPIPE; break; } i_head = ipipe->head; o_tail = opipe->tail; if (pipe_empty(i_head, i_tail) && !ipipe->writers) break; / * Cannot make any progress, because either the input * pipe is empty or the output pipe is full. / if (pipe_empty(i_head, i_tail) \|\| pipe_full(o_head, o_tail, opipe->max_usage)) { / Already processed some buffers, break / if (ret) break; if (flags & SPLICE_F_NONBLOCK) { ret = -EAGAIN; break; } / * We raced with another reader/writer and haven't * managed to process any buffers. A zero return * value means EOF, so retry instead. / pipe_unlock(ipipe); pipe_unlock(opipe); goto retry; } ibuf = pipe_buf(ipipe, i_tail); obuf = pipe_buf(opipe, o_head); if (len >= ibuf->len) { / * Simply move the whole buffer from ipipe to opipe / obuf = ibuf; ibuf->ops = NULL; i_tail++; ipipe->tail = i_tail; input_wakeup = true; o_len = obuf->len; o_head++; opipe->head = o_head; } else { / * Get a reference to this pipe buffer, * so we can copy the contents over. / if (!pipe_buf_get(ipipe, ibuf)) { if (ret == 0) ret = -EFAULT; break; } obuf = ibuf; / * Don't inherit the gift and merge flags, we need to * prevent multiple steals of this page. / obuf->flags &= ~PIPE_BUF_FLAG_GIFT; obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE; obuf->len = len; ibuf->offset += len; ibuf->len -= len; o_len = len; o_head++; opipe->head = o_head; } ret += o_len; len -= o_len; } while (len); pipe_unlock(ipipe); pipe_unlock(opipe); / * If we put data in the output pipe, wakeup any potential readers. / if (ret > 0) wakeup_pipe_readers(opipe); if (input_wakeup) wakeup_pipe_writers(ipipe); return ret; } / * Link contents of ipipe to opipe. / static ssize_t link_pipe(struct pipe_inode_info ipipe, struct pipe_inode_info opipe, size_t len, unsigned int flags) { struct pipe_buffer ibuf, obuf; unsigned int i_head, o_head; unsigned int i_tail, o_tail; ssize_t ret = 0; / * Potential ABBA deadlock, work around it by ordering lock * grabbing by pipe info address. Otherwise two different processes * could deadlock (one doing tee from A -> B, the other from B -> A). / pipe_double_lock(ipipe, opipe); i_tail = ipipe->tail; o_head = opipe->head; do { if (!opipe->readers) { send_sig(SIGPIPE, current, 0); if (!ret) ret = -EPIPE; break; } i_head = ipipe->head; o_tail = opipe->tail; / * If we have iterated all input buffers or run out of * output room, break. / if (pipe_empty(i_head, i_tail) \|\| pipe_full(o_head, o_tail, opipe->max_usage)) break; ibuf = pipe_buf(ipipe, i_tail); obuf = pipe_buf(opipe, o_head); / * Get a reference to this pipe buffer, * so we can copy the contents over. / if (!pipe_buf_get(ipipe, ibuf)) { if (ret == 0) ret = -EFAULT; break; } obuf = ibuf; / * Don't inherit the gift and merge flag, we need to prevent * multiple steals of this page. / obuf->flags &= ~PIPE_BUF_FLAG_GIFT; obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE; if (obuf->len > len) obuf->len = len; ret += obuf->len; len -= obuf->len; o_head++; opipe->head = o_head; i_tail++; } while (len); pipe_unlock(ipipe); pipe_unlock(opipe); / * If we put data in the output pipe, wakeup any potential readers. / if (ret > 0) wakeup_pipe_readers(opipe); return ret; } / * This is a tee(1) implementation that works on pipes. It doesn't copy * any data, it simply references the 'in' pages on the 'out' pipe. * The 'flags' used are the SPLICE_F_* variants, currently the only * applicable one is SPLICE_F_NONBLOCK. / ssize_t do_tee(struct file in, struct file out, size_t len, unsigned int flags) { struct pipe_inode_info ipipe = get_pipe_info(in, true); struct pipe_inode_info opipe = get_pipe_info(out, true); ssize_t ret = -EINVAL; if (unlikely(!(in->f_mode & FMODE_READ) \|\| !(out->f_mode & FMODE_WRITE))) return -EBADF; / * Duplicate the contents of ipipe to opipe without actually * copying the data. / if (ipipe && opipe && ipipe != opipe) { if ((in->f_flags \| out->f_flags) & O_NONBLOCK) flags \|= SPLICE_F_NONBLOCK; / * Keep going, unless we encounter an error. The ipipe/opipe * ordering doesn't really matter. */ ret = ipipe_prep(ipipe, flags); if (!ret) { ret = opipe_prep(opipe, flags); if (!ret) ret = link_pipe(ipipe, opipe, len, flags); } } if (ret > 0) { fsnotify_access(in); fsnotify_modify(out); } return ret; } SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags) { if (unlikely(flags & ~SPLICE_F_ALL)) return -EINVAL; if (unlikely(!len)) return 0; CLASS(fd, in)(fdin); if (fd_empty(in)) return -EBADF; CLASS(fd, out)(fdout); if (fd_empty(out)) return -EBADF; return do_tee(fd_file(in), fd_file(out), len, flags); } ]
16/2	2026/02/26 05:13	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 95774 output: 19 thoughts: 1184 Thoughts: Analysis of a KCSAN Report: Benign Race Condition in Readahead Okay, let's break this down. I'm looking at a KCSAN report indicating a potential race condition. The key functions involved are `sendfile`, `splice_direct_to_actor`, and `filemap_read`, all dealing with file I/O. Task 10061 is running `sendfile` and Task 10066 is doing an `io_read` via `io_uring`. Both appear to be operating on the same file. The reported race is on `f_ra.prev_pos`, the previous position used for readahead. My initial thought is that this is likely a standard, expected race in the readahead mechanism. Readahead hints are inherently susceptible to these kinds of races when multiple threads share a file descriptor. The readahead state isn't precisely maintained; it's a hint. I quickly recall that `sendfile` itself uses `f_pos`, and the relevant splice functions use `f_pos` or `ppos`, depending on the situation. I dig into the code and see that `do_splice` uses `in->f_pos` if `off_in` is null. `do_splice_direct_actor` uses the passed in position from `splice_desc.pos = ppos`. I needed to make sure it wasn't something problematic. I review the code again to make sure that the reported field, `f_ra.prev_pos`, is the only one affected. My next step is to make sure this is the only relevant field in `f_ra` that could cause a problem. Upon closer inspection, `prev_pos` is the only field flagged by KCSAN. Now, I look at how `prev_pos` is actually used within `filemap_read`. I confirm that the readahead's `prev_pos` is used to decide when to call `folio_mark_accessed`. Specifically, I note this section: ```c 2842: if (!pos_same_folio(iocb->ki_pos, last_pos - 1, 2843: fbatch.folios[0])) 2844: folio_mark_accessed(fbatch.folios[0]); ``` `folio_mark_accessed` is, itself, just a hint, mostly for LRU (Least Recently Used). If the `last_pos` is racy, it means that `folio_mark_accessed` might be called more or less often than would be perfectly ideal. This isn't critical. `pos_same_folio` also just checks if the positions are in the same folio. A racy `prev_pos` could lead to a different thread's position being used in the `pos_same_folio` check, resulting in a false negative (we don't mark as accessed when we should) or a false positive (we mark as accessed when we shouldn't) for `folio_mark_accessed`. Since `folio_mark_accessed` is a hint for LRU, both are harmless. The worst-case scenario is a slightly suboptimal LRU behavior. This is clearly a benign race condition. I also consider that `filemap_read` and `filemap_splice_read` are both incredibly common, so if this were a serious issue, it would have been found and fixed a long time ago. Therefore, my verdict is that this race condition on `f_ra.prev_pos` is benign.
17/2	2026/02/26 05:14	tool	set-results	0m Args: map[Benign:true Confident:true] Results: map[Benign:true Confident:true]
18/2	2026/02/26 05:14	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 97775 output: 569 thoughts: 336 Thoughts: My Assessment of the Data Race in File Readahead Okay, so I've been looking at this data race on the `prev_pos` field within the `file_ra_state` struct, nested within the `struct file` in the kernel. My understanding is that this field is crucial for the readahead algorithm; it tracks the last read position to optimize for sequential access patterns. The race specifically occurs because of concurrent read operations on the same file descriptor. Things like `splice` and `sendfile`, they can all hit the same file concurrently, and this is where the problem lies. The devs deliberately left the readahead state unlocked to prevent performance bottlenecks. A classic trade-off, really. From what I can see, the worst-case scenarios are suboptimal readahead and redundant `folio_mark_accessed` calls. The readahead algorithm might misinterpret access patterns, which means I/O performance will suffer. That second issue, the potential for those extra `folio_mark_accessed` calls is essentially harmless, though inefficient. It is only used to avoid multiple accesses for a single read. However, and this is crucial, `prev_pos` is a simple `loff_t`, a 64-bit integer. It's only a hint, an optimization. It's not involved in any data integrity checks. On an x86_64 system, 8-byte aligned writes are atomic, meaning even if a write is "torn," the value won't be nonsensical. Memory corruption, crashes, or incorrect data returned to the user are simply not on the table here. And, of course, the kernel documentation and the code comments themselves acknowledge that concurrent streams will interfere with each other’s readahead. Considering all this, it looks like this is a benign data race. The benefits of avoiding those read-path bottlenecks outweigh the minor performance hiccups that might occur.

map[Benign:true Confident:true Explanation:The data race occurs on the `prev_pos` field of `struct file_ra_state`, which is part of the `struct file`. This field is used by the readahead algorithm to track the last byte read in the previous request, helping it detect sequential access patterns and decide when to trigger readahead.

The race happens when multiple tasks perform read operations (such as `filemap_read` and `filemap_splice_read`) on the same file descriptor concurrently. In the reported case, one task is performing a `sendfile` (which uses `filemap_splice_read`) while another is performing an `io_uring` read (which uses `filemap_read`). Both functions update `in->f_ra.prev_pos` to reflect their progress.

This data race is considered **benign** for the following reasons:
1.  **Hint-only field**: The `prev_pos` field (and the entire `file_ra_state` structure) is used only as a hint for readahead and LRU management. It does not affect the correctness of the data returned to the user or the integrity of the filesystem.
2.  **Intentional lack of synchronization**: The readahead state is intentionally accessed without strict synchronization (like locks) to avoid performance degradation on hot read paths. The kernel documentation for readahead (`mm/readahead.c`) explicitly mentions that concurrent streams on the same file descriptor can interfere with each other's readahead state.
3.  **Atomic writes**: On the x86_64 architecture, 8-byte aligned writes to a `loff_t` are atomic. The `prev_pos` field is 8-byte aligned within `struct file_ra_state`. Thus, even with a race, the field will always contain a valid (though potentially "stale" from another thread's perspective) offset.
4.  **Robust algorithm**: The readahead algorithm is designed to be robust. If `prev_pos` is updated by another thread, the algorithm might temporarily fail to detect a sequential pattern and fall back to a slower path that queries the page cache, but it will still function correctly and eventually re-establish readahead if the pattern persists.
5.  **Harmless side effects**: In `filemap_read`, `prev_pos` is also used to avoid redundant `folio_mark_accessed()` calls. A race here might result in an extra call to mark a folio as accessed, which is a harmless operation for the LRU cache.

In summary, this is a known and expected data race on a performance-hinting field that does not compromise system stability or data integrity.]

map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/4a2b064596d08cb6a84691d51f20dc6b9bfd3434]

map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/da6fca5c4649a67b9e82aa07431048b6f910d2c8]

map[Index:codesearch-index]

gemini-3-flash-preview

map[Benign:true Confident:true]

You are an experienced Linux kernel developer tasked with determining if the given kernel
data race is benign or not. The data race report is from KCSAN tool.
It contains 2 stack traces of the memory accesses that constitute a data race.

A "benign" data races are on a simple int/bool variable or similar field,
and the accesses are not supposed to be protected by any mutual exclusion primitives.
Common examples of such "benign" data races are accesses to various flags fields,
statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash
with a conservative compiler that compiles memory accesses to primitive types
effectively as atomic.

A non-benign (or "harmful" data race) can lead to corruption/crash even with
a conservative compiler that compiles memory accesses to primitive types
effectively as atomic. A common example of a "harmful" data races is race on
a complex container (list/hashmap/etc), where accesses are supposed to be protected
by a mutual exclusion primitive.

Also consider races that happen at the same time with the given one.
If there is no synchronization in between, other memory accesses in the involved threads
race with each other if they access the same memory. For example, if both threads execute:

some_struct->have_elements = true;
list_add(new_node, &some_struct->list_head);

the race on some_struct->have_elements may appear benign, however it also implies there
is a race on some_struct->list_head which is not benign, since the list is not thread-safe.

Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses,
and thus even with a conservative compiler may not be fully atomic. However, such races may
still be benign depending on how writes are done, and how read data is used.

In the final reply explain why you think the given data race is benign or is harmful.

Use the provided tools to confirm any assumptions, variables/fields being accessed, etc.
In particular, don't make assumptions about the kernel source code,
use codesearch tools to read the actual source code.

Prefer calling several tools at the same time to save round-trips.

Use set-results tool to provide results of the analysis.
It must be called exactly once before the final reply.
Ignore results of this tool.

The data race report is:

==================================================================
BUG: KCSAN: data-race in filemap_read / filemap_splice_read

write to 0xffff88811a6f8228 of 8 bytes by task 10061 on cpu 0:
 filemap_splice_read+0x523/0x780 mm/filemap.c:3125
 do_splice_read fs/splice.c:980 [inline]
 splice_direct_to_actor+0x26e/0x670 fs/splice.c:1084
 do_splice_direct_actor fs/splice.c:1202 [inline]
 do_splice_direct+0x119/0x1a0 fs/splice.c:1228
 do_sendfile+0x382/0x650 fs/read_write.c:1372
 __do_sys_sendfile64 fs/read_write.c:1433 [inline]
 __se_sys_sendfile64 fs/read_write.c:1419 [inline]
 __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1419
 x64_sys_call+0x2dc4/0x3020 arch/x86/include/generated/asm/syscalls_64.h:41
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

write to 0xffff88811a6f8228 of 8 bytes by task 10066 on cpu 1:
 filemap_read+0x98d/0xa10 mm/filemap.c:2873
 generic_file_read_iter+0x79/0x330 mm/filemap.c:2992
 io_iter_do_read io_uring/rw.c:835 [inline]
 __io_read+0x463/0xc50 io_uring/rw.c:951
 io_read+0x4a/0x190 io_uring/rw.c:1031
 __io_issue_sqe+0xfd/0x2d0 io_uring/io_uring.c:1384
 io_issue_sqe+0x206/0xc00 io_uring/io_uring.c:1407
 io_wq_submit_work+0x5bc/0x830 io_uring/io_uring.c:1519
 io_worker_handle_work+0x41e/0x950 io_uring/io-wq.c:652
 io_wq_worker+0x23d/0x970 io_uring/io-wq.c:706
 ret_from_fork+0x150/0x360 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

value changed: 0x00000000000000d8 -> 0x00000000000014f0

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 10066 Comm: iou-wrk-10063 Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
==================================================================

The data race occurs on the `prev_pos` field of `struct file_ra_state`, which is part of the `struct file`. This field is used by the readahead algorithm to track the last byte read in the previous request, helping it detect sequential access patterns and decide when to trigger readahead.

The race happens when multiple tasks perform read operations (such as `filemap_read` and `filemap_splice_read`) on the same file descriptor concurrently. In the reported case, one task is performing a `sendfile` (which uses `filemap_splice_read`) while another is performing an `io_uring` read (which uses `filemap_read`). Both functions update `in->f_ra.prev_pos` to reflect their progress.

This data race is considered **benign** for the following reasons:
1.  **Hint-only field**: The `prev_pos` field (and the entire `file_ra_state` structure) is used only as a hint for readahead and LRU management. It does not affect the correctness of the data returned to the user or the integrity of the filesystem.
2.  **Intentional lack of synchronization**: The readahead state is intentionally accessed without strict synchronization (like locks) to avoid performance degradation on hot read paths. The kernel documentation for readahead (`mm/readahead.c`) explicitly mentions that concurrent streams on the same file descriptor can interfere with each other's readahead state.
3.  **Atomic writes**: On the x86_64 architecture, 8-byte aligned writes to a `loff_t` are atomic. The `prev_pos` field is 8-byte aligned within `struct file_ra_state`. Thus, even with a race, the field will always contain a valid (though potentially "stale" from another thread's perspective) offset.
4.  **Robust algorithm**: The readahead algorithm is designed to be robust. If `prev_pos` is updated by another thread, the algorithm might temporarily fail to detect a sequential pattern and fall back to a slower path that queries the page cache, but it will still function correctly and eventually re-establish readahead if the pattern persists.
5.  **Harmless side effects**: In `filemap_read`, `prev_pos` is also used to avoid redundant `folio_mark_accessed()` calls. A race here might result in an extra call to mark a folio as accessed, which is a harmless operation for the LRU cache.

In summary, this is a known and expected data race on a performance-hinting field that does not compromise system stability or data integrity.

gemini-3-flash-preview

							input: 4040
							output: 41
							thoughts: 142

Okay, here's my summary, taking on the persona you've requested.

**Analyzing the Reported Data Race**

Alright, a data race has popped up, specifically between `filemap_splice_read` and `filemap_read`. The report is pointing a finger at a shared memory address: `0xffff88811a6f8228`. Both of these functions, apparently, are issuing 8-byte writes to the same location. That's a red flag, and it warrants a deeper dive.

My immediate next steps are to pinpoint the exact code involved. The report has conveniently provided the specific lines: `mm/filemap.c:3125` for `filemap_splice_read` and `mm/filemap.c:2873` for `filemap_read`. I'll first list the files as a reference, then I’m going to go straight to those lines and examine the surrounding context. Understanding the adjacent code is crucial for getting a full picture of what's happening. I need to see what's being written, under what conditions, and if there are any obvious synchronization mechanisms in place, or the lack thereof. Let’s get to it.

map[Files:[backing-dev.c balloon.c bootmem_info.c bpf_memcontrol.c cma.c cma.h cma_debug.c cma_sysfs.c compaction.c debug.c debug_page_alloc.c debug_page_ref.c debug_vm_pgtable.c dmapool.c dmapool_test.c early_ioremap.c execmem.c fadvise.c fail_page_alloc.c failslab.c filemap.c folio-compat.c gup.c gup_test.c gup_test.h highmem.c hmm.c huge_memory.c hugetlb.c hugetlb_cgroup.c hugetlb_cma.c hugetlb_cma.h hugetlb_internal.h hugetlb_sysctl.c hugetlb_sysfs.c hugetlb_vmemmap.c hugetlb_vmemmap.h hwpoison-inject.c init-mm.c internal.h interval_tree.c ioremap.c khugepaged.c kmemleak.c ksm.c list_lru.c maccess.c madvise.c mapping_dirty_helpers.c memblock.c memcontrol-v1.c memcontrol-v1.h memcontrol.c memfd.c memfd_luo.c memory-failure.c memory-tiers.c memory.c memory_hotplug.c mempolicy.c mempool.c memremap.c memtest.c migrate.c migrate_device.c mincore.c mlock.c mm_init.c mm_slot.h mmap.c mmap_lock.c mmu_gather.c mmu_notifier.c mmzone.c mprotect.c mremap.c mseal.c msync.c nommu.c numa.c numa_emulation.c numa_memblks.c oom_kill.c page-writeback.c page_alloc.c page_counter.c page_ext.c page_frag_cache.c page_idle.c page_io.c page_isolation.c page_owner.c page_poison.c page_reporting.c page_reporting.h page_table_check.c page_vma_mapped.c pagewalk.c percpu-internal.h percpu-km.c percpu-stats.c percpu-vm.c percpu.c pgalloc-track.h pgtable-generic.c process_vm_access.c ptdump.c readahead.c rmap.c rodata_test.c secretmem.c shmem.c shmem_quota.c show_mem.c shrinker.c shrinker_debug.c shuffle.c shuffle.h slab.h slab_common.c slub.c sparse-vmemmap.c sparse.c swap.c swap.h swap_cgroup.c swap_state.c swap_table.h swapfile.c truncate.c usercopy.c userfaultfd.c util.c vma.c vma.h vma_exec.c vma_init.c vma_internal.h vmalloc.c vmpressure.c vmscan.c vmstat.c workingset.c zpdesc.h zsmalloc.c zswap.c] Subdirs:[damon kasan kfence kmsan tests]]

map[File:mm/filemap.c]

map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/*
 *	linux/mm/filemap.c
 *
 * Copyright (C) 1994-1999  Linus Torvalds
 */

/*
 * This file handles the generic file mmap semantics used by
 * most "normal" filesystems (but you don't /have/ to use this:
 * the NFS filesystem used to do this differently, for example)
 */
#include <linux/export.h>
#include <linux/compiler.h>
#include <linux/dax.h>
#include <linux/fs.h>
#include <linux/sched/signal.h>
#include <linux/uaccess.h>
#include <linux/capability.h>
#include <linux/kernel_stat.h>
#include <linux/gfp.h>
#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/leafops.h>
#include <linux/syscalls.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
#include <linux/file.h>
#include <linux/uio.h>
#include <linux/error-injection.h>
#include <linux/hash.h>
#include <linux/writeback.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/security.h>
#include <linux/cpuset.h>
#include <linux/hugetlb.h>
#include <linux/memcontrol.h>
#include <linux/shmem_fs.h>
#include <linux/rmap.h>
#include <linux/delayacct.h>
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
#include <linux/migrate.h>
#include <linux/pipe_fs_i.h>
#include <linux/splice.h>
#include <linux/rcupdate_wait.h>
#include <linux/sched/mm.h>
#include <linux/sysctl.h>
#include <linux/pgalloc.h>

#include <asm/tlbflush.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
#include <trace/events/filemap.h>

/*
 * FIXME: remove all knowledge of the buffer layer from the core VM
 */
#include <linux/buffer_head.h> /* for try_to_free_buffers */

#include <asm/mman.h>

#include "swap.h"

/*
 * Shared mappings implemented 30.11.1994. It's not fully working yet,
 * though.
 *
 * Shared mappings now work. 15.8.1995  Bruno.
 *
 * finished 'unifying' the page and buffer cache and SMP-threaded the
 * page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com>
 *
 * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de>
 */

/*
 * Lock ordering:
 *
 *  ->i_mmap_rwsem		(truncate_pagecache)
 *    ->private_lock		(__free_pte->block_dirty_folio)
 *      ->swap_lock		(exclusive_swap_page, others)
 *        ->i_pages lock
 *
 *  ->i_rwsem
 *    ->invalidate_lock		(acquired by fs in truncate path)
 *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)
 *
 *  ->mmap_lock
 *    ->i_mmap_rwsem
 *      ->page_table_lock or pte_lock	(various, mainly in memory.c)
 *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
 *
 *  ->mmap_lock
 *    ->invalidate_lock		(filemap_fault)
 *      ->lock_page		(filemap_fault, access_process_vm)
 *
 *  ->i_rwsem			(generic_perform_write)
 *    ->mmap_lock		(fault_in_readable->do_page_fault)
 *
 *  bdi->wb.list_lock
 *    sb_lock			(fs/fs-writeback.c)
 *    ->i_pages lock		(__sync_single_inode)
 *
 *  ->i_mmap_rwsem
 *    ->anon_vma.lock		(vma_merge)
 *
 *  ->anon_vma.lock
 *    ->page_table_lock or pte_lock	(anon_vma_prepare and various)
 *
 *  ->page_table_lock or pte_lock
 *    ->swap_lock		(try_to_unmap_one)
 *    ->private_lock		(try_to_unmap_one)
 *    ->i_pages lock		(try_to_unmap_one)
 *    ->lruvec->lru_lock	(follow_page_mask->mark_page_accessed)
 *    ->lruvec->lru_lock	(check_pte_range->folio_isolate_lru)
 *    ->private_lock		(folio_remove_rmap_pte->set_page_dirty)
 *    ->i_pages lock		(folio_remove_rmap_pte->set_page_dirty)
 *    bdi.wb->list_lock		(folio_remove_rmap_pte->set_page_dirty)
 *    ->inode->i_lock		(folio_remove_rmap_pte->set_page_dirty)
 *    bdi.wb->list_lock		(zap_pte_range->set_page_dirty)
 *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
 *    ->private_lock		(zap_pte_range->block_dirty_folio)
 */

static void page_cache_delete(struct address_space *mapping,
				   struct folio *folio, void *shadow)
{
	XA_STATE(xas, &mapping->i_pages, folio->index);
	long nr = 1;

	mapping_set_update(&xas, mapping);

	xas_set_order(&xas, folio->index, folio_order(folio));
	nr = folio_nr_pages(folio);

	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

	xas_store(&xas, shadow);
	xas_init_marks(&xas);

	folio->mapping = NULL;
	/* Leave folio->index set: truncation lookup relies upon it */
	mapping->nrpages -= nr;
}

static void filemap_unaccount_folio(struct address_space *mapping,
		struct folio *folio)
{
	long nr;

	VM_BUG_ON_FOLIO(folio_mapped(folio), folio);
	if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(folio_mapped(folio))) {
		pr_alert("BUG: Bad page cache in process %s  pfn:%05lx\n",
			 current->comm, folio_pfn(folio));
		dump_page(&folio->page, "still mapped when deleted");
		dump_stack();
		add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);

		if (mapping_exiting(mapping) && !folio_test_large(folio)) {
			int mapcount = folio_mapcount(folio);

			if (folio_ref_count(folio) >= mapcount + 2) {
				/*
				 * All vmas have already been torn down, so it's
				 * a good bet that actually the page is unmapped
				 * and we'd rather not leak it: if we're wrong,
				 * another bad page check should catch it later.
				 */
				atomic_set(&folio->_mapcount, -1);
				folio_ref_sub(folio, mapcount);
			}
		}
	}

	/* hugetlb folios do not participate in page cache accounting. */
	if (folio_test_hugetlb(folio))
		return;

	nr = folio_nr_pages(folio);

	lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
	if (folio_test_swapbacked(folio)) {
		lruvec_stat_mod_folio(folio, NR_SHMEM, -nr);
		if (folio_test_pmd_mappable(folio))
			lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
	} else if (folio_test_pmd_mappable(folio)) {
		lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
		filemap_nr_thps_dec(mapping);
	}
	if (test_bit(AS_KERNEL_FILE, &folio->mapping->flags))
		mod_node_page_state(folio_pgdat(folio),
				    NR_KERNEL_FILE_PAGES, -nr);

	/*
	 * At this point folio must be either written or cleaned by
	 * truncate.  Dirty folio here signals a bug and loss of
	 * unwritten data - on ordinary filesystems.
	 *
	 * But it's harmless on in-memory filesystems like tmpfs; and can
	 * occur when a driver which did get_user_pages() sets page dirty
	 * before putting it, while the inode is being finally evicted.
	 *
	 * Below fixes dirty accounting after removing the folio entirely
	 * but leaves the dirty flag set: it has no effect for truncated
	 * folio and anyway will be cleared before returning folio to
	 * buddy allocator.
	 */
	if (WARN_ON_ONCE(folio_test_dirty(folio) &&
			 mapping_can_writeback(mapping)))
		folio_account_cleaned(folio, inode_to_wb(mapping->host));
}

/*
 * Delete a page from the page cache and free it. Caller has to make
 * sure the page is locked and that nobody else uses it - or that usage
 * is safe.  The caller must hold the i_pages lock.
 */
void __filemap_remove_folio(struct folio *folio, void *shadow)
{
	struct address_space *mapping = folio->mapping;

	trace_mm_filemap_delete_from_page_cache(folio);
	filemap_unaccount_folio(mapping, folio);
	page_cache_delete(mapping, folio, shadow);
}

void filemap_free_folio(struct address_space *mapping, struct folio *folio)
{
	void (*free_folio)(struct folio *);

	free_folio = mapping->a_ops->free_folio;
	if (free_folio)
		free_folio(folio);

	folio_put_refs(folio, folio_nr_pages(folio));
}

/**
 * filemap_remove_folio - Remove folio from page cache.
 * @folio: The folio.
 *
 * This must be called only on folios that are locked and have been
 * verified to be in the page cache.  It will never put the folio into
 * the free list because the caller has a reference on the page.
 */
void filemap_remove_folio(struct folio *folio)
{
	struct address_space *mapping = folio->mapping;

	BUG_ON(!folio_test_locked(folio));
	spin_lock(&mapping->host->i_lock);
	xa_lock_irq(&mapping->i_pages);
	__filemap_remove_folio(folio, NULL);
	xa_unlock_irq(&mapping->i_pages);
	if (mapping_shrinkable(mapping))
		inode_lru_list_add(mapping->host);
	spin_unlock(&mapping->host->i_lock);

	filemap_free_folio(mapping, folio);
}

/*
 * page_cache_delete_batch - delete several folios from page cache
 * @mapping: the mapping to which folios belong
 * @fbatch: batch of folios to delete
 *
 * The function walks over mapping->i_pages and removes folios passed in
 * @fbatch from the mapping. The function expects @fbatch to be sorted
 * by page index and is optimised for it to be dense.
 * It tolerates holes in @fbatch (mapping entries at those indices are not
 * modified).
 *
 * The function expects the i_pages lock to be held.
 */
static void page_cache_delete_batch(struct address_space *mapping,
			     struct folio_batch *fbatch)
{
	XA_STATE(xas, &mapping->i_pages, fbatch->folios[0]->index);
	long total_pages = 0;
	int i = 0;
	struct folio *folio;

	mapping_set_update(&xas, mapping);
	xas_for_each(&xas, folio, ULONG_MAX) {
		if (i >= folio_batch_count(fbatch))
			break;

		/* A swap/dax/shadow entry got inserted? Skip it. */
		if (xa_is_value(folio))
			continue;
		/*
		 * A page got inserted in our range? Skip it. We have our
		 * pages locked so they are protected from being removed.
		 * If we see a page whose index is higher than ours, it
		 * means our page has been removed, which shouldn't be
		 * possible because we're holding the PageLock.
		 */
		if (folio != fbatch->folios[i]) {
			VM_BUG_ON_FOLIO(folio->index >
					fbatch->folios[i]->index, folio);
			continue;
		}

		WARN_ON_ONCE(!folio_test_locked(folio));

		folio->mapping = NULL;
		/* Leave folio->index set: truncation lookup relies on it */

		i++;
		xas_store(&xas, NULL);
		total_pages += folio_nr_pages(folio);
	}
	mapping->nrpages -= total_pages;
}

void delete_from_page_cache_batch(struct address_space *mapping,
				  struct folio_batch *fbatch)
{
	int i;

	if (!folio_batch_count(fbatch))
		return;

	spin_lock(&mapping->host->i_lock);
	xa_lock_irq(&mapping->i_pages);
	for (i = 0; i < folio_batch_count(fbatch); i++) {
		struct folio *folio = fbatch->folios[i];

		trace_mm_filemap_delete_from_page_cache(folio);
		filemap_unaccount_folio(mapping, folio);
	}
	page_cache_delete_batch(mapping, fbatch);
	xa_unlock_irq(&mapping->i_pages);
	if (mapping_shrinkable(mapping))
		inode_lru_list_add(mapping->host);
	spin_unlock(&mapping->host->i_lock);

	for (i = 0; i < folio_batch_count(fbatch); i++)
		filemap_free_folio(mapping, fbatch->folios[i]);
}

int filemap_check_errors(struct address_space *mapping)
{
	int ret = 0;
	/* Check for outstanding write errors */
	if (test_bit(AS_ENOSPC, &mapping->flags) &&
	    test_and_clear_bit(AS_ENOSPC, &mapping->flags))
		ret = -ENOSPC;
	if (test_bit(AS_EIO, &mapping->flags) &&
	    test_and_clear_bit(AS_EIO, &mapping->flags))
		ret = -EIO;
	return ret;
}
EXPORT_SYMBOL(filemap_check_errors);

static int filemap_check_and_keep_errors(struct address_space *mapping)
{
	/* Check for outstanding write errors */
	if (test_bit(AS_EIO, &mapping->flags))
		return -EIO;
	if (test_bit(AS_ENOSPC, &mapping->flags))
		return -ENOSPC;
	return 0;
}

static int filemap_writeback(struct address_space *mapping, loff_t start,
		loff_t end, enum writeback_sync_modes sync_mode,
		long *nr_to_write)
{
	struct writeback_control wbc = {
		.sync_mode	= sync_mode,
		.nr_to_write	= nr_to_write ? *nr_to_write : LONG_MAX,
		.range_start	= start,
		.range_end	= end,
	};
	int ret;

	if (!mapping_can_writeback(mapping) ||
	    !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
		return 0;

	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
	ret = do_writepages(mapping, &wbc);
	wbc_detach_inode(&wbc);

	if (!ret && nr_to_write)
		*nr_to_write = wbc.nr_to_write;
	return ret;
}

/**
 * filemap_fdatawrite_range - start writeback on mapping dirty pages in range
 * @mapping:	address space structure to write
 * @start:	offset in bytes where the range starts
 * @end:	offset in bytes where the range ends (inclusive)
 *
 * Start writeback against all of a mapping's dirty pages that lie
 * within the byte offsets <start, end> inclusive.
 *
 * This is a data integrity operation that waits upon dirty or in writeback
 * pages.
 *
 * Return: %0 on success, negative error code otherwise.
 */
int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
		loff_t end)
{
	return filemap_writeback(mapping, start, end, WB_SYNC_ALL, NULL);
}
EXPORT_SYMBOL(filemap_fdatawrite_range);

int filemap_fdatawrite(struct address_space *mapping)
{
	return filemap_fdatawrite_range(mapping, 0, LLONG_MAX);
}
EXPORT_SYMBOL(filemap_fdatawrite);

/**
 * filemap_flush_range - start writeback on a range
 * @mapping:	target address_space
 * @start:	index to start writeback on
 * @end:	last (inclusive) index for writeback
 *
 * This is a non-integrity writeback helper, to start writing back folios
 * for the indicated range.
 *
 * Return: %0 on success, negative error code otherwise.
 */
int filemap_flush_range(struct address_space *mapping, loff_t start,
				  loff_t end)
{
	return filemap_writeback(mapping, start, end, WB_SYNC_NONE, NULL);
}
EXPORT_SYMBOL_GPL(filemap_flush_range);

/**
 * filemap_flush - mostly a non-blocking flush
 * @mapping:	target address_space
 *
 * This is a mostly non-blocking flush.  Not suitable for data-integrity
 * purposes - I/O may not be started against all dirty pages.
 *
 * Return: %0 on success, negative error code otherwise.
 */
int filemap_flush(struct address_space *mapping)
{
	return filemap_flush_range(mapping, 0, LLONG_MAX);
}
EXPORT_SYMBOL(filemap_flush);

/*
 * Start writeback on @nr_to_write pages from @mapping.  No one but the existing
 * btrfs caller should be using this.  Talk to linux-mm if you think adding a
 * new caller is a good idea.
 */
int filemap_flush_nr(struct address_space *mapping, long *nr_to_write)
{
	return filemap_writeback(mapping, 0, LLONG_MAX, WB_SYNC_NONE,
			nr_to_write);
}
EXPORT_SYMBOL_FOR_MODULES(filemap_flush_nr, "btrfs");

/**
 * filemap_range_has_page - check if a page exists in range.
 * @mapping:           address space within which to check
 * @start_byte:        offset in bytes where the range starts
 * @end_byte:          offset in bytes where the range ends (inclusive)
 *
 * Find at least one page in the range supplied, usually used to check if
 * direct writing in this range will trigger a writeback.
 *
 * Return: %true if at least one page exists in the specified range,
 * %false otherwise.
 */
bool filemap_range_has_page(struct address_space *mapping,
			   loff_t start_byte, loff_t end_byte)
{
	struct folio *folio;
	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
	pgoff_t max = end_byte >> PAGE_SHIFT;

	if (end_byte < start_byte)
		return false;

	rcu_read_lock();
	for (;;) {
		folio = xas_find(&xas, max);
		if (xas_retry(&xas, folio))
			continue;
		/* Shadow entries don't count */
		if (xa_is_value(folio))
			continue;
		/*
		 * We don't need to try to pin this page; we're about to
		 * release the RCU lock anyway.  It is enough to know that
		 * there was a page here recently.
		 */
		break;
	}
	rcu_read_unlock();

	return folio != NULL;
}
EXPORT_SYMBOL(filemap_range_has_page);

static void __filemap_fdatawait_range(struct address_space *mapping,
				     loff_t start_byte, loff_t end_byte)
{
	pgoff_t index = start_byte >> PAGE_SHIFT;
	pgoff_t end = end_byte >> PAGE_SHIFT;
	struct folio_batch fbatch;
	unsigned nr_folios;

	folio_batch_init(&fbatch);

	while (index <= end) {
		unsigned i;

		nr_folios = filemap_get_folios_tag(mapping, &index, end,
				PAGECACHE_TAG_WRITEBACK, &fbatch);

		if (!nr_folios)
			break;

		for (i = 0; i < nr_folios; i++) {
			struct folio *folio = fbatch.folios[i];

			folio_wait_writeback(folio);
		}
		folio_batch_release(&fbatch);
		cond_resched();
	}
}

/**
 * filemap_fdatawait_range - wait for writeback to complete
 * @mapping:		address space structure to wait for
 * @start_byte:		offset in bytes where the range starts
 * @end_byte:		offset in bytes where the range ends (inclusive)
 *
 * Walk the list of under-writeback pages of the given address space
 * in the given range and wait for all of them.  Check error status of
 * the address space and return it.
 *
 * Since the error status of the address space is cleared by this function,
 * callers are responsible for checking the return value and handling and/or
 * reporting the error.
 *
 * Return: error status of the address space.
 */
int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
			    loff_t end_byte)
{
	__filemap_fdatawait_range(mapping, start_byte, end_byte);
	return filemap_check_errors(mapping);
}
EXPORT_SYMBOL(filemap_fdatawait_range);

/**
 * filemap_fdatawait_range_keep_errors - wait for writeback to complete
 * @mapping:		address space structure to wait for
 * @start_byte:		offset in bytes where the range starts
 * @end_byte:		offset in bytes where the range ends (inclusive)
 *
 * Walk the list of under-writeback pages of the given address space in the
 * given range and wait for all of them.  Unlike filemap_fdatawait_range(),
 * this function does not clear error status of the address space.
 *
 * Use this function if callers don't handle errors themselves.  Expected
 * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
 * fsfreeze(8)
 */
int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
		loff_t start_byte, loff_t end_byte)
{
	__filemap_fdatawait_range(mapping, start_byte, end_byte);
	return filemap_check_and_keep_errors(mapping);
}
EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors);

/**
 * file_fdatawait_range - wait for writeback to complete
 * @file:		file pointing to address space structure to wait for
 * @start_byte:		offset in bytes where the range starts
 * @end_byte:		offset in bytes where the range ends (inclusive)
 *
 * Walk the list of under-writeback pages of the address space that file
 * refers to, in the given range and wait for all of them.  Check error
 * status of the address space vs. the file->f_wb_err cursor and return it.
 *
 * Since the error status of the file is advanced by this function,
 * callers are responsible for checking the return value and handling and/or
 * reporting the error.
 *
 * Return: error status of the address space vs. the file->f_wb_err cursor.
 */
int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)
{
	struct address_space *mapping = file->f_mapping;

	__filemap_fdatawait_range(mapping, start_byte, end_byte);
	return file_check_and_advance_wb_err(file);
}
EXPORT_SYMBOL(file_fdatawait_range);

/**
 * filemap_fdatawait_keep_errors - wait for writeback without clearing errors
 * @mapping: address space structure to wait for
 *
 * Walk the list of under-writeback pages of the given address space
 * and wait for all of them.  Unlike filemap_fdatawait(), this function
 * does not clear error status of the address space.
 *
 * Use this function if callers don't handle errors themselves.  Expected
 * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
 * fsfreeze(8)
 *
 * Return: error status of the address space.
 */
int filemap_fdatawait_keep_errors(struct address_space *mapping)
{
	__filemap_fdatawait_range(mapping, 0, LLONG_MAX);
	return filemap_check_and_keep_errors(mapping);
}
EXPORT_SYMBOL(filemap_fdatawait_keep_errors);

/* Returns true if writeback might be needed or already in progress. */
static bool mapping_needs_writeback(struct address_space *mapping)
{
	return mapping->nrpages;
}

bool filemap_range_has_writeback(struct address_space *mapping,
				 loff_t start_byte, loff_t end_byte)
{
	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
	pgoff_t max = end_byte >> PAGE_SHIFT;
	struct folio *folio;

	if (end_byte < start_byte)
		return false;

	rcu_read_lock();
	xas_for_each(&xas, folio, max) {
		if (xas_retry(&xas, folio))
			continue;
		if (xa_is_value(folio))
			continue;
		if (folio_test_dirty(folio) || folio_test_locked(folio) ||
				folio_test_writeback(folio))
			break;
	}
	rcu_read_unlock();
	return folio != NULL;
}
EXPORT_SYMBOL_GPL(filemap_range_has_writeback);

/**
 * filemap_write_and_wait_range - write out & wait on a file range
 * @mapping:	the address_space for the pages
 * @lstart:	offset in bytes where the range starts
 * @lend:	offset in bytes where the range ends (inclusive)
 *
 * Write out and wait upon file offsets lstart->lend, inclusive.
 *
 * Note that @lend is inclusive (describes the last byte to be written) so
 * that this function can be used to write to the very end-of-file (end = -1).
 *
 * Return: error status of the address space.
 */
int filemap_write_and_wait_range(struct address_space *mapping,
				 loff_t lstart, loff_t lend)
{
	int err = 0, err2;

	if (lend < lstart)
		return 0;

	if (mapping_needs_writeback(mapping)) {
		err = filemap_fdatawrite_range(mapping, lstart, lend);
		/*
		 * Even if the above returned error, the pages may be
		 * written partially (e.g. -ENOSPC), so we wait for it.
		 * But the -EIO is special case, it may indicate the worst
		 * thing (e.g. bug) happened, so we avoid waiting for it.
		 */
		if (err != -EIO)
			__filemap_fdatawait_range(mapping, lstart, lend);
	}
	err2 = filemap_check_errors(mapping);
	if (!err)
		err = err2;
	return err;
}
EXPORT_SYMBOL(filemap_write_and_wait_range);

void __filemap_set_wb_err(struct address_space *mapping, int err)
{
	errseq_t eseq = errseq_set(&mapping->wb_err, err);

	trace_filemap_set_wb_err(mapping, eseq);
}
EXPORT_SYMBOL(__filemap_set_wb_err);

/**
 * file_check_and_advance_wb_err - report wb error (if any) that was previously
 * 				   and advance wb_err to current one
 * @file: struct file on which the error is being reported
 *
 * When userland calls fsync (or something like nfsd does the equivalent), we
 * want to report any writeback errors that occurred since the last fsync (or
 * since the file was opened if there haven't been any).
 *
 * Grab the wb_err from the mapping. If it matches what we have in the file,
 * then just quickly return 0. The file is all caught up.
 *
 * If it doesn't match, then take the mapping value, set the "seen" flag in
 * it and try to swap it into place. If it works, or another task beat us
 * to it with the new value, then update the f_wb_err and return the error
 * portion. The error at this point must be reported via proper channels
 * (a'la fsync, or NFS COMMIT operation, etc.).
 *
 * While we handle mapping->wb_err with atomic operations, the f_wb_err
 * value is protected by the f_lock since we must ensure that it reflects
 * the latest value swapped in for this file descriptor.
 *
 * Return: %0 on success, negative error code otherwise.
 */
int file_check_and_advance_wb_err(struct file *file)
{
	int err = 0;
	errseq_t old = READ_ONCE(file->f_wb_err);
	struct address_space *mapping = file->f_mapping;

	/* Locklessly handle the common case where nothing has changed */
	if (errseq_check(&mapping->wb_err, old)) {
		/* Something changed, must use slow path */
		spin_lock(&file->f_lock);
		old = file->f_wb_err;
		err = errseq_check_and_advance(&mapping->wb_err,
						&file->f_wb_err);
		trace_file_check_and_advance_wb_err(file, old);
		spin_unlock(&file->f_lock);
	}

	/*
	 * We're mostly using this function as a drop in replacement for
	 * filemap_check_errors. Clear AS_EIO/AS_ENOSPC to emulate the effect
	 * that the legacy code would have had on these flags.
	 */
	clear_bit(AS_EIO, &mapping->flags);
	clear_bit(AS_ENOSPC, &mapping->flags);
	return err;
}
EXPORT_SYMBOL(file_check_and_advance_wb_err);

/**
 * file_write_and_wait_range - write out & wait on a file range
 * @file:	file pointing to address_space with pages
 * @lstart:	offset in bytes where the range starts
 * @lend:	offset in bytes where the range ends (inclusive)
 *
 * Write out and wait upon file offsets lstart->lend, inclusive.
 *
 * Note that @lend is inclusive (describes the last byte to be written) so
 * that this function can be used to write to the very end-of-file (end = -1).
 *
 * After writing out and waiting on the data, we check and advance the
 * f_wb_err cursor to the latest value, and return any errors detected there.
 *
 * Return: %0 on success, negative error code otherwise.
 */
int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)
{
	int err = 0, err2;
	struct address_space *mapping = file->f_mapping;

	if (lend < lstart)
		return 0;

	if (mapping_needs_writeback(mapping)) {
		err = filemap_fdatawrite_range(mapping, lstart, lend);
		/* See comment of filemap_write_and_wait() */
		if (err != -EIO)
			__filemap_fdatawait_range(mapping, lstart, lend);
	}
	err2 = file_check_and_advance_wb_err(file);
	if (!err)
		err = err2;
	return err;
}
EXPORT_SYMBOL(file_write_and_wait_range);

/**
 * replace_page_cache_folio - replace a pagecache folio with a new one
 * @old:	folio to be replaced
 * @new:	folio to replace with
 *
 * This function replaces a folio in the pagecache with a new one.  On
 * success it acquires the pagecache reference for the new folio and
 * drops it for the old folio.  Both the old and new folios must be
 * locked.  This function does not add the new folio to the LRU, the
 * caller must do that.
 *
 * The remove + add is atomic.  This function cannot fail.
 */
void replace_page_cache_folio(struct folio *old, struct folio *new)
{
	struct address_space *mapping = old->mapping;
	void (*free_folio)(struct folio *) = mapping->a_ops->free_folio;
	pgoff_t offset = old->index;
	XA_STATE(xas, &mapping->i_pages, offset);

	VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
	VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
	VM_BUG_ON_FOLIO(new->mapping, new);

	folio_get(new);
	new->mapping = mapping;
	new->index = offset;

	mem_cgroup_replace_folio(old, new);

	xas_lock_irq(&xas);
	xas_store(&xas, new);

	old->mapping = NULL;
	/* hugetlb pages do not participate in page cache accounting. */
	if (!folio_test_hugetlb(old))
		lruvec_stat_sub_folio(old, NR_FILE_PAGES);
	if (!folio_test_hugetlb(new))
		lruvec_stat_add_folio(new, NR_FILE_PAGES);
	if (folio_test_swapbacked(old))
		lruvec_stat_sub_folio(old, NR_SHMEM);
	if (folio_test_swapbacked(new))
		lruvec_stat_add_folio(new, NR_SHMEM);
	xas_unlock_irq(&xas);
	if (free_folio)
		free_folio(old);
	folio_put(old);
}
EXPORT_SYMBOL_GPL(replace_page_cache_folio);

noinline int __filemap_add_folio(struct address_space *mapping,
		struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
{
	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
	bool huge;
	long nr;
	unsigned int forder = folio_order(folio);

	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
			folio);
	mapping_set_update(&xas, mapping);

	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
	huge = folio_test_hugetlb(folio);
	nr = folio_nr_pages(folio);

	gfp &= GFP_RECLAIM_MASK;
	folio_ref_add(folio, nr);
	folio->mapping = mapping;
	folio->index = xas.xa_index;

	for (;;) {
		int order = -1;
		void *entry, *old = NULL;

		xas_lock_irq(&xas);
		xas_for_each_conflict(&xas, entry) {
			old = entry;
			if (!xa_is_value(entry)) {
				xas_set_err(&xas, -EEXIST);
				goto unlock;
			}
			/*
			 * If a larger entry exists,
			 * it will be the first and only entry iterated.
			 */
			if (order == -1)
				order = xas_get_order(&xas);
		}

		if (old) {
			if (order > 0 && order > forder) {
				unsigned int split_order = max(forder,
						xas_try_split_min_order(order));

				/* How to handle large swap entries? */
				BUG_ON(shmem_mapping(mapping));

				while (order > forder) {
					xas_set_order(&xas, index, split_order);
					xas_try_split(&xas, old, order);
					if (xas_error(&xas))
						goto unlock;
					order = split_order;
					split_order =
						max(xas_try_split_min_order(
							    split_order),
						    forder);
				}
				xas_reset(&xas);
			}
			if (shadowp)
				*shadowp = old;
		}

		xas_store(&xas, folio);
		if (xas_error(&xas))
			goto unlock;

		mapping->nrpages += nr;

		/* hugetlb pages do not participate in page cache accounting */
		if (!huge) {
			lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
			if (folio_test_pmd_mappable(folio))
				lruvec_stat_mod_folio(folio,
						NR_FILE_THPS, nr);
		}

unlock:
		xas_unlock_irq(&xas);

		if (!xas_nomem(&xas, gfp))
			break;
	}

	if (xas_error(&xas))
		goto error;

	trace_mm_filemap_add_to_page_cache(folio);
	return 0;
error:
	folio->mapping = NULL;
	/* Leave folio->index set: truncation relies upon it */
	folio_put_refs(folio, nr);
	return xas_error(&xas);
}
ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);

int filemap_add_folio(struct address_space *mapping, struct folio *folio,
				pgoff_t index, gfp_t gfp)
{
	void *shadow = NULL;
	int ret;
	struct mem_cgroup *tmp;
	bool kernel_file = test_bit(AS_KERNEL_FILE, &mapping->flags);

	if (kernel_file)
		tmp = set_active_memcg(root_mem_cgroup);
	ret = mem_cgroup_charge(folio, NULL, gfp);
	if (kernel_file)
		set_active_memcg(tmp);
	if (ret)
		return ret;

	__folio_set_locked(folio);
	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
	if (unlikely(ret)) {
		mem_cgroup_uncharge(folio);
		__folio_clear_locked(folio);
	} else {
		/*
		 * The folio might have been evicted from cache only
		 * recently, in which case it should be activated like
		 * any other repeatedly accessed folio.
		 * The exception is folios getting rewritten; evicting other
		 * data from the working set, only to cache data that will
		 * get overwritten with something else, is a waste of memory.
		 */
		WARN_ON_ONCE(folio_test_active(folio));
		if (!(gfp & __GFP_WRITE) && shadow)
			workingset_refault(folio, shadow);
		folio_add_lru(folio);
		if (kernel_file)
			mod_node_page_state(folio_pgdat(folio),
					    NR_KERNEL_FILE_PAGES,
					    folio_nr_pages(folio));
	}
	return ret;
}
EXPORT_SYMBOL_GPL(filemap_add_folio);

#ifdef CONFIG_NUMA
struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order,
		struct mempolicy *policy)
{
	int n;
	struct folio *folio;

	if (policy)
		return folio_alloc_mpol_noprof(gfp, order, policy,
				NO_INTERLEAVE_INDEX, numa_node_id());

	if (cpuset_do_page_mem_spread()) {
		unsigned int cpuset_mems_cookie;
		do {
			cpuset_mems_cookie = read_mems_allowed_begin();
			n = cpuset_mem_spread_node();
			folio = __folio_alloc_node_noprof(gfp, order, n);
		} while (!folio && read_mems_allowed_retry(cpuset_mems_cookie));

		return folio;
	}
	return folio_alloc_noprof(gfp, order);
}
EXPORT_SYMBOL(filemap_alloc_folio_noprof);
#endif

/*
 * filemap_invalidate_lock_two - lock invalidate_lock for two mappings
 *
 * Lock exclusively invalidate_lock of any passed mapping that is not NULL.
 *
 * @mapping1: the first mapping to lock
 * @mapping2: the second mapping to lock
 */
void filemap_invalidate_lock_two(struct address_space *mapping1,
				 struct address_space *mapping2)
{
	if (mapping1 > mapping2)
		swap(mapping1, mapping2);
	if (mapping1)
		down_write(&mapping1->invalidate_lock);
	if (mapping2 && mapping1 != mapping2)
		down_write_nested(&mapping2->invalidate_lock, 1);
}
EXPORT_SYMBOL(filemap_invalidate_lock_two);

/*
 * filemap_invalidate_unlock_two - unlock invalidate_lock for two mappings
 *
 * Unlock exclusive invalidate_lock of any passed mapping that is not NULL.
 *
 * @mapping1: the first mapping to unlock
 * @mapping2: the second mapping to unlock
 */
void filemap_invalidate_unlock_two(struct address_space *mapping1,
				   struct address_space *mapping2)
{
	if (mapping1)
		up_write(&mapping1->invalidate_lock);
	if (mapping2 && mapping1 != mapping2)
		up_write(&mapping2->invalidate_lock);
}
EXPORT_SYMBOL(filemap_invalidate_unlock_two);

/*
 * In order to wait for pages to become available there must be
 * waitqueues associated with pages. By using a hash table of
 * waitqueues where the bucket discipline is to maintain all
 * waiters on the same queue and wake all when any of the pages
 * become available, and for the woken contexts to check to be
 * sure the appropriate page became available, this saves space
 * at a cost of "thundering herd" phenomena during rare hash
 * collisions.
 */
#define PAGE_WAIT_TABLE_BITS 8
#define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;

static wait_queue_head_t *folio_waitqueue(struct folio *folio)
{
	return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
}

/* How many times do we accept lock stealing from under a waiter? */
static int sysctl_page_lock_unfairness = 5;
static const struct ctl_table filemap_sysctl_table[] = {
	{
		.procname	= "page_lock_unfairness",
		.data		= &sysctl_page_lock_unfairness,
		.maxlen		= sizeof(sysctl_page_lock_unfairness),
		.mode		= 0644,
		.proc_handler	= proc_dointvec_minmax,
		.extra1		= SYSCTL_ZERO,
	}
};

void __init pagecache_init(void)
{
	int i;

	for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++)
		init_waitqueue_head(&folio_wait_table[i]);

	page_writeback_init();
	register_sysctl_init("vm", filemap_sysctl_table);
}

/*
 * The page wait code treats the "wait->flags" somewhat unusually, because
 * we have multiple different kinds of waits, not just the usual "exclusive"
 * one.
 *
 * We have:
 *
 *  (a) no special bits set:
 *
 *	We're just waiting for the bit to be released, and when a waker
 *	calls the wakeup function, we set WQ_FLAG_WOKEN and wake it up,
 *	and remove it from the wait queue.
 *
 *	Simple and straightforward.
 *
 *  (b) WQ_FLAG_EXCLUSIVE:
 *
 *	The waiter is waiting to get the lock, and only one waiter should
 *	be woken up to avoid any thundering herd behavior. We'll set the
 *	WQ_FLAG_WOKEN bit, wake it up, and remove it from the wait queue.
 *
 *	This is the traditional exclusive wait.
 *
 *  (c) WQ_FLAG_EXCLUSIVE | WQ_FLAG_CUSTOM:
 *
 *	The waiter is waiting to get the bit, and additionally wants the
 *	lock to be transferred to it for fair lock behavior. If the lock
 *	cannot be taken, we stop walking the wait queue without waking
 *	the waiter.
 *
 *	This is the "fair lock handoff" case, and in addition to setting
 *	WQ_FLAG_WOKEN, we set WQ_FLAG_DONE to let the waiter easily see
 *	that it now has the lock.
 */
static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
{
	unsigned int flags;
	struct wait_page_key *key = arg;
	struct wait_page_queue *wait_page
		= container_of(wait, struct wait_page_queue, wait);

	if (!wake_page_match(wait_page, key))
		return 0;

	/*
	 * If it's a lock handoff wait, we get the bit for it, and
	 * stop walking (and do not wake it up) if we can't.
	 */
	flags = wait->flags;
	if (flags & WQ_FLAG_EXCLUSIVE) {
		if (test_bit(key->bit_nr, &key->folio->flags.f))
			return -1;
		if (flags & WQ_FLAG_CUSTOM) {
			if (test_and_set_bit(key->bit_nr, &key->folio->flags.f))
				return -1;
			flags |= WQ_FLAG_DONE;
		}
	}

	/*
	 * We are holding the wait-queue lock, but the waiter that
	 * is waiting for this will be checking the flags without
	 * any locking.
	 *
	 * So update the flags atomically, and wake up the waiter
	 * afterwards to avoid any races. This store-release pairs
	 * with the load-acquire in folio_wait_bit_common().
	 */
	smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN);
	wake_up_state(wait->private, mode);

	/*
	 * Ok, we have successfully done what we're waiting for,
	 * and we can unconditionally remove the wait entry.
	 *
	 * Note that this pairs with the "finish_wait()" in the
	 * waiter, and has to be the absolute last thing we do.
	 * After this list_del_init(&wait->entry) the wait entry
	 * might be de-allocated and the process might even have
	 * exited.
	 */
	list_del_init_careful(&wait->entry);
	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
}

static void folio_wake_bit(struct folio *folio, int bit_nr)
{
	wait_queue_head_t *q = folio_waitqueue(folio);
	struct wait_page_key key;
	unsigned long flags;

	key.folio = folio;
	key.bit_nr = bit_nr;
	key.page_match = 0;

	spin_lock_irqsave(&q->lock, flags);
	__wake_up_locked_key(q, TASK_NORMAL, &key);

	/*
	 * It's possible to miss clearing waiters here, when we woke our page
	 * waiters, but the hashed waitqueue has waiters for other pages on it.
	 * That's okay, it's a rare case. The next waker will clear it.
	 *
	 * Note that, depending on the page pool (buddy, hugetlb, ZONE_DEVICE,
	 * other), the flag may be cleared in the course of freeing the page;
	 * but that is not required for correctness.
	 */
	if (!waitqueue_active(q) || !key.page_match)
		folio_clear_waiters(folio);

	spin_unlock_irqrestore(&q->lock, flags);
}

/*
 * A choice of three behaviors for folio_wait_bit_common():
 */
enum behavior {
	EXCLUSIVE,	/* Hold ref to page and take the bit when woken, like
			 * __folio_lock() waiting on then setting PG_locked.
			 */
	SHARED,		/* Hold ref to page and check the bit when woken, like
			 * folio_wait_writeback() waiting on PG_writeback.
			 */
	DROP,		/* Drop ref to page before wait, no check when woken,
			 * like folio_put_wait_locked() on PG_locked.
			 */
};

/*
 * Attempt to check (or get) the folio flag, and mark us done
 * if successful.
 */
static inline bool folio_trylock_flag(struct folio *folio, int bit_nr,
					struct wait_queue_entry *wait)
{
	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
		if (test_and_set_bit(bit_nr, &folio->flags.f))
			return false;
	} else if (test_bit(bit_nr, &folio->flags.f))
		return false;

	wait->flags |= WQ_FLAG_WOKEN | WQ_FLAG_DONE;
	return true;
}

static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
		int state, enum behavior behavior)
{
	wait_queue_head_t *q = folio_waitqueue(folio);
	int unfairness = sysctl_page_lock_unfairness;
	struct wait_page_queue wait_page;
	wait_queue_entry_t *wait = &wait_page.wait;
	bool thrashing = false;
	unsigned long pflags;
	bool in_thrashing;

	if (bit_nr == PG_locked &&
	    !folio_test_uptodate(folio) && folio_test_workingset(folio)) {
		delayacct_thrashing_start(&in_thrashing);
		psi_memstall_enter(&pflags);
		thrashing = true;
	}

	init_wait(wait);
	wait->func = wake_page_function;
	wait_page.folio = folio;
	wait_page.bit_nr = bit_nr;

repeat:
	wait->flags = 0;
	if (behavior == EXCLUSIVE) {
		wait->flags = WQ_FLAG_EXCLUSIVE;
		if (--unfairness < 0)
			wait->flags |= WQ_FLAG_CUSTOM;
	}

	/*
	 * Do one last check whether we can get the
	 * page bit synchronously.
	 *
	 * Do the folio_set_waiters() marking before that
	 * to let any waker we _just_ missed know they
	 * need to wake us up (otherwise they'll never
	 * even go to the slow case that looks at the
	 * page queue), and add ourselves to the wait
	 * queue if we need to sleep.
	 *
	 * This part needs to be done under the queue
	 * lock to avoid races.
	 */
	spin_lock_irq(&q->lock);
	folio_set_waiters(folio);
	if (!folio_trylock_flag(folio, bit_nr, wait))
		__add_wait_queue_entry_tail(q, wait);
	spin_unlock_irq(&q->lock);

	/*
	 * From now on, all the logic will be based on
	 * the WQ_FLAG_WOKEN and WQ_FLAG_DONE flag, to
	 * see whether the page bit testing has already
	 * been done by the wake function.
	 *
	 * We can drop our reference to the folio.
	 */
	if (behavior == DROP)
		folio_put(folio);

	/*
	 * Note that until the "finish_wait()", or until
	 * we see the WQ_FLAG_WOKEN flag, we need to
	 * be very careful with the 'wait->flags', because
	 * we may race with a waker that sets them.
	 */
	for (;;) {
		unsigned int flags;

		set_current_state(state);

		/* Loop until we've been woken or interrupted */
		flags = smp_load_acquire(&wait->flags);
		if (!(flags & WQ_FLAG_WOKEN)) {
			if (signal_pending_state(state, current))
				break;

			io_schedule();
			continue;
		}

		/* If we were non-exclusive, we're done */
		if (behavior != EXCLUSIVE)
			break;

		/* If the waker got the lock for us, we're done */
		if (flags & WQ_FLAG_DONE)
			break;

		/*
		 * Otherwise, if we're getting the lock, we need to
		 * try to get it ourselves.
		 *
		 * And if that fails, we'll have to retry this all.
		 */
		if (unlikely(test_and_set_bit(bit_nr, folio_flags(folio, 0))))
			goto repeat;

		wait->flags |= WQ_FLAG_DONE;
		break;
	}

	/*
	 * If a signal happened, this 'finish_wait()' may remove the last
	 * waiter from the wait-queues, but the folio waiters bit will remain
	 * set. That's ok. The next wakeup will take care of it, and trying
	 * to do it here would be difficult and prone to races.
	 */
	finish_wait(q, wait);

	if (thrashing) {
		delayacct_thrashing_end(&in_thrashing);
		psi_memstall_leave(&pflags);
	}

	/*
	 * NOTE! The wait->flags weren't stable until we've done the
	 * 'finish_wait()', and we could have exited the loop above due
	 * to a signal, and had a wakeup event happen after the signal
	 * test but before the 'finish_wait()'.
	 *
	 * So only after the finish_wait() can we reliably determine
	 * if we got woken up or not, so we can now figure out the final
	 * return value based on that state without races.
	 *
	 * Also note that WQ_FLAG_WOKEN is sufficient for a non-exclusive
	 * waiter, but an exclusive one requires WQ_FLAG_DONE.
	 */
	if (behavior == EXCLUSIVE)
		return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR;

	return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
}

#ifdef CONFIG_MIGRATION
/**
 * migration_entry_wait_on_locked - Wait for a migration entry to be removed
 * @entry: migration swap entry.
 * @ptl: already locked ptl. This function will drop the lock.
 *
 * Wait for a migration entry referencing the given page to be removed. This is
 * equivalent to folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE) except
 * this can be called without taking a reference on the page. Instead this
 * should be called while holding the ptl for the migration entry referencing
 * the page.
 *
 * Returns after unlocking the ptl.
 *
 * This follows the same logic as folio_wait_bit_common() so see the comments
 * there.
 */
void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
	__releases(ptl)
{
	struct wait_page_queue wait_page;
	wait_queue_entry_t *wait = &wait_page.wait;
	bool thrashing = false;
	unsigned long pflags;
	bool in_thrashing;
	wait_queue_head_t *q;
	struct folio *folio = softleaf_to_folio(entry);

	q = folio_waitqueue(folio);
	if (!folio_test_uptodate(folio) && folio_test_workingset(folio)) {
		delayacct_thrashing_start(&in_thrashing);
		psi_memstall_enter(&pflags);
		thrashing = true;
	}

	init_wait(wait);
	wait->func = wake_page_function;
	wait_page.folio = folio;
	wait_page.bit_nr = PG_locked;
	wait->flags = 0;

	spin_lock_irq(&q->lock);
	folio_set_waiters(folio);
	if (!folio_trylock_flag(folio, PG_locked, wait))
		__add_wait_queue_entry_tail(q, wait);
	spin_unlock_irq(&q->lock);

	/*
	 * If a migration entry exists for the page the migration path must hold
	 * a valid reference to the page, and it must take the ptl to remove the
	 * migration entry. So the page is valid until the ptl is dropped.
	 */
	spin_unlock(ptl);

	for (;;) {
		unsigned int flags;

		set_current_state(TASK_UNINTERRUPTIBLE);

		/* Loop until we've been woken or interrupted */
		flags = smp_load_acquire(&wait->flags);
		if (!(flags & WQ_FLAG_WOKEN)) {
			if (signal_pending_state(TASK_UNINTERRUPTIBLE, current))
				break;

			io_schedule();
			continue;
		}
		break;
	}

	finish_wait(q, wait);

	if (thrashing) {
		delayacct_thrashing_end(&in_thrashing);
		psi_memstall_leave(&pflags);
	}
}
#endif

void folio_wait_bit(struct folio *folio, int bit_nr)
{
	folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
}
EXPORT_SYMBOL(folio_wait_bit);

int folio_wait_bit_killable(struct folio *folio, int bit_nr)
{
	return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED);
}
EXPORT_SYMBOL(folio_wait_bit_killable);

/**
 * folio_put_wait_locked - Drop a reference and wait for it to be unlocked
 * @folio: The folio to wait for.
 * @state: The sleep state (TASK_KILLABLE, TASK_UNINTERRUPTIBLE, etc).
 *
 * The caller should hold a reference on @folio.  They expect the page to
 * become unlocked relatively soon, but do not wish to hold up migration
 * (for example) by holding the reference while waiting for the folio to
 * come unlocked.  After this function returns, the caller should not
 * dereference @folio.
 *
 * Return: 0 if the folio was unlocked or -EINTR if interrupted by a signal.
 */
static int folio_put_wait_locked(struct folio *folio, int state)
{
	return folio_wait_bit_common(folio, PG_locked, state, DROP);
}

/**
 * folio_unlock - Unlock a locked folio.
 * @folio: The folio.
 *
 * Unlocks the folio and wakes up any thread sleeping on the page lock.
 *
 * Context: May be called from interrupt or process context.  May not be
 * called from NMI context.
 */
void folio_unlock(struct folio *folio)
{
	/* Bit 7 allows x86 to check the byte's sign bit */
	BUILD_BUG_ON(PG_waiters != 7);
	BUILD_BUG_ON(PG_locked > 7);
	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
	if (folio_xor_flags_has_waiters(folio, 1 << PG_locked))
		folio_wake_bit(folio, PG_locked);
}
EXPORT_SYMBOL(folio_unlock);

/**
 * folio_end_read - End read on a folio.
 * @folio: The folio.
 * @success: True if all reads completed successfully.
 *
 * When all reads against a folio have completed, filesystems should
 * call this function to let the pagecache know that no more reads
 * are outstanding.  This will unlock the folio and wake up any thread
 * sleeping on the lock.  The folio will also be marked uptodate if all
 * reads succeeded.
 *
 * Context: May be called from interrupt or process context.  May not be
 * called from NMI context.
 */
void folio_end_read(struct folio *folio, bool success)
{
	unsigned long mask = 1 << PG_locked;

	/* Must be in bottom byte for x86 to work */
	BUILD_BUG_ON(PG_uptodate > 7);
	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
	VM_BUG_ON_FOLIO(success && folio_test_uptodate(folio), folio);

	if (likely(success))
		mask |= 1 << PG_uptodate;
	if (folio_xor_flags_has_waiters(folio, mask))
		folio_wake_bit(folio, PG_locked);
}
EXPORT_SYMBOL(folio_end_read);

/**
 * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
 * @folio: The folio.
 *
 * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
 * it.  The folio reference held for PG_private_2 being set is released.
 *
 * This is, for example, used when a netfs folio is being written to a local
 * disk cache, thereby allowing writes to the cache for the same folio to be
 * serialised.
 */
void folio_end_private_2(struct folio *folio)
{
	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
	folio_wake_bit(folio, PG_private_2);
	folio_put(folio);
}
EXPORT_SYMBOL(folio_end_private_2);

/**
 * folio_wait_private_2 - Wait for PG_private_2 to be cleared on a folio.
 * @folio: The folio to wait on.
 *
 * Wait for PG_private_2 to be cleared on a folio.
 */
void folio_wait_private_2(struct folio *folio)
{
	while (folio_test_private_2(folio))
		folio_wait_bit(folio, PG_private_2);
}
EXPORT_SYMBOL(folio_wait_private_2);

/**
 * folio_wait_private_2_killable - Wait for PG_private_2 to be cleared on a folio.
 * @folio: The folio to wait on.
 *
 * Wait for PG_private_2 to be cleared on a folio or until a fatal signal is
 * received by the calling task.
 *
 * Return:
 * - 0 if successful.
 * - -EINTR if a fatal signal was encountered.
 */
int folio_wait_private_2_killable(struct folio *folio)
{
	int ret = 0;

	while (folio_test_private_2(folio)) {
		ret = folio_wait_bit_killable(folio, PG_private_2);
		if (ret < 0)
			break;
	}

	return ret;
}
EXPORT_SYMBOL(folio_wait_private_2_killable);

static void filemap_end_dropbehind(struct folio *folio)
{
	struct address_space *mapping = folio->mapping;

	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

	if (folio_test_writeback(folio) || folio_test_dirty(folio))
		return;
	if (!folio_test_clear_dropbehind(folio))
		return;
	if (mapping)
		folio_unmap_invalidate(mapping, folio, 0);
}

/*
 * If folio was marked as dropbehind, then pages should be dropped when writeback
 * completes. Do that now. If we fail, it's likely because of a big folio -
 * just reset dropbehind for that case and latter completions should invalidate.
 */
void folio_end_dropbehind(struct folio *folio)
{
	if (!folio_test_dropbehind(folio))
		return;

	/*
	 * Hitting !in_task() should not happen off RWF_DONTCACHE writeback,
	 * but can happen if normal writeback just happens to find dirty folios
	 * that were created as part of uncached writeback, and that writeback
	 * would otherwise not need non-IRQ handling. Just skip the
	 * invalidation in that case.
	 */
	if (in_task() && folio_trylock(folio)) {
		filemap_end_dropbehind(folio);
		folio_unlock(folio);
	}
}
EXPORT_SYMBOL_GPL(folio_end_dropbehind);

/**
 * folio_end_writeback_no_dropbehind - End writeback against a folio.
 * @folio: The folio.
 *
 * The folio must actually be under writeback.
 * This call is intended for filesystems that need to defer dropbehind.
 *
 * Context: May be called from process or interrupt context.
 */
void folio_end_writeback_no_dropbehind(struct folio *folio)
{
	VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);

	/*
	 * folio_test_clear_reclaim() could be used here but it is an
	 * atomic operation and overkill in this particular case. Failing
	 * to shuffle a folio marked for immediate reclaim is too mild
	 * a gain to justify taking an atomic operation penalty at the
	 * end of every folio writeback.
	 */
	if (folio_test_reclaim(folio)) {
		folio_clear_reclaim(folio);
		folio_rotate_reclaimable(folio);
	}

	if (__folio_end_writeback(folio))
		folio_wake_bit(folio, PG_writeback);

	acct_reclaim_writeback(folio);
}
EXPORT_SYMBOL_GPL(folio_end_writeback_no_dropbehind);

/**
 * folio_end_writeback - End writeback against a folio.
 * @folio: The folio.
 *
 * The folio must actually be under writeback.
 *
 * Context: May be called from process or interrupt context.
 */
void folio_end_writeback(struct folio *folio)
{
	VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);

	/*
	 * Writeback does not hold a folio reference of its own, relying
	 * on truncation to wait for the clearing of PG_writeback.
	 * But here we must make sure that the folio is not freed and
	 * reused before the folio_wake_bit().
	 */
	folio_get(folio);
	folio_end_writeback_no_dropbehind(folio);
	folio_end_dropbehind(folio);
	folio_put(folio);
}
EXPORT_SYMBOL(folio_end_writeback);

/**
 * __folio_lock - Get a lock on the folio, assuming we need to sleep to get it.
 * @folio: The folio to lock
 */
void __folio_lock(struct folio *folio)
{
	folio_wait_bit_common(folio, PG_locked, TASK_UNINTERRUPTIBLE,
				EXCLUSIVE);
}
EXPORT_SYMBOL(__folio_lock);

int __folio_lock_killable(struct folio *folio)
{
	return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE,
					EXCLUSIVE);
}
EXPORT_SYMBOL_GPL(__folio_lock_killable);

static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
{
	struct wait_queue_head *q = folio_waitqueue(folio);
	int ret;

	wait->folio = folio;
	wait->bit_nr = PG_locked;

	spin_lock_irq(&q->lock);
	__add_wait_queue_entry_tail(q, &wait->wait);
	folio_set_waiters(folio);
	ret = !folio_trylock(folio);
	/*
	 * If we were successful now, we know we're still on the
	 * waitqueue as we're still under the lock. This means it's
	 * safe to remove and return success, we know the callback
	 * isn't going to trigger.
	 */
	if (!ret)
		__remove_wait_queue(q, &wait->wait);
	else
		ret = -EIOCBQUEUED;
	spin_unlock_irq(&q->lock);
	return ret;
}

/*
 * Return values:
 * 0 - folio is locked.
 * non-zero - folio is not locked.
 *     mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
 *     vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
 *     FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
 *
 * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
 * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
 */
vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
{
	unsigned int flags = vmf->flags;

	if (fault_flag_allow_retry_first(flags)) {
		/*
		 * CAUTION! In this case, mmap_lock/per-VMA lock is not
		 * released even though returning VM_FAULT_RETRY.
		 */
		if (flags & FAULT_FLAG_RETRY_NOWAIT)
			return VM_FAULT_RETRY;

		release_fault_lock(vmf);
		if (flags & FAULT_FLAG_KILLABLE)
			folio_wait_locked_killable(folio);
		else
			folio_wait_locked(folio);
		return VM_FAULT_RETRY;
	}
	if (flags & FAULT_FLAG_KILLABLE) {
		bool ret;

		ret = __folio_lock_killable(folio);
		if (ret) {
			release_fault_lock(vmf);
			return VM_FAULT_RETRY;
		}
	} else {
		__folio_lock(folio);
	}

	return 0;
}

/**
 * page_cache_next_miss() - Find the next gap in the page cache.
 * @mapping: Mapping.
 * @index: Index.
 * @max_scan: Maximum range to search.
 *
 * Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for the
 * gap with the lowest index.
 *
 * This function may be called under the rcu_read_lock.  However, this will
 * not atomically search a snapshot of the cache at a single point in time.
 * For example, if a gap is created at index 5, then subsequently a gap is
 * created at index 10, page_cache_next_miss covering both indices may
 * return 10 if called under the rcu_read_lock.
 *
 * Return: The index of the gap if found, otherwise an index outside the
 * range specified (in which case 'return - index >= max_scan' will be true).
 * In the rare case of index wrap-around, 0 will be returned.
 */
pgoff_t page_cache_next_miss(struct address_space *mapping,
			     pgoff_t index, unsigned long max_scan)
{
	XA_STATE(xas, &mapping->i_pages, index);
	unsigned long nr = max_scan;

	while (nr--) {
		void *entry = xas_next(&xas);
		if (!entry || xa_is_value(entry))
			return xas.xa_index;
		if (xas.xa_index == 0)
			return 0;
	}

	return index + max_scan;
}
EXPORT_SYMBOL(page_cache_next_miss);

/**
 * page_cache_prev_miss() - Find the previous gap in the page cache.
 * @mapping: Mapping.
 * @index: Index.
 * @max_scan: Maximum range to search.
 *
 * Search the range [max(index - max_scan + 1, 0), index] for the
 * gap with the highest index.
 *
 * This function may be called under the rcu_read_lock.  However, this will
 * not atomically search a snapshot of the cache at a single point in time.
 * For example, if a gap is created at index 10, then subsequently a gap is
 * created at index 5, page_cache_prev_miss() covering both indices may
 * return 5 if called under the rcu_read_lock.
 *
 * Return: The index of the gap if found, otherwise an index outside the
 * range specified (in which case 'index - return >= max_scan' will be true).
 * In the rare case of wrap-around, ULONG_MAX will be returned.
 */
pgoff_t page_cache_prev_miss(struct address_space *mapping,
			     pgoff_t index, unsigned long max_scan)
{
	XA_STATE(xas, &mapping->i_pages, index);

	while (max_scan--) {
		void *entry = xas_prev(&xas);
		if (!entry || xa_is_value(entry))
			break;
		if (xas.xa_index == ULONG_MAX)
			break;
	}

	return xas.xa_index;
}
EXPORT_SYMBOL(page_cache_prev_miss);

/*
 * Lockless page cache protocol:
 * On the lookup side:
 * 1. Load the folio from i_pages
 * 2. Increment the refcount if it's not zero
 * 3. If the folio is not found by xas_reload(), put the refcount and retry
 *
 * On the removal side:
 * A. Freeze the page (by zeroing the refcount if nobody else has a reference)
 * B. Remove the page from i_pages
 * C. Return the page to the page allocator
 *
 * This means that any page may have its reference count temporarily
 * increased by a speculative page cache (or GUP-fast) lookup as it can
 * be allocated by another user before the RCU grace period expires.
 * Because the refcount temporarily acquired here may end up being the
 * last refcount on the page, any page allocation must be freeable by
 * folio_put().
 */

/*
 * filemap_get_entry - Get a page cache entry.
 * @mapping: the address_space to search
 * @index: The page cache index.
 *
 * Looks up the page cache entry at @mapping & @index.  If it is a folio,
 * it is returned with an increased refcount.  If it is a shadow entry
 * of a previously evicted folio, or a swap entry from shmem/tmpfs,
 * it is returned without further action.
 *
 * Return: The folio, swap or shadow entry, %NULL if nothing is found.
 */
void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
{
	XA_STATE(xas, &mapping->i_pages, index);
	struct folio *folio;

	rcu_read_lock();
repeat:
	xas_reset(&xas);
	folio = xas_load(&xas);
	if (xas_retry(&xas, folio))
		goto repeat;
	/*
	 * A shadow entry of a recently evicted page, or a swap entry from
	 * shmem/tmpfs.  Return it without attempting to raise page count.
	 */
	if (!folio || xa_is_value(folio))
		goto out;

	if (!folio_try_get(folio))
		goto repeat;

	if (unlikely(folio != xas_reload(&xas))) {
		folio_put(folio);
		goto repeat;
	}
out:
	rcu_read_unlock();

	return folio;
}

/**
 * __filemap_get_folio_mpol - Find and get a reference to a folio.
 * @mapping: The address_space to search.
 * @index: The page index.
 * @fgp_flags: %FGP flags modify how the folio is returned.
 * @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
 * @policy: NUMA memory allocation policy to follow.
 *
 * Looks up the page cache entry at @mapping & @index.
 *
 * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
 * if the %GFP flags specified for %FGP_CREAT are atomic.
 *
 * If this function returns a folio, it is returned with an increased refcount.
 *
 * Return: The found folio or an ERR_PTR() otherwise.
 */
struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
		pgoff_t index, fgf_t fgp_flags, gfp_t gfp, struct mempolicy *policy)
{
	struct folio *folio;

repeat:
	folio = filemap_get_entry(mapping, index);
	if (xa_is_value(folio))
		folio = NULL;
	if (!folio)
		goto no_page;

	if (fgp_flags & FGP_LOCK) {
		if (fgp_flags & FGP_NOWAIT) {
			if (!folio_trylock(folio)) {
				folio_put(folio);
				return ERR_PTR(-EAGAIN);
			}
		} else {
			folio_lock(folio);
		}

		/* Has the page been truncated? */
		if (unlikely(folio->mapping != mapping)) {
			folio_unlock(folio);
			folio_put(folio);
			goto repeat;
		}
		VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
	}

	if (fgp_flags & FGP_ACCESSED)
		folio_mark_accessed(folio);
	else if (fgp_flags & FGP_WRITE) {
		/* Clear idle flag for buffer write */
		if (folio_test_idle(folio))
			folio_clear_idle(folio);
	}

	if (fgp_flags & FGP_STABLE)
		folio_wait_stable(folio);
no_page:
	if (!folio && (fgp_flags & FGP_CREAT)) {
		unsigned int min_order = mapping_min_folio_order(mapping);
		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
		int err;
		index = mapping_align_index(mapping, index);

		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
			gfp |= __GFP_WRITE;
		if (fgp_flags & FGP_NOFS)
			gfp &= ~__GFP_FS;
		if (fgp_flags & FGP_NOWAIT) {
			gfp &= ~GFP_KERNEL;
			gfp |= GFP_NOWAIT;
		}
		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
			fgp_flags |= FGP_LOCK;

		if (order > mapping_max_folio_order(mapping))
			order = mapping_max_folio_order(mapping);
		/* If we're not aligned, allocate a smaller folio */
		if (index & ((1UL << order) - 1))
			order = __ffs(index);

		do {
			gfp_t alloc_gfp = gfp;

			err = -ENOMEM;
			if (order > min_order)
				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
			folio = filemap_alloc_folio(alloc_gfp, order, policy);
			if (!folio)
				continue;

			/* Init accessed so avoid atomic mark_page_accessed later */
			if (fgp_flags & FGP_ACCESSED)
				__folio_set_referenced(folio);
			if (fgp_flags & FGP_DONTCACHE)
				__folio_set_dropbehind(folio);

			err = filemap_add_folio(mapping, folio, index, gfp);
			if (!err)
				break;
			folio_put(folio);
			folio = NULL;
		} while (order-- > min_order);

		if (err == -EEXIST)
			goto repeat;
		if (err) {
			/*
			 * When NOWAIT I/O fails to allocate folios this could
			 * be due to a nonblocking memory allocation and not
			 * because the system actually is out of memory.
			 * Return -EAGAIN so that there caller retries in a
			 * blocking fashion instead of propagating -ENOMEM
			 * to the application.
			 */
			if ((fgp_flags & FGP_NOWAIT) && err == -ENOMEM)
				err = -EAGAIN;
			return ERR_PTR(err);
		}
		/*
		 * filemap_add_folio locks the page, and for mmap
		 * we expect an unlocked page.
		 */
		if (folio && (fgp_flags & FGP_FOR_MMAP))
			folio_unlock(folio);
	}

	if (!folio)
		return ERR_PTR(-ENOENT);
	/* not an uncached lookup, clear uncached if set */
	if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
		folio_clear_dropbehind(folio);
	return folio;
}
EXPORT_SYMBOL(__filemap_get_folio_mpol);

static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
		xa_mark_t mark)
{
	struct folio *folio;

retry:
	if (mark == XA_PRESENT)
		folio = xas_find(xas, max);
	else
		folio = xas_find_marked(xas, max, mark);

	if (xas_retry(xas, folio))
		goto retry;
	/*
	 * A shadow entry of a recently evicted page, a swap
	 * entry from shmem/tmpfs or a DAX entry.  Return it
	 * without attempting to raise page count.
	 */
	if (!folio || xa_is_value(folio))
		return folio;

	if (!folio_try_get(folio))
		goto reset;

	if (unlikely(folio != xas_reload(xas))) {
		folio_put(folio);
		goto reset;
	}

	return folio;
reset:
	xas_reset(xas);
	goto retry;
}

/**
 * find_get_entries - gang pagecache lookup
 * @mapping:	The address_space to search
 * @start:	The starting page cache index
 * @end:	The final page index (inclusive).
 * @fbatch:	Where the resulting entries are placed.
 * @indices:	The cache indices corresponding to the entries in @entries
 *
 * find_get_entries() will search for and return a batch of entries in
 * the mapping.  The entries are placed in @fbatch.  find_get_entries()
 * takes a reference on any actual folios it returns.
 *
 * The entries have ascending indexes.  The indices may not be consecutive
 * due to not-present entries or large folios.
 *
 * Any shadow entries of evicted folios, or swap entries from
 * shmem/tmpfs, are included in the returned array.
 *
 * Return: The number of entries which were found.
 */
unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
		pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
{
	XA_STATE(xas, &mapping->i_pages, *start);
	struct folio *folio;

	rcu_read_lock();
	while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
		indices[fbatch->nr] = xas.xa_index;
		if (!folio_batch_add(fbatch, folio))
			break;
	}

	if (folio_batch_count(fbatch)) {
		unsigned long nr;
		int idx = folio_batch_count(fbatch) - 1;

		folio = fbatch->folios[idx];
		if (!xa_is_value(folio))
			nr = folio_nr_pages(folio);
		else
			nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
		*start = round_down(indices[idx] + nr, nr);
	}
	rcu_read_unlock();

	return folio_batch_count(fbatch);
}

/**
 * find_lock_entries - Find a batch of pagecache entries.
 * @mapping:	The address_space to search.
 * @start:	The starting page cache index.
 * @end:	The final page index (inclusive).
 * @fbatch:	Where the resulting entries are placed.
 * @indices:	The cache indices of the entries in @fbatch.
 *
 * find_lock_entries() will return a batch of entries from @mapping.
 * Swap, shadow and DAX entries are included.  Folios are returned
 * locked and with an incremented refcount.  Folios which are locked
 * by somebody else or under writeback are skipped.  Folios which are
 * partially outside the range are not returned.
 *
 * The entries have ascending indexes.  The indices may not be consecutive
 * due to not-present entries, large folios, folios which could not be
 * locked or folios under writeback.
 *
 * Return: The number of entries which were found.
 */
unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
		pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
{
	XA_STATE(xas, &mapping->i_pages, *start);
	struct folio *folio;

	rcu_read_lock();
	while ((folio = find_get_entry(&xas, end, XA_PRESENT))) {
		unsigned long base;
		unsigned long nr;

		if (!xa_is_value(folio)) {
			nr = folio_nr_pages(folio);
			base = folio->index;
			/* Omit large folio which begins before the start */
			if (base < *start)
				goto put;
			/* Omit large folio which extends beyond the end */
			if (base + nr - 1 > end)
				goto put;
			if (!folio_trylock(folio))
				goto put;
			if (folio->mapping != mapping ||
			    folio_test_writeback(folio))
				goto unlock;
			VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index),
					folio);
		} else {
			nr = 1 << xas_get_order(&xas);
			base = xas.xa_index & ~(nr - 1);
			/* Omit order>0 value which begins before the start */
			if (base < *start)
				continue;
			/* Omit order>0 value which extends beyond the end */
			if (base + nr - 1 > end)
				break;
		}

		/* Update start now so that last update is correct on return */
		*start = base + nr;
		indices[fbatch->nr] = xas.xa_index;
		if (!folio_batch_add(fbatch, folio))
			break;
		continue;
unlock:
		folio_unlock(folio);
put:
		folio_put(folio);
	}
	rcu_read_unlock();

	return folio_batch_count(fbatch);
}

/**
 * filemap_get_folios - Get a batch of folios
 * @mapping:	The address_space to search
 * @start:	The starting page index
 * @end:	The final page index (inclusive)
 * @fbatch:	The batch to fill.
 *
 * Search for and return a batch of folios in the mapping starting at
 * index @start and up to index @end (inclusive).  The folios are returned
 * in @fbatch with an elevated reference count.
 *
 * Return: The number of folios which were found.
 * We also update @start to index the next folio for the traversal.
 */
unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start,
		pgoff_t end, struct folio_batch *fbatch)
{
	return filemap_get_folios_tag(mapping, start, end, XA_PRESENT, fbatch);
}
EXPORT_SYMBOL(filemap_get_folios);

/**
 * filemap_get_folios_contig - Get a batch of contiguous folios
 * @mapping:	The address_space to search
 * @start:	The starting page index
 * @end:	The final page index (inclusive)
 * @fbatch:	The batch to fill
 *
 * filemap_get_folios_contig() works exactly like filemap_get_folios(),
 * except the returned folios are guaranteed to be contiguous. This may
 * not return all contiguous folios if the batch gets filled up.
 *
 * Return: The number of folios found.
 * Also update @start to be positioned for traversal of the next folio.
 */

unsigned filemap_get_folios_contig(struct address_space *mapping,
		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch)
{
	XA_STATE(xas, &mapping->i_pages, *start);
	unsigned long nr;
	struct folio *folio;

	rcu_read_lock();

	for (folio = xas_load(&xas); folio && xas.xa_index <= end;
			folio = xas_next(&xas)) {
		if (xas_retry(&xas, folio))
			continue;
		/*
		 * If the entry has been swapped out, we can stop looking.
		 * No current caller is looking for DAX entries.
		 */
		if (xa_is_value(folio))
			goto update_start;

		/* If we landed in the middle of a THP, continue at its end. */
		if (xa_is_sibling(folio))
			goto update_start;

		if (!folio_try_get(folio))
			goto retry;

		if (unlikely(folio != xas_reload(&xas)))
			goto put_folio;

		if (!folio_batch_add(fbatch, folio)) {
			nr = folio_nr_pages(folio);
			*start = folio->index + nr;
			goto out;
		}
		xas_advance(&xas, folio_next_index(folio) - 1);
		continue;
put_folio:
		folio_put(folio);

retry:
		xas_reset(&xas);
	}

update_start:
	nr = folio_batch_count(fbatch);

	if (nr) {
		folio = fbatch->folios[nr - 1];
		*start = folio_next_index(folio);
	}
out:
	rcu_read_unlock();
	return folio_batch_count(fbatch);
}
EXPORT_SYMBOL(filemap_get_folios_contig);

/**
 * filemap_get_folios_tag - Get a batch of folios matching @tag
 * @mapping:    The address_space to search
 * @start:      The starting page index
 * @end:        The final page index (inclusive)
 * @tag:        The tag index
 * @fbatch:     The batch to fill
 *
 * The first folio may start before @start; if it does, it will contain
 * @start.  The final folio may extend beyond @end; if it does, it will
 * contain @end.  The folios have ascending indices.  There may be gaps
 * between the folios if there are indices which have no folio in the
 * page cache.  If folios are added to or removed from the page cache
 * while this is running, they may or may not be found by this call.
 * Only returns folios that are tagged with @tag.
 *
 * Return: The number of folios found.
 * Also update @start to index the next folio for traversal.
 */
unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
			pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch)
{
	XA_STATE(xas, &mapping->i_pages, *start);
	struct folio *folio;

	rcu_read_lock();
	while ((folio = find_get_entry(&xas, end, tag)) != NULL) {
		/*
		 * Shadow entries should never be tagged, but this iteration
		 * is lockless so there is a window for page reclaim to evict
		 * a page we saw tagged. Skip over it.
		 */
		if (xa_is_value(folio))
			continue;
		if (!folio_batch_add(fbatch, folio)) {
			unsigned long nr = folio_nr_pages(folio);
			*start = folio->index + nr;
			goto out;
		}
	}
	/*
	 * We come here when there is no page beyond @end. We take care to not
	 * overflow the index @start as it confuses some of the callers. This
	 * breaks the iteration when there is a page at index -1 but that is
	 * already broke anyway.
	 */
	if (end == (pgoff_t)-1)
		*start = (pgoff_t)-1;
	else
		*start = end + 1;
out:
	rcu_read_unlock();

	return folio_batch_count(fbatch);
}
EXPORT_SYMBOL(filemap_get_folios_tag);

/**
 * filemap_get_folios_dirty - Get a batch of dirty folios
 * @mapping:	The address_space to search
 * @start:	The starting folio index
 * @end:	The final folio index (inclusive)
 * @fbatch:	The batch to fill
 *
 * filemap_get_folios_dirty() works exactly like filemap_get_folios(), except
 * the returned folios are presumed to be dirty or undergoing writeback. Dirty
 * state is presumed because we don't block on folio lock nor want to miss
 * folios. Callers that need to can recheck state upon locking the folio.
 *
 * This may not return all dirty folios if the batch gets filled up.
 *
 * Return: The number of folios found.
 * Also update @start to be positioned for traversal of the next folio.
 */
unsigned filemap_get_folios_dirty(struct address_space *mapping, pgoff_t *start,
			pgoff_t end, struct folio_batch *fbatch)
{
	XA_STATE(xas, &mapping->i_pages, *start);
	struct folio *folio;

	rcu_read_lock();
	while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
		if (xa_is_value(folio))
			continue;
		if (folio_trylock(folio)) {
			bool clean = !folio_test_dirty(folio) &&
				     !folio_test_writeback(folio);
			folio_unlock(folio);
			if (clean) {
				folio_put(folio);
				continue;
			}
		}
		if (!folio_batch_add(fbatch, folio)) {
			unsigned long nr = folio_nr_pages(folio);
			*start = folio->index + nr;
			goto out;
		}
	}
	/*
	 * We come here when there is no folio beyond @end. We take care to not
	 * overflow the index @start as it confuses some of the callers. This
	 * breaks the iteration when there is a folio at index -1 but that is
	 * already broke anyway.
	 */
	if (end == (pgoff_t)-1)
		*start = (pgoff_t)-1;
	else
		*start = end + 1;
out:
	rcu_read_unlock();

	return folio_batch_count(fbatch);
}

/*
 * CD/DVDs are error prone. When a medium error occurs, the driver may fail
 * a _large_ part of the i/o request. Imagine the worst scenario:
 *
 *      ---R__________________________________________B__________
 *         ^ reading here                             ^ bad block(assume 4k)
 *
 * read(R) => miss => readahead(R...B) => media error => frustrating retries
 * => failing the whole request => read(R) => read(R+1) =>
 * readahead(R+1...B+1) => bang => read(R+2) => read(R+3) =>
 * readahead(R+3...B+2) => bang => read(R+3) => read(R+4) =>
 * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
 *
 * It is going insane. Fix it by quickly scaling down the readahead size.
 */
static void shrink_readahead_size_eio(struct file_ra_state *ra)
{
	ra->ra_pages /= 4;
}

/*
 * filemap_get_read_batch - Get a batch of folios for read
 *
 * Get a batch of folios which represent a contiguous range of bytes in
 * the file.  No exceptional entries will be returned.  If @index is in
 * the middle of a folio, the entire folio will be returned.  The last
 * folio in the batch may have the readahead flag set or the uptodate flag
 * clear so that the caller can take the appropriate action.
 */
static void filemap_get_read_batch(struct address_space *mapping,
		pgoff_t index, pgoff_t max, struct folio_batch *fbatch)
{
	XA_STATE(xas, &mapping->i_pages, index);
	struct folio *folio;

	rcu_read_lock();
	for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) {
		if (xas_retry(&xas, folio))
			continue;
		if (xas.xa_index > max || xa_is_value(folio))
			break;
		if (xa_is_sibling(folio))
			break;
		if (!folio_try_get(folio))
			goto retry;

		if (unlikely(folio != xas_reload(&xas)))
			goto put_folio;

		if (!folio_batch_add(fbatch, folio))
			break;
		if (!folio_test_uptodate(folio))
			break;
		if (folio_test_readahead(folio))
			break;
		xas_advance(&xas, folio_next_index(folio) - 1);
		continue;
put_folio:
		folio_put(folio);
retry:
		xas_reset(&xas);
	}
	rcu_read_unlock();
}

static int filemap_read_folio(struct file *file, filler_t filler,
		struct folio *folio)
{
	bool workingset = folio_test_workingset(folio);
	unsigned long pflags;
	int error;

	/* Start the actual read. The read will unlock the page. */
	if (unlikely(workingset))
		psi_memstall_enter(&pflags);
	error = filler(file, folio);
	if (unlikely(workingset))
		psi_memstall_leave(&pflags);
	if (error)
		return error;

	error = folio_wait_locked_killable(folio);
	if (error)
		return error;
	if (folio_test_uptodate(folio))
		return 0;
	if (file)
		shrink_readahead_size_eio(&file->f_ra);
	return -EIO;
}

static bool filemap_range_uptodate(struct address_space *mapping,
		loff_t pos, size_t count, struct folio *folio,
		bool need_uptodate)
{
	if (folio_test_uptodate(folio))
		return true;
	/* pipes can't handle partially uptodate pages */
	if (need_uptodate)
		return false;
	if (!mapping->a_ops->is_partially_uptodate)
		return false;
	if (mapping->host->i_blkbits >= folio_shift(folio))
		return false;

	if (folio_pos(folio) > pos) {
		count -= folio_pos(folio) - pos;
		pos = 0;
	} else {
		pos -= folio_pos(folio);
	}

	if (pos == 0 && count >= folio_size(folio))
		return false;

	return mapping->a_ops->is_partially_uptodate(folio, pos, count);
}

static int filemap_update_page(struct kiocb *iocb,
		struct address_space *mapping, size_t count,
		struct folio *folio, bool need_uptodate)
{
	int error;

	if (iocb->ki_flags & IOCB_NOWAIT) {
		if (!filemap_invalidate_trylock_shared(mapping))
			return -EAGAIN;
	} else {
		filemap_invalidate_lock_shared(mapping);
	}

	if (!folio_trylock(folio)) {
		error = -EAGAIN;
		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
			goto unlock_mapping;
		if (!(iocb->ki_flags & IOCB_WAITQ)) {
			filemap_invalidate_unlock_shared(mapping);
			/*
			 * This is where we usually end up waiting for a
			 * previously submitted readahead to finish.
			 */
			folio_put_wait_locked(folio, TASK_KILLABLE);
			return AOP_TRUNCATED_PAGE;
		}
		error = __folio_lock_async(folio, iocb->ki_waitq);
		if (error)
			goto unlock_mapping;
	}

	error = AOP_TRUNCATED_PAGE;
	if (!folio->mapping)
		goto unlock;

	error = 0;
	if (filemap_range_uptodate(mapping, iocb->ki_pos, count, folio,
				   need_uptodate))
		goto unlock;

	error = -EAGAIN;
	if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
		goto unlock;

	error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio,
			folio);
	goto unlock_mapping;
unlock:
	folio_unlock(folio);
unlock_mapping:
	filemap_invalidate_unlock_shared(mapping);
	if (error == AOP_TRUNCATED_PAGE)
		folio_put(folio);
	return error;
}

static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch)
{
	struct address_space *mapping = iocb->ki_filp->f_mapping;
	struct folio *folio;
	int error;
	unsigned int min_order = mapping_min_folio_order(mapping);
	pgoff_t index;

	if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
		return -EAGAIN;

	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order, NULL);
	if (!folio)
		return -ENOMEM;
	if (iocb->ki_flags & IOCB_DONTCACHE)
		__folio_set_dropbehind(folio);

	/*
	 * Protect against truncate / hole punch. Grabbing invalidate_lock
	 * here assures we cannot instantiate and bring uptodate new
	 * pagecache folios after evicting page cache during truncate
	 * and before actually freeing blocks.	Note that we could
	 * release invalidate_lock after inserting the folio into
	 * the page cache as the locked folio would then be enough to
	 * synchronize with hole punching. But there are code paths
	 * such as filemap_update_page() filling in partially uptodate
	 * pages or ->readahead() that need to hold invalidate_lock
	 * while mapping blocks for IO so let's hold the lock here as
	 * well to keep locking rules simple.
	 */
	filemap_invalidate_lock_shared(mapping);
	index = (iocb->ki_pos >> (PAGE_SHIFT + min_order)) << min_order;
	error = filemap_add_folio(mapping, folio, index,
			mapping_gfp_constraint(mapping, GFP_KERNEL));
	if (error == -EEXIST)
		error = AOP_TRUNCATED_PAGE;
	if (error)
		goto error;

	error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio,
					folio);
	if (error)
		goto error;

	filemap_invalidate_unlock_shared(mapping);
	folio_batch_add(fbatch, folio);
	return 0;
error:
	filemap_invalidate_unlock_shared(mapping);
	folio_put(folio);
	return error;
}

static int filemap_readahead(struct kiocb *iocb, struct file *file,
		struct address_space *mapping, struct folio *folio,
		pgoff_t last_index)
{
	DEFINE_READAHEAD(ractl, file, &file->f_ra, mapping, folio->index);

	if (iocb->ki_flags & IOCB_NOIO)
		return -EAGAIN;
	if (iocb->ki_flags & IOCB_DONTCACHE)
		ractl.dropbehind = 1;
	page_cache_async_ra(&ractl, folio, last_index - folio->index);
	return 0;
}

static int filemap_get_pages(struct kiocb *iocb, size_t count,
		struct folio_batch *fbatch, bool need_uptodate)
{
	struct file *filp = iocb->ki_filp;
	struct address_space *mapping = filp->f_mapping;
	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
	pgoff_t last_index;
	struct folio *folio;
	unsigned int flags;
	int err = 0;

	/* "last_index" is the index of the folio beyond the end of the read */
	last_index = round_up(iocb->ki_pos + count,
			mapping_min_folio_nrbytes(mapping)) >> PAGE_SHIFT;
retry:
	if (fatal_signal_pending(current))
		return -EINTR;

	filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
	if (!folio_batch_count(fbatch)) {
		DEFINE_READAHEAD(ractl, filp, &filp->f_ra, mapping, index);

		if (iocb->ki_flags & IOCB_NOIO)
			return -EAGAIN;
		if (iocb->ki_flags & IOCB_NOWAIT)
			flags = memalloc_noio_save();
		if (iocb->ki_flags & IOCB_DONTCACHE)
			ractl.dropbehind = 1;
		page_cache_sync_ra(&ractl, last_index - index);
		if (iocb->ki_flags & IOCB_NOWAIT)
			memalloc_noio_restore(flags);
		filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
	}
	if (!folio_batch_count(fbatch)) {
		err = filemap_create_folio(iocb, fbatch);
		if (err == AOP_TRUNCATED_PAGE)
			goto retry;
		return err;
	}

	folio = fbatch->folios[folio_batch_count(fbatch) - 1];
	if (folio_test_readahead(folio)) {
		err = filemap_readahead(iocb, filp, mapping, folio, last_index);
		if (err)
			goto err;
	}
	if (!folio_test_uptodate(folio)) {
		if (folio_batch_count(fbatch) > 1) {
			err = -EAGAIN;
			goto err;
		}
		err = filemap_update_page(iocb, mapping, count, folio,
					  need_uptodate);
		if (err)
			goto err;
	}

	trace_mm_filemap_get_pages(mapping, index, last_index - 1);
	return 0;
err:
	if (err < 0)
		folio_put(folio);
	if (likely(--fbatch->nr))
		return 0;
	if (err == AOP_TRUNCATED_PAGE)
		goto retry;
	return err;
}

static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
{
	unsigned int shift = folio_shift(folio);

	return (pos1 >> shift == pos2 >> shift);
}

static void filemap_end_dropbehind_read(struct folio *folio)
{
	if (!folio_test_dropbehind(folio))
		return;
	if (folio_test_writeback(folio) || folio_test_dirty(folio))
		return;
	if (folio_trylock(folio)) {
		filemap_end_dropbehind(folio);
		folio_unlock(folio);
	}
}

/**
 * filemap_read - Read data from the page cache.
 * @iocb: The iocb to read.
 * @iter: Destination for the data.
 * @already_read: Number of bytes already read by the caller.
 *
 * Copies data from the page cache.  If the data is not currently present,
 * uses the readahead and read_folio address_space operations to fetch it.
 *
 * Return: Total number of bytes copied, including those already read by
 * the caller.  If an error happens before any bytes are copied, returns
 * a negative error number.
 */
ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
		ssize_t already_read)
{
	struct file *filp = iocb->ki_filp;
	struct file_ra_state *ra = &filp->f_ra;
	struct address_space *mapping = filp->f_mapping;
	struct inode *inode = mapping->host;
	struct folio_batch fbatch;
	int i, error = 0;
	bool writably_mapped;
	loff_t isize, end_offset;
	loff_t last_pos = ra->prev_pos;

	if (unlikely(iocb->ki_pos < 0))
		return -EINVAL;
	if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes))
		return 0;
	if (unlikely(!iov_iter_count(iter)))
		return 0;

	iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos);
	folio_batch_init(&fbatch);

	do {
		cond_resched();

		/*
		 * If we've already successfully copied some data, then we
		 * can no longer safely return -EIOCBQUEUED. Hence mark
		 * an async read NOWAIT at that point.
		 */
		if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
			iocb->ki_flags |= IOCB_NOWAIT;

		if (unlikely(iocb->ki_pos >= i_size_read(inode)))
			break;

		error = filemap_get_pages(iocb, iter->count, &fbatch, false);
		if (error < 0)
			break;

		/*
		 * i_size must be checked after we know the pages are Uptodate.
		 *
		 * Checking i_size after the check allows us to calculate
		 * the correct value for "nr", which means the zero-filled
		 * part of the page is not copied back to userspace (unless
		 * another truncate extends the file - this is desired though).
		 */
		isize = i_size_read(inode);
		if (unlikely(iocb->ki_pos >= isize))
			goto put_folios;
		end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);

		/*
		 * Once we start copying data, we don't want to be touching any
		 * cachelines that might be contended:
		 */
		writably_mapped = mapping_writably_mapped(mapping);

		/*
		 * When a read accesses the same folio several times, only
		 * mark it as accessed the first time.
		 */
		if (!pos_same_folio(iocb->ki_pos, last_pos - 1,
				    fbatch.folios[0]))
			folio_mark_accessed(fbatch.folios[0]);

		for (i = 0; i < folio_batch_count(&fbatch); i++) {
			struct folio *folio = fbatch.folios[i];
			size_t fsize = folio_size(folio);
			size_t offset = iocb->ki_pos & (fsize - 1);
			size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos,
					     fsize - offset);
			size_t copied;

			if (end_offset < folio_pos(folio))
				break;
			if (i > 0)
				folio_mark_accessed(folio);
			/*
			 * If users can be writing to this folio using arbitrary
			 * virtual addresses, take care of potential aliasing
			 * before reading the folio on the kernel side.
			 */
			if (writably_mapped)
				flush_dcache_folio(folio);

			copied = copy_folio_to_iter(folio, offset, bytes, iter);

			already_read += copied;
			iocb->ki_pos += copied;
			last_pos = iocb->ki_pos;

			if (copied < bytes) {
				error = -EFAULT;
				break;
			}
		}
put_folios:
		for (i = 0; i < folio_batch_count(&fbatch); i++) {
			struct folio *folio = fbatch.folios[i];

			filemap_end_dropbehind_read(folio);
			folio_put(folio);
		}
		folio_batch_init(&fbatch);
	} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);

	file_accessed(filp);
	ra->prev_pos = last_pos;
	return already_read ? already_read : error;
}
EXPORT_SYMBOL_GPL(filemap_read);

int kiocb_write_and_wait(struct kiocb *iocb, size_t count)
{
	struct address_space *mapping = iocb->ki_filp->f_mapping;
	loff_t pos = iocb->ki_pos;
	loff_t end = pos + count - 1;

	if (iocb->ki_flags & IOCB_NOWAIT) {
		if (filemap_range_needs_writeback(mapping, pos, end))
			return -EAGAIN;
		return 0;
	}

	return filemap_write_and_wait_range(mapping, pos, end);
}
EXPORT_SYMBOL_GPL(kiocb_write_and_wait);

int filemap_invalidate_pages(struct address_space *mapping,
			     loff_t pos, loff_t end, bool nowait)
{
	int ret;

	if (nowait) {
		/* we could block if there are any pages in the range */
		if (filemap_range_has_page(mapping, pos, end))
			return -EAGAIN;
	} else {
		ret = filemap_write_and_wait_range(mapping, pos, end);
		if (ret)
			return ret;
	}

	/*
	 * After a write we want buffered reads to be sure to go to disk to get
	 * the new data.  We invalidate clean cached page from the region we're
	 * about to write.  We do this *before* the write so that we can return
	 * without clobbering -EIOCBQUEUED from ->direct_IO().
	 */
	return invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
					     end >> PAGE_SHIFT);
}

int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
{
	struct address_space *mapping = iocb->ki_filp->f_mapping;

	return filemap_invalidate_pages(mapping, iocb->ki_pos,
					iocb->ki_pos + count - 1,
					iocb->ki_flags & IOCB_NOWAIT);
}
EXPORT_SYMBOL_GPL(kiocb_invalidate_pages);

/**
 * generic_file_read_iter - generic filesystem read routine
 * @iocb:	kernel I/O control block
 * @iter:	destination for the data read
 *
 * This is the "read_iter()" routine for all filesystems
 * that can use the page cache directly.
 *
 * The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall
 * be returned when no data can be read without waiting for I/O requests
 * to complete; it doesn't prevent readahead.
 *
 * The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/O
 * requests shall be made for the read or for readahead.  When no data
 * can be read, -EAGAIN shall be returned.  When readahead would be
 * triggered, a partial, possibly empty read shall be returned.
 *
 * Return:
 * * number of bytes copied, even for partial reads
 * * negative error code (or 0 if IOCB_NOIO) if nothing was read
 */
ssize_t
generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
	size_t count = iov_iter_count(iter);
	ssize_t retval = 0;

	if (!count)
		return 0; /* skip atime */

	if (iocb->ki_flags & IOCB_DIRECT) {
		struct file *file = iocb->ki_filp;
		struct address_space *mapping = file->f_mapping;
		struct inode *inode = mapping->host;

		retval = kiocb_write_and_wait(iocb, count);
		if (retval < 0)
			return retval;
		file_accessed(file);

		retval = mapping->a_ops->direct_IO(iocb, iter);
		if (retval >= 0) {
			iocb->ki_pos += retval;
			count -= retval;
		}
		if (retval != -EIOCBQUEUED)
			iov_iter_revert(iter, count - iov_iter_count(iter));

		/*
		 * Btrfs can have a short DIO read if we encounter
		 * compressed extents, so if there was an error, or if
		 * we've already read everything we wanted to, or if
		 * there was a short read because we hit EOF, go ahead
		 * and return.  Otherwise fallthrough to buffered io for
		 * the rest of the read.  Buffered reads will not work for
		 * DAX files, so don't bother trying.
		 */
		if (retval < 0 || !count || IS_DAX(inode))
			return retval;
		if (iocb->ki_pos >= i_size_read(inode))
			return retval;
	}

	return filemap_read(iocb, iter, retval);
}
EXPORT_SYMBOL(generic_file_read_iter);

/*
 * Splice subpages from a folio into a pipe.
 */
size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
			      struct folio *folio, loff_t fpos, size_t size)
{
	struct page *page;
	size_t spliced = 0, offset = offset_in_folio(folio, fpos);

	page = folio_page(folio, offset / PAGE_SIZE);
	size = min(size, folio_size(folio) - offset);
	offset %= PAGE_SIZE;

	while (spliced < size && !pipe_is_full(pipe)) {
		struct pipe_buffer *buf = pipe_head_buf(pipe);
		size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced);

		*buf = (struct pipe_buffer) {
			.ops	= &page_cache_pipe_buf_ops,
			.page	= page,
			.offset	= offset,
			.len	= part,
		};
		folio_get(folio);
		pipe->head++;
		page++;
		spliced += part;
		offset = 0;
	}

	return spliced;
}

/**
 * filemap_splice_read -  Splice data from a file's pagecache into a pipe
 * @in: The file to read from
 * @ppos: Pointer to the file position to read from
 * @pipe: The pipe to splice into
 * @len: The amount to splice
 * @flags: The SPLICE_F_* flags
 *
 * This function gets folios from a file's pagecache and splices them into the
 * pipe.  Readahead will be called as necessary to fill more folios.  This may
 * be used for blockdevs also.
 *
 * Return: On success, the number of bytes read will be returned and *@ppos
 * will be updated if appropriate; 0 will be returned if there is no more data
 * to be read; -EAGAIN will be returned if the pipe had no space, and some
 * other negative error code will be returned on error.  A short read may occur
 * if the pipe has insufficient space, we reach the end of the data or we hit a
 * hole.
 */
ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
			    struct pipe_inode_info *pipe,
			    size_t len, unsigned int flags)
{
	struct folio_batch fbatch;
	struct kiocb iocb;
	size_t total_spliced = 0, used, npages;
	loff_t isize, end_offset;
	bool writably_mapped;
	int i, error = 0;

	if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
		return 0;

	init_sync_kiocb(&iocb, in);
	iocb.ki_pos = *ppos;

	/* Work out how much data we can actually add into the pipe */
	used = pipe_buf_usage(pipe);
	npages = max_t(ssize_t, pipe->max_usage - used, 0);
	len = min_t(size_t, len, npages * PAGE_SIZE);

	folio_batch_init(&fbatch);

	do {
		cond_resched();

		if (*ppos >= i_size_read(in->f_mapping->host))
			break;

		iocb.ki_pos = *ppos;
		error = filemap_get_pages(&iocb, len, &fbatch, true);
		if (error < 0)
			break;

		/*
		 * i_size must be checked after we know the pages are Uptodate.
		 *
		 * Checking i_size after the check allows us to calculate
		 * the correct value for "nr", which means the zero-filled
		 * part of the page is not copied back to userspace (unless
		 * another truncate extends the file - this is desired though).
		 */
		isize = i_size_read(in->f_mapping->host);
		if (unlikely(*ppos >= isize))
			break;
		end_offset = min_t(loff_t, isize, *ppos + len);

		/*
		 * Once we start copying data, we don't want to be touching any
		 * cachelines that might be contended:
		 */
		writably_mapped = mapping_writably_mapped(in->f_mapping);

		for (i = 0; i < folio_batch_count(&fbatch); i++) {
			struct folio *folio = fbatch.folios[i];
			size_t n;

			if (folio_pos(folio) >= end_offset)
				goto out;
			folio_mark_accessed(folio);

			/*
			 * If users can be writing to this folio using arbitrary
			 * virtual addresses, take care of potential aliasing
			 * before reading the folio on the kernel side.
			 */
			if (writably_mapped)
				flush_dcache_folio(folio);

			n = min_t(loff_t, len, isize - *ppos);
			n = splice_folio_into_pipe(pipe, folio, *ppos, n);
			if (!n)
				goto out;
			len -= n;
			total_spliced += n;
			*ppos += n;
			in->f_ra.prev_pos = *ppos;
			if (pipe_is_full(pipe))
				goto out;
		}

		folio_batch_release(&fbatch);
	} while (len);

out:
	folio_batch_release(&fbatch);
	file_accessed(in);

	return total_spliced ? total_spliced : error;
}
EXPORT_SYMBOL(filemap_splice_read);

static inline loff_t folio_seek_hole_data(struct xa_state *xas,
		struct address_space *mapping, struct folio *folio,
		loff_t start, loff_t end, bool seek_data)
{
	const struct address_space_operations *ops = mapping->a_ops;
	size_t offset, bsz = i_blocksize(mapping->host);

	if (xa_is_value(folio) || folio_test_uptodate(folio))
		return seek_data ? start : end;
	if (!ops->is_partially_uptodate)
		return seek_data ? end : start;

	xas_pause(xas);
	rcu_read_unlock();
	folio_lock(folio);
	if (unlikely(folio->mapping != mapping))
		goto unlock;

	offset = offset_in_folio(folio, start) & ~(bsz - 1);

	do {
		if (ops->is_partially_uptodate(folio, offset, bsz) ==
							seek_data)
			break;
		start = (start + bsz) & ~((u64)bsz - 1);
		offset += bsz;
	} while (offset < folio_size(folio));
unlock:
	folio_unlock(folio);
	rcu_read_lock();
	return start;
}

static inline size_t seek_folio_size(struct xa_state *xas, struct folio *folio)
{
	if (xa_is_value(folio))
		return PAGE_SIZE << xas_get_order(xas);
	return folio_size(folio);
}

/**
 * mapping_seek_hole_data - Seek for SEEK_DATA / SEEK_HOLE in the page cache.
 * @mapping: Address space to search.
 * @start: First byte to consider.
 * @end: Limit of search (exclusive).
 * @whence: Either SEEK_HOLE or SEEK_DATA.
 *
 * If the page cache knows which blocks contain holes and which blocks
 * contain data, your filesystem can use this function to implement
 * SEEK_HOLE and SEEK_DATA.  This is useful for filesystems which are
 * entirely memory-based such as tmpfs, and filesystems which support
 * unwritten extents.
 *
 * Return: The requested offset on success, or -ENXIO if @whence specifies
 * SEEK_DATA and there is no data after @start.  There is an implicit hole
 * after @end - 1, so SEEK_HOLE returns @end if all the bytes between @start
 * and @end contain data.
 */
loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
		loff_t end, int whence)
{
	XA_STATE(xas, &mapping->i_pages, start >> PAGE_SHIFT);
	pgoff_t max = (end - 1) >> PAGE_SHIFT;
	bool seek_data = (whence == SEEK_DATA);
	struct folio *folio;

	if (end <= start)
		return -ENXIO;

	rcu_read_lock();
	while ((folio = find_get_entry(&xas, max, XA_PRESENT))) {
		loff_t pos = (u64)xas.xa_index << PAGE_SHIFT;
		size_t seek_size;

		if (start < pos) {
			if (!seek_data)
				goto unlock;
			start = pos;
		}

		seek_size = seek_folio_size(&xas, folio);
		pos = round_up((u64)pos + 1, seek_size);
		start = folio_seek_hole_data(&xas, mapping, folio, start, pos,
				seek_data);
		if (start < pos)
			goto unlock;
		if (start >= end)
			break;
		if (seek_size > PAGE_SIZE)
			xas_set(&xas, pos >> PAGE_SHIFT);
		if (!xa_is_value(folio))
			folio_put(folio);
	}
	if (seek_data)
		start = -ENXIO;
unlock:
	rcu_read_unlock();
	if (folio && !xa_is_value(folio))
		folio_put(folio);
	if (start > end)
		return end;
	return start;
}

#ifdef CONFIG_MMU
#define MMAP_LOTSAMISS  (100)
/*
 * lock_folio_maybe_drop_mmap - lock the page, possibly dropping the mmap_lock
 * @vmf - the vm_fault for this fault.
 * @folio - the folio to lock.
 * @fpin - the pointer to the file we may pin (or is already pinned).
 *
 * This works similar to lock_folio_or_retry in that it can drop the
 * mmap_lock.  It differs in that it actually returns the folio locked
 * if it returns 1 and 0 if it couldn't lock the folio.  If we did have
 * to drop the mmap_lock then fpin will point to the pinned file and
 * needs to be fput()'ed at a later point.
 */
static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio,
				     struct file **fpin)
{
	if (folio_trylock(folio))
		return 1;

	/*
	 * NOTE! This will make us return with VM_FAULT_RETRY, but with
	 * the fault lock still held. That's how FAULT_FLAG_RETRY_NOWAIT
	 * is supposed to work. We have way too many special cases..
	 */
	if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
		return 0;

	*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
	if (vmf->flags & FAULT_FLAG_KILLABLE) {
		if (__folio_lock_killable(folio)) {
			/*
			 * We didn't have the right flags to drop the
			 * fault lock, but all fault_handlers only check
			 * for fatal signals if we return VM_FAULT_RETRY,
			 * so we need to drop the fault lock here and
			 * return 0 if we don't have a fpin.
			 */
			if (*fpin == NULL)
				release_fault_lock(vmf);
			return 0;
		}
	} else
		__folio_lock(folio);

	return 1;
}

/*
 * Synchronous readahead happens when we don't even find a page in the page
 * cache at all.  We don't want to perform IO under the mmap sem, so if we have
 * to drop the mmap sem we return the file that was pinned in order for us to do
 * that.  If we didn't pin a file then we return NULL.  The file that is
 * returned needs to be fput()'ed when we're done with it.
 */
static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
{
	struct file *file = vmf->vma->vm_file;
	struct file_ra_state *ra = &file->f_ra;
	struct address_space *mapping = file->f_mapping;
	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
	struct file *fpin = NULL;
	vm_flags_t vm_flags = vmf->vma->vm_flags;
	bool force_thp_readahead = false;
	unsigned short mmap_miss;

	/* Use the readahead code, even if readahead is disabled */
	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
		force_thp_readahead = true;

	if (!force_thp_readahead) {
		/*
		 * If we don't want any read-ahead, don't bother.
		 * VM_EXEC case below is already intended for random access.
		 */
		if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
			return fpin;

		if (!ra->ra_pages)
			return fpin;

		if (vm_flags & VM_SEQ_READ) {
			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
			page_cache_sync_ra(&ractl, ra->ra_pages);
			return fpin;
		}
	}

	if (!(vm_flags & VM_SEQ_READ)) {
		/* Avoid banging the cache line if not needed */
		mmap_miss = READ_ONCE(ra->mmap_miss);
		if (mmap_miss < MMAP_LOTSAMISS * 10)
			WRITE_ONCE(ra->mmap_miss, ++mmap_miss);

		/*
		 * Do we miss much more than hit in this file? If so,
		 * stop bothering with read-ahead. It will only hurt.
		 */
		if (mmap_miss > MMAP_LOTSAMISS)
			return fpin;
	}

	if (force_thp_readahead) {
		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
		ra->size = HPAGE_PMD_NR;
		/*
		 * Fetch two PMD folios, so we get the chance to actually
		 * readahead, unless we've been told not to.
		 */
		if (!(vm_flags & VM_RAND_READ))
			ra->size *= 2;
		ra->async_size = HPAGE_PMD_NR;
		ra->order = HPAGE_PMD_ORDER;
		page_cache_ra_order(&ractl, ra);
		return fpin;
	}

	if (vm_flags & VM_EXEC) {
		/*
		 * Allow arch to request a preferred minimum folio order for
		 * executable memory. This can often be beneficial to
		 * performance if (e.g.) arm64 can contpte-map the folio.
		 * Executable memory rarely benefits from readahead, due to its
		 * random access nature, so set async_size to 0.
		 *
		 * Limit to the boundaries of the VMA to avoid reading in any
		 * pad that might exist between sections, which would be a waste
		 * of memory.
		 */
		struct vm_area_struct *vma = vmf->vma;
		unsigned long start = vma->vm_pgoff;
		unsigned long end = start + vma_pages(vma);
		unsigned long ra_end;

		ra->order = exec_folio_order();
		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
		ra->start = max(ra->start, start);
		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
		ra_end = min(ra_end, end);
		ra->size = ra_end - ra->start;
		ra->async_size = 0;
	} else {
		/*
		 * mmap read-around
		 */
		ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
		ra->size = ra->ra_pages;
		ra->async_size = ra->ra_pages / 4;
		ra->order = 0;
	}

	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
	ractl._index = ra->start;
	page_cache_ra_order(&ractl, ra);
	return fpin;
}

/*
 * Asynchronous readahead happens when we find the page and PG_readahead,
 * so we want to possibly extend the readahead further.  We return the file that
 * was pinned if we have to drop the mmap_lock in order to do IO.
 */
static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
					    struct folio *folio)
{
	struct file *file = vmf->vma->vm_file;
	struct file_ra_state *ra = &file->f_ra;
	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
	struct file *fpin = NULL;
	unsigned short mmap_miss;

	/* If we don't want any read-ahead, don't bother */
	if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
		return fpin;

	/*
	 * If the folio is locked, we're likely racing against another fault.
	 * Don't touch the mmap_miss counter to avoid decreasing it multiple
	 * times for a single folio and break the balance with mmap_miss
	 * increase in do_sync_mmap_readahead().
	 */
	if (likely(!folio_test_locked(folio))) {
		mmap_miss = READ_ONCE(ra->mmap_miss);
		if (mmap_miss)
			WRITE_ONCE(ra->mmap_miss, --mmap_miss);
	}

	if (folio_test_readahead(folio)) {
		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
		page_cache_async_ra(&ractl, folio, ra->ra_pages);
	}
	return fpin;
}

static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	vm_fault_t ret = 0;
	pte_t *ptep;

	/*
	 * We might have COW'ed a pagecache folio and might now have an mlocked
	 * anon folio mapped. The original pagecache folio is not mlocked and
	 * might have been evicted. During a read+clear/modify/write update of
	 * the PTE, such as done in do_numa_page()/change_pte_range(), we
	 * temporarily clear the PTE under PT lock and might detect it here as
	 * "none" when not holding the PT lock.
	 *
	 * Not rechecking the PTE under PT lock could result in an unexpected
	 * major fault in an mlock'ed region. Recheck only for this special
	 * scenario while holding the PT lock, to not degrade non-mlocked
	 * scenarios. Recheck the PTE without PT lock firstly, thereby reducing
	 * the number of times we hold PT lock.
	 */
	if (!(vma->vm_flags & VM_LOCKED))
		return 0;

	if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
		return 0;

	ptep = pte_offset_map_ro_nolock(vma->vm_mm, vmf->pmd, vmf->address,
					&vmf->ptl);
	if (unlikely(!ptep))
		return VM_FAULT_NOPAGE;

	if (unlikely(!pte_none(ptep_get_lockless(ptep)))) {
		ret = VM_FAULT_NOPAGE;
	} else {
		spin_lock(vmf->ptl);
		if (unlikely(!pte_none(ptep_get(ptep))))
			ret = VM_FAULT_NOPAGE;
		spin_unlock(vmf->ptl);
	}
	pte_unmap(ptep);
	return ret;
}

/**
 * filemap_fault - read in file data for page fault handling
 * @vmf:	struct vm_fault containing details of the fault
 *
 * filemap_fault() is invoked via the vma operations vector for a
 * mapped memory region to read in file data during a page fault.
 *
 * The goto's are kind of ugly, but this streamlines the normal case of having
 * it in the page cache, and handles the special cases reasonably without
 * having a lot of duplicated code.
 *
 * vma->vm_mm->mmap_lock must be held on entry.
 *
 * If our return value has VM_FAULT_RETRY set, it's because the mmap_lock
 * may be dropped before doing I/O or by lock_folio_maybe_drop_mmap().
 *
 * If our return value does not have VM_FAULT_RETRY set, the mmap_lock
 * has not been released.
 *
 * We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set.
 *
 * Return: bitwise-OR of %VM_FAULT_ codes.
 */
vm_fault_t filemap_fault(struct vm_fault *vmf)
{
	int error;
	struct file *file = vmf->vma->vm_file;
	struct file *fpin = NULL;
	struct address_space *mapping = file->f_mapping;
	struct inode *inode = mapping->host;
	pgoff_t max_idx, index = vmf->pgoff;
	struct folio *folio;
	vm_fault_t ret = 0;
	bool mapping_locked = false;

	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
	if (unlikely(index >= max_idx))
		return VM_FAULT_SIGBUS;

	trace_mm_filemap_fault(mapping, index);

	/*
	 * Do we have something in the page cache already?
	 */
	folio = filemap_get_folio(mapping, index);
	if (likely(!IS_ERR(folio))) {
		/*
		 * We found the page, so try async readahead before waiting for
		 * the lock.
		 */
		if (!(vmf->flags & FAULT_FLAG_TRIED))
			fpin = do_async_mmap_readahead(vmf, folio);
		if (unlikely(!folio_test_uptodate(folio))) {
			filemap_invalidate_lock_shared(mapping);
			mapping_locked = true;
		}
	} else {
		ret = filemap_fault_recheck_pte_none(vmf);
		if (unlikely(ret))
			return ret;

		/* No page in the page cache at all */
		count_vm_event(PGMAJFAULT);
		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
		ret = VM_FAULT_MAJOR;
		fpin = do_sync_mmap_readahead(vmf);
retry_find:
		/*
		 * See comment in filemap_create_folio() why we need
		 * invalidate_lock
		 */
		if (!mapping_locked) {
			filemap_invalidate_lock_shared(mapping);
			mapping_locked = true;
		}
		folio = __filemap_get_folio(mapping, index,
					  FGP_CREAT|FGP_FOR_MMAP,
					  vmf->gfp_mask);
		if (IS_ERR(folio)) {
			if (fpin)
				goto out_retry;
			filemap_invalidate_unlock_shared(mapping);
			return VM_FAULT_OOM;
		}
	}

	if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
		goto out_retry;

	/* Did it get truncated? */
	if (unlikely(folio->mapping != mapping)) {
		folio_unlock(folio);
		folio_put(folio);
		goto retry_find;
	}
	VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);

	/*
	 * We have a locked folio in the page cache, now we need to check
	 * that it's up-to-date. If not, it is going to be due to an error,
	 * or because readahead was otherwise unable to retrieve it.
	 */
	if (unlikely(!folio_test_uptodate(folio))) {
		/*
		 * If the invalidate lock is not held, the folio was in cache
		 * and uptodate and now it is not. Strange but possible since we
		 * didn't hold the page lock all the time. Let's drop
		 * everything, get the invalidate lock and try again.
		 */
		if (!mapping_locked) {
			folio_unlock(folio);
			folio_put(folio);
			goto retry_find;
		}

		/*
		 * OK, the folio is really not uptodate. This can be because the
		 * VMA has the VM_RAND_READ flag set, or because an error
		 * arose. Let's read it in directly.
		 */
		goto page_not_uptodate;
	}

	/*
	 * We've made it this far and we had to drop our mmap_lock, now is the
	 * time to return to the upper layer and have it re-find the vma and
	 * redo the fault.
	 */
	if (fpin) {
		folio_unlock(folio);
		goto out_retry;
	}
	if (mapping_locked)
		filemap_invalidate_unlock_shared(mapping);

	/*
	 * Found the page and have a reference on it.
	 * We must recheck i_size under page lock.
	 */
	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
	if (unlikely(index >= max_idx)) {
		folio_unlock(folio);
		folio_put(folio);
		return VM_FAULT_SIGBUS;
	}

	vmf->page = folio_file_page(folio, index);
	return ret | VM_FAULT_LOCKED;

page_not_uptodate:
	/*
	 * Umm, take care of errors if the page isn't up-to-date.
	 * Try to re-read it _once_. We do this synchronously,
	 * because there really aren't any performance issues here
	 * and we need to check for errors.
	 */
	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
	if (fpin)
		goto out_retry;
	folio_put(folio);

	if (!error || error == AOP_TRUNCATED_PAGE)
		goto retry_find;
	filemap_invalidate_unlock_shared(mapping);

	return VM_FAULT_SIGBUS;

out_retry:
	/*
	 * We dropped the mmap_lock, we need to return to the fault handler to
	 * re-find the vma and come back and find our hopefully still populated
	 * page.
	 */
	if (!IS_ERR(folio))
		folio_put(folio);
	if (mapping_locked)
		filemap_invalidate_unlock_shared(mapping);
	if (fpin)
		fput(fpin);
	return ret | VM_FAULT_RETRY;
}
EXPORT_SYMBOL(filemap_fault);

static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
		pgoff_t start)
{
	struct mm_struct *mm = vmf->vma->vm_mm;

	/* Huge page is mapped? No need to proceed. */
	if (pmd_trans_huge(*vmf->pmd)) {
		folio_unlock(folio);
		folio_put(folio);
		return true;
	}

	if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
		struct page *page = folio_file_page(folio, start);
		vm_fault_t ret = do_set_pmd(vmf, folio, page);
		if (!ret) {
			/* The page is mapped successfully, reference consumed. */
			folio_unlock(folio);
			return true;
		}
	}

	if (pmd_none(*vmf->pmd) && vmf->prealloc_pte)
		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);

	return false;
}

static struct folio *next_uptodate_folio(struct xa_state *xas,
		struct address_space *mapping, pgoff_t end_pgoff)
{
	struct folio *folio = xas_next_entry(xas, end_pgoff);
	unsigned long max_idx;

	do {
		if (!folio)
			return NULL;
		if (xas_retry(xas, folio))
			continue;
		if (xa_is_value(folio))
			continue;
		if (!folio_try_get(folio))
			continue;
		if (folio_test_locked(folio))
			goto skip;
		/* Has the page moved or been split? */
		if (unlikely(folio != xas_reload(xas)))
			goto skip;
		if (!folio_test_uptodate(folio) || folio_test_readahead(folio))
			goto skip;
		if (!folio_trylock(folio))
			goto skip;
		if (folio->mapping != mapping)
			goto unlock;
		if (!folio_test_uptodate(folio))
			goto unlock;
		max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
		if (xas->xa_index >= max_idx)
			goto unlock;
		return folio;
unlock:
		folio_unlock(folio);
skip:
		folio_put(folio);
	} while ((folio = xas_next_entry(xas, end_pgoff)) != NULL);

	return NULL;
}

/*
 * Map page range [start_page, start_page + nr_pages) of folio.
 * start_page is gotten from start by folio_page(folio, start)
 */
static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
			struct folio *folio, unsigned long start,
			unsigned long addr, unsigned int nr_pages,
			unsigned long *rss, unsigned short *mmap_miss,
			pgoff_t file_end)
{
	struct address_space *mapping = folio->mapping;
	unsigned int ref_from_caller = 1;
	vm_fault_t ret = 0;
	struct page *page = folio_page(folio, start);
	unsigned int count = 0;
	pte_t *old_ptep = vmf->pte;
	unsigned long addr0;

	/*
	 * Map the large folio fully where possible:
	 *
	 *  - The folio is fully within size of the file or belong
	 *    to shmem/tmpfs;
	 *  - The folio doesn't cross VMA boundary;
	 *  - The folio doesn't cross page table boundary;
	 */
	addr0 = addr - start * PAGE_SIZE;
	if ((file_end >= folio_next_index(folio) || shmem_mapping(mapping)) &&
	    folio_within_vma(folio, vmf->vma) &&
	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) {
		vmf->pte -= start;
		page -= start;
		addr = addr0;
		nr_pages = folio_nr_pages(folio);
	}

	do {
		if (PageHWPoison(page + count))
			goto skip;

		/*
		 * If there are too many folios that are recently evicted
		 * in a file, they will probably continue to be evicted.
		 * In such situation, read-ahead is only a waste of IO.
		 * Don't decrease mmap_miss in this scenario to make sure
		 * we can stop read-ahead.
		 */
		if (!folio_test_workingset(folio))
			(*mmap_miss)++;

		/*
		 * NOTE: If there're PTE markers, we'll leave them to be
		 * handled in the specific fault path, and it'll prohibit the
		 * fault-around logic.
		 */
		if (!pte_none(ptep_get(&vmf->pte[count])))
			goto skip;

		count++;
		continue;
skip:
		if (count) {
			set_pte_range(vmf, folio, page, count, addr);
			*rss += count;
			folio_ref_add(folio, count - ref_from_caller);
			ref_from_caller = 0;
			if (in_range(vmf->address, addr, count * PAGE_SIZE))
				ret = VM_FAULT_NOPAGE;
		}

		count++;
		page += count;
		vmf->pte += count;
		addr += count * PAGE_SIZE;
		count = 0;
	} while (--nr_pages > 0);

	if (count) {
		set_pte_range(vmf, folio, page, count, addr);
		*rss += count;
		folio_ref_add(folio, count - ref_from_caller);
		ref_from_caller = 0;
		if (in_range(vmf->address, addr, count * PAGE_SIZE))
			ret = VM_FAULT_NOPAGE;
	}

	vmf->pte = old_ptep;
	if (ref_from_caller)
		/* Locked folios cannot get truncated. */
		folio_ref_dec(folio);

	return ret;
}

static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
		struct folio *folio, unsigned long addr,
		unsigned long *rss, unsigned short *mmap_miss)
{
	vm_fault_t ret = 0;
	struct page *page = &folio->page;

	if (PageHWPoison(page))
		goto out;

	/* See comment of filemap_map_folio_range() */
	if (!folio_test_workingset(folio))
		(*mmap_miss)++;

	/*
	 * NOTE: If there're PTE markers, we'll leave them to be
	 * handled in the specific fault path, and it'll prohibit
	 * the fault-around logic.
	 */
	if (!pte_none(ptep_get(vmf->pte)))
		goto out;

	if (vmf->address == addr)
		ret = VM_FAULT_NOPAGE;

	set_pte_range(vmf, folio, page, 1, addr);
	(*rss)++;
	return ret;

out:
	/* Locked folios cannot get truncated. */
	folio_ref_dec(folio);
	return ret;
}

vm_fault_t filemap_map_pages(struct vm_fault *vmf,
			     pgoff_t start_pgoff, pgoff_t end_pgoff)
{
	struct vm_area_struct *vma = vmf->vma;
	struct file *file = vma->vm_file;
	struct address_space *mapping = file->f_mapping;
	pgoff_t file_end, last_pgoff = start_pgoff;
	unsigned long addr;
	XA_STATE(xas, &mapping->i_pages, start_pgoff);
	struct folio *folio;
	vm_fault_t ret = 0;
	unsigned long rss = 0;
	unsigned int nr_pages = 0, folio_type;
	unsigned short mmap_miss = 0, mmap_miss_saved;

	rcu_read_lock();
	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
	if (!folio)
		goto out;

	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
	end_pgoff = min(end_pgoff, file_end);

	/*
	 * Do not allow to map with PMD across i_size to preserve
	 * SIGBUS semantics.
	 *
	 * Make an exception for shmem/tmpfs that for long time
	 * intentionally mapped with PMDs across i_size.
	 */
	if ((file_end >= folio_next_index(folio) || shmem_mapping(mapping)) &&
	    filemap_map_pmd(vmf, folio, start_pgoff)) {
		ret = VM_FAULT_NOPAGE;
		goto out;
	}

	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
	if (!vmf->pte) {
		folio_unlock(folio);
		folio_put(folio);
		goto out;
	}

	folio_type = mm_counter_file(folio);
	do {
		unsigned long end;

		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
		vmf->pte += xas.xa_index - last_pgoff;
		last_pgoff = xas.xa_index;
		end = folio_next_index(folio) - 1;
		nr_pages = min(end, end_pgoff) - xas.xa_index + 1;

		if (!folio_test_large(folio))
			ret |= filemap_map_order0_folio(vmf,
					folio, addr, &rss, &mmap_miss);
		else
			ret |= filemap_map_folio_range(vmf, folio,
					xas.xa_index - folio->index, addr,
					nr_pages, &rss, &mmap_miss, file_end);

		folio_unlock(folio);
	} while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
	add_mm_counter(vma->vm_mm, folio_type, rss);
	pte_unmap_unlock(vmf->pte, vmf->ptl);
	trace_mm_filemap_map_pages(mapping, start_pgoff, end_pgoff);
out:
	rcu_read_unlock();

	mmap_miss_saved = READ_ONCE(file->f_ra.mmap_miss);
	if (mmap_miss >= mmap_miss_saved)
		WRITE_ONCE(file->f_ra.mmap_miss, 0);
	else
		WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss_saved - mmap_miss);

	return ret;
}
EXPORT_SYMBOL(filemap_map_pages);

vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf)
{
	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
	struct folio *folio = page_folio(vmf->page);
	vm_fault_t ret = VM_FAULT_LOCKED;

	sb_start_pagefault(mapping->host->i_sb);
	file_update_time(vmf->vma->vm_file);
	folio_lock(folio);
	if (folio->mapping != mapping) {
		folio_unlock(folio);
		ret = VM_FAULT_NOPAGE;
		goto out;
	}
	/*
	 * We mark the folio dirty already here so that when freeze is in
	 * progress, we are guaranteed that writeback during freezing will
	 * see the dirty folio and writeprotect it again.
	 */
	folio_mark_dirty(folio);
	folio_wait_stable(folio);
out:
	sb_end_pagefault(mapping->host->i_sb);
	return ret;
}

const struct vm_operations_struct generic_file_vm_ops = {
	.fault		= filemap_fault,
	.map_pages	= filemap_map_pages,
	.page_mkwrite	= filemap_page_mkwrite,
};

/* This is used for a general mmap of a disk file */

int generic_file_mmap(struct file *file, struct vm_area_struct *vma)
{
	struct address_space *mapping = file->f_mapping;

	if (!mapping->a_ops->read_folio)
		return -ENOEXEC;
	file_accessed(file);
	vma->vm_ops = &generic_file_vm_ops;
	return 0;
}

int generic_file_mmap_prepare(struct vm_area_desc *desc)
{
	struct file *file = desc->file;
	struct address_space *mapping = file->f_mapping;

	if (!mapping->a_ops->read_folio)
		return -ENOEXEC;
	file_accessed(file);
	desc->vm_ops = &generic_file_vm_ops;
	return 0;
}

/*
 * This is for filesystems which do not implement ->writepage.
 */
int generic_file_readonly_mmap(struct file *file, struct vm_area_struct *vma)
{
	if (vma_is_shared_maywrite(vma))
		return -EINVAL;
	return generic_file_mmap(file, vma);
}

int generic_file_readonly_mmap_prepare(struct vm_area_desc *desc)
{
	if (is_shared_maywrite(&desc->vma_flags))
		return -EINVAL;
	return generic_file_mmap_prepare(desc);
}
#else
vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf)
{
	return VM_FAULT_SIGBUS;
}
int generic_file_mmap(struct file *file, struct vm_area_struct *vma)
{
	return -ENOSYS;
}
int generic_file_mmap_prepare(struct vm_area_desc *desc)
{
	return -ENOSYS;
}
int generic_file_readonly_mmap(struct file *file, struct vm_area_struct *vma)
{
	return -ENOSYS;
}
int generic_file_readonly_mmap_prepare(struct vm_area_desc *desc)
{
	return -ENOSYS;
}
#endif /* CONFIG_MMU */

EXPORT_SYMBOL(filemap_page_mkwrite);
EXPORT_SYMBOL(generic_file_mmap);
EXPORT_SYMBOL(generic_file_mmap_prepare);
EXPORT_SYMBOL(generic_file_readonly_mmap);
EXPORT_SYMBOL(generic_file_readonly_mmap_prepare);

static struct folio *do_read_cache_folio(struct address_space *mapping,
		pgoff_t index, filler_t filler, struct file *file, gfp_t gfp)
{
	struct folio *folio;
	int err;

	if (!filler)
		filler = mapping->a_ops->read_folio;
repeat:
	folio = filemap_get_folio(mapping, index);
	if (IS_ERR(folio)) {
		folio = filemap_alloc_folio(gfp, mapping_min_folio_order(mapping), NULL);
		if (!folio)
			return ERR_PTR(-ENOMEM);
		index = mapping_align_index(mapping, index);
		err = filemap_add_folio(mapping, folio, index, gfp);
		if (unlikely(err)) {
			folio_put(folio);
			if (err == -EEXIST)
				goto repeat;
			/* Presumably ENOMEM for xarray node */
			return ERR_PTR(err);
		}

		goto filler;
	}
	if (folio_test_uptodate(folio))
		goto out;

	if (!folio_trylock(folio)) {
		folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE);
		goto repeat;
	}

	/* Folio was truncated from mapping */
	if (!folio->mapping) {
		folio_unlock(folio);
		folio_put(folio);
		goto repeat;
	}

	/* Someone else locked and filled the page in a very small window */
	if (folio_test_uptodate(folio)) {
		folio_unlock(folio);
		goto out;
	}

filler:
	err = filemap_read_folio(file, filler, folio);
	if (err) {
		folio_put(folio);
		if (err == AOP_TRUNCATED_PAGE)
			goto repeat;
		return ERR_PTR(err);
	}

out:
	folio_mark_accessed(folio);
	return folio;
}

/**
 * read_cache_folio - Read into page cache, fill it if needed.
 * @mapping: The address_space to read from.
 * @index: The index to read.
 * @filler: Function to perform the read, or NULL to use aops->read_folio().
 * @file: Passed to filler function, may be NULL if not required.
 *
 * Read one page into the page cache.  If it succeeds, the folio returned
 * will contain @index, but it may not be the first page of the folio.
 *
 * If the filler function returns an error, it will be returned to the
 * caller.
 *
 * Context: May sleep.  Expects mapping->invalidate_lock to be held.
 * Return: An uptodate folio on success, ERR_PTR() on failure.
 */
struct folio *read_cache_folio(struct address_space *mapping, pgoff_t index,
		filler_t filler, struct file *file)
{
	return do_read_cache_folio(mapping, index, filler, file,
			mapping_gfp_mask(mapping));
}
EXPORT_SYMBOL(read_cache_folio);

/**
 * mapping_read_folio_gfp - Read into page cache, using specified allocation flags.
 * @mapping:	The address_space for the folio.
 * @index:	The index that the allocated folio will contain.
 * @gfp:	The page allocator flags to use if allocating.
 *
 * This is the same as "read_cache_folio(mapping, index, NULL, NULL)", but with
 * any new memory allocations done using the specified allocation flags.
 *
 * The most likely error from this function is EIO, but ENOMEM is
 * possible and so is EINTR.  If ->read_folio returns another error,
 * that will be returned to the caller.
 *
 * The function expects mapping->invalidate_lock to be already held.
 *
 * Return: Uptodate folio on success, ERR_PTR() on failure.
 */
struct folio *mapping_read_folio_gfp(struct address_space *mapping,
		pgoff_t index, gfp_t gfp)
{
	return do_read_cache_folio(mapping, index, NULL, NULL, gfp);
}
EXPORT_SYMBOL(mapping_read_folio_gfp);

static struct page *do_read_cache_page(struct address_space *mapping,
		pgoff_t index, filler_t *filler, struct file *file, gfp_t gfp)
{
	struct folio *folio;

	folio = do_read_cache_folio(mapping, index, filler, file, gfp);
	if (IS_ERR(folio))
		return &folio->page;
	return folio_file_page(folio, index);
}

struct page *read_cache_page(struct address_space *mapping,
			pgoff_t index, filler_t *filler, struct file *file)
{
	return do_read_cache_page(mapping, index, filler, file,
			mapping_gfp_mask(mapping));
}
EXPORT_SYMBOL(read_cache_page);

/**
 * read_cache_page_gfp - read into page cache, using specified page allocation flags.
 * @mapping:	the page's address_space
 * @index:	the page index
 * @gfp:	the page allocator flags to use if allocating
 *
 * This is the same as "read_mapping_page(mapping, index, NULL)", but with
 * any new page allocations done using the specified allocation flags.
 *
 * If the page does not get brought uptodate, return -EIO.
 *
 * The function expects mapping->invalidate_lock to be already held.
 *
 * Return: up to date page on success, ERR_PTR() on failure.
 */
struct page *read_cache_page_gfp(struct address_space *mapping,
				pgoff_t index,
				gfp_t gfp)
{
	return do_read_cache_page(mapping, index, NULL, NULL, gfp);
}
EXPORT_SYMBOL(read_cache_page_gfp);

/*
 * Warn about a page cache invalidation failure during a direct I/O write.
 */
static void dio_warn_stale_pagecache(struct file *filp)
{
	static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST);
	char pathname[128];
	char *path;

	errseq_set(&filp->f_mapping->wb_err, -EIO);
	if (__ratelimit(&_rs)) {
		path = file_path(filp, pathname, sizeof(pathname));
		if (IS_ERR(path))
			path = "(unknown)";
		pr_crit("Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!\n");
		pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid,
			current->comm);
	}
}

void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count)
{
	struct address_space *mapping = iocb->ki_filp->f_mapping;

	if (mapping->nrpages &&
	    invalidate_inode_pages2_range(mapping,
			iocb->ki_pos >> PAGE_SHIFT,
			(iocb->ki_pos + count - 1) >> PAGE_SHIFT))
		dio_warn_stale_pagecache(iocb->ki_filp);
}

ssize_t
generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
{
	struct address_space *mapping = iocb->ki_filp->f_mapping;
	size_t write_len = iov_iter_count(from);
	ssize_t written;

	/*
	 * If a page can not be invalidated, return 0 to fall back
	 * to buffered write.
	 */
	written = kiocb_invalidate_pages(iocb, write_len);
	if (written) {
		if (written == -EBUSY)
			return 0;
		return written;
	}

	written = mapping->a_ops->direct_IO(iocb, from);

	/*
	 * Finally, try again to invalidate clean pages which might have been
	 * cached by non-direct readahead, or faulted in by get_user_pages()
	 * if the source of the write was an mmap'ed region of the file
	 * we're writing.  Either one is a pretty crazy thing to do,
	 * so we don't support it 100%.  If this invalidation
	 * fails, tough, the write still worked...
	 *
	 * Most of the time we do not need this since dio_complete() will do
	 * the invalidation for us. However there are some file systems that
	 * do not end up with dio_complete() being called, so let's not break
	 * them by removing it completely.
	 *
	 * Noticeable example is a blkdev_direct_IO().
	 *
	 * Skip invalidation for async writes or if mapping has no pages.
	 */
	if (written > 0) {
		struct inode *inode = mapping->host;
		loff_t pos = iocb->ki_pos;

		kiocb_invalidate_post_direct_write(iocb, written);
		pos += written;
		write_len -= written;
		if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
			i_size_write(inode, pos);
			mark_inode_dirty(inode);
		}
		iocb->ki_pos = pos;
	}
	if (written != -EIOCBQUEUED)
		iov_iter_revert(from, write_len - iov_iter_count(from));
	return written;
}
EXPORT_SYMBOL(generic_file_direct_write);

ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
{
	struct file *file = iocb->ki_filp;
	loff_t pos = iocb->ki_pos;
	struct address_space *mapping = file->f_mapping;
	const struct address_space_operations *a_ops = mapping->a_ops;
	size_t chunk = mapping_max_folio_size(mapping);
	long status = 0;
	ssize_t written = 0;

	do {
		struct folio *folio;
		size_t offset;		/* Offset into folio */
		size_t bytes;		/* Bytes to write to folio */
		size_t copied;		/* Bytes copied from user */
		void *fsdata = NULL;

		bytes = iov_iter_count(i);
retry:
		offset = pos & (chunk - 1);
		bytes = min(chunk - offset, bytes);
		balance_dirty_pages_ratelimited(mapping);

		if (fatal_signal_pending(current)) {
			status = -EINTR;
			break;
		}

		status = a_ops->write_begin(iocb, mapping, pos, bytes,
						&folio, &fsdata);
		if (unlikely(status < 0))
			break;

		offset = offset_in_folio(folio, pos);
		if (bytes > folio_size(folio) - offset)
			bytes = folio_size(folio) - offset;

		if (mapping_writably_mapped(mapping))
			flush_dcache_folio(folio);

		/*
		 * Faults here on mmap()s can recurse into arbitrary
		 * filesystem code. Lots of locks are held that can
		 * deadlock. Use an atomic copy to avoid deadlocking
		 * in page fault handling.
		 */
		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
		flush_dcache_folio(folio);

		status = a_ops->write_end(iocb, mapping, pos, bytes, copied,
						folio, fsdata);
		if (unlikely(status != copied)) {
			iov_iter_revert(i, copied - max(status, 0L));
			if (unlikely(status < 0))
				break;
		}
		cond_resched();

		if (unlikely(status == 0)) {
			/*
			 * A short copy made ->write_end() reject the
			 * thing entirely.  Might be memory poisoning
			 * halfway through, might be a race with munmap,
			 * might be severe memory pressure.
			 */
			if (chunk > PAGE_SIZE)
				chunk /= 2;
			if (copied) {
				bytes = copied;
				goto retry;
			}

			/*
			 * 'folio' is now unlocked and faults on it can be
			 * handled. Ensure forward progress by trying to
			 * fault it in now.
			 */
			if (fault_in_iov_iter_readable(i, bytes) == bytes) {
				status = -EFAULT;
				break;
			}
		} else {
			pos += status;
			written += status;
		}
	} while (iov_iter_count(i));

	if (!written)
		return status;
	iocb->ki_pos += written;
	return written;
}
EXPORT_SYMBOL(generic_perform_write);

/**
 * __generic_file_write_iter - write data to a file
 * @iocb:	IO state structure (file, offset, etc.)
 * @from:	iov_iter with data to write
 *
 * This function does all the work needed for actually writing data to a
 * file. It does all basic checks, removes SUID from the file, updates
 * modification times and calls proper subroutines depending on whether we
 * do direct IO or a standard buffered write.
 *
 * It expects i_rwsem to be grabbed unless we work on a block device or similar
 * object which does not need locking at all.
 *
 * This function does *not* take care of syncing data in case of O_SYNC write.
 * A caller has to handle it. This is mainly due to the fact that we want to
 * avoid syncing under i_rwsem.
 *
 * Return:
 * * number of bytes written, even for truncated writes
 * * negative error code if no data has been written at all
 */
ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
	struct file *file = iocb->ki_filp;
	struct address_space *mapping = file->f_mapping;
	struct inode *inode = mapping->host;
	ssize_t ret;

	ret = file_remove_privs(file);
	if (ret)
		return ret;

	ret = file_update_time(file);
	if (ret)
		return ret;

	if (iocb->ki_flags & IOCB_DIRECT) {
		ret = generic_file_direct_write(iocb, from);
		/*
		 * If the write stopped short of completing, fall back to
		 * buffered writes.  Some filesystems do this for writes to
		 * holes, for example.  For DAX files, a buffered write will
		 * not succeed (even if it did, DAX does not handle dirty
		 * page-cache pages correctly).
		 */
		if (ret < 0 || !iov_iter_count(from) || IS_DAX(inode))
			return ret;
		return direct_write_fallback(iocb, from, ret,
				generic_perform_write(iocb, from));
	}

	return generic_perform_write(iocb, from);
}
EXPORT_SYMBOL(__generic_file_write_iter);

/**
 * generic_file_write_iter - write data to a file
 * @iocb:	IO state structure
 * @from:	iov_iter with data to write
 *
 * This is a wrapper around __generic_file_write_iter() to be used by most
 * filesystems. It takes care of syncing the file in case of O_SYNC file
 * and acquires i_rwsem as needed.
 * Return:
 * * negative error code if no data has been written at all of
 *   vfs_fsync_range() failed for a synchronous write
 * * number of bytes written, even for truncated writes
 */
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
	struct file *file = iocb->ki_filp;
	struct inode *inode = file->f_mapping->host;
	ssize_t ret;

	inode_lock(inode);
	ret = generic_write_checks(iocb, from);
	if (ret > 0)
		ret = __generic_file_write_iter(iocb, from);
	inode_unlock(inode);

	if (ret > 0)
		ret = generic_write_sync(iocb, ret);
	return ret;
}
EXPORT_SYMBOL(generic_file_write_iter);

/**
 * filemap_release_folio() - Release fs-specific metadata on a folio.
 * @folio: The folio which the kernel is trying to free.
 * @gfp: Memory allocation flags (and I/O mode).
 *
 * The address_space is trying to release any data attached to a folio
 * (presumably at folio->private).
 *
 * This will also be called if the private_2 flag is set on a page,
 * indicating that the folio has other metadata associated with it.
 *
 * The @gfp argument specifies whether I/O may be performed to release
 * this page (__GFP_IO), and whether the call may block
 * (__GFP_RECLAIM & __GFP_FS).
 *
 * Return: %true if the release was successful, otherwise %false.
 */
bool filemap_release_folio(struct folio *folio, gfp_t gfp)
{
	struct address_space * const mapping = folio->mapping;

	BUG_ON(!folio_test_locked(folio));
	if (!folio_needs_release(folio))
		return true;
	if (folio_test_writeback(folio))
		return false;

	if (mapping && mapping->a_ops->release_folio)
		return mapping->a_ops->release_folio(folio, gfp);
	return try_to_free_buffers(folio);
}
EXPORT_SYMBOL(filemap_release_folio);

/**
 * filemap_invalidate_inode - Invalidate/forcibly write back a range of an inode's pagecache
 * @inode: The inode to flush
 * @flush: Set to write back rather than simply invalidate.
 * @start: First byte to in range.
 * @end: Last byte in range (inclusive), or LLONG_MAX for everything from start
 *       onwards.
 *
 * Invalidate all the folios on an inode that contribute to the specified
 * range, possibly writing them back first.  Whilst the operation is
 * undertaken, the invalidate lock is held to prevent new folios from being
 * installed.
 */
int filemap_invalidate_inode(struct inode *inode, bool flush,
			     loff_t start, loff_t end)
{
	struct address_space *mapping = inode->i_mapping;
	pgoff_t first = start >> PAGE_SHIFT;
	pgoff_t last = end >> PAGE_SHIFT;
	pgoff_t nr = end == LLONG_MAX ? ULONG_MAX : last - first + 1;

	if (!mapping || !mapping->nrpages || end < start)
		goto out;

	/* Prevent new folios from being added to the inode. */
	filemap_invalidate_lock(mapping);

	if (!mapping->nrpages)
		goto unlock;

	unmap_mapping_pages(mapping, first, nr, false);

	/* Write back the data if we're asked to. */
	if (flush)
		filemap_fdatawrite_range(mapping, start, end);

	/* Wait for writeback to complete on all folios and discard. */
	invalidate_inode_pages2_range(mapping, start / PAGE_SIZE, end / PAGE_SIZE);

unlock:
	filemap_invalidate_unlock(mapping);
out:
	return filemap_check_errors(mapping);
}
EXPORT_SYMBOL_GPL(filemap_invalidate_inode);

#ifdef CONFIG_CACHESTAT_SYSCALL
/**
 * filemap_cachestat() - compute the page cache statistics of a mapping
 * @mapping:	The mapping to compute the statistics for.
 * @first_index:	The starting page cache index.
 * @last_index:	The final page index (inclusive).
 * @cs:	the cachestat struct to write the result to.
 *
 * This will query the page cache statistics of a mapping in the
 * page range of [first_index, last_index] (inclusive). The statistics
 * queried include: number of dirty pages, number of pages marked for
 * writeback, and the number of (recently) evicted pages.
 */
static void filemap_cachestat(struct address_space *mapping,
		pgoff_t first_index, pgoff_t last_index, struct cachestat *cs)
{
	XA_STATE(xas, &mapping->i_pages, first_index);
	struct folio *folio;

	/* Flush stats (and potentially sleep) outside the RCU read section. */
	mem_cgroup_flush_stats_ratelimited(NULL);

	rcu_read_lock();
	xas_for_each(&xas, folio, last_index) {
		int order;
		unsigned long nr_pages;
		pgoff_t folio_first_index, folio_last_index;

		/*
		 * Don't deref the folio. It is not pinned, and might
		 * get freed (and reused) underneath us.
		 *
		 * We *could* pin it, but that would be expensive for
		 * what should be a fast and lightweight syscall.
		 *
		 * Instead, derive all information of interest from
		 * the rcu-protected xarray.
		 */

		if (xas_retry(&xas, folio))
			continue;

		order = xas_get_order(&xas);
		nr_pages = 1 << order;
		folio_first_index = round_down(xas.xa_index, 1 << order);
		folio_last_index = folio_first_index + nr_pages - 1;

		/* Folios might straddle the range boundaries, only count covered pages */
		if (folio_first_index < first_index)
			nr_pages -= first_index - folio_first_index;

		if (folio_last_index > last_index)
			nr_pages -= folio_last_index - last_index;

		if (xa_is_value(folio)) {
			/* page is evicted */
			void *shadow = (void *)folio;
			bool workingset; /* not used */

			cs->nr_evicted += nr_pages;

#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
			if (shmem_mapping(mapping)) {
				/* shmem file - in swap cache */
				swp_entry_t swp = radix_to_swp_entry(folio);

				/* swapin error results in poisoned entry */
				if (!softleaf_is_swap(swp))
					goto resched;

				/*
				 * Getting a swap entry from the shmem
				 * inode means we beat
				 * shmem_unuse(). rcu_read_lock()
				 * ensures swapoff waits for us before
				 * freeing the swapper space. However,
				 * we can race with swapping and
				 * invalidation, so there might not be
				 * a shadow in the swapcache (yet).
				 */
				shadow = swap_cache_get_shadow(swp);
				if (!shadow)
					goto resched;
			}
#endif
			if (workingset_test_recent(shadow, true, &workingset, false))
				cs->nr_recently_evicted += nr_pages;

			goto resched;
		}

		/* page is in cache */
		cs->nr_cache += nr_pages;

		if (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY))
			cs->nr_dirty += nr_pages;

		if (xas_get_mark(&xas, PAGECACHE_TAG_WRITEBACK))
			cs->nr_writeback += nr_pages;

resched:
		if (need_resched()) {
			xas_pause(&xas);
			cond_resched_rcu();
		}
	}
	rcu_read_unlock();
}

/*
 * See mincore: reveal pagecache information only for files
 * that the calling process has write access to, or could (if
 * tried) open for writing.
 */
static inline bool can_do_cachestat(struct file *f)
{
	if (f->f_mode & FMODE_WRITE)
		return true;
	if (inode_owner_or_capable(file_mnt_idmap(f), file_inode(f)))
		return true;
	return file_permission(f, MAY_WRITE) == 0;
}

/*
 * The cachestat(2) system call.
 *
 * cachestat() returns the page cache statistics of a file in the
 * bytes range specified by `off` and `len`: number of cached pages,
 * number of dirty pages, number of pages marked for writeback,
 * number of evicted pages, and number of recently evicted pages.
 *
 * An evicted page is a page that is previously in the page cache
 * but has been evicted since. A page is recently evicted if its last
 * eviction was recent enough that its reentry to the cache would
 * indicate that it is actively being used by the system, and that
 * there is memory pressure on the system.
 *
 * `off` and `len` must be non-negative integers. If `len` > 0,
 * the queried range is [`off`, `off` + `len`]. If `len` == 0,
 * we will query in the range from `off` to the end of the file.
 *
 * The `flags` argument is unused for now, but is included for future
 * extensibility. User should pass 0 (i.e no flag specified).
 *
 * Currently, hugetlbfs is not supported.
 *
 * Because the status of a page can change after cachestat() checks it
 * but before it returns to the application, the returned values may
 * contain stale information.
 *
 * return values:
 *  zero        - success
 *  -EFAULT     - cstat or cstat_range points to an illegal address
 *  -EINVAL     - invalid flags
 *  -EBADF      - invalid file descriptor
 *  -EOPNOTSUPP - file descriptor is of a hugetlbfs file
 */
SYSCALL_DEFINE4(cachestat, unsigned int, fd,
		struct cachestat_range __user *, cstat_range,
		struct cachestat __user *, cstat, unsigned int, flags)
{
	CLASS(fd, f)(fd);
	struct address_space *mapping;
	struct cachestat_range csr;
	struct cachestat cs;
	pgoff_t first_index, last_index;

	if (fd_empty(f))
		return -EBADF;

	if (copy_from_user(&csr, cstat_range,
			sizeof(struct cachestat_range)))
		return -EFAULT;

	/* hugetlbfs is not supported */
	if (is_file_hugepages(fd_file(f)))
		return -EOPNOTSUPP;

	if (!can_do_cachestat(fd_file(f)))
		return -EPERM;

	if (flags != 0)
		return -EINVAL;

	first_index = csr.off >> PAGE_SHIFT;
	last_index =
		csr.len == 0 ? ULONG_MAX : (csr.off + csr.len - 1) >> PAGE_SHIFT;
	memset(&cs, 0, sizeof(struct cachestat));
	mapping = fd_file(f)->f_mapping;
	filemap_cachestat(mapping, first_index, last_index, &cs);

	if (copy_to_user(cstat, &cs, sizeof(struct cachestat)))
		return -EFAULT;

	return 0;
}
#endif /* CONFIG_CACHESTAT_SYSCALL */
]

gemini-3-flash-preview

							input: 53072
							output: 40
							thoughts: 371

**Analyzing a Data Race in the Filemap Code**

Okay, so I've got a data race on address `0xffff88811a6f8228`, which is... hmm, looks like that's `f_ra.prev_pos`. Task 10061 is in `filemap_splice_read` at line 3125 of `mm/filemap.c`, and task 10066 is in `filemap_read` at line 2873 of the same file. I need to take a closer look at those lines of code.

Alright, let's see... Line 3125 in `filemap_splice_read` is writing `*ppos` to `in->f_ra.prev_pos`, and in `filemap_read`, line 2873, it looks like `last_pos` is being assigned to `ra->prev_pos`. Interesting. Both functions are indeed writing to this `prev_pos` field, with `ra` being the same thing as `&filp->f_ra`. So, it's a conflict between two tasks updating `f_ra.prev_pos`. Now, I need to confirm that this is indeed what's happening and figure out what the root cause and possible solutions are.

map[ContextFile:mm/filemap.c IncludeLines:true Name:file_ra_state]

map[SourceCode:1197:	/**
1198:	 * struct file_ra_state - Track a file's readahead state.
1199:	 * @start: Where the most recent readahead started.
1200:	 * @size: Number of pages read in the most recent readahead.
1201:	 * @async_size: Numer of pages that were/are not needed immediately
1202:	 *      and so were/are genuinely "ahead".  Start next readahead when
1203:	 *      the first of these pages is accessed.
1204:	 * @ra_pages: Maximum size of a readahead request, copied from the bdi.
1205:	 * @order: Preferred folio order used for most recent readahead.
1206:	 * @mmap_miss: How many mmap accesses missed in the page cache.
1207:	 * @prev_pos: The last byte in the most recent read request.
1208:	 *
1209:	 * When this structure is passed to ->readahead(), the "most recent"
1210:	 * readahead means the current readahead.
1211:	 */
1212:	struct file_ra_state {
1213:		pgoff_t start;
1214:		unsigned int size;
1215:		unsigned int async_size;
1216:		unsigned int ra_pages;
1217:		unsigned short order;
1218:		unsigned short mmap_miss;
1219:		loff_t prev_pos;
1220:	};
 SourceFile:include/linux/fs.h]

gemini-3-flash-preview

							input: 54211
							output: 21
							thoughts: 1065

**Analyzing a Potential Race Condition in Readahead Logic**

Okay, so I'm looking at this readahead implementation, specifically around how the `prev_pos` field in `struct file_ra_state` (`f_ra`) is handled. It seems like both `filemap_read` and `filemap_splice_read` are updating `f_ra.prev_pos`, which is a `loff_t` (64-bit). The point is to track the last read position for readahead decisions. My initial thought is, is this concurrent access a problem?

Both functions update `f_ra.prev_pos` to track the last read position for readahead decisions. The updates happen within the read loops in these functions.

I realize that multiple threads can read from the same file descriptor (meaning they can share the `struct file` that contains `f_ra`). The VFS layer usually doesn't provide explicit locking for concurrent reads for performance reasons unless the filesystem itself implements it.

Crucially, neither `filemap_read` nor `filemap_splice_read` appears to be taking the `inode->i_rwsem` lock in all cases, which could provide protection. `generic_file_read_iter` calls `filemap_read`, and it doesn't take the `inode_lock` either for buffered reads. `filemap_splice_read` also isn't using it.

So, if two threads are reading from the same file, they are both updating `f_ra.prev_pos` simultaneously. Considering `prev_pos` is just a *hint* for readahead, the worst-case scenario should be suboptimal readahead behavior, but ideally, no memory corruption or crashes. The write itself should be atomic since `loff_t` is 64-bit and aligned on x86_64, right? The system in question is 64-bit.

But, KCSAN reported a data race. `prev_pos` value is changing from `0xd8` to `0x14f0`, which is definitely a conflict. On 32-bit systems, where the atomic-ness of writes might not be guaranteed, this could be a more severe problem, but here we are on x86_64.

Okay, let me double-check how these functions are handling the `iocb` structure. In `filemap_read`, the `iocb` is passed in, while in `filemap_splice_read`, a local `iocb` is initialized. However, the `in->f_ra` is shared. So, they both access `in->f_ra`, in other words, they access the same `struct file`, hence the same `f_ra`.

So the question becomes: is `f_ra` itself protected by any lock? I understand that readahead is often treated as best-effort, and access to it might not be strictly synchronized to avoid contention on critical paths.

I should probably check if there are any other places where `f_ra` is being accessed and if any locking is involved there. `page_cache_sync_ra` and `page_cache_async_ra` use `f_ra`, so I should dive into that. I'll look at the `page_cache_sync_ra` function in `mm/readahead.c` to see if there are any hints.

map[File:mm/readahead.c]

map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/*
 * mm/readahead.c - address_space-level file readahead.
 *
 * Copyright (C) 2002, Linus Torvalds
 *
 * 09Apr2002	Andrew Morton
 *		Initial version.
 */

/**
 * DOC: Readahead Overview
 *
 * Readahead is used to read content into the page cache before it is
 * explicitly requested by the application.  Readahead only ever
 * attempts to read folios that are not yet in the page cache.  If a
 * folio is present but not up-to-date, readahead will not try to read
 * it. In that case a simple ->read_folio() will be requested.
 *
 * Readahead is triggered when an application read request (whether a
 * system call or a page fault) finds that the requested folio is not in
 * the page cache, or that it is in the page cache and has the
 * readahead flag set.  This flag indicates that the folio was read
 * as part of a previous readahead request and now that it has been
 * accessed, it is time for the next readahead.
 *
 * Each readahead request is partly synchronous read, and partly async
 * readahead.  This is reflected in the struct file_ra_state which
 * contains ->size being the total number of pages, and ->async_size
 * which is the number of pages in the async section.  The readahead
 * flag will be set on the first folio in this async section to trigger
 * a subsequent readahead.  Once a series of sequential reads has been
 * established, there should be no need for a synchronous component and
 * all readahead request will be fully asynchronous.
 *
 * When either of the triggers causes a readahead, three numbers need
 * to be determined: the start of the region to read, the size of the
 * region, and the size of the async tail.
 *
 * The start of the region is simply the first page address at or after
 * the accessed address, which is not currently populated in the page
 * cache.  This is found with a simple search in the page cache.
 *
 * The size of the async tail is determined by subtracting the size that
 * was explicitly requested from the determined request size, unless
 * this would be less than zero - then zero is used.  NOTE THIS
 * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED
 * PAGE.  ALSO THIS CALCULATION IS NOT USED CONSISTENTLY.
 *
 * The size of the region is normally determined from the size of the
 * previous readahead which loaded the preceding pages.  This may be
 * discovered from the struct file_ra_state for simple sequential reads,
 * or from examining the state of the page cache when multiple
 * sequential reads are interleaved.  Specifically: where the readahead
 * was triggered by the readahead flag, the size of the previous
 * readahead is assumed to be the number of pages from the triggering
 * page to the start of the new readahead.  In these cases, the size of
 * the previous readahead is scaled, often doubled, for the new
 * readahead, though see get_next_ra_size() for details.
 *
 * If the size of the previous read cannot be determined, the number of
 * preceding pages in the page cache is used to estimate the size of
 * a previous read.  This estimate could easily be misled by random
 * reads being coincidentally adjacent, so it is ignored unless it is
 * larger than the current request, and it is not scaled up, unless it
 * is at the start of file.
 *
 * In general readahead is accelerated at the start of the file, as
 * reads from there are often sequential.  There are other minor
 * adjustments to the readahead size in various special cases and these
 * are best discovered by reading the code.
 *
 * The above calculation, based on the previous readahead size,
 * determines the size of the readahead, to which any requested read
 * size may be added.
 *
 * Readahead requests are sent to the filesystem using the ->readahead()
 * address space operation, for which mpage_readahead() is a canonical
 * implementation.  ->readahead() should normally initiate reads on all
 * folios, but may fail to read any or all folios without causing an I/O
 * error.  The page cache reading code will issue a ->read_folio() request
 * for any folio which ->readahead() did not read, and only an error
 * from this will be final.
 *
 * ->readahead() will generally call readahead_folio() repeatedly to get
 * each folio from those prepared for readahead.  It may fail to read a
 * folio by:
 *
 * * not calling readahead_folio() sufficiently many times, effectively
 *   ignoring some folios, as might be appropriate if the path to
 *   storage is congested.
 *
 * * failing to actually submit a read request for a given folio,
 *   possibly due to insufficient resources, or
 *
 * * getting an error during subsequent processing of a request.
 *
 * In the last two cases, the folio should be unlocked by the filesystem
 * to indicate that the read attempt has failed.  In the first case the
 * folio will be unlocked by the VFS.
 *
 * Those folios not in the final ``async_size`` of the request should be
 * considered to be important and ->readahead() should not fail them due
 * to congestion or temporary resource unavailability, but should wait
 * for necessary resources (e.g.  memory or indexing information) to
 * become available.  Folios in the final ``async_size`` may be
 * considered less urgent and failure to read them is more acceptable.
 * In this case it is best to use filemap_remove_folio() to remove the
 * folios from the page cache as is automatically done for folios that
 * were not fetched with readahead_folio().  This will allow a
 * subsequent synchronous readahead request to try them again.  If they
 * are left in the page cache, then they will be read individually using
 * ->read_folio() which may be less efficient.
 */

#include <linux/blkdev.h>
#include <linux/kernel.h>
#include <linux/dax.h>
#include <linux/gfp.h>
#include <linux/export.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/pagemap.h>
#include <linux/psi.h>
#include <linux/syscalls.h>
#include <linux/file.h>
#include <linux/mm_inline.h>
#include <linux/blk-cgroup.h>
#include <linux/fadvise.h>
#include <linux/sched/mm.h>

#define CREATE_TRACE_POINTS
#include <trace/events/readahead.h>

#include "internal.h"

/*
 * Initialise a struct file's readahead state.  Assumes that the caller has
 * memset *ra to zero.
 */
void
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
{
	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
	ra->prev_pos = -1;
}
EXPORT_SYMBOL_GPL(file_ra_state_init);

static void read_pages(struct readahead_control *rac)
{
	const struct address_space_operations *aops = rac->mapping->a_ops;
	struct folio *folio;
	struct blk_plug plug;

	if (!readahead_count(rac))
		return;

	if (unlikely(rac->_workingset))
		psi_memstall_enter(&rac->_pflags);
	blk_start_plug(&plug);

	if (aops->readahead) {
		aops->readahead(rac);
		/* Clean up the remaining folios. */
		while ((folio = readahead_folio(rac)) != NULL) {
			folio_get(folio);
			filemap_remove_folio(folio);
			folio_unlock(folio);
			folio_put(folio);
		}
	} else {
		while ((folio = readahead_folio(rac)) != NULL)
			aops->read_folio(rac->file, folio);
	}

	blk_finish_plug(&plug);
	if (unlikely(rac->_workingset))
		psi_memstall_leave(&rac->_pflags);
	rac->_workingset = false;

	BUG_ON(readahead_count(rac));
}

static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
				       gfp_t gfp_mask, unsigned int order)
{
	struct folio *folio;

	folio = filemap_alloc_folio(gfp_mask, order, NULL);
	if (folio && ractl->dropbehind)
		__folio_set_dropbehind(folio);

	return folio;
}

/**
 * page_cache_ra_unbounded - Start unchecked readahead.
 * @ractl: Readahead control.
 * @nr_to_read: The number of pages to read.
 * @lookahead_size: Where to start the next readahead.
 *
 * This function is for filesystems to call when they want to start
 * readahead beyond a file's stated i_size.  This is almost certainly
 * not the function you want to call.  Use page_cache_async_readahead()
 * or page_cache_sync_readahead() instead.
 *
 * Context: File is referenced by caller, and ractl->mapping->invalidate_lock
 * must be held by the caller at least in shared mode.  Mutexes may be held by
 * caller.  May sleep, but will not reenter filesystem to reclaim memory.
 */
void page_cache_ra_unbounded(struct readahead_control *ractl,
		unsigned long nr_to_read, unsigned long lookahead_size)
{
	struct address_space *mapping = ractl->mapping;
	unsigned long index = readahead_index(ractl);
	gfp_t gfp_mask = readahead_gfp_mask(mapping);
	unsigned long mark = ULONG_MAX, i = 0;
	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);

	/*
	 * Partway through the readahead operation, we will have added
	 * locked pages to the page cache, but will not yet have submitted
	 * them for I/O.  Adding another page may need to allocate memory,
	 * which can trigger memory reclaim.  Telling the VM we're in
	 * the middle of a filesystem operation will cause it to not
	 * touch file-backed pages, preventing a deadlock.  Most (all?)
	 * filesystems already specify __GFP_NOFS in their mapping's
	 * gfp_mask, but let's be explicit here.
	 */
	unsigned int nofs = memalloc_nofs_save();

	lockdep_assert_held(&mapping->invalidate_lock);

	trace_page_cache_ra_unbounded(mapping->host, index, nr_to_read,
				      lookahead_size);
	index = mapping_align_index(mapping, index);

	/*
	 * As iterator `i` is aligned to min_nrpages, round_up the
	 * difference between nr_to_read and lookahead_size to mark the
	 * index that only has lookahead or "async_region" to set the
	 * readahead flag.
	 */
	if (lookahead_size <= nr_to_read) {
		unsigned long ra_folio_index;

		ra_folio_index = round_up(readahead_index(ractl) +
					  nr_to_read - lookahead_size,
					  min_nrpages);
		mark = ra_folio_index - index;
	}
	nr_to_read += readahead_index(ractl) - index;
	ractl->_index = index;

	/*
	 * Preallocate as many pages as we will need.
	 */
	while (i < nr_to_read) {
		struct folio *folio = xa_load(&mapping->i_pages, index + i);
		int ret;

		if (folio && !xa_is_value(folio)) {
			/*
			 * Page already present?  Kick off the current batch
			 * of contiguous pages before continuing with the
			 * next batch.  This page may be the one we would
			 * have intended to mark as Readahead, but we don't
			 * have a stable reference to this page, and it's
			 * not worth getting one just for that.
			 */
			read_pages(ractl);
			ractl->_index += min_nrpages;
			i = ractl->_index + ractl->_nr_pages - index;
			continue;
		}

		folio = ractl_alloc_folio(ractl, gfp_mask,
					mapping_min_folio_order(mapping));
		if (!folio)
			break;

		ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
		if (ret < 0) {
			folio_put(folio);
			if (ret == -ENOMEM)
				break;
			read_pages(ractl);
			ractl->_index += min_nrpages;
			i = ractl->_index + ractl->_nr_pages - index;
			continue;
		}
		if (i == mark)
			folio_set_readahead(folio);
		ractl->_workingset |= folio_test_workingset(folio);
		ractl->_nr_pages += min_nrpages;
		i += min_nrpages;
	}

	/*
	 * Now start the IO.  We ignore I/O errors - if the folio is not
	 * uptodate then the caller will launch read_folio again, and
	 * will then handle the error.
	 */
	read_pages(ractl);
	memalloc_nofs_restore(nofs);
}
EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);

/*
 * do_page_cache_ra() actually reads a chunk of disk.  It allocates
 * the pages first, then submits them for I/O. This avoids the very bad
 * behaviour which would occur if page allocations are causing VM writeback.
 * We really don't want to intermingle reads and writes like that.
 */
static void do_page_cache_ra(struct readahead_control *ractl,
		unsigned long nr_to_read, unsigned long lookahead_size)
{
	struct address_space *mapping = ractl->mapping;
	unsigned long index = readahead_index(ractl);
	loff_t isize = i_size_read(mapping->host);
	pgoff_t end_index;	/* The last page we want to read */

	if (isize == 0)
		return;

	end_index = (isize - 1) >> PAGE_SHIFT;
	if (index > end_index)
		return;
	/* Don't read past the page containing the last byte of the file */
	if (nr_to_read > end_index - index)
		nr_to_read = end_index - index + 1;

	filemap_invalidate_lock_shared(mapping);
	page_cache_ra_unbounded(ractl, nr_to_read, lookahead_size);
	filemap_invalidate_unlock_shared(mapping);
}

/*
 * Chunk the readahead into 2 megabyte units, so that we don't pin too much
 * memory at once.
 */
void force_page_cache_ra(struct readahead_control *ractl,
		unsigned long nr_to_read)
{
	struct address_space *mapping = ractl->mapping;
	struct file_ra_state *ra = ractl->ra;
	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
	unsigned long max_pages;

	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
		return;

	/*
	 * If the request exceeds the readahead window, allow the read to
	 * be up to the optimal hardware IO size
	 */
	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
	while (nr_to_read) {
		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE;

		if (this_chunk > nr_to_read)
			this_chunk = nr_to_read;
		do_page_cache_ra(ractl, this_chunk, 0);

		nr_to_read -= this_chunk;
	}
}

/*
 * Set the initial window size, round to next power of 2 and square
 * for small size, x 4 for medium, and x 2 for large
 * for 128k (32 page) max ra
 * 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial
 */
static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
{
	unsigned long newsize = roundup_pow_of_two(size);

	if (newsize <= max / 32)
		newsize = newsize * 4;
	else if (newsize <= max / 4)
		newsize = newsize * 2;
	else
		newsize = max;

	return newsize;
}

/*
 *  Get the previous window size, ramp it up, and
 *  return it as the new window size.
 */
static unsigned long get_next_ra_size(struct file_ra_state *ra,
				      unsigned long max)
{
	unsigned long cur = ra->size;

	if (cur < max / 16)
		return 4 * cur;
	if (cur <= max / 2)
		return 2 * cur;
	return max;
}

/*
 * On-demand readahead design.
 *
 * The fields in struct file_ra_state represent the most-recently-executed
 * readahead attempt:
 *
 *                        |<----- async_size ---------|
 *     |------------------- size -------------------->|
 *     |==================#===========================|
 *     ^start             ^page marked with PG_readahead
 *
 * To overlap application thinking time and disk I/O time, we do
 * `readahead pipelining': Do not wait until the application consumed all
 * readahead pages and stalled on the missing page at readahead_index;
 * Instead, submit an asynchronous readahead I/O as soon as there are
 * only async_size pages left in the readahead window. Normally async_size
 * will be equal to size, for maximum pipelining.
 *
 * In interleaved sequential reads, concurrent streams on the same fd can
 * be invalidating each other's readahead state. So we flag the new readahead
 * page at (start+size-async_size) with PG_readahead, and use it as readahead
 * indicator. The flag won't be set on already cached pages, to avoid the
 * readahead-for-nothing fuss, saving pointless page cache lookups.
 *
 * prev_pos tracks the last visited byte in the _previous_ read request.
 * It should be maintained by the caller, and will be used for detecting
 * small random reads. Note that the readahead algorithm checks loosely
 * for sequential patterns. Hence interleaved reads might be served as
 * sequential ones.
 *
 * There is a special-case: if the first page which the application tries to
 * read happens to be the first page of the file, it is assumed that a linear
 * read is about to happen and the window is immediately set to the initial size
 * based on I/O request size and the max_readahead.
 *
 * The code ramps up the readahead size aggressively at first, but slow down as
 * it approaches max_readahead.
 */

static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
		pgoff_t mark, unsigned int order, gfp_t gfp)
{
	int err;
	struct folio *folio = ractl_alloc_folio(ractl, gfp, order);

	if (!folio)
		return -ENOMEM;
	mark = round_down(mark, 1UL << order);
	if (index == mark)
		folio_set_readahead(folio);
	err = filemap_add_folio(ractl->mapping, folio, index, gfp);
	if (err) {
		folio_put(folio);
		return err;
	}

	ractl->_nr_pages += 1UL << order;
	ractl->_workingset |= folio_test_workingset(folio);
	return 0;
}

void page_cache_ra_order(struct readahead_control *ractl,
		struct file_ra_state *ra)
{
	struct address_space *mapping = ractl->mapping;
	pgoff_t start = readahead_index(ractl);
	pgoff_t index = start;
	unsigned int min_order = mapping_min_folio_order(mapping);
	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
	pgoff_t mark = index + ra->size - ra->async_size;
	unsigned int nofs;
	int err = 0;
	gfp_t gfp = readahead_gfp_mask(mapping);
	unsigned int new_order = ra->order;

	trace_page_cache_ra_order(mapping->host, start, ra);
	if (!mapping_large_folio_support(mapping)) {
		ra->order = 0;
		goto fallback;
	}

	limit = min(limit, index + ra->size - 1);

	new_order = min(mapping_max_folio_order(mapping), new_order);
	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
	new_order = max(new_order, min_order);

	ra->order = new_order;

	/* See comment in page_cache_ra_unbounded() */
	nofs = memalloc_nofs_save();
	filemap_invalidate_lock_shared(mapping);
	/*
	 * If the new_order is greater than min_order and index is
	 * already aligned to new_order, then this will be noop as index
	 * aligned to new_order should also be aligned to min_order.
	 */
	ractl->_index = mapping_align_index(mapping, index);
	index = readahead_index(ractl);

	while (index <= limit) {
		unsigned int order = new_order;

		/* Align with smaller pages if needed */
		if (index & ((1UL << order) - 1))
			order = __ffs(index);
		/* Don't allocate pages past EOF */
		while (order > min_order && index + (1UL << order) - 1 > limit)
			order--;
		err = ra_alloc_folio(ractl, index, mark, order, gfp);
		if (err)
			break;
		index += 1UL << order;
	}

	read_pages(ractl);
	filemap_invalidate_unlock_shared(mapping);
	memalloc_nofs_restore(nofs);

	/*
	 * If there were already pages in the page cache, then we may have
	 * left some gaps.  Let the regular readahead code take care of this
	 * situation below.
	 */
	if (!err)
		return;
fallback:
	/*
	 * ->readahead() may have updated readahead window size so we have to
	 * check there's still something to read.
	 */
	if (ra->size > index - start)
		do_page_cache_ra(ractl, ra->size - (index - start),
				 ra->async_size);
}

static unsigned long ractl_max_pages(struct readahead_control *ractl,
		unsigned long req_size)
{
	struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
	unsigned long max_pages = ractl->ra->ra_pages;

	/*
	 * If the request exceeds the readahead window, allow the read to
	 * be up to the optimal hardware IO size
	 */
	if (req_size > max_pages && bdi->io_pages > max_pages)
		max_pages = min(req_size, bdi->io_pages);
	return max_pages;
}

void page_cache_sync_ra(struct readahead_control *ractl,
		unsigned long req_count)
{
	pgoff_t index = readahead_index(ractl);
	bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM);
	struct file_ra_state *ra = ractl->ra;
	unsigned long max_pages, contig_count;
	pgoff_t prev_index, miss;

	trace_page_cache_sync_ra(ractl->mapping->host, index, ra, req_count);
	/*
	 * Even if readahead is disabled, issue this request as readahead
	 * as we'll need it to satisfy the requested range. The forced
	 * readahead will do the right thing and limit the read to just the
	 * requested range, which we'll set to 1 page for this case.
	 */
	if (!ra->ra_pages || blk_cgroup_congested()) {
		if (!ractl->file)
			return;
		req_count = 1;
		do_forced_ra = true;
	}

	/* be dumb */
	if (do_forced_ra) {
		force_page_cache_ra(ractl, req_count);
		return;
	}

	max_pages = ractl_max_pages(ractl, req_count);
	prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT;
	/*
	 * A start of file, oversized read, or sequential cache miss:
	 * trivial case: (index - prev_index) == 1
	 * unaligned reads: (index - prev_index) == 0
	 */
	if (!index || req_count > max_pages || index - prev_index <= 1UL) {
		ra->start = index;
		ra->size = get_init_ra_size(req_count, max_pages);
		ra->async_size = ra->size > req_count ? ra->size - req_count :
							ra->size >> 1;
		goto readit;
	}

	/*
	 * Query the page cache and look for the traces(cached history pages)
	 * that a sequential stream would leave behind.
	 */
	rcu_read_lock();
	miss = page_cache_prev_miss(ractl->mapping, index - 1, max_pages);
	rcu_read_unlock();
	contig_count = index - miss - 1;
	/*
	 * Standalone, small random read. Read as is, and do not pollute the
	 * readahead state.
	 */
	if (contig_count <= req_count) {
		do_page_cache_ra(ractl, req_count, 0);
		return;
	}
	/*
	 * File cached from the beginning:
	 * it is a strong indication of long-run stream (or whole-file-read)
	 */
	if (miss == ULONG_MAX)
		contig_count *= 2;
	ra->start = index;
	ra->size = min(contig_count + req_count, max_pages);
	ra->async_size = 1;
readit:
	ra->order = 0;
	ractl->_index = ra->start;
	page_cache_ra_order(ractl, ra);
}
EXPORT_SYMBOL_GPL(page_cache_sync_ra);

void page_cache_async_ra(struct readahead_control *ractl,
		struct folio *folio, unsigned long req_count)
{
	unsigned long max_pages;
	struct file_ra_state *ra = ractl->ra;
	pgoff_t index = readahead_index(ractl);
	pgoff_t expected, start, end, aligned_end, align;

	/* no readahead */
	if (!ra->ra_pages)
		return;

	/*
	 * Same bit is used for PG_readahead and PG_reclaim.
	 */
	if (folio_test_writeback(folio))
		return;

	trace_page_cache_async_ra(ractl->mapping->host, index, ra, req_count);
	folio_clear_readahead(folio);

	if (blk_cgroup_congested())
		return;

	max_pages = ractl_max_pages(ractl, req_count);
	/*
	 * It's the expected callback index, assume sequential access.
	 * Ramp up sizes, and push forward the readahead window.
	 */
	expected = round_down(ra->start + ra->size - ra->async_size,
			folio_nr_pages(folio));
	if (index == expected) {
		ra->start += ra->size;
		/*
		 * In the case of MADV_HUGEPAGE, the actual size might exceed
		 * the readahead window.
		 */
		ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
		goto readit;
	}

	/*
	 * Hit a marked folio without valid readahead state.
	 * E.g. interleaved reads.
	 * Query the pagecache for async_size, which normally equals to
	 * readahead size. Ramp it up and use it as the new readahead size.
	 */
	rcu_read_lock();
	start = page_cache_next_miss(ractl->mapping, index + 1, max_pages);
	rcu_read_unlock();

	if (!start || start - index > max_pages)
		return;

	ra->start = start;
	ra->size = start - index;	/* old async_size */
	ra->size += req_count;
	ra->size = get_next_ra_size(ra, max_pages);
readit:
	ra->order += 2;
	align = 1UL << min(ra->order, ffs(max_pages) - 1);
	end = ra->start + ra->size;
	aligned_end = round_down(end, align);
	if (aligned_end > ra->start)
		ra->size -= end - aligned_end;
	ra->async_size = ra->size;
	ractl->_index = ra->start;
	page_cache_ra_order(ractl, ra);
}
EXPORT_SYMBOL_GPL(page_cache_async_ra);

ssize_t ksys_readahead(int fd, loff_t offset, size_t count)
{
	struct file *file;
	const struct inode *inode;

	CLASS(fd, f)(fd);
	if (fd_empty(f))
		return -EBADF;

	file = fd_file(f);
	if (!(file->f_mode & FMODE_READ))
		return -EBADF;

	/*
	 * The readahead() syscall is intended to run only on files
	 * that can execute readahead. If readahead is not possible
	 * on this file, then we must return -EINVAL.
	 */
	if (!file->f_mapping)
		return -EINVAL;
	if (!file->f_mapping->a_ops)
		return -EINVAL;

	inode = file_inode(file);
	if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
		return -EINVAL;
	if (IS_ANON_FILE(inode))
		return -EINVAL;

	return vfs_fadvise(fd_file(f), offset, count, POSIX_FADV_WILLNEED);
}

SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count)
{
	return ksys_readahead(fd, offset, count);
}

#if defined(CONFIG_COMPAT) && defined(__ARCH_WANT_COMPAT_READAHEAD)
COMPAT_SYSCALL_DEFINE4(readahead, int, fd, compat_arg_u64_dual(offset), size_t, count)
{
	return ksys_readahead(fd, compat_arg_u64_glue(offset), count);
}
#endif

/**
 * readahead_expand - Expand a readahead request
 * @ractl: The request to be expanded
 * @new_start: The revised start
 * @new_len: The revised size of the request
 *
 * Attempt to expand a readahead request outwards from the current size to the
 * specified size by inserting locked pages before and after the current window
 * to increase the size to the new window.  This may involve the insertion of
 * THPs, in which case the window may get expanded even beyond what was
 * requested.
 *
 * The algorithm will stop if it encounters a conflicting page already in the
 * pagecache and leave a smaller expansion than requested.
 *
 * The caller must check for this by examining the revised @ractl object for a
 * different expansion than was requested.
 */
void readahead_expand(struct readahead_control *ractl,
		      loff_t new_start, size_t new_len)
{
	struct address_space *mapping = ractl->mapping;
	struct file_ra_state *ra = ractl->ra;
	pgoff_t new_index, new_nr_pages;
	gfp_t gfp_mask = readahead_gfp_mask(mapping);
	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
	unsigned int min_order = mapping_min_folio_order(mapping);

	new_index = new_start / PAGE_SIZE;
	/*
	 * Readahead code should have aligned the ractl->_index to
	 * min_nrpages before calling readahead aops.
	 */
	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));

	/* Expand the leading edge downwards */
	while (ractl->_index > new_index) {
		unsigned long index = ractl->_index - 1;
		struct folio *folio = xa_load(&mapping->i_pages, index);

		if (folio && !xa_is_value(folio))
			return; /* Folio apparently present */

		folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
		if (!folio)
			return;

		index = mapping_align_index(mapping, index);
		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
			folio_put(folio);
			return;
		}
		if (unlikely(folio_test_workingset(folio)) &&
				!ractl->_workingset) {
			ractl->_workingset = true;
			psi_memstall_enter(&ractl->_pflags);
		}
		ractl->_nr_pages += min_nrpages;
		ractl->_index = folio->index;
	}

	new_len += new_start - readahead_pos(ractl);
	new_nr_pages = DIV_ROUND_UP(new_len, PAGE_SIZE);

	/* Expand the trailing edge upwards */
	while (ractl->_nr_pages < new_nr_pages) {
		unsigned long index = ractl->_index + ractl->_nr_pages;
		struct folio *folio = xa_load(&mapping->i_pages, index);

		if (folio && !xa_is_value(folio))
			return; /* Folio apparently present */

		folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
		if (!folio)
			return;

		index = mapping_align_index(mapping, index);
		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
			folio_put(folio);
			return;
		}
		if (unlikely(folio_test_workingset(folio)) &&
				!ractl->_workingset) {
			ractl->_workingset = true;
			psi_memstall_enter(&ractl->_pflags);
		}
		ractl->_nr_pages += min_nrpages;
		if (ra) {
			ra->size += min_nrpages;
			ra->async_size += min_nrpages;
		}
	}
}
EXPORT_SYMBOL(readahead_expand);
]

gemini-3-flash-preview

							input: 65043
							output: 17
							thoughts: 1820

**Analyzing the Data Race on `file_ra_state->prev_pos`**

Okay, so I'm digging into this data race on `prev_pos` within the `file_ra_state` struct. From what I'm seeing, `prev_pos` is primarily used as a hint for the readahead algorithm, specifically to detect sequential access patterns. It's written in `filemap_read` and `filemap_splice_read`, and read in `page_cache_sync_ra`. The issue arises when multiple threads are reading from the same file using a shared file descriptor; they're all vying to update `prev_pos`.

My initial thought is that this is likely a benign data race. Here's why: `prev_pos` is a `loff_t` (64-bit on x86_64), and its offset within `file_ra_state` (offset 24) is 8-byte aligned. This means writes to it should be atomic on x86_64. So, a corrupted write is unlikely. Even if the value is incorrect, it just affects readahead. It's not part of the file's *correct* contents.

Looking at the code, in `page_cache_sync_ra`, it's used to calculate `prev_index`. This is then compared with the current `index` to determine if readahead should be triggered. Specifically: `if (!index || req_count > max_pages || index - prev_index <= 1UL)`. A corrupted `prev_pos` could lead to an incorrect `prev_index`, and hence, a wrong readahead decision. But, if `index` and `prev_index` are both `pgoff_t` and `index < prev_index`, `index - prev_index <= 1UL` will be false as there will be a large number after the subtraction. The readahead decision would simply be missed, or occasionally triggered when it could have been skipped. That's not a correctness issue, just a performance one.

The kernel developers seem aware of this "interleaved reads" behaviour, which is confirmed by the comment in `mm/readahead.c`. There's no lock, and no atomic operations. This reinforces the idea that it's a trade-off. Readahead is a performance optimization. Not a necessity. So this looks like a classic "benign" data race on a statistics/hint field.

However, I'm making sure to check for any other uses of `prev_pos`. The report only mentions this. I've also noted that `filemap_splice_read` updates `prev_pos` *within* its loop, while `filemap_read` updates it *after* its loop. This is consistent across both functions. If I *were* going to write to it, I'd probably write the most recent value after I'm done processing the data, so it makes sense to me.

I also checked to make sure that the subtraction done with the `index` and `prev_index` in `page_cache_sync_ra` does not lead to any underflow, and this is confirmed. The condition is just used to decide whether to do read ahead. In summary, I think KCSAN reported a race that is unlikely to be harmful, given how `prev_pos` is used, the atomicity of writes on x86_64, and the lack of other uses of this value.

map[Expression:prev_pos]

map[Output:drivers/gpu/drm/xe/xe_vm.c=2796=static void xe_vma_op_unwind(struct xe_vm *vm, struct xe_vma_op *op,
drivers/gpu/drm/xe/xe_vm.c:2797:			     bool post_commit, bool prev_post_commit,
drivers/gpu/drm/xe/xe_vm.c-2798-			     bool next_post_commit)
--
drivers/gpu/drm/xe/xe_vm.c-2826-		if (op->remap.prev) {
drivers/gpu/drm/xe/xe_vm.c:2827:			prep_vma_destroy(vm, op->remap.prev, prev_post_commit);
drivers/gpu/drm/xe/xe_vm.c-2828-			xe_vma_destroy_unlocked(op->remap.prev);
--
drivers/media/platform/samsung/exynos4-is/fimc-is.h=160=struct is_af_info {
--
drivers/media/platform/samsung/exynos4-is/fimc-is.h-167-	u16 pos_y;
drivers/media/platform/samsung/exynos4-is/fimc-is.h:168:	u16 prev_pos_x;
drivers/media/platform/samsung/exynos4-is/fimc-is.h:169:	u16 prev_pos_y;
drivers/media/platform/samsung/exynos4-is/fimc-is.h-170-	u16 use_af;
--
drivers/pci/pci.h=117=bool pcie_cap_has_rtctl(const struct pci_dev *dev);
--
drivers/pci/pci.h-138-	u8 __id,  __found_pos = 0;					\
drivers/pci/pci.h:139:	u8 __prev_pos = (start);					\
drivers/pci/pci.h-140-	u8 __pos = (start);						\
--
drivers/pci/pci.h-158-			if (prev_ptr != NULL)				\
drivers/pci/pci.h:159:				*(u8 *)prev_ptr = __prev_pos;		\
drivers/pci/pci.h-160-			break;						\
--
drivers/pci/pci.h-162-									\
drivers/pci/pci.h:163:		__prev_pos = __pos;					\
drivers/pci/pci.h-164-		__pos = FIELD_GET(PCI_CAP_LIST_NEXT_MASK, __ent);	\
--
drivers/pci/pci.h-189-	u16 __found_pos = 0;						\
drivers/pci/pci.h:190:	u16 __prev_pos;							\
drivers/pci/pci.h-191-	int __ttl, __ret;						\
--
drivers/pci/pci.h-193-									\
drivers/pci/pci.h:194:	__prev_pos = __pos;						\
drivers/pci/pci.h-195-	__ttl = (PCI_CFG_SPACE_EXP_SIZE - PCI_CFG_SPACE_SIZE) / 8;	\
--
drivers/pci/pci.h-206-			if (prev_ptr != NULL)				\
drivers/pci/pci.h:207:				*(u16 *)prev_ptr = __prev_pos;		\
drivers/pci/pci.h-208-			break;						\
--
drivers/pci/pci.h-210-									\
drivers/pci/pci.h:211:		__prev_pos = __pos;					\
drivers/pci/pci.h-212-		__pos = PCI_EXT_CAP_NEXT(__header);			\
--
drivers/scsi/aic7xxx/aic79xx_core.c=8000=ahd_qinfifo_requeue_tail(struct ahd_softc *ahd, struct scb *scb)
--
drivers/scsi/aic7xxx/aic79xx_core.c-8009-		u_int prev_tag;
drivers/scsi/aic7xxx/aic79xx_core.c:8010:		u_int prev_pos;
drivers/scsi/aic7xxx/aic79xx_core.c-8011-
drivers/scsi/aic7xxx/aic79xx_core.c:8012:		prev_pos = AHD_QIN_WRAP(ahd->qinfifonext - 1);
drivers/scsi/aic7xxx/aic79xx_core.c:8013:		prev_tag = ahd->qinfifo[prev_pos];
drivers/scsi/aic7xxx/aic79xx_core.c-8014-		prev_scb = ahd_lookup_scb(ahd, prev_tag);
--
drivers/scsi/aic7xxx/aic7xxx_core.c=5745=ahc_qinfifo_requeue_tail(struct ahc_softc *ahc, struct scb *scb)
--
drivers/scsi/aic7xxx/aic7xxx_core.c-5751-		u_int prev_tag;
drivers/scsi/aic7xxx/aic7xxx_core.c:5752:		uint8_t prev_pos;
drivers/scsi/aic7xxx/aic7xxx_core.c-5753-
drivers/scsi/aic7xxx/aic7xxx_core.c:5754:		prev_pos = ahc->qinfifonext - 1;
drivers/scsi/aic7xxx/aic7xxx_core.c:5755:		prev_tag = ahc->qinfifo[prev_pos];
drivers/scsi/aic7xxx/aic7xxx_core.c-5756-		prev_scb = ahc_lookup_scb(ahc, prev_tag);
--
fs/ext4/dir.c=130=static int ext4_readdir(struct file *file, struct dir_context *ctx)
--
fs/ext4/dir.c-202-					1 << EXT4_SB(sb)->s_min_folio_order);
fs/ext4/dir.c:203:			file->f_ra.prev_pos = (loff_t)index << PAGE_SHIFT;
fs/ext4/dir.c-204-			bh = ext4_bread(NULL, inode, map.m_lblk, 0);
--
fs/fuse/file.c=2162=static bool fuse_folios_need_send(struct fuse_conn *fc, loff_t pos,
--
fs/fuse/file.c-2168-	unsigned bytes = cur_bytes + len;
fs/fuse/file.c:2169:	loff_t prev_pos;
fs/fuse/file.c-2170-	size_t max_bytes = write ? fc->max_write : fc->max_read;
--
fs/fuse/file.c-2183-	prev_desc = ap->descs[ap->num_folios - 1];
fs/fuse/file.c:2184:	prev_pos = folio_pos(prev_folio) + prev_desc.offset + prev_desc.length;
fs/fuse/file.c:2185:	if (prev_pos != pos)
fs/fuse/file.c-2186-		return true;
--
fs/ntfs3/fsntfs.c=1184=int ntfs_read_run_nb_ra(struct ntfs_sb_info *sbi, const struct runs_tree *run,
--
fs/ntfs3/fsntfs.c-1241-							  index, 1);
fs/ntfs3/fsntfs.c:1242:				ra->prev_pos = (loff_t)index << PAGE_SHIFT;
fs/ntfs3/fsntfs.c-1243-			}
--
fs/splice.c=1025=ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
--
fs/splice.c-1081-		size_t read_len;
fs/splice.c:1082:		loff_t pos = sd->pos, prev_pos = pos;
fs/splice.c-1083-
--
fs/splice.c-1104-		if (unlikely(ret <= 0)) {
fs/splice.c:1105:			sd->pos = prev_pos;
fs/splice.c-1106-			goto out_release;
--
fs/splice.c-1113-		if (ret < read_len) {
fs/splice.c:1114:			sd->pos = prev_pos + ret;
fs/splice.c-1115-			goto out_release;
--
include/linux/fs.h=1188=struct fown_struct {
--
include/linux/fs.h-1206- * @mmap_miss: How many mmap accesses missed in the page cache.
include/linux/fs.h:1207: * @prev_pos: The last byte in the most recent read request.
include/linux/fs.h-1208- *
--
include/linux/fs.h=1212=struct file_ra_state {
--
include/linux/fs.h-1218-	unsigned short mmap_miss;
include/linux/fs.h:1219:	loff_t prev_pos;
include/linux/fs.h-1220-};
--
include/trace/events/readahead.h=76=DECLARE_EVENT_CLASS(page_cache_ra_op,
--
include/trace/events/readahead.h-90-		__field(unsigned int, mmap_miss)
include/trace/events/readahead.h:91:		__field(loff_t, prev_pos)
include/trace/events/readahead.h-92-		__field(unsigned long, req_count)
--
include/trace/events/readahead.h-103-		__entry->mmap_miss = ra->mmap_miss;
include/trace/events/readahead.h:104:		__entry->prev_pos = ra->prev_pos;
include/trace/events/readahead.h-105-		__entry->req_count = req_count;
--
include/trace/events/readahead.h-108-	TP_printk(
include/trace/events/readahead.h:109:		"dev=%d:%d ino=%lx index=%lu req_count=%lu order=%u size=%u async_size=%u ra_pages=%u mmap_miss=%u prev_pos=%lld",
include/trace/events/readahead.h-110-		MAJOR(__entry->s_dev), MINOR(__entry->s_dev), __entry->i_ino,
--
include/trace/events/readahead.h-112-		__entry->size, __entry->async_size, __entry->ra_pages,
include/trace/events/readahead.h:113:		__entry->mmap_miss, __entry->prev_pos
include/trace/events/readahead.h-114-	)
--
lib/xz/xz_dec_bcj.c=88=static size_t bcj_x86(struct xz_dec_bcj *s, uint8_t *buf, size_t size)
--
lib/xz/xz_dec_bcj.c-95-	size_t i;
lib/xz/xz_dec_bcj.c:96:	size_t prev_pos = (size_t)-1;
lib/xz/xz_dec_bcj.c-97-	uint32_t prev_mask = s->x86_prev_mask;
--
lib/xz/xz_dec_bcj.c-110-
lib/xz/xz_dec_bcj.c:111:		prev_pos = i - prev_pos;
lib/xz/xz_dec_bcj.c:112:		if (prev_pos > 3) {
lib/xz/xz_dec_bcj.c-113-			prev_mask = 0;
lib/xz/xz_dec_bcj.c-114-		} else {
lib/xz/xz_dec_bcj.c:115:			prev_mask = (prev_mask << (prev_pos - 1)) & 7;
lib/xz/xz_dec_bcj.c-116-			if (prev_mask != 0) {
--
lib/xz/xz_dec_bcj.c-119-						|| bcj_x86_test_msbyte(b)) {
lib/xz/xz_dec_bcj.c:120:					prev_pos = i;
lib/xz/xz_dec_bcj.c-121-					prev_mask = (prev_mask << 1) | 1;
--
lib/xz/xz_dec_bcj.c-126-
lib/xz/xz_dec_bcj.c:127:		prev_pos = i;
lib/xz/xz_dec_bcj.c-128-
--
lib/xz/xz_dec_bcj.c-152-
lib/xz/xz_dec_bcj.c:153:	prev_pos = i - prev_pos;
lib/xz/xz_dec_bcj.c:154:	s->x86_prev_mask = prev_pos > 3 ? 0 : prev_mask << (prev_pos - 1);
lib/xz/xz_dec_bcj.c-155-	return i;
--
mm/filemap.c=2763=ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
--
mm/filemap.c-2773-	loff_t isize, end_offset;
mm/filemap.c:2774:	loff_t last_pos = ra->prev_pos;
mm/filemap.c-2775-
--
mm/filemap.c-2872-	file_accessed(filp);
mm/filemap.c:2873:	ra->prev_pos = last_pos;
mm/filemap.c-2874-	return already_read ? already_read : error;
--
mm/filemap.c=3048=ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
--
mm/filemap.c-3124-			*ppos += n;
mm/filemap.c:3125:			in->f_ra.prev_pos = *ppos;
mm/filemap.c-3126-			if (pipe_is_full(pipe))
--
mm/readahead.c=142=file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
--
mm/readahead.c-144-	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
mm/readahead.c:145:	ra->prev_pos = -1;
mm/readahead.c-146-}
--
mm/readahead.c=394=static unsigned long get_next_ra_size(struct file_ra_state *ra,
--
mm/readahead.c-429- *
mm/readahead.c:430: * prev_pos tracks the last visited byte in the _previous_ read request.
mm/readahead.c-431- * It should be maintained by the caller, and will be used for detecting
--
mm/readahead.c=557=void page_cache_sync_ra(struct readahead_control *ractl,
--
mm/readahead.c-586-	max_pages = ractl_max_pages(ractl, req_count);
mm/readahead.c:587:	prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT;
mm/readahead.c-588-	/*
--
mm/shmem.c=3553=static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
--
mm/shmem.c-3650-		*ppos += n;
mm/shmem.c:3651:		in->f_ra.prev_pos = *ppos;
mm/shmem.c-3652-		if (pipe_is_full(pipe))
--
sound/soc/intel/avs/core.c=239=static void hdac_stream_update_pos(struct hdac_stream *stream, u64 buffer_size)
sound/soc/intel/avs/core.c-240-{
sound/soc/intel/avs/core.c:241:	u64 prev_pos, pos, num_bytes;
sound/soc/intel/avs/core.c-242-
sound/soc/intel/avs/core.c:243:	div64_u64_rem(stream->curr_pos, buffer_size, &prev_pos);
sound/soc/intel/avs/core.c-244-	pos = snd_hdac_stream_get_pos_posbuf(stream);
sound/soc/intel/avs/core.c-245-
sound/soc/intel/avs/core.c:246:	if (pos < prev_pos)
sound/soc/intel/avs/core.c:247:		num_bytes = (buffer_size - prev_pos) +  pos;
sound/soc/intel/avs/core.c-248-	else
sound/soc/intel/avs/core.c:249:		num_bytes = pos - prev_pos;
sound/soc/intel/avs/core.c-250-
--
sound/soc/sof/compress.c=15=static void sof_set_transferred_bytes(struct sof_compr_stream *sstream,
--
sound/soc/sof/compress.c-17-{
sound/soc/sof/compress.c:18:	u64 prev_pos;
sound/soc/sof/compress.c-19-	unsigned int copied;
sound/soc/sof/compress.c-20-
sound/soc/sof/compress.c:21:	div64_u64_rem(sstream->copied_total, buffer_size, &prev_pos);
sound/soc/sof/compress.c-22-
sound/soc/sof/compress.c:23:	if (host_pos < prev_pos)
sound/soc/sof/compress.c:24:		copied = (buffer_size - prev_pos) + host_pos;
sound/soc/sof/compress.c-25-	else
sound/soc/sof/compress.c:26:		copied = host_pos - prev_pos;
sound/soc/sof/compress.c-27-
--
sound/soc/sof/intel/hda-stream.c=800=hda_dsp_compr_bytes_transferred(struct hdac_stream *hstream, int direction)
--
sound/soc/sof/intel/hda-stream.c-802-	u64 buffer_size = hstream->bufsize;
sound/soc/sof/intel/hda-stream.c:803:	u64 prev_pos, pos, num_bytes;
sound/soc/sof/intel/hda-stream.c-804-
sound/soc/sof/intel/hda-stream.c:805:	div64_u64_rem(hstream->curr_pos, buffer_size, &prev_pos);
sound/soc/sof/intel/hda-stream.c-806-	pos = hda_dsp_stream_get_position(hstream, direction, false);
sound/soc/sof/intel/hda-stream.c-807-
sound/soc/sof/intel/hda-stream.c:808:	if (pos < prev_pos)
sound/soc/sof/intel/hda-stream.c:809:		num_bytes = (buffer_size - prev_pos) +  pos;
sound/soc/sof/intel/hda-stream.c-810-	else
sound/soc/sof/intel/hda-stream.c:811:		num_bytes = pos - prev_pos;
sound/soc/sof/intel/hda-stream.c-812-
]

gemini-3-flash-preview

							input: 72916
							output: 20
							thoughts: 4888

**Benign Data Race in `struct file_ra_state`**

Okay, here's the deal. I've been looking into this KCSAN report about a data race on the `prev_pos` field within the `struct file_ra_state` (`f_ra`) structure. From what I've gathered, this race is expected and, for the most part, harmless.

Essentially, `prev_pos`, along with other fields like `start`, `size`, `async_size`, `order`, and `mmap_miss` in `f_ra`, are used to optimize readahead. They are read and updated without explicit locking, which causes the data race. The key functions involved in the readahead process are `filemap_read`, `filemap_splice_read`, and `page_cache_sync_ra`, and a few others.

The concern is this: if multiple threads use the same file descriptor (and thus share the `f_ra` structure), they can interfere with each other's readahead hints. For example, `filemap_splice_read` and `filemap_read` concurrently updating `prev_pos`.

I initially worried that this could lead to out-of-bounds reads or other serious issues, but after tracing the readahead logic, I'm confident that the consequences are limited to inefficient readahead. The calculations of the read range are bounded by the file size, and the `order` is sanitized to stay within reasonable limits. There is code in `page_cache_sync_ra` that tries to account for races.

Specifically, integer overflows due to the large negative value `-1` initially assigned to `prev_pos` are handled correctly, as there is a check in place. The race is between two writers, updating a hint. Each write is, essentially, the most recent reading position of a thread using this shared file descriptor. Because these are all advancing positions within a file, this race should not have any problematic consequences.

I did a deep dive and looked at the code paths that use `prev_pos`, the values that it is used with, and how they relate. This includes how values of read position are handled. Based on my analysis, these values are used to decide whether the fast or slow path is followed. The readahead state is updated in either case. Even with the "wrong" `prev_pos` value, it still converges to the readahead state.

I also double-checked whether any other `f_ra` fields are similarly raced, and yes, `mmap_miss` is also updated without locks, using `READ_ONCE` and `WRITE_ONCE`. While `WRITE_ONCE` is not enough to avoid data races, the races on `mmap_miss` are understood to be fine. This is likely an older code path. The key thing is that any "lost updates" only impact the readahead effectiveness, not correctness or security.

I looked at the code path for splice, and verified that these races are acceptable, even in high concurrency scenarios. `prev_pos` is updated frequently in splice. `splice_direct_to_actor` in `fs/splice.c` uses a local variable that is separate from `in->f_ra.prev_pos`. The race also could happen in `splice`, but only with shared file descriptors, which do not have any implications for correctness, safety, or security.

Regarding the KCSAN report itself, the "torn write" scenario (partially written values) isn't a concern on x86-64, because 8-byte writes are atomic. So, no torn writes or other issues like that.

In conclusion, the data race is benign. It's a known trade-off for performance, and the code is structured to handle this shared mutable state reasonably well. While technically a race, it's not a bug.

map[File:fs/splice.c]

map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/*
 * "splice": joining two ropes together by interweaving their strands.
 *
 * This is the "extended pipe" functionality, where a pipe is used as
 * an arbitrary in-memory buffer. Think of a pipe as a small kernel
 * buffer that you can use to transfer data from one end to the other.
 *
 * The traditional unix read/write is extended with a "splice()" operation
 * that transfers data buffers to or from a pipe buffer.
 *
 * Named by Larry McVoy, original implementation from Linus, extended by
 * Jens to support splicing to files, network, direct splicing, etc and
 * fixing lots of bugs.
 *
 * Copyright (C) 2005-2006 Jens Axboe <axboe@kernel.dk>
 * Copyright (C) 2005-2006 Linus Torvalds <torvalds@osdl.org>
 * Copyright (C) 2006 Ingo Molnar <mingo@elte.hu>
 *
 */
#include <linux/bvec.h>
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/pagemap.h>
#include <linux/splice.h>
#include <linux/memcontrol.h>
#include <linux/mm_inline.h>
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/export.h>
#include <linux/syscalls.h>
#include <linux/uio.h>
#include <linux/fsnotify.h>
#include <linux/security.h>
#include <linux/gfp.h>
#include <linux/net.h>
#include <linux/socket.h>
#include <linux/sched/signal.h>

#include "internal.h"

/*
 * Splice doesn't support FMODE_NOWAIT. Since pipes may set this flag to
 * indicate they support non-blocking reads or writes, we must clear it
 * here if set to avoid blocking other users of this pipe if splice is
 * being done on it.
 */
static noinline void pipe_clear_nowait(struct file *file)
{
	fmode_t fmode = READ_ONCE(file->f_mode);

	do {
		if (!(fmode & FMODE_NOWAIT))
			break;
	} while (!try_cmpxchg(&file->f_mode, &fmode, fmode & ~FMODE_NOWAIT));
}

/*
 * Attempt to steal a page from a pipe buffer. This should perhaps go into
 * a vm helper function, it's already simplified quite a bit by the
 * addition of remove_mapping(). If success is returned, the caller may
 * attempt to reuse this page for another destination.
 */
static bool page_cache_pipe_buf_try_steal(struct pipe_inode_info *pipe,
		struct pipe_buffer *buf)
{
	struct folio *folio = page_folio(buf->page);
	struct address_space *mapping;

	folio_lock(folio);

	mapping = folio_mapping(folio);
	if (mapping) {
		WARN_ON(!folio_test_uptodate(folio));

		/*
		 * At least for ext2 with nobh option, we need to wait on
		 * writeback completing on this folio, since we'll remove it
		 * from the pagecache.  Otherwise truncate wont wait on the
		 * folio, allowing the disk blocks to be reused by someone else
		 * before we actually wrote our data to them. fs corruption
		 * ensues.
		 */
		folio_wait_writeback(folio);

		if (!filemap_release_folio(folio, GFP_KERNEL))
			goto out_unlock;

		/*
		 * If we succeeded in removing the mapping, set LRU flag
		 * and return good.
		 */
		if (remove_mapping(mapping, folio)) {
			buf->flags |= PIPE_BUF_FLAG_LRU;
			return true;
		}
	}

	/*
	 * Raced with truncate or failed to remove folio from current
	 * address space, unlock and return failure.
	 */
out_unlock:
	folio_unlock(folio);
	return false;
}

static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe,
					struct pipe_buffer *buf)
{
	put_page(buf->page);
	buf->flags &= ~PIPE_BUF_FLAG_LRU;
}

/*
 * Check whether the contents of buf is OK to access. Since the content
 * is a page cache page, IO may be in flight.
 */
static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
				       struct pipe_buffer *buf)
{
	struct folio *folio = page_folio(buf->page);
	int err;

	if (!folio_test_uptodate(folio)) {
		folio_lock(folio);

		/*
		 * Folio got truncated/unhashed. This will cause a 0-byte
		 * splice, if this is the first page.
		 */
		if (!folio->mapping) {
			err = -ENODATA;
			goto error;
		}

		/*
		 * Uh oh, read-error from disk.
		 */
		if (!folio_test_uptodate(folio)) {
			err = -EIO;
			goto error;
		}

		/* Folio is ok after all, we are done */
		folio_unlock(folio);
	}

	return 0;
error:
	folio_unlock(folio);
	return err;
}

const struct pipe_buf_operations page_cache_pipe_buf_ops = {
	.confirm	= page_cache_pipe_buf_confirm,
	.release	= page_cache_pipe_buf_release,
	.try_steal	= page_cache_pipe_buf_try_steal,
	.get		= generic_pipe_buf_get,
};

static bool user_page_pipe_buf_try_steal(struct pipe_inode_info *pipe,
		struct pipe_buffer *buf)
{
	if (!(buf->flags & PIPE_BUF_FLAG_GIFT))
		return false;

	buf->flags |= PIPE_BUF_FLAG_LRU;
	return generic_pipe_buf_try_steal(pipe, buf);
}

static const struct pipe_buf_operations user_page_pipe_buf_ops = {
	.release	= page_cache_pipe_buf_release,
	.try_steal	= user_page_pipe_buf_try_steal,
	.get		= generic_pipe_buf_get,
};

static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
{
	smp_mb();
	if (waitqueue_active(&pipe->rd_wait))
		wake_up_interruptible(&pipe->rd_wait);
	kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
}

/**
 * splice_to_pipe - fill passed data into a pipe
 * @pipe:	pipe to fill
 * @spd:	data to fill
 *
 * Description:
 *    @spd contains a map of pages and len/offset tuples, along with
 *    the struct pipe_buf_operations associated with these pages. This
 *    function will link that data to the pipe.
 *
 */
ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
		       struct splice_pipe_desc *spd)
{
	unsigned int spd_pages = spd->nr_pages;
	unsigned int tail = pipe->tail;
	unsigned int head = pipe->head;
	ssize_t ret = 0;
	int page_nr = 0;

	if (!spd_pages)
		return 0;

	if (unlikely(!pipe->readers)) {
		send_sig(SIGPIPE, current, 0);
		ret = -EPIPE;
		goto out;
	}

	while (!pipe_full(head, tail, pipe->max_usage)) {
		struct pipe_buffer *buf = pipe_buf(pipe, head);

		buf->page = spd->pages[page_nr];
		buf->offset = spd->partial[page_nr].offset;
		buf->len = spd->partial[page_nr].len;
		buf->private = spd->partial[page_nr].private;
		buf->ops = spd->ops;
		buf->flags = 0;

		head++;
		pipe->head = head;
		page_nr++;
		ret += buf->len;

		if (!--spd->nr_pages)
			break;
	}

	if (!ret)
		ret = -EAGAIN;

out:
	while (page_nr < spd_pages)
		spd->spd_release(spd, page_nr++);

	return ret;
}
EXPORT_SYMBOL_GPL(splice_to_pipe);

ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
{
	unsigned int head = pipe->head;
	unsigned int tail = pipe->tail;
	int ret;

	if (unlikely(!pipe->readers)) {
		send_sig(SIGPIPE, current, 0);
		ret = -EPIPE;
	} else if (pipe_full(head, tail, pipe->max_usage)) {
		ret = -EAGAIN;
	} else {
		*pipe_buf(pipe, head) = *buf;
		pipe->head = head + 1;
		return buf->len;
	}
	pipe_buf_release(pipe, buf);
	return ret;
}
EXPORT_SYMBOL(add_to_pipe);

/*
 * Check if we need to grow the arrays holding pages and partial page
 * descriptions.
 */
int splice_grow_spd(const struct pipe_inode_info *pipe, struct splice_pipe_desc *spd)
{
	unsigned int max_usage = READ_ONCE(pipe->max_usage);

	spd->nr_pages_max = max_usage;
	if (max_usage <= PIPE_DEF_BUFFERS)
		return 0;

	spd->pages = kmalloc_objs(struct page *, max_usage);
	spd->partial = kmalloc_objs(struct partial_page, max_usage);

	if (spd->pages && spd->partial)
		return 0;

	kfree(spd->pages);
	kfree(spd->partial);
	return -ENOMEM;
}

void splice_shrink_spd(struct splice_pipe_desc *spd)
{
	if (spd->nr_pages_max <= PIPE_DEF_BUFFERS)
		return;

	kfree(spd->pages);
	kfree(spd->partial);
}

/**
 * copy_splice_read -  Copy data from a file and splice the copy into a pipe
 * @in: The file to read from
 * @ppos: Pointer to the file position to read from
 * @pipe: The pipe to splice into
 * @len: The amount to splice
 * @flags: The SPLICE_F_* flags
 *
 * This function allocates a bunch of pages sufficient to hold the requested
 * amount of data (but limited by the remaining pipe capacity), passes it to
 * the file's ->read_iter() to read into and then splices the used pages into
 * the pipe.
 *
 * Return: On success, the number of bytes read will be returned and *@ppos
 * will be updated if appropriate; 0 will be returned if there is no more data
 * to be read; -EAGAIN will be returned if the pipe had no space, and some
 * other negative error code will be returned on error.  A short read may occur
 * if the pipe has insufficient space, we reach the end of the data or we hit a
 * hole.
 */
ssize_t copy_splice_read(struct file *in, loff_t *ppos,
			 struct pipe_inode_info *pipe,
			 size_t len, unsigned int flags)
{
	struct iov_iter to;
	struct bio_vec *bv;
	struct kiocb kiocb;
	struct page **pages;
	ssize_t ret;
	size_t used, npages, chunk, remain, keep = 0;
	int i;

	/* Work out how much data we can actually add into the pipe */
	used = pipe_buf_usage(pipe);
	npages = max_t(ssize_t, pipe->max_usage - used, 0);
	len = min_t(size_t, len, npages * PAGE_SIZE);
	npages = DIV_ROUND_UP(len, PAGE_SIZE);

	bv = kzalloc(array_size(npages, sizeof(bv[0])) +
		     array_size(npages, sizeof(struct page *)), GFP_KERNEL);
	if (!bv)
		return -ENOMEM;

	pages = (struct page **)(bv + npages);
	npages = alloc_pages_bulk(GFP_USER, npages, pages);
	if (!npages) {
		kfree(bv);
		return -ENOMEM;
	}

	remain = len = min_t(size_t, len, npages * PAGE_SIZE);

	for (i = 0; i < npages; i++) {
		chunk = min_t(size_t, PAGE_SIZE, remain);
		bv[i].bv_page = pages[i];
		bv[i].bv_offset = 0;
		bv[i].bv_len = chunk;
		remain -= chunk;
	}

	/* Do the I/O */
	iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
	init_sync_kiocb(&kiocb, in);
	kiocb.ki_pos = *ppos;
	ret = in->f_op->read_iter(&kiocb, &to);

	if (ret > 0) {
		keep = DIV_ROUND_UP(ret, PAGE_SIZE);
		*ppos = kiocb.ki_pos;
	}

	/*
	 * Callers of ->splice_read() expect -EAGAIN on "can't put anything in
	 * there", rather than -EFAULT.
	 */
	if (ret == -EFAULT)
		ret = -EAGAIN;

	/* Free any pages that didn't get touched at all. */
	if (keep < npages)
		release_pages(pages + keep, npages - keep);

	/* Push the remaining pages into the pipe. */
	remain = ret;
	for (i = 0; i < keep; i++) {
		struct pipe_buffer *buf = pipe_head_buf(pipe);

		chunk = min_t(size_t, remain, PAGE_SIZE);
		*buf = (struct pipe_buffer) {
			.ops	= &default_pipe_buf_ops,
			.page	= bv[i].bv_page,
			.offset	= 0,
			.len	= chunk,
		};
		pipe->head++;
		remain -= chunk;
	}

	kfree(bv);
	return ret;
}
EXPORT_SYMBOL(copy_splice_read);

const struct pipe_buf_operations default_pipe_buf_ops = {
	.release	= generic_pipe_buf_release,
	.try_steal	= generic_pipe_buf_try_steal,
	.get		= generic_pipe_buf_get,
};

/* Pipe buffer operations for a socket and similar. */
const struct pipe_buf_operations nosteal_pipe_buf_ops = {
	.release	= generic_pipe_buf_release,
	.get		= generic_pipe_buf_get,
};
EXPORT_SYMBOL(nosteal_pipe_buf_ops);

static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
{
	smp_mb();
	if (waitqueue_active(&pipe->wr_wait))
		wake_up_interruptible(&pipe->wr_wait);
	kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}

/**
 * splice_from_pipe_feed - feed available data from a pipe to a file
 * @pipe:	pipe to splice from
 * @sd:		information to @actor
 * @actor:	handler that splices the data
 *
 * Description:
 *    This function loops over the pipe and calls @actor to do the
 *    actual moving of a single struct pipe_buffer to the desired
 *    destination.  It returns when there's no more buffers left in
 *    the pipe or if the requested number of bytes (@sd->total_len)
 *    have been copied.  It returns a positive number (one) if the
 *    pipe needs to be filled with more data, zero if the required
 *    number of bytes have been copied and -errno on error.
 *
 *    This, together with splice_from_pipe_{begin,end,next}, may be
 *    used to implement the functionality of __splice_from_pipe() when
 *    locking is required around copying the pipe buffers to the
 *    destination.
 */
static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd,
			  splice_actor *actor)
{
	unsigned int head = pipe->head;
	unsigned int tail = pipe->tail;
	int ret;

	while (!pipe_empty(head, tail)) {
		struct pipe_buffer *buf = pipe_buf(pipe, tail);

		sd->len = buf->len;
		if (sd->len > sd->total_len)
			sd->len = sd->total_len;

		ret = pipe_buf_confirm(pipe, buf);
		if (unlikely(ret)) {
			if (ret == -ENODATA)
				ret = 0;
			return ret;
		}

		ret = actor(pipe, buf, sd);
		if (ret <= 0)
			return ret;

		buf->offset += ret;
		buf->len -= ret;

		sd->num_spliced += ret;
		sd->len -= ret;
		sd->pos += ret;
		sd->total_len -= ret;

		if (!buf->len) {
			pipe_buf_release(pipe, buf);
			tail++;
			pipe->tail = tail;
			if (pipe->files)
				sd->need_wakeup = true;
		}

		if (!sd->total_len)
			return 0;
	}

	return 1;
}

/* We know we have a pipe buffer, but maybe it's empty? */
static inline bool eat_empty_buffer(struct pipe_inode_info *pipe)
{
	unsigned int tail = pipe->tail;
	struct pipe_buffer *buf = pipe_buf(pipe, tail);

	if (unlikely(!buf->len)) {
		pipe_buf_release(pipe, buf);
		pipe->tail = tail+1;
		return true;
	}

	return false;
}

/**
 * splice_from_pipe_next - wait for some data to splice from
 * @pipe:	pipe to splice from
 * @sd:		information about the splice operation
 *
 * Description:
 *    This function will wait for some data and return a positive
 *    value (one) if pipe buffers are available.  It will return zero
 *    or -errno if no more data needs to be spliced.
 */
static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_desc *sd)
{
	/*
	 * Check for signal early to make process killable when there are
	 * always buffers available
	 */
	if (signal_pending(current))
		return -ERESTARTSYS;

repeat:
	while (pipe_is_empty(pipe)) {
		if (!pipe->writers)
			return 0;

		if (sd->num_spliced)
			return 0;

		if (sd->flags & SPLICE_F_NONBLOCK)
			return -EAGAIN;

		if (signal_pending(current))
			return -ERESTARTSYS;

		if (sd->need_wakeup) {
			wakeup_pipe_writers(pipe);
			sd->need_wakeup = false;
		}

		pipe_wait_readable(pipe);
	}

	if (eat_empty_buffer(pipe))
		goto repeat;

	return 1;
}

/**
 * splice_from_pipe_begin - start splicing from pipe
 * @sd:		information about the splice operation
 *
 * Description:
 *    This function should be called before a loop containing
 *    splice_from_pipe_next() and splice_from_pipe_feed() to
 *    initialize the necessary fields of @sd.
 */
static void splice_from_pipe_begin(struct splice_desc *sd)
{
	sd->num_spliced = 0;
	sd->need_wakeup = false;
}

/**
 * splice_from_pipe_end - finish splicing from pipe
 * @pipe:	pipe to splice from
 * @sd:		information about the splice operation
 *
 * Description:
 *    This function will wake up pipe writers if necessary.  It should
 *    be called after a loop containing splice_from_pipe_next() and
 *    splice_from_pipe_feed().
 */
static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_desc *sd)
{
	if (sd->need_wakeup)
		wakeup_pipe_writers(pipe);
}

/**
 * __splice_from_pipe - splice data from a pipe to given actor
 * @pipe:	pipe to splice from
 * @sd:		information to @actor
 * @actor:	handler that splices the data
 *
 * Description:
 *    This function does little more than loop over the pipe and call
 *    @actor to do the actual moving of a single struct pipe_buffer to
 *    the desired destination. See pipe_to_file, pipe_to_sendmsg, or
 *    pipe_to_user.
 *
 */
ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
			   splice_actor *actor)
{
	int ret;

	splice_from_pipe_begin(sd);
	do {
		cond_resched();
		ret = splice_from_pipe_next(pipe, sd);
		if (ret > 0)
			ret = splice_from_pipe_feed(pipe, sd, actor);
	} while (ret > 0);
	splice_from_pipe_end(pipe, sd);

	return sd->num_spliced ? sd->num_spliced : ret;
}
EXPORT_SYMBOL(__splice_from_pipe);

/**
 * splice_from_pipe - splice data from a pipe to a file
 * @pipe:	pipe to splice from
 * @out:	file to splice to
 * @ppos:	position in @out
 * @len:	how many bytes to splice
 * @flags:	splice modifier flags
 * @actor:	handler that splices the data
 *
 * Description:
 *    See __splice_from_pipe. This function locks the pipe inode,
 *    otherwise it's identical to __splice_from_pipe().
 *
 */
ssize_t splice_from_pipe(struct pipe_inode_info *pipe, struct file *out,
			 loff_t *ppos, size_t len, unsigned int flags,
			 splice_actor *actor)
{
	ssize_t ret;
	struct splice_desc sd = {
		.total_len = len,
		.flags = flags,
		.pos = *ppos,
		.u.file = out,
	};

	pipe_lock(pipe);
	ret = __splice_from_pipe(pipe, &sd, actor);
	pipe_unlock(pipe);

	return ret;
}

/**
 * iter_file_splice_write - splice data from a pipe to a file
 * @pipe:	pipe info
 * @out:	file to write to
 * @ppos:	position in @out
 * @len:	number of bytes to splice
 * @flags:	splice modifier flags
 *
 * Description:
 *    Will either move or copy pages (determined by @flags options) from
 *    the given pipe inode to the given file.
 *    This one is ->write_iter-based.
 *
 */
ssize_t
iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
			  loff_t *ppos, size_t len, unsigned int flags)
{
	struct splice_desc sd = {
		.total_len = len,
		.flags = flags,
		.pos = *ppos,
		.u.file = out,
	};
	int nbufs = pipe->max_usage;
	struct bio_vec *array;
	ssize_t ret;

	if (!out->f_op->write_iter)
		return -EINVAL;

	array = kzalloc_objs(struct bio_vec, nbufs);
	if (unlikely(!array))
		return -ENOMEM;

	pipe_lock(pipe);

	splice_from_pipe_begin(&sd);
	while (sd.total_len) {
		struct kiocb kiocb;
		struct iov_iter from;
		unsigned int head, tail;
		size_t left;
		int n;

		ret = splice_from_pipe_next(pipe, &sd);
		if (ret <= 0)
			break;

		if (unlikely(nbufs < pipe->max_usage)) {
			kfree(array);
			nbufs = pipe->max_usage;
			array = kzalloc_objs(struct bio_vec, nbufs);
			if (!array) {
				ret = -ENOMEM;
				break;
			}
		}

		head = pipe->head;
		tail = pipe->tail;

		/* build the vector */
		left = sd.total_len;
		for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++) {
			struct pipe_buffer *buf = pipe_buf(pipe, tail);
			size_t this_len = buf->len;

			/* zero-length bvecs are not supported, skip them */
			if (!this_len)
				continue;
			this_len = min(this_len, left);

			ret = pipe_buf_confirm(pipe, buf);
			if (unlikely(ret)) {
				if (ret == -ENODATA)
					ret = 0;
				goto done;
			}

			bvec_set_page(&array[n], buf->page, this_len,
				      buf->offset);
			left -= this_len;
			n++;
		}

		iov_iter_bvec(&from, ITER_SOURCE, array, n, sd.total_len - left);
		init_sync_kiocb(&kiocb, out);
		kiocb.ki_pos = sd.pos;
		ret = out->f_op->write_iter(&kiocb, &from);
		sd.pos = kiocb.ki_pos;
		if (ret <= 0)
			break;
		WARN_ONCE(ret > sd.total_len - left,
			  "Splice Exceeded! ret=%zd tot=%zu left=%zu\n",
			  ret, sd.total_len, left);

		sd.num_spliced += ret;
		sd.total_len -= ret;
		*ppos = sd.pos;

		/* dismiss the fully eaten buffers, adjust the partial one */
		tail = pipe->tail;
		while (ret) {
			struct pipe_buffer *buf = pipe_buf(pipe, tail);
			if (ret >= buf->len) {
				ret -= buf->len;
				buf->len = 0;
				pipe_buf_release(pipe, buf);
				tail++;
				pipe->tail = tail;
				if (pipe->files)
					sd.need_wakeup = true;
			} else {
				buf->offset += ret;
				buf->len -= ret;
				ret = 0;
			}
		}
	}
done:
	kfree(array);
	splice_from_pipe_end(pipe, &sd);

	pipe_unlock(pipe);

	if (sd.num_spliced)
		ret = sd.num_spliced;

	return ret;
}

EXPORT_SYMBOL(iter_file_splice_write);

#ifdef CONFIG_NET
/**
 * splice_to_socket - splice data from a pipe to a socket
 * @pipe:	pipe to splice from
 * @out:	socket to write to
 * @ppos:	position in @out
 * @len:	number of bytes to splice
 * @flags:	splice modifier flags
 *
 * Description:
 *    Will send @len bytes from the pipe to a network socket. No data copying
 *    is involved.
 *
 */
ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
			 loff_t *ppos, size_t len, unsigned int flags)
{
	struct socket *sock = sock_from_file(out);
	struct bio_vec bvec[16];
	struct msghdr msg = {};
	ssize_t ret = 0;
	size_t spliced = 0;
	bool need_wakeup = false;

	pipe_lock(pipe);

	while (len > 0) {
		unsigned int head, tail, bc = 0;
		size_t remain = len;

		/*
		 * Check for signal early to make process killable when there
		 * are always buffers available
		 */
		ret = -ERESTARTSYS;
		if (signal_pending(current))
			break;

		while (pipe_is_empty(pipe)) {
			ret = 0;
			if (!pipe->writers)
				goto out;

			if (spliced)
				goto out;

			ret = -EAGAIN;
			if (flags & SPLICE_F_NONBLOCK)
				goto out;

			ret = -ERESTARTSYS;
			if (signal_pending(current))
				goto out;

			if (need_wakeup) {
				wakeup_pipe_writers(pipe);
				need_wakeup = false;
			}

			pipe_wait_readable(pipe);
		}

		head = pipe->head;
		tail = pipe->tail;

		while (!pipe_empty(head, tail)) {
			struct pipe_buffer *buf = pipe_buf(pipe, tail);
			size_t seg;

			if (!buf->len) {
				tail++;
				continue;
			}

			seg = min_t(size_t, remain, buf->len);

			ret = pipe_buf_confirm(pipe, buf);
			if (unlikely(ret)) {
				if (ret == -ENODATA)
					ret = 0;
				break;
			}

			bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset);
			remain -= seg;
			if (remain == 0 || bc >= ARRAY_SIZE(bvec))
				break;
			tail++;
		}

		if (!bc)
			break;

		msg.msg_flags = MSG_SPLICE_PAGES;
		if (flags & SPLICE_F_MORE)
			msg.msg_flags |= MSG_MORE;
		if (remain && pipe_occupancy(pipe->head, tail) > 0)
			msg.msg_flags |= MSG_MORE;
		if (out->f_flags & O_NONBLOCK)
			msg.msg_flags |= MSG_DONTWAIT;

		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc,
			      len - remain);
		ret = sock_sendmsg(sock, &msg);
		if (ret <= 0)
			break;

		spliced += ret;
		len -= ret;
		tail = pipe->tail;
		while (ret > 0) {
			struct pipe_buffer *buf = pipe_buf(pipe, tail);
			size_t seg = min_t(size_t, ret, buf->len);

			buf->offset += seg;
			buf->len -= seg;
			ret -= seg;

			if (!buf->len) {
				pipe_buf_release(pipe, buf);
				tail++;
			}
		}

		if (tail != pipe->tail) {
			pipe->tail = tail;
			if (pipe->files)
				need_wakeup = true;
		}
	}

out:
	pipe_unlock(pipe);
	if (need_wakeup)
		wakeup_pipe_writers(pipe);
	return spliced ?: ret;
}
#endif

static int warn_unsupported(struct file *file, const char *op)
{
	pr_debug_ratelimited(
		"splice %s not supported for file %pD4 (pid: %d comm: %.20s)\n",
		op, file, current->pid, current->comm);
	return -EINVAL;
}

/*
 * Attempt to initiate a splice from pipe to file.
 */
static ssize_t do_splice_from(struct pipe_inode_info *pipe, struct file *out,
			      loff_t *ppos, size_t len, unsigned int flags)
{
	if (unlikely(!out->f_op->splice_write))
		return warn_unsupported(out, "write");
	return out->f_op->splice_write(pipe, out, ppos, len, flags);
}

/*
 * Indicate to the caller that there was a premature EOF when reading from the
 * source and the caller didn't indicate they would be sending more data after
 * this.
 */
static void do_splice_eof(struct splice_desc *sd)
{
	if (sd->splice_eof)
		sd->splice_eof(sd);
}

/*
 * Callers already called rw_verify_area() on the entire range.
 * No need to call it for sub ranges.
 */
static ssize_t do_splice_read(struct file *in, loff_t *ppos,
			      struct pipe_inode_info *pipe, size_t len,
			      unsigned int flags)
{
	unsigned int p_space;

	if (unlikely(!(in->f_mode & FMODE_READ)))
		return -EBADF;
	if (!len)
		return 0;

	/* Don't try to read more the pipe has space for. */
	p_space = pipe->max_usage - pipe_buf_usage(pipe);
	len = min_t(size_t, len, p_space << PAGE_SHIFT);

	if (unlikely(len > MAX_RW_COUNT))
		len = MAX_RW_COUNT;

	if (unlikely(!in->f_op->splice_read))
		return warn_unsupported(in, "read");
	/*
	 * O_DIRECT and DAX don't deal with the pagecache, so we allocate a
	 * buffer, copy into it and splice that into the pipe.
	 */
	if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
		return copy_splice_read(in, ppos, pipe, len, flags);
	return in->f_op->splice_read(in, ppos, pipe, len, flags);
}

/**
 * vfs_splice_read - Read data from a file and splice it into a pipe
 * @in:		File to splice from
 * @ppos:	Input file offset
 * @pipe:	Pipe to splice to
 * @len:	Number of bytes to splice
 * @flags:	Splice modifier flags (SPLICE_F_*)
 *
 * Splice the requested amount of data from the input file to the pipe.  This
 * is synchronous as the caller must hold the pipe lock across the entire
 * operation.
 *
 * If successful, it returns the amount of data spliced, 0 if it hit the EOF or
 * a hole and a negative error code otherwise.
 */
ssize_t vfs_splice_read(struct file *in, loff_t *ppos,
			struct pipe_inode_info *pipe, size_t len,
			unsigned int flags)
{
	ssize_t ret;

	ret = rw_verify_area(READ, in, ppos, len);
	if (unlikely(ret < 0))
		return ret;

	return do_splice_read(in, ppos, pipe, len, flags);
}
EXPORT_SYMBOL_GPL(vfs_splice_read);

/**
 * splice_direct_to_actor - splices data directly between two non-pipes
 * @in:		file to splice from
 * @sd:		actor information on where to splice to
 * @actor:	handles the data splicing
 *
 * Description:
 *    This is a special case helper to splice directly between two
 *    points, without requiring an explicit pipe. Internally an allocated
 *    pipe is cached in the process, and reused during the lifetime of
 *    that process.
 *
 */
ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
			       splice_direct_actor *actor)
{
	struct pipe_inode_info *pipe;
	ssize_t ret, bytes;
	size_t len;
	int i, flags, more;

	/*
	 * We require the input to be seekable, as we don't want to randomly
	 * drop data for eg socket -> socket splicing. Use the piped splicing
	 * for that!
	 */
	if (unlikely(!(in->f_mode & FMODE_LSEEK)))
		return -EINVAL;

	/*
	 * neither in nor out is a pipe, setup an internal pipe attached to
	 * 'out' and transfer the wanted data from 'in' to 'out' through that
	 */
	pipe = current->splice_pipe;
	if (unlikely(!pipe)) {
		pipe = alloc_pipe_info();
		if (!pipe)
			return -ENOMEM;

		/*
		 * We don't have an immediate reader, but we'll read the stuff
		 * out of the pipe right after the splice_to_pipe(). So set
		 * PIPE_READERS appropriately.
		 */
		pipe->readers = 1;

		current->splice_pipe = pipe;
	}

	/*
	 * Do the splice.
	 */
	bytes = 0;
	len = sd->total_len;

	/* Don't block on output, we have to drain the direct pipe. */
	flags = sd->flags;
	sd->flags &= ~SPLICE_F_NONBLOCK;

	/*
	 * We signal MORE until we've read sufficient data to fulfill the
	 * request and we keep signalling it if the caller set it.
	 */
	more = sd->flags & SPLICE_F_MORE;
	sd->flags |= SPLICE_F_MORE;

	WARN_ON_ONCE(!pipe_is_empty(pipe));

	while (len) {
		size_t read_len;
		loff_t pos = sd->pos, prev_pos = pos;

		ret = do_splice_read(in, &pos, pipe, len, flags);
		if (unlikely(ret <= 0))
			goto read_failure;

		read_len = ret;
		sd->total_len = read_len;

		/*
		 * If we now have sufficient data to fulfill the request then
		 * we clear SPLICE_F_MORE if it was not set initially.
		 */
		if (read_len >= len && !more)
			sd->flags &= ~SPLICE_F_MORE;

		/*
		 * NOTE: nonblocking mode only applies to the input. We
		 * must not do the output in nonblocking mode as then we
		 * could get stuck data in the internal pipe:
		 */
		ret = actor(pipe, sd);
		if (unlikely(ret <= 0)) {
			sd->pos = prev_pos;
			goto out_release;
		}

		bytes += ret;
		len -= ret;
		sd->pos = pos;

		if (ret < read_len) {
			sd->pos = prev_pos + ret;
			goto out_release;
		}
	}

done:
	pipe->tail = pipe->head = 0;
	file_accessed(in);
	return bytes;

read_failure:
	/*
	 * If the user did *not* set SPLICE_F_MORE *and* we didn't hit that
	 * "use all of len" case that cleared SPLICE_F_MORE, *and* we did a
	 * "->splice_in()" that returned EOF (ie zero) *and* we have sent at
	 * least 1 byte *then* we will also do the ->splice_eof() call.
	 */
	if (ret == 0 && !more && len > 0 && bytes)
		do_splice_eof(sd);
out_release:
	/*
	 * If we did an incomplete transfer we must release
	 * the pipe buffers in question:
	 */
	for (i = 0; i < pipe->ring_size; i++) {
		struct pipe_buffer *buf = &pipe->bufs[i];

		if (buf->ops)
			pipe_buf_release(pipe, buf);
	}

	if (!bytes)
		bytes = ret;

	goto done;
}
EXPORT_SYMBOL(splice_direct_to_actor);

static int direct_splice_actor(struct pipe_inode_info *pipe,
			       struct splice_desc *sd)
{
	struct file *file = sd->u.file;
	long ret;

	file_start_write(file);
	ret = do_splice_from(pipe, file, sd->opos, sd->total_len, sd->flags);
	file_end_write(file);
	return ret;
}

static int splice_file_range_actor(struct pipe_inode_info *pipe,
					struct splice_desc *sd)
{
	struct file *file = sd->u.file;

	return do_splice_from(pipe, file, sd->opos, sd->total_len, sd->flags);
}

static void direct_file_splice_eof(struct splice_desc *sd)
{
	struct file *file = sd->u.file;

	if (file->f_op->splice_eof)
		file->f_op->splice_eof(file);
}

static ssize_t do_splice_direct_actor(struct file *in, loff_t *ppos,
				      struct file *out, loff_t *opos,
				      size_t len, unsigned int flags,
				      splice_direct_actor *actor)
{
	struct splice_desc sd = {
		.len		= len,
		.total_len	= len,
		.flags		= flags,
		.pos		= *ppos,
		.u.file		= out,
		.splice_eof	= direct_file_splice_eof,
		.opos		= opos,
	};
	ssize_t ret;

	if (unlikely(!(out->f_mode & FMODE_WRITE)))
		return -EBADF;

	if (unlikely(out->f_flags & O_APPEND))
		return -EINVAL;

	ret = splice_direct_to_actor(in, &sd, actor);
	if (ret > 0)
		*ppos = sd.pos;

	return ret;
}
/**
 * do_splice_direct - splices data directly between two files
 * @in:		file to splice from
 * @ppos:	input file offset
 * @out:	file to splice to
 * @opos:	output file offset
 * @len:	number of bytes to splice
 * @flags:	splice modifier flags
 *
 * Description:
 *    For use by do_sendfile(). splice can easily emulate sendfile, but
 *    doing it in the application would incur an extra system call
 *    (splice in + splice out, as compared to just sendfile()). So this helper
 *    can splice directly through a process-private pipe.
 *
 * Callers already called rw_verify_area() on the entire range.
 */
ssize_t do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
			 loff_t *opos, size_t len, unsigned int flags)
{
	return do_splice_direct_actor(in, ppos, out, opos, len, flags,
				      direct_splice_actor);
}
EXPORT_SYMBOL(do_splice_direct);

/**
 * splice_file_range - splices data between two files for copy_file_range()
 * @in:		file to splice from
 * @ppos:	input file offset
 * @out:	file to splice to
 * @opos:	output file offset
 * @len:	number of bytes to splice
 *
 * Description:
 *    For use by ->copy_file_range() methods.
 *    Like do_splice_direct(), but vfs_copy_file_range() already holds
 *    start_file_write() on @out file.
 *
 * Callers already called rw_verify_area() on the entire range.
 */
ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
			  loff_t *opos, size_t len)
{
	lockdep_assert(file_write_started(out));

	return do_splice_direct_actor(in, ppos, out, opos,
				      min_t(size_t, len, MAX_RW_COUNT),
				      0, splice_file_range_actor);
}
EXPORT_SYMBOL(splice_file_range);

static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
{
	for (;;) {
		if (unlikely(!pipe->readers)) {
			send_sig(SIGPIPE, current, 0);
			return -EPIPE;
		}
		if (!pipe_is_full(pipe))
			return 0;
		if (flags & SPLICE_F_NONBLOCK)
			return -EAGAIN;
		if (signal_pending(current))
			return -ERESTARTSYS;
		pipe_wait_writable(pipe);
	}
}

static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
			       struct pipe_inode_info *opipe,
			       size_t len, unsigned int flags);

ssize_t splice_file_to_pipe(struct file *in,
			    struct pipe_inode_info *opipe,
			    loff_t *offset,
			    size_t len, unsigned int flags)
{
	ssize_t ret;

	pipe_lock(opipe);
	ret = wait_for_space(opipe, flags);
	if (!ret)
		ret = do_splice_read(in, offset, opipe, len, flags);
	pipe_unlock(opipe);
	if (ret > 0)
		wakeup_pipe_readers(opipe);
	return ret;
}

/*
 * Determine where to splice to/from.
 */
ssize_t do_splice(struct file *in, loff_t *off_in, struct file *out,
		  loff_t *off_out, size_t len, unsigned int flags)
{
	struct pipe_inode_info *ipipe;
	struct pipe_inode_info *opipe;
	loff_t offset;
	ssize_t ret;

	if (unlikely(!(in->f_mode & FMODE_READ) ||
		     !(out->f_mode & FMODE_WRITE)))
		return -EBADF;

	ipipe = get_pipe_info(in, true);
	opipe = get_pipe_info(out, true);

	if (ipipe && opipe) {
		if (off_in || off_out)
			return -ESPIPE;

		/* Splicing to self would be fun, but... */
		if (ipipe == opipe)
			return -EINVAL;

		if ((in->f_flags | out->f_flags) & O_NONBLOCK)
			flags |= SPLICE_F_NONBLOCK;

		ret = splice_pipe_to_pipe(ipipe, opipe, len, flags);
	} else if (ipipe) {
		if (off_in)
			return -ESPIPE;
		if (off_out) {
			if (!(out->f_mode & FMODE_PWRITE))
				return -EINVAL;
			offset = *off_out;
		} else {
			offset = out->f_pos;
		}

		if (unlikely(out->f_flags & O_APPEND))
			return -EINVAL;

		ret = rw_verify_area(WRITE, out, &offset, len);
		if (unlikely(ret < 0))
			return ret;

		if (in->f_flags & O_NONBLOCK)
			flags |= SPLICE_F_NONBLOCK;

		file_start_write(out);
		ret = do_splice_from(ipipe, out, &offset, len, flags);
		file_end_write(out);

		if (!off_out)
			out->f_pos = offset;
		else
			*off_out = offset;
	} else if (opipe) {
		if (off_out)
			return -ESPIPE;
		if (off_in) {
			if (!(in->f_mode & FMODE_PREAD))
				return -EINVAL;
			offset = *off_in;
		} else {
			offset = in->f_pos;
		}

		ret = rw_verify_area(READ, in, &offset, len);
		if (unlikely(ret < 0))
			return ret;

		if (out->f_flags & O_NONBLOCK)
			flags |= SPLICE_F_NONBLOCK;

		ret = splice_file_to_pipe(in, opipe, &offset, len, flags);

		if (!off_in)
			in->f_pos = offset;
		else
			*off_in = offset;
	} else {
		ret = -EINVAL;
	}

	if (ret > 0) {
		/*
		 * Generate modify out before access in:
		 * do_splice_from() may've already sent modify out,
		 * and this ensures the events get merged.
		 */
		fsnotify_modify(out);
		fsnotify_access(in);
	}

	return ret;
}

static ssize_t __do_splice(struct file *in, loff_t __user *off_in,
			   struct file *out, loff_t __user *off_out,
			   size_t len, unsigned int flags)
{
	struct pipe_inode_info *ipipe;
	struct pipe_inode_info *opipe;
	loff_t offset, *__off_in = NULL, *__off_out = NULL;
	ssize_t ret;

	ipipe = get_pipe_info(in, true);
	opipe = get_pipe_info(out, true);

	if (ipipe) {
		if (off_in)
			return -ESPIPE;
		pipe_clear_nowait(in);
	}
	if (opipe) {
		if (off_out)
			return -ESPIPE;
		pipe_clear_nowait(out);
	}

	if (off_out) {
		if (copy_from_user(&offset, off_out, sizeof(loff_t)))
			return -EFAULT;
		__off_out = &offset;
	}
	if (off_in) {
		if (copy_from_user(&offset, off_in, sizeof(loff_t)))
			return -EFAULT;
		__off_in = &offset;
	}

	ret = do_splice(in, __off_in, out, __off_out, len, flags);
	if (ret < 0)
		return ret;

	if (__off_out && copy_to_user(off_out, __off_out, sizeof(loff_t)))
		return -EFAULT;
	if (__off_in && copy_to_user(off_in, __off_in, sizeof(loff_t)))
		return -EFAULT;

	return ret;
}

static ssize_t iter_to_pipe(struct iov_iter *from,
			    struct pipe_inode_info *pipe,
			    unsigned int flags)
{
	struct pipe_buffer buf = {
		.ops = &user_page_pipe_buf_ops,
		.flags = flags
	};
	size_t total = 0;
	ssize_t ret = 0;

	while (iov_iter_count(from)) {
		struct page *pages[16];
		ssize_t left;
		size_t start;
		int i, n;

		left = iov_iter_get_pages2(from, pages, ~0UL, 16, &start);
		if (left <= 0) {
			ret = left;
			break;
		}

		n = DIV_ROUND_UP(left + start, PAGE_SIZE);
		for (i = 0; i < n; i++) {
			int size = umin(left, PAGE_SIZE - start);

			buf.page = pages[i];
			buf.offset = start;
			buf.len = size;
			ret = add_to_pipe(pipe, &buf);
			if (unlikely(ret < 0)) {
				iov_iter_revert(from, left);
				// this one got dropped by add_to_pipe()
				while (++i < n)
					put_page(pages[i]);
				goto out;
			}
			total += ret;
			left -= size;
			start = 0;
		}
	}
out:
	return total ? total : ret;
}

static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
			struct splice_desc *sd)
{
	int n = copy_page_to_iter(buf->page, buf->offset, sd->len, sd->u.data);
	return n == sd->len ? n : -EFAULT;
}

/*
 * For lack of a better implementation, implement vmsplice() to userspace
 * as a simple copy of the pipe's pages to the user iov.
 */
static ssize_t vmsplice_to_user(struct file *file, struct iov_iter *iter,
				unsigned int flags)
{
	struct pipe_inode_info *pipe = get_pipe_info(file, true);
	struct splice_desc sd = {
		.total_len = iov_iter_count(iter),
		.flags = flags,
		.u.data = iter
	};
	ssize_t ret = 0;

	if (!pipe)
		return -EBADF;

	pipe_clear_nowait(file);

	if (sd.total_len) {
		pipe_lock(pipe);
		ret = __splice_from_pipe(pipe, &sd, pipe_to_user);
		pipe_unlock(pipe);
	}

	if (ret > 0)
		fsnotify_access(file);

	return ret;
}

/*
 * vmsplice splices a user address range into a pipe. It can be thought of
 * as splice-from-memory, where the regular splice is splice-from-file (or
 * to file). In both cases the output is a pipe, naturally.
 */
static ssize_t vmsplice_to_pipe(struct file *file, struct iov_iter *iter,
				unsigned int flags)
{
	struct pipe_inode_info *pipe;
	ssize_t ret = 0;
	unsigned buf_flag = 0;

	if (flags & SPLICE_F_GIFT)
		buf_flag = PIPE_BUF_FLAG_GIFT;

	pipe = get_pipe_info(file, true);
	if (!pipe)
		return -EBADF;

	pipe_clear_nowait(file);

	pipe_lock(pipe);
	ret = wait_for_space(pipe, flags);
	if (!ret)
		ret = iter_to_pipe(iter, pipe, buf_flag);
	pipe_unlock(pipe);
	if (ret > 0) {
		wakeup_pipe_readers(pipe);
		fsnotify_modify(file);
	}
	return ret;
}

/*
 * Note that vmsplice only really supports true splicing _from_ user memory
 * to a pipe, not the other way around. Splicing from user memory is a simple
 * operation that can be supported without any funky alignment restrictions
 * or nasty vm tricks. We simply map in the user memory and fill them into
 * a pipe. The reverse isn't quite as easy, though. There are two possible
 * solutions for that:
 *
 *	- memcpy() the data internally, at which point we might as well just
 *	  do a regular read() on the buffer anyway.
 *	- Lots of nasty vm tricks, that are neither fast nor flexible (it
 *	  has restriction limitations on both ends of the pipe).
 *
 * Currently we punt and implement it as a normal copy, see pipe_to_user().
 *
 */
SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, uiov,
		unsigned long, nr_segs, unsigned int, flags)
{
	struct iovec iovstack[UIO_FASTIOV];
	struct iovec *iov = iovstack;
	struct iov_iter iter;
	ssize_t error;
	int type;

	if (unlikely(flags & ~SPLICE_F_ALL))
		return -EINVAL;

	CLASS(fd, f)(fd);
	if (fd_empty(f))
		return -EBADF;
	if (fd_file(f)->f_mode & FMODE_WRITE)
		type = ITER_SOURCE;
	else if (fd_file(f)->f_mode & FMODE_READ)
		type = ITER_DEST;
	else
		return -EBADF;

	error = import_iovec(type, uiov, nr_segs,
			     ARRAY_SIZE(iovstack), &iov, &iter);
	if (error < 0)
		return error;

	if (!iov_iter_count(&iter))
		error = 0;
	else if (type == ITER_SOURCE)
		error = vmsplice_to_pipe(fd_file(f), &iter, flags);
	else
		error = vmsplice_to_user(fd_file(f), &iter, flags);

	kfree(iov);
	return error;
}

SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
		int, fd_out, loff_t __user *, off_out,
		size_t, len, unsigned int, flags)
{
	if (unlikely(!len))
		return 0;

	if (unlikely(flags & ~SPLICE_F_ALL))
		return -EINVAL;

	CLASS(fd, in)(fd_in);
	if (fd_empty(in))
		return -EBADF;

	CLASS(fd, out)(fd_out);
	if (fd_empty(out))
		return -EBADF;

	return __do_splice(fd_file(in), off_in, fd_file(out), off_out,
					    len, flags);
}

/*
 * Make sure there's data to read. Wait for input if we can, otherwise
 * return an appropriate error.
 */
static int ipipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
{
	int ret;

	/*
	 * Check the pipe occupancy without the inode lock first. This function
	 * is speculative anyways, so missing one is ok.
	 */
	if (!pipe_is_empty(pipe))
		return 0;

	ret = 0;
	pipe_lock(pipe);

	while (pipe_is_empty(pipe)) {
		if (signal_pending(current)) {
			ret = -ERESTARTSYS;
			break;
		}
		if (!pipe->writers)
			break;
		if (flags & SPLICE_F_NONBLOCK) {
			ret = -EAGAIN;
			break;
		}
		pipe_wait_readable(pipe);
	}

	pipe_unlock(pipe);
	return ret;
}

/*
 * Make sure there's writeable room. Wait for room if we can, otherwise
 * return an appropriate error.
 */
static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
{
	int ret;

	/*
	 * Check pipe occupancy without the inode lock first. This function
	 * is speculative anyways, so missing one is ok.
	 */
	if (!pipe_is_full(pipe))
		return 0;

	ret = 0;
	pipe_lock(pipe);

	while (pipe_is_full(pipe)) {
		if (!pipe->readers) {
			send_sig(SIGPIPE, current, 0);
			ret = -EPIPE;
			break;
		}
		if (flags & SPLICE_F_NONBLOCK) {
			ret = -EAGAIN;
			break;
		}
		if (signal_pending(current)) {
			ret = -ERESTARTSYS;
			break;
		}
		pipe_wait_writable(pipe);
	}

	pipe_unlock(pipe);
	return ret;
}

/*
 * Splice contents of ipipe to opipe.
 */
static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
			       struct pipe_inode_info *opipe,
			       size_t len, unsigned int flags)
{
	struct pipe_buffer *ibuf, *obuf;
	unsigned int i_head, o_head;
	unsigned int i_tail, o_tail;
	int ret = 0;
	bool input_wakeup = false;


retry:
	ret = ipipe_prep(ipipe, flags);
	if (ret)
		return ret;

	ret = opipe_prep(opipe, flags);
	if (ret)
		return ret;

	/*
	 * Potential ABBA deadlock, work around it by ordering lock
	 * grabbing by pipe info address. Otherwise two different processes
	 * could deadlock (one doing tee from A -> B, the other from B -> A).
	 */
	pipe_double_lock(ipipe, opipe);

	i_tail = ipipe->tail;
	o_head = opipe->head;

	do {
		size_t o_len;

		if (!opipe->readers) {
			send_sig(SIGPIPE, current, 0);
			if (!ret)
				ret = -EPIPE;
			break;
		}

		i_head = ipipe->head;
		o_tail = opipe->tail;

		if (pipe_empty(i_head, i_tail) && !ipipe->writers)
			break;

		/*
		 * Cannot make any progress, because either the input
		 * pipe is empty or the output pipe is full.
		 */
		if (pipe_empty(i_head, i_tail) ||
		    pipe_full(o_head, o_tail, opipe->max_usage)) {
			/* Already processed some buffers, break */
			if (ret)
				break;

			if (flags & SPLICE_F_NONBLOCK) {
				ret = -EAGAIN;
				break;
			}

			/*
			 * We raced with another reader/writer and haven't
			 * managed to process any buffers.  A zero return
			 * value means EOF, so retry instead.
			 */
			pipe_unlock(ipipe);
			pipe_unlock(opipe);
			goto retry;
		}

		ibuf = pipe_buf(ipipe, i_tail);
		obuf = pipe_buf(opipe, o_head);

		if (len >= ibuf->len) {
			/*
			 * Simply move the whole buffer from ipipe to opipe
			 */
			*obuf = *ibuf;
			ibuf->ops = NULL;
			i_tail++;
			ipipe->tail = i_tail;
			input_wakeup = true;
			o_len = obuf->len;
			o_head++;
			opipe->head = o_head;
		} else {
			/*
			 * Get a reference to this pipe buffer,
			 * so we can copy the contents over.
			 */
			if (!pipe_buf_get(ipipe, ibuf)) {
				if (ret == 0)
					ret = -EFAULT;
				break;
			}
			*obuf = *ibuf;

			/*
			 * Don't inherit the gift and merge flags, we need to
			 * prevent multiple steals of this page.
			 */
			obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
			obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE;

			obuf->len = len;
			ibuf->offset += len;
			ibuf->len -= len;
			o_len = len;
			o_head++;
			opipe->head = o_head;
		}
		ret += o_len;
		len -= o_len;
	} while (len);

	pipe_unlock(ipipe);
	pipe_unlock(opipe);

	/*
	 * If we put data in the output pipe, wakeup any potential readers.
	 */
	if (ret > 0)
		wakeup_pipe_readers(opipe);

	if (input_wakeup)
		wakeup_pipe_writers(ipipe);

	return ret;
}

/*
 * Link contents of ipipe to opipe.
 */
static ssize_t link_pipe(struct pipe_inode_info *ipipe,
			 struct pipe_inode_info *opipe,
			 size_t len, unsigned int flags)
{
	struct pipe_buffer *ibuf, *obuf;
	unsigned int i_head, o_head;
	unsigned int i_tail, o_tail;
	ssize_t ret = 0;

	/*
	 * Potential ABBA deadlock, work around it by ordering lock
	 * grabbing by pipe info address. Otherwise two different processes
	 * could deadlock (one doing tee from A -> B, the other from B -> A).
	 */
	pipe_double_lock(ipipe, opipe);

	i_tail = ipipe->tail;
	o_head = opipe->head;

	do {
		if (!opipe->readers) {
			send_sig(SIGPIPE, current, 0);
			if (!ret)
				ret = -EPIPE;
			break;
		}

		i_head = ipipe->head;
		o_tail = opipe->tail;

		/*
		 * If we have iterated all input buffers or run out of
		 * output room, break.
		 */
		if (pipe_empty(i_head, i_tail) ||
		    pipe_full(o_head, o_tail, opipe->max_usage))
			break;

		ibuf = pipe_buf(ipipe, i_tail);
		obuf = pipe_buf(opipe, o_head);

		/*
		 * Get a reference to this pipe buffer,
		 * so we can copy the contents over.
		 */
		if (!pipe_buf_get(ipipe, ibuf)) {
			if (ret == 0)
				ret = -EFAULT;
			break;
		}

		*obuf = *ibuf;

		/*
		 * Don't inherit the gift and merge flag, we need to prevent
		 * multiple steals of this page.
		 */
		obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
		obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE;

		if (obuf->len > len)
			obuf->len = len;
		ret += obuf->len;
		len -= obuf->len;

		o_head++;
		opipe->head = o_head;
		i_tail++;
	} while (len);

	pipe_unlock(ipipe);
	pipe_unlock(opipe);

	/*
	 * If we put data in the output pipe, wakeup any potential readers.
	 */
	if (ret > 0)
		wakeup_pipe_readers(opipe);

	return ret;
}

/*
 * This is a tee(1) implementation that works on pipes. It doesn't copy
 * any data, it simply references the 'in' pages on the 'out' pipe.
 * The 'flags' used are the SPLICE_F_* variants, currently the only
 * applicable one is SPLICE_F_NONBLOCK.
 */
ssize_t do_tee(struct file *in, struct file *out, size_t len,
	       unsigned int flags)
{
	struct pipe_inode_info *ipipe = get_pipe_info(in, true);
	struct pipe_inode_info *opipe = get_pipe_info(out, true);
	ssize_t ret = -EINVAL;

	if (unlikely(!(in->f_mode & FMODE_READ) ||
		     !(out->f_mode & FMODE_WRITE)))
		return -EBADF;

	/*
	 * Duplicate the contents of ipipe to opipe without actually
	 * copying the data.
	 */
	if (ipipe && opipe && ipipe != opipe) {
		if ((in->f_flags | out->f_flags) & O_NONBLOCK)
			flags |= SPLICE_F_NONBLOCK;

		/*
		 * Keep going, unless we encounter an error. The ipipe/opipe
		 * ordering doesn't really matter.
		 */
		ret = ipipe_prep(ipipe, flags);
		if (!ret) {
			ret = opipe_prep(opipe, flags);
			if (!ret)
				ret = link_pipe(ipipe, opipe, len, flags);
		}
	}

	if (ret > 0) {
		fsnotify_access(in);
		fsnotify_modify(out);
	}

	return ret;
}

SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
{
	if (unlikely(flags & ~SPLICE_F_ALL))
		return -EINVAL;

	if (unlikely(!len))
		return 0;

	CLASS(fd, in)(fdin);
	if (fd_empty(in))
		return -EBADF;

	CLASS(fd, out)(fdout);
	if (fd_empty(out))
		return -EBADF;

	return do_tee(fd_file(in), fd_file(out), len, flags);
}
]

gemini-3-flash-preview

							input: 95774
							output: 19
							thoughts: 1184

**Analysis of a KCSAN Report: Benign Race Condition in Readahead**

Okay, let's break this down. I'm looking at a KCSAN report indicating a potential race condition. The key functions involved are `sendfile`, `splice_direct_to_actor`, and `filemap_read`, all dealing with file I/O. Task 10061 is running `sendfile` and Task 10066 is doing an `io_read` via `io_uring`. Both appear to be operating on the same file. The reported race is on `f_ra.prev_pos`, the previous position used for readahead.

My initial thought is that this is likely a standard, expected race in the readahead mechanism. Readahead hints are inherently susceptible to these kinds of races when multiple threads share a file descriptor. The readahead state isn't precisely maintained; it's a *hint*. I quickly recall that `sendfile` itself uses `f_pos`, and the relevant splice functions use `f_pos` or `*ppos`, depending on the situation.

I dig into the code and see that `do_splice` uses `in->f_pos` if `off_in` is null. `do_splice_direct_actor` uses the passed in position from `splice_desc.pos = *ppos`.

I needed to make sure it wasn't something problematic. I review the code again to make sure that the reported field, `f_ra.prev_pos`, is the only one affected.

My next step is to make sure this is the only relevant field in `f_ra` that could cause a problem. Upon closer inspection, `prev_pos` is the only field flagged by KCSAN.

Now, I look at how `prev_pos` is actually *used* within `filemap_read`. I confirm that the readahead's `prev_pos` is used to decide when to call `folio_mark_accessed`. Specifically, I note this section:

```c
2842: if (!pos_same_folio(iocb->ki_pos, last_pos - 1,
2843: fbatch.folios[0]))
2844: folio_mark_accessed(fbatch.folios[0]);
```

`folio_mark_accessed` is, itself, just a hint, mostly for LRU (Least Recently Used). If the `last_pos` is racy, it means that `folio_mark_accessed` might be called more or less often than would be *perfectly* ideal. This isn't critical. `pos_same_folio` also just checks if the positions are in the same folio.

A racy `prev_pos` could lead to a different thread's position being used in the `pos_same_folio` check, resulting in a false negative (we don't mark as accessed when we should) or a false positive (we mark as accessed when we shouldn't) for `folio_mark_accessed`. Since `folio_mark_accessed` is a hint for LRU, both are harmless. The worst-case scenario is a slightly suboptimal LRU behavior. This is clearly a benign race condition. I also consider that `filemap_read` and `filemap_splice_read` are both incredibly common, so if this were a serious issue, it would have been found and fixed a long time ago.
Therefore, my verdict is that this race condition on `f_ra.prev_pos` is benign.

map[Benign:true Confident:true]

map[Benign:true Confident:true]

gemini-3-flash-preview

							input: 97775
							output: 569
							thoughts: 336

**My Assessment of the Data Race in File Readahead**

Okay, so I've been looking at this data race on the `prev_pos` field within the `file_ra_state` struct, nested within the `struct file` in the kernel. My understanding is that this field is crucial for the readahead algorithm; it tracks the last read position to optimize for sequential access patterns.

The race specifically occurs because of concurrent read operations on the same file descriptor. Things like `splice` and `sendfile`, they can all hit the same file concurrently, and this is where the problem lies. The devs deliberately left the readahead state unlocked to prevent performance bottlenecks. A classic trade-off, really.

From what I can see, the worst-case scenarios are suboptimal readahead and redundant `folio_mark_accessed` calls. The readahead algorithm might misinterpret access patterns, which means I/O performance will suffer. That second issue, the potential for those extra `folio_mark_accessed` calls is essentially harmless, though inefficient. It is only used to avoid multiple accesses for a single read.

However, and this is crucial, `prev_pos` is a simple `loff_t`, a 64-bit integer. It's only a hint, an optimization. It's not involved in any data integrity checks. On an x86_64 system, 8-byte aligned writes are atomic, meaning even if a write is "torn," the value won't be nonsensical. Memory corruption, crashes, or incorrect data returned to the user are simply not on the table here.

And, of course, the kernel documentation and the code comments themselves acknowledge that concurrent streams will interfere with each other’s readahead.

Considering all this, it looks like this is a benign data race. The benefits of avoiding those read-path bottlenecks outweigh the minor performance hiccups that might occur.

LLM Calls Summary:
Total Calls	Total Tokens	Avg Tokens	Total Duration (Seconds)	Avg Duration (Seconds)

Tool Calls Summary:
Total Calls	Total Duration (Seconds)	Avg Duration (Seconds)