bisecting cause commit starting from 7c7ec3226f5f33f9c050d85ec20f18419c622ad6 building syzkaller on 2d5ea0cb6edb828803beea8af38dbc04dc80f028 testing commit 7c7ec3226f5f33f9c050d85ec20f18419c622ad6 with gcc (GCC) 8.1.0 kernel signature: 79ade23124832d84491a2de8756dde2756edbbd603c8dd4f10de76b9400edf1a all runs: crashed: BUG: unable to handle kernel NULL pointer dereference in gfs2_withdraw testing release v5.8 testing commit bcf876870b95592b52519ed4aafcf9d95999bc9c with gcc (GCC) 8.1.0 kernel signature: b34c4620b2edc4d2a98e9ddafbb841982198a933426b3c47cba8f3923a7add1f all runs: crashed: BUG: unable to handle kernel NULL pointer dereference in gfs2_withdraw testing release v5.7 testing commit 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162 with gcc (GCC) 8.1.0 kernel signature: 7f53ab93676ff138c03793adc5ab03ffbc4cdaee3b3729f5609180b4c407b422 all runs: crashed: general protection fault in gfs2_withdraw testing release v5.6 testing commit 7111951b8d4973bda27ff663f2cf18b663d15b48 with gcc (GCC) 8.1.0 kernel signature: 83a12a73b4d0c1c5e2793d37d50811f10c6c508d41d413955e617f5e37b0efa8 all runs: OK # git bisect start 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162 7111951b8d4973bda27ff663f2cf18b663d15b48 Bisecting: 7542 revisions left to test after this (roughly 13 steps) [50a5de895dbe5df947b3a695777db5b2c313e065] Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma testing commit 50a5de895dbe5df947b3a695777db5b2c313e065 with gcc (GCC) 8.1.0 kernel signature: ae43903517fdcb5d80c0a0c903dd20e40454efbc2795c092df5ecf15b82a4f11 all runs: crashed: general protection fault in gfs2_withdraw # git bisect bad 50a5de895dbe5df947b3a695777db5b2c313e065 Bisecting: 4204 revisions left to test after this (roughly 12 steps) [56a451b780676bc1cdac011735fe2869fa2e9abf] Merge tag 'ntb-5.7' of git://github.com/jonmason/ntb testing commit 56a451b780676bc1cdac011735fe2869fa2e9abf with gcc (GCC) 8.1.0 kernel signature: 2fb15387b7c6283f34eaf11e9f5d5a3318f89a54858b4b9aae8d34d994519c47 all runs: crashed: general protection fault in gfs2_withdraw # git bisect bad 56a451b780676bc1cdac011735fe2869fa2e9abf Bisecting: 1643 revisions left to test after this (roughly 11 steps) [49835c15a55225e9b3ff9cc9317135b334ea2d49] Merge tag 'pm-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm testing commit 49835c15a55225e9b3ff9cc9317135b334ea2d49 with gcc (GCC) 8.1.0 kernel signature: 15add38062ec163e25d44c899fc0f34a4ae8635eb290202b784624be54028ac5 all runs: OK # git bisect good 49835c15a55225e9b3ff9cc9317135b334ea2d49 Bisecting: 856 revisions left to test after this (roughly 10 steps) [a8222fd5b80c7ec83f257060670becbeea9b50b9] Merge tag 'microblaze-v5.7-rc1' of git://git.monstr.eu/linux-2.6-microblaze testing commit a8222fd5b80c7ec83f257060670becbeea9b50b9 with gcc (GCC) 8.1.0 kernel signature: 3355289aa58eeccf3e327ac216029fbcfee57183a778d0a87ebf4d44658733eb all runs: OK # git bisect good a8222fd5b80c7ec83f257060670becbeea9b50b9 Bisecting: 441 revisions left to test after this (roughly 9 steps) [018d21f5c58c3854ebd7ee18540fc4a03f244d2f] Merge tag 'gfs2-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 testing commit 018d21f5c58c3854ebd7ee18540fc4a03f244d2f with gcc (GCC) 8.1.0 kernel signature: 8221df2e436a628530eb93a8cdc24256a200fe5a46856e799a60921b721441d1 all runs: crashed: general protection fault in gfs2_withdraw # git bisect bad 018d21f5c58c3854ebd7ee18540fc4a03f244d2f Bisecting: 207 revisions left to test after this (roughly 8 steps) [39dba8739c4e360d7d1b27119c728791e68b0448] btrfs: do not resolve backrefs for roots that are being deleted testing commit 39dba8739c4e360d7d1b27119c728791e68b0448 with gcc (GCC) 8.1.0 kernel signature: 3eb2105a043dd94d3e6f6060f40ca6b8f9025a764cfcbf0b94172c972deb5218 all runs: OK # git bisect good 39dba8739c4e360d7d1b27119c728791e68b0448 Bisecting: 108 revisions left to test after this (roughly 7 steps) [97cddfc34549014b902f5954953ebd9a4f3f040a] Merge branch 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip testing commit 97cddfc34549014b902f5954953ebd9a4f3f040a with gcc (GCC) 8.1.0 kernel signature: 338d71cef93e355fa9a443c984334fcf7aec5dd2e2312e87785ed4b76162e184 all runs: OK # git bisect good 97cddfc34549014b902f5954953ebd9a4f3f040a Bisecting: 54 revisions left to test after this (roughly 6 steps) [d9d76778927dc953c553b83ab52287dfbd15ac6a] Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip testing commit d9d76778927dc953c553b83ab52287dfbd15ac6a with gcc (GCC) 8.1.0 kernel signature: f89bfece97105c0418beeb931e6fc4c1a9c88488e27b7d87e5ec08040fbacebe all runs: OK # git bisect good d9d76778927dc953c553b83ab52287dfbd15ac6a Bisecting: 27 revisions left to test after this (roughly 5 steps) [c9ebc4b737999ab1f3264c42431f7be80ac2efbf] gfs2: allow journal replay to hold sd_log_flush_lock testing commit c9ebc4b737999ab1f3264c42431f7be80ac2efbf with gcc (GCC) 8.1.0 kernel signature: cd53182a1141545907768a515f5fa9ac95533bd809093b1a5abf4f04ffa0a154 all runs: crashed: general protection fault in gfs2_withdraw # git bisect bad c9ebc4b737999ab1f3264c42431f7be80ac2efbf Bisecting: 13 revisions left to test after this (roughly 4 steps) [a72d2401f54b7db41c77ab971238a06eafe929fb] gfs2: Allow some glocks to be used during withdraw testing commit a72d2401f54b7db41c77ab971238a06eafe929fb with gcc (GCC) 8.1.0 kernel signature: 6a8e6f27a63321a3250e781859c0fe37dc243cb9c335ee2d8226173b9697c08e all runs: OK # git bisect good a72d2401f54b7db41c77ab971238a06eafe929fb Bisecting: 6 revisions left to test after this (roughly 3 steps) [9ff78289356af640941bbb0dd3f46af2063f0046] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty testing commit 9ff78289356af640941bbb0dd3f46af2063f0046 with gcc (GCC) 8.1.0 kernel signature: 7a785bd358a1e94ce78a47808acd516fe1c76c8ab9a44f8ae4da8044a5c896da all runs: crashed: general protection fault in gfs2_withdraw # git bisect bad 9ff78289356af640941bbb0dd3f46af2063f0046 Bisecting: 3 revisions left to test after this (roughly 2 steps) [7d9f9249580e05a9af823cdbb7a91f2758d1c63b] gfs2: Add verbose option to check_journal_clean testing commit 7d9f9249580e05a9af823cdbb7a91f2758d1c63b with gcc (GCC) 8.1.0 kernel signature: b69956e1f777597d17e3591d2ab89f20c06e821830ee6b17512d022e4c3acd81 all runs: crashed: general protection fault in gfs2_withdraw # git bisect bad 7d9f9249580e05a9af823cdbb7a91f2758d1c63b Bisecting: 0 revisions left to test after this (roughly 1 step) [33dbd1e41a1dd549eb19a29477119d4e29766210] gfs2: fix infinite loop when checking ail item count before go_inval testing commit 33dbd1e41a1dd549eb19a29477119d4e29766210 with gcc (GCC) 8.1.0 kernel signature: 97bc7b59fa3dbaf7b20a8e62941a93e24b7a888603be2843e9ad5cce4b4adf71 all runs: crashed: general protection fault in gfs2_withdraw # git bisect bad 33dbd1e41a1dd549eb19a29477119d4e29766210 Bisecting: 0 revisions left to test after this (roughly 0 steps) [601ef0d52e9617588fcff3df26953592f2eb44ac] gfs2: Force withdraw to replay journals and wait for it to finish testing commit 601ef0d52e9617588fcff3df26953592f2eb44ac with gcc (GCC) 8.1.0 kernel signature: 4b082912da270d19d2414736c101b1bab27b58156d57f54fed4a3412d75d485c all runs: crashed: general protection fault in gfs2_withdraw # git bisect bad 601ef0d52e9617588fcff3df26953592f2eb44ac 601ef0d52e9617588fcff3df26953592f2eb44ac is the first bad commit commit 601ef0d52e9617588fcff3df26953592f2eb44ac Author: Bob Peterson Date: Tue Jan 28 20:23:45 2020 +0100 gfs2: Force withdraw to replay journals and wait for it to finish When a node withdraws from a file system, it often leaves its journal in an incomplete state. This is especially true when the withdraw is caused by io errors writing to the journal. Before this patch, a withdraw would try to write a "shutdown" record to the journal, tell dlm it's done with the file system, and none of the other nodes know about the problem. Later, when the problem is fixed and the withdrawn node is rebooted, it would then discover that its own journal was incomplete, and replay it. However, replaying it at this point is almost guaranteed to introduce corruption because the other nodes are likely to have used affected resource groups that appeared in the journal since the time of the withdraw. Replaying the journal later will overwrite any changes made, and not through any fault of dlm, which was instructed during the withdraw to release those resources. This patch makes file system withdraws seen by the entire cluster. Withdrawing nodes dequeue their journal glock to allow recovery. The remaining nodes check all the journals to see if they are clean or in need of replay. They try to replay dirty journals, but only the journals of withdrawn nodes will be "not busy" and therefore available for replay. Until the journal replay is complete, no i/o related glocks may be given out, to ensure that the replay does not cause the aforementioned corruption: We cannot allow any journal replay to overwrite blocks associated with a glock once it is held. The "live" glock which is now used to signal when a withdraw occurs. When a withdraw occurs, the node signals its withdraw by dequeueing the "live" glock and trying to enqueue it in EX mode, thus forcing the other nodes to all see a demote request, by way of a "1CB" (one callback) try lock. The "live" glock is not granted in EX; the callback is only just used to indicate a withdraw has occurred. Note that all nodes in the cluster must wait for the recovering node to finish replaying the withdrawing node's journal before continuing. To this end, it checks that the journals are clean multiple times in a retry loop. Also note that the withdraw function may be called from a wide variety of situations, and therefore, we need to take extra precautions to make sure pointers are valid before using them in many circumstances. We also need to take care when glocks decide to withdraw, since the withdraw code now uses glocks. Also, before this patch, if a process encountered an error and decided to withdraw, if another process was already withdrawing, the second withdraw would be silently ignored, which set it free to unlock its glocks. That's correct behavior if the original withdrawer encounters further errors down the road. But if secondary waiters don't wait for the journal replay, unlocking glocks will allow other nodes to use them, despite the fact that the journal containing those blocks is being replayed. The replay needs to finish before our glocks are released to other nodes. IOW, secondary withdraws need to wait for the first withdraw to finish. For example, if an rgrp glock is unlocked by a process that didn't wait for the first withdraw, a journal replay could introduce file system corruption by replaying a rgrp block that has already been granted to a different cluster node. Signed-off-by: Bob Peterson fs/gfs2/glock.c | 23 ++++++- fs/gfs2/glops.c | 77 +++++++++++++++++++++- fs/gfs2/incore.h | 9 +++ fs/gfs2/lock_dlm.c | 34 ++++++++++ fs/gfs2/meta_io.c | 3 +- fs/gfs2/ops_fstype.c | 11 +++- fs/gfs2/quota.c | 3 + fs/gfs2/super.c | 75 +++++++++++++++------ fs/gfs2/super.h | 1 - fs/gfs2/sys.c | 2 + fs/gfs2/util.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++- 11 files changed, 390 insertions(+), 31 deletions(-) culprit signature: 4b082912da270d19d2414736c101b1bab27b58156d57f54fed4a3412d75d485c parent signature: 6a8e6f27a63321a3250e781859c0fe37dc243cb9c335ee2d8226173b9697c08e revisions tested: 18, total time: 3h58m29.364710259s (build: 2h11m16.711625886s, test: 1h45m21.161866474s) first bad commit: 601ef0d52e9617588fcff3df26953592f2eb44ac gfs2: Force withdraw to replay journals and wait for it to finish recipients (to): ["agruenba@redhat.com" "cluster-devel@redhat.com" "rpeterso@redhat.com" "rpeterso@redhat.com"] recipients (cc): ["linux-kernel@vger.kernel.org"] crash: general protection fault in gfs2_withdraw gfs2: fsid=syz:syz: Trying to join cluster "lock_nolock", "syz:syz" gfs2: fsid=syz:syz: Now mounting FS... gfs2: fsid=syz:syz.0: fatal: invalid metadata block bh = 2075 (magic number) function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 417 gfs2: fsid=syz:syz.0: about to withdraw this file system general protection fault, probably for non-canonical address 0xdffffc000000000d: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f] CPU: 0 PID: 8743 Comm: syz-executor.0 Not tainted 5.5.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:signal_our_withdraw fs/gfs2/util.c:87 [inline] RIP: 0010:gfs2_withdraw.cold.9+0xe6/0xa95 fs/gfs2/util.c:282 Code: 00 48 c1 e0 2a 80 3c 02 00 0f 85 e2 01 00 00 4d 8b ae 28 08 00 00 b8 ff ff 37 00 48 c1 e0 2a 49 8d 7d 68 48 89 fa 48 c1 ea 03 <80> 3c 02 00 74 05 e8 f3 04 93 fe 4d 8b 6d 68 b8 ff ff 37 00 48 c1 RSP: 0018:ffffc90004b77450 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffff88808d1e8050 RCX: 0000000000000000 RDX: 000000000000000d RSI: ffffffff88f831c0 RDI: 0000000000000068 RBP: ffff88808d1e82e8 R08: ffffed1015cc6661 R09: ffffed1015cc6661 R10: ffffed1015cc6660 R11: ffff8880ae633307 R12: ffffffff8808a500 R13: 0000000000000000 R14: ffff88808d1e8000 R15: ffff88808d1e826d FS: 00007f0fb70a5700(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fff4a1b49e0 CR3: 0000000094f18000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: gfs2_meta_check_ii+0x4d/0xa0 fs/gfs2/util.c:423 gfs2_metatype_check_i fs/gfs2/util.h:116 [inline] gfs2_meta_indirect_buffer+0x31d/0x3a0 fs/gfs2/meta_io.c:417 gfs2_meta_inode_buffer fs/gfs2/meta_io.h:70 [inline] gfs2_inode_refresh+0x8e/0xd70 fs/gfs2/glops.c:405 inode_go_lock+0x25c/0x3cc fs/gfs2/glops.c:435 do_promote+0x382/0xb80 fs/gfs2/glock.c:391 finish_xmote+0x4e9/0xce0 fs/gfs2/glock.c:552 do_xmote+0x3fd/0x570 fs/gfs2/glock.c:626 gfs2_glock_nq+0x754/0x12a0 fs/gfs2/glock.c:1240 gfs2_glock_nq_init fs/gfs2/glock.h:229 [inline] gfs2_lookupi+0x291/0x4e0 fs/gfs2/inode.c:298 gfs2_lookup_simple+0x92/0xd0 fs/gfs2/inode.c:249 init_journal fs/gfs2/ops_fstype.c:620 [inline] init_inodes+0x310/0x1d20 fs/gfs2/ops_fstype.c:756 gfs2_fill_super+0x1749/0x2356 fs/gfs2/ops_fstype.c:1125 get_tree_bdev+0x3a1/0x560 fs/super.c:1342 gfs2_get_tree+0x45/0x240 fs/gfs2/ops_fstype.c:1190 vfs_get_tree+0x7e/0x330 fs/super.c:1547 do_new_mount fs/namespace.c:2822 [inline] do_mount+0x102b/0x16b0 fs/namespace.c:3142 __do_sys_mount fs/namespace.c:3351 [inline] __se_sys_mount fs/namespace.c:3328 [inline] __x64_sys_mount+0x15d/0x1b0 fs/namespace.c:3328 do_syscall_64+0xc6/0x620 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x460bca Code: b8 a6 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 dd 87 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 ba 87 fb ff c3 66 0f 1f 84 00 00 00 00 00 RSP: 002b:00007f0fb70a4a88 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007f0fb70a4b20 RCX: 0000000000460bca RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007f0fb70a4ae0 RBP: 00007f0fb70a4ae0 R08: 00007f0fb70a4b20 R09: 0000000020000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000020000000 R13: 0000000020000100 R14: 0000000020000200 R15: 00000000200a6800 Modules linked in: ---[ end trace 215a82b3517d5577 ]--- RIP: 0010:signal_our_withdraw fs/gfs2/util.c:87 [inline] RIP: 0010:gfs2_withdraw.cold.9+0xe6/0xa95 fs/gfs2/util.c:282 Code: 00 48 c1 e0 2a 80 3c 02 00 0f 85 e2 01 00 00 4d 8b ae 28 08 00 00 b8 ff ff 37 00 48 c1 e0 2a 49 8d 7d 68 48 89 fa 48 c1 ea 03 <80> 3c 02 00 74 05 e8 f3 04 93 fe 4d 8b 6d 68 b8 ff ff 37 00 48 c1 RSP: 0018:ffffc90004b77450 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffff88808d1e8050 RCX: 0000000000000000 RDX: 000000000000000d RSI: ffffffff88f831c0 RDI: 0000000000000068 RBP: ffff88808d1e82e8 R08: ffffed1015cc6661 R09: ffffed1015cc6661 R10: ffffed1015cc6660 R11: ffff8880ae633307 R12: ffffffff8808a500 R13: 0000000000000000 R14: ffff88808d1e8000 R15: ffff88808d1e826d FS: 00007f0fb70a5700(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0ed8eccdb8 CR3: 0000000094f18000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400