kernel issue tracked in THP huge page
author: hixichen Kernel version:2.6.32 Checked: KSM , THP
LOG:
<4>------------[ cut here ]------------
<2>kernel BUG at mm/huge_memory.c:1194!
<4>invalid opcode: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/0000:06:00.0/host7/scsi_host/host7/proc_name
<4>CPU 1
<4>Modules linked in: nf_conntrack bonding 8021q garp stp llc ipv6 openvswitch(U) libcrc32c iptable_filter ip_tables sg igb dca microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ext4 mbcache jbd2 sd_mod crc_t10dif isci libsas scsi_transport_sas ahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 3084, comm: redis-server Tainted: G --------------- H 2.6.32-279.19.5.el6.ucloud.x86_64 #1 Huawei Technologies Co., Ltd. Tecal RH2288H V2-12L/BC11SRSG1
<4>RIP: 0010:[<ffffffff81180f22>] [<ffffffff81180f22>] split_huge_page+0x802/0x850
<4>RSP: 0018:ffff880c5edddbd8 EFLAGS: 00010086
<4>RAX: 00000000ffffffff RBX: ffffea0029333970 RCX: ffffc9000c06b000
<4>RDX: 0000000000000002 RSI: ffff880c61fe9c00 RDI: 0000000000000004
<4>RBP: ffff880c5edddca8 R08: 0000000000000064 R09: 0000000000000001
<4>R10: 0000000000000106 R11: ffff880c5941f300 R12: ffffea0029334000
<4>R13: 0000000000000000 R14: ffffea002932d000 R15: 00000007fcd17be1
<4>FS: 00007fcd25858720(0000) GS:ffff880053820000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: 00007f7086a8a740 CR3: 0000000c57daf000 CR4: 00000000001406e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process redis-server (pid: 3084, threadinfo ffff880c5eddc000, task ffff880c5941e940)
<4>Stack:
<4> 0000000000000002 0000000000000000 0000000000000000 ffff880c00000001
<4><d> ffff880c5edddc78 ffff88184155d5f0 0000000000000000 ffffffff81180f9b
<4><d> ffff88184155d630 ffffffff810a9f0d ffffea0052938e78 0000000180010f40
<4>Call Trace:
<4> [<ffffffff81180f9b>] ? __split_huge_page_pmd+0x2b/0xc0
<4> [<ffffffff810a9f0d>] ? trace_hardirqs_on_caller+0x14d/0x190
<4> [<ffffffff81180ff0>] __split_huge_page_pmd+0x80/0xc0
<4> [<ffffffff81147ff4>] unmap_vmas+0xa34/0xc00
<4> [<ffffffff81148b9d>] zap_page_range+0x7d/0xe0
<4> [<ffffffff81144b4d>] sys_madvise+0x54d/0x770
<4> [<ffffffff815073bb>] ? thread_return+0x4e/0x7d3
<4> [<ffffffff810a9f0d>] ? trace_hardirqs_on_caller+0x14d/0x190
<4> [<ffffffff81509ee2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: 83 c1 01 e8 69 5e 38 00 0f 0b eb fe 0f 0b eb fe 0f 0b 0f 1f 80 00 00 00 00 eb f7 0f 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 0f
<1>RIP [<ffffffff81180f22>] split_huge_page+0x802/0x850
<4> RSP <ffff880c5edddbd8>
结合代码: 找到的call path:
<上面的back trace中, >
unmap_vmas
--unmap_page_range
--zap_pud_range
--zap_pmd_range
--split_huge_page_pmd
--__split_huge_page_pmd
--split_huge_page
--__split_huge_page
--__split_huge_page_refcount
line: 1194行如上:
【汇编code】: objdump vmlinux之后, 因为很多函数是static inline的, 所以system map里面找到jump地址,要到调用函数里面去找。
RAX: 00000000ffffffff
RBX: ffffea0029333970
R12: ffffea0029334000
触发原因: eax的值是-1;
eax放的是: add $0x1, %eax <——-> atomic_read(&(page)->_mapcount) + 1;
即: _mapcount的值是-2, 相加之后出现了-1值。
现象是 mapcount出现了理论上不应该也不能出现的值;
mapcount:
- Count of ptes mapped in
- mms, to show when page is
- mapped & limit reverse map
- searches.
start from “-1”;
经过分析, 理论上此处的mapcount值是 -1
BUG_ON(page_mapcount(page_tail) < 0); 判断值是0 ,就不会出现BUG。
而改写的值不是脏值,看起来像是被减一后出现的。
都是在 put_compound_page中。。。
经过搜索
得到下面的一个记录:
+====== start quote =======
+ mapcount 0 page_mapcount 1
+ kernel BUG at mm/huge_memory.c:1384!
+
+At some point prior to the panic, a "bad pmd ..." message similar to the
+following is logged on the console:
+
+ mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
+
+The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
+the page's PMD table entry.
+
+ 143 void pmd_clear_bad(pmd_t *pmd)
+ 144 {
+-> 145 pmd_ERROR(*pmd);
+ 146 pmd_clear(pmd);
+ 147 }
+
+After the PMD table entry has been cleared, there is an inconsistency
+between the actual number of PMD table entries that are mapping the page
+and the page's map count (_mapcount field in struct page). When the page
+is subsequently reclaimed, __split_huge_page() detects this inconsistency.
+
+ 1381 if (mapcount != page_mapcount(page))
+ 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+ 1383 mapcount, page_mapcount(page));
+-> 1384 BUG_ON(mapcount != page_mapcount(page));
+
+The root cause of the problem is a race of two threads in a multithreaded
+process. Thread B incurs a page fault on a virtual address that has never
+been accessed (PMD entry is zero) while Thread A is executing an madvise()
+system call on a virtual address within the same 2 MB (huge page) range.
+
+ virtual address space
+ .---------------------.
+ | |
+ | |
+ .-|---------------------|
+ | | |
+ | | |<-- B(fault)
+ | | |
+ 2 MB | |/////////////////////|-.
+ huge < |/////////////////////| > A(range)
+ page | |/////////////////////|-'
+ | | |
+ | | |
+ '-|---------------------|
+ | |
+ | |
+ '---------------------'
+
+- Thread A is executing an madvise(..., MADV_DONTNEED) system call
+ on the virtual address range "A(range)" shown in the picture.
+
+sys_madvise
+ // Acquire the semaphore in shared mode.
+ down_read(¤t->mm->mmap_sem)
+ ...
+ madvise_vma
+ switch (behavior)
+ case MADV_DONTNEED:
+ madvise_dontneed
+ zap_page_range
+ unmap_vmas
+ unmap_page_range
+ zap_pud_range
+ zap_pmd_range
+ //
+ // Assume that this huge page has never been accessed.
+ // I.e. content of the PMD entry is zero (not mapped).
+ //
+ if (pmd_trans_huge(*pmd)) {
+ // We don't get here due to the above assumption.
+ }
+ //
+ // Assume that Thread B incurred a page fault and
+ .---------> // sneaks in here as shown below.
+ | //
+ | if (pmd_none_or_clear_bad(pmd))
+ | {
+ | if (unlikely(pmd_bad(*pmd)))
+ | pmd_clear_bad
+ | {
+ | pmd_ERROR
+ | // Log "bad pmd ..." message here.
+ | pmd_clear
+ | // Clear the page's PMD entry.
+ | // Thread B incremented the map count
+ | // in page_add_new_anon_rmap(), but
+ | // now the page is no longer mapped
+ | // by a PMD entry (-> inconsistency).
+ | }
+ | }
+ |
+ v
+- Thread B is handling a page fault on virtual address "B(fault)" shown
+ in the picture.
+
+...
+do_page_fault
+ __do_page_fault
+ // Acquire the semaphore in shared mode.
+ down_read_trylock(&mm->mmap_sem)
+ ...
+ handle_mm_fault
+ if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
+ // We get here due to the above assumption (PMD entry is zero).
+ do_huge_pmd_anonymous_page
+ alloc_hugepage_vma
+ // Allocate a new transparent huge page here.
+ ...
+ __do_huge_pmd_anonymous_page
+ ...
+ spin_lock(&mm->page_table_lock)
+ ...
+ page_add_new_anon_rmap
+ // Here we increment the page's map count (starts at -1).
+ atomic_set(&page->_mapcount, 0)
+ set_pmd_at
+ // Here we set the page's PMD entry which will be cleared
+ // when Thread A calls pmd_clear_bad().
+ ...
+ spin_unlock(&mm->page_table_lock)
+
+The mmap_sem does not prevent the race because both threads are acquiring
+it in shared mode (down_read). Thread B holds the page_table_lock while
+the page's map count and PMD table entry are updated. However, Thread A
+does not synchronize on that lock.
+
+====== end quote =======
通过上面的分析, 和目前的情况。。。。。。
有理由判断:multi thread的情况出现(函数调用过程中,嵌套了多个循环,multi thread的执行是有各种条件的。
The mmap_sem does not prevent the race because both threads are acquiring +it in shared mode (down_read). Thread B holds the page_table_lock while +the page’s map count and PMD table entry are updated. However, Thread A +does not synchronize on that lock.
需要进一步判断,具体哪里触发的。。。。。
已经怎么构造测试的环境。。。。