Skip to content

Feb 2018


02/28 Wed

  • patch fork, and cow handler
  • debug pcache, while running python hello world
  • add vDSO, gettimeofday

So, it is end of the day. After adding wp handler, I now have the whole picture of pcache activities, and the interactions between them. The reclaim, zap, move, copy, add, operations needs to be carefully synchronized. Also the refcount etc. I feel the ground rule is we need to make sure a PCM that a function is currently using, can not suddenly become invalid due to other operations. This has to be synced by: refcount, lock, flags. Oh well, mm is hard with SMP, but also fun.

We are very close to have a fully working OS.

I did not have time to look into the python hello world bug issue. It is a very serious one. It may also rule out some root bugs.


02/27 Tue

Spent two days on CS527 source project, implemented a small SSHD and SSD client. And we have to inject exactly five bugs, or vulnerabilities into the systems. Lol, it is really hard to intentionally plant BUGs!

Anyway, back to Lego. Since others are having a hard time compile program statically, I will try to add dynamic loader today.

The interpreter: /lib64/ld-linux-x86-64.so.2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
Linux seq.c maps (no randomization):
00400000-00401000 r-xp 00000000 fd:00 18752683                           /root/ys/LegoOS/usr/a.out
00600000-00601000 r--p 00000000 fd:00 18752683                           /root/ys/LegoOS/usr/a.out
00601000-00602000 rw-p 00001000 fd:00 18752683                           /root/ys/LegoOS/usr/a.out
00602000-00604000 rw-p 00000000 00:00 0                                  [heap]
7ffff7a18000-7ffff7bd0000 r-xp 00000000 fd:00 55051990                   /usr/lib64/libc-2.17.so
7ffff7bd0000-7ffff7dd0000 ---p 001b8000 fd:00 55051990                   /usr/lib64/libc-2.17.so
7ffff7dd0000-7ffff7dd4000 r--p 001b8000 fd:00 55051990                   /usr/lib64/libc-2.17.so
7ffff7dd4000-7ffff7dd6000 rw-p 001bc000 fd:00 55051990                   /usr/lib64/libc-2.17.so
7ffff7dd6000-7ffff7ddb000 rw-p 00000000 00:00 0
7ffff7ddb000-7ffff7dfc000 r-xp 00000000 fd:00 55051983                   /usr/lib64/ld-2.17.so
7ffff7fde000-7ffff7fe1000 rw-p 00000000 00:00 0
7ffff7ff9000-7ffff7ffa000 rw-p 00000000 00:00 0
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 00:00 0                          [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00021000 fd:00 55051983                   /usr/lib64/ld-2.17.so
7ffff7ffd000-7ffff7ffe000 rw-p 00022000 fd:00 55051983                   /usr/lib64/ld-2.17.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
lego after loading
00400000-00401000 r-xp 00000000 /root/ys/LegoOS/usr/a.out
00600000-00602000 rw-p 00000000 /root/ys/LegoOS/usr/a.out
00602000-00604000 rw-p 00000000 [heap]
7ffff7ddb000-7ffff7dfc000 r-xp 00000000 /lib64/ld-linux-x86-64.so.2
7ffff7ffc000-7ffff7ffe000 rw-p 00021000 /lib64/ld-linux-x86-64.so.2
7ffff7ffe000-7ffff7fff000 rw-p 00000000
7ffffffde000-7ffffffff000 rw-p 00000000 [stack]


[ 2066.379224] ****    Finish dump final mm
[ 2066.426023] handle_p2m_execve(): reply_status: OKAY, new_ip: 0x7ffff7ddc170, new_sp: 0x7fffffffede0
[ 2066.628949] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:150 vaddr:0x7ffff7ddc170
[ 2066.732034] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:150 vaddr:0x7ffff7ddc170
[ 2066.934947] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:51 vaddr:0x7fffffffedd8
[ 2067.036978] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:51 vaddr:0x7fffffffedd8
[ 2067.238842] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7ffce00
[ 2067.340880] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7ffce00
[ 2067.542747] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:51 vaddr:0x7ffff7ffd9a8
[ 2067.644774] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:51 vaddr:0x7ffff7ffd9a8
[ 2067.846640] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7ddb8e0
[ 2067.948679] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7ddb8e0
[ 2068.355424] ------------[ cut here ]------------
[ 2068.408568] WARNING: CPU: 4 PID: 31 at managers/memory/handle_pcache/fault.c:54 handle_p2m_pcache_miss+0x29d/0x380
[ 2068.532327] src_nid:0,pid:32,vaddr:0x7ffff7e0e000
[ 2068.588487] CPU: 4 PID: 31 Comm: mc-manager 4.0.0-lego-ys+ #100
[ 2068.659207] Stack:
1
2
3
4
[root@wuklab13: lib64] $ ll ld-*
-rwxr-xr-x 1 root root 164112 Nov 30 13:53 ld-2.17.so
lrwxrwxrwx 1 root root     10 Jan  8 12:34 ld-linux-x86-64.so.2 -> ld-2.17.so
[root@wuklab13: lib64]

It turns out there is a bug in mmap code: forgot to increment the file ref count when a file-backed vma is created. Some put_file in loader accidentally free the ld-linux file. Bug fixed, dyloader works like a charm.


02/24 Sat

Well. PhDs do not have weekends. Anyway, it is Saturday after all, relaxed a little bit. I was looking into the pcache issue. Also added our own kernel version strace.


02/23 Fri

Solved FPU BUG

current is fine. I should not compare the old implementation with the new per-cpu current. I forgot that the kernel stack is switched in the __switch_to_asm. This means in __switch_to(), we are actually using the next_p‘s kernel stack. So there is small time frame, where current_thread_info() points to next_p, while current_task is still prev_p. Since interrupts are disabled during context switch, we are good with this mismatch.

Rule out current, the only thing left is fpu__copy warning, which happens during copy_process(). One weird thing is this function has been called multiple times before it showed a warning. System itself use this function to create a lot background threads, which are fine. Only when it was triggered by sys_clone then we have the warning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[ 3213.055639] CPU: 6 PID: 17 sys_clone+0x0/0x30
[ 3213.056584] new task_struct: ffff88083e4c9838
[ 3213.057530] arch_dup_task_struct cpu6 dst:ffff88083e4c9838 17 word_count-seq src:ffff88083e457838 17 word_count-seq
[ 3213.059536] TRAP do_general_protection in CPU6, error_code: 0 current:ffff88083e457838 17 word_count-seq
[ 3213.061289] fixup_exception pid(17) cpu(6) insn:0xffffffff81009a21(fpu__copy+0x81/0x260) fixup:0xffffffff8105d9b2(__fixup_text_start+0xc2/0x322) handler:ex_handler_default+0x0/0x20
[ 3213.064114] ------------[ cut here ]------------
[ 3213.065040] WARNING: CPU: 6 PID: 17 at ./arch/x86/include/asm/fpu/internal.h:354 fpu__copy+0xc3/0x260
[ 3213.066760] CPU: 6 PID: 17 Comm: word_count-seq 4.0.0-lego+ #6
[ 3213.067855] Stack:
[ 3213.068424] ffff88083e4c7dd0 ffffffff810124b5 ffff88083e4c9bf8 ffff88083e4c9c38
[ 3213.070133] ffff88083e4c9838 00007ffff7ffd700 ffff88083e4c7de0 ffffffff8101258f
[ 3213.071775] ffff88083e4c7e08 ffffffff81009a63 ffff88083e457838 ffff88083e4c9838
[ 3213.073419] ffff88083e457838 ffff88083e4c7e40 ffffffff81000ebb ffff88083e457838
[ 3213.075057] ffff880800000011 ffff88083e457a68 00000000003d0f00 ffff88083e457838
[ 3213.076703] Call Trace:
[ 3213.077295] <TSK>
[ 3213.077828] [<ffffffff810124c1>] __warn.constprop.0+0x91/0xd0
[ 3213.078855] [<ffffffff8101258f>] warn_slowpath_null+0xf/0x20
[ 3213.081653] [<ffffffff81009a63>] fpu__copy+0xc3/0x260
[ 3213.082543] [<ffffffff81000ebb>] arch_dup_task_struct+0x7b/0x90
[ 3213.083667] [<ffffffff8101d32e>] copy_process+0x14e/0x10e0
[ 3213.084618] [<ffffffff8103a3c6>] ? n_tty_write+0x166/0x3c0
[ 3213.085564] [<ffffffff8101e2e6>] do_fork+0x26/0x140
[ 3213.086439] [<ffffffff8101e4a0>] ? sys_vfork+0x40/0x40
[ 3213.087333] [<ffffffff8101e4a0>] ? sys_vfork+0x40/0x40
[ 3213.088232] [<ffffffff8101e4c9>] sys_clone+0x29/0x30
[ 3213.089109] [<ffffffff8100e719>] do_syscall_64+0x69/0xf0
[ 3213.090030] [<ffffffff8100d5ec>] entry_SYSCALL64_slow_path+0x25/0x25
[ 3213.091078] <EOT>
[ 3213.091580] ---[ end trace 0000000000000000 ]---
[ 3213.093250] TRAP do_general_protection in CPU7, error_code: 0 current:ffff88083fd0f008 0 swapper/7
[ 3213.096526] fixup_exception pid(0) cpu(7) insn:0xffffffff81000c62(__switch_to+0x452/0x630) fixup:0xffffffff8105d922(__fixup_text_start+0x32/0x322) handler:ex_handler_default+0x0/0x20
[ 3213.101241] ------------[ cut here ]------------
[ 3213.103285] WARNING: CPU: 7 PID: 0 at ./arch/x86/include/asm/fpu/internal.h:369 __switch_to+0x47e/0x630

So, dig into fpu__copy(), find out why it fails at this certain point. Glad I have something to dig into. ๐Ÿ˜

The instruction leads to GP is:

1
ffffffff8100b0f5:       48 0f ae 27             xsave64 (%rdi)

which is generated by:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#define XSTATE_XSAVE(st, lmask, hmask, err)                             \
        asm volatile(ALTERNATIVE_2(XSAVE,                               \
                                   XSAVEOPT, X86_FEATURE_XSAVEOPT,      \
                                   XSAVES,   X86_FEATURE_XSAVES)        \
                     "\n"                                               \
                     "xor %[err], %[err]\n"                             \
                     "3:\n"                                             \
                     ".pushsection .fixup,\"ax\"\n"                     \
                     "4: movl $-2, %[err]\n"                            \
                     "jmp 3b\n"                                         \
                     ".popsection\n"                                    \
                     _ASM_EXTABLE(661b, 4b)                             \
                     : [err] "=r" (err)                                 \
                     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)    \
                     : "memory")
static inline void copy_xregs_to_kernel(struct xregs_state *xstate)
{
        u64 mask = -1;
        u32 lmask = mask;
        u32 hmask = mask >> 32;
        int err;

        WARN_ON(!alternatives_patched);

        XSTATE_XSAVE(xstate, lmask, hmask, err);

        /* We should never fault when copying to a kernel buffer: */
        WARN_ON_FPU(err);
}

From SDM on XSAVE: Use of a destination operand not aligned to 64-byte boundary (in either 64-bit or 32-bit modes) results in a general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored.

%rdi is struct xregs_state *xstate in above code. Thus, check if xstate if 64-bytes aligned. Of course, it is not:

1
[10894.999997] copy_xregs_to_kernel CPU6 xstate: ffff88083e4c8c38

Hehe. Criminal identified. But why? The xstate structure is already marked as __attribute__(aliged 64) in the code. It is the task_struct, which is NOT 0x40 aligned. But god why? Because we currently use kmalloc to allocate new task_struct, whose minimum alignment is 8 bytes. Anyway, use __alloc_pages instead.

Such an deeply hidden bug. Took me almost a month to find out.

IB

Seen this during boot (at both P and M, although lego continue running correctly):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
[54017.712533] ***    NodeID    Hostname    LID    QPN
[54017.770776] ***    -------------------------------------
[54017.834220] ***         0    wuklab12     13     72
[54017.892462] ***         1    wuklab14     16     72 <---
[54017.955906] ***         2    wuklab16     20     74
[54018.014149] ***
[54074.552844] ***  Start establish connection (mynodeid: 1)
[54102.554407] ib_process_mad mad_ifc fails
[54130.960691] ***  recvpollcq runs on CPU2
[54131.070918] ***  Successfully built QP for node  0 [LID: 13 QPN: 72]
[54131.152936] ***  Successfully built QP for node  2 [LID: 20 QPN: 74]
[54161.228245] ***  FIT layer ready to go!
[54161.272034] ***
Another one:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[ 1966.930409] ***
[ 1966.951210] ***  FIT_initial_timeout_s:   30
[ 1967.002168] ***  FIT_local_id:            0
[ 1967.052087] ***
[ 1967.072887] ***    NodeID    Hostname    LID    QPN
[ 1967.131126] ***    -------------------------------------
[ 1967.194567] ***         0    wuklab12     13     72 <---
[ 1967.258005] ***         1    wuklab14     16     72
[ 1967.316244] ***         2    wuklab16     20     74
[ 1967.374484] ***
[ 2032.926448] ***  Start establish connection (mynodeid: 0)
[ 2032.996068] Fail to modify qp[6]
[ 2033.032572] Fail to do client_init_ctx
[ 2033.077287] client_establish_conn: ctx           (null) fail to init_interface
[ 2033.164646] ibapi_establish_conn: ctx           (null) fail to init_interface
[ 2033.250967] ***
[ 2035.620167] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[ 2035.713763] IP: [<ffffffff8105c589>] client_send_reply_with_rdma_write_with_imm+0x69/0x3b0
[ 2035.812562] PGD 0
[ 2035.836482] Oops: 0002 [#1] SMP PROCESSOR
[ 2035.884321] CPU: 0 PID: 1 Comm: kernel_init 4.0.0-lego-ys+ #253
[ 2035.955041] RIP: 0010:[<ffffffff8105c589>]  [<ffffffff8105c589>] client_send_reply_with_rdma_write_with_imm+0x69/0x3b0
...
[ 2037.313267] <TSK>
[ 2037.336146] [<ffffffff8105a377>] ibapi_send_reply_timeout+0x57/0x70
[ 2037.411025] [<ffffffff81033d24>] ? net_send_reply_timeout+0x94/0x132
[ 2037.486944] [<ffffffff81033d24>] net_send_reply_timeout+0x94/0x132

pcache

Running word_count-pthread, with 100MB dataset, finally got some reasonable bug:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[54211.243181] pcache_evict_line(): pset: ffff88207f86e3c0, for uva: 0x7ffff1b8f000
[54211.385654] pcache:ffff88207f86e3a8 mapcount:8 refcount:0 flags:()
[54211.510447] pcache dumped because: PCACHE_BUG_ON_PCM(!PcacheLocked(pcm))
[54212.080336] BUG: failure at managers/processor/pcache/evict.c:240/pcache_evict_line()!
[54212.664785] Kernel Panic - not syncing: BUG!
[54212.715742] CPU: 8 PID: 81 Comm: word_count-pthr 4.0.0-lego-ys+ #252
...
[54213.391706] <TSK>
[54213.414584] [<ffffffff81024180>] panic+0xc2/0xeb
[54213.524818] [<ffffffff8101b81c>] ? task_tick_rt+0x2c/0xd0
[54213.589295] [<ffffffff81018f75>] ? scheduler_tick+0x55/0x60
[54213.655850] [<ffffffff81016625>] ? tick_handle_periodic+0x45/0x70
[54213.728647] [<ffffffff81006634>] ? apic_timer_interrupt+0x54/0x90
[54213.801443] [<ffffffff8100e22a>] ? smp__apic_timer_interrupt+0x6a/0x70
[54213.879439] [<ffffffff8101256d>] ? printk+0x11d/0x1b0
[54214.103027] [<ffffffff8102ecf4>] pcache_evict_line+0x134/0x220
[54214.172703] [<ffffffff8102c6ae>] pcache_alloc+0x22e/0x2e0
[54214.237179] [<ffffffff8102be0a>] common_do_fill_page+0x2a/0x1f0
[54214.307895] [<ffffffff8102baf0>] ? move_page_tables+0x4c0/0x4c0
[54214.378612] [<ffffffff8102c172>] pcache_handle_fault+0x1a2/0x3a0
[54214.450367] [<ffffffff8100fc02>] do_page_fault+0xa2/0x1a0
[54214.514843] [<ffffffff8100d85f>] page_fault+0x1f/0x30
[54214.575161] [<ffffffff81034842>] ? copy_user_enhanced_fast_string+0x2/0x10
[54214.657316] [<ffffffff81032368>] ? seq_read+0x248/0x360
[54214.719714] [<ffffffff810307af>] sys_read+0x3f/0xc0
[54214.777949] [<ffffffff81030770>] ? sweep_pset_lru+0x220/0x220
[54214.846587] [<ffffffff8100e619>] do_syscall_64+0x69/0xf0
[54214.910022] [<ffffffff8100d4ec>] entry_SYSCALL64_slow_path+0x25/0x25
[54214.985939] <EOT>

Another one:

1
2
3
[  735.393244] pcache_evict_line(): pset: ffff88207f86e3c0, for uva: 0x7ffff1b8fd90
[  735.537804] pcache:ffff88207f86e3a8 mapcount:8 refcount:0 flags:()
[  735.663642] pcache dumped because: PCACHE_BUG_ON_PCM(!PcacheLocked(pcm))

Do note this happens after computation. This happens when phoenix create a lot threads to sort the results.

Both bug happen to the same set, same user page. The pcache is clearly corrupted: mapcount:8, refcount:0, flags:().

Come back after dinner. Remember to check altenative, cause the XSAVE above should be XSAVEOPT. Make sure it does not override other memory. Also, check linker script. Do not forget to link any sections.

Another several bug logs in wuklab13 and wuklab15: 022318-*. I’m really tired today after fixing the FPU bug. But I’m also pretty confident pcache is something I’m able to debug. Even thought it is hard in SMP case.

Anyway, I gonna call for the day.


02/22 Thur

  • context switch fpu
  • signal compat check, all good.
  • make current use percpu current_task, so all code in Lego is consistent.
  • checked entry_SYSCALL-64 again, which looks good to me.
  • The only concern is rsp_scratch and current_top_of_stack, which are per-cpu variables. If these per-cpu is setup wrong, then we are doomed.
  • Also check if per-cpu is all cleared up?
  • try big syscall lock
  • does x86 has to use different kernel stacks? Interrupt is using different stack in Linux, has to do so???
  • check current is correct. compare with old implementation.

First of all, FPU is definitely functional for now. Since I replaced the current macro today, I add some code to check if this current matches our old implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
static __always_inline struct task_struct *get_current(void)                                                           
{                                                                                                                      
        return this_cpu_read_stable(current_task);                                                                     
}

//#define current get_current()

#define current                                                 \
({                                                              \
        struct task_struct *old = current_thread_info()->task;  \
        struct task_struct *new = get_current();                \
                                                                \
        if (old != new) {                                       \
                printk("%s:%d() cpu:%d old:%pS %d %s new:%pS %d %s\n",  \
                        __func__, __LINE__, smp_processor_id(), old, old->pid, old->comm, \
                        new, new->pid, new->comm);              \
                BUG();                                          \
        }                                                       \
        get_current();                                          \
})

Combined with some FPU warning, it is now like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[ 3273.748819] CPU:5 PID:32   sys_clone+0x0/0x30
[ 3273.800808] alloc_task_struct_node: size:740 ffff88107e831838
[ 3273.869451] arch_dup_task_struct() CPU5 current:32 new: ffff88107e831838 old: ffff88107e827838 32
[ 3273.975533] ------------[ cut here ]------------
[ 3274.030651] WARNING: CPU: 5 PID: 32 at ./arch/x86/include/asm/fpu/internal.h:354 fpu__copy+0xe2/0x310
[ 3274.140895] CPU: 5 PID: 32 Comm: word_count-pthr 4.0.0-lego-ys-gdbe6dbe-dirty #249
[ 3274.231377] Stack:
[ 3274.255298] ffff88107e82fd68 ffffffff81016dbf 00000000ffffffff 0000000000000000
[ 3274.342659] 00000000ffffffff 0000000000000000 ffff88107e831bf8 ffff88107e831c38
[ 3274.430021] ffff88107e831838 000000207fe64000 ffff88107e82fd78 ffffffff810170af
[ 3274.517382] ffff88107e82fdc0 ffffffff8100b052 0000000000000020 ffff88107e831838
[ 3274.604745] ffff88107e827838 ffff88107e827838 ffff88107e831838 ffff88107e827838
[ 3274.692106] Call Trace:
[ 3274.721229] <TSK>
[ 3274.744109] [<ffffffff81016dd8>] __warn.constprop.0+0xe8/0x3b0
[ 3274.813790] [<ffffffff810170af>] warn_slowpath_null+0xf/0x20
[ 3274.881391] [<ffffffff8100b052>] fpu__copy+0xe2/0x310
[ 3274.941713] [<ffffffff810012e4>] arch_dup_task_struct+0x84/0x120
[ 3275.013475] [<ffffffff81022c10>] copy_process+0x160/0x1e60
[ 3275.078996] [<ffffffff81024936>] do_fork+0x26/0x140
[ 3275.137238] [<ffffffff81024af0>] ? sys_vfork+0x40/0x40
[ 3275.198599] [<ffffffff81024af0>] ? sys_vfork+0x40/0x40
[ 3275.259960] [<ffffffff81024b19>] sys_clone+0x29/0x30
[ 3275.319242] [<ffffffff81012314>] do_syscall_64+0x84/0x240
[ 3275.383723] [<ffffffff8101106c>] entry_SYSCALL64_slow_path+0x25/0x25
[ 3275.459645] <EOT>
[ 3275.482526] ---[ end trace 0000000000000000 ]---
[ 3275.537648] wake_up_new_task CPU5 task:ffff88107e831838, dest_cpu:6 current:32
[ 3275.623970] SMP IPI: reschedule_interrupt() CPU(6) PID(0)
[ 3275.739412] do_general_protection:186() cpu:6 old:0xffff88107e831838 33 word_count-pthr new:0xffff88107fcaf008 0 swapper/6

[ 3275.871493] ------------[ cut here ]------------
[ 3275.926614] BUG: failure at arch/x86/kernel/traps.c:186/do_general_protection()!
[ 3276.015018] Kernel Panic - not syncing: BUG!
[ 3276.065978] panic:107() cpu:6 old:0xffff88107e831838 33 word_count-pthr new:0xffff88107fcaf008 0 swapper/6

Based on the switch code:

1
2
3
4
5
6
7
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
        this_cpu_write(current_task, next_p);

        /* Reload sp0 This changes current_thread_info(). */
        load_sp0(tss, next);
}

Based on log line 30, load_sp0() already happened, which means this_cpu_write(..) happened too. If this_cpu_write(..) happened, then log line 30’s new should have been updated to 0xffff88107e831838. Something wrong with percpu?


02/21 Wed

  • irq_regs, old code, check
  • signal frame, and fpu hook together Done
  • in_interrupt(), it is empty, TODO
  • check arch/x86/Makefile, it introduce a lot FPU flags.
  • added more than 4K lines today. Damn FPU. Ugh go home sleep.

02/20 Tue Cloudy

Not too many Sunny days recently. Well, continue yesterday’s work. I don’t think I can easily find out why so many /proc/memoinfo open happened. Instead, I’m trying to enable the flush_thread in P’s exec code.

During the way, I found some issue related to __ARCH_HAS_SA_RESTORER in signal code. I need to check if these x86 macros are defined, but lego does not port them.

Well, it turns out flush_thread does not make too much difference. Next I’m going to try to disable exit_thread, which uses fpu__drop().

Hmm, disable exit_thread also does not work.


02/19 Mon Rainy

It is another week. I can not deny I’m a little tired about the bug. Tried so many possible solutions, but none of them work. Well, today I first need to test the vma changes (pgoff and anon_vma) thing. Especially the vma merge and split.

This morning I fixed a bug in kernel_init process: make kernel_init able to run all possible CPUs. Because the first user process is forked from kernel_init, it is quite important that it gets the right cpu affinity:

1
2
3
4
5
6
static int kernel_init(void *unused)
{
        ...
        set_cpus_allowed_ptr(current, cpu_possible_mask);
        ...
}

Well, interestingly, the unmodified word_count-pthread succeed with 50MB dataset… with or without any DEBUG option! Amazing! I need to find out why the cpus_allowed becomes 0 at the beginning of kernel_init. Because init_task actually has:

1
2
    .cpus_allowed   = CPU_MASK_ALL,
    .nr_cpus_allowed= NR_CPUS,

Things to do next:

  • check why the cpus_allowed changed
  • check why word_count-pthread open /dev/../cpu so many times. Anything wrong with our copy_files, or open, close?
  • here is an idea, to verify if FPU code is correct, run some scientific benchmarks.

Okay, findings:

  • cpus_allowd is fine, it is reset inside sched_init(), when it tries make the init_task as the idle thread. Thus it is reasonable to set cpus_allowed again at kernel_init thread. And it should NOTHING to do with the bug.

  • about the second, check the following log:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    [11838.364543] STDOUT: ---[
    Wordcount: Running...
    ]---
    [11838.422886] STDOUT: ---[
    
    
    ]---
    [11838.463445] SYSC_open(cpu5 pid:32): f_name: /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_50MB.txt, flags: 0, mode: 900
    [11838.619460] SYSC_open(cpu5 pid:32): fd: 3
    [11838.667406] SYSC_open(cpu5 pid:32): f_name: /sys/devices/system/cpu/online, flags: 80000, mode: 0
    [11838.773351] SYSC_open(cpu5 pid:32): fd: 4
    [11838.821239] seq_file:
      dest_uva: 00007fffffffc8d0, nr_chars: 5
      string: [0-23
    ]
    [11838.913791] SYSC_close(cpu5 pid:32): fd: 4
    [11838.962622] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
    [11840.223255] STDOUT: ---[
    Word Count: Computation Completed 1.555581 sec
    
    ]---
    [11840.309678] SYSC_open(cpu5 pid:32): f_name: /sys/devices/system/cpu/online, flags: 80000, mode: 0
    [11840.415754] SYSC_open(cpu5 pid:32): fd: 4
    [11840.463593] seq_file:
      dest_uva: 00007fffffffc8a0, nr_chars: 5
      string: [0-23
    ]
    [11840.556147] SYSC_close(cpu5 pid:32): fd: 4
    [11840.605024] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
    [11840.677821] STDOUT: ---[
    THe number of processors is 24
    
    รด
    ]---
    [11840.753769] SYSC_open(cpu7 pid:80): f_name: /proc/meminfo, flags: 80000, mode: 1b6
    [11840.844212] SYSC_open(cpu19 pid:92): f_name: /proc/meminfo, flags: 80000, mode: 1b6
    [11840.935728] SYSC_open(cpu7 pid:80): fd: 4
    [11840.983567] SYSC_open(cpu19 pid:92): fd: 5
    [11841.032444] seq_file:
      dest_uva: 00007ffff444c000, nr_chars: 172
      string: [MemTotal:       115355128 kB
    MemFree:        115355128 kB
    MemAvailable:   115355128 kB
    DirectMap4k:        5812 kB
    DirectMap2M:     1861632 kB
    DirectMap1G:    134217728 kB
    ]
    [11841.305953] seq_file:
      dest_uva: 00007ffff444b000, nr_chars: 172
      string: [MemTotal:       115355128 kB
    MemFree:        115355128 kB
    MemAvailable:   115355128 kB
    DirectMap4k:        5812 kB
    DirectMap2M:     1861632 kB
    DirectMap1G:    134217728 kB
    ]
    [11841.579460] SYSC_close(cpu7 pid:80): fd: 4
    [11841.628339] SYSC_close(cpu19 pid:92): fd: 5
    [11841.678257] SYSC_close(): [4] -> [/proc/meminfo]
    [11841.733375] SYSC_close(): [5] -> [/proc/meminfo]
    [11841.788493] SYSC_open(cpu18 pid:91): f_name: /proc/meminfo, flags: 80000, mode: 1b6
    [11841.880008] SYSC_open(cpu6 pid:102): f_name: /proc/meminfo, flags: 80000, mode: 1b6
    [11841.971523] SYSC_open(cpu12 pid:85): f_name: /proc/meminfo, flags: 80000, mode: 1b6
    [11842.063040] SYSC_open(cpu0 pid:97): f_name: /proc/meminfo, flags: 80000, mode: 1b6
    [11842.153516] SYSC_open(cpu14 pid:87): f_name: /proc/meminfo, flags: 80000, mode: 1b6
    [11842.245032] SYSC_open(cpu16 pid:89): f_name: /proc/meminfo, flags: 80000, mode: 1b6
    [11842.336548] SYSC_open(cpu4 pid:100): f_name: /proc/meminfo, flags: 80000, mode: 1b6
    [11842.428064] SYSC_open(cpu16 pid:89): fd: 9
    [11842.476942] SYSC_open(cpu4 pid:100): fd: 10
    [11842.526860] seq_file:
      dest_uva: 00007ffff444c000, nr_chars: 172
      string: [MemTotal:       115355128 kB
    MemFree:        115355128 kB
    MemAvailable:   115355128 kB
    DirectMap4k:        5812 kB
    DirectMap2M:     1861632 kB
    DirectMap1G:    134217728 kB
    ]
    [11842.800368] seq_file:
      dest_uva: 00007ffff444b000, nr_chars: 172
      string: [MemTotal:       115355128 kB
    MemFree:        115355128 kB
    MemAvailable:   115355128 kB
    DirectMap4k:        5812 kB
    DirectMap2M:     1861632 kB
    DirectMap1G:    134217728 kB
    ]
    [11843.073877] SYSC_close(cpu16 pid:89): fd: 9
    

  • However, in a normal Linux exeution:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    strace -C -o strace_2 ./word_count-pthread ./word_count_datafiles/word_50MB.txt
    
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     86.41    0.052074        1736        30           futex
      6.89    0.004151          67        62           munmap
      2.47    0.001490          17        88           mmap
      2.12    0.001278          14        93           clone
      1.51    0.000912          14        64           mprotect
      0.19    0.000117           7        16           write
      0.15    0.000092          46         2           open
    
    $ cat strace_2 | grep open
      open("./word_count_datafiles/word_50MB.txt", O_RDONLY) = 3
      open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 4
    

  • It opened the /proc/meminfo for way too many times. In the normal Linux execution, this should not happen. Is it because our meminfo is faked, so glibs is complaining? But why it does not open meminfo while running in Linux? Or does our entry assembly messed up some stuff in stack, so the return path changed?

  • oh, about the FPU. It reminds our flush_thread function actually has an issue before. When I enabled this function during loading in P, the P will crash. Within flush_thread, there is a fpu_clear!!! So, check this tomorrow! (12:00am, need to go home)


02/18 Sun Sunny

It is a nice day. Yesterday I’ve changed one line of code in mmap code path: change anonymous vma’s pgoff from some value to 0. The result is I got several succeed work-count-pthread(bind to one core) testing. However, it still fail with unmodified word-count-pthread.

It brings me to inspect pgoff manipulation code and all mmap.c code. We ported everything from linux without almost zero modification. That means we ported all those useless anon_vma and pgoff code, which is used a lot by vma_merge, vma_split code. The thing is: our memory manager, our vma code do not need such anon_vma structure, and do not maintain pgoff. Thus, I’m a little bit worried linux code may doing some crazy behind our back: mess vma and pages, then pcache miss gets some wrong pages

Well. Lego does not use anon_vma, and pgoff should only be used by file-backed vma. So, I decided to remove anon_vma from our code, and make sure pgoff is used properly. Of course, the goal is to make vma_merge, split, copy, do the things we intended.

Lesson learned.


02/17 Sat Snowy

Fixed the bss bug. It comes from loader. We did not implement the lego_clear_user function, so some part of bss is non-zero.

Bad news is word_count-pthread still fail at same fpu instruction. Have to look into memory code more.

This is actually a fun debugging story. We should always add TODO or XXX or some warnings to unfinished code, no matter what. Lesson learned.


02/16 Fri Cloudy

Yilun found a major loader bug yesterday: the .bss section variables are not 0, in the iozone benchmark. I did not encounter this issue before with simple test program. This is pretty serious.


02/15 Thur Rainy

Today is Chinese New Year.

Line 7 and 8 show the uva belong to the same page. Need to revisit get_arg_pages etc functions.

1
2
3
4
5
6
7
8
[  108.393991] handle_p2m_execve(): pid:22,argc:2,envc:2,file:/root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[  108.395255]     argc[0] (len: 65):  /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[  108.396329]     argc[1] (len: 82):  /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_100MB.txt
[  108.397530]     envc[0] (len:  7):  HOME=/
[  108.398069]     envc[1] (len: 11):  TERM=linux
[  108.398640] __bprm_mm_init vma: ffff88083effe6b8
[  108.399226] faultin_page vma: ffff88083effe6b8 uva: 0x7fffffffefed
[  108.399949] faultin_page vma: ffff88083effe6b8 uva: 0x7fffffffef94

Well, this is 100% fine. I wrote this loader code long time ago and need some time to pickup. So, after I read the loader code, especially the copy_strings function, I found this is okay. Because copy_strings will be invoked three times, so the faultin_page basically will be invoked at least three times. That is why it went to that pte fault handling code.

Although actually I think copy_strings should not use faultin_page, instead, it should use get_user_pages, which will walk through the pgtable first, then went to handle_lego_mm_fault.


02/14 Wed Rainy

Hmm, tried to make kmalloc behave as kzalloc, and bind all threads to one core, still gave the same old bug:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
  42731a:       f3 0f 6f 16             movdqu (%rsi),%xmm2

  [93182.657376] word_count-pthr[85] general protection ip:42731a sp:7fffe3ffed28 error:0
  [93182.747959] CPU: 8 PID: 85 Comm: word_count-pthr 4.0.0-lego+ #170
  [93182.820758] RIP: 0033:[<000000000042731a>]  [<000000000042731a>] 0x42731a
  [93182.901878] RSP: 002b:00007fffe3ffed28  EFLAGS: 00010283
  [93182.965317] RAX: 000000000000001f RBX: 00007ffff001b010 RCX: 0000000000000005
  [93183.050596] RDX: 0000000000000000 RSI: 5345485355420045 RDI: 00007ffff294791f
  [93183.135876] RBP: 00007ffff294791f R08: 000000000000ffff R09: 0000000000000008
  [93183.221156] R10: fffffffffffff048 R11: 00000000004acfc0 R12: 0000000000001cde
  [93183.306435] R13: 00000000006e4a8c R14: 0000000000001cd7 R15: 0000000000001cda
  [93183.391716] FS:  00007fffe3fff700(0000) GS:ffff88107fc80000(0000) knlGS:0000000000000000
  [93183.488434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [93183.557075] CR2: 00007ffff27a4000 CR3: 000000107e924000 CR4: 00000000000406a0

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
  427377:       66 0f 6f 17             movdqa (%rdi),%xmm2

  [93180.527248] word_count-pthr[93]: segfault at 0x0 ip 0000000000427377 sp 00007fffdfff6d28 error 4
  [93180.630314] CPU: 8 PID: 93 Comm: word_count-pthr 4.0.0-lego+ #170
  [93180.703114] RIP: 0033:[<0000000000427377>]  [<0000000000427377>] 0x427377
  [93180.784234] RSP: 002b:00007fffdfff6d28  EFLAGS: 00010297
  [93180.847674] RAX: 0000000000000000 RBX: 000000000073c4c0 RCX: 000000000000000d
  [93180.932953] RDX: 000000000000ffff RSI: 00007ffff4999070 RDI: 0000000000000000
  [93181.018233] RBP: 00007ffff499907d R08: 000000000000ffff R09: 0000000000000000
  [93181.103513] R10: 0000000000427760 R11: 00007ffff49982c0 R12: 0000000000000118
  [93181.188791] R13: 00000000006e4aac R14: 0000000000000116 R15: 0000000000000117
  [93181.274072] FS:  00007fffdfff7700(0000) GS:ffff88107fc80000(0000) knlGS:0000000000000000
  [93181.370790] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [93181.439430] CR2: 0000000000000000 CR3: 000000107e924000 CR4: 00000000000406a0

Tried several ways to ensure memory safety. It still failed even if I enabled all of them. So, I guess the memory safety is ensured? Still some other things?

  • force alloc_pages to use __GFP_ZERO
  • make kmalloc behave as kzalloc
  • make kfree empty

I also suspect munmap may free extra wrong pgtable entries. Although I’ve went through all the code and checked, but in addition to the above things, I’m going to:

  • make munmap dummy (no p2m_munmap, return 0 directly)

Failed.

Next, I’m going to:

  • add checksum for every page transferred across network.
  • add warning for unnormal cases

Bang! I found something while running P+M:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[  115.727597] Memory-component manager is up and running.
[  116.691723] handle_p2m_fork(): nid:0,pid:22,tgid:22,parent_tgid:1
[  116.697038] handle_p2m_fork(): reply: 0:OKAY
[  116.791088] handle_p2m_execve(): pid:22,argc:2,envc:2,file:/root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[  116.792357]     argc[0] (len: 65):  /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[  116.793439]     argc[1] (len: 82):  /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_100MB.txt
[  116.794653]     envc[0] (len:  7):  HOME=/
[  116.795196]     envc[1] (len: 11):  TERM=linux
[  116.795772] __bprm_mm_init vma: ffff88083effe6b8
[  116.796209] faultin_page vma: ffff88083effe6b8
[  116.796729] faultin_page vma: ffff88083effe6b8
[  116.797150] handle_pte_fault vma: ffff88083effe6b8 entry: 0xffff88083e8c1067
[  116.798044] pte:ffff88083e8c0ff0 pfn:0x8083e8c1 flags:(present|writable|user|accessed|dirty|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[  116.799462] ------------[ cut here ]------------
[  116.800049] WARNING: CPU: 4 PID: 15 at managers/memory/vm/fault.c:148 handle_lego_mm_fault+0x4d8/0x550
[  116.801148] CPU: 4 PID: 15 Comm: mc-manager 4.0.0-lego+ #78
[  116.801818] Stack:
[  116.802179] ffff88083e893c50 ffffffff8100e827 00007fffffffef94 ffff88083effe6b8
[  116.803283] ffff88083e894008 ffff88083e8c1067 ffff88083e893c60 ffffffff8100e91f
[  116.804387] ffff88083e893cf0 ffffffff8102b008 0000000000000031 ffff88083e893cf0
[  116.805488] 0000000000000296 00003fffffe00000 ffff800000000067 ffff88083e893d50
[  116.806590] ffff880000000001 ffffffff81066798 ffff88083effe6b8 ffff88083e893d50
[  116.807691] Call Trace:
[  116.808087] <TSK>
[  116.808448] [<ffffffff8100e836>] __warn.constprop.0+0xa6/0x100
[  116.809126] [<ffffffff8100e91f>] warn_slowpath_null+0xf/0x20
[  116.809802] [<ffffffff8102b008>] handle_lego_mm_fault+0x4d8/0x550
[  116.810505] [<ffffffff8102cfe3>] faultin_page+0x43/0xb0
[  116.811131] [<ffffffff8102dab1>] copy_strings.isra.1+0xe1/0x130
[  116.811819] [<ffffffff8102dd1e>] exec_loader+0x21e/0x350
[  116.812457] [<ffffffff8102680a>] handle_p2m_execve+0x1aa/0x290

This is a temporary stack vma that loader created for saving argv and envp. So, this vma was created here:

1
2
3
4
5
6
static int __bprm_mm_init(struct lego_binprm *bprm)
{
        ...
        bprm->vma = vma = kzalloc(sizeof(*vma), GFP_KERNEL);
        ...
}

And then copy_strings will call faultin_page to populate a page for a specific user virtual adddress:

1
2
3
4
5
6
7
int faultin_page(struct vm_area_struct *vma, unsigned long start,
                 unsigned long flags, unsigned long *kvaddr)
{
        ...
        ret = handle_lego_mm_fault(vma, start, flags, kvaddr);
        ...
}

Eventually, the handle_lego_mm_fault will call handle_pte_fault:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
static int handle_pte_fault(struct vm_area_struct *vma, unsigned long address,
                            unsigned int flags, pte_t *pte, pmd_t *pmd,
                            unsigned long *mapping_flags)
{
        ...
        if (!pte_present(entry)) {
                ...
        }

        pr_info("%s vma: %p entry: %#lx\n", FUNC, vma, entry.pte);
        dump_pte(pte, NULL);
        WARN_ON_ONCE(1);
        ...
}

Apparently, pte is wrong! But I don’t have time today. Continue tomorrow. Hmm forgot that we are saving kernel virtual addresses in the pte. Just take a quick look at the lego_pud_alloc things, seems will have some issues. I defenitly need to check all these stuff tomorrow. I’ve not touch this part for too long!


02/13 Tue Sunny

Checking our SLOB allocator today. So I found Yutong’s code is using set_page_private when slob get a new page from buddy. This private field is only intended to be used by buddy to record the order. This mixed usage will confuse buddy and create bug.

Even though I removed the set_page_private(page, 0) after free_page, word_count-pthread still fails. Damn.


02/12 Mon Cloudy

Add this commit 4cb3a8b6a943c90714fd9bb5e5465ee315f0aa30:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
    memory: Use kzalloc instead of kmalloc in __bprm_mm_init (loader)

    This was an potentionl bug that was not triggered previously.
    It is simply because kmalloc'ed vma contains some garbage area,
    while later in the pgfault code, we use
            if (vma->vm_ops && vma->vm_ops->fault)
                    ...
    to check if it is an file-backed fault.

    Fortunately the vma->vm_ops happens to have some leftover value.
    So this bug was triggered.

    This actually reminds me that this is a series of potential bugs!
    Even though before I've added things like force GFP_ZERO in all
    physical page allocation, I missed the kmalloc's case!

The story is:

I patched the stop_machine code today, and tried to run code with P+M on VM, everything works fine. However, when I tried to run the new code with P+M+S on physical machine, M crashed at a very weird point:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
[ 7791.998168] handle_p2m_execve(): pid:81,argc:2,envc:2,file:/root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[ 7792.129312] BUG: unable to handle kernel NULL pointer dereference at 0000000000000031
[ 7792.222889] IP: [<ffffffff8102c180>] handle_lego_mm_fault+0x160/0x4b0
[ 7792.299842] PGD 0
[ 7792.323760] Oops: 0000 [#1] PREEMPT SMP MEMORY
[ 7792.376794] CPU: 4 PID: 79 Comm: mc-manager 4.0.0-lego+ #29
[ 7792.443349] RIP: .. [<ffffffff8102c180>] handle_lego_mm_fault+0x160/0x4b0
......
....
[ 7793.750506] Call Trace:
[ 7793.779623] <TSK>
[ 7793.802501] [<ffffffff810053f4>] ? apic_timer_interrupt+0x54/0x90
[ 7793.875295] [<ffffffff8102e469>] faultin_page+0x9/0x70
[ 7793.936649] [<ffffffff8102ef01>] copy_strings.isra.1+0xe1/0x130
[ 7794.007362] [<ffffffff8102f11e>] exec_loader+0x1ce/0x340
[ 7794.070796] [<ffffffff81027def>] handle_p2m_execve+0x12f/0x200
[ 7794.140469] [<ffffffff810274fb>] mc_manager+0x1ab/0x2b0
[ 7794.202864] [<ffffffff81027350>] ? bitmap_fill+0x33/0x33
[ 7794.266298] [<ffffffff8101c6b7>] kthread+0x107/0x130
[ 7794.325572] [<ffffffff8101c5b0>] ? __kthread_parkme+0x90/0x90
[ 7794.394205] [<ffffffff8100b462>] ret_from_fork+0x22/0x30

So faulting source code is:

1
2
3
4
5
6
7
8
static int handle_pte_fault(struct vm_area_struct *vma, unsigned long address,
                            unsigned int flags, pte_t *pte, pmd_t *pmd)
{
    ....
        if (vma->vm_ops && vma->vm_ops->fault)
                return do_linear_fault(vma, address, flags,
                                       pte, pmd, entry)
    ....

Something wrong with vma? At this loader stage, this vma is a temporaty stack vma created for saving argv and envp. So I look back into the code that created this vma:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
managers/memory/loader/core.c:
static int __bprm_mm_init(struct lego_binprm *bprm)
{
        int err;
        struct vm_area_struct *vma = NULL;
        struct lego_mm_struct *mm = bprm->mm;

        bprm->vma = vma = kmalloc(sizeof(*vma), GFP_KERNEL);
        if (!vma)
                return -ENOMEM;

The code after this does NOT do necessary cleanup. The vm_ops happens to have some garbage value from last user. So it is not 0, so the above vma->vm_ops is true, and it will try to read vma->vm_ops->fault. And that, my friend, is where garbage turns into crash.

This presents a series of potential bugs. Ugh, memory safety!


02/09 Fri Cloudy

Tried to modify Phoneix code: replace realloc with malloc+mempcy. Thus the mremap syscall is avoided, but it still has general protection fault. Same with yesterday, corrupted at __strcmp_sse42, with corrupted RSI or RDI. So I guess it is not about mremap itself at all. I will follow yesterday’s checking list.


02/08 Thur Cloudy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
00000000004272d0 <__strcmp_sse42>:

  4272d0:       89 f1                   mov    %esi,%ecx
  4272d2:       89 f8                   mov    %edi,%eax
  4272d4:       48 83 e1 3f             and    $0x3f,%rcx
  4272d8:       48 83 e0 3f             and    $0x3f,%rax
  4272dc:       83 f9 30                cmp    $0x30,%ecx
  4272df:       77 3f                   ja     427320 <__strcmp_sse42+0x50>
  4272e1:       83 f8 30                cmp    $0x30,%eax
  4272e4:       77 3a                   ja     427320 <__strcmp_sse42+0x50>
  4272e6:       f3 0f 6f 0f             movdqu (%rdi),%xmm1
* 4272ea:       f3 0f 6f 16             movdqu (%rsi),%xmm2
  4272ee:       66 0f ef c0             pxor   %xmm0,%xmm0
  4272f2:       66 0f 74 c1             pcmpeqb %xmm1,%xmm0
  4272f6:       66 0f 74 ca             pcmpeqb %xmm2,%xmm1
  4272fa:       66 0f f8 c8             psubb  %xmm0,%xmm1
  4272fe:       66 0f d7 d1             pmovmskb %xmm1,%edx
  427302:       81 ea ff ff 00 00       sub    $0xffff,%edx
  427308:       0f 85 42 0d 00 00       jne    428050 <__strcmp_sse42+0xd80>
  42730e:       48 83 c6 10             add    $0x10,%rsi
  427312:       48 83 c7 10             add    $0x10,%rdi
  427316:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  42731d:       00 00 00  
  427320:       48 83 e6 f0             and    $0xfffffffffffffff0,%rsi
  427324:       48 83 e7 f0             and    $0xfffffffffffffff0,%rdi
  427328:       ba ff ff 00 00          mov    $0xffff,%edx
  42732d:       45 31 c0                xor    %r8d,%r8d
  427330:       83 e1 0f                and    $0xf,%ecx
  427333:       83 e0 0f                and    $0xf,%eax
  427336:       66 0f ef c0             pxor   %xmm0,%xmm0
  42733a:       39 c1                   cmp    %eax,%ecx
  42733c:       74 32                   je     427370 <__strcmp_sse42+0xa0>
  42733e:       77 07                   ja     427347 <__strcmp_sse42+0x77>
  427340:       41 89 d0                mov    %edx,%r8d
  427343:       91                      xchg   %eax,%ecx
  427344:       48 87 f7                xchg   %rsi,%rdi
* 427347:       66 0f 6f 17             movdqa (%rdi),%xmm2
  (RDI: 0000000000000000)

Frustrating! What is wrong with multithread program? Because of broken FPU-switch code? of inappropriate TLB flush? of IB corrupts memory? of what? ugh?

I’m done with this random guess and frustrated general protection or segfault, I need to first make sure underlying kernel is 100% percent correct, this is a checking list:

  • fpu save/restore
    • always fail at some XMM instruction
    • always with corrupted RDI or RSI
  • switch_to_asm
    • %gs and %fs
    • switch_mm (pgd)
    • stack frame
  • set_arch_tls (%fs)
    • glibc’s way of using per thread data
  • some cpu may miss tlb flush
  • kernel entry/exit assembly
    • current_task macro
    • stack_stratch
    • per-cpu data in entry.S
  • futex
    • clear_tid
    • set_tid
    • shared mm
    • robust list
  • interrupts
    • vector array
    • APIC setup
    • IO-APIC
    • timer interrupt
  • cpu_init and Trampoline
  • faked kernel version
  • P side pgfault handling code (SMP)
  • and M side pgfault handling (SMP)
  • mremap, munmap
    • check pgtable boundary
  • In all, check SMP implications

Is there any code, that is solely used to test if the underlying kernel has appropriate behaviors? Like glibc test code?

How to protect kernel virtual memory? Any existing solutions in Linux?

What is the implication of multiple CPU entering kernel at the same time? How can it corrupt user pages? Maybe: kernel entry code, per-cpu data in entry code, fpu code, switch_to, scheduler.

Why it always fail at those FPU code i.e. the strcmp function? I failed to compile without those sse, any solution? How it hurt performance?


02/07 Wed Cloudy

20:07
Pushed a small patch on mremap issue. Hope it will work. mremap really makes the whole thing very interesting, will be a very good research finding on combing virtual cache and operating system. Need to go gym with a friend, will be back on debugging late tonight.

9:30
Have two meetings to do today, and an security class, won’t have too much time coding during daytime.


02/06 Tue Sunny

Well. We’ve ruled out both smp_call_function and workqueue yesterday with Yiying’s help. But the multi-thread word-count still fails :-( Single thread word-count just finished 4GB dataset (with 8GB pcache). So what could be still wrong with multithread one????

  • chill
  • check exit code
  • (Checked) check pcache’s usage of task_struct, should always use the group_leader
  • check cpu boot code and check the switch code again
  • I believe pinpoint the issue in multithread word-count can solve a lot issues, it must be some thread creation, removal, schedule things.
  • How about adding a lock for ibapi, make it sequential? Sweet, I tried, finally it is a bug that we are able to debug.

22:39
Done for today. I’m trying to patch move_pte and pcache_move_pte. Although in theory we defenitly need to patch it, I keep thinking the code before should not trigger any serious bus or memory corruption. Ugh. Maybe it is concurrent mremap that one of them remap from A to B, while another one remap from C to A. It is possible. But my dead brain can not think of this anymore. I’m going to hit the gym and do some squats.

17:01
Criminal found: mremap() and virtual cache did the crime. Interesting, I have not seen any research paper, tech-reports, writeup, code about this, not even the OVC paper, which, by the way, I think they must consider this case. Otherwise, a mremap will simply crash its virtual cache. Many thanks went to my smoke-and-think time.

15:14
Something new came up! After adding a spinlock for ibapi, this showed up (I tried one more time after this, which does not show up). We are lucky to catch this. At least I know where to look at. Also, this is defenitly triggered by mremap. It is seems it is overlapped mremap(). One thing I did not know is which thread trigger this bug, the sweep thread? Cause mremap related pcache rmap functions do not use rmap_get_locked_pte.

1
2
3
4
5
6
7
[ 3826.048774] normal_p2s_open(): f_name: word_100MB.txt, mode: 04400, flags: 0
[ 3827.891622] SYSC_mremap(cpu18): move: [0x7fffe5788000 - 0x7fffe5806000] -> [0x7fffe531b000 - 0x7fffe5399000]
[ 3828.178643] SYSC_mremap(cpu14): move: [0x7fffe5941000 - 0x7fffe5980000] -> [0x7fffe57c7000 - 0x7fffe5806000]

****    ERROR: mismatched PTE and rmap
****    rmap->owner_process: word_count-pthr uva: 0x7fffe57c8000 ptep: ffff88107efe0e40, rmap->page_table: ffff88107efe0e40
****    pcache_pfn: 0x1257c8, pte_pfn: 0x125942

14:00
word_count-pthread: 100MB dataset pcache: 8GB, 8-way victim: 8 entries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
[ 1294.845313] STDOUT: ---[
Wordcount: Running...
]---
[ 1294.903661] STDOUT: ---[

o;
]---
[ 1294.946301] normal_p2s_open(): f_name: /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_100MB.txt, mode: 04400, flags: 0
[ 1295.100517] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[ 1295.594658] word_count-pthr[59] general protection ip:4272ea sp:7ffff1b8ed28 error:0
[ 1295.685236] CPU: 10 PID: 59 Comm: word_count-pthr 4.0.0-lego+ #113
[ 1295.759070] RIP: 0033:[<00000000004272ea>]  [<00000000004272ea>] 0x4272ea
[ 1295.840184] RSP: 002b:00007ffff1b8ed28  EFLAGS: 00010283
[ 1295.903621] RAX: 000000000000000f RBX: 00007fffe5a3d010 RCX: 0000000000000001
[ 1295.988893] RDX: 0000000000000000 RSI: 4854005942004441 RDI: 00007ffff1c1e80f
[ 1296.074166] RBP: 00007ffff1c1e80f R08: 0000000000000000 R09: 0000000000000010
[ 1296.211435] R10: 0000000000427ce0 R11: 00007ffff1bbb3ba R12: 0000000000001de4
[ 1296.296711] R13: 00000000006e4a80 R14: 0000000000001d9e R15: 0000000000001dc1
[ 1296.433978] FS:  00007ffff1b8f700(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
[ 1296.582686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1296.963297] CR2: 00007ffff1c1e000 CR3: 000000207fd8a000 CR4: 00000000000406a0
So what is this ip:4272ea, let us objdump the binary:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0000000000425e60 <strcmp>:
  425e60:       48 8d 05 69 14 00 00    lea    0x1469(%rip),%rax        # 4272d0 <__strcmp_sse42>
  425e67:       f7 05 5f b8 2b 00 00    testl  $0x100000,0x2bb85f(%rip)        # 6e16d0 <_dl_x86_cpu_features+0x10>
  425e6e:       00 10 00
  425e71:       75 1a                   jne    425e8d <strcmp+0x2d>
  425e73:       48 8d 05 46 b0 00 00    lea    0xb046(%rip),%rax        # 430ec0 <__strcmp_ssse3>
  425e7a:       f7 05 4c b8 2b 00 00    testl  $0x200,0x2bb84c(%rip)        # 6e16d0 <_dl_x86_cpu_features+0x10>
  425e81:       02 00 00
  425e84:       75 07                   jne    425e8d <strcmp+0x2d>
  425e86:       48 8d 05 03 00 00 00    lea    0x3(%rip),%rax        # 425e90 <__GI_strcmp>
  425e8d:       c3                      retq
  425e8e:       66 90                   xchg   %ax,%ax
 .. ..
 .. ..
00000000004272d0 <__strcmp_sse42>:
  4272d0:       89 f1                   mov    %esi,%ecx
  4272d2:       89 f8                   mov    %edi,%eax
  4272d4:       48 83 e1 3f             and    $0x3f,%rcx
  4272d8:       48 83 e0 3f             and    $0x3f,%rax
  4272dc:       83 f9 30                cmp    $0x30,%ecx
  4272df:       77 3f                   ja     427320 <__strcmp_sse42+0x50>
  4272e1:       83 f8 30                cmp    $0x30,%eax
  4272e4:       77 3a                   ja     427320 <__strcmp_sse42+0x50>
  4272e6:       f3 0f 6f 0f             movdqu (%rdi),%xmm1
* 4272ea:       f3 0f 6f 16             movdqu (%rsi),%xmm2
  4272ee:       66 0f ef c0             pxor   %xmm0,%xmm0
You can see %rsi has some garbage value RSI: 4854005942004441. Something went wrong. Will it be our FPU? I’m not quite sure. If FPU code has error, why single-thread one succeed? Why it only shows up at multithread ones?


02/05 Mon Sunny

From yesterday’s testing of Phoenix, it looks like something is wrong in smp_call_functions(). They are invoked through tlb flush, which was further invoked by mremap, or munmap. The warning from smp is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
[ 1260.586696] WARNING: CPU: 0 PID: 73 at kernel/smp.c:129 generic_smp_call_function_single_interrupt+0xb8/0x160
[ 1260.705251] CPU: 0 PID: 73 Comm: word_count-pthr 4.0.0-lego+ #99
[ 1260.777008] Stack:
[ 1260.800927] ffff88207fdffef8 ffffffff8100ec67 ffff88107fc00000 ffff88107fc00000
[ 1260.888283] ffffffff8100d410 ffff88207fe23df0 ffff88207fdfff08 ffffffff8100ed5f
[ 1260.975639] ffff88207fdfff38 ffffffff8100fe68 00007fffe58c3010 0000000000000f96
[ 1261.062995] 000000000000f960 0000000000000f95 ffff88207fdfff48 ffffffff810020dd
[ 1261.150351] 00007ffff58869c1 ffffffff8100b2e9 0000000000000f96 0000000000000f95
[ 1261.237707] Call Trace:
[ 1261.266825] <TSK>
[ 1261.289704] [<ffffffff8100ec76>] __warn.constprop.0+0xa6/0x100
[ 1261.359381] [<ffffffff8100d410>] ? pgd_free+0x90/0x90
[ 1261.419699] [<ffffffff8100ed5f>] warn_slowpath_null+0xf/0x20
[ 1261.487295] [<ffffffff8100fe68>] generic_smp_call_function_single_interrupt+0xb8/0x160
[ 1261.581931] [<ffffffff810020dd>] call_function_interrupt+0x1d/0x20
[ 1261.655767] [<ffffffff8100b2e9>] smp__call_function_interrupt+0x69/0x70

So I decided to look into smp.c a little bit to find out if there is something wrong (I wrote it long time ago). The warning itself is true, it means some inconsistent behavior.. I saw alloc_percpu stuff during call_function_init, hence probably I also need to check percpu code a little code cause I’m not sure if I port all the functionalities.

In all, today’s task, check percpu and smp_call_function code. Esp, percpu code, they are crucial and very hard to relate real bugs to it.

Well… things changed. I found a more serious bug: something about cpuhotplug, even though lego is not using it. cpuhotplug is a set of implict callbacks to all different subsystems who want to do some initialization work on each offline->online cpu.

Let us dig into how secondary cpu boots:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Trampoline.. setup 64bit mode
start_secondary()
  smp_callin()
        notify_cpu_starting()
              ...
              while (st->state < target) {
                      st->state++;
                      cpuhp_invoke_callback(cpu, st->state, true, NULL);
              }
          cpuhp_invoke_callback()

See? There will be some callbacks! What are those callbacks exactly? Well, they are predefined at the kernel/cpu.c. To save the trouble of reading code, I just print what functions are executed, the log is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[    0.118235] cpuhp_invoke_callback(): 136  CPU:0  page_writeback_cpu_online+0x0/0x20

[    0.368478] cpuhp_invoke_callback(): 136  CPU:1  smpboot_create_threads+0x0/0x90
[    0.370196] cpuhp_invoke_callback(): 136  CPU:1  perf_event_init_cpu+0x0/0xa0
[    0.370403] cpuhp_invoke_callback(): 136  CPU:1  workqueue_prepare_cpu+0x0/0x80
[    0.371112] cpuhp_invoke_callback(): 136  CPU:1  hrtimers_prepare_cpu+0x0/0x60
[    0.371339] cpuhp_invoke_callback(): 136  CPU:1  smpcfd_prepare_cpu+0x0/0x80
[    0.371584] cpuhp_invoke_callback(): 136  CPU:1  relay_prepare_cpu+0x0/0xe0
[    0.371794] cpuhp_invoke_callback(): 136  CPU:1  rcutree_prepare_cpu+0x0/0x170
[    0.372333] cpuhp_invoke_callback(): 136  CPU:1  notify_prepare+0x0/0xa0
[    0.372744] cpuhp_invoke_callback(): 136  CPU:1  bringup_cpu+0x0/0x100
[    0.008000] cpuhp_invoke_callback(): 136  CPU:1  sched_cpu_starting+0x0/0x60
[    0.926124] cpuhp_invoke_callback(): 136  CPU:1  smpboot_unpark_threads+0x0/0x90
[    0.926124] cpuhp_invoke_callback(): 136  CPU:1  perf_event_init_cpu+0x0/0xa0
[    0.927028] cpuhp_invoke_callback(): 136  CPU:1  workqueue_online_cpu+0x0/0x2a0
[    0.927768] cpuhp_invoke_callback(): 136  CPU:1  rcutree_online_cpu+0x0/0x70
[    0.928045] cpuhp_invoke_callback(): 136  CPU:1  notify_online+0x0/0x20
[    0.928256] cpuhp_invoke_callback(): 136  CPU:1  page_writeback_cpu_online+0x0/0x20
[    0.928527] cpuhp_invoke_callback(): 136  CPU:1  sched_cpu_activate+0x0/0x190

[    0.929084] cpuhp_invoke_callback(): 136  CPU:2  smpboot_create_threads+0x0/0x90
[    0.930240] cpuhp_invoke_callback(): 136  CPU:2  perf_event_init_cpu+0x0/0xa0
[    0.930434] cpuhp_invoke_callback(): 136  CPU:2  workqueue_prepare_cpu+0x0/0x80
[    0.931070] cpuhp_invoke_callback(): 136  CPU:2  hrtimers_prepare_cpu+0x0/0x60
[    0.931264] cpuhp_invoke_callback(): 136  CPU:2  smpcfd_prepare_cpu+0x0/0x80
[    0.931464] cpuhp_invoke_callback(): 136  CPU:2  relay_prepare_cpu+0x0/0xe0
[    0.931649] cpuhp_invoke_callback(): 136  CPU:2  rcutree_prepare_cpu+0x0/0x170
[    0.932245] cpuhp_invoke_callback(): 136  CPU:2  notify_prepare+0x0/0xa0
[    0.932475] cpuhp_invoke_callback(): 136  CPU:2  bringup_cpu+0x0/0x100
[    0.008000] cpuhp_invoke_callback(): 136  CPU:2  sched_cpu_starting+0x0/0x60
[    1.005023] cpuhp_invoke_callback(): 136  CPU:2  smpboot_unpark_threads+0x0/0x90
[    1.005065] cpuhp_invoke_callback(): 136  CPU:2  perf_event_init_cpu+0x0/0xa0
[    1.005408] cpuhp_invoke_callback(): 136  CPU:2  workqueue_online_cpu+0x0/0x2a0
[    1.005729] cpuhp_invoke_callback(): 136  CPU:2  rcutree_online_cpu+0x0/0x70
[    1.006029] cpuhp_invoke_callback(): 136  CPU:2  notify_online+0x0/0x20
[    1.006206] cpuhp_invoke_callback(): 136  CPU:2  page_writeback_cpu_online+0x0/0x20
[    1.006549] cpuhp_invoke_callback(): 136  CPU:2  sched_cpu_activate+0x0/0x190

Interesting! Currently, Lego need to add the smpboot_create_threads(), workqueue_prepare_cpu(), workqueue_prepare_cpu(), bringup_cpu(), smpboot_unpark_threads(), workqueue_online_cpu().

This hidden things is really hard to find and not easy to track during boot. Especially during boot, they should do something like for_each_online_cpu and init one by one. But I guess, after adding support of cpu hotplug, code kind of merged. Some stuff will be executed whenever a cpu has been teardown or bought up. And bang, why not use the same set of hotplug during boot, right? Well.