Aug 2018¶

Aug 31¶

One major TODO¶

Check do_handle_p2m_pcache_miss(). We MUST remove that mempcy, maybe by using another flag in thpool. This is just no acceptable.

Ugh¶

Fuck. Without debug_mm, there is still memory corruption.

Try max_send_wr and number of QPs¶

without lock_ib, with debug_mm. Change max_send_wr at all P M S.

QP=4, max_send_wr = 1: always fail
QP=4, max_send_wr = 256: always fail
QP=24, max_send_wr = 1: succeed (0831-w14-18 0831-w14-20)
QP=24, max_send_wr = 256: succeed (0831-w14-16 0831-w14-17)

Pay attention to the 0831-w14-15： something wrong with our timekeeping code? QP=24, max_send_wr = 1 case.

After Victim bug fix¶

MNIST 4 threads

With lock_ib, debug_mm etc: 3 successful runs
Only with debug_mm: Well fit failed. Lost CQE.

Now the debug scope is limited. Let me try the micro test suite, to stress ibapi_send_reply itself.

Potential: read/write buffer.

Aug 30¶

Be humble.

Identified victim bug.¶

Finally. I thought it through, and with the help of this 0830-w14-12. The bug is in victim_try_fill_pcache(), when there are multiple hits to the same victim. Since we released the usable_victim_lock after a hit. There might a be race case where: 1) CPU0 reached dec_and_test_filling, and passed to free the line. 2) CPU1 just got to the victim_check_hit, and increment the fill counter to 1 again. When CPU1 finished filling, and do dec_and_test_filling, it will do the free again!!! What a double free.

Tomorrow, let me do the fix. Thought: adding more sync in victim_check_hit part. Basically we want to ensure only one CPU can do the final free.

After adding pi_lock¶

Okay. the pi_lock is added. Although it is mostly used by futex-pi and rt-mutex, we lego does not have these two guys. Therefore, it is only used by sched/core.c, exit.c, and kthread.c. 99% is in core.c

The purpose of having this back is to have the spin_lock_irqsave(&p->pi_lock) back. Most scheduling code is not recursive, we have to disable interrupt. Of course we can use spin_lock_irqsave(&rq->lock) instead of spin_lock(&rq->lock). But this is too dangerous at this stage. Porting based on Linux now is the fastest and safest way.

The importance of disabling interrupt in some kernel path!!

Good. Now I’m seeing now debuggable victim issue.

Classical deadlock catched. Now, only two victims.

```c [ 2819.068997] CPU14 PID29 Abort victim alloc (20010ms) nr_usable_victims: 2. From pset_idx:532 nr_lru:63 fault_uva: 0x7fff98614000 [ 2819.094409] CPU14 PID29 – Start Dump Victim Cache [0] total: 2 [ 2819.114188] CPU14 PID29 victim[0]:ffffffff810c2880 refcount:2 nr_fill:1 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5a000 [ 2819.133289] CPU14 PID29 hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffcc000000 [ 2819.141723] CPU14 PID29 rmap to pset:ffff88207ff5a000 set_idx: 0 nr_lru:63 [ 2819.149770] CPU14 PID29 victim[1]:ffffffff810c2900 refcount:2 nr_fill:1 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5a000 [ 2819.168870] CPU14 PID29 hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffb0000000 [ 2819.177306] CPU14 PID29 rmap to pset:ffff88207ff5a000 set_idx: 0 nr_lru:63 [ 2819.185352] CPU14 PID29 – End Dump Victim Cache [0]

[ 2819.081708] CPU16 PID30 Abort victim alloc (20010ms) nr_usable_victims: 2. From pset_idx:0 nr_lru:63 fault_uva: 0x7fffcc000024 [ 2819.209008] CPU16 PID30 – Start Dump Victim Cache [1] total: 2 [ 2819.223358] CPU16 PID30 victim[0]:ffffffff810c2880 refcount:2 nr_fill:1 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5a000 [ 2819.252443] CPU16 PID30 hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffcc000000 [ 2819.260879] CPU16 PID30 rmap to pset:ffff88207ff5a000 set_idx: 0 nr_lru:63 [ 2819.268926] CPU16 PID30 victim[1]:ffffffff810c2900 refcount:2 nr_fill:1 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5a000 [ 2819.288026] CPU16 PID30 hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffb0000000 [ 2819.296461] CPU16 PID30 rmap to pset:ffff88207ff5a000 set_idx: 0 nr_lru:63 [ 2819.304508] CPU16 PID30 – End Dump Victim Cache [1]

[ 2819.101391] CPU18 PID31 Abort victim alloc (20010ms) nr_usable_victims: 2. From pset_idx:15 nr_lru:63 fault_uva: 0x7fff98c0f000 [ 2819.328165] CPU18 PID31 – Start Dump Victim Cache [2] total: 2 [ 2819.335146] CPU18 PID31 victim[0]:ffffffff810c2880 refcount:1 nr_fill:0 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5a000 [ 2819.354246] CPU18 PID31 hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffcc000000 [ 2819.362680] CPU18 PID31 rmap to pset:ffff88207ff5a000 set_idx: 0 nr_lru:63 [ 2819.370728] CPU18 PID31 victim[1]:ffffffff810c2900 refcount:2 nr_fill:1 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5a000 [ 2819.389828] CPU18 PID31 hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffb0000000 [ 2819.398262] CPU18 PID31 rmap to pset:ffff88207ff5a000 set_idx: 0 nr_lru:63 [ 2819.406310] CPU18 PID31 – End Dump Victim Cache [2]

¶

This guy grabbed the fill counter right before the first timout¶

That’s why the above three timeout happen. And this one is 20s later¶

which equals to the timeout second.¶

¶

[ 2839.327457] CPU12 PID28 Abort victim alloc (20010ms) nr_usable_victims: 2. From pset_idx:0 nr_lru:63 fault_uva: 0x7fffb0000f00 [ 2839.339964] CPU12 PID28 – Start Dump Victim Cache [3] total: 2 [ 2839.346945] CPU12 PID28 victim[0]:ffffffff810c2880 refcount:1 nr_fill:0 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5a000 [ 2839.366046] CPU12 PID28 hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffcc000000 [ 2839.374480] CPU12 PID28 rmap to pset:ffff88207ff5a000 set_idx: 0 nr_lru:63 [ 2839.382527] CPU12 PID28 victim[1]:ffffffff810c2900 refcount:2 nr_fill:1 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5a000 [ 2839.401628] CPU12 PID28 hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffb0000000 [ 2839.410062] CPU12 PID28 rmap to pset:ffff88207ff5a000 set_idx: 0 nr_lru:63 [ 2839.418109] CPU12 PID28 – End Dump Victim Cache [3] ```

rq->lock deadlock¶

Alright. We had rq->lock deadlock issue. Basically, we missed the part of disabling interrupt. A timer interrupt will try to acquire the lock again. Then, bang we have a deadlock. Digging into the code, you will be able to find the cause easily. The root cause we removed all the pi_lock stuff, which actually have a lot irqsave usages… Oh man, maybe it’s time to add it back.

[ 3367.835389] ------------------- cut here ------------------- [ 3367.841504] Possible deadlock happend locker_cpu: 0 [ 3367.846934] Current call stack: [ 3367.850425] CPU: 0 PID: 1 Comm: kernel_init 4.0.0-lego+ #437 [ 3367.856726] Stack: [ 3367.858957] ffff88107ff0fa58 ffffffff8101f4b6 ffff88107fc05e00 00000004a817c800 [ 3367.867101] ffff88107ff0fa80 ffffffff8101f52e ffff88107fc05e00 ffff88107ffb4000 [ 3367.875246] 0000000000000000 ffff88107ff0faa0 ffffffff8101b1ae ffff88107fc04980 [ 3367.883390] 0000000000000000 ffff88107ff0fab8 ffffffff810174f5 0000000000000286 [ 3367.891535] ffff88107ff0fae0 ffffffff81006774 ffff88107ffb9000 ffff88107fc05e00 [ 3367.899680] Call Trace: [ 3367.902396] <TSK> [ 3367.904528] [<ffffffff8101f4c2>] report_deadlock+0x62/0x80 [ 3367.910637] [<ffffffff8101f52e>] debug_spin_lock+0x4e/0x60 [ 3367.916745] [<ffffffff8101b1ae>] scheduler_tick+0x2e/0x60 [ 3367.922756] [<ffffffff810174f5>] tick_handle_periodic+0x45/0x70 [ 3367.929350] [<ffffffff81006774>] apic_timer_interrupt+0x54/0x90 [ 3367.935943] [<ffffffff8100e8aa>] smp__apic_timer_interrupt+0x6a/0x70 [ 3367.943021] [<ffffffff8101db99>] ? enqueue_task_rt+0x149/0x250 [ 3367.949518] [<ffffffff8105908a>] ? __mlx4_write_mtt+0xea/0x140 [ 3367.956014] [<ffffffff8101ad34>] activate_task+0x44/0x50 [ 3367.961929] [<ffffffff8101b667>] ttwu_do_activate+0x27/0x50 [ 3367.968134] [<ffffffff8101b89c>] try_to_wake_up+0xdc/0x1f0 [ 3367.974243] [<ffffffff8106cc20>] ? ib_mad_send_done_handler.isra.22+0x4d0/0x4d0 [ 3367.982388] [<ffffffff8101ba80>] wake_up_process+0x10/0x20 [ 3367.988497] [<ffffffff81023116>] __kthread_create_on_node+0x146/0x230 [ 3367.995671] [<ffffffff8102329f>] kthread_create_on_node+0x2f/0x40 [ 3368.002459] [<ffffffff81066873>] ? ib_create_cq+0x23/0x60 [ 3368.008470] [<ffffffff810695e1>] ib_mad_init_device+0x1f1/0x7b0 [ 3368.015064] [<ffffffff81067246>] ib_register_device+0x5d6/0x690 [ 3368.021657] [<ffffffff8105e9d3>] mlx4_ib_add+0x653/0x780 [ 3368.027571] [<ffffffff8105147d>] mlx4_add_device+0x8d/0x130 [ 3368.033777] [<ffffffff8105158c>] mlx4_register_interface+0x6c/0xa0 [ 3368.040661] [<ffffffff811dc660>] mlx4_ib_init+0x10/0x20 [ 3368.046478] [<ffffffff811dc619>] mlx4_init+0x19/0x50 [ 3368.052005] [<ffffffff811dc68d>] ib_core_init+0x1d/0x30 [ 3368.057823] [<ffffffff811db7f9>] device_init+0x9/0x10 [ 3368.063447] [<ffffffff8100030b>] kernel_init+0x4b/0xc0 [ 3368.069168] [<ffffffff8101b0ea>] ? schedule_tail+0xa/0x40 [ 3368.075178] [<ffffffff810002c0>] ? 0xffffffff810002c0 [ 3368.080803] [<ffffffff8100eb32>] ret_from_fork+0x22/0x30 [ 3368.086718] <EOT>

0830-w14-1: I really don’t know how this happen. The refcounter and fill counter should be enough to serialize.. ``` [37722.177024] CPU20 PID31 victim:ffffffff810c2880 index:0 refcount:0 nr_fill:0 max_fill:4 locked:0 flags:(0x12e)(allocated|usable|hasdata|waitflush|fillfree) pcm: (null) pset:ffff88207ff5b980 [37722.196623] CPU20 PID31 hit[0] owner:22 m_nid:1 rep_nid:1 addr: 0x2c33000 [37722.204572] CPU20 PID31 victim:ffffffff810c2880 index:0 refcount:0 nr_fill:0 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5b980 [37722.224154] CPU20 PID31 rmap to pset:ffff88207ff5b980 set_idx: 51 nr_lru:63 [37722.232299] CPU20 PID31 victim dumped because: PCACHE_BUG_ON_VICTIM(!VictimAllocated(v) || !VictimUsable(v) || !VictimFlushed(v) || VictimWriteback(v) || VictimLocked(v)) [37722.254790] WARNING: CPU: 20 PID: 31 at managers/processor/pcache/victim.c:196 __put_victim_nolist+0xb8/0x140 ffffffff8103e170[37722.453632] [] __put_victim_nolist+0xb8/0x140 0000000000000000[37722.461873] [] victim_try_fill_pcache+0x2f8/0x440

[37722.265842] CPU10 PID20 victim:ffffffff810c2880 index:0 refcount:0 nr_fill:0 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5b980 [37722.291438] CPU10 PID20 hit[0] owner:22 m_nid:1 rep_nid:1 addr: 0x2c33000 [37722.301616] CPU10 PID20 victim:ffffffff810c2880 index:0 refcount:0 nr_fill:0 max_fill:4 locked:0 flags:(0x14e)(allocated|usable|hasdata|flushed|fillfree) pcm: (null) pset:ffff88207ff5b980 [37722.324206] CPU10 PID20 rmap to pset:ffff88207ff5b980 set_idx: 51 nr_lru:63 [37722.332349] CPU10 PID20 victim dumped because: PCACHE_BUG_ON_VICTIM(victim_ref_count(v) == 0) [37722.350673] WARNING: CPU: 10 PID: 20 at ./include/processor/pcache_victim.h:127 __victim_flush_func+0x232/0x250 [37722.363568] CPU: 10 PID: 20 Comm: kvictim_flushd 4.0.0-lego+ #435 [37722.534003] [] __victim_flush_func+0x232/0x250 [37722.547577] [] victim_flush_async+0x69/0xb0 [37722.553975] [] kthread+0x111/0x130 [37722.565900] [] ret_from_fork+0x22/0x30 ```

Aug 29¶

The only thing left about core_IB is: ib_sa_query, which will be invoked when there is a mlx4 interrupts.

Not sure if this is important.

Anyway. Testing TF 4 threads MNIST again.

When I enable SEQ_IBAPI：

0829-w14-11 (0829-w09-11) succeed
0829-w14-12: P side seems have deadlock. Let me enable DEBUG_SPINLOCK.
0829-w14-13: SEQ_IBAPI, DEBUG_SPINLOCK, this is a very useful log: ```c [ 531.495545] STDOUT: —[ INFO:tensorflow:loss = 0.5256375, step = 101 (25.166 sec)

]— [ 531.624474] BUG: unable to handle kernel NULL pointer dereference at 0000000000000064 [ 531.633016] IP: [] __put_victim_nolist+0xe/0xa0 [ 531.639803] PGD 0 [ 531.642032] Oops: 0002 [#1] SMP PROCESSOR [ 531.646493] CPU: 10 PID: 20 Comm: kvictim_flushd 4.0.0-lego+ #426 [ 531.653279] RIP: 0010:[] [] __put_victim_nolist+0xe/0xa0 [ 531.662781] RSP: 0000:ffff880fe392fde0 EFLAGS: 00010006 [ 531.668696] RAX: 0000000000000000 RBX: ffffffff810c2b00 RCX: ffffffff810c2b70 [ 531.676646] RDX: ffffffff810c2b70 RSI: 0000007aea3f42fa RDI: ffffffff810c2b00 [ 531.684597] RBP: ffff880fe392fdf0 R08: 000000000000001f R09: 0000000000000002 [ 531.692548] R10: 0000000080000000 R11: 00000000000664c3 R12: ffff88207ff57000 [ 531.700498] R13: ffffffff810c2b60 R14: ffff880a72555000 R15: ffffffff810c2b48 [ 531.708449] FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 [ 531.717466] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 531.723865] CR2: 0000000000000064 CR3: 00000000011b9000 CR4: 00000000000406a0 [ 531.731816] Stack: [ 531.734046] ffffffff810c2b00 ffff88207ff57000 ffff880fe392fe08 ffffffff8103bbea [ 531.742190] ffffffff810c2b00 ffff880fe392fe48 ffffffff8103c729 000000008103d7c2 [ 531.750335] ffff880a72555060 ffff88107ff0fdc8 0000000000000000 ffffffff8103c780 [ 531.758479] 0000000000000000 ffff880fe392fe60 ffffffff8103c7e6 ffff880fe391c000 [ 531.766623] ffff880fe392ff48 ffffffff81022e81 0000000000000000 0000000000000000 [ 531.774768] Call Trace: [ 531.777483] [ 531.779617] [] __put_victim+0x4a/0x50 [ 531.785433] [] __victim_flush_func+0xb9/0x110 [ 531.792027] [] ? __victim_flush_func+0x110/0x110 [ 531.798911] [] victim_flush_async+0x66/0x90 [ 531.805310] [] kthread+0x111/0x130 [ 531.810836] [] ? __kthread_parkme+0x70/0x70 [ 531.817236] [] ret_from_fork+0x22/0x30 [ 531.823151] ```

0829-w14-14: this looks like a double free, or concurrent eviction. But if you look into the evict code, we will check the Flushed flag. It means another eviction routine should have skipped this line, and will not pick this line to do eviction. Some other possibilities?
check until 0829-w14-18 c [ 1671.661424] ------------[ cut here ]------------ [ 1671.666378] BUG: failure at managers/processor/pcache/victim.c:610/victim_finish_insert()! [ 1671.675591] Kernel Panic - not syncing: BUG! [ 1671.680339] CPU: 20 PID: 31 Comm: python 4.0.0-lego+ #426 [ 1671.686351] Stack: [ 1671.688581] ffff880fbe76fda0 ffffffff810289b7 ffffffff00000008 ffff880fbe76fdb0 [ 1671.696725] ffff880fbe76fd68 ffffff0021475542 ffff88107fd45e00 ffff880fbe753000 [ 1671.704870] 0000000000000000 0000000000000001 ffff880fbe76f9b0 ffffffff8101b1b7 [ 1671.713015] ffff88107fd44980 ffff880fbe76f9d8 ffffffff8101405f 0000000000000000 [ 1671.721160] 0000000000000001 ffff880ff992a000 0000000000000001 ffff880fbe76f9f0 [ 1671.729304] Call Trace: [ 1671.732019] <TSK> [ 1671.734153] [<ffffffff810289c3>] panic+0xc2/0x10a [ 1671.739388] [<ffffffff8101b1b7>] ? scheduler_tick+0x57/0x60 [ 1671.745593] [<ffffffff8101405f>] ? generic_smp_call_function_single_interrupt+0x8f/0x160 [ 1671.754611] [<ffffffff8100339e>] ? call_function_interrupt+0x2e/0x40 [ 1671.761688] [<ffffffff8100e9fa>] ? smp__call_function_interrupt+0x6a/0x70 [ 1671.769251] [<ffffffff8101f4bb>] ? debug_spin_lock+0x1b/0x50 [ 1671.775555] [<ffffffff81075efc>] ? fit_internal_poll_sendcq+0x6c/0x140 [ 1671.782826] [<ffffffff81042039>] ? find_next_bit+0x19/0x20 [ 1671.788934] [<ffffffff8101f4bb>] ? debug_spin_lock+0x1b/0x50 [ 1671.795236] [<ffffffff8101dcac>] ? task_tick_rt+0x2c/0xd0 [ 1671.801248] [<ffffffff8101b1b7>] ? scheduler_tick+0x57/0x60 [ 1671.807453] [<ffffffff810174d5>] ? tick_handle_periodic+0x45/0x70 [ 1671.814240] [<ffffffff81006774>] ? apic_timer_interrupt+0x54/0x90 [ 1671.821029] [<ffffffff8100e8aa>] ? smp__apic_timer_interrupt+0x6a/0x70 [ 1671.828300] [<ffffffff81012bc8>] ? printk+0x118/0x1b0 [ 1671.833924] [<ffffffff8103c161>] victim_finish_insert+0x171/0x180 [ 1671.840711] [<ffffffff8103b2a2>] pcache_evict_line+0xf2/0x2e0 [ 1671.847110] [<ffffffff81038d7c>] pcache_alloc+0x1ac/0x380 [ 1671.853122] [<ffffffff8103a10c>] ? pcache_add_rmap+0x7c/0x260 [ 1671.859521] [<ffffffff810382bb>] common_do_fill_page+0x2b/0x1e0 [ 1671.866114] [<ffffffff81038631>] pcache_handle_fault+0x1c1/0x620 [ 1671.872804] [<ffffffff81037fc0>] ? pcache_meta_to_kva+0x30/0x30 [ 1671.879398] [<ffffffff8101006f>] do_page_fault+0xaf/0x1c0 [ 1671.885410] [<ffffffff8100dedf>] page_fault+0x1f/0x30
0829-w14-16: we got this by having debug_spinlock, and seq_ibapi. This is interesting and serious. I think our general C code is fine.. Should I go check the assembly part? This is the rq->lock? come on… ```c [ 683.748135] ------------------- cut here ------------------- [ 683.754252] Possible deadlock happend [ 683.758323] Current call stack: [ 683.761815] CPU: 4 PID: 39 Comm: python 4.0.0-lego+ #428 [ 683.767728] Stack: [ 683.769959] ffff880fc1c1fc38 ffffffff8101f48c ffff88107fc45e00 ffff880fc1c1fc60 [ 683.778103] ffffffff8101f4e4 ffff88107fc45e00 ffff880fc23fb000 0000000000000000 [ 683.786247] ffff880fc1c1fc80 ffffffff8101b18e ffff88107fc44980 0000000000000004 [ 683.794391] ffff880fc1c1fc98 ffffffff810174d5 ffffffff8101dddb ffff880fc1c1fcc0 [ 683.802537] ffffffff81006774 ffff88107fc45e00 00000004a817c800 0000009a8a78c5e7 [ 683.810680] Call Trace: [ 683.813396] [ 683.815528] [] report_deadlock+0x58/0x60 [ 683.821637] [] debug_spin_lock+0x44/0x50 [ 683.827745] [] scheduler_tick+0x2e/0x60 [ 683.833758] [] tick_handle_periodic+0x45/0x70 [ 683.840351] [] ? dequeue_task_rt+0x1b/0x180 [ 683.846750] [] apic_timer_interrupt+0x54/0x90 [ 683.853343] [] smp__apic_timer_interrupt+0x6a/0x70 [ 683.860421] [] ? debug_spin_lock+0x31/0x50 [ 683.866723] [] try_to_wake_up+0xce/0x1f0 [ 683.872832] [] wake_up_q+0x54/0xc0 [ 683.878358] [] do_futex+0x407/0x620 [ 683.883982] [] ? pcache_add_rmap+0xb1/0x600 [ 683.890381] [] sys_futex+0x6c/0x130 [ 683.896005] [] do_syscall_64+0x36/0xc0 [ 683.901919] [] entry_SYSCALL64_slow_path+0x25/0x25

```

Aug 27¶

There a lot lost CQE cases. This one is about P->M->S. And M lost the CQE for the WQE sent to S. c 0827-w9-5 [ 963.304865] watchdog: worker[0] CPU10 stucked [ 963.309712] watchdog: common_header [op=0x20000000 src_nid:0] [ 963.316210] CPU: 10 PID: 20 Comm: thpool-worker0 4.0.0-lego+ #43 [ 963.322899] RIP: 0010:[<ffffffff8106ad51>] [<ffffffff8106ad51>] fit_send_reply_with_rdma_write_with_imm+0x2a1/0x3a0 [ 963.334632] RSP: 0000:ffff88103ef3fc20 EFLAGS: 00000287 [ 963.340547] RAX: 00000000ffffb6d4 RBX: 000000000000000b RCX: 0000000000001770 [ 963.348498] RDX: 00000000ffffa70d RSI: fffffffffffff039 RDI: 0000000000000000 [ 963.356450] RBP: ffff88103ef3fcc0 R08: 000000000000001f R09: 0000000000000002 [ 963.364400] R10: 0000000080000000 R11: 000077ff80000000 R12: 0000000000000000 [ 963.372352] R13: ffff88103ef26738 R14: 00000000000b3d54 R15: ffff88103ef25008 [ 963.380303] FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 [ 963.389320] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 963.395720] CR2: 0000000000000000 CR3: 000000000116a000 CR4: 00000000000406a0 [ 963.403671] Stack: [ 963.405901] 00007fff000b3d54 ffffffff800b3d54 ffff881000000004 ffff88103ef3fc78 [ 963.414045] 0000000900000000 ffff881000000000 0000100800000001 ffff88103d216000 [ 963.422191] ffff88103eebae48 800b3d540000011c ffffff9b00000246 ffffea0000000001 [ 963.430337] 000000103d216000 0000000000010c00 000000000000011c 0000000000001008 [ 963.438481] 000000000000011c 0000000000001008 ffff88103eebae48 ffff88103ef3fd70 [ 963.446626] Call Trace: [ 963.449342] <TSK> [ 963.451475] [<ffffffff81067c80>] ibapi_send_reply_imm+0x50/0xd0 [ 963.458068] [<ffffffff8102e953>] ? __storage_read+0xc3/0x120 [ 963.464371] [<ffffffff8102e953>] __storage_read+0xc3/0x120 [ 963.470480] [<ffffffff8102e9bf>] storage_read+0xf/0x50 [ 963.476201] [<ffffffff8102eab7>] storage_vma_fault+0xb7/0x130 [ 963.482600] [<ffffffff8103262f>] handle_lego_mm_fault+0x13f/0x4a0 [ 963.489389] [<ffffffff8102ecf4>] common_handle_p2m_miss.isra.1+0x54/0xc0 [ 963.496855] [<ffffffff8102edc7>] handle_p2m_pcache_miss+0x67/0x2d0 [ 963.503739] [<ffffffff8102bf96>] thpool_worker_func+0x296/0x3a0 [ 963.510332] [<ffffffff8102bd00>] ? handle_bad_request+0x40/0x40 [ 963.516926] [<ffffffff81020ca6>] kthread+0xf6/0x120 [ 963.522357] [<ffffffff81020bb0>] ? __kthread_parkme+0x70/0x70 [ 963.528756] [<ffffffff8100e632>] ret_from_fork+0x22/0x30

hmm, another on lost CQE happen at P. Today is weird, why we happen to have so many lost CQE today?

Think about why CQE is not generated?

```c 0827-w14-6 [ 1185.835707]

***** Fail to to get the CQE from send_cq after 20 seconds! ***** This means the packet was lost and something went wrong ***** with your NIC… ***** connection_id: 7 dest node: 1

[ 1185.856465] IB Stats: [ 1185.858985] nr_ib_send_reply: 3452 [ 1185.864221] nr_bytes_tx: 506507 [ 1185.869456] nr_bytes_rx: 8981004 [ 1185.874692] ------------[ cut here ]------------ [ 1185.879829] WARNING: CPU: 14 PID: 22 at net/lego/fit_internal.c:1108 fit_internal_poll_sendcq+0xe5/0x140 [ 1185.890399] CPU: 14 PID: 22 Comm: python 4.0.0-lego+ #356 [ 1185.896410] Stack: [ 1185.898640] ffff88103c49fb30 ffffffff810126f5 ffff88103cb22000 00000004a817c800 [ 1185.906784] 0000010f7139214f 0000000000000007 ffff88103c49fb40 ffffffff810127cf [ 1185.914927] ffff88103c49fbf0 ffffffff810724b5 000000023cb2c280 ffff88103cb2c1f8 [ 1185.923072] 0000000000000286 ffff88103c49fc18 ffff88103cb06000 ffff88103cb2c150 [ 1185.931217] 000000000000024b ffff88108101c7dc ffff88107fce5d80 ffff88103c46f000 [ 1185.939360] Call Trace: [ 1185.942075] [ 1185.944209] [] __warn.constprop.1+0x91/0xd0 [ 1185.950607] [] warn_slowpath_null+0xf/0x20 [ 1185.956909] [] fit_internal_poll_sendcq+0xe5/0x140 [ 1185.963987] [] ? scheduler_tick+0x55/0x60 [ 1185.970192] [] fit_send_message_with_rdma_write_with_imm_request+0x152/0x350 [ 1185.979791] [] fit_send_reply_with_rdma_write_with_imm+0x25f/0x3a0 [ 1185.988420] [] ? __pcache_do_fill_page+0xc2/0x1d0 [ 1185.995401] [] ibapi_send_reply_timeout+0x79/0x120 [ 1186.002479] [] ? __pcache_do_fill_page+0xc2/0x1d0 [ 1186.009459] [] __pcache_do_fill_page+0xc2/0x1d0 [ 1186.016245] [] common_do_fill_page+0xf4/0x1f0 [ 1186.022839] [] pcache_handle_fault+0x1c0/0x610 [ 1186.029528] [] ? __pcache_do_zerofill_page+0x100/0x100 [ 1186.036995] [] do_page_fault+0xaf/0x1c0 [ 1186.043005] [] page_fault+0x1f/0x30

```

Aug 26¶

Oh well. I saw the same damn lost packet issue again. The issue can be desribed as: P use lite rpc to send a request to M. M processed the handled, and called rpc reply to sent back to P. M need to poll send_cq to poll completion. But M fail to get the CQE for the should-be-sent-out WQE.

This is tested with M’s CONFIG_FIT_NOWAIT optimization, which is basically an optimization that M will not poll cq every time a reply was sent out, instead, do batch polling.

The following stack dump was reported by M side watchdog. It is not necessary mlx4_poll_cq’s issue, since there is a while (1) loop at fit code. Oh well. ```c Log name: 0826-w9-1

[187736.669027] watchdog: worker[0] CPU10 stucked [187736.673972] watchdog: common_header [op=0x30000000 src_nid:0] [187736.680566] CPU: 10 PID: 20 Comm: thpool-worker0 4.0.0-lego+ #26 [187736.687351] RIP: 0010:[] [] mlx4_ib_poll_cq+0x1d3/0x850 [187736.696854] RSP: 0000:ffff88103ef3f750 EFLAGS: 00000286 [187736.702865] RAX: 00000000fffffff5 RBX: 0000000000000000 RCX: ffff88103ed6b050 [187736.710913] RDX: 0000000080630000 RSI: 0000000000000001 RDI: ffff88103edb0bf0 [187736.718961] RBP: ffff88103ef3f7b8 R08: 0000000000000020 R09: 0000000000000002 [187736.727007] R10: 0000000ffc53fddc R11: 0000000040bf1040 R12: ffff88103ef3f7c8 [187736.735055] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88103edb0bf0 [187736.743104] FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 [187736.752218] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [187736.758714] CR2: 0000000000000000 CR3: 000000000116a000 CR4: 00000000000406a0 [187736.766762] Stack: [187736.769089] 0000000ffc53fddc 0000000000000002 0000000000000020 ffff88103edb0c98 [187736.777331] 0000000000000286 0000000080630000 ffff88103ef3f7d0 0000638000000018 [187736.785572] ffff88103edb0bf0 0000000000000001 ffff88103ef25008 0000000000000003 [187736.793813] 000000000000000c ffff88103ef3fd30 ffffffff8106920c ffff88103ef3fd54 [187736.802054] 0000000100000000 0000000100000000 ffff88103edb07b0 ffff88103e81b008 [187736.810296] Call Trace: [187736.813108] [187736.815338] [] fit_internal_poll_sendcq+0x6c/0xe0 [187736.822416] [] ? fit_send_reply_with_rdma_write_with_imm+0x25f/0x3a0 [187736.831336] [] ? _lego_copy_to_user+0x110/0x250 [187736.838220] [] ? __free_pages+0x25/0x30 [187736.844329] [] ? __storage_read+0xf1/0x120 [187736.850728] [] ? scheduler_tick+0x55/0x60 [187736.857031] [] ? fit_send_message_with_rdma_write_with_imm_request+0x152/0x350 [187736.866920] [] ? fit_send_message_with_rdma_write_with_imm_request+0x152/0x350 [187736.876810] [] ? __vma_adjust+0x38f/0x550 [187736.883113] [] ? vma_merge+0x1a4/0x280 [187736.889123] [] ? arch_get_unmapped_area_topdown+0xe0/0x220 [187736.897075] [] fit_send_message_with_rdma_write_with_imm_request+0x152/0x350 [187736.906771] [] fit_ack_reply_callback+0x185/0x1e0 [187736.913848] [] ? handle_p2m_flush_one+0x69/0x160 [187736.920830] [] thpool_worker_func+0xe0/0x3a0 [187736.927424] [] ? handle_bad_request+0x40/0x40 [187736.934113] [] kthread+0xf6/0x120 [187736.939639] [] ? __kthread_parkme+0x70/0x70 [187736.946137] [] ret_from_fork+0x22/0x30 ```

Aug 22¶

Damn it!!! After so much effort verifying we had a solid IB stack, we still has memory corruption and deadlock issues. Fuck!

One thing at a time, simple stuff first. Okay, tomorrow first add DEBUG_SPINLOCK to detect possible deadlocks. This, could help to identify some buggy code. After this, I will spend some time looking into the LITE, it’s fucking HEAVY. I do found a lot issues during summer.

Personally, I’m not feeling good this days. I treat someone with love and respect, but there is not too much in return. Yeahyeahyeah, I know how this works. It’s just sad that sometimes you just have a BAD timing. I’ve went through too much things in 2018, good and bad. I care sooo much about the people I love, family and others. I feel this is good, of course. Anyway, it is supposed to be a Lego dump, that no one probably interested in.

Last update: September 1, 2018